Python 第三方模块之 beautifulsoup（bs4）- 解析 HTML

官方网站：http://beautifulsoup.readthedocs.io/zh_CN/latest/

简介

简单来说，Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。
它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

安装

1	`pip3 install beautifulsoup4`

解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装。

1	`pip3 install lxml`

另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:

1	`pip install html5lib`

解析器对比如下：

文档解析函数

语法如下

from bs4 import BeautifulSoup

BeautifulSoup(
    markup='',
    features=None,
    builder=None,
    parse_only=None,
    from_encoding=None,
    exclude_encodings=None,
    **kwargs,
)

markup: 表示要被解析的标记，一个字符串对象或类似文件的对象。
features:要使用的解析器的理想特性。
这可能是特定解析器的名称(“lxml”、“lxml-xml”、“html.parser”,or “html5lib””)。或者它可能是要使用的标记类型(“html”、“html5”、“xml”)。建议您指定一个特定的解析器，以便Beautiful Soup在平台和虚拟环境中提供相同的结果。
builder: 一个特定的树构建器，用来来使用而不是根据“功能”来查找。你不应该用这个。
parse_only:一个Soup过滤器。只考虑文件中与Soup过滤器匹配的部分。这在解析文档的某些部分时非常有用，否则这些部分将太大而无法装入内存。
from_encoding:表示要解析的文档编码的字符串。如果Beautiful Soup错误地猜测了文档的编码，请传递此信息。
exclude_encodings:指示已知编码错误的字符串列表。
如果您不知道文档的编码，但是知道Beautiful Soup的猜测是错误的，那么请传递这个信息。

将一段字符串或一个文件句柄传入BeautifulSoup的构造方法，就能得到一个BeautifulSoup文档对象（复杂的树形结构）。这个对象和解析的HTML本身有很多相同的地方，因为BeautifulSoup是将HTML文档转换成一个复杂的树形结构。

# 简单示例
from bs4 import BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
asdf
    <div class="title">
        <b>The Dormouse's story总共</b>
        <h1>f</h1>
    </div>
<div class="story">Once upon a time there were three little sisters; and their names were
    <a  class="sister0" id="link1">Els<span>f</span>ie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</div>
ad<br/>sf
<p class="story">...</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, features="lxml")
tag1 = soup.find(name='a')        # 找到第一个a标签
tag2 = soup.find_all(name='a')    # 找到所有的a标签
tag3 = soup.select('#link2')      # 找到id＝link2的标签

对象的种类

树形结构里，每个节点都是Python对象，所有对象可以归纳为4种：Tag，NavigableString，BeautifulSoup，Comment

Tag-标签

Tag对象与XML或HTML原生文档中的tag相同。

soup=BeautifulSoup('<p class="body language" /p>')  
#获取标签
soup.p
#<p class="body language"></p>

Tag有很多属性（遍历文档数和搜索文档树中有详细解释），最重要的属性：name和attributes。

name
每个tag都有自己的名字，通过 .name 来获取。
name属性通过赋值可改变（会影响BeautifulSoup对象生成的文档）。

soup.p.name
# 'p'
soup.p.name='q'
soup
#<html><body><q class="body language"></q></body></html>

attributes
一个tag可能有很多属性， 对属性的操作方法与字典相同 ，也可以通过 .attrs 来获取。
tag的属性可以被添加、删除或修改。
多值属性
在BeautifulSoup中多值属性的返回类型是list（列表）：

soup.q['class'] 
# ['body','language']

soup.q.attrs
# {'class': ['body', 'language']}

soup.q['id']=1
soup
#<html><body><q class="body language" id="1"></q></body></html>

soup.q.get('class')
#['body', 'language']

如果某个属性看起来有多个值，但是任何版本的HTML定义中都没有定义过，那么BeautifulSoup会将这个属性作为字符串输出。
如果转换的是XML格式的文档，那么tag中不包含多值属性。

1
2
3

id_soup=BeautifulSoup('<p id="my id"></p>')
id_soup.p.attrs
#{'id': 'my id'}

NavigableString

字符串常被包含在tag内，可用 .string 获取tag内的字符串。
Beautiful Soup用 NavigableString 类来包装tag中的字符串，用type()函数查看，返回值为’bs4.element.NavigableString’。
tag中的字符串不能编辑，可以被替换，用 replace_with() 方法。

1	`tag.string.replace_with(' ') #tag.string不能为空`

字符串不支持.contents或.string属性或find()方法。

BeautifulSoup

BeautifulSoup对象表示的是一个文档的大部分内容， 大部分时候，可以把它当做Tag对象，但是它并不是真正的HTML或XML的Tag，所以它没有name和attributes属性，但是BeautifulSoup对象包含了一个值为’[document]‘的特殊属性 .name。（BeautifulSoup对象里面的标签tag有.name属性）

Comment

Comment对象是一个特殊类型的NavigableString对象，它是文档的注释部分以及一些特殊字符串。

a="<b !--Hey, buddy. Want to buy a used parser?-- /b>" 
soup=BeautifulSoup(a)
comment=soup.b.string
type(comment)
# 'bs4.element.Comment'

遍历文档树

下面是《爱丽丝梦游仙境》的一段内容，后面的分析将一直引用该例子：

html_doc = """<html><head><title>睡鼠的故事</title></head>
<body>
<p class="title"><b>睡鼠的故事</b></p>
 
<p class="story">从前有三位小姐姐，她们的名字是：
<a href="http://example.com/elsie" class="sister" id="link1">埃尔西</a>，
<a href="http://example.com/lacie" class="sister" id="link2">莱斯</a>和
<a href="http://example.com/tillie" class="sister" id="link3">蒂尔莉</a>；
她们住在一个井底下面。</p>
 
<p class="story">...</p>
"""

子节点

一个Tag的子节点可以是其他的Tag或者是多个字符串。 BeautifulSoup提供了很多操作和遍历子节点的属性。

注意：字符串没有属性，因为字符串没有子节点。

.<tag的名字>
例如，如要获取< head >标签，用soup.head。如要获取标签< body >标签中的 < b >标签，用soup.body.b。
注意：通过点取属性的方式，只能获得当前名字的第一个tag。（如果想得到所有标签，用搜索遍历树中的find_all()方法）
.contents
tag的 .contents 属性可以 将tag的子节点以列表的方式输出。（直接子节点）
.children
获取tag的所有 直接子节点（非子孙节点），通过.children生成器，可以对tag的子节点进行循环，遍历所有子节点。
.descendants
获取tag的所有子孙节点,返回的是可迭代对象。
.string
1. 如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。
2. 如果一个tag仅有一个子节点（说明tag没有字符串直接子节点）,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。
3. 如果tag包含了多个子节点,tag就无法确定 .string 方法应该调用哪个子节点的内容，此时输出结果是None。
.strings 和 stripped_strings
如果tag中包含多个字符串 ,可以使用 .strings 来循环获取:

1 2	`for string in soup.strings: print(repr(string))`

输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容:

1 2	`for string in soup.stripped_strings: print(repr(string))`

父节点

每个tag或字符串都有父节点（基本都包含在某个tag中）。

.parent
通过 .parent 属性来获取某个元素的父节点。
文档顶层节点，比如爱丽丝文档里面的< html >的父节点是BeautifulSoup对象。
BeautifulSoup 对象的 .parent 是None。
.parents
通过元素的 .parents 属性可以递归得到元素的所有父辈节点（当前tag到根节点的所有节点）。

兄弟节点

同一层（是同一个元素的子节点）的节点，可以被称为兄弟节点，兄弟节点有相同的缩进级别。

.next_sibling 和 .previous_sibling
在文档树中,使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点。
实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白. 看看“爱丽丝”文档:

1
2
3

a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

如果以为第一个< a >标签的 .next_sibling 结果是第二个< a >标签,那就错了,真实结果是第一个< a >标签和第二个< a >标签之间的逗号和换行符。（可以自己用 print(soup.prettify()) 输出查看，可以发现’ ,\n’和< a >处于同一层。）

.next_siblings 和 .previous_siblings
通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出。

1 2	`for sibling in soup.a.next_siblings: print(repr(sibling))`

1 2	`for sibling in soup.find(id="link3").previous_siblings: print(repr(sibling))`

回退和前进

HTML解析器把这段字符串转换成一连串的事件: “打开< html >标签”,”打开一个< head >标签”,”打开一个< title>标签”,”添加一段字符串”,”关闭< title>标签”,”打开< p>标签”,等等。Beautiful Soup提供了重现解析器初始化过程的方法。

.next_element 和 .previous_element
.next_element 属性指向解析过程中下一个被解析的对象(字符串或tag)，结果可能与 .next_sibling 相同,但通常是不一样的（下一个解析对象很少是下一个兄弟节点，因为节点里面基本都有字符串子节点）。
.next_elements 和 .previous_elements
通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样。

搜索文档树

Beautiful Soup定义了很多搜索方法，这里着重介绍2个: find() 和 find_all() 。其它方法的参数和用法类似。

过滤器

介绍 find_all() 方法前,先介绍一下过滤器的类型 ,这些过滤器贯穿整个搜索的API。过滤器可以被用在tag的name中,节点的属性中,字符串中或他们的混合中。

字符串
最简单的过滤器是字符串。在搜索方法中传入一个字符串参数，Beautiful Soup会 查找与字符串完整匹配的内容。

1 2	`soup.find_all('b') #[<b>睡鼠的故事</b>]`

正则表达式
如果传入正则表达式作为参数，Beautiful Soup会 通过正则表达式的 match() 来匹配内容。

1
2
3

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string（一个字符串的开始部分）, returning
    a Match object, or None if no match was found.

import re
for tag in soup.find_all(re.compile('b')):
    print(tag.name)
#body
#b

列表
如果传入列表参数,Beautiful Soup会将 与列表中任一元素匹配的内容返回。

soup.find_all(['a','b'])
#[<b>睡鼠的故事</b>,
#<a class="sister" href="http://example.com/elsie" id="link1">埃尔西</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">莱斯</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">蒂尔莉</a>]

True
True 可以 匹配任何值。

for tag in soup.find_all(True):
    print(tag.name)
#html
#head
#title
#body
#p
#b
#p
#a
#a
#a
#p

方法

如果没有合适过滤器，那么还可以定义一个方法，方法只接受一个元素参数 ,如果这个方法返回 True 表示当前元素匹配并且被找到，如果不是则反回 False。
下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True:

1 2	`def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id')`

将这个方法作为参数传入 find_all() 方法,将得到所有< p >标签：

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

find_all()

1 2	`find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) Extracts a list of Tag objects that match the givencriteria.`

find_all() 方法 搜索当前tag的所有tag子节点，并判断是否符合过滤器的条件。

name参数
name 参数可以查找所有名字为 name 的tag（字符串对象会被自动忽略掉，对象tag里包含的字符串会完整输出）。
搜索 name 参数的值可以是任一类型的过滤器（字符串,正则表达式,列表, True）或者方法。
keyword 参数
如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索。

1 2	`soup.find_all(id='link2') # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]`

搜索指定名字的属性时可以使用的参数值包括字符串 , 正则表达式 , 列表, True 。

1 2	`soup.find_all(href=re.compile("elsie")) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]`

attrs参数
使用多个指定名字的参数可以同时过滤tag的多个属性。
有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性，

1
2
3

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression

但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag。

1 2	`data_soup.find_all(attrs={"data-foo": "value"}) # [<div data-foo="value">foo!</div>]`

text参数
通过 text参数可以搜索文档中的字符串内容。与 name 参数的可选值一样, string 参数接受字符串 , 正则表达式 , 列表, True。
在BeautifulSoup会找到.string方法与text参数值相符的tag，所以text参数可以用string表示。

soup.find_all(string="Elsie")
# ['Elsie']
soup.find_all(text="Elsie")
# ['Elsie']

limit 参数
find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量。效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果。
recursive 参数
调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False。
按CSS搜索
可以按CSS搜索（按照CSS类名搜索），class类可以通过class_参数搜索有指定CSS类名的tag。

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

class_ 参数同样接受不同类型的过滤器 ,字符串,正则表达式,方法或 True。
如果tag的class属性是多值属性，按照CSS类名搜索tag时,分别搜索tag中的每个CSS类名，就可以得出搜索全类名的效果（也可以完全匹配，完全匹配时，类名顺序不能乱）。

注意

由于find_all()几乎是BeautifulSoup中最常用的搜索方法，所以我们定义了它的简写方法.。BeautifulSoup 对象和 tag 对象可以被当作一个方法来使用，这个方法的执行结果与调用这个对象的 find_all() 方法相同,下面两行代码是等价的:

1 2	`soup.find_all("a") soup("a")`

1 2	`soup.title.find_all(string=True) soup.title(string=True)`

find()

1
2
3

find(self, name=None, attrs={}, recursive=True, text=None, **kwargs)
    Return only the first child of this Tag matching the given
    criteria.

有时我们只想得到一个结果，可以直接使用find()方法，相当于直接使用find_all方法并设置limit=1参数。
唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表，而 find() 方法直接返回结果。
find_all() 方法没有找到目标是返回空列表, find() 方法找不到目标时,返回 None。

find_parents() 和 find_parent()

1
2
3

find_parents(self, name=None, attrs={}, limit=None, **kwargs)
    Returns the parents of this Tag that match the given
    criteria.

1
2
3

find_parent(self, name=None, attrs={}, **kwargs)
    Returns the closest parent of this Tag that matches the given
    criteria.

记住: find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等.。
find_parents() 和 find_parent() 用来搜索当前节点的父辈节点，搜索方法与普通tag的搜索方法相同。

find_next_siblings() 合 find_next_sibling()

1
2
3

find_next_sibling(self, name=None, attrs={}, text=None, **kwargs)
    Returns the closest sibling to this Tag that matches the
    given criteria and appears after this Tag in the document.

1
2
3

find_next_siblings(self, name=None, attrs={}, text=None, limit=None, **kwargs)
    Returns the siblings of this Tag that match the given
    criteria and appear after this Tag in the document.

find_previous_siblings() 和 find_previous_sibling()

1
2
3

find_previous_siblings(self, name=None, attrs={}, text=None, limit=None, **kwargs)
    Returns the siblings of this Tag that match the given
    criteria and appear before this Tag in the document.

1
2
3

find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs)
    Returns the closest sibling to this Tag that matches the
    given criteria and appears before this Tag in the document.

find_all_next() 和 find_next()

find_all_previous() 和 find_previous()

CSS选择器

在Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数, 即可使用CSS选择器的语法找到tag。Beautiful Soup支持大部分的CSS选择器。

通过tag标签逐层查找

1 2	`soup.select("html head title") # [<title>The Dormouse's story</title>]`

找到某个tag标签下的直接子标签

1 2	`soup.select("head > title") # [<title>The Dormouse's story</title>]`

找到兄弟节点标签

soup.select("#link1 ~ .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]

soup.select("#link1 + .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过CSS的类名查找
通过tag的id查找
同时用多种CSS选择器查询元素
通过是否存在某个属性来查找
通过属性的值来查找

修改文档树

修改tag的名称和属性
修改 .string
给tag的 .string 属性赋值,就相当于用当前的内容替代了原来的内容。
注意: 如果当前的tag包含了其它tag,那么给它的 .string 属性赋值会覆盖掉原有的所有内容包括子tag。
append()
tag.append() 方法向tag中添加内容,就好像Python的列表的 .append() 方法.

soup = BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")

soup
# <html><head></head><body><a>FooBar</a></body></html>
soup.a.contents
# ['Foo', 'Bar']

.new_tag()
创建一个tag最好的方法是调用工厂方法 BeautifulSoup.new_tag()。

soup = BeautifulSoup("<b></b>")
original_tag = soup.b

new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b><a href="http://www.example.com"></a></b>

insert()
把元素插入到指定的位置.与Python列表总的 .insert() 方法的用法下同。

1	`tag.insert(1, "but did not endorse ")`

insert_before() 和 insert_after()
insert_before() 方法在当前tag或文本节点前插入内容。
insert_after() 方法在当前tag或文本节点后插入内容。
clear()
tag.clear() 方法移除当前tag的内容。
extract()
PageElement.extract() 方法将当前tag移除文档树,并作为方法结果返回

1 2	`soup.i.extract() i_tag.string.extract()`

decompose()
tag.decompose() 方法将当前节点移除文档树并完全销毁。
replace_with()
PageElement.replace_with() 方法移除文档树中的某段内容,并用新tag或文本节点替代它。
wrap()
PageElement.wrap() 方法可以对指定的tag元素进行包装 [8] ,并返回包装后的结果。
unwrap()
tag.unwrap() 方法与 wrap() 方法相反.将移除tag内的所有tag标签,该方法常被用来进行标记的解包。

Python > Python 第三方模块

#Python #Python 第三方模块

Python 第三方模块之 beautifulsoup（bs4）- 解析 HTML

https://flepeng.github.io/021-Python-31-Python-第三方模块-Python-第三方模块之-beautifulsoup（bs4）-解析-HTML/

作者

Lepeng

发布于

2021年4月27日

许可协议

Python 第三方模块之 imgaug - 图像 augmentation 上一篇

Python 第三方模块之 jsonschema - JSON 模式规范下一篇