什么是pyQuery:
强大又灵活的网页解析库。如果你觉得正则写起来太麻烦(我不会写正则),如果你觉得 BeautifulSoup的语法太难记,如果你熟悉JQuery的语法,那么PyQuery就是你最佳的选择。
pyQuery的安装pip3 install pyquery即可安装啦。
pyQuery的基本用法:
初始化:
字符串初始化:
# !/usr/bin/env python # -*- coding: utf-8 -*- html = """The Dormouse's story """ from pyquery import PyQuery as pq doc = pq(html) print (doc( ' a ' ))The Dormouse's story
Once upon a time there were three little sisters;and thier names were Lacie and Title ; and they lived at the boottom of a well.
...
运行结果:
URL初始化:
# !/usr/bin/env python # -*- coding: utf-8 -*- # URL初始化 from pyquery import PyQuery as pq doc = pq( ' http://www.baidu.com ' ) print (doc( ' input ' ))
运行结果:
文件初始化:
# !/usr/bin/env python # -*- coding: utf-8 -*- # 文件初始化 from pyquery import PyQuery as pq doc = pq(filename= ' baidu.html ' ) print (doc( ' title ' ))
运行结果:
选择方式和jquery一致,id、name、class都是如此,还有很多都和jquery一致。
基本CSS选择器:
# !/usr/bin/env python # -*- coding: utf-8 -*- # Css选择器 html = """The Dormouse's story """ from pyquery import PyQuery as pq doc = pq(html) print (doc( ' .title ' ))The Dormouse's story
Once upon a time there were three little sisters;and thier names were Lacie and Title ; and they lived at the boottom of a well.
...
运行结果:
查找元素:
子元素:
# !/usr/bin/env python # -*- coding: utf-8 -*- # 子元素 html = """The Dormouse's story """ from pyquery import PyQuery as pq doc = pq(html) items = doc( ' .title ' ) print (type(items)) print (items) p = items.find( ' b ' ) print (type(p)) print (p)The Dormouse's story
Once upon a time there were three little sisters;and thier names were Lacie and Title ; and they lived at the boottom of a well.
...
该代码为查找id为title的标签,我们可以看到id为title的标签有两个一个是p标签,一个是a标签,然后我们再使用find方法,查找出我们需要的p标签,运行结果:
这里需要注意的是,我们所使用的find是查找每一个元素内部的标签.
children:
# !/usr/bin/env python # -*- coding: utf-8 -*- # 子元素 html = """The Dormouse's story """ from pyquery import PyQuery as pq doc = pq(html) items = doc( ' .title ' ) print (items.children())The Dormouse's story
Once upon a time there were three little sisters;and thier names were Lacie and Title ; and they lived at the boottom of a well.
...
运行结果:
也可以在children()内添加选择器条件:
# !/usr/bin/env python # -*- coding: utf-8 -*- # 子元素 html = """The Dormouse's story """ from pyquery import PyQuery as pq doc = pq(html) items = doc( ' .title ' ) print (items.children( ' b ' ))The Dormouse's story
Once upon a time there were three little sisters;and thier names were Lacie and Title ; and they lived at the boottom of a well.
...
输出结果和上面的一致。
父元素:
# !/usr/bin/env python # -*- coding: utf-8 -*- # 子元素 html = """The Dormouse's story """ from pyquery import PyQuery as pq doc = pq(html) items = doc( ' #link1 ' ) print (items) print (items.parent())The Dormouse's story
Once upon a time there were three little sisters;and thier names were Lacie and Title ; and they lived at the boottom of a well.
...
运行结果:
这里只输出一个父元素。这里我们用parents方法会给予我们返回所有父元素,祖先元素
# !/usr/bin/env python # -*- coding: utf-8 -*- # 祖先元素 html = """The Dormouse's story Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title
...
""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' #link1 ' ) print (items) print (items.parents( ' body ' ))
运行结果:
兄弟元素:
# !/usr/bin/env python # -*- coding: utf-8 -*- # 兄弟元素 html = """The Dormouse's story Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title
...
""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' #link1 ' ) print (items) print (items.siblings( ' #link2 ' ))
运行结果:
上面就把查找元素的方法都说了,下面我来看一下如何遍历元素。
遍历
# !/usr/bin/env python # -*- coding: utf-8 -*- # 兄弟元素 html = """The Dormouse's story Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title
...
""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' a ' ) for k,v in enumerate(items.items()): print (k,v)
运行结果:
获取信息:
获取属性:
# !/usr/bin/env python # -*- coding: utf-8 -*- # 获取属性 html = """The Dormouse's story Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title
...
""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' a ' ) print (items) print (items.attr( ' href ' )) print (items.attr.href)
运行结果:
获得文本:
# !/usr/bin/env python # -*- coding: utf-8 -*- # 获取属性 html = """The Dormouse's story Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title
...
""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' a ' ) print (items) print (items.text()) print (type(items.text()))
运行结果:
获得HTML:
# !/usr/bin/env python # -*- coding: utf-8 -*- # 获取属性 html = """The Dormouse's story Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title
...
""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' a ' ) print (items.html())
运行结果:
DOM操作:
addClass、removeClass
# !/usr/bin/env python # -*- coding: utf-8 -*- # DOM操作,addClass、removeClass html = """The Dormouse's story Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title
...
""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' #link2 ' ) print (items) items.addClass( ' addStyle ' ) # add_class print (items) items.remove_class( ' sister ' ) # removeClass print (items)
运行结果:
attr、css:
# !/usr/bin/env python # -*- coding: utf-8 -*- # DOM操作,attr,css html = """The Dormouse's story Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title
...
""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' #link2 ' ) items.attr( ' name ' , ' addname ' ) print (items) items.css( ' width ' , ' 100px ' ) print (items)
可以给予新的属性,如果原来有该属性,会覆盖掉原有的属性
运行结果:
remove:
# !/usr/bin/env python # -*- coding: utf-8 -*- # DOM操作,remove html = """Hello World""" from pyquery import PyQuery as pq doc = pq(html) wrap = doc( ' .wrap ' ) print (wrap.text()) wrap.find( ' p ' ).remove() print ( " remove以后的数据 " ) print (wrap)This is a paragraph.
运行结果:
还有很多其他的DOM方法,想了解更多的小伙伴可以阅读其官方文档,地址:https://pyquery.readthedocs.io/en/latest/api.html
伪类选择器:
# !/usr/bin/env python # -*- coding: utf-8 -*- # DOM操作,伪类选择器 html = """The Dormouse's story Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title
...
""" from pyquery import PyQuery as pq doc = pq(html) # print(doc) wrap = doc( ' a:first-child ' ) # 第一个标签 print (wrap) wrap = doc( ' a:last-child ' ) # 最后一个标签 print (wrap) wrap = doc( ' a:nth-child(2) ' ) # 第二个标签 print (wrap) wrap = doc( ' a:gt(2) ' ) # 比2大的索引 标签 即为 0 1 2 3 4 从0开始的 不是1 print (wrap) wrap = doc( ' a:nth-child(2n) ' ) # 第 2的整数倍 个标签 print (wrap) wrap = doc( ' a:contains(Lacie) ' ) # 包含Lacie文本的标签 print (wrap)
这里不在详细的一一列举了,了解更多CSS选择器可以查看官方文档,由W3C提供地址:http://www.w3school.com.cn/css/index.asp
到这里我们就把pyQuery的使用方法大致的说完了,想了解更多,更详细的可以阅读官方文档,地址:https://pyquery.readthedocs.io/en/latest/
上述代码地址:https://gitee.com/dwyui/pyQuery.git
感谢大家的阅读,不正确的地方,还希望大家来斧正,鞠躬,谢谢。