什么是pyQuery:
强大又灵活的网页解析库。如果你觉得正则写起来太麻烦(我不会写正则),如果你觉得 BeautifulSoup的语法太难记,如果你熟悉JQuery的语法,那么PyQuery就是你最佳的选择。
pyQuery的安装pip3 install pyquery即可安装啦。
pyQuery的基本用法:
初始化:
字符串初始化:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
html
=
"""
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters;and thier names were
Lacie
and
Title
; and they lived at the boottom of a well.
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
print
(doc(
'
a
'
))
运行结果:
URL初始化:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
URL初始化
from
pyquery
import
PyQuery as pq
doc
= pq(
'
http://www.baidu.com
'
)
print
(doc(
'
input
'
))
运行结果:
文件初始化:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
文件初始化
from
pyquery
import
PyQuery as pq
doc
= pq(filename=
'
baidu.html
'
)
print
(doc(
'
title
'
))
运行结果:
选择方式和jquery一致,id、name、class都是如此,还有很多都和jquery一致。
基本CSS选择器:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
Css选择器
html
=
"""
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters;and thier names were
Lacie
and
Title
; and they lived at the boottom of a well.
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
print
(doc(
'
.title
'
))
运行结果:
查找元素:
子元素:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
子元素
html
=
"""
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters;and thier names were
Lacie
and
Title
; and they lived at the boottom of a well.
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
items
= doc(
'
.title
'
)
print
(type(items))
print
(items)
p
= items.find(
'
b
'
)
print
(type(p))
print
(p)
该代码为查找id为title的标签,我们可以看到id为title的标签有两个一个是p标签,一个是a标签,然后我们再使用find方法,查找出我们需要的p标签,运行结果:
这里需要注意的是,我们所使用的find是查找每一个元素内部的标签.
children:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
子元素
html
=
"""
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters;and thier names were
Lacie
and
Title
; and they lived at the boottom of a well.
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
items
= doc(
'
.title
'
)
print
(items.children())
运行结果:
也可以在children()内添加选择器条件:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
子元素
html
=
"""
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters;and thier names were
Lacie
and
Title
; and they lived at the boottom of a well.
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
items
= doc(
'
.title
'
)
print
(items.children(
'
b
'
))
输出结果和上面的一致。
父元素:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
子元素
html
=
"""
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters;and thier names were
Lacie
and
Title
; and they lived at the boottom of a well.
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
items
= doc(
'
#link1
'
)
print
(items)
print
(items.parent())
运行结果:
这里只输出一个父元素。这里我们用parents方法会给予我们返回所有父元素,祖先元素
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
祖先元素
html
=
"""
The Dormouse's story
Once upo a time were three little sister;and theru name were
Elsie
Lacie
and
Title
Title
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
items
= doc(
'
#link1
'
)
print
(items)
print
(items.parents(
'
body
'
))
运行结果:
兄弟元素:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
兄弟元素
html
=
"""
The Dormouse's story
Once upo a time were three little sister;and theru name were
Elsie
Lacie
and
Title
Title
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
items
= doc(
'
#link1
'
)
print
(items)
print
(items.siblings(
'
#link2
'
))
运行结果:
上面就把查找元素的方法都说了,下面我来看一下如何遍历元素。
遍历
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
兄弟元素
html
=
"""
The Dormouse's story
Once upo a time were three little sister;and theru name were
Elsie
Lacie
and
Title
Title
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
items
= doc(
'
a
'
)
for
k,v
in
enumerate(items.items()):
print
(k,v)
运行结果:
获取信息:
获取属性:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
获取属性
html
=
"""
The Dormouse's story
Once upo a time were three little sister;and theru name were
Elsie
Lacie
and
Title
Title
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
items
= doc(
'
a
'
)
print
(items)
print
(items.attr(
'
href
'
))
print
(items.attr.href)
运行结果:
获得文本:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
获取属性
html
=
"""
The Dormouse's story
Once upo a time were three little sister;and theru name were
Elsie
Lacie
and
Title
Title
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
items
= doc(
'
a
'
)
print
(items)
print
(items.text())
print
(type(items.text()))
运行结果:
获得HTML:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
获取属性
html
=
"""
The Dormouse's story
Once upo a time were three little sister;and theru name were
Elsie
Lacie
and
Title
Title
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
items
= doc(
'
a
'
)
print
(items.html())
运行结果:
DOM操作:
addClass、removeClass
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
DOM操作,addClass、removeClass
html
=
"""
The Dormouse's story
Once upo a time were three little sister;and theru name were
Elsie
Lacie
and
Title
Title
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
items
= doc(
'
#link2
'
)
print
(items)
items.addClass(
'
addStyle
'
)
#
add_class
print
(items)
items.remove_class(
'
sister
'
)
#
removeClass
print
(items)
运行结果:
attr、css:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
DOM操作,attr,css
html
=
"""
The Dormouse's story
Once upo a time were three little sister;and theru name were
Elsie
Lacie
and
Title
Title
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
items
= doc(
'
#link2
'
)
items.attr(
'
name
'
,
'
addname
'
)
print
(items)
items.css(
'
width
'
,
'
100px
'
)
print
(items)
可以给予新的属性,如果原来有该属性,会覆盖掉原有的属性
运行结果:
remove:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
DOM操作,remove
html
=
"""
Hello World
This is a paragraph.
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
wrap
= doc(
'
.wrap
'
)
print
(wrap.text())
wrap.find(
'
p
'
).remove()
print
(
"
remove以后的数据
"
)
print
(wrap)
运行结果:
还有很多其他的DOM方法,想了解更多的小伙伴可以阅读其官方文档,地址:https://pyquery.readthedocs.io/en/latest/api.html
伪类选择器:
#
!/usr/bin/env python
#
-*- coding: utf-8 -*-
#
DOM操作,伪类选择器
html
=
"""
The Dormouse's story
Once upo a time were three little sister;and theru name were
Elsie
Lacie
and
Title
Title
...
"""
from
pyquery
import
PyQuery as pq
doc
=
pq(html)
#
print(doc)
wrap = doc(
'
a:first-child
'
)
#
第一个标签
print
(wrap)
wrap
= doc(
'
a:last-child
'
)
#
最后一个标签
print
(wrap)
wrap
= doc(
'
a:nth-child(2)
'
)
#
第二个标签
print
(wrap)
wrap
= doc(
'
a:gt(2)
'
)
#
比2大的索引 标签 即为 0 1 2 3 4 从0开始的 不是1
print
(wrap)
wrap
= doc(
'
a:nth-child(2n)
'
)
#
第 2的整数倍 个标签
print
(wrap)
wrap
= doc(
'
a:contains(Lacie)
'
)
#
包含Lacie文本的标签
print
(wrap)
这里不在详细的一一列举了,了解更多CSS选择器可以查看官方文档,由W3C提供地址:http://www.w3school.com.cn/css/index.asp
到这里我们就把pyQuery的使用方法大致的说完了,想了解更多,更详细的可以阅读官方文档,地址:https://pyquery.readthedocs.io/en/latest/
上述代码地址:https://gitee.com/dwyui/pyQuery.git
感谢大家的阅读,不正确的地方,还希望大家来斧正,鞠躬,谢谢。

