python爬虫---从零开始(五)pyQuery库

系统 1703 0

 

什么是pyQuery:

强大又灵活的网页解析库。如果你觉得正则写起来太麻烦(我不会写正则),如果你觉得 BeautifulSoup的语法太难记,如果你熟悉JQuery的语法,那么PyQuery就是你最佳的选择。

pyQuery的安装pip3 install pyquery即可安装啦。

pyQuery的基本用法:

初始化:

字符串初始化:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story


                <p class="title" name="dromouse">
                  <b>
                    The Dormouse's story
                  </b>
                </p>
                <p class="story">
                  Once upon a time there were three little sisters;and thier names were

                  <a href="http://example.com/elsie" class="sister" id="link1">
                    <!-- Elsie -->                  </a>
                  <a href="http://example.com/lacle" class="sister" id="link2">
                    Lacie
                  </a>
                   and

                  <a href="http://example.com/title" class="sister" id="link3">
                    Title
                  </a>
                  ; and they lived at the boottom of a well.
                </p>
                <p class="story">
                  ...
                </p>
              
            
            
              """
            
            
              from
            
             pyquery 
            
              import
            
            
               PyQuery as pq
doc 
            
            =
            
               pq(html)

            
            
              print
            
            (doc(
            
              '
            
            
              a
            
            
              '
            
            ))
          

运行结果:

python爬虫---从零开始(五)pyQuery库_第1张图片

URL初始化:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               URL初始化
            
            
              from
            
             pyquery 
            
              import
            
            
               PyQuery as pq
doc 
            
            = pq(
            
              '
            
            
              http://www.baidu.com
            
            
              '
            
            
              )

            
            
              print
            
            (doc(
            
              '
            
            
              input
            
            
              '
            
            ))
          

运行结果:

文件初始化:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               文件初始化
            
            
              from
            
             pyquery 
            
              import
            
            
               PyQuery as pq
doc 
            
            = pq(filename=
            
              '
            
            
              baidu.html
            
            
              '
            
            
              )

            
            
              print
            
            (doc(
            
              '
            
            
              title
            
            
              '
            
            ))
          

运行结果:

python爬虫---从零开始(五)pyQuery库_第2张图片

 选择方式和jquery一致,id、name、class都是如此,还有很多都和jquery一致。

基本CSS选择器:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               Css选择器
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story


                <p class="title" name="dromouse">
                  <b>
                    The Dormouse's story
                  </b>
                </p>
                <p class="story">
                  Once upon a time there were three little sisters;and thier names were

                  <a href="http://example.com/elsie" class="sister" id="link1">
                    <!-- Elsie -->                  </a>
                  <a href="http://example.com/lacle" class="sister" id="link2">
                    Lacie
                  </a>
                   and

                  <a href="http://example.com/title" class="title" id="link3">
                    Title
                  </a>
                  ; and they lived at the boottom of a well.
                </p>
                <p class="story">
                  ...
                </p>
              
            
            
              """
            
            
              from
            
             pyquery 
            
              import
            
            
               PyQuery as pq
doc 
            
            =
            
               pq(html)

            
            
              print
            
            (doc(
            
              '
            
            
              .title
            
            
              '
            
            ))
          

运行结果:

查找元素:

子元素:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               子元素
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story


                <p class="title" name="dromouse">
                  <b>
                    The Dormouse's story
                  </b>
                </p>
                <p class="story">
                  Once upon a time there were three little sisters;and thier names were

                  <a href="http://example.com/elsie" class="sister" id="link1">
                    <!-- Elsie -->                  </a>
                  <a href="http://example.com/lacle" class="sister" id="link2">
                    Lacie
                  </a>
                   and

                  <a href="http://example.com/title" class="title" id="link3">
                    Title
                  </a>
                  ; and they lived at the boottom of a well.
                </p>
                <p class="story">
                  ...
                </p>
              
            
            
              """
            
            
              from
            
             pyquery 
            
              import
            
            
               PyQuery as pq
doc 
            
            =
            
               pq(html)
items 
            
            = doc(
            
              '
            
            
              .title
            
            
              '
            
            
              )

            
            
              print
            
            
              (type(items))

            
            
              print
            
            
              (items)
p 
            
            = items.find(
            
              '
            
            
              b
            
            
              '
            
            
              )

            
            
              print
            
            
              (type(p))

            
            
              print
            
            (p)
          

该代码为查找id为title的标签,我们可以看到id为title的标签有两个一个是p标签,一个是a标签,然后我们再使用find方法,查找出我们需要的p标签,运行结果:

python爬虫---从零开始(五)pyQuery库_第3张图片

这里需要注意的是,我们所使用的find是查找每一个元素内部的标签.

children:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               子元素
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story


                <p class="title" name="dromouse">
                  <b>
                    The Dormouse's story
                  </b>
                </p>
                <p class="story">
                  Once upon a time there were three little sisters;and thier names were

                  <a href="http://example.com/elsie" class="sister" id="link1">
                    <!-- Elsie -->                  </a>
                  <a href="http://example.com/lacle" class="sister" id="link2">
                    Lacie
                  </a>
                   and

                  <a href="http://example.com/title" class="title" id="link3">
                    Title
                  </a>
                  ; and they lived at the boottom of a well.
                </p>
                <p class="story">
                  ...
                </p>
              
            
            
              """
            
            
              from
            
             pyquery 
            
              import
            
            
               PyQuery as pq
doc 
            
            =
            
               pq(html)
items 
            
            = doc(
            
              '
            
            
              .title
            
            
              '
            
            
              )

            
            
              print
            
            (items.children())
          

运行结果:

python爬虫---从零开始(五)pyQuery库_第4张图片

也可以在children()内添加选择器条件:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               子元素
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story


                <p class="title" name="dromouse">
                  <b>
                    The Dormouse's story
                  </b>
                </p>
                <p class="story">
                  Once upon a time there were three little sisters;and thier names were

                  <a href="http://example.com/elsie" class="sister" id="link1">
                    <!-- Elsie -->                  </a>
                  <a href="http://example.com/lacle" class="sister" id="link2">
                    Lacie
                  </a>
                   and

                  <a href="http://example.com/title" class="title" id="link3">
                    Title
                  </a>
                  ; and they lived at the boottom of a well.
                </p>
                <p class="story">
                  ...
                </p>
              
            
            
              """
            
            
              from
            
             pyquery 
            
              import
            
            
               PyQuery as pq
doc 
            
            =
            
               pq(html)
items 
            
            = doc(
            
              '
            
            
              .title
            
            
              '
            
            
              )

            
            
              print
            
            (items.children(
            
              '
            
            
              b
            
            
              '
            
            ))
          

输出结果和上面的一致。

 父元素:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               子元素
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story


                <p class="title" name="dromouse">
                  <b>
                    The Dormouse's story
                  </b>
                </p>
                <p class="story">
                  Once upon a time there were three little sisters;and thier names were

                  <a href="http://example.com/elsie" class="sister" id="link1">
                    <!-- Elsie -->                  </a>
                  <a href="http://example.com/lacle" class="sister" id="link2">
                    Lacie
                  </a>
                   and

                  <a href="http://example.com/title" class="title" id="link3">
                    Title
                  </a>
                  ; and they lived at the boottom of a well.
                </p>
                <p class="story">
                  ...
                </p>
              
            
            
              """
            
            
              from
            
             pyquery 
            
              import
            
            
               PyQuery as pq
doc 
            
            =
            
               pq(html)
items 
            
            = doc(
            
              '
            
            
              #link1
            
            
              '
            
            
              )

            
            
              print
            
            
              (items)

            
            
              print
            
            (items.parent())
          

运行结果:

python爬虫---从零开始(五)pyQuery库_第5张图片

这里只输出一个父元素。这里我们用parents方法会给予我们返回所有父元素,祖先元素

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               祖先元素
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story
              
              

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' #link1 ' ) print (items) print (items.parents( ' body ' ))

运行结果:

python爬虫---从零开始(五)pyQuery库_第6张图片

兄弟元素:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               兄弟元素
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story
              
              

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' #link1 ' ) print (items) print (items.siblings( ' #link2 ' ))

运行结果:

python爬虫---从零开始(五)pyQuery库_第7张图片

上面就把查找元素的方法都说了,下面我来看一下如何遍历元素。

遍历

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               兄弟元素
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story
              
              

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' a ' ) for k,v in enumerate(items.items()): print (k,v)

运行结果:

python爬虫---从零开始(五)pyQuery库_第8张图片

 获取信息:

  获取属性:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               获取属性
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story
              
              

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' a ' ) print (items) print (items.attr( ' href ' )) print (items.attr.href)

运行结果:

python爬虫---从零开始(五)pyQuery库_第9张图片

获得文本:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               获取属性
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story
              
              

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' a ' ) print (items) print (items.text()) print (type(items.text()))

运行结果:

python爬虫---从零开始(五)pyQuery库_第10张图片

 获得HTML:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               获取属性
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story
              
              

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' a ' ) print (items.html())

运行结果:

python爬虫---从零开始(五)pyQuery库_第11张图片

DOM操作:

addClass、removeClass

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               DOM操作,addClass、removeClass
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story
              
              

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' #link2 ' ) print (items) items.addClass( ' addStyle ' ) # add_class print (items) items.remove_class( ' sister ' ) # removeClass print (items)

运行结果:

python爬虫---从零开始(五)pyQuery库_第12张图片

attr、css:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               DOM操作,attr,css
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story
              
              

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) items = doc( ' #link2 ' ) items.attr( ' name ' , ' addname ' ) print (items) items.css( ' width ' , ' 100px ' ) print (items)

可以给予新的属性,如果原来有该属性,会覆盖掉原有的属性

运行结果:

python爬虫---从零开始(五)pyQuery库_第13张图片

remove:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               DOM操作,remove
            
            
              
html 
            
            = 
            
              """
            
            
              
Hello World

This is a paragraph.

""" from pyquery import PyQuery as pq doc = pq(html) wrap = doc( ' .wrap ' ) print (wrap.text()) wrap.find( ' p ' ).remove() print ( " remove以后的数据 " ) print (wrap)

运行结果:

python爬虫---从零开始(五)pyQuery库_第14张图片

还有很多其他的DOM方法,想了解更多的小伙伴可以阅读其官方文档,地址:https://pyquery.readthedocs.io/en/latest/api.html

伪类选择器:

            
              #
            
            
              !/usr/bin/env python
            
            
              
#
            
            
               -*- coding: utf-8 -*-
            
            
              
#
            
            
               DOM操作,伪类选择器
            
            
              
html 
            
            = 
            
              """
            
            
              
                The Dormouse's story
              
              

Once upo a time were three little sister;and theru name were Elsie Lacie and Title Title

...

""" from pyquery import PyQuery as pq doc = pq(html) # print(doc) wrap = doc( ' a:first-child ' ) # 第一个标签 print (wrap) wrap = doc( ' a:last-child ' ) # 最后一个标签 print (wrap) wrap = doc( ' a:nth-child(2) ' ) # 第二个标签 print (wrap) wrap = doc( ' a:gt(2) ' ) # 比2大的索引 标签 即为 0 1 2 3 4 从0开始的 不是1 print (wrap) wrap = doc( ' a:nth-child(2n) ' ) # 第 2的整数倍 个标签 print (wrap) wrap = doc( ' a:contains(Lacie) ' ) # 包含Lacie文本的标签 print (wrap)

这里不在详细的一一列举了,了解更多CSS选择器可以查看官方文档,由W3C提供地址:http://www.w3school.com.cn/css/index.asp

到这里我们就把pyQuery的使用方法大致的说完了,想了解更多,更详细的可以阅读官方文档,地址:https://pyquery.readthedocs.io/en/latest/

上述代码地址:https://gitee.com/dwyui/pyQuery.git

感谢大家的阅读,不正确的地方,还希望大家来斧正,鞠躬,谢谢。


更多文章、技术交流、商务合作、联系博主

微信扫码或搜索:z360901061

微信扫一扫加我为好友

QQ号联系: 360901061

您的支持是博主写作最大的动力,如果您喜欢我的文章,感觉我的文章对您有帮助,请用微信扫描下面二维码支持博主2元、5元、10元、20元等您想捐的金额吧,狠狠点击下面给点支持吧,站长非常感激您!手机微信长按不能支付解决办法:请将微信支付二维码保存到相册,切换到微信,然后点击微信右上角扫一扫功能,选择支付二维码完成支付。

【本文对您有帮助就好】

您的支持是博主写作最大的动力,如果您喜欢我的文章,感觉我的文章对您有帮助,请用微信扫描上面二维码支持博主2元、5元、10元、自定义金额等您想捐的金额吧,站长会非常 感谢您的哦!!!

发表我的评论
最新评论 总共0条评论