Python爬虫初体验（1）：利用requests和bs4提取网站漫画

emm……真实的高三暑假是，整天无事可做 ~~然后找事，于是开始学习Python~~

好的废话不多说，进入正题

由题，作为一名初学者，想要玩转爬虫这类玩意还要花很大功夫。

所以我就从简单的开始：提取XKCD漫画（网页简单，提取方便）

使用 requests 和 bs4 模块提取网页内容+分析html，然后再存入硬盘内

首先，requests 和 bs4 都是 Python 的第三方库，使用 pip install xxx（xxx 是 requests 和 bs4）安装它们

requests 库最主要的方法是 requests.get() ，根据指定的 url 提取超链接指向的内容，可以附带其他的一些参数来达到特定的目的

比如，传入 timeout = 10 可以使得连接和读取超时的时间为 10s，超时会报错

（用法：get(url, params=None, **kwargs)，**kwargs为可变参数，包含了timeout等）

这个网站很慢而且不稳定，时常有连接不上的问题，为了防止爬虫卡死，加入了重试的代码：

            
              import requests, bs4

url = "http://xkcd.com"
downloadCount = 0                        # 下载的图片文件计数

def get_elements(link, tle=10):          # tle:超时时间
    count = 1                            # count:尝试访问网页的次数
                                         # 访问成功则返回requests.get()取得的值，超过3次失败则抛出异常
    while count <= 3:
        try:
            res = requests.get(link, timeout=tle)
            res.raise_for_status()
            return res
        except requests.exceptions.RequestException:
            count += 1
            print("Connection timeout. Retry --> %d in 3 times..." % count)

    raise TimeoutError("Your network is really bad. Please check your Internet and try again later.")

（其实可以由 requests 已经封装好的代码来操作……然而暂时先不这么做）

然后！

就可以用 bs4.BeautifulSoup() 来分析这个网页的 html 了

马上找到图片的链接，然后下载这个图片

            
                  soup = bs4.BeautifulSoup(res.text,features="html.parser")   # 解析html，找到漫画图片对应的链接
    releventContent = soup.select("#comic img")                 # 图片都在html的这一段之内
    picUrl = releventContent[0].get("src")

    print("Downloading picture %d..." % downloadCount)          # 下载图片
    picResource = get_elements("http:" + picUrl)

每一张漫画都有对应的编号，存入硬盘时，为了更好区分，于是写入文件时就以编号+漫画标题的方式写入

            
                  prevUrl = soup.select("a[rel='prev']")  # prevUrl = '/1234/'  查找图片编号；格式如左
    prevNum = prevUrl[0].get("href")
    currentNum = 0                                              # 查找到当前图片的编号
    if prevNum == '#':
        currentNum = 1
    else:
        currentNum = int(prevNum.strip('/')) + 1

    print("Writing picture %d..." % downloadCount)              # 文件写入硬盘
    picFile = open(str(currentNum) + '_' + os.path.basename(picUrl), 'wb')
                                                                # 以 编号+标题 的方式命名，二进制方式写入
    for c in picResource.iter_content(100000):                  # 写入文件
        picFile.write(c)
    picFile.close()
    print("File %d successfully written." % downloadCount)

最后别忘了调整 url 为上一幅漫画的：

            
                  url = "http://xkcd.com" + prevNum                           # 上一幅图片的url

于是！整个过程就弄完啦！然后就可以等待它慢慢扒图……

图例：

~~并不规范的~~ 源码：

            
              #! python3
# 爬虫实践1：XKCD Comics
# reversed sequence of comics, from latest to 1st image.

import os
import requests,bs4

os.chdir("g:\\work\\gjmtest\\comics")
os.makedirs("xkcd",exist_ok=True)
os.chdir(".\\xkcd")

url = "http://xkcd.com/"
downloadCount = 0

def get_elements(link, tle=10):
    count = 1
    while count <= 3:
        try:
            res = requests.get(link, timeout=tle)
            res.raise_for_status()
            return res
        except requests.exceptions.RequestException:
            count += 1
            print("Connection timeout. Retry --> %d in 3 times..." % count)
    raise TimeoutError("Your network is really bad. Please check your Internet and try again later.")

while not url.endswith('#'):
    downloadCount += 1                                          # 下载的文件总数
    if downloadCount > 50:
        break
    print("Analyzing page %d..." % downloadCount)
    res = get_elements(url)

    soup = bs4.BeautifulSoup(res.text,features="html.parser")   # 解析html，找到漫画图片对应的链接
    releventContent = soup.select("#comic img")
    picUrl = releventContent[0].get("src")

    print("Downloading picture %d..." % downloadCount)          # 下载图片
    picResource = get_elements("http:" + picUrl)

    prevUrl = soup.select("a[rel='prev']")  # prevUrl = '/1234/'  查找图片编号；格式如左
    prevNum = prevUrl[0].get("href")
    currentNum = 0                                              # 查找到当前图片的编号
    if prevNum == '#':
        currentNum = 1
    else:
        currentNum = int(prevNum.strip('/')) + 1

    print("Writing picture %d..." % downloadCount)              # 文件写入硬盘
    picFile = open(str(currentNum) + '_' + os.path.basename(picUrl), 'wb')
                                                                # 以 编号+标题 的方式命名，二进制方式写入
    for c in picResource.iter_content(100000):                  # 写入文件
        picFile.write(c)
    picFile.close()
    print("File %d successfully written." % downloadCount)

    url = "http://xkcd.com" + prevNum                           # 上一幅图片的url

print("Done.")

更多文章、技术交流、商务合作、联系博主

微信扫码或搜索：z360901061

微信扫一扫加我为好友

QQ号联系： 360901061

您的支持是博主写作最大的动力，如果您喜欢我的文章，感觉我的文章对您有帮助，请用微信扫描下面二维码支持博主2元、5元、10元、20元等您想捐的金额吧，狠狠点击下面给点支持吧，站长非常感激您！手机微信长按不能支付解决办法：请将微信支付二维码保存到相册，切换到微信，然后点击微信右上角扫一扫功能，选择支付二维码完成支付。

【本文对您有帮助就好】元

2元

5元

10元

20元

自定义