这是北京理工大学的课程，附上视频link：https://www.bilibili.com/video/av9784617/?p=1

Requests库

Requests库主要方法

Requests库的7个主要方法
方法	说明
requests.request()	构造一个请求，支撑以下各方法的基础方法
requests.get()	获取HTML网页的主要方法，对应于HTTP的GET
requests.head()	获取HTML网页头信息的方法，对应于HTTP的HEAD
requests.post()	向HTML网页提交POST请求的方法，对应于HTTP的POST
requests.put()	向HTML网页提交PUT请求的方法，对应于HTTP的PUT
requests.patch()	向HTML网页提交局部修改请求，对应于HTTP的PATCH
requests.delete()	向HTML页面提交删除请求，对应于HTTP的DELETE

Response对象的属性
属性	说明
r.status_code	HTTP请求的返回状态，200表示连接成功，404表示失败
r.text	HTTP响应内容的字符串形式，即url对应的页面内容
r.encoding	从HTTP header中猜测的响应内容编码方式，如果header中不存在charset，则认为编码为ISO-8859-1
r.apparent_encoding	从内容中分析出的响应内容编码方式（备选编码方式）
r.content	HTTP响应内容的二进制形式

理解Requests库的异常
异常	说明
requests.Connection Error	网络连接错误异常,如DNS查询失败、拒绝连接等
requests.HTTPError	HTTP错误异常
requests.URLRequired	URL缺失异常
requests.TooManyRedirects	超过最大重定向次数，产生重定向异常
requests.ConeetTimeout	连接远程服务器超时异常
requests.Timeout	请求URL超时，产生超时异常
r.raise_for_status()	如果不是200，产生异常requests.HTTPError

使用r.raise_for_status()检查异常，使代码更健壮

爬取网页的通用代码框架

            
              #通用代码框架框架
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "error"

if __name__ == "__main__":
    url = "http://www.baidu.com"
    print getHTMLText(url)

HTTP协议

HTTP, Hypertext Transfer Protocol,超文本传输协议。HTTP是一个基于“请求与响应”模式的、无状态的应用层协议。

URL格式htp://host[:port][path]

host:合法的Internet主机域名或P地址

port:端口号，缺省端口为80

path:请求资源的路径

HTTP URL的理解:

url是通过http协议存取资源的internet路径，一个url对应一个数据资源。

网络爬虫的限制

◆来源审查:判断User-Agent进行限制

检查来访HTTP协议头的User-Agent域，只响应浏览器或友好爬虫的访问。

◆发布公告: Robots协议

告知所有爬虫网站的爬取策略，要求爬虫遵守。

Robots协议

Robots Exclusion Standard网络爬虫排除标准

作用:网站告知网络爬虫哪些页面可以抓取，哪些不行。

形式:在网站根目录下的robots.txt文件。

在爬取某一个网站时，可以“网站”+“robots.txt”查看是否限制爬虫

案例:京东的Robots协议

https://www.jd.com/robots.txt

*代表所有

user-agent：HuihuiSpider Disallow:/ -->表明针对user-agent是HuihuiSpider，整个网站都不可以爬取

实例1:京东商品页面的爬取

            
              try:
    r = requests.get("https://item.jd.com/100000177760.html")
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print r.text[:1000]
except:
    print "error"

实例2:亚马逊商品页面的爬取

            
              try:
    r = requests.get("https://www.amazon.cn/dp/B01NAS120J/ref=cngwdyfloorv2_recs_0/462-8269923-3092529?pf_rd_m=A1AJ19PSB66TGU&pf_rd_s=desktop-2&pf_rd_r=2RAT2AKB28AT2ZBSCDBE&pf_rd_r=2RAT2AKB28AT2ZBSCDBE&pf_rd_t=36701&pf_rd_p=d2aa3428-dc2b-4cfe-bca6-5e3a33f2342e&pf_rd_p=d2aa3428-dc2b-4cfe-bca6-5e3a33f2342e&pf_rd_i=desktop")
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print r.text[:1000]
except:
    print "error"

#对比传入headers和不传入headers的区别，在第二种可以发现user-agent改变了
r = requests.get("https://www.amazon.cn/gp/product/B01M8L5Z3Y")
print r.status_code
print r.request.headers

kv = {'user-agent':'Mozilla/5.0'}
r = requests.get("https://www.amazon.cn/gp/product/B01M8L5Z3Y", headers=kv)
print r.request.headers

实例3:百度360搜索关键字提交

            
              #百度关键词接口
kv = {'wd':'python'}
r = requests.get("http://www.baidu.com/s", params=kv)
print r.request.url
print len(r.text)

#360关键词接口
kv = {'q':'python'}
r = requests.get("http://www.so.com/s", params=kv)
print r.request.url
print len(r.text)

实例4:网络图片的爬取和存储

            
              url = "https://www.nationalgeographic.com/animals/2019/04/polar-bears-algae-sea-ice-warming/#/01-polar-bears-algae-nationalgeographic_675501.jpg"
root = './pics/'
path = root + url.split('/')[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path, 'wb') as f:
            f.write(r.content)
            f.close()
            print '文件保存成功'
    else:
        print '文件已存在'
except:
    print '爬取失败'

实例5:ip地址归属地的自动查询

            
              url = 'http://m.ip138.com/ip.asp?ip='
try:
    r = requests.get(url + '202.204.80.112')
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print r.text[-500:]
except:
    print '爬取失败'

更多文章、技术交流、商务合作、联系博主

微信扫码或搜索：z360901061

微信扫一扫加我为好友

QQ号联系： 360901061

您的支持是博主写作最大的动力，如果您喜欢我的文章，感觉我的文章对您有帮助，请用微信扫描下面二维码支持博主2元、5元、10元、20元等您想捐的金额吧，狠狠点击下面给点支持吧，站长非常感激您！手机微信长按不能支付解决办法：请将微信支付二维码保存到相册，切换到微信，然后点击微信右上角扫一扫功能，选择支付二维码完成支付。

【本文对您有帮助就好】元

2元

5元

10元

20元

自定义