总体上来说,从Response对象开始,我们就分成了两条路径,一条路径是数据放在HTML里,所以我们用BeautifulSoup库去解析数据和提取数据;另一条,数据作为Json存储起来,所以我们用response.json()方法去解析,然后提取、存储数据。
爬取知乎大v张佳玮的文章“标题”、“摘要”、“链接”,并存储到本地文件。
张佳玮的知乎文章URL在这里:https://www.zhihu.com/people/zhang-jia-wei/posts?page=1
用requests.get()获取数据,然后检查请求是否成功。
import requests
#引入requests
headers={'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
#封装headers
url='https://www.zhihu.com/api/v4/members/zhang-jia-wei/articles?'
#写入网址
params={
'include':'data[*].comment_count,suggest_edit,is_normal,thumbnail_extra_info,thumbnail,can_comment,comment_permission,admin_closed_comment,content,voteup_count,created,updated,upvoted_followees,voting,review_info,is_labeled,label_info;data[*].author.badge[?(type=best_answerer)].topics',
'offset':'10',
'limit':'20',
'sort_by':'voteups',
}
#封装参数
res=requests.get(url,headers=headers,params=params)
#发送请求,并把响应内容赋值到变量res里面
print(res.status_code)
#确认请求成功
显示200说明请求成功。
们来看看第一页和最后一页请求的参数区别:
对比一下,你会发第一页的is_end是显示false,最后一页的is_end是显示true,这个元素可以帮我们结束循环。
至于那个totals: 919元素,我算了一下页码和每页的文章数,判断这是文章的总数,也同样可以作为结束循环的条件。两个元素都可以用,我用的是totals,结合每页offset的值的变化。
import requests
import csv
import openpyxl
n=0
csv_file=open('zhihu.csv','w',newline='',encoding='gbk')
writer=csv.writer(csv_file)
writer.writerow(['编号','标题','摘要','链接'])
wb=openpyxl.Workbook()
sheet=wb.active
sheet['A1']='编号'
sheet['B1']='标题'
sheet['C1']='摘要'
sheet['D1']='链接'
headers={'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
url='https://www.zhihu.com/api/v4/members/zhang-jia-wei/articles?'
params={
'include':'data[*].comment_count,suggest_edit,is_normal,thumbnail_extra_info,thumbnail,can_comment,comment_permission,admin_closed_comment,content,voteup_count,created,updated,upvoted_followees,voting,review_info,is_labeled,label_info;data[*].author.badge[?(type=best_answerer)].topics',
'offset':'10',
'limit':'20',
'sort_by':'voteups'
}
res=requests.get(url,headers=headers,params=params)
html=res.json()
totals=html['paging']['totals']
for offset in range(0,totals,20):
res=requests.get('https://www.zhihu.com/api/v4/members/zhang-jia-wei/articles?include=data%5B*%5D.comment_count%2Csuggest_edit%2Cis_normal%2Cthumbnail_extra_info%2Cthumbnail%2Ccan_comment%2Ccomment_permission%2Cadmin_closed_comment%2Ccontent%2Cvoteup_count%2Ccreated%2Cupdated%2Cupvoted_followees%2Cvoting%2Creview_info%2Cis_labeled%2Clabel_info%3Bdata%5B*%5D.author.badge%5B%3F(type%3Dbest_answerer)%5D.topics&\
offset={}&limit=10&sort_by=voteups'.format(offset),headers=headers)
html=res.json()
items=html['data']
for item in items:
n+=1
num=n
title=item['title']
abstract=item['excerpt']
url=item['url']
writer.writerow([num,title,abstract,url])
sheet.append([num,title,abstract,url])
csv_file.close()
wb.save('zhihu.xlsx')