python爬虫第7关项目爬取知乎大v的文章

总体上来说，从Response对象开始，我们就分成了两条路径，一条路径是数据放在HTML里，所以我们用BeautifulSoup库去解析数据和提取数据；另一条，数据作为Json存储起来，所以我们用response.json()方法去解析，然后提取、存储数据。
爬取知乎大v张佳玮的文章“标题”、“摘要”、“链接”，并存储到本地文件。
张佳玮的知乎文章URL在这里：https://www.zhihu.com/people/zhang-jia-wei/posts?page=1
用requests.get()获取数据，然后检查请求是否成功。

            
              import requests
#引入requests
headers={'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
#封装headers
url='https://www.zhihu.com/api/v4/members/zhang-jia-wei/articles?'
#写入网址
params={
    'include':'data[*].comment_count,suggest_edit,is_normal,thumbnail_extra_info,thumbnail,can_comment,comment_permission,admin_closed_comment,content,voteup_count,created,updated,upvoted_followees,voting,review_info,is_labeled,label_info;data[*].author.badge[?(type=best_answerer)].topics',
    'offset':'10',
    'limit':'20',
    'sort_by':'voteups',
    }
#封装参数
res=requests.get(url,headers=headers,params=params)
#发送请求，并把响应内容赋值到变量res里面
print(res.status_code)
#确认请求成功

显示200说明请求成功。
们来看看第一页和最后一页请求的参数区别：
对比一下，你会发第一页的is_end是显示false，最后一页的is_end是显示true，这个元素可以帮我们结束循环。
至于那个totals: 919元素，我算了一下页码和每页的文章数，判断这是文章的总数，也同样可以作为结束循环的条件。两个元素都可以用，我用的是totals，结合每页offset的值的变化。

            
              import requests
import csv
import openpyxl

n=0
csv_file=open('zhihu.csv','w',newline='',encoding='gbk')
writer=csv.writer(csv_file)
writer.writerow(['编号','标题','摘要','链接'])

wb=openpyxl.Workbook()
sheet=wb.active
sheet['A1']='编号'
sheet['B1']='标题'
sheet['C1']='摘要'
sheet['D1']='链接'

headers={'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
url='https://www.zhihu.com/api/v4/members/zhang-jia-wei/articles?'
params={
'include':'data[*].comment_count,suggest_edit,is_normal,thumbnail_extra_info,thumbnail,can_comment,comment_permission,admin_closed_comment,content,voteup_count,created,updated,upvoted_followees,voting,review_info,is_labeled,label_info;data[*].author.badge[?(type=best_answerer)].topics',
'offset':'10',
'limit':'20',
'sort_by':'voteups'
}
res=requests.get(url,headers=headers,params=params)
html=res.json()
totals=html['paging']['totals']
for offset in range(0,totals,20):
        res=requests.get('https://www.zhihu.com/api/v4/members/zhang-jia-wei/articles?include=data%5B*%5D.comment_count%2Csuggest_edit%2Cis_normal%2Cthumbnail_extra_info%2Cthumbnail%2Ccan_comment%2Ccomment_permission%2Cadmin_closed_comment%2Ccontent%2Cvoteup_count%2Ccreated%2Cupdated%2Cupvoted_followees%2Cvoting%2Creview_info%2Cis_labeled%2Clabel_info%3Bdata%5B*%5D.author.badge%5B%3F(type%3Dbest_answerer)%5D.topics&\
                offset={}&limit=10&sort_by=voteups'.format(offset),headers=headers)
        html=res.json()
        items=html['data']
        for item in items:
                n+=1
                num=n
                title=item['title']
                abstract=item['excerpt']
                url=item['url']
                writer.writerow([num,title,abstract,url])
                sheet.append([num,title,abstract,url])
csv_file.close()
wb.save('zhihu.xlsx')

更多文章、技术交流、商务合作、联系博主

微信扫码或搜索：z360901061

微信扫一扫加我为好友

QQ号联系： 360901061

您的支持是博主写作最大的动力，如果您喜欢我的文章，感觉我的文章对您有帮助，请用微信扫描下面二维码支持博主2元、5元、10元、20元等您想捐的金额吧，狠狠点击下面给点支持吧，站长非常感激您！手机微信长按不能支付解决办法：请将微信支付二维码保存到相册，切换到微信，然后点击微信右上角扫一扫功能，选择支付二维码完成支付。

【本文对您有帮助就好】元

2元

5元

10元

20元

自定义