豆瓣电影评论爬取
这是一个利用requests和xpath路径爬取豆瓣电影评论数据的爬虫代码,比较简单,主要用来熟悉xpath路径的使用。
#导入相关的库
import requests
from lxml import etree
接下来构造headers,如果网站的反爬措施比较简单,则不必构造headers。
# headers是假装浏览器在查看网页,一般设置user agent即可
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
url = 'https://movie.douban.com/subject/26752088/comments?start=0&limit=20&sort=new_score&status=P'
爬取内容,构造选择器
douban = requests.get(url,headers = headers)
selector = etree.HTML(douban.text)
首先获取评论块的内容
comments = selector.xpath('//div[@class="comment"]')
comments
[<Element div at 0x279e286ef08>,
<Element div at 0x279e28dfa08>,
<Element div at 0x279e293c608>,
<Element div at 0x279e293c508>,
<Element div at 0x279e293c548>,
<Element div at 0x279e293c6c8>,
<Element div at 0x279e293c708>,
<Element div at 0x279e293c748>,
<Element div at 0x279e293c788>,
<Element div at 0x279e293c688>,
<Element div at 0x279e293c7c8>,
<Element div at 0x279e293c808>,
<Element div at 0x279e293c848>,
<Element div at 0x279e293c888>,
<Element div at 0x279e293c8c8>,
<Element div at 0x279e293c908>,
<Element div at 0x279e293c948>,
<Element div at 0x279e293c988>,
<Element div at 0x279e293c9c8>,
<Element div at 0x279e293ca08>]
构造空列表,用来存放数据
users = []
stars = []
comment_text = []
循环获得每个用户评论的相关数据
for comment in comments:
user = comment.xpath('.//span[@class="comment-info"]/a/text()')[0]
star = comment.xpath('.//span[@class="comment-info"]/span[2]/@class')[0][7:8]
content = comment.xpath('.//span[@class="short"]/text()')[0]
users.append(user)
stars.append(star)
comment_text.append(content)
users
['忻钰坤',
'沐子荒',
'凌睿',
'徐若风',
'桃桃淘电影',
'远世祖',
'影志',
'开开kergelen',
'Noodles',
'哪吒男',
'栀虞',
'大島',
'木卫二',
'LOOK',
'张小北',
'喝可乐的鸟',
'OreoOlymLee',
'SELVEN',
'给你好看',
'任仁忍']
将爬取的数据构造一个字典,用来构建dataframe
dic = {'user':users,
'star':stars,
'comment':comment_text}
import pandas as pd
df = pd.DataFrame(dic)
df