豆瓣电影评论爬取(简单版)

requests, xpath

Posted by 粘世强 on July 24, 2018

豆瓣电影评论爬取

这是一个利用requests和xpath路径爬取豆瓣电影评论数据的爬虫代码,比较简单,主要用来熟悉xpath路径的使用。

#导入相关的库
import requests
from lxml import etree

接下来构造headers,如果网站的反爬措施比较简单,则不必构造headers。

# headers是假装浏览器在查看网页,一般设置user agent即可
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
url = 'https://movie.douban.com/subject/26752088/comments?start=0&limit=20&sort=new_score&status=P'

爬取内容,构造选择器

douban = requests.get(url,headers = headers)
selector = etree.HTML(douban.text)

首先获取评论块的内容

comments = selector.xpath('//div[@class="comment"]')
comments
[<Element div at 0x279e286ef08>,
 <Element div at 0x279e28dfa08>,
 <Element div at 0x279e293c608>,
 <Element div at 0x279e293c508>,
 <Element div at 0x279e293c548>,
 <Element div at 0x279e293c6c8>,
 <Element div at 0x279e293c708>,
 <Element div at 0x279e293c748>,
 <Element div at 0x279e293c788>,
 <Element div at 0x279e293c688>,
 <Element div at 0x279e293c7c8>,
 <Element div at 0x279e293c808>,
 <Element div at 0x279e293c848>,
 <Element div at 0x279e293c888>,
 <Element div at 0x279e293c8c8>,
 <Element div at 0x279e293c908>,
 <Element div at 0x279e293c948>,
 <Element div at 0x279e293c988>,
 <Element div at 0x279e293c9c8>,
 <Element div at 0x279e293ca08>]

构造空列表,用来存放数据

users = []
stars = []
comment_text = []

循环获得每个用户评论的相关数据

for comment in comments:
    user = comment.xpath('.//span[@class="comment-info"]/a/text()')[0]
    star = comment.xpath('.//span[@class="comment-info"]/span[2]/@class')[0][7:8]
    content = comment.xpath('.//span[@class="short"]/text()')[0]
    
    users.append(user)
    stars.append(star)
    comment_text.append(content)
users
['忻钰坤',
 '沐子荒',
 '凌睿',
 '徐若风',
 '桃桃淘电影',
 '远世祖',
 '影志',
 '开开kergelen',
 'Noodles',
 '哪吒男',
 '栀虞',
 '大島',
 '木卫二',
 'LOOK',
 '张小北',
 '喝可乐的鸟',
 'OreoOlymLee',
 'SELVEN',
 '给你好看',
 '任仁忍']

将爬取的数据构造一个字典,用来构建dataframe

dic = {'user':users,
      'star':stars,
      'comment':comment_text}
import pandas as pd
df = pd.DataFrame(dic)
df