Python爬取某电商平台商品数据及评论！

前言

主要内容

1. 爬取商品列表数据

2. 爬取单个商品页面的数据

3. 爬取评论数据

4. 使用代理ip

总结

前言

随着互联网的发展，电商平台的出现让我们的消费更加便利，消费者可以在家里轻松地购买到各种商品。但有时候我们需要大量的商品数据进行分析，或者需要了解其他消费者的评价，这时候我们可以通过爬虫来获取数据。本文将介绍如何使用Python爬取某电商平台的商品数据及评论，并且用到代理ip来实现爬虫的稳定运行。

主要内容

本文的主要内容分为以下几部分：

爬取商品列表数据
爬取单个商品页面的数据
爬取评论数据
使用代理ip

1. 爬取商品列表数据

我们首先需要爬取商品列表数据，包括商品名称、价格、评分、销量等信息。以某电商平台为例，我们可以使用requests和BeautifulSoup库来实现：

import requests
from bs4 import BeautifulSoup

# 定义请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
    'Referer': 'https://www.xxx.com/'
}

# 定义请求参数
params = {
    'keyword': '手机',  # 商品名称
    'sort': 's',        # 排序方式，s为综合排序，p为销量排序
    'pageNum': '1'      # 页码
}

# 发送请求
url = 'https://search.xxx.com/search'
response = requests.get(url, params=params, headers=headers)

# 解析html
soup = BeautifulSoup(response.text, 'html.parser')

# 获取商品列表
items = soup.select('.gl-item')
for item in items:
    # 商品名称
    title = item.select('.p-name em')[0].text.strip()
    # 商品价格
    price = item.select('.p-price i')[0].text.strip()
    # 商品评分
    score = item.select('.p-commit strong')[0].text.strip()
    # 商品销量
    sales = item.select('.p-commit a')[0].text.strip()
    
    print(title, price, score, sales)

以上代码中，我们通过requests发送请求，使用BeautifulSoup解析html，然后获取商品列表信息。通过分析html代码，我们可以发现商品列表信息在class为“gl-item”的标签中，因此可以使用select方法来获取。

2. 爬取单个商品页面的数据

接下来，我们需要爬取单个商品页面的数据，包括商品名称、价格、评分、评论数、详情等信息。同样使用requests和BeautifulSoup库来实现：

import requests
from bs4 import BeautifulSoup

# 定义请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
    'Referer': 'https://www.xxx.com/'
}

# 定义请求地址
url = 'https://item.xxx.com/123456.html'

# 发送请求
response = requests.get(url, headers=headers)

# 解析html
soup = BeautifulSoup(response.text, 'html.parser')

# 商品名称
title = soup.select('#itemDisplayName')[0].text.strip()
# 商品价格
price = soup.select('#breakprice em')[0].text.strip()
# 商品评分
score = soup.select('.J_commentTotal')[0].text.strip()
# 评论数
comment_count = soup.select('.J_commentTotal')[0].text.strip()
# 商品详情
detail = soup.select('.J-detail-content')[0].text.strip()

print(title, price, score, comment_count, detail)

以上代码中，我们通过requests发送请求，使用BeautifulSoup解析html，然后获取单个商品页面的信息。通过分析html代码，我们可以发现需要的信息在不同的标签中，需要根据实际情况进行选择。

3. 爬取评论数据

评论数据是非常重要的，我们需要获取其他消费者对商品的评价，以此来了解商品的优缺点。以某电商平台为例，我们可以使用requests和json库来实现：

import requests
import json

# 定义请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
    'Referer': 'https://www.xxx.com/'
}

# 定义请求地址及参数
url = 'https://club.jd.com/comment/productPageComments.action'
params = {
    'productId': '123456',       # 商品id
    'score': '0',                # 评分，0为全部评价，1为好评，2为中评，3为差评
    'sortType': '5',             # 排序方式，5为按时间排序，6为按热度排序
    'pageNumber': '1',           # 页码
    'pageSize': '10',            # 每页显示数量
    'isShadowSku': '0',          # 是否为非主流商品
    'callback': 'fetchJSON_comment98vv123456'  # 固定值
}

# 发送请求
response = requests.get(url, params=params, headers=headers)

# 解析json
data = json.loads(response.text.lstrip('fetchJSON_comment98vv123456(').rstrip(');'))

# 获取评论列表
comments = data['comments']
for comment in comments:
    # 评论内容
    content = comment['content'].strip()
    # 评分
    score = comment['score']
    # 评论时间
    time = comment['creationTime']
    # 评论者
    nickname = comment['nickname']
    
    print(content, score, time, nickname)

以上代码中，我们通过requests发送请求，使用json.loads解析json，然后获取评论列表信息。通过分析json数据，我们可以找到需要的信息在哪些字段中，并且选择对应的字段即可。

4. 使用代理ip

在爬虫过程中，我们可能会遇到被封ip的情况，为了避免这种情况的发生，我们可以使用代理ip来实现爬虫的稳定运行。以某代理ip网站为例，我们可以使用requests和随机选择代理ip的方式来爬取数据：

import requests

# 定义请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
    'Referer': 'https://www.xxx.com/'
}

# 定义请求地址
url = 'http://www.xxx.com/'

# 获取代理ip列表
proxy_list = [
    'http://123.45.67.89:8888',
    'http://123.45.67.90:8888',
    'http://123.45.67.91:8888'
]

# 随机选择代理ip
proxy = {
    'http': random.choice(proxy_list)
}

# 发送请求
response = requests.get(url, headers=headers, proxies=proxy)

以上代码中，我们定义了一个代理ip列表，然后随机选择一个代理ip来发送请求。这样就可以防止ip被封的情况发生。