首页 > 分享 > 电商交易数据分析实践：数据清洗与初步洞察

电商交易数据分析实践：数据清洗与初步洞察

花匠小妙招
2025-01-11 22:26

一、导入要使用的模块

import pandas as pd import numpy as np from matplotlib import pyplot as plt from matplotlib import rcParams 1234 二、读取数据

ds_trading_data = pd.read_csv("./order_info_2016.csv", index_col = "id") pd.set_option("display.max_columns", None) # 显示所有列 ds_trading_data.index.name = None rcParams["font.sans-serif"] = ["Kaiti"] # 设置绘图时显示的字体 rcParams['axes.unicode_minus'] = False # 正常显示负号 12345

导入数据后，初步判断数据是否有缺失值

print(ds_trading_data.info()) 1

Int64Index: 104557 entries, 1 to 104557
Data columns (total 10 columns):
orderId 104557 non-null int64
userId 104557 non-null int64
productId 104557 non-null int64
cityId 104557 non-null int64
price 104557 non-null int64
payMoney 104557 non-null int64
channelId 104549 non-null object
deviceType 104557 non-null int64
createTime 104557 non-null object
payTime 104557 non-null object

观察发现，原数据有10列(字段)，共104557行，其中channelId列只有104549行，即有null值。接下来对各字段进行处理，即对重复值、缺失值、无效值等的处理。

三、清洗数据 1. 处理orderId列

print(ds_trading_data["orderId"].unique().size) # 查看orderId列是否有重复值 1

输出：104530

print(ds_trading_data.describe()) # 查看最大、最小值，判断该列的值是否在正常范围内 1

判断各字段的最大、最小值是否在正常取值范围内
不难发现，这一列值均在正常范围内，但有重复值，因为订单号是唯一的，所以我们要去重，我们将这一步放到最后。

2. 处理userId列

print(ds_trading_data["userId"].unique().size) 1

输出：102672

userId列也有重复值，但是每个用户可以有多个订单，所以userId允许有重复值。

3. 处理productId列

由上图可知，productId列的最小值为0，不符合常理，因此要处理0值。

print(ds_trading_data[ds_trading_data["productId"] == 0]) 1

输出：[177 rows x 10 columns]

可见productId为0的值共有177行，这些数据都是要清洗的，我们同样放到最后处理。

4. 处理cityId列

cityId列与productId列一样，是允许有重复值的，而且上图中看出这一列的值均在有效范围内，因此该列可不做处理。

5. 处理price列

由上图看出，price列的数值在正常范围内，无需处理，但是因为价格是以“分”为单位的，所以我们要把价格转化为以“元”为单位。

ds_trading_data["price"] = ds_trading_data["price"]/100 1 6. 处理payMoney列

ds_trading_data["payMoney"] = ds_trading_data["payMoney"]/100 # 与price列同理，将“分”转化为“元” 1

在上图中，payMoney列的最小值为负值，这是不合常理的，我们可以将这些数据删除。

ds_trading_data.drop(index = ds_trading_data[ds_trading_data["payMoney"] < 0].index, inplace = True) 1 7. 处理channelId列

前面讲过，channelId列是有null值得，我们将其删除

ds_trading_data.drop(index = ds_trading_data[pd.isnull(ds_trading_data["channelId"])].index, inplace = True) 1 8. 处理createTime列

# 将createTime转化为datetime类型 ds_trading_data["createTime"] = pd.to_datetime(ds_trading_data["createTime"]) 12 9. 处理payTime列

# 将payTime转化为datetime类型 ds_trading_data["payTime"] = pd.to_datetime(ds_trading_data["payTime"]) 12 10. 查看订单创建时的年份

print(pd.DatetimeIndex(ds_trading_data["createTime"]).year.unique()) 1

输出：Int64Index([2016, 2015], dtype=‘int64’, name=‘createTime’)

print(ds_trading_data[pd.DatetimeIndex(ds_trading_data["createTime"]).year == 2015].index) 1

输出：Int64Index([53, 18669, 36650, 71638, 88692], dtype=‘int64’)

由此可见，有5个订单的创建时间在2015年，而剩余部分全在2016年，因此我们可以把2015年的数据删除，只分析订单创建时间在2016年的数据。

ds_trading_data.set_index("createTime", inplace = True) # 删除2016年之前的数据 ds_trading_data.drop(index = ds_trading_data[:"2015-12-31 23:59:59"].index, inplace = True) ds_trading_data = ds_trading_data.reset_index() 1234 11. 将支付时间早于下单时间的记录删除

ds_trading_data.drop(index = ds_trading_data[ds_trading_data["payTime"]<ds_trading_data["createTime"]].index, inplace = True) 1 12. 处理productId列和orderId列

# 删除productId为0的记录 ds_trading_data.drop(index = ds_trading_data[ds_trading_data["productId"] == 0].index, inplace = True) # 删除orderId列的重复值 ds_trading_data["orderId"].drop_duplicates(inplace = True) 1234 13. 将deviceType列中的数字替换成相应字符串

ds_trading_data["deviceType"].replace({ 1:"PC", 2:"Android", 3:"iPhone", 4:"Wap", 5:"Other", 6:"Other"}, inplace = True) 12

因为数据量较小，替换成的字符串来自另一份文件，这里没有导入，而是直接使用字典的方式替换。

四、分析及可视化基础分析 1. 总体情况

print(ds_trading_data["orderId"].count()) # 所有订单数 print(ds_trading_data["userId"].unique().size) # 所有用户数 !!注意：因为unique()后是一个numpy.ndarray，此时不能使用count() print(ds_trading_data["productId"].unique().size) # 被购买的商品种数 print(ds_trading_data["payMoney"].sum()) # 总销售额 1234

输出：
104329
102447
1000
9066639970

经过前面一系列的数据清洗，最终得到了104329条有效信息。而在2016年全年中，一共有102447名顾客合计购买了1000种商品，总的成交额为9066639970元。

2. 从productId角度分析

# 不同商品销量 product_group_num = ds_trading_data.groupby(by = "productId").1