1、导入数据
import pandas as pd
import os
import numpy as np
os.chdir(r"E:Python_learningdata_sciencetrain_05295Preprocessing")
camp = pd.read_csv('teleco_camp_orig.csv')
camp.head()
Out[1]:
ID Suc_flag ARPU PromCnt12 PromCnt36 PromCntMsg12 PromCntMsg36
0 12 1 50.0 6 10 2 3
1 53 0 NaN 5 9 1 4
2 67 1 25.0 6 11 2 4
3 71 1 80.0 7 10 2 4
4 142 1 15.0 6 11 2 4
Class Age Gender HomeOwner AvgARPU AvgHomeValue AvgIncome
0 4 57.0 M H 49.894904 33400 39460
1 3 55.0 M H 48.574742 37600 33545
2 1 57.0 F H 49.272646 100400 42091
3 1 52.0 F H 47.334953 39900 39313
4 1 NaN F U 47.827404 47500 0
2、查看数据某属性分布情况,并进行统计描述
import matplotlib.pyplot as plt
plt.hist(camp['AvgIncome'], bins=20, normed=True)
camp['AvgIncome'].describe(include='all')
Out[3]:
count 9686.000000
mean 40491.444249
std 28707.494146
min 0.000000
25% 24464.000000
50% 43100.000000
75% 56876.000000
max 200001.000000
Name: AvgIncome, dtype: float64
由上图可以看出异常值(平均收入不可能为0)
plt.hist(camp['AvgHomeValue'], bins=20, normed=True)
camp['AvgHomeValue'].describe(include='all')
Out[4]:
count 9686.000000
mean 110986.299814
std 98670.855450
min 0.000000
25% 52300.000000
50% 76900.000000
75% 128175.000000
max 600000.000000
Name: AvgHomeValue, dtype: float64
同样也出现了0,为异常值
3、对异常值进行处理
camp['AvgIncome']=camp['AvgIncome'].replace({0: np.NaN})
plt.hist(camp['AvgIncome'], bins=20, normed=True,range=(camp.AvgIncome.min(),camp.AvgIncome.max()))
camp['AvgIncome'].describe(include='all')
Out[5]:
count 7329.000000
mean 53513.457361
std 19805.168339
min 2499.000000
25% 40389.000000
50% 48699.000000
75% 62385.000000
max 200001.000000
Name: AvgIncome, dtype: float64
同理对另一个属性进行处理
camp['AvgHomeValue']=camp['AvgHomeValue'].replace({0: np.NaN})
plt.hist(camp['AvgHomeValue'], bins=20, normed=True,range=(camp.AvgHomeValue.min(),camp.AvgHomeValue.max()))
camp['AvgHomeValue'].describe(include='all')
Out[6]:
count 9583.000000
mean 112179.202755
std 98522.888583
min 7500.000000
25% 53200.000000
50% 77700.000000
75% 129350.000000
max 600000.000000
Name: AvgHomeValue, dtype: float64
4、对各属性值去重
camp.describe()
ID Suc_flag ARPU PromCnt12 PromCnt36
count 9686.000000 9686.000000 4843.000000 9686.000000 9686.000000
mean 97975.474086 0.500000 78.121722 3.495251 7.466963
std 56550.171120 0.500026 62.225686 1.270258 1.977909
min 12.000000 0.000000 5.000000 1.000000 1.000000
25% 48835.500000 0.000000 50.000000 3.000000 6.000000
50% 99106.000000 0.500000 65.000000 3.000000 8.000000
75% 148538.750000 1.000000 100.000000 4.000000 8.000000
max 191779.000000 1.000000 1000.000000 15.000000 20.000000
PromCntMsg12 PromCntMsg36 Class Age AvgARPU
count 9686.000000 9686.000000 9686.000000 7279.000000 9686.000000
mean 1.034586 2.323044 2.424530 49.567386 52.905156
std 0.244171 0.904083 1.049047 6.991306 4.993775
min 0.000000 0.000000 1.000000 16.000000 46.138968
25% 1.000000 1.000000 2.000000 45.000000 49.760116
50% 1.000000 3.000000 2.000000 50.000000 50.876672
75% 1.000000 3.000000 3.000000 55.000000 54.452822
max 4.000000 6.000000 4.000000 60.000000 99.444787
AvgHomeValue AvgIncome
count 9583.000000 7329.000000
mean 112179.202755 53513.457361
std 98522.888583 19805.168339
min 7500.000000 2499.000000
25% 53200.000000 40389.000000
50% 77700.000000 48699.000000
75% 129350.000000 62385.000000
max 600000.000000 200001.000000
对重复值打标签
camp['dup'] = camp.duplicated()
camp.dup.head()
Out[7]:
0 False
1 False
2 False
3 False
4 False
Name: dup, dtype: bool
根据标签提取数据
camp_dup = camp[camp['dup'] == True]
camp_nodup = camp[camp['dup'] == False]
camp_nodup.head()
camp['dup1'] = camp['ID'].duplicated() # 按照主键进行重复记录标识
5、对缺失值进行填充
现对属性age进行处理:
camp['Age'].describe()
Out[9]:
count 7279.000000
mean 49.567386
std 6.991306
min 16.000000
25% 45.000000
50% 50.000000
75% 55.000000
max 60.000000
Name: Age, dtype: float64
计算age的均值
vmean = camp['Age'].mean(axis=0, skipna=True)
vmean
Out[10]: 49.56738562989422
对age缺失值打标签
camp['Age_empflag'] = camp['Age'].isnull()
camp.head()
Out[11]:
ID Suc_flag ARPU PromCnt12 PromCnt36 PromCntMsg12 PromCntMsg36
0 12 1 50.0 6 10 2 3
1 53 0 NaN 5 9 1 4
2 67 1 25.0 6 11 2 4
3 71 1 80.0 7 10 2 4
4 142 1 15.0 6 11 2 4
Class Age Gender HomeOwner AvgARPU AvgHomeValue AvgIncome
0 4 57.0 M H 49.894904 33400.0 39460.0
1 3 55.0 M H 48.574742 37600.0 33545.0
2 1 57.0 F H 49.272646 100400.0 42091.0
3 1 52.0 F H 47.334953 39900.0 39313.0
4 1 NaN F U 47.827404 47500.0 NaN
Age_empflag
0 False
1 False
2 False
3 False
4 True
对age中的缺失值用均值填充
camp['Age']= camp['Age'].fillna(vmean)
camp['Age'].describe()
Out[12]:
count 9686.000000
mean 49.567386
std 6.060585
min 16.000000
25% 47.000000
50% 49.567386
75% 54.000000
max 60.000000
Name: Age, dtype: float64
同理对AvgHomeValue属性进行处理
vmean = camp['AvgHomeValue'].mean(axis=0, skipna=True)
camp['AvgHomeValue_empflag'] = camp['AvgHomeValue'].isnull()
camp['AvgHomeValue']= camp['AvgHomeValue'].fillna(vmean)
camp['AvgHomeValue'].describe()
Out[13]:
count 9686.000000
mean 112179.202755
std 97997.592632
min 7500.000000
25% 53500.000000
50% 78450.000000
75% 128175.000000
max 600000.000000
Name: AvgHomeValue, dtype: float64
vmean = camp['AvgIncome'].mean(axis=0, skipna=True)
camp['AvgIncome_empflag'] = camp['AvgIncome'].isnull()
camp['AvgIncome']= camp['AvgIncome'].fillna(vmean)
camp['AvgIncome'].describe()
Out[14]:
count 9686.000000
mean 53513.457361
std 17227.468161
min 2499.000000
25% 42775.000000
50% 53513.457361
75% 56876.000000
max 200001.000000
Name: AvgIncome, dtype: float64
相关知识
基于Python爬虫的电商网站彩妆数据的分析与研究
鸢尾花Python数据分析
python 怎么加载鸢尾花数据
高效农作物病虫害识别:Python项目源码及数据集教程
python 鸢尾花数据集下载
python的鸢尾花数据如何导入
数据分析(Python)入门—鸢尾植物数据集处理
Python数据分析项目实例2:使用seaborn分析鸢尾花(Iris)数据集
iris数据
如何在Python中使用Pandas库进行季节性调整?
网址: python 清洗数据 https://m.huajiangbk.com/newsview996865.html
上一篇: 奶油草莓怎么种 |
下一篇: @阿秋不啊啾 |