首页 > 分享 > python 清洗数据

python 清洗数据

1、导入数据

import pandas as pd

import os

import numpy as np

os.chdir(r"E:Python_learningdata_sciencetrain_05295Preprocessing")

camp = pd.read_csv('teleco_camp_orig.csv')

camp.head()

Out[1]:

ID Suc_flag ARPU PromCnt12 PromCnt36 PromCntMsg12 PromCntMsg36

0 12 1 50.0 6 10 2 3

1 53 0 NaN 5 9 1 4

2 67 1 25.0 6 11 2 4

3 71 1 80.0 7 10 2 4

4 142 1 15.0 6 11 2 4

Class Age Gender HomeOwner AvgARPU AvgHomeValue AvgIncome

0 4 57.0 M H 49.894904 33400 39460

1 3 55.0 M H 48.574742 37600 33545

2 1 57.0 F H 49.272646 100400 42091

3 1 52.0 F H 47.334953 39900 39313

4 1 NaN F U 47.827404 47500 0

2、查看数据某属性分布情况,并进行统计描述

import matplotlib.pyplot as plt

plt.hist(camp['AvgIncome'], bins=20, normed=True)

camp['AvgIncome'].describe(include='all')

Out[3]:

count 9686.000000

mean 40491.444249

std 28707.494146

min 0.000000

25% 24464.000000

50% 43100.000000

75% 56876.000000

max 200001.000000

Name: AvgIncome, dtype: float64

由上图可以看出异常值(平均收入不可能为0)

plt.hist(camp['AvgHomeValue'], bins=20, normed=True)

camp['AvgHomeValue'].describe(include='all')

Out[4]:

count 9686.000000

mean 110986.299814

std 98670.855450

min 0.000000

25% 52300.000000

50% 76900.000000

75% 128175.000000

max 600000.000000

Name: AvgHomeValue, dtype: float64

同样也出现了0,为异常值

3、对异常值进行处理

camp['AvgIncome']=camp['AvgIncome'].replace({0: np.NaN})

plt.hist(camp['AvgIncome'], bins=20, normed=True,range=(camp.AvgIncome.min(),camp.AvgIncome.max()))

camp['AvgIncome'].describe(include='all')

Out[5]:

count 7329.000000

mean 53513.457361

std 19805.168339

min 2499.000000

25% 40389.000000

50% 48699.000000

75% 62385.000000

max 200001.000000

Name: AvgIncome, dtype: float64

同理对另一个属性进行处理

camp['AvgHomeValue']=camp['AvgHomeValue'].replace({0: np.NaN})

plt.hist(camp['AvgHomeValue'], bins=20, normed=True,range=(camp.AvgHomeValue.min(),camp.AvgHomeValue.max()))

camp['AvgHomeValue'].describe(include='all')

Out[6]:

count 9583.000000

mean 112179.202755

std 98522.888583

min 7500.000000

25% 53200.000000

50% 77700.000000

75% 129350.000000

max 600000.000000

Name: AvgHomeValue, dtype: float64

4、对各属性值去重

camp.describe()

ID Suc_flag ARPU PromCnt12 PromCnt36

count 9686.000000 9686.000000 4843.000000 9686.000000 9686.000000

mean 97975.474086 0.500000 78.121722 3.495251 7.466963

std 56550.171120 0.500026 62.225686 1.270258 1.977909

min 12.000000 0.000000 5.000000 1.000000 1.000000

25% 48835.500000 0.000000 50.000000 3.000000 6.000000

50% 99106.000000 0.500000 65.000000 3.000000 8.000000

75% 148538.750000 1.000000 100.000000 4.000000 8.000000

max 191779.000000 1.000000 1000.000000 15.000000 20.000000

PromCntMsg12 PromCntMsg36 Class Age AvgARPU

count 9686.000000 9686.000000 9686.000000 7279.000000 9686.000000

mean 1.034586 2.323044 2.424530 49.567386 52.905156

std 0.244171 0.904083 1.049047 6.991306 4.993775

min 0.000000 0.000000 1.000000 16.000000 46.138968

25% 1.000000 1.000000 2.000000 45.000000 49.760116

50% 1.000000 3.000000 2.000000 50.000000 50.876672

75% 1.000000 3.000000 3.000000 55.000000 54.452822

max 4.000000 6.000000 4.000000 60.000000 99.444787

AvgHomeValue AvgIncome

count 9583.000000 7329.000000

mean 112179.202755 53513.457361

std 98522.888583 19805.168339

min 7500.000000 2499.000000

25% 53200.000000 40389.000000

50% 77700.000000 48699.000000

75% 129350.000000 62385.000000

max 600000.000000 200001.000000

 对重复值打标签

camp['dup'] = camp.duplicated()

camp.dup.head()

Out[7]:

0 False

1 False

2 False

3 False

4 False

Name: dup, dtype: bool

    根据标签提取数据

camp_dup = camp[camp['dup'] == True]

camp_nodup = camp[camp['dup'] == False]

camp_nodup.head()

camp['dup1'] = camp['ID'].duplicated() # 按照主键进行重复记录标识

5、对缺失值进行填充

现对属性age进行处理:

camp['Age'].describe()

Out[9]:

count 7279.000000

mean 49.567386

std 6.991306

min 16.000000

25% 45.000000

50% 50.000000

75% 55.000000

max 60.000000

Name: Age, dtype: float64

计算age的均值

vmean = camp['Age'].mean(axis=0, skipna=True)

vmean

Out[10]: 49.56738562989422

对age缺失值打标签

camp['Age_empflag'] = camp['Age'].isnull()

camp.head()

Out[11]:

ID Suc_flag ARPU PromCnt12 PromCnt36 PromCntMsg12 PromCntMsg36

0 12 1 50.0 6 10 2 3

1 53 0 NaN 5 9 1 4

2 67 1 25.0 6 11 2 4

3 71 1 80.0 7 10 2 4

4 142 1 15.0 6 11 2 4

Class Age Gender HomeOwner AvgARPU AvgHomeValue AvgIncome

0 4 57.0 M H 49.894904 33400.0 39460.0

1 3 55.0 M H 48.574742 37600.0 33545.0

2 1 57.0 F H 49.272646 100400.0 42091.0

3 1 52.0 F H 47.334953 39900.0 39313.0

4 1 NaN F U 47.827404 47500.0 NaN

Age_empflag

0 False

1 False

2 False

3 False

4 True

对age中的缺失值用均值填充

camp['Age']= camp['Age'].fillna(vmean)

camp['Age'].describe()

Out[12]:

count 9686.000000

mean 49.567386

std 6.060585

min 16.000000

25% 47.000000

50% 49.567386

75% 54.000000

max 60.000000

Name: Age, dtype: float64

同理对AvgHomeValue属性进行处理

vmean = camp['AvgHomeValue'].mean(axis=0, skipna=True)

camp['AvgHomeValue_empflag'] = camp['AvgHomeValue'].isnull()

camp['AvgHomeValue']= camp['AvgHomeValue'].fillna(vmean)

camp['AvgHomeValue'].describe()

Out[13]:

count 9686.000000

mean 112179.202755

std 97997.592632

min 7500.000000

25% 53500.000000

50% 78450.000000

75% 128175.000000

max 600000.000000

Name: AvgHomeValue, dtype: float64

vmean = camp['AvgIncome'].mean(axis=0, skipna=True)

camp['AvgIncome_empflag'] = camp['AvgIncome'].isnull()

camp['AvgIncome']= camp['AvgIncome'].fillna(vmean)

camp['AvgIncome'].describe()

Out[14]:

count 9686.000000

mean 53513.457361

std 17227.468161

min 2499.000000

25% 42775.000000

50% 53513.457361

75% 56876.000000

max 200001.000000

Name: AvgIncome, dtype: float64

相关知识

基于Python爬虫的电商网站彩妆数据的分析与研究
鸢尾花Python数据分析
python 怎么加载鸢尾花数据
高效农作物病虫害识别:Python项目源码及数据集教程
python 鸢尾花数据集下载
python的鸢尾花数据如何导入
数据分析(Python)入门—鸢尾植物数据集处理
Python数据分析项目实例2:使用seaborn分析鸢尾花(Iris)数据集
iris数据
如何在Python中使用Pandas库进行季节性调整?

网址: python 清洗数据 https://m.huajiangbk.com/newsview996865.html

所属分类:花卉
上一篇: 奶油草莓怎么种
下一篇: @阿秋不啊啾