sklearn学习记录二：数据预处理

（1）数据标准化（Standardization or Mean Removal and Variance Scaling）

进行标准化缩放的数据均值为0，具有单位方差。

scale函数提供一种便捷的标准化转换操作，如下：

>>> from sklearn import preprocessing >>> X=[[1.,-1.,2.], [2.,0.,0.], [0.,1.,-1.]] >>> X_scaled = preprocessing.scale(X) >>> X_scaled array([[ 0. , -1.22474487, 1.33630621], [ 1.22474487, 0. , -0.26726124], [-1.22474487, 1.22474487, -1.06904497]]) >>> X_scaled.mean(axis=0) array([ 0., 0., 0.]) >>> X_scaled.std(axis=0) array([ 1., 1., 1.])

同样我们也可以通过preprocessing模块提供的Scaler（StandardScaler 0.15以后版本）工具类来实现这个功能：

>>> scaler = preprocessing.StandardScaler().fit(X) >>> scaler StandardScaler(copy=True, with_mean=True, with_std=True) >>> scaler.mean_ array([ 1. , 0. , 0.33333333]) >>> scaler.std_ array([ 0.81649658, 0.81649658, 1.24721913]) >>> scaler.transform(X) array([[ 0. , -1.22474487, 1.33630621], [ 1.22474487, 0. , -0.26726124], [-1.22474487, 1.22474487, -1.06904497]])

（2）数据规范化（Normalization）
把数据集中的每个样本所有数值缩放到(-1,1)之间。

>>> X = [[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]] >>> X_normalized = preprocessing.normalize(X, norm='l2') >>> X_normalized array([[ 0.40824829, -0.40824829, 0.81649658], [ 1. , 0. , 0. ], [ 0. , 0.70710678, -0.70710678]]) >>> normalizer = preprocessing.Normalizer().fit(X) >>> normalizer Normalizer(copy=True, norm='l2') >>> normalizer.transform(X) array([[ 0.40824829, -0.40824829, 0.81649658], [ 1. , 0. , 0. ], [ 0. , 0.70710678, -0.70710678]]) >>> normalizer.transform([[-1., 1., 0.]]) array([[-0.70710678, 0.70710678, 0. ]])

（3）二进制化（Binarization）
将数值型数据转化为布尔型的二值数据，可以设置一个阈值（threshold）

>>> X = [[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]] >>> binarizer = preprocessing.Binarizer().fit(X) >>> binarizer Binarizer(copy=True, threshold=0.0) >>> binarizer.transform(X) array([[ 1., 0., 1.], [ 1., 0., 0.], [ 0., 1., 0.]]) >>> binarizer = preprocessing.Binarizer(threshold=1.1) >>> binarizer.transform(X) array([[ 0., 0., 1.], [ 1., 0., 0.], [ 0., 0., 0.]])

（4）标签预处理（Label preprocessing）

4.1）标签二值化（Label binarization）

LabelBinarizer通常用于通过一个多类标签（label）列表，创建一个label指示器矩阵

>>> lb = preprocessing.LabelBinarizer() >>> lb.fit([1, 2, 6, 4, 2]) LabelBinarizer(neg_label=0, pos_label=1) >>> lb.classes_ array([1, 2, 4, 6]) >>> lb.transform([1, 6]) array([[1, 0, 0, 0], [0, 0, 0, 1]])

上例中每个实例中只有一个标签（label），LabelBinarizer也支持每个实例数据显示多个标签：

>>> lb.fit_transform([(1, 2), (3,)]) array([[1, 1, 0], [0, 0, 1]]) >>> lb.classes_ array([1, 2, 3]) >>> from sklearn import preprocessing >>> le = preprocessing.LabelEncoder() >>> le.fit([1, 2, 2, 6]) LabelEncoder() >>> le.classes_ array([1, 2, 6]) >>> le.transform([1, 1, 2, 6]) array([0, 0, 1, 2]) >>> le.inverse_transform([0, 0, 1, 2]) array([1, 1, 2, 6])

也可以用于非数值类型的标签到数值类型标签的转化：

>>> le = preprocessing.LabelEncoder() >>> le.fit(["paris", "paris", "tokyo", "amsterdam"]) LabelEncoder() >>> list(le.classes_) ['amsterdam', 'paris', 'tokyo'] >>> le.transform(["tokyo", "tokyo", "paris"]) array([2, 2, 1]) >>> list(le.inverse_transform([2, 2, 1])) ['tokyo', 'tokyo', 'paris']

sklearn学习记录二：数据预处理

网购猕猴桃在家里该怎么储存？

【农家帮手】猕猴桃贮藏注意事项

明日花キララ:明日花绮罗年度好评作品番号汇总

明日花キララ(明日花绮罗)经典品番作品及内容预览

家庭养花风水知识家庭养花“五行说”

家庭养花知识大全家庭养花有什么好处

sklearn学习记录二：数据预处理

网购猕猴桃在家里该怎么储存？

【农家帮手】猕猴桃贮藏注意事项

明日花キララ:明日花绮罗年度好评作品番号汇总

明日花キララ(明日花绮罗)经典品番作品及内容预览

家庭养花风水知识 家庭养花“五行说”

家庭养花知识大全 家庭养花有什么好处

家庭养花风水知识家庭养花“五行说”

家庭养花知识大全家庭养花有什么好处