首页 > 分享 > 机器学习分类算法SVM、逻辑回归、KNN

机器学习分类算法SVM、逻辑回归、KNN

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

加载数据

iris = pd.read_csv('E:/练习/Iris.csv') 1

iris.head() 1 idSepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCmSpecies015.13.51.40.20124.93.01.40.20234.73.21.30.20344.63.11.50.20455.03.61.40.20

打印数据内存使用情况

iris.info() 1

<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 150 non-null int64 1 SepalLengthCm 150 non-null float64 2 SepalWidthCm 150 non-null float64 3 PetalLengthCm 150 non-null float64 4 PetalWidthCm 150 non-null float64 5 Species 150 non-null int64 dtypes: float64(4), int64(2) memory usage: 7.2 KB 12345678910111213

iris.drop('id',axis=1,inplace=True) 1

iris.info() 1

<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SepalLengthCm 150 non-null float64 1 SepalWidthCm 150 non-null float64 2 PetalLengthCm 150 non-null float64 3 PetalWidthCm 150 non-null float64 4 Species 150 non-null int64 dtypes: float64(4), int64(1) memory usage: 6.0 KB 123456789101112

fig = iris[iris.Species==0].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='orange', label='Setosa') iris[iris.Species==1].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='blue', label='versicolor',ax=fig) iris[iris.Species==2].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='green', label='virginica', ax=fig) fig.set_xlabel("Sepal Length") fig.set_ylabel("Sepal Width") fig.set_title("Sepal Length VS Width") fig=plt.gcf() fig.set_size_inches(10,6) plt.show() 123456789

在这里插入图片描述

上图显示了萼片长度和宽度之间的关系。现在我们将检查花瓣长度和宽度之间的关系。

fig = iris[iris.Species==0].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='orange', label='Setosa') iris[iris.Species==1].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='blue', label='versicolor',ax=fig) iris[iris.Species==2].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='green', label='virginica', ax=fig) fig.set_xlabel("Petal Length") fig.set_ylabel("Petal Width") fig.set_title(" Petal Length VS Width") fig=plt.gcf() fig.set_size_inches(10,6) plt.show() 123456789

在这里插入图片描述

正如我们所看到的,与萼片特征相比,花瓣特征给出了更好的簇划分。这表明花瓣有助于更好、更准确地预测萼片。我们稍后再检查。 长度和宽度是如何分布的

iris.hist(edgecolor='black', linewidth=1.2) fig=plt.gcf() fig.set_size_inches(12,6) plt.show() 1234

在这里插入图片描述

长度和宽度如何随物种而变化 seaborn.vinlinplot–绘制小提琴图 小提琴图是 箱型图 和核密度图的结合 展示了数据随种类的长度和密度。越窄的部分说明数据密度较低,越宽的部分说明数据密度高

plt.figure(figsize=(15,10)) plt.subplot(2,2,1) sns.violinplot(x='Species',y='PetalLengthCm',data=iris) plt.subplot(2,2,2) sns.violinplot(x='Species',y='PetalWidthCm',data=iris) plt.subplot(2,2,3) sns.violinplot(x='Species',y='SepalLengthCm',data=iris) plt.subplot(2,2,4) sns.violinplot(x='Species',y='SepalWidthCm',data=iris) 123456789

<AxesSubplot:xlabel='Species', ylabel='SepalWidthCm'> 1

在这里插入图片描述

分类算法

sklearn.linear_model —逻辑回归算法 sklearn.model_selection.train_test_split --将数据集随机分成训练集合测试集 sklearn.neighbors.KNeighborsClassifier --K临近算法 sklearn.svm --支持向量机算法 sklearn.metrics.accuracy——score --检查模型的准确性 sklearn.tree.DecisionTreeClassifier --决策树算法

from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn import svm from sklearn import metrics from sklearn.tree import DecisionTreeClassifier 123456 pandas.DataFrame.shape–返回DataFrame数据形状

iris.shape 1

(150, 5) 1 现在,当我们训练任何算法时,特征的数量及其相关性起着重要的作用。如果存在特征且许多特征高度相关,则训练具有所有特征的算法将降低精度。因此,应仔细选择特征。该数据集的功能较少,但我们仍将看到相关性。 pandas.DataFrame.corr–计算相关系数 seaborn.heatmap --热力图

plt.figure(figsize=(7,4)) sns.heatmap(iris.corr(),annot=True,cmap='cubehelix_r') plt.show() 123

请添加图片描述

观察 萼片宽度和长度不相关,花瓣宽度和长度高度相关 我们将使用所有功能来训练算法并检查其准确性。然后我们将使用1个花瓣特征和1个萼片特征来检查算法的准确性,因为我们只使用2个不相关的特征。因此,我们可以在数据集中有一个方差,这可能有助于提高准确性。我们稍后再查。 训练算法步骤 1.将数据集拆分为培训和测试数据集。测试数据集通常比训练数据集小,因为它有助于更好地训练模型。 2.根据问题(分类或回归)选择算法。 3.然后将训练数据集传递给算法进行训练。我们使用.fit()方法 4.然后将测试数据传递给经过训练的算法,以预测结果。我们使用.predict()方法。 5.比较预测结果和真实值,给出算法准确性。 将数据拆分为训练和测试数据集 test_size = 0.3测试集数据占比30%

train, test = train_test_split(iris, test_size = 0.3) print(train.shape) print(test.shape) 123

(105, 5) (45, 5) 12 获取训练集X的特征: [‘SepalLengthCm’,‘SepalWidthCm’,‘PetalLengthCm’,‘PetalWidthCm’] 训练集Y的实际分布 测试集X的特征 测试集Y的实际分布

train_X = train[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']] train_y=train.Species# output of our training data test_X= test[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']] test_y =test.Species 1234 检查训练集和测试集

train_X.head(2) 1 SepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCm726.32.54.91.5395.13.41.50.2

test_X.head(2) 1 SepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCm1436.83.25.92.3405.03.51.30.3 训练集中分类的输出值(原始列表中标注的分类)

train_y.head() 1

72 1 39 0 21 0 109 2 106 2 Name: Species, dtype: int64 123456 SVM支持向量机 sklearn.svm.SVC --SVC算法 sklearn.svm.SVC.fit–对于训练集使用fit方法训练算法 sklearn.svm.SVC.predict-- 传入测试集,使用predict方法给出预测值 sklearn.metrics.accuracy_score --预测值与实际值对比,给出算法准确度

model = svm.SVC() model.fit(train_X,train_y) prediction=model.predict(test_X) print('The accuracy of the SVM is:',metrics.accuracy_score(prediction,test_y)) 1234

The accuracy of the SVM is: 0.9777777777777777 1 支持向量机具有很好的精度。我们将继续检查不同型号的精度。现在我们将按照上面相同的步骤来训练各种机器学习算法。 逻辑回归(Logistic Regression)

model = LogisticRegression() model.fit(train_X,train_y) prediction=model.predict(test_X) print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction,test_y)) 1234

The accuracy of the Logistic Regression is 0.9777777777777777 D:Anaconda3libsite-packagessklearnlinear_model_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( 1234567891011 决策树(Decision Tree)

model=DecisionTreeClassifier() model.fit(train_X,train_y) prediction=model.predict(test_X) print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction,test_y)) 1234

The accuracy of the Decision Tree is 0.9777777777777777 1 K均值聚类(Kmeans)

from sklearn.cluster import KMeans 1

model = KMeans(n_clusters=3) model.fit(train_X,train_y) prediction=model.predict(test_X) print('The accuracy of the KMeans is',metrics.accuracy_score(prediction,test_y)) x0 = (train_X,train_y)[prediction == 0] x1 = (train_X,train_y)[prediction == 1] x2 = (train_X,train_y)[prediction == 2] plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0') plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1') plt.scatter(x2[:, 0], x2[:, 1], c = "blue", marker='+', label='label2') plt.xlabel('petal length') plt.ylabel('petal width') plt.legend(loc=2) plt.show() 123456789101112131415

The accuracy of the KMeans is 0.3111111111111111 --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-156-38728399d491> in <module> 4 print('The accuracy of the KMeans is',metrics.accuracy_score(prediction,test_y)) 5 ----> 6 x0 = (train_X,train_y)[prediction == 0] 7 x1 = (train_X,train_y)[prediction == 1] 8 x2 = (train_X,train_y)[prediction == 2] TypeError: only integer scalar arrays can be converted to a scalar index

1234567891011121314151617

1' K邻近算法(K-Nearest Neighbours) n_neighbors=3:检查邻近3个点判断属于哪个分类

model=KNeighborsClassifier(n_neighbors=3) model.fit(train_X,train_y) prediction=model.predict(test_X) print('The accuracy of the KNN is',metrics.accuracy_score(prediction,test_y)) 1234

The accuracy of the KNN is 0.9777777777777777 1 当n_neighbors值不同时,检查KNN算法的准确度变化,少数服从多数

a_index=list(range(1,11)) a=pd.Series() x=[1,2,3,4,5,6,7,8,9,10] for i in list(range(1,11)): model=KNeighborsClassifier(n_neighbors=i) model.fit(train_X,train_y) prediction=model.predict(test_X) a=a.append(pd.Series(metrics.accuracy_score(prediction,test_y))) plt.plot(a_index, a) plt.xticks(x) 12345678910

<ipython-input-134-4f8d635a95d7>:2: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. a=pd.Series() ([<matplotlib.axis.XTick at 0x231ebade3d0>, <matplotlib.axis.XTick at 0x231ea5e09d0>, <matplotlib.axis.XTick at 0x231ece293d0>, <matplotlib.axis.XTick at 0x231ed283eb0>, <matplotlib.axis.XTick at 0x231ed28f400>, <matplotlib.axis.XTick at 0x231ed283970>, <matplotlib.axis.XTick at 0x231ed28fac0>, <matplotlib.axis.XTick at 0x231ed28ffd0>, <matplotlib.axis.XTick at 0x231ed295520>, <matplotlib.axis.XTick at 0x231ed295a30>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])

123456789101112131415161718192021222324252627

在这里插入图片描述

我们在上述模型中使用了iris的所有特征。现在我们将分别使用花瓣和萼片 创建花瓣和萼片训练数据 花萼:长度和宽度相关性很低 花瓣:长度和宽度相关性很高

petal=iris[['PetalLengthCm','PetalWidthCm','Species']] sepal=iris[['SepalLengthCm','SepalWidthCm','Species']] 12

train_p,test_p=train_test_split(petal,test_size=0.3,random_state=0) #petals train_x_p=train_p[['PetalWidthCm','PetalLengthCm']] train_y_p=train_p.Species test_x_p=test_p[['PetalWidthCm','PetalLengthCm']] test_y_p=test_p.Species train_s,test_s=train_test_split(sepal,test_size=0.3,random_state=0) #Sepal train_x_s=train_s[['SepalWidthCm','SepalLengthCm']] train_y_s=train_s.Species test_x_s=test_s[['SepalWidthCm','SepalLengthCm']] test_y_s=test_s.Species 123456789101112

SVM

model=svm.SVC() model.fit(train_x_p,train_y_p) prediction=model.predict(test_x_p) print('The accuracy of the SVM using Petals is:',metrics.accuracy_score(prediction,test_y_p)) model=svm.SVC() model.fit(train_x_s,train_y_s) prediction=model.predict(test_x_s) print('The accuracy of the SVM using Sepal is:',metrics.accuracy_score(prediction,test_y_s)) 123456789

The accuracy of the SVM using Petals is: 0.9777777777777777 The accuracy of the SVM using Sepal is: 0.8 12

逻辑回归

model = LogisticRegression() model.fit(train_x_p,train_y_p) prediction=model.predict(test_x_p) print('The accuracy of the Logistic Regression using Petals is:',metrics.accuracy_score(prediction,test_y_p)) model.fit(train_x_s,train_y_s) prediction=model.predict(test_x_s) print('The accuracy of the Logistic Regression using Sepals is:',metrics.accuracy_score(prediction,test_y_s)) 12345678

The accuracy of the Logistic Regression using Petals is: 0.9777777777777777 The accuracy of the Logistic Regression using Sepals is: 0.8222222222222222 12

决策树

model=DecisionTreeClassifier() model.fit(train_x_p,train_y_p) prediction=model.predict(test_x_p) print('The accuracy of the Decision Tree using Petals is:',metrics.accuracy_score(prediction,test_y_p)) model.fit(train_x_s,train_y_s) prediction=model.predict(test_x_s) print('The accuracy of the Decision Tree using Sepals is:',metrics.accuracy_score(prediction,test_y_s)) 12345678

The accuracy of the Decision Tree using Petals is: 0.9555555555555556 The accuracy of the Decision Tree using Sepals is: 0.6444444444444445 12

K均值聚类

model=KMeans(n_clusters=3) model.fit(train_x_p,train_y_p) prediction=model.predict(test_x_p) print('The accuracy of the KMeans using Petals is:',metrics.accuracy_score(prediction,test_y_p)) model.fit(train_x_s,train_y_s) prediction=model.predict(test_x_s) print('The accuracy of the KMeans using Sepals is:',metrics.accuracy_score(prediction,test_y_s)) 12345678

The accuracy of the KMeans using Petals is: 0.022222222222222223 The accuracy of the KMeans using Sepals is: 0.7555555555555555 12

KNN

model=KNeighborsClassifier(n_neighbors=3) model.fit(train_x_p,train_y_p) prediction=model.predict(test_x_p) print('The accuracy of the KNN using Petals is:',metrics.accuracy_score(prediction,test_y_p)) model.fit(train_x_s,train_y_s) prediction=model.predict(test_x_s) print('The accuracy of the KNN using Sepals is:',metrics.accuracy_score(prediction,test_y_s)) 12345678

The accuracy of the KNN using Petals is: 0.9777777777777777 The accuracy of the KNN using Sepals is: 0.7333333333333333 12

总结

使用花瓣覆盖萼片来训练数据可以提供更好的准确性。 正如我们在上面的热图中看到的,这是意料之中的,萼片宽度和长度之间的相关性非常低,而花瓣宽度和长度之间的相关性非常高。

结语

因此,我们刚刚实现了一些常见的机器学习算法。由于数据集很小,功能很少,所以我没有介绍一些概念,因为当我们有很多特征时,它们是相关的。

1'

1'

相关知识

【机器学习】KNN算法实现鸢尾花分类
【机器学习】基于KNN算法实现鸢尾花数据集的分类
机器学习算法其一:鸢尾花数据集逻辑回归分类预测学习总结
基于机器学习的鸢尾花数据集的三分类算法的实现 C++
【机器学习】鸢尾花分类:机器学习领域经典入门项目实战
【机器学习】KNN算法实现手写板字迹识别
【人工智能】基于分类算法的学生学业预警系统应用
Python机器学习教程——逻辑回归
【python机器学习】KNN算法实现回归(基于鸢尾花数据集)
【10月23日】机器学习实战(一)KNN算法:手写识别系统

网址: 机器学习分类算法SVM、逻辑回归、KNN https://m.huajiangbk.com/newsview1114421.html

所属分类:花卉
上一篇: 与太阳花极为相似的植物(探秘其特
下一篇: 三秒认清这些花,让你不被男友骗