首页 > 分享 > 机器学习分类算法SVM、逻辑回归、KNN

机器学习分类算法SVM、逻辑回归、KNN

花匠小妙招
2024-12-15 20:21

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

加载数据

iris = pd.read_csv('E:/练习/Iris.csv') 1

iris.head() 1 idSepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCmSpecies015.13.51.40.20124.93.01.40.20234.73.21.30.20344.63.11.50.20455.03.61.40.20

打印数据内存使用情况

iris.info() 1

<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 150 non-null int64 1 SepalLengthCm 150 non-null float64 2 SepalWidthCm 150 non-null float64 3 PetalLengthCm 150 non-null float64 4 PetalWidthCm 150 non-null float64 5 Species 150 non-null int64 dtypes: float64(4), int64(2) memory usage: 7.2 KB 12345678910111213

iris.drop('id',axis=1,inplace=True) 1

iris.info() 1

<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SepalLengthCm 150 non-null float64 1 SepalWidthCm 150 non-null float64 2 PetalLengthCm 150 non-null float64 3 PetalWidthCm 150 non-null float64 4 Species 150 non-null int64 dtypes: float64(4), int64(1) memory usage: 6.0 KB 123456789101112

fig = iris[iris.Species==0].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='orange', label='Setosa') iris[iris.Species==1].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='blue', label='versicolor',ax=fig) iris[iris.Species==2].plot(kind='scatter',x='SepalLengthCm',y='SepalWidthCm',color='green', label='virginica', ax=fig) fig.set_xlabel("Sepal Length") fig.set_ylabel("Sepal Width") fig.set_title("Sepal Length VS Width") fig=plt.gcf() fig.set_size_inches(10,6) plt.show() 123456789

在这里插入图片描述

上图显示了萼片长度和宽度之间的关系。现在我们将检查花瓣长度和宽度之间的关系。

fig = iris[iris.Species==0].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='orange', label='Setosa') iris[iris.Species==1].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='blue', label='versicolor',ax=fig) iris[iris.Species==2].plot.scatter(x='PetalLengthCm',y='PetalWidthCm',color='green', label='virginica', ax=fig) fig.set_xlabel("Petal Length") fig.set_ylabel("Petal Width") fig.set_title(" Petal Length VS Width") fig=plt.gcf() fig.set_size_inches(10,6) plt.show() 123456789

在这里插入图片描述

正如我们所看到的，与萼片特征相比，花瓣特征给出了更好的簇划分。这表明花瓣有助于更好、更准确地预测萼片。我们稍后再检查。长度和宽度是如何分布的

iris.hist(edgecolor='black', linewidth=1.2) fig=plt.gcf() fig.set_size_inches(12,6) plt.show() 1234

在这里插入图片描述

长度和宽度如何随物种而变化 seaborn.vinlinplot–绘制小提琴图小提琴图是箱型图和核密度图的结合展示了数据随种类的长度和密度。越窄的部分说明数据密度较低，越宽的部分说明数据密度高

plt.figure(figsize=(15,10)) plt.subplot(2,2,1) sns.violinplot(x='Species',y='PetalLengthCm',data=iris) plt.subplot(2,2,2) sns.violinplot(x='Species',y='PetalWidthCm',data=iris) plt.subplot(2,2,3) sns.violinplot(x='Species',y='SepalLengthCm',data=iris) plt.subplot(2,2,4) sns.violinplot(x='Species',y='SepalWidthCm',data=iris) 123456789

<AxesSubplot:xlabel='Species', ylabel='SepalWidthCm'> 1

在这里插入图片描述

分类算法

sklearn.linear_model —逻辑回归算法 sklearn.model_selection.train_test_split --将数据集随机分成训练集合测试集 sklearn.neighbors.KNeighborsClassifier --K临近算法 sklearn.svm --支持向量机算法 sklearn.metrics.accuracy——score --检查模型的准确性 sklearn.tree.DecisionTreeClassifier --决策树算法

from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn import svm from sklearn import metrics from sklearn.tree import DecisionTreeClassifier 123456 pandas.DataFrame.shape–返回DataFrame数据形状

iris.shape 1

(150, 5) 1 现在，当我们训练任何算法时，特征的数量及其相关性起着重要的作用。如果存在特征且许多特征高度相关，则训练具有所有特征的算法将降低精度。因此，应仔细选择特征。该数据集的功能较少，但我们仍将看到相关性。 pandas.DataFrame.corr–计算相关系数 seaborn.heatmap --热力图

plt.figure(figsize=(7,4)) sns.heatmap(iris.corr(),annot=True,cmap='cubehelix_r') plt.show() 123

请添加图片描述

观察萼片宽度和长度不相关，花瓣宽度和长度高度相关我们将使用所有功能来训练算法并检查其准确性。然后我们将使用1个花瓣特征和1个萼片特征来检查算法的准确性，因为我们只使用2个不相关的特征。因此，我们可以在数据集中有一个方差，这可能有助于提高准确性。我们稍后再查。训练算法步骤 1.将数据集拆分为培训和测试数据集。测试数据集通常比训练数据集小，因为它有助于更好地训练模型。 2.根据问题（分类或回归）选择算法。 3.然后将训练数据集传递给算法进行训练。我们使用.fit（）方法 4.然后将测试数据传递给经过训练的算法，以预测结果。我们使用.predict（）方法。 5.比较预测结果和真实值，给出算法准确性。将数据拆分为训练和测试数据集 test_size = 0.3测试集数据占比30%

train, test = train_test_split(iris, test_size = 0.3) print(train.shape) print(test.shape) 123

(105, 5) (45, 5) 12 获取训练集X的特征： [‘SepalLengthCm’,‘SepalWidthCm’,‘PetalLengthCm’,‘PetalWidthCm’] 训练集Y的实际分布测试集X的特征测试集Y的实际分布

train_X = train[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']] train_y=train.Species# output of our training data test_X= test[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']] test_y =test.Species 1234 检查训练集和测试集

train_X.head(2) 1 SepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCm726.32.54.91.5395.13.41.50.2

test_X.head(2) 1 SepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCm1436.83.25.92.3405.03.51.30.3 训练集中分类的输出值（原始列表中标注的分类）

train_y.head() 1

72 1 39 0 21 0 109 2 106 2 Name: Species, dtype: int64 123456 SVM支持向量机 sklearn.svm.SVC --SVC算法 sklearn.svm.SVC.fit–对于训练集使用fit方法训练算法 sklearn.svm.SVC.predict-- 传入测试集，使用predict方法给出预测值 sklearn.metrics.accuracy_score --预测值与实际值对比，给出算法准确度

model = svm.SVC() model.fit(train_X,train_y) prediction=model.predict(test_X) print('The accuracy of the SVM is:',metrics.accuracy_score(prediction,test_y)) 1234

The accuracy of the SVM is: 0.9777777777777777 1 支持向量机具有很好的精度。我们将继续检查不同型号的精度。现在我们将按照上面相同的步骤来训练各种机器学习算法。逻辑回归（Logistic Regression）

model = LogisticRegression() model.fit(train_X,train_y) prediction=model.predict(test_X) print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction,test_y)) 1234

The accuracy of the Logistic Regression is 0.9777777777777777 D:Anaconda3libsite-packagessklearnlinear_model_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( 1234567891011 决策树（Decision Tree）

model=DecisionTreeClassifier() model.fit(train_X,train_y) prediction=model.predict(test_X) print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction,test_y)) 1234

The accuracy of the Decision Tree is 0.9777777777777777 1 K均值聚类（Kmeans）

from sklearn.cluster import KMeans 1

model = KMeans(n_clusters=3) model.fit(train_X,train_y) prediction=model.predict(test_X) print('The accuracy of the KMeans is',metrics.accuracy_score(prediction,test_y)) x0 = (train_X,train_y)[prediction == 0] x1 = (train_X,train_y)[prediction == 1] x2 = (train_X,train_y)[prediction == 2] plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0') plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1') plt.scatter(x2[:, 0], x2[:, 1], c = "blue", marker='+', label='label2') plt.xlabel('petal length') plt.ylabel('petal width') plt.legend(loc=2) plt.show() 123456789101112131415

The accuracy of the KMeans is 0.3111111111111111 --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-156-38728399d491> in <module> 4 print('The accuracy of the KMeans is',metrics.accuracy_score(prediction,test_y)) 5 ----> 6 x0 = (train_X,train_y)[prediction == 0] 7 x1 = (train_X,train_y)[prediction == 1] 8 x2 = (train_X,train_y)[prediction == 2] TypeError: only integer scalar arrays can be converted to a scalar index

1234567891011121314151617

1' K邻近算法（K-Nearest Neighbours） n_neighbors=3：检查邻近3个点判断属于哪个分类

model=KNeighborsClassifier(n_neighbors=3) model.fit(train_X,train_y) prediction=model.predict(test_X) print('The accuracy of the KNN is',metrics.accuracy_score(prediction,test_y)) 1234

The accuracy of the KNN is 0.9777777777777777 1 当n_neighbors值不同时，检查KNN算法的准确度变化，少数服从多数

a_index=list(range(1,11)) a=pd.Series() x=[1,2,3,4,5,6,7,8,9,10] for i in list(range(1,11)): model=KNeighborsClassifier(n_neighbors=i) model.fit(train_X,train_y) prediction=model.predict(test_X) a=a.append(pd.Series(metrics.accuracy_score(prediction,test_y))) plt.plot(a_index, a) plt.xticks(x) 12345678910

<ipython-input-134-4f8d635a95d7>:2: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. a=pd.Series() ([<matplotlib.axis.XTick at 0x231ebade3d0>, <matplotlib.axis.XTick at 0x231ea5e09d0>, <matplotlib.axis.XTick at 0x231ece293d0>, <matplotlib.axis.XTick at 0x231ed283eb0>, <matplotlib.axis.XTick at 0x231ed28f400>, <matplotlib.axis.XTick at 0x231ed283970>, <matplotlib.axis.XTick at 0x231ed28fac0>, <matplotlib.axis.XTick at 0x231ed28ffd0>, <matplotlib.axis.XTick at 0x231ed295520>, <matplotlib.axis.XTick at 0x231ed295a30>], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])

123456789101112131415161718192021222324252627

在这里插入图片描述

我们在上述模型中使用了iris的所有特征。现在我们将分别使用花瓣和萼片创建花瓣和萼片训练数据花萼：长度和宽度相关性很低花瓣：长度和宽度相关性很高

petal=iris[['PetalLengthCm','PetalWidthCm','Species']] sepal=iris[['SepalLengthCm','SepalWidthCm','Species']] 12

train_p,test_p=train_test_split(petal,test_size=0.3,random_state=0) #petals train_x_p=train_p[['PetalWidthCm','PetalLengthCm']] train_y_p=train_p.Species test_x_p=test_p[['PetalWidthCm','PetalLengthCm']] test_y_p=test_p.Species train_s,test_s=train_test_split(sepal,test_size=0.3,random_state=0) #Sepal train_x_s=train_s[['SepalWidthCm','SepalLengthCm']] train_y_s=train_s.Species test_x_s=test_s[['SepalWidthCm','SepalLengthCm']] test_y_s=test_s.Species 123456789101112

SVM

model=svm.SVC() model.fit(train_x_p,train_y_p) prediction=model.predict(test_x_p) print('The accuracy of the SVM using Petals is:',metrics.accuracy_score(prediction,test_y_p)) model=svm.SVC() model.fit(train_x_s,train_y_s) prediction=model.predict(test_x_s) print('The accuracy of the SVM using Sepal is:',metrics.accuracy_score(prediction,test_y_s)) 123456789

The accuracy of the SVM using Petals is: 0.9777777777777777 The accuracy of the SVM using Sepal is: 0.8 12

逻辑回归

model = LogisticRegression() model.fit(train_x_p,train_y_p) prediction=model.predict(test_x_p) print('The accuracy of the Logistic Regression using Petals is:',metrics.accuracy_score(prediction,test_y_p)) model.fit(train_x_s,train_y_s) prediction=model.predict(test_x_s) print('The accuracy of the Logistic Regression using Sepals is:',metrics.accuracy_score(prediction,test_y_s)) 12345678

The accuracy of the Logistic Regression using Petals is: 0.9777777777777777 The accuracy of the Logistic Regression using Sepals is: 0.8222222222222222 12

决策树

model=DecisionTreeClassifier() model.fit(train_x_p,train_y_p) prediction=model.predict(test_x_p) print('The accuracy of the Decision Tree using Petals is:',metrics.accuracy_score(prediction,test_y_p)) model.fit(train_x_s,train_y_s) prediction=model.predict(test_x_s) print('The accuracy of the Decision Tree using Sepals is:',metrics.accuracy_score(prediction,test_y_s)) 12345678

The accuracy of the Decision Tree using Petals is: 0.9555555555555556 The accuracy of the Decision Tree using Sepals is: 0.6444444444444445 12

K均值聚类

model=KMeans(n_clusters=3) model.fit(train_x_p,train_y_p) prediction=model.predict(test_x_p) print('The accuracy of the KMeans using Petals is:',metrics.accuracy_score(prediction,test_y_p)) model.fit(train_x_s,train_y_s) prediction=model.predict(test_x_s) print('The accuracy of the KMeans using Sepals is:',metrics.accuracy_score(prediction,test_y_s)) 12345678

The accuracy of the KMeans using Petals is: 0.022222222222222223 The accuracy of the KMeans using Sepals is: 0.7555555555555555 12

KNN

model=KNeighborsClassifier(n_neighbors=3) model.fit(train_x_p,train_y_p) prediction=model.predict(test_x_p) print('The accuracy of the KNN using Petals is:',metrics.accuracy_score(prediction,test_y_p)) model.fit(train_x_s,train_y_s) prediction=model.predict(test_x_s) print('The accuracy of the KNN using Sepals is:',metrics.accuracy_score(prediction,test_y_s)) 12345678

The accuracy of the KNN using Petals is: 0.9777777777777777 The accuracy of the KNN using Sepals is: 0.7333333333333333 12