首页 > 分享 > 纯python实现iris鸢尾花数据集的分类,基于决策树

纯python实现iris鸢尾花数据集的分类,基于决策树

鸢尾花数据集简介

鸢尾花数据集是机器学习领域中最经典的数据集之一,由英国统计学家罗纳德·费舍尔(Ronald A. Fisher)在1936年提出。该数据集包含150个样本,分为三类,每类50个样本。每个样本有四个特征:

萼片长度(sepal length)

萼片宽度(sepal width)

花瓣长度(petal length)

花瓣宽度(petal width)

目标是根据这四个特征预测鸢尾花的种类(Setosa、Versicolor、Virginica)。

决策树算法概述

决策树是一种基于树状结构的分类与回归方法。其核心思想是通过对特征的选择与划分,将数据集逐步分割成更纯的子集,最终形成一个可以进行预测的树状模型。决策树的构建主要包括以下步骤:

选择最佳特征:通过某种准则(本文使用了信息增益),选择最能区分数据的特征。

确定划分点:对于连续特征,确定一个阈值将数据集划分为两个子集。

递归构建子树:对每个子集重复上述过程,直到满足停止条件(如所有样本属于同一类别或没有更多特征可用)。

代码实现详解

相关公式的代码实现

def calculate_Entropy(dataset):

Entropy = 0

cls = calculate_cls(dataset)

data_num = dataset.shape[0]

for i in cls:

p_i = dataset['species'].value_counts().get(i, 0) / data_num

if p_i > 0:

Entropy -= p_i * np.log2(p_i)

return Entropy

def calculate_Gain(dataset, feature):

sorted_dataset = dataset.sort_values(by=feature, ascending=True).reset_index(drop=True)

eigen = sorted_dataset[feature].tolist()

label = sorted_dataset['species'].tolist()

sample_num = dataset.shape[0]

base_entropy = calculate_Entropy(dataset)

max_Gain = -float('inf')

max_Gain_point = None

for i in range(1, sample_num):

if label[i] != label[i - 1]:

mid_point = (eigen[i] + eigen[i - 1]) / 2

left_subset = sorted_dataset[sorted_dataset[feature] <= mid_point]

right_subset = sorted_dataset[sorted_dataset[feature] > mid_point]

if left_subset.empty or right_subset.empty:

continue

Gain = base_entropy - (

(left_subset.shape[0] / sample_num) * calculate_Entropy(left_subset) +

(right_subset.shape[0] / sample_num) * calculate_Entropy(right_subset)

)

if Gain > max_Gain:

max_Gain = Gain

max_Gain_point = mid_point

return (max_Gain_point, max_Gain)

以下是完整的Python代码实现:

import pandas as pd

import numpy as np

train_data = pd.read_csv('./data/iris_training.csv')

test_data = pd.read_csv('./data/iris_test.csv')

features = train_data.columns[:-1].tolist()

def calculate_cls(dataset):

cls_column = dataset.iloc[:, -1]

cls = set(cls_column)

return cls

def calculate_Entropy(dataset):

Entropy = 0

cls = calculate_cls(dataset)

data_num = dataset.shape[0]

for i in cls:

p_i = dataset['species'].value_counts().get(i, 0) / data_num

if p_i > 0:

Entropy -= p_i * np.log2(p_i)

return Entropy

def calculate_Gain(dataset, feature):

sorted_dataset = dataset.sort_values(by=feature, ascending=True).reset_index(drop=True)

eigen = sorted_dataset[feature].tolist()

label = sorted_dataset['species'].tolist()

sample_num = dataset.shape[0]

base_entropy = calculate_Entropy(dataset)

max_Gain = -float('inf')

max_Gain_point = None

for i in range(1, sample_num):

if label[i] != label[i - 1]:

mid_point = (eigen[i] + eigen[i - 1]) / 2

left_subset = sorted_dataset[sorted_dataset[feature] <= mid_point]

right_subset = sorted_dataset[sorted_dataset[feature] > mid_point]

if left_subset.empty or right_subset.empty:

continue

Gain = base_entropy - (

(left_subset.shape[0] / sample_num) * calculate_Entropy(left_subset) +

(right_subset.shape[0] / sample_num) * calculate_Entropy(right_subset)

)

if Gain > max_Gain:

max_Gain = Gain

max_Gain_point = mid_point

return (max_Gain_point, max_Gain)

def choose_best_feature(dataset):

features = dataset.columns[:-1].tolist()

best_Gain = -float('inf')

best_feature = None

best_split = None

for feature in features:

split_point, Gain = calculate_Gain(dataset, feature)

if Gain > best_Gain:

best_Gain = Gain

best_feature = feature

best_split = split_point

return best_feature, best_split

def split_dataset(dataset, feature, point):

dataset_1 = dataset[dataset[feature] <= point].reset_index(drop=True)

dataset_2 = dataset[dataset[feature] > point].reset_index(drop=True)

return (dataset_1, dataset_2)

def create_tree(dataset):

cls = calculate_cls(dataset)

if len(cls) == 1:

return cls.pop()

if dataset.shape[1] == 1:

return dataset['species'].mode()[0]

best_feature, best_split = choose_best_feature(dataset)

if best_feature is None:

return dataset['species'].mode()[0]

tree = {best_feature: {'split_point': best_split, 'left': None, 'right': None}}

left_subset, right_subset = split_dataset(dataset, best_feature, best_split)

tree[best_feature]['left'] = create_tree(left_subset)

tree[best_feature]['right'] = create_tree(right_subset)

return tree

def predict(tree, sample):

if not isinstance(tree, dict):

return tree

feature = next(iter(tree))

split_point = tree[feature]['split_point']

if sample[feature] <= split_point:

return predict(tree[feature]['left'], sample)

else:

return predict(tree[feature]['right'], sample)

decision_tree = create_tree(train_data)

print("构建的决策树:", decision_tree)

test_features = test_data.columns[:-1].tolist()

predictions = test_data.apply(lambda x: predict(decision_tree, x), axis=1)

accuracy = (predictions == test_data['species']).mean()

print(f"模型在测试集上的准确率为: {accuracy * 100:.2f}%")

test_data['predicted_species'] = predictions

test_data.to_csv('./data/test_with_predictions.csv', index=False)

print("预测结果已保存到 'test_with_predictions.csv'")

结果展示

相关知识

基于机器学习的鸢尾花数据集的三分类算法的实现 C++
python利用c4.5决策树对鸢尾花卉数据集进行分类(iris)
python实战(一)——iris鸢尾花数据集分类
【python数据挖掘课程】十九.鸢尾花数据集可视化、线性回归、决策树花样分析
以鸢尾花数据集为例,用Python对决策树进行分类
Iris鸢尾花数据集可视化、线性回归、决策树分析、KMeans聚类分析
Python语言基于CART决策树的鸢尾花数据分类
KNN算法实现鸢尾花数据集分类
卷积神经网络实现鸢尾花数据分类python代码实现
Python:以鸢尾花数据为例,介绍决策树算法

网址: 纯python实现iris鸢尾花数据集的分类,基于决策树 https://m.huajiangbk.com/newsview1364276.html

所属分类:花卉
上一篇: 园林树木学(研究生)
下一篇: 园林树木学Ⅰ