首页 > 分享 > 纯python实现iris鸢尾花数据集的分类,基于决策树

纯python实现iris鸢尾花数据集的分类,基于决策树

花匠小妙招
2024-12-30 07:39

鸢尾花数据集简介

鸢尾花数据集是机器学习领域中最经典的数据集之一，由英国统计学家罗纳德·费舍尔（Ronald A. Fisher）在1936年提出。该数据集包含150个样本，分为三类，每类50个样本。每个样本有四个特征：

萼片长度（sepal length）

萼片宽度（sepal width）

花瓣长度（petal length）

花瓣宽度（petal width）

目标是根据这四个特征预测鸢尾花的种类（Setosa、Versicolor、Virginica）。

决策树算法概述

决策树是一种基于树状结构的分类与回归方法。其核心思想是通过对特征的选择与划分，将数据集逐步分割成更纯的子集，最终形成一个可以进行预测的树状模型。决策树的构建主要包括以下步骤：

选择最佳特征：通过某种准则（本文使用了信息增益）,选择最能区分数据的特征。

确定划分点：对于连续特征，确定一个阈值将数据集划分为两个子集。

递归构建子树：对每个子集重复上述过程，直到满足停止条件（如所有样本属于同一类别或没有更多特征可用）。

代码实现详解

相关公式的代码实现

def calculate_Entropy(dataset):

Entropy = 0

cls = calculate_cls(dataset)

data_num = dataset.shape[0]

for i in cls:

p_i = dataset['species'].value_counts().get(i, 0) / data_num

if p_i > 0:

Entropy -= p_i * np.log2(p_i)

return Entropy

def calculate_Gain(dataset, feature):

sorted_dataset = dataset.sort_values(by=feature, ascending=True).reset_index(drop=True)

eigen = sorted_dataset[feature].tolist()

label = sorted_dataset['species'].tolist()

sample_num = dataset.shape[0]

base_entropy = calculate_Entropy(dataset)

max_Gain = -float('inf')

max_Gain_point = None

for i in range(1, sample_num):

if label[i] != label[i - 1]:

mid_point = (eigen[i] + eigen[i - 1]) / 2

left_subset = sorted_dataset[sorted_dataset[feature] <= mid_point]

right_subset = sorted_dataset[sorted_dataset[feature] > mid_point]

if left_subset.empty or right_subset.empty:

continue

Gain = base_entropy - (

(left_subset.shape[0] / sample_num) * calculate_Entropy(left_subset) +

(right_subset.shape[0] / sample_num) * calculate_Entropy(right_subset)

)

if Gain > max_Gain:

max_Gain = Gain

max_Gain_point = mid_point

return (max_Gain_point, max_Gain)

以下是完整的Python代码实现：

import pandas as pd

import numpy as np

train_data = pd.read_csv('./data/iris_training.csv')

test_data = pd.read_csv('./data/iris_test.csv')

features = train_data.columns[:-1].tolist()

def calculate_cls(dataset):

cls_column = dataset.iloc[:, -1]

cls = set(cls_column)

return cls

def calculate_Entropy(dataset):

Entropy = 0

cls = calculate_cls(dataset)

data_num = dataset.shape[0]

for i in cls:

p_i = dataset['species'].value_counts().get(i, 0) / data_num

if p_i > 0:

Entropy -= p_i * np.log2(p_i)

return Entropy

def calculate_Gain(dataset, feature):

sorted_dataset = dataset.sort_values(by=feature, ascending=True).reset_index(drop=True)

eigen = sorted_dataset[feature].tolist()

label = sorted_dataset['species'].tolist()

sample_num = dataset.shape[0]

base_entropy = calculate_Entropy(dataset)

max_Gain = -float('inf')

max_Gain_point = None

for i in range(1, sample_num):

if label[i] != label[i - 1]:

mid_point = (eigen[i] + eigen[i - 1]) / 2

left_subset = sorted_dataset[sorted_dataset[feature] <= mid_point]

right_subset = sorted_dataset[sorted_dataset[feature] > mid_point]

if left_subset.empty or right_subset.empty:

continue

Gain = base_entropy - (

(left_subset.shape[0] / sample_num) * calculate_Entropy(left_subset) +

(right_subset.shape[0] / sample_num) * calculate_Entropy(right_subset)

)

if Gain > max_Gain:

max_Gain = Gain

max_Gain_point = mid_point

return (max_Gain_point, max_Gain)

def choose_best_feature(dataset):

features = dataset.columns[:-1].tolist()

best_Gain = -float('inf')

best_feature = None

best_split = None

for feature in features:

split_point, Gain = calculate_Gain(dataset, feature)

if Gain > best_Gain:

best_Gain = Gain

best_feature = feature

best_split = split_point

return best_feature, best_split

def split_dataset(dataset, feature, point):

dataset_1 = dataset[dataset[feature] <= point].reset_index(drop=True)

dataset_2 = dataset[dataset[feature] > point].reset_index(drop=True)

return (dataset_1, dataset_2)

def create_tree(dataset):

cls = calculate_cls(dataset)

if len(cls) == 1:

return cls.pop()

if dataset.shape[1] == 1:

return dataset['species'].mode()[0]

best_feature, best_split = choose_best_feature(dataset)

if best_feature is None:

return dataset['species'].mode()[0]

tree = {best_feature: {'split_point': best_split, 'left': None, 'right': None}}

left_subset, right_subset = split_dataset(dataset, best_feature, best_split)

tree[best_feature]['left'] = create_tree(left_subset)

tree[best_feature]['right'] = create_tree(right_subset)

return tree

def predict(tree, sample):

if not isinstance(tree, dict):

return tree

feature = next(iter(tree))

split_point = tree[feature]['split_point']

if sample[feature] <= split_point:

return predict(tree[feature]['left'], sample)

else:

return predict(tree[feature]['right'], sample)

decision_tree = create_tree(train_data)

print("构建的决策树:", decision_tree)

test_features = test_data.columns[:-1].tolist()

predictions = test_data.apply(lambda x: predict(decision_tree, x), axis=1)

accuracy = (predictions == test_data['species']).mean()

print(f"模型在测试集上的准确率为: {accuracy * 100:.2f}%")

test_data['predicted_species'] = predictions

test_data.to_csv('./data/test_with_predictions.csv', index=False)

print("预测结果已保存到 'test_with_predictions.csv'")

结果展示

释迦果是热带水果吗释迦果简单介绍

热情果含有多种热带水...

热点分享

家庭养花知识大全(家庭养花知识大全与技巧)

养花常识养花技巧 1.浇花 ①残茶浇花残茶用来浇花,既能保持土...

养花知识大全,养花技巧大全

养花知识绿萝是一种很常见的盆栽植物，因为四季翠绿、养护简单...

推荐分享

家庭养花风水知识家庭养花“五行说”

许多人喜欢在家庭里面养花，但不是很了解家庭养花风水知识。居家...

家庭养花知识大全家庭养花有什么好处

家庭养花知识大全家庭养花有什么好处爱花之人总是喜欢在家里...

热门点击排行

君子兰什么品种最名贵十大名贵君子兰排名

世界上最名贵的10种兰花图片，莲瓣兰价值高达1500万

分享分类导航

花卉

每日分享

花卉图片

养花生活

纯python实现iris鸢尾花数据集的分类,基于决策树

鸢尾花数据集简介

决策树算法概述

代码实现详解

相关公式的代码实现

结果展示

释迦果是热带水果吗 释迦果简单介绍

热情果含有多种热带水...

家庭养花知识大全(家庭养花知识大全与技巧)

养花知识大全,养花技巧大全

家庭养花风水知识 家庭养花“五行说”

家庭养花知识大全 家庭养花有什么好处

释迦果是热带水果吗释迦果简单介绍

家庭养花风水知识家庭养花“五行说”

家庭养花知识大全家庭养花有什么好处