鸢尾花数据集是机器学习领域中最经典的数据集之一,由英国统计学家罗纳德·费舍尔(Ronald A. Fisher)在1936年提出。该数据集包含150个样本,分为三类,每类50个样本。每个样本有四个特征:
萼片长度(sepal length)
萼片宽度(sepal width)
花瓣长度(petal length)
花瓣宽度(petal width)
目标是根据这四个特征预测鸢尾花的种类(Setosa、Versicolor、Virginica)。
决策树是一种基于树状结构的分类与回归方法。其核心思想是通过对特征的选择与划分,将数据集逐步分割成更纯的子集,最终形成一个可以进行预测的树状模型。决策树的构建主要包括以下步骤:
选择最佳特征:通过某种准则(本文使用了信息增益),选择最能区分数据的特征。
确定划分点:对于连续特征,确定一个阈值将数据集划分为两个子集。
递归构建子树:对每个子集重复上述过程,直到满足停止条件(如所有样本属于同一类别或没有更多特征可用)。
def calculate_Entropy(dataset):
Entropy = 0
cls = calculate_cls(dataset)
data_num = dataset.shape[0]
for i in cls:
p_i = dataset['species'].value_counts().get(i, 0) / data_num
if p_i > 0:
Entropy -= p_i * np.log2(p_i)
return Entropy
def calculate_Gain(dataset, feature):
sorted_dataset = dataset.sort_values(by=feature, ascending=True).reset_index(drop=True)
eigen = sorted_dataset[feature].tolist()
label = sorted_dataset['species'].tolist()
sample_num = dataset.shape[0]
base_entropy = calculate_Entropy(dataset)
max_Gain = -float('inf')
max_Gain_point = None
for i in range(1, sample_num):
if label[i] != label[i - 1]:
mid_point = (eigen[i] + eigen[i - 1]) / 2
left_subset = sorted_dataset[sorted_dataset[feature] <= mid_point]
right_subset = sorted_dataset[sorted_dataset[feature] > mid_point]
if left_subset.empty or right_subset.empty:
continue
Gain = base_entropy - (
(left_subset.shape[0] / sample_num) * calculate_Entropy(left_subset) +
(right_subset.shape[0] / sample_num) * calculate_Entropy(right_subset)
)
if Gain > max_Gain:
max_Gain = Gain
max_Gain_point = mid_point
return (max_Gain_point, max_Gain)
以下是完整的Python代码实现:
import pandas as pd
import numpy as np
train_data = pd.read_csv('./data/iris_training.csv')
test_data = pd.read_csv('./data/iris_test.csv')
features = train_data.columns[:-1].tolist()
def calculate_cls(dataset):
cls_column = dataset.iloc[:, -1]
cls = set(cls_column)
return cls
def calculate_Entropy(dataset):
Entropy = 0
cls = calculate_cls(dataset)
data_num = dataset.shape[0]
for i in cls:
p_i = dataset['species'].value_counts().get(i, 0) / data_num
if p_i > 0:
Entropy -= p_i * np.log2(p_i)
return Entropy
def calculate_Gain(dataset, feature):
sorted_dataset = dataset.sort_values(by=feature, ascending=True).reset_index(drop=True)
eigen = sorted_dataset[feature].tolist()
label = sorted_dataset['species'].tolist()
sample_num = dataset.shape[0]
base_entropy = calculate_Entropy(dataset)
max_Gain = -float('inf')
max_Gain_point = None
for i in range(1, sample_num):
if label[i] != label[i - 1]:
mid_point = (eigen[i] + eigen[i - 1]) / 2
left_subset = sorted_dataset[sorted_dataset[feature] <= mid_point]
right_subset = sorted_dataset[sorted_dataset[feature] > mid_point]
if left_subset.empty or right_subset.empty:
continue
Gain = base_entropy - (
(left_subset.shape[0] / sample_num) * calculate_Entropy(left_subset) +
(right_subset.shape[0] / sample_num) * calculate_Entropy(right_subset)
)
if Gain > max_Gain:
max_Gain = Gain
max_Gain_point = mid_point
return (max_Gain_point, max_Gain)
def choose_best_feature(dataset):
features = dataset.columns[:-1].tolist()
best_Gain = -float('inf')
best_feature = None
best_split = None
for feature in features:
split_point, Gain = calculate_Gain(dataset, feature)
if Gain > best_Gain:
best_Gain = Gain
best_feature = feature
best_split = split_point
return best_feature, best_split
def split_dataset(dataset, feature, point):
dataset_1 = dataset[dataset[feature] <= point].reset_index(drop=True)
dataset_2 = dataset[dataset[feature] > point].reset_index(drop=True)
return (dataset_1, dataset_2)
def create_tree(dataset):
cls = calculate_cls(dataset)
if len(cls) == 1:
return cls.pop()
if dataset.shape[1] == 1:
return dataset['species'].mode()[0]
best_feature, best_split = choose_best_feature(dataset)
if best_feature is None:
return dataset['species'].mode()[0]
tree = {best_feature: {'split_point': best_split, 'left': None, 'right': None}}
left_subset, right_subset = split_dataset(dataset, best_feature, best_split)
tree[best_feature]['left'] = create_tree(left_subset)
tree[best_feature]['right'] = create_tree(right_subset)
return tree
def predict(tree, sample):
if not isinstance(tree, dict):
return tree
feature = next(iter(tree))
split_point = tree[feature]['split_point']
if sample[feature] <= split_point:
return predict(tree[feature]['left'], sample)
else:
return predict(tree[feature]['right'], sample)
decision_tree = create_tree(train_data)
print("构建的决策树:", decision_tree)
test_features = test_data.columns[:-1].tolist()
predictions = test_data.apply(lambda x: predict(decision_tree, x), axis=1)
accuracy = (predictions == test_data['species']).mean()
print(f"模型在测试集上的准确率为: {accuracy * 100:.2f}%")
test_data['predicted_species'] = predictions
test_data.to_csv('./data/test_with_predictions.csv', index=False)
print("预测结果已保存到 'test_with_predictions.csv'")
相关知识
基于机器学习的鸢尾花数据集的三分类算法的实现 C++
python利用c4.5决策树对鸢尾花卉数据集进行分类(iris)
python实战(一)——iris鸢尾花数据集分类
【python数据挖掘课程】十九.鸢尾花数据集可视化、线性回归、决策树花样分析
以鸢尾花数据集为例,用Python对决策树进行分类
Iris鸢尾花数据集可视化、线性回归、决策树分析、KMeans聚类分析
Python语言基于CART决策树的鸢尾花数据分类
KNN算法实现鸢尾花数据集分类
卷积神经网络实现鸢尾花数据分类python代码实现
Python:以鸢尾花数据为例,介绍决策树算法
网址: 纯python实现iris鸢尾花数据集的分类,基于决策树 https://m.huajiangbk.com/newsview1364276.html
上一篇: 园林树木学(研究生) |
下一篇: 园林树木学Ⅰ |