首页 > 分享 > 集成学习中的软投票和硬投票机制详解和代码实现

集成学习中的软投票和硬投票机制详解和代码实现

花匠小妙招
2024-12-03 08:04

快速回顾集成方法中的软投票和硬投票

集成方法是将两个或多个单独的机器学习算法的结果结合在一起，并试图产生比任何单个算法都准确的结果。

在软投票中，每个类别的概率被平均以产生结果。例如，如果算法 1 以 40% 的概率预测对象是一块岩石，而算法 2 以 80% 的概率预测它是一个岩石，那么集成将预测该对象是一个具有 (80 + 40) / 2 = 60% 的岩石可能性。

在硬投票中，每个算法的预测都被认为是选择具有最高票数的类的集合。例如，如果三个算法将特定葡萄酒的颜色预测为“白色”、“白色”和“红色”，则集成将预测“白色”。

最简单的解释是：软投票是概率的集成，硬投票是结果标签的集成。

生成测试数据

下面我们开始代码的编写，首先导入一些库和一些简单的配置

import pandas as pd import numpy as np import copy as cp from sklearn.datasets import make_classification from sklearn.model_selection import KFold, cross_val_score from typing import Tuple from statistics import mode from sklearn.ensemble import VotingClassifier from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from xgboost import XGBClassifier from sklearn.neural_network import MLPClassifier from sklearn.svm import SVC from lightgbm import LGBMClassifier RANDOM_STATE : int = 42 N_SAMPLES : int = 10000 N_FEATURES : int = 25 N_CLASSES : int = 3 N_CLUSTERS_PER_CLASS : int = 2 FEATURE_NAME_PREFIX : str = "Feature" TARGET_NAME : str = "Target" N_SPLITS : int = 5 np.set_printoptions(suppress=True)

12345678910111213141516171819202122232425262728293031

还需要一些数据作为分类的输入。 make_classification_dataframe 函数将数据创建包含特征和目标的测试数据。

这里我们设置类别数为 3。这样就可以实现多分类算法（超过2类都可以）的软投票和硬投票算法。并且我们的代码也可以适用于二元的分类。

def make_classification_dataframe(n_samples : int = 10000, n_features : int = 25, n_classes : int = 2, n_clusters_per_class : int = 2, feature_name_prefix : str = "Feature", target_name : str = "Target", random_state : int = 42) -> pd.DataFrame: X, y = make_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes, n_informative = n_classes * n_clusters_per_class, random_state=random_state) feature_names = [feature_name_prefix + " " + str(v) for v in np.arange(1, n_features+1)] return pd.concat([pd.DataFrame(X, columns=feature_names), pd.DataFrame(y, columns=[target_name])], axis=1) df_data = make_classification_dataframe(n_samples=N_SAMPLES, n_features=N_FEATURES, n_classes=N_CLASSES, n_clusters_per_class=N_CLUSTERS_PER_CLASS, feature_name_prefix=FEATURE_NAME_PREFIX, target_name=TARGET_NAME, random_state=RANDOM_STATE) X = df_data.drop([TARGET_NAME], axis=1).to_numpy() y = df_data[TARGET_NAME].to_numpy() df_data.head() 123456789101112

生成的数据如下：

交叉验证

使用交叉验证而不是 train_test_split，是因为可以提供更健壮的算法性能评估。

cross_val_predict 辅助函数提供了执行此操作的代码：

def cross_val_predict(model, kfold : KFold, X : np.array, y : np.array) -> Tuple[np.array, np.array, np.array]: model_ = cp.deepcopy(model) no_classes = len(np.unique(y)) actual_classes = np.empty([0], dtype=int) predicted_classes = np.empty([0], dtype=int) predicted_proba = np.empty([0, no_classes]) for train_ndx, test_ndx in kfold.split(X): train_X, train_y, test_X, test_y = X[train_ndx], y[train_ndx], X[test_ndx], y[test_ndx] actual_classes = np.append(actual_classes, test_y) model_.fit(train_X, train_y) predicted_classes = np.append(predicted_classes, model_.predict(test_X)) try: predicted_proba = np.append(predicted_proba, model_.predict_proba(test_X), axis=0) except: predicted_proba = np.append(predicted_proba, np.zeros((len(test_X), no_classes), dtype=float), axis=0) return actual_classes, predicted_classes, predicted_proba

12345678910111213141516171819202122232425

在 predict_proba 中添加了 try 是因为并非所有算法都支持概率，并且没有一致的警告或错误可以显式捕获。

在开始之前，快速看一下单个算法的 cross_val_predict …

lr = LogisticRegression(random_state=RANDOM_STATE) kfold = KFold(n_splits=N_SPLITS, random_state=RANDOM_STATE, shuffle=True) %time actual, lr_predicted, lr_predicted_proba = cross_val_predict(lr, kfold, X, y) print(f"Accuracy of Logistic Regression: {accuracy_score(actual, lr_predicted)}") lr_predicted Wall time: 309 ms Accuracy of Logistic Regression: 0.6821 array([0, 0, 1, ..., 0, 2, 1]) 12345678910

函数cross_val_predict 已返回概率和预测类别，预测类别已显示在单元格输出中。

第一组数据被预测为属于 0 ， 0，1 等等。

多个分类器进行预测

下一件事是为几个分类器生成一组预测和概率，这里选择的算法是随机森林、XGboost等

def cross_val_predict_all_classifiers(classifiers : dict) -> Tuple[np.array, np.array]: predictions = [None] * len(classifiers) predicted_probas = [None] * len(classifiers) for i, (name, classifier) in enumerate(classifiers.items()): %time actual, predictions[i], predicted_probas[i] = cross_val_predict(classifier, kfold, X, y) print(f"Accuracy of {name}: {accuracy_score(actual, predictions[i])}") return actual, predictions, predicted_probas classifiers = dict() classifiers["Random Forest"] = RandomForestClassifier(random_state=RANDOM_STATE) classifiers["XG Boost"] = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=RANDOM_STATE) classifiers["Extra Random Trees"] = ExtraTreesClassifier(random_state=RANDOM_STATE) actual, predictions, predicted_probas = cross_val_predict_all_classifiers(classifiers) Wall time: 17.1 s Accuracy of Random Forest: 0.8742 Wall time: 24.6 s Accuracy of XG Boost: 0.8838 Wall time: 6.2 s Accuracy of Extra Random Trees: 0.8754

1234567891011121314151617181920212223

predictions 变量是一个包含每个算法的一组预测类的列表：

[array([2, 0, 0, ..., 0, 2, 1]), array([2, 0, 2, ..., 0, 2, 1], dtype=int64), array([2, 0, 0, ..., 0, 2, 1])] 123

predict_probas 也是列表，但是其中包含每个预测目标的概率。每个数组都是(10000, 3)的，其中：

10,000 是样本数据集中的数据点数。每个数组对于每组数据都有一行3 是非二元分类器中的类数（因为我们的目标是3个类）

[array([[0.17, 0.02, 0.81], [0.58, 0.07, 0.35], [0.54, 0.1 , 0.36], ..., [0.46, 0.08, 0.46], [0.15, 0. , 0.85], [0.01, 0.97, 0.02]]), array([[0.05611309, 0.00085733, 0.94302952], [0.95303732, 0.00187497, 0.04508775], [0.4653917 , 0.01353438, 0.52107394], ..., [0.75208634, 0.0398241 , 0.20808953], [0.02066649, 0.00156501, 0.97776848], [0.00079027, 0.99868006, 0.00052966]]), array([[0.33, 0.02, 0.65], [0.54, 0.14, 0.32], [0.51, 0.17, 0.32], ..., [0.52, 0.06, 0.42], [0.1 , 0.03, 0.87], [0.05, 0.93, 0.02]])]

123456789101112131415161718192021

上面输出中的第一行详细解释如下

对于第一种算法的第一组数据的预测（即DataFrame中的第一行有17%的概率属于0类，2%的概率属于1类，81%的概率属于2类（三类相加是100%）。

软投票和硬投票

现在进入本文的主题。只需几行 Python 代码即可实现软投票和硬投票。

def soft_voting(predicted_probas : list) -> np.array: sv_predicted_proba = np.mean(predicted_probas, axis=0) sv_predicted_proba[:,-1] = 1 - np.sum(sv_predicted_proba[:,:-1], axis=1) return sv_predicted_proba, sv_predicted_proba.argmax(axis=1) def hard_voting(predictions : list) -> np.array: return [mode(v) for v in np.array(predictions).T] sv_predicted_proba, sv_predictions = soft_voting(predicted_probas) hv_predictions = hard_voting(predictions) for i, (name, classifier) in enumerate(classifiers.items()): print(f"Accuracy of {name}: {accuracy_score(actual, predictions[i])}") print(f"Accuracy of Soft Voting: {accuracy_score(actual, sv_predictions)}") print(f"Accuracy of Hard Voting: {accuracy_score(actual, hv_predictions)}") Accuracy of Random Forrest: 0.8742 Accuracy of XG Boost: 0.8838 Accuracy of Extra Random Trees: 0.8754 Accuracy of Soft Voting: 0.8868 Accuracy of Hard Voting: 0.881

123456789101112131415161718192021222324

上面代码可以看到软投票比性能最佳的单个算法高出 0.3%（88.68% 对 88.38%）,而硬投票却有所降低（88.10% 对 88.38%）,下面就对这两种机制做详细的解释

投票算法代码实现

软投票

sv_predicted_proba = np.mean(predicted_probas, axis=0) sv_predicted_proba array([[0.18537103, 0.01361911, 0.80100984], [0.69101244, 0.07062499, 0.23836258], [0.50513057, 0.09451146, 0.40035798], ..., [0.57736211, 0.05994137, 0.36269651], [0.09022216, 0.01052167, 0.89925616], [0.02026342, 0.96622669, 0.01350989]]) 12345678910

numpy mean 函数沿轴 0 (列)取平均值。从理论上讲，这应该是软投票的全部内容，因为这已经创建了 3 组输出中的每组输出的平均值（均值）并且看起来是正确的。

print(np.sum(sv_predicted_proba[0])) sv_predicted_proba[0] 0.9999999826153119 array([0.18537103, 0.01361911, 0.80100984]) 12345

但是因为四舍五入的误差，行的值并不总是加起来为 1，因为每个数据点都属于概率和为 1 的三个类之一

如果我们使用topk的方法获取分类标签，这种误差不会有任何的影响。但是有时候还需要进行其他处理，必须要保证概率为1，那么就需要做一些简单的处理：将最后一列中的值设置为 1- 其他列中值的总和

sv_predicted_proba[:,-1] = 1 - np.sum(sv_predicted_proba[:,:-1], axis=1) 1.0 array([0.18537103, 0.01361911, 0.80100986]) 1234

现在数据都没有问题了，每一行的概率加起来就是 1，就像它们应该的那样。

下面就是使用numpy 的 argmax 函数获取概率最大的类别作为预测的结果（即对于每一行，软投票是否预测类别 0、1 或 2）。

sv_predicted_proba.argmax(axis=1) array([2, 0, 0, ..., 0, 2, 1], dtype=int64) 123

argmax 函数是沿axis参数中指定的轴选择数组中最大值的索引，因此它为第一行选择 2，为第二行选择 0，为第三行选择0等。

硬投票

hv_predicted = [mode(v) for v in np.array(predictions).T] 1

其中np.array(predictions).T 语法只是转置数组，将(10000, 3)变为(3，10000 )

print(np.array(predictions).shape) np.array(predictions).T (3, 10000) array([[2, 2, 2], [0, 0, 0], [0, 2, 0], ..., [0, 0, 0], [2, 2, 2], [1, 1, 1]], dtype=int64) 1234567891011

然后列表推导获取每个元素（行）并将 statistics.mode 应用于它，从而选择从算法中获得最多票的分类…

np.array(hv_predicted) array([2, 0, 0, ..., 0, 2, 1], dtype=int64) 123

使用 Scikit-Learn

从头编写代码可以让我们更好的链接算法的实现机制，但是不需要重复的制造轮子， scikit-learn的VotingClassifier 可以完成所有工作

estimators = list(classifiers.items()) vc_sv = VotingClassifier(estimators=estimators, voting="soft") vc_hv = VotingClassifier(estimators=estimators, voting="hard") %time actual, vc_sv_predicted, vc_sv_predicted_proba = cross_val_predict(vc_sv, kfold, X, y) %time actual, vc_hv_predicted, _ = cross_val_predict(vc_hv, kfold, X, y) print(f"Accuracy of SciKit-Learn Soft Voting: {accuracy_score(actual, vc_sv_predicted)}") print(f"Accuracy of SciKit-Learn Hard Voting: {accuracy_score(actual, vc_hv_predicted)}") Wall time: 1min 4s Wall time: 55.3 s Accuracy of SciKit-Learn Soft Voting: 0.8868 Accuracy of SciKit-Learn Hard Voting: 0.881 123456789101112131415

cikit-learn 实现产生的结果与我们手写的算法完全相同——软投票准确率为 88.68%，硬投票准确率为 88.1%。

为什么硬投票的效果不好呢？

想想这样一个情况，以二元分类为例：

B的概率分别为0.01，0.51，0.6，0.1，0.7，则

A的概率为0.99，0.49，0.4，0.9，0.3

如果以硬投票计算：

B 3票，A 2票，结果为B

软投票的计算

B：（0.01+0.51+0.6+0.1+0.7）/5=0.38

A:（0.99+0.49+0.4+0.9+0.3）/5=0.61

结果为A

硬投票的结果最终由概率值相对较低的模型（0.51，0.6，0.7）决定，而软投票则由概率值较高的（0.99，0.9）模型决定，软投票会给使那些概率高模型获得更多的权重，所以表现要比硬投票好。

集成学习到底能有多大的提升？

我们看看集成学习究竟可以在准确度度量上实现多少改进呢？使用常见的6个算法看看我们可以从集成中挤出多少性能…

lassifiers = dict() classifiers["Random Forrest"] = RandomForestClassifier(random_state=RANDOM_STATE) classifiers["XG Boost"] = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=RANDOM_STATE) classifiers["Extra Random Trees"] = ExtraTreesClassifier(random_state=RANDOM_STATE) classifiers['Neural Network'] = MLPClassifier(max_iter = 1000, random_state=RANDOM_STATE) classifiers['Support Vector Machine'] = SVC(probability=True, random_state=RANDOM_STATE) classifiers['Light GMB'] = LGBMClassifier(random_state=RANDOM_STATE) estimators = list(classifiers.items()) 123456789

方法 1：使用手写代码

%time actual, predictions, predicted_probas = cross_val_predict_all_classifiers(classifiers) # Get a collection of predictions and probabilities from the selected algorithms sv_predicted_proba, sv_predictions = soft_voting(predicted_probas) # Combine those collections into a single set of predictions hv_predictions = hard_voting(predictions) print(f"Accuracy of Soft Voting: {accuracy_score(actual, sv_predictions)}") print(f"Accuracy of Hard Voting: {accuracy_score(actual, hv_predictions)}") Wall time: 14.9 s Accuracy of Random Forest: 0.8742 Wall time: 32.8 s Accuracy of XG Boost: 0.8838 Wall time: 5.78 s Accuracy of Extra Random Trees: 0.8754 Wall time: 3min 2s Accuracy of Neural Network: 0.8612 Wall time: 36.2 s Accuracy of Support Vector Machine: 0.8674 Wall time: 1.65 s Accuracy of Light GMB: 0.8828 Accuracy of Soft Voting: 0.8914 Accuracy of Hard Voting: 0.8851 Wall time: 4min 34s

123456789101112131415161718192021222324

方法2：使用 SciKit-Learn 和 cross_val_predict

%%time vc_sv = VotingClassifier(estimators=estimators, voting="soft") vc_hv = VotingClassifier(estimators=estimators, voting="hard") %time actual, vc_sv_predicted, vc_sv_predicted_proba = cross_val_predict(vc_sv, kfold, X, y) %time actual, vc_hv_predicted, _ = cross_val_predict(vc_hv, kfold, X, y) print(f"Accuracy of SciKit-Learn Soft Voting: {accuracy_score(actual, vc_sv_predicted)}") print(f"Accuracy of SciKit-Learn Hard Voting: {accuracy_score(actual, vc_hv_predicted)}") Wall time: 4min 11s Wall time: 4min 41s Accuracy of SciKit-Learn Soft Voting: 0.8914 Accuracy of SciKit-Learn Hard Voting: 0.8859 Wall time: 8min 52s 123456789101112131415

方法 3：使用 SciKit-Learn 和 cross_val_score

%time print(f"Accuracy of SciKit-Learn Soft Voting using cross_val_score: {np.mean(cross_val_score(vc_sv, X, y, cv=kfold))}") Accuracy of SciKit-Learn Soft Voting using cross_val_score: 0.8914 Wall time: 4min 46s 1234

3 种不同的方法对软投票准确性的评分达成一致，这再次说明了我们手写的实现是正确的。

补充：Scikit-Learn 硬投票的限制

from catboost import CatBoostClassifier from sklearn.model_selection import cross_val_score classifiers["Cat Boost"] = CatBoostClassifier(silent=True, random_state=RANDOM_STATE) estimators = list(classifiers.items()) vc_hv = VotingClassifier(estimators=estimators, voting="hard") print(cross_val_score(vc_hv, X, y, cv=kfold)) 123456789

会得到一个错误：

ValueError: could not broadcast input array from shape (2000,1) into shape (2000) 1

这个问题很好解决，是因为输出是（2000，1）而输入的要求是（2000），所以我们可以使用np.squeeze将第二个维度删除即可。

总结

通过将将神经网络、支持向量机和lightGMB 加入到组合中，软投票的准确率从 88.68% 提高了 0.46% 至 89.14%，新的软投票准确率比最佳个体算法（XG Boost 为 88.38 提高了 0.76% ）。添加准确度分数低于 XGBoost 的 Light GMB 将集成性能提高了 0.1%！也就是说集成不是最佳的模型也能够提升软投票的性能。如果是 Kaggle 比赛，那么 0.76% 可能是巨大的，可能会让你在排行榜上飞跃。

作者：Graham Harrison

cv=kfold))

会得到一个错误： ```python ValueError: could not broadcast input array from shape (2000,1) into shape (2000) 12345

这个问题很好解决，是因为输出是（2000，1）而输入的要求是（2000），所以我们可以使用np.squeeze将第二个维度删除即可。

总结

https://www.overfit.cn/post/a4a0f83dafad46f3b4135b5faf1ca85a

作者：Graham Harrison

这也太浪漫了吧丨首.家鲜花主题餐厅它来啦❗

北京七夕浪漫拍照餐厅｜适合约会的漂亮饭

热点分享

家庭养花知识大全(家庭养花知识大全与技巧)

养花常识养花技巧 1.浇花 ①残茶浇花残茶用来浇花,既能保持土...

养花知识大全,养花技巧大全

养花知识绿萝是一种很常见的盆栽植物，因为四季翠绿、养护简单...

推荐分享

家庭养花风水知识家庭养花“五行说”

许多人喜欢在家庭里面养花，但不是很了解家庭养花风水知识。居家...

家庭养花知识大全家庭养花有什么好处

家庭养花知识大全家庭养花有什么好处爱花之人总是喜欢在家里...

热门点击排行

君子兰什么品种最名贵十大名贵君子兰排名

世界上最名贵的10种兰花图片，莲瓣兰价值高达1500万

分享分类导航

花卉

每日分享

花卉图片

养花生活