目录
算法简介
信息熵(Entropy)
信息增益(Information gain) - ID3算法
信息增益率(gain ratio) - C4.5算法
源数据
代码实现 - ID3算法
代码实现 - C4.5算法
画决策树代码-treePlotter
决策数(Decision Tree)在机器学习中也是比较常见的一种算法,属于监督学习中的一种。其中ID3算法是以信息熵和信息增益作为衡量标准的分类算法。
信息熵(Entropy)熵的概念主要是指信息的混乱程度,变量的不确定性越大,熵的值也就越大,熵的公式可以表示为:
信息增益指的是根据特征划分数据前后熵的变化,可以用下面的公式表示:
根据不同特征分类后熵的变化不同,信息增益也不同,信息增益越大,区分样本的能力越强,越具有代表性。 这是一种自顶向下的贪心策略,即在ID3中根据“最大信息增益”原则选择特征。
ID3采用信息增益来选择特征,存在一个缺点,它一般会优先选择有较多属性值的特征,因为属性值多的特征会有相对较大的信息增益。(这是因为:信息增益反映的给定一个条件以后不确定性减少的程度,必然是分得越细的数据集确定性更高,也就是条件熵越小,信息增益越大)。
信息增益率(gain ratio) - C4.5算法为了避免ID3的不足,C4.5中是用信息增益率(gain ratio)来作为选择分支的准则。对于有较多属性值的特征,信息增益率的分母Split information(S,A),我们称之为分裂信息,会稀释掉它对特征选择的影响。分裂信息(公式1)和信息增益率(公式2)的计算如下所示。
这是一位单身女性根据对方的一些基本条件,判断是否去约会的数据,此处展示前五行。我们要通过这位女士历史的数据建立决策树模型,使得尽量给这位女性推送她比较愿意约会的异性信息。
from math import log
import operator
import numpy as np
import pandas as pd
from pandas import DataFrame,Series
def dataentropy(data, feat):
lendata=len(data)
labelCounts={}
for featVec in data:
category=featVec[-1]
if category not in labelCounts.keys():
labelCounts[category]=0
labelCounts[category]+=1
entropy=0
for key in labelCounts:
prob=float(labelCounts[key])/lendata
entropy-=prob*log(prob,2)
return entropy
def Importdata(datafile):
dataa = pd.read_excel(datafile)
productDict={'高':1,'一般':2,'低':3, '帅':1, '丑':3, '胖':3, '瘦':1, '是':1, '否':0}
dataa['income'] = dataa['收入'].map(productDict)
dataa['hight'] = dataa['身高'].map(productDict)
dataa['look'] = dataa['长相'].map(productDict)
dataa['shape'] = dataa['体型'].map(productDict)
dataa['is_meet'] = dataa['是否见面'].map(productDict)
data = dataa.iloc[:,5:].values.tolist()
b = dataa.iloc[0:0,5:-1]
labels = b.columns.values.tolist()
return data,labels
def splitData(data,i,value):
splitData=[]
for featVec in data:
if featVec[i]==value:
rfv =featVec[:i]
rfv.extend(featVec[i+1:])
splitData.append(rfv)
return splitData
def BestSplit(data):
numFea = len(data[0])-1
baseEnt = dataentropy(data,-1)
bestInfo = 0
bestFeat = -1
for i in range(numFea):
featList = [rowdata[i] for rowdata in data]
uniqueVals = set(featList)
newEnt = 0
for value in uniqueVals:
subData = splitData(data,i,value)
prob =len(subData)/float(len(data))
newEnt +=prob*dataentropy(subData,i)
info = baseEnt - newEnt
if (info>bestInfo):
bestInfo=info
bestFeat = i
return bestFeat
def majorityCnt(classList):
c_count={}
for i in classList:
if i not in c_count.keys():
c_count[i]=0
c_count[i]+=1
ClassCount = sorted(c_count.items(),key=operator.itemgetter(1),reverse=True)
return ClassCount[0][0]
def createTree(data,labels):
classList = [rowdata[-1] for rowdata in data]
if classList.count(classList[0])==len(classList):
return classList[0]
if len(data[0])==1:
return majorityCnt(classList)
bestFeat = BestSplit(data)
bestLab = labels[bestFeat]
myTree = {bestLab:{}}
del(labels[bestFeat])
featValues = [rowdata[bestFeat] for rowdata in data]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
myTree[bestLab][value] = createTree(splitData(data,bestFeat,value),subLabels)
return myTree
if __name__=='__main__':
datafile = u'E:pythondatatree.xlsx'
data, labels=Importdata(datafile)
print(createTree(data, labels))
运行结果:
{'hight': {1: {'look': {1: {'income': {1: {'shape': {1: 1, 2: 1}}, 2: 1, 3: {'shape': {1: 1, 2: 0}}}}, 2: 1, 3: {'income': {1: 1, 2: 0}}}}, 2: {'income': {1: 1, 2: {'look': {1: 1, 2: 0}}, 3: 0}}, 3: {'look': {1: {'shape': {3: 0, 1: 1}}, 2: 0, 3: 0}}}}
对应的决策树:
C4.5算法和ID3算法逻辑很相似,只是ID3算法是用信息增益来选择特征,而C4.5算法是用的信息增益率,因此对代码的影响也只有BestSplit(data)函数的定义部分,只需要加一个信息增益率的计算即可,BestSplit(data)函数定义代码更改后如下:
def BestSplit(data):
numFea = len(data[0])-1
baseEnt = dataentropy(data,-1)
bestGainRate = 0
bestFeat = -1
for i in range(numFea):
featList = [rowdata[i] for rowdata in data]
uniqueVals = set(featList)
newEnt = 0
for value in uniqueVals:
subData = splitData(data,i,value)
prob =len(subData)/float(len(data))
newEnt +=prob*dataentropy(subData,i)
info = baseEnt - newEnt
splitonfo = dataentropy(subData,i)
if splitonfo == 0:
continue
GainRate = info/splitonfo
if (GainRate>bestGainRate):
bestGainRate=GainRate
bestFeat = i
return bestFeat
'运行结果:
{'hight': {1: {'look': {1: {'income': {1: {'shape': {0: 0, 1: 1}}, 2: 1, 3: {'shape': {0: 0, 1: 1}}}}, 2: 1, 3: {'shape': {0: 0, 1: 1}}}}, 2: {'shape': {0: 0, 1: 1}}, 3: {'shape': {1: 0, 3: {'look': {0: 0, 1: 1}}}}}}
决策树可以代码实现的,不需要按照运行结果一点一点手动画图。
import treePlotter
treePlotter.createPlot(myTree)
其中treePlotter模块是如下一段代码,可以保存为.py文件,放在Python/Lib/site-package目录下,然后用的时候import 【文件名】就可以了。
treePlotter模块代码:
import matplotlib.pyplot as plt
decisionNode = dict(boxstyle="round4", color='#ccccff')
leafNode = dict(boxstyle="circle", color='#66ff99')
arrow_args = dict(arrowstyle="<-", color='ffcc00')
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',
xytext=centerPt, textcoords='axes fraction',
va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)
def getNumLeafs(myTree):
numLeafs = 0
firstStr = myTree.keys()[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__ == 'dict':
numLeafs += getNumLeafs(secondDict[key])
else:
numLeafs += 1
return numLeafs
def getTreeDepth(myTree):
maxDepth = 0
firstStr = myTree.keys()[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__ == 'dict':
thisDepth = 1 + getTreeDepth(secondDict[key])
else:
thisDepth = 1
if thisDepth > maxDepth:
maxDepth = thisDepth
return maxDepth
def plotMidText(cntrPt, parentPt, txtString):
xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)
def plotTree(myTree, parentPt, nodeTxt):
numLeafs = getNumLeafs(myTree)
depth = getTreeDepth(myTree)
firstStr = myTree.keys()[0]
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs)) / 2.0 / plotTree.totalW, plotTree.yOff)
plotMidText(cntrPt, parentPt, nodeTxt)
plotNode(firstStr, cntrPt, parentPt, decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD
for key in secondDict.keys():
if type(secondDict[key]).__name__ == 'dict':
plotTree(secondDict[key], cntrPt, str(key))
else:
plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalW
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD
def createPlot(inTree):
fig = plt.figure(1, facecolor='white')
fig.clf()
axprops = dict(xticks=[], yticks=[])
createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)
plotTree.totalW = float(getNumLeafs(inTree))
plotTree.totalD = float(getTreeDepth(inTree))
plotTree.xOff = -0.5 / plotTree.totalW;
plotTree.yOff = 1.0;
plotTree(inTree, (0.5, 1.0), '')
plt.show()
'相关知识
基于决策树构建鸢尾花数据的分类模型并绘制决策树模型
9.决策树
基于决策树的水稻病虫害发生程度预测模型——以芜湖市为例
基于决策树的水稻病虫害发生程度预测模型
决策树算法简介
【2016年第1期】基于大数据的小麦蚜虫发生程度决策树预测分类模型
决策树完成鸢尾花分类
分类算法3:决策树及R语言实现
【机器学习】R语言实现随机森林、支持向量机、决策树多方法二分类模型
【机器学习小实验5】基于决策树和随机森林的鸢尾花种类预测
网址: 决策树模型 https://m.huajiangbk.com/newsview1101277.html
上一篇: 一双慧眼,继续教你分清那些花 |
下一篇: 关于js中''、0、false、 |