首页 > 分享 > 决策树模型

决策树模型

花匠小妙招
2024-12-15 02:44

算法简介

信息熵(Entropy)

信息增益(Information gain) - ID3算法

信息增益率(gain ratio) - C4.5算法

源数据

代码实现 - ID3算法

代码实现 - C4.5算法

画决策树代码-treePlotter

算法简介

决策数(Decision Tree)在机器学习中也是比较常见的一种算法，属于监督学习中的一种。其中ID3算法是以信息熵和信息增益作为衡量标准的分类算法。

信息熵(Entropy)

熵的概念主要是指信息的混乱程度，变量的不确定性越大，熵的值也就越大，熵的公式可以表示为：

信息增益(Information gain) - ID3算法

信息增益指的是根据特征划分数据前后熵的变化，可以用下面的公式表示：

根据不同特征分类后熵的变化不同，信息增益也不同，信息增益越大，区分样本的能力越强，越具有代表性。这是一种自顶向下的贪心策略，即在ID3中根据“最大信息增益”原则选择特征。

ID3采用信息增益来选择特征，存在一个缺点，它一般会优先选择有较多属性值的特征，因为属性值多的特征会有相对较大的信息增益。(这是因为：信息增益反映的给定一个条件以后不确定性减少的程度,必然是分得越细的数据集确定性更高,也就是条件熵越小,信息增益越大)。

信息增益率(gain ratio) - C4.5算法

为了避免ID3的不足，C4.5中是用信息增益率(gain ratio)来作为选择分支的准则。对于有较多属性值的特征，信息增益率的分母Split information（S,A），我们称之为分裂信息，会稀释掉它对特征选择的影响。分裂信息（公式1）和信息增益率（公式2）的计算如下所示。

源数据

收入身高长相体型是否见面一般高丑胖否高一般帅瘦是一般一般一般一般否高高丑一般是一般高帅胖是

这是一位单身女性根据对方的一些基本条件，判断是否去约会的数据，此处展示前五行。我们要通过这位女士历史的数据建立决策树模型，使得尽量给这位女性推送她比较愿意约会的异性信息。

代码实现 - ID3算法

from math import log

import operator

import numpy as np

import pandas as pd

from pandas import DataFrame,Series

def dataentropy(data, feat):

lendata=len(data)

labelCounts={}

for featVec in data:

category=featVec[-1]

if category not in labelCounts.keys():

labelCounts[category]=0

labelCounts[category]+=1

entropy=0

for key in labelCounts:

prob=float(labelCounts[key])/lendata

entropy-=prob*log(prob,2)

return entropy

def Importdata(datafile):

dataa = pd.read_excel(datafile)

productDict={'高':1,'一般':2,'低':3, '帅':1, '丑':3, '胖':3, '瘦':1, '是':1, '否':0}

dataa['income'] = dataa['收入'].map(productDict)

dataa['hight'] = dataa['身高'].map(productDict)

dataa['look'] = dataa['长相'].map(productDict)

dataa['shape'] = dataa['体型'].map(productDict)

dataa['is_meet'] = dataa['是否见面'].map(productDict)

data = dataa.iloc[:,5:].values.tolist()

b = dataa.iloc[0:0,5:-1]

labels = b.columns.values.tolist()

return data,labels

def splitData(data,i,value):

splitData=[]

for featVec in data:

if featVec[i]==value:

rfv =featVec[:i]

rfv.extend(featVec[i+1:])

splitData.append(rfv)

return splitData

def BestSplit(data):

numFea = len(data[0])-1

baseEnt = dataentropy(data,-1)

bestInfo = 0

bestFeat = -1

for i in range(numFea):

featList = [rowdata[i] for rowdata in data]

uniqueVals = set(featList)

newEnt = 0

for value in uniqueVals:

subData = splitData(data,i,value)

prob =len(subData)/float(len(data))

newEnt +=prob*dataentropy(subData,i)

info = baseEnt - newEnt

if (info>bestInfo):

bestInfo=info

bestFeat = i

return bestFeat

def majorityCnt(classList):

c_count={}

for i in classList:

if i not in c_count.keys():

c_count[i]=0

c_count[i]+=1

ClassCount = sorted(c_count.items(),key=operator.itemgetter(1),reverse=True)

return ClassCount[0][0]

def createTree(data,labels):

classList = [rowdata[-1] for rowdata in data]

if classList.count(classList[0])==len(classList):

return classList[0]

if len(data[0])==1:

return majorityCnt(classList)

bestFeat = BestSplit(data)

bestLab = labels[bestFeat]

myTree = {bestLab:{}}

del(labels[bestFeat])

featValues = [rowdata[bestFeat] for rowdata in data]

uniqueVals = set(featValues)

for value in uniqueVals:

subLabels = labels[:]

myTree[bestLab][value] = createTree(splitData(data,bestFeat,value),subLabels)

return myTree

if __name__=='__main__':

datafile = u'E:pythondatatree.xlsx'

data, labels=Importdata(datafile)

print(createTree(data, labels))

运行结果：

{'hight': {1: {'look': {1: {'income': {1: {'shape': {1: 1, 2: 1}}, 2: 1, 3: {'shape': {1: 1, 2: 0}}}}, 2: 1, 3: {'income': {1: 1, 2: 0}}}}, 2: {'income': {1: 1, 2: {'look': {1: 1, 2: 0}}, 3: 0}}, 3: {'look': {1: {'shape': {3: 0, 1: 1}}, 2: 0, 3: 0}}}}

对应的决策树：

代码实现 - C4.5算法

C4.5算法和ID3算法逻辑很相似，只是ID3算法是用信息增益来选择特征，而C4.5算法是用的信息增益率，因此对代码的影响也只有BestSplit(data)函数的定义部分，只需要加一个信息增益率的计算即可，BestSplit(data)函数定义代码更改后如下：

def BestSplit(data):

numFea = len(data[0])-1

baseEnt = dataentropy(data,-1)

bestGainRate = 0

bestFeat = -1

for i in range(numFea):

featList = [rowdata[i] for rowdata in data]

uniqueVals = set(featList)

newEnt = 0

for value in uniqueVals:

subData = splitData(data,i,value)

prob =len(subData)/float(len(data))

newEnt +=prob*dataentropy(subData,i)

info = baseEnt - newEnt

splitonfo = dataentropy(subData,i)

if splitonfo == 0:

continue

GainRate = info/splitonfo

if (GainRate>bestGainRate):

bestGainRate=GainRate

bestFeat = i

return bestFeat

运行结果：

{'hight': {1: {'look': {1: {'income': {1: {'shape': {0: 0, 1: 1}}, 2: 1, 3: {'shape': {0: 0, 1: 1}}}}, 2: 1, 3: {'shape': {0: 0, 1: 1}}}}, 2: {'shape': {0: 0, 1: 1}}, 3: {'shape': {1: 0, 3: {'look': {0: 0, 1: 1}}}}}}

画决策树代码-treePlotter

决策树可以代码实现的，不需要按照运行结果一点一点手动画图。

import treePlotter

treePlotter.createPlot(myTree)

其中treePlotter模块是如下一段代码，可以保存为.py文件，放在Python/Lib/site-package目录下，然后用的时候import 【文件名】就可以了。

treePlotter模块代码：

import matplotlib.pyplot as plt

decisionNode = dict(boxstyle="round4", color='#ccccff')

leafNode = dict(boxstyle="circle", color='#66ff99')

arrow_args = dict(arrowstyle="<-", color='ffcc00')

def plotNode(nodeTxt, centerPt, parentPt, nodeType):

createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',

xytext=centerPt, textcoords='axes fraction',

va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)

def getNumLeafs(myTree):

numLeafs = 0

firstStr = myTree.keys()[0]

secondDict = myTree[firstStr]

for key in secondDict.keys():

if type(secondDict[key]).__name__ == 'dict':

numLeafs += getNumLeafs(secondDict[key])

else:

numLeafs += 1

return numLeafs

def getTreeDepth(myTree):

maxDepth = 0

firstStr = myTree.keys()[0]

secondDict = myTree[firstStr]

for key in secondDict.keys():

if type(secondDict[key]).__name__ == 'dict':

thisDepth = 1 + getTreeDepth(secondDict[key])

else:

thisDepth = 1

if thisDepth > maxDepth:

maxDepth = thisDepth

return maxDepth

def plotMidText(cntrPt, parentPt, txtString):

xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]

yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]

createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

def plotTree(myTree, parentPt, nodeTxt):

numLeafs = getNumLeafs(myTree)

depth = getTreeDepth(myTree)

firstStr = myTree.keys()[0]

cntrPt = (plotTree.xOff + (1.0 + float(numLeafs)) / 2.0 / plotTree.totalW, plotTree.yOff)

plotMidText(cntrPt, parentPt, nodeTxt)

plotNode(firstStr, cntrPt, parentPt, decisionNode)

secondDict = myTree[firstStr]

plotTree.yOff = plotTree.yOff - 1.0 / plotTree.totalD

for key in secondDict.keys():

if type(secondDict[key]).__name__ == 'dict':

plotTree(secondDict[key], cntrPt, str(key))

else:

plotTree.xOff = plotTree.xOff + 1.0 / plotTree.totalW

plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)

plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))

plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD

def createPlot(inTree):

fig = plt.figure(1, facecolor='white')

fig.clf()

axprops = dict(xticks=[], yticks=[])

createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)

plotTree.totalW = float(getNumLeafs(inTree))

plotTree.totalD = float(getTreeDepth(inTree))

plotTree.xOff = -0.5 / plotTree.totalW;

plotTree.yOff = 1.0;

plotTree(inTree, (0.5, 1.0), '')

plt.show()

家里的阳台适合种什么花？这份绿植指南请收好

家中阳台养铜钱草，一定要“大气”才能养得好，肥水3天补一次

热点分享

家庭养花知识大全(家庭养花知识大全与技巧)

养花常识养花技巧 1.浇花 ①残茶浇花残茶用来浇花,既能保持土...

养花知识大全,养花技巧大全

养花知识绿萝是一种很常见的盆栽植物，因为四季翠绿、养护简单...

推荐分享

家庭养花风水知识家庭养花“五行说”

许多人喜欢在家庭里面养花，但不是很了解家庭养花风水知识。居家...

家庭养花知识大全家庭养花有什么好处

家庭养花知识大全家庭养花有什么好处爱花之人总是喜欢在家里...

热门点击排行

君子兰什么品种最名贵十大名贵君子兰排名

世界上最名贵的10种兰花图片，莲瓣兰价值高达1500万

分享分类导航

花卉

每日分享

花卉图片

养花生活

决策树模型

算法简介

源数据

代码实现 - ID3算法

代码实现 - C4.5算法

画决策树代码-treePlotter

家里的阳台适合种什么花？这份绿植指南请收好

家中阳台养铜钱草，一定要“大气”才能养得好，肥水3天补一次

家庭养花知识大全(家庭养花知识大全与技巧)

养花知识大全,养花技巧大全

家庭养花风水知识 家庭养花“五行说”

家庭养花知识大全 家庭养花有什么好处

家庭养花风水知识家庭养花“五行说”

家庭养花知识大全家庭养花有什么好处