Decision Trees Final

DecisionTrees
GroupMembers
(inorderofpresentation)
Betson Thomas AndrewGaun BillyAnzovino
[Overview,ID3]
[C4.5]
[PaperSLIQ]
References
y Overview,ID3 y http://en.wikipedia.org/wiki/Decision_Trees y http://www.cise.ufl.edu/~ddd/cap6635/Fall97/Short papers/2.htm y http://www.cis.temple.edu/~ingargio/cis587/readings/i d3c45.html y http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/0 6prop/id3/id3.html y http://www.autonlab.org/tutorials/dtree.html y http://www.autonlab.org/tutorials/infogain.html y http://www.rulequest.com/see5comparison.html
Whatisadecisiontree?
y General y Agraph/modelthathelpsmakedecisions
Whatisadecisiontree?[cont]
y DataMining y Apredictive modelusedtoclassifydata. y Asetofattributesthataretestedtopredictoutcome [class].
y
Eachnode[attribute]representsachoicebetweenanumber ofalternatives[attributevalues] Eachleafnoderepresentsaclassificationofthatdata
Howdowecreateone?
y Weneeddata. y CLS/ID3Requirements y Attributevaluedescription
y
Sameattributesmustdescribeeach[record].Setofvalues mustbefixed. y C4.5handlescontinuousdata Attributesarenotlearnedbythealgorithm Clearlydistinct. Dontwant:Bread={stiff,verystiff,abitstiff,soft,verysoft } Want:Vodka={Likes,DoesntLike} Needtodistinguishvalidpatternsfromchance
y y
Predefinedclasses
y
Discreteclasses
y y y
SufficientTestData
y
PlayBaseball?
Outlook Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Temperature Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal High Wind Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong Playball No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No
Algorithms
y y y y
CLS ID3 C4.5 C5/See5
sleep
income
forest
BlueisC5.GrayisC4.5
CLS
y C=trainingdata y Step1:Ifall[records]inCarepositive,thencreateYESnode andhalt. y Ifall[records]inCarenegative,createaNOnodeandhalt. y Otherwiseselectan[attribute]withvaluesv1,...,vn andcreatea decisionnode. y Step2:Partitionthetraining[records]inCintosubsetsC1,C2, ...,Cn accordingtothevaluesofV[v1,,vn]. y Step3:applythealgorithmrecursivelytoeachofthesetsCi. y Note,thetrainer(theexpert)decideswhich[attribute]toselect.
ID3
y ExtendedfromCLS y Addsattributeselectionheuristic. y Searchesthroughdataandselectsthebestattribute, theonethatbestseparatesthedata.
y y
Minimizethedepthofthetree. OccamsRazor y Oftwoequivalenttheoriesorexplanations,allotherthings beingequal,thesimplestoneistobepreferred. OccamsRazorfordecisiontrees y "Theworldisinherentlysimple.Thereforethesmallest decisiontreethatisconsistentwiththesamplesistheone thatismostlikelytoidentifyunknownobjectscorrectly."
ID3AttributeSelection
y InformationGain y Measureshowwellanattributeseparatesdatainto classes. y Selectattributewithhighestinformationgain=most usefulforclassification=mostinformative=good decision y Todefineinfogain,weneedentropy[frominformation theory=quantifyinformation] y Entropy y Measurestheamountofinformationinanattribute
Entropy
y Entropy(S)= p(I)log2p(I) y S=setofsamples y I=valueoftheclassattribute y Example y Shas14samples,9=YES,5=NO y Entropy(S)= p(I)log2p(I) y (p(YES)log2p(YES) +p(NO)log2p(NO)) y ((9/14)log2(9/14)+(5/14)log2(5/14)) y (.642(.637)+.357(1.485))=0.94 y IfShadequaldistribution,7=YES,7=NO y Entropy(S)=1,interpretedastotallyrandom y IfShad,14=YES,0=NO y Entropy(S)=0,interpretedasperfectlyclassified
InformationGain
y Gain(S,A)=InfogainofSduetoattributeA y Gain(S,A)=Entropy(S) Entropy(S,A) y Entropy(S,A)= ((|Sv|/|S|)*Entropy(Sv)) y acrossallpossiblevaluesvofattributeA y Sv =subsetofSforwhichattributeAhasvaluev y |Sv|=numberofelementsinSv y |S|=numberofelementsinS
InformationGainExample
y ExampleusingWind attributefromdataset y WindcanbeWeakorStrong,|Wind|=14 y |Wind=Weak|=8,YES=6,NO=2 y |Wind=Strong|=6,YES=3,NO=3 y Entropy(S,Wind)= ((|Sv|/|S|)*Entropy(Sv)) y =((8/14)Entropy(Sweak)+(6/14)Entropy(Sstrong))
y Entropy(Sweak)=((6/8)log2(6/8)+(2/8)log2(2/8))=0.811 y Entropy(Sstrong)=((3/6)log2(3/6)+(3/6)log2(3/6)=1.0
y =(8/14)0.811+(6/14)1.0=.892 y Gain(S,Wind)=Entropy(S) Entropy(S,Wind) y Gain(S,Wind)=0.94 0.892=0.048
ID3
y Attributeselectionheuristic,ateachnode y Computeinformationgainforeachattribute y Splitontheattributewiththehighestgain y Usingdataset,S=allrecords y Gain(S,Outlook)=0.246 y Gain(S,Temperature)=0.029 y Gain(S,Humidity)=0.151 y Gain(S,Wind)=0.048 y SelectOutlooksinceitishighest,nowrepeatwithremaining
attributes y Gain(Ssunny,Humidity)=0.970 y Gain(Ssunny,Temperature)=0.570 y Gain(Ssunny,Wind)=0.019 y SelectHumidity y Repeatuntil y Alldataisclassifiedperfectly=allrecordsatanodeofthesameclass y Runoutofattributes

y
UseMajorityvoting
Whats wrong with Gain?

Informationgainisbiastowardsattributeswithlarge numberofvalues. Forexample,ifanattributeinthetableisanidthenit wouldbechosen.Thisisbecauseeachbranchwould producealeaf,whichwouldcauseInfoid(D) =0.Thus, informationgainwouldbemaximalbecauseGain(A)= Info(D)
Gain Ratios
y GainratiosareusedtoaddressthebiasoftheID3 algorithm.
GainRatio A D = Gain A SplitInfoA D
y SplitInfo istheinformationduetothesplitofDonthe attributeA.
Gain Ratio Example

rec r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 Age Income Student Credit_rating Buys_computer <=30 High No Fair No <=30 High No Excellent No 31...40 High No Fair Yes >40 Medium No Fair Yes >40 Low Yes Fair Yes >40 Low Yes Excellent No 31...40 Low Yes Excellent Yes <=30 Medium No Fair No <=30 Low Yes Fair Yes >40 Medium Yes Fair Yes <-=30 Medium Yes Excellent Yes 31...40 Medium No Excellent Yes 31...40 High Yes Fair Yes >40 Medium No Excellent No
FromProfessorAnitaWasilewska's examplesides
Gain Ratio Example 2

Calculationsofinformationratioforrec:
yI(P,N)= (9/(9+5))Log2(9/(9+5))(5/(9+5))Log2(5/(9+5))
=.643(0.64)+(.357)(1.49)=.944 yI(Pi,Ni) = (0/(0+1))Log2*(0/(0+1))(1/(0+1))log2(1/(0+1)) =0(infinite)+(1)(0)=0 yE(rec) =I(Pr1,Nr1) +I(Pr2,Nr2) +...=0 yGain(rec) =.944 0 =.944 ySplitInforec(Root) =14*(1/14*Log2(1/14))= 14*0.271953923=3.807354922 yGainRatiorec(Root) =.944 /3.807354922 =0.248

CalculationsofinformationratioforStudent:
yI(P,N)= (9/(9+5))Log2(9/(9+5))(5/(9+5))Log2(5/(9+5))
=.643(0.64)+(.357)(1.49)=.944 yI(P1,N1)= (6/(6+1))*Log2(6/(6+1))(1/(6+1))*Log2(1/(6+q))= .591 yI(P2,N2)=(3/(3+4))*Log2(3/(3+4))(4/(3+4))*Log2(4/(3+4))= .987 yE(Student) =(((6+1)/14)*.591)=.266+ (((3+4)/14)*.987)= .493 = .789 yGain(Student) =.944 .789 =.155 ySplitInfoStudent(Root) =7/14*Log2(7/14) 7/14*Log2(7/14)=1 yGainRatioStudent(Root) =.155 /1 =0.155

CalculationsofinformationratioforIncome:
yI(P,N)= (9/(9+5))Log2(9/(9+5))(5/(9+5))Log2(5/(9+5))
=.643(0.64)+(.357)(1.49)=.944 yI(P1,N1)= (2/(2+2))*Log2(2/(2+2))(2/(2+2))*Log2(2/(2+2))= 1 yI(P2,N2)=(4/(4+2))*Log2(4/(4+2))(2/(4+2))*Log2(2/(4+2))= .918 yI(P3,N3)=(3/(3+1))*Log2(3/(3+1))(1/(3+1))*Log2(1/(3+1))= .811 yE(Income) =(((2+2)/14)*1)=.286+ (((4+2)/14)*.918)= .393+ (((3+1)/14)*.811)=.232 = .911 yGain(Income) =.944 .911 =.033 ySplitInfoIncome(Root) =4/14*Log2(4/14) 6/14*Log2(6/14) 4/14*Log2(4/14)=1.557 yGainRatioIncome(Root) =.033 /1.557 =0.0212
C4.5
AnextensionofID3.Thealgorithmisverysimilarwiththefollowingdifferences:
y UsesGainRatio tofindattributetosplitinsteadofjustGain
y Canhandlecontinuousattributes Firstthetableissortedbythecontinuousattribute,A.Athreshold,h,fromtheattributelistis chosensplittingA intoA h,A >h.Theh isusedistheonethatmaximizestheGainRatio. yCanbuildwithtrainingsetswithunknownattributes. Onlyrecordswithdefinedvaluesareconsideredforthegainratio yCanusethetestdatawithunknownattributes. Whenanattributeismissingweestimateitsvaluebytheprobabilityofthevariousresults.
Mean and variance
MeanandvarianceforaBernoullitrial:
p,p(1p)
Expectedsuccessratef=S/N Meanandvarianceforf:p,p(1p)/N ForlargeenoughN,f followsaNormaldistribution c%confidenceinterval[z X z]forrandomvariablewith0meanisgivenby: Withasymmetricdistribution:
Pr[ z X z ] = c
TakenfromDr.GregoryPiatetskyShapiro'sslides
Pr [ z X z ]= 1 2 Pr [ X z ]
Transforming f
f p p (1 p ) / N
yTransformedvalueforf:
(i.e.subtractthemeananddividebythestandarddeviation)
yResultingequation:
Pr z
f p z=c p 1 p / N
ySolvingforp:
p= f
z2 z 2N
f f2 N N
z2 / 1 2 4N
z2 N
C4.5 methods
yErrorestimateforsubtree isweightedsumoferrorestimatesforallitsleaves yErrorestimateforanode(upperbound):
e= f
z 2N
f f N N
2 2
4N
/ 1
z N
yIfc=25%thenz=0.69(fromnormaldistribution) yfistheerroronthetrainingdata yNisthenumberofinstancescoveredbytheleaf
C 4.5 tree error

Estimatederrorofeachnodeofthetree: y Ifleaf:error=N*UCF(E,N) y Otherwise:Sumofsubtrees error N=Numberofclasses E=numberofincorrect UCF =Binomialdistribution CF=ConfidenceLevel
C 4.5 pruning
Color red red red red red red white blue blue blue blue blue blue blue blue blue Class 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
Whenpruningstartatthebottom Ifwesplitoncolorweget3leaves: Ontheredleaf,1withN=6,E=0 Onthewhiteleaf,2withN=1,E=0 Ontheblueleaf,1withN=8,E=0 U25(0,6)=.206 U25(0,1)=.750 U25(0,8)=.143
BasedonnumbersfromJ.R.Quinlan'sslides
C 4.5 pruning part 2

Whenweconsiderreplacinganodefirstwecalculatethe errorofitssubtree andcompareitwitherrorifwe replaceitwithaleaf. Theerrorofthesubtree is: 6*.206+1*.750+9*.143=3.273 Theerrorifreplaced: 16*U25(1,16)=16*.157=2.512 Sowereplacethesubgraph withaleaf.
C5.0/See5
RossQuinlancreatorofID3andC4.5wentontocreateacommercialimprovementon thisalgorithmscalledC5.0forUnix/LinuxandSee5forWindows
y Speed C5.0isordersofmagnitudefasterthanC4.5 y Memoryusage C5.0ismorememoryefficientthanC4.5 y Smallerdecisiontrees C5.0getssimilarresultstoC4.5withconsiderablysmaller
decisiontrees. y Supportforboosting Boostingimprovesthetreesandgivesthemmoreaccuracy. y Weighting C5.0allowsyoutoweightdifferentattributesandmisclassificationtypes. y Winnowing C5.0automaticallywinnowsthedatatohelpreducenoise. Source:Wikipedia
References
y Quinlan,J.R.C4.5:ProgramsforMachineLearning.MorganKaufmann
Publishers,1993. y J.R.Quinlan.Improveduseofcontinuousattributesinc4.5.Journalof ArtificialIntelligenceResearch,4:7790,1996. y C4.5andBeyond.C4.5andBeyond www.cs.uvm.edu/~xwu/kdd/Slides/C4.5byRossQuinlanforICDM06.pdf, 2006 yWasilewska,Anita:DecisionTreeExamples. www.cs.sunysb.edu/~cse634/examplesdtree.pdf yWorld:C4.5algorithm. http://en.wikipedia.org/wiki/C4.5_algorithm yPiatetskyShapiro,Gregory:MachineLearninginRealWorld:C4.5, http://www.kdnuggets.com/data_mining_course/

Decision Trees Final

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Decision Trees Final

Hochgeladen von

Copyright:

Verfügbare Formate

DecisionTrees

Betson Thomas AndrewGaun BillyAnzovino

Eachnode[attribute]representsachoicebetweenanumber ofalternatives[attributevalues] Eachleafnoderepresentsaclassificationofthatdata

Sameattributesmustdescribeeach[record].Setofvalues mustbefixed. y C4.5handlescontinuousdata Attributesarenotlearnedbythealgorithm Clearlydistinct. Dontwant:Bread={stiff,verystiff,abitstiff,soft,verysoft } Want:Vodka={Likes,DoesntLike} Needtodistinguishvalidpatternsfromchance

CLS ID3 C4.5 C5/See5

y =(8/14)0.811+(6/14)1.0=.892 y Gain(S,Wind)=Entropy(S) Entropy(S,Wind) y Gain(S,Wind)=0.94 0.892=0.048

attributes y Gain(Ssunny,Humidity)=0.970 y Gain(Ssunny,Temperature)=0.570 y Gain(Ssunny,Wind)=0.019 y SelectHumidity y Repeatuntil y Alldataisclassifiedperfectly=allrecordsatanodeofthesameclass y Runoutofattributes

Whats wrong with Gain?

y SplitInfo istheinformationduetothesplitofDonthe attributeA.

Gain Ratio Example

Gain Ratio Example 2

Gain Ratio Example 3

Gain Ratio Example 4

Mean and variance

Expectedsuccessratef=S/N Meanandvarianceforf:p,p(1p)/N ForlargeenoughN,f followsaNormaldistribution c%confidenceinterval[z X z]forrandomvariablewith0meanisgivenby: Withasymmetricdistribution:

yIfc=25%thenz=0.69(fromnormaldistribution) yfistheerroronthetrainingdata yNisthenumberofinstancescoveredbytheleaf

C 4.5 tree error

Whenpruningstartatthebottom Ifwesplitoncolorweget3leaves: Ontheredleaf,1withN=6,E=0 Onthewhiteleaf,2withN=1,E=0 Ontheblueleaf,1withN=8,E=0 U25(0,6)=.206 U25(0,1)=.750 U25(0,8)=.143

C 4.5 pruning part 2

decisiontrees. y Supportforboosting Boostingimprovesthetreesandgivesthemmoreaccuracy. y Weighting C5.0allowsyoutoweightdifferentattributesandmisclassificationtypes. y Winnowing C5.0automaticallywinnowsthedatatohelpreducenoise. Source:Wikipedia

Das könnte Ihnen auch gefallen