Sie sind auf Seite 1von 8

MODULE 11

Decision Trees
LESSON 22
Construction of Decision Trees
Keywords: Entropy, Impurity, Variance, Misclassification

Construction of Decision Trees


A decision tree is induced from training examples and is therefore an
example of inductive learning or learning from examples.
The training set has a number of labelled patterns of each class.
By utilizing the values of each feature for the various classes in a number
of cases, the decision tree is induced.
It is necessary to choose the attribute at each node to make a decision.
The attribute chosen at each node should be the attribute which makes
the most difference to the classification. It is the attribute which is the
most discriminative.
Whenever a decision is made, the example set is split depending on
the various outcomes. Consider a two class example where the node
splits the positive and negative examples into two outcomes. If the
positive and negative examples are split equally to the two outcomes,
then it is not a good split as the split leaves the example set with almost
the same proportion of positive and negative examples. On the other
hand, if a split puts all the positive examples on one side and all the
negative examples on the other, it is the best split as it has succeeded
in classifying the examples as positive or negative.
If the split results in a certain answer as to the class, then it is a good
split.
Whenever an attribute is chosen, the different outcomes represent new
decision trees with a subset of the examples, and again the most important attribute is chosen. This is done till the classification is complete.
We need to measure how pure a node is. We measure the impurity of
a node. Given below are different measures of impurity.
Measures of Impurity
All the measures compute the fraction of the number of patterns of a
class going along a branch after a split.
2

The measure of impurity is found and gain in information is found at


each node for different attributes and the attribute with the largest
information gain is the attribute chosen.
Given below are the different measures of impurity.
1. Entropy Impurity or Information Impurity
At a node q, the entropy impurity Im(q) is given by
Im(q) =

P (Ci)log2 P (Ci)

P (Ci ) gives the fraction of patterns at node q of category Ci .


As an example, if out of n patterns, it splits equally into two subsets n2 and n2 , then
P (Ci ) =

n/2
n

= 0.5

Then the entropy impurity will be


Im(q) = 0.5log2 0.5 0.5log2 0.5 = 1
If all the patterns go along one branch and no patterns in the
other branch,
For branch 1,
P (Ci ) =

n
n

= 1.0

For branch 2,
P (Ci ) =

0
n

=0

and

Im(q) = 1.0log2 1.0 = 0


If we have 50 examples which split into three classes with the first
class getting 14 patterns, the second class getting 25 and the third
class getting 11 patterns, then we get
P (C1 ) =

14
50

= 0.28

P (C2 ) =

25
50

= 0.5

P (C3 ) =

11
50

= 0.22

and the entropy impurity will be


Im(q) = 0.28log2 0.28 0.5log2 0.5 0.22log2 0.22 = 1.49
2. Variance Impurity
For a two-category problem, the variance impurity will be
Im(q) = P (C1)P (C2 )
If the patterns split equally into the two classes, then
Im(q) = 0.5 0.5 = 0.25
If the patterns all go along one branch and no patterns in the
other branch, we get
Im(q) = 1.0 0 = 0
If the division is 0.8 to Class 1 and 0.2 to Class 2, we get
Im(q) = 0.8 0.2 = 0.16

In the case of more classes, we have to use the generalization of


the above equation. This is called the Gini Impurity and is given
by

Im(q) =

X
1
P (Ci )P (Cj ) = [1
P 2 (Cj )]
2
j
i6=j

If we have 50 examples which split into three classes with the first
class getting 14 patterns, the second class getting 25 and the third
class getting 11 patterns, then we get
P (C1 ) =

14
50

= 0.28

P (C2 ) =

25
50

= 0.5

P (C3 ) =

11
50

= 0.22

Then the gini impurity is


Im(q) = 21 [1 0.282 0.52 0.222 ] = 0.31
3. Misclassification Impurity
Im(q) = 1 maxi P (Ci )
In the above case in which
P (C1 ) =

14
50

= 0.28

P (C2 ) =

25
50

= 0.5

P (C3 ) =

11
50

= 0.22

the misclassification impurity will be


Im(q) = 1 max(0.28, 0.5, 0.22) = 1 0.5 = 0.5
This gives the minimum probability of error in classification.
Using the impurity of a split, the attribute to be chosen at a decision
node can be decided.
Finding the Attribute to Split at a Node
At each node, the attribute to be used for splitting is to be determined
so as to get the smallest and the most efficient decision tree.
Consider a two class problem. Let there be n1 patterns of class 1 and
n2 patterns of class two before splitting. If this attribute is split at the
node and there are half the patterns of class 1 and half the patterns
of class 2 at the left node and half the patterns of class 1 and half the
patterns of class 2 at the right node, this splitting has not succeeded
in separating the two classes. However, if all the n1 patterns of Class 1
are at the left node and all the n2 patterns of Class 2 are at the right
node, this is a very good decision as it separates the two classes.
The split is made according to the entropy or uncertainty remaining.
The entropy or uncertainty E of the patterns before splitting would be
Im(n) = i p(i)log2 p(i)

In this case, it will be


1
1
Im(n) = n1n+n
log2 ( n1n+n
)
2
2

n2
2
log2 ( n1n+n
)
n1 +n2
2

Let a decision rule split the patterns into two branches. Let there be
n11 patterns of class1 and n12 patterns if class 2 in the left branch and
n21 patterns of class 1 and n22 patterns of class 2 in the right branch.
6

Class 1 : 100
Class 2 : 100
Class 3 : 100
Decision

Class 1 : 28
Class 2 : 45
Class 3 : 53

Class 1 : 60
Class 2 : 45
Class 3 : 10

Class 1 : 12
Class 2 : 10
Class 3 : 37

Figure 1: The splitting at a decision node


Then the impurity of the left branch is
Im(lef t) = nn111 log2 nn111

n12
log2 nn121
n1

The impurity of the right branch is


Im(right) = nn212 log2 nn212

n22
log2 nn222
n2

The drop in impurity is


Im(n) = Im(n) ( nn1 Im(lef t)) ( nn2 Im(right))
The drop in impurity is also called the gain in information.
Therefore IG = Im(n) ( nn1 Im(lef t)) ( nn2 Im(right))
For example, consider Figure 1 which gives a three class problem which
is split by the decision into three outcomes. The figure shows the number of patterns of each class before splitting at the node and in each
7

branch after the split. Let us calculate the information gain at the
node. Note that a node can also be represented by the number of patterns associated with it
The impurity at the node before the split is
Im(n) = 13 log2 13 31 log2 13 13 log2 31 = 1.585
After the split, impurity of the leftmost branch is
Total patterns in left branch = 126
28
28
45
45
Im(lef t) = 126
log2 126
126
log2 126

53
53
log2 126
126

= 1.538

Impurity of the middle branch is


Total patterns in middle branch = 115
60
60
Im(middle) = 115
log2 115

45
45
log2 115
115

10
10
log2 115
115

= 1.326

Impurity in the rightmost branch is


Total patterns in right branch = 59
Im(right) = 12
log2 12

59
59

10
log2 10
59
59

37
log2 37
59
59

= 1.324

The information gain for this node is


IG = 1.585

126
300

1.538

115
300

1.326

59
300

1.324 = 0.17

Das könnte Ihnen auch gefallen