Sie sind auf Seite 1von 40

Decision Trees Slide 1 of 40

DECISION TREE INDUCTION

What is a decision tree?


The basic decision tree induction procedure
From decision trees to production rules
Dealing with missing values
Inconsistent training data
Incrementality
Handling numerical attributes
Attribute value grouping
Alternative attribute selection criteria
The problem of overfitting

P.D.Scott University of Essex


Decision Trees Slide 2 of 40

LOAN EVALUATION

When a bank is asked to make a loan it needs to assess how


likely it is that the borrower will be able to repay the loan.
Collectively the bank has a lot of experience of making loans
and discovering which ones are ultimately repaid.
However, any individual bank employee has only a limited
amount of experience.
It would thus be very helpful if, somehow the banks collective
experience could be used to construct a set of rules (or a
computer program embodying those rules) that could be used
to assess the risk that a prospective loan would not be repaid.
How?
What we need is a system that could take all the banks data
on the outcome of previous borrowers and the outcomes for
their loans and learn such a set of rules.
One widely used approach is decision tree induction.

P.D.Scott University of Essex


Decision Trees Slide 3 of 40

WHAT IS A DECISION TREE?

The following is a very simple decision tree that assigns


animals to categories:

Skin Covering

Feathers Scales
Fur

Beak Teeth Fish

Hooked Straight
Sharp Blunt

Eagle Heron Lion Lamb

Thus a decision tree can be used to predict the category (or


class) to which an example belongs.

P.D.Scott University of Essex


Decision Trees Slide 4 of 40

So what is a decision tree?

A tree in which:
Each terminal node (leaf) is associated with a class.
Each non-terminal node is associated with one of the
attributes that examples possess.
Each branch is associated with a particular value that
the attribute of its parent node can take.
What is decision tree induction?
A procedure that, given a training set, attempts to build a
decision tree that will correctly predict the class of any
unclassified example.
What is a training set?
A set of classified examples, drawn from some population of
possible examples. The training set is almost always a very
small fraction of the population.
What is an example
Typically decision trees operate using examples that take the
form of feature vectors.
A feature vector is simply a vector whose elements are the
values taken by the examples attributes.
For example, a heron might be represented as:

Skin Covering Beak Teeth

Feathers Straight None

P.D.Scott University of Essex


Decision Trees Slide 5 of 40

THE BASIC DECISION TREE ALGORITHM

FUNCTION build_dec_tree(examples,atts)
// Takes a set of classified examples and
// a list of attributes, atts. Returns the
// root node of a decision tree
Create node N;
IF examples are all in same class
THEN RETURN N labelled with that class;
IF atts is empty
THEN RETURN N labelled with modal example
class;
best_att = choose_best_att(examples,atts);
label N with best_att;
FOR each value ai of best_att
si = subset examples with best_att = ai;
IF si is not empty
THEN
new_atts = atts best_att;
subtree = build_dec_tree(si,new_atts);
attach subtree as child of N;
ELSE
Create leaf node L;
Label L with modal example class;
attach L as child of N;
RETURN N;

P.D.Scott University of Essex


Decision Trees Slide 6 of 40

Choosing the Best Attribute

What is the best attribute?

Many possible definitions.


A reasonable answer would be the attribute that best
discriminates the examples with respect to their classes.
So what does best discriminates mean?

Still many possible answers.


Many different criteria have many used.
The most popular is information gain.
What is information gain?

P.D.Scott University of Essex


Decision Trees Slide 7 of 40

Shannons Information Function

Given a situation in which there are N unknown outcomes.


How much information have you acquired once you know
what the outcome is?

Consider some examples when the outcomes are all equally


likely:

Coin toss 2 outcomes 1 bit of information


Pick 1 card from 8 8 outcomes 3 bits of information
Pick 1 card from 32 32 outcomes 5 bits of information

In general, for N equiprobable outcomes


Information = ln2(N) bits
Since the probability of each outcome p = 1/N, we can also
express this as
Information = -ln2(p) bits
Non-equiprobable outcomes

Consider picking 1 card from a pack containing 127 red and 1


black.
There are 2 possible outcomes but you would be almost
certain that the result would be red.
Thus being told the outcome usually gives you less
information than being told the outcome of an experiment with
two equiprobable outcomes.

P.D.Scott University of Essex


Decision Trees Slide 8 of 40

Shannons Function

We need an expression that reflects the fact there is less


information to be gained when we already know that some
outcomes are more likely than others.
Shannon derived the following function:
N
Information pi ln 2 pi bits
i 1

where N is the number of alternative outcomes.

Notice that it reduces to ln2(p) when the outcomes are all


equiprobable.
If there are only two outcomes, it takes this form:

1
0.8
Information

0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability

Information is also sometimes called uncertainty or entropy.

P.D.Scott University of Essex


Decision Trees Slide 9 of 40

Using Information to Assess Attributes

Suppose
You have a set of 100 examples, E
These examples fall in two classes, c1 and c2
70 are in c1
30 are in c2
How uncertain are you about the class an example belongs
to?
Information = -p(c1)ln2(p(c1)) - p(c2)ln2(p(c2))
= - 0.7ln2(0.7) - 0.3 ln2(0.3)
= -(0.7 x -0.51 + 0.3 x -1.74) = 0.88 bits
Now suppose
A is one of the example attributes with values v1 and v2
The 100 examples are distributed thus
v1 v2
c1 63 7
c2 6 24

What is the uncertainty for the examples whose A value is v1?


There are 69 of them; 63 in c1 and 6 in c2.
So for this subset, p(c1) = p(63/69) = 0.913
and p(c2) = p(6/69) = 0.087
Hence
Information = -0.913 ln2(0.913) - 0.087 ln2(0.087)) = 0.43

P.D.Scott University of Essex


Decision Trees Slide 10 of 40

Similarly for the examples whose A value is v2?


There are 31 of them; 7 in c1 and 24 in c2.
So for this subset, p(c1) = p(7/31) = 0.226
and p(c2) = p(24/31) = 0.774
Hence
Information = -0. 226 ln2(0. 226) - 0.774 ln2(0.774) = 0.77

So, if we know the value of attribute A


The uncertainty is 0.43 if the value is v1.
The uncertainty is 0.77 if the value is v2.
But 69% have value v1 and 31% have value v2.
Hence the average uncertainty if we know the value of
attribute A will be 0.69 x 0.43 + 0.31 x 0.77 = 0.54.
Compare this with the uncertainty if we dont know the value
of A which we calculated earlier as 0.88.
Hence attribute A provides an information gain of
0.88 0.54 = 0.34 bits

P.D.Scott University of Essex


Decision Trees Slide 11 of 40

AN EXAMPLE

Suppose we have a training set of data derived from weather


records.
These contain four attributes:

Attribute Possible Values

Temperature Warm; Cool

Cloud Cover Overcast; Cloudy; Clear

Wind Windy; Calm

Precipitation Rain; Dry

We want to build a system that predicts precipitation from the


other three attributes.
The training data set is:
[ warm, overcast, windy; rain ]
[ cool, overcast, calm; dry ]
[ cool, cloudy, windy; rain ]
[ warm, clear, windy; dry ]
[ cool, clear, windy; dry ]
[ cool, overcast, windy; rain ]
[ cool, clear, calm; dry ]
[ warm, overcast, calm; dry ]

P.D.Scott University of Essex


Decision Trees Slide 12 of 40

Initial Uncertainty

First we consider the initial uncertainty:


p(rain) = 3/8; p(dry) = 5/8
So Inf = -(3/8ln2(3/8)+5/8ln2(5/8)) = 0.954;

Next we must choose the best attribute for building branches


from the root node of the decision tree.
There are three to choose to from
Information Gain from Temperature Attribute

Cool Examples:
There are 5 of these; 2 rain and 3 dry.
So p(rain) = 2/5 and p(dry) = 3/5
Hence Infcool = -(2/5ln2(2/5)+3/5ln2(3/5)) = 0.971
Warm Examples:
There are 3 of these; 1 rain and 2 dry.
So p(rain) = 1/3 and p(dry) = 2/3
Hence Infwarm = -(1/3ln2(1/3)+2/3ln2(2/3)) = 0.918
Average Information:
5/8 Infcool + 3/8 Infwarm = 0.625x0.971+0.375x0.918 = 0.951
Hence Information Gain for Temperature is
Initial Information Average Information for Temperature
= 0.954 0.951 = 0.003. (Very small)

P.D.Scott University of Essex


Decision Trees Slide 13 of 40

Information Gain from Cloud Cover Attribute

A similar calculation gives an average information of 0.500.


Hence Information Gain for Cloud Cover is
Initial Information Average Information for Cloud Cover
= 0.954 0.500 = 0.454. (Large)
Information Gain from Wind Attribute

A similar calculation gives an average information of 0.607.


Hence Information Gain for Wind is
Initial Information Average Information for Wind
= 0.954 0.607 = 0.347. (Quite large)
The Best Attribute: Starting to build the tree
Cloud cover gives the greatest information gain so we choose
it as the attribute to begin tree construction

Cloud Cover

Overcast Clear

warm,overcast,windy: rain warm,clear,windy: dry


cool,overcast,calm: dry cool,clear,windy: dry
cool,overcast,windy: rain cool,clear,calm: dry
warm,overcast,calm: dry
Cloudy

cool,cloudy,windy: rain

P.D.Scott University of Essex

Eagle Heron Lion Lamb


Decision Trees Slide 14 of 40

Developing the Tree

All the examples on the Clear branch belong to the same


class so no further elaboration is needed.
It can be terminated with a leaf node labelled dry
Similarly the single example on the Cloudy branch
necessarily belongs to one class.
It can be terminated with a leaf node labelled rain
This gives us :-
The Overcast branch has both rain and dry examples.
Cloud Cover

Overcast Clear

warm,overcast,windy: rain
cool,overcast,calm: dry Dry
cool,overcast,windy: rain
warm,overcast,calm: dry
Cloudy

Rain

So we must attempt to extend the tree from this node.

P.D.Scott University of Essex


Decision Trees Slide 15 of 40

Extending the Overcast subtree

There are 4 examples: 2 rain and 2 dry


So p(rain) = p(dry) = 0.5 and the uncertainty is 1.
There are two remaining attributes: temperature and wind.
Information Gain from Temperature Attribute

Cool Examples:
There are 2 of these; 1 rain and 1 dry.
So p(rain) = 1/2 and p(dry) = 1/2
Hence Infcool = -(1/2ln2(1/2)+1/2ln2(1/2)) = 1
Warm Examples:
There are also 2 of these; 1 rain and 1 dry.
So again Infwarm = -(1/2ln2(1/2)+1/2ln2(1/2)) = 1
Average Information:
1/2 Infcool + 1/2 Infwarm = 0. 5 x 1.0 + 0. 5 x 1.0 = 1
Hence Information Gain for Temperature is zero!

P.D.Scott University of Essex


Decision Trees Slide 16 of 40

Information Gain from Wind Attribute

Windy Examples:
There are 2 of these; 2 rain and 0 dry.
So p(rain) = 1 and p(dry) = 0
Hence Infwindy = -(1 x ln2(1)+ 0 x ln2(0)) = 0
Calm Examples:
There are also 2 of these; 0 rain and 2 dry.
So again Infcalm = -(1 x ln2(1)+ 0 x ln2(0)) = 0
Average Information:
1/2 Infwindy + 1/2 Infcalm = 0. 5 x 0.0 + 0. 5 x 0.0 = 0
Hence Information Gain for Temperature is 1.
Note: This reflects the fact that wind is a perfect predictor
of precipitation for this subset of examples.
The Best Attribute:
Obviously wind is the best attribute so we can now extend
the tree.

P.D.Scott University of Essex


Decision Trees Slide 17 of 40

Cloud Cover

Overcast Clear

Wind
Dry
Calm

Windy
cool,overcast,calm: dry Cloudy
warm,overcast,calm: dry

Rain
warm,overcast,windy: rain
cool,overcast,windy: rain

All the examples on both the new branches belong to the


same class so they can be terminated with appropriately
labelled leaf nodes.

Cloud Cover

Overcast Clear
Cloudy
Wind
Dry
Rain

Windy Calm

Rain Dry

P.D.Scott University of Essex


Decision Trees Slide 18 of 40

FROM DECISION TREES TO PRODUCTION RULES

Decision trees can easily be converted into sets of IF-THEN


rules.
The tree just derived would become:
IF clear THEN dry
IF cloudy THEN rain
IF overcast AND calm THEN dry
IF overcast AND windy THEN rain

Such rules are usually easier to understand than the


corresponding tree.
Large trees produce large sets of rules.
It is often possible to simplify these considerably by applying
transformations to them.
In some cases these simplified rule sets are more accurate
than the original tree because they reduce the effect of
overfitting a topic we will discuss later.

P.D.Scott University of Essex


Decision Trees Slide 19 of 40

REFINEMENTS OF DECISION TREE LEARNING

The basic top down procedure for decision tree construction is


exemplified by ID3.
This technique has proved extremely successful as a
method for classification learning.
Consequently it has been applied to a wide range of
problems.
As this has happened, limitations have emerged and new
techniques developed to deal with them.

These include:
Dealing with Missing Values
Inconsistent Data
Incrementality
Handling Numerical Attributes
Attribute Value Grouping
Alternative Attribute Selection Criteria
The Problem of Overfitting

Note.
Many of these problems also arise in other learning
procedures and statistical methods.
Hence many of the solutions developed for use with
decision trees may be useful in conjunction with other
techniques.

P.D.Scott University of Essex


Decision Trees Slide 20 of 40

MISSING VALUES

Missing values are a major problem when working with real


data sets.
A survey respondent may not have answered a question.
The results of a lab test may not be available.
There are various approaches to this problem.

Discard the training example.


If there are many attributes, each of which might be
missing, this may almost wipe out the training data.

Guess the value.


e.g. substitute the commonest value.
Obviously error prone but quite effective if missing values
for a given variable are rare.
More sophisticated guessing techniques have been
developed.
Sometimes called imputation

Assign a probability to each possible value.


Probabilities can be estimated from remaining examples.
For training purposes, treat missing value cases as
fractional examples in each class for gain computation.
This fractionation will propagate as the tree is extended.
Resulting tree will give a probability for each class rather
than a definite answer.

P.D.Scott University of Essex


Decision Trees Slide 21 of 40

INCONSISTENT DATA

It is possible (and not unusual) to arrive at a situation in which


the examples associated with a leaf node belong to more than
one class.
This situation can arise in two ways:
1. There are no more attributes that could be used to
further subdivide the examples.
e.g. Suppose the weather data had contained a ninth
example
[ cool, cloudy, windy; dry ]
2. There are more attributes but none of them is useful
in distinguishing the classes.
e.g. Suppose the weather data had included the above
example and another attribute that identified the
person who recorded the data.

No More Attributes
A decision tree program can do one of two things.
Predict the most probable (modal) class.
This is what the pseudocode given earlier does.
Make a set of predictions with associated probabilities.
This is better.

P.D.Scott University of Essex


Decision Trees Slide 22 of 40

No More Useful Attributes

In this situation the system cannot build a subtree to further


discriminate the classes because the unused attributes do not
correlate with the classification.
This situation must be handled in the same way as No More
Attributes.
But first the program must detect when the situation has
arisen.
Detecting that Further Progress is Impossible.

This requires a threshold on information gain or a statistical


test.
We will discuss this when we consider pre-pruning as a
possible solution to overfitting.

P.D.Scott University of Essex


Decision Trees Slide 23 of 40

INCREMENTALITY

Most decision tree induction programs require the entire


training set to be available at the start.
Such programs can only incorporate new data by building
a complete new tree.
This is not a problem for most data mining applications.
In some applications, new training examples become
available at intervals.
e.g. Consider a robot dog learning football tactics by
playing games.
Learning programs that can accept new training instances
after learning has begun, without starting again are said to be
incremental.
Incremental Decision Tree Induction

ID4 is a modification of ID3 that can learn incrementally.


Maintains counts at every node throughout learning:
Number of each class associated with the node and its
subclasses.
Numbers of each class having each possible value of
each attribute.
When new example is encountered, all counts are
updated.
Where counts have changed, system can calculate if
current best attribute is still the best.
If not new subtree is built to replace original.

P.D.Scott University of Essex


Decision Trees Slide 24 of 40

Building Decision Trees with Numeric Attributes

The Problem

Standard decision tree procedures are designed to work with


categorical attributes.
No account is taken of any numerical relationship between
the values.
The branching factor of the tree will be reasonable if the
number of distinct values is modest.
If the same procedures are applied to numerical attributes we
run into two difficulties:
The branching factor may become absurdly large.
Consider the attribute income in a survey of 1000
people.
It is not unlikely that every individual will have a
different annual income.
All the information implicit in the ordering of values is
thrown away.

The Solution
Partition the value set into a small number of contiguous
subranges and then treat membership of each subrange as a
categorical variable.
The result is in effect a new ordinal attribute.
A reasonable branching factor.
Some of the ordering information has been used.

P.D.Scott University of Essex


Decision Trees Slide 25 of 40

Discretization of Continuous Variables

Two possible approaches to partitioning a range of numeric


values in order to build a classifier:
Divide the range up into a preset number of subranges.
For example, each subrange could have equal width or
include an equal number of examples.
Use the classification variable to determine the best way
to partition the numeric attribute.
The second approach has proved more successful.
Discretization using the classification variable

In principle this is straightforward:


Consider every possible partitioning of the numeric
attribute.
Assess each partitioning using the attribute selection
criterion.
In practice this is infeasible:
A set of m training examples can be partitioned in 2m-1
ways
However, it can be proved that if two neighbouring values
belong to the same class they should be assigned to the
same group.
This reduces the number of possibilities but the number of
partitionings is still infeasibly large.
Two solutions are possible:
Consider only the m-1 binary partitions. (e.g. C4.5)
Use heuristics to find good multiple partitions
P.D.Scott University of Essex
Decision Trees Slide 26 of 40

ATTRIBUTE VALUE GROUPING

Discretization of numeric attributes involves finding subgroups


of values that are equivalent for the purposes of classification.
This notion can usefully be applied to categorical variables:
Suppose
A decision tree program attempts to induce a tree to
predict some binary class, C.
That the training set comprises feature vectors of nominal
attributes A1, A2, ..., Ak.
That the class C is in fact defined by the following
classification function:
C = V2,1 (V5,2 V5,4) V8,3
where Vi,j denotes that attribute Ai takes its jth value.
Suppose finally that each attribute has four possible
values and the program selects attributes in the order of
A5, A2, A8.
The resulting tree will be:

A5

V5,1 V5,2 V5,3 V5,4

N A2 N A2

V2,1 V2,2 V2,3 V2,4 V2,1 V2,2 V2,3 V2,4

A8 N N N A8 N N N

V8,1 V8,2 V8,3 V8,4 V8,1 V8,2 V8,3 V8,4


N N Y N N N Y N

P.D.Scott University of Essex


Decision Trees Slide 27 of 40

There is a great deal of duplication in the structure of this tree.


If branches could created for groups of attribute values a
A
V5,1,V5,3 5 V5,2V5,4

N A
V2,1 2 V2,2,V2,3,V2,4

A N
8
V8,1,V8,2,V8,4 V8,3

N Y

much simpler one could be constructed:


The original tree:
Has 21 nodes
Divides the example space into 16 regions
The new tree
Has 7 nodes
Divides the example space into 4 regions
Their classification behaviours are identical.

Attribute Value Grouping Procedures

Heuristic attribute value grouping procedures, similar to those


used to discretize numeric attributes, can be used to produce
such tree simplification.

P.D.Scott University of Essex


Decision Trees Slide 28 of 40

P.D.Scott University of Essex


Decision Trees Slide 29 of 40

ALTERNATIVE ATTRIBUTE SELECTION CRITERIA

Hill Climbing
The basic decision tree induction algorithm proceeds
using a hill climbing approach.
At every step, a new branch is created for each value of
the best attribute.
There is no backtracking.

Which is the Best Attribute?

The criterion used for selecting the best attribute is therefore


very important, since there is no opportunity to rectify its
mistakes.
Several alternatives have been used successfully.

Information Based Criteria

If an experiment has n possible outcomes then the amount of


information, expressed as bits, provided by knowing the
outcome is defined to be
n
I pi log 2 pi
i 1

where pi is the prior probability of the ith outcome.

This quantity is also known as entropy and uncertainty.


For decision tree construction, the experiment is finding out
the correct classification of an example.

P.D.Scott University of Essex


Decision Trees Slide 30 of 40

Information Gain

ID3 originally used information gain to determine the best


attribute:

The information gain for an attribute A when used with a set of


examples X is defined to be:

| Xv |
Gain( X , A) I ( X )
vvalues ( A ) | X |
I(Xv )

where |X| is the number of examples in set X.

This criterion has been (and is) widely and successfully used
But it is known to be biased towards attributes with many
values.

Why is it biased?
A many-valued attribute will partition the examples into
many subsets.
The average size of these subsets will be small.
Some of these are likely to contain a high percentage of
one class by chance alone.
Hence the true information gain will be over-estimated.

P.D.Scott University of Essex


Decision Trees Slide 31 of 40

Information Gain Ratio

The information gain ratio measure incorporates an additional


term, split information, to compensate for this bias:
It is defined:
| Xv | | Xv |
SplitInf ( X , A)
vvalues ( A ) | X |
log2
|X |
which is in fact the information imparted when you are given
the value of attribute A.

For example, if we consider equiprobable values:


If A has 2 values, SplitInf(X,A) = 1
If A has 4 values, SplitInf(X,A) = 2
If A has 8 values, SplitInf(X,A) = 3

The information gain ratio is then defined:


The gain ratio itself can lead to difficulties if values are far

Gain ( X , A)
GainRatio( X , A)
SplitInf ( X , A)

from equiprobable.
If most of the examples are of the same value then
SplitInf(X,A) will be close to zero, and hence GainRatio(X,A)
may be very large.
The usual solution is to use Gain rather than GainRatio
whenever SplitInf is small.

P.D.Scott University of Essex


Decision Trees Slide 32 of 40

Information Distance

SplitInf can be regarded as a normalisation factor for gain.


Lopez de Mantaras has developed a mathematically sounder
form of normalization based on the information distance
between two partitions.
This has not been widely adopted.

The Gini Criterion

An alternative to the information theory based measures.


Based on the notion of minimising misclassification rate.

Suppose you know the probabilities p(ci) of each class ci for


the examples assigned to a node.
Suppose you are given an unclassified example that would be
assigned to that node and decide to make a random guess of
its class with probabilities p(ci).

What is the probability that you will guess incorrectly?

G p ( ci ) p ( c j )
i j

where the sum is taken over all classes.

G is called the Gini criterion and can be used in the same


ways as information to select the best attribute.

P.D.Scott University of Essex


Decision Trees Slide 33 of 40

SO WHAT IS THE BEST ATTRIBUTE SELECTION CRITERION?


A good attribute selection criterion should select those
attributes that most improve prediction accuracy.
It should also be cheap to compute.

Which criteria are widely used?

Information gain ratio is most popular in machine learning


research.
The Gini criterion is very popular in the statistical and pattern
recognition communities.

The evidence

Empirical evidence suggests that choice of criteria has little


impact on the classification accuracy of the resulting trees.
There are claims that some methods produce smaller trees
with similar accuracies.

P.D.Scott University of Essex


Decision Trees Slide 34 of 40

THE PROBLEM OF OVERFITTING


Question
What would happen if a completely random set of data were
used as the training and test sets for a decision tree induction
program?
Answer
The program would build a decision tree.
If there were many variables and plenty of data it could be
quite a large tree.

Question
Would the tree be any good as a classifier?
Would it, for example, do better than the simple strategy of
always picking the modal class?
Answer
No.
Note also that if the experiment were repeated with a new set
of random data we would get an entirely different tree.

Questions
Isnt this rather worrying?
Could the same sort of thing be happening with non-random
data?
Answers
Yes and yes.

P.D.Scott University of Essex


Decision Trees Slide 35 of 40

What is going on?


A decision tree is a mathematical model of some population
of examples.
But the tree is built on the basis of a sample from that
population the training set.
So what a decision tree program really does is build a model
of the training set.
The features of such a model can be divided into two groups:
1. Those that reflect relationships that are true for the
population as a whole.
2. Those that reflect relationships that are peculiar to the
particular training set.
Overfitting
Roughly speaking what happens is this:
Initially the features of a decision tree will reflect features
of the whole population.
As the tree gets deeper, the samples at each node get
smaller and the major relationships of the population will
have already been incorporated into the model.
Consequently any further additions are likely to reflect
relationships that have occurred by chance in the training
data.
From this point on the tree becomes a less accurate
model of the population: typically 20% less accurate.
This phenomenon of modelling the training data rather than
the population it represents is called overfitting.

P.D.Scott University of Essex


Decision Trees Slide 36 of 40

Eliminating Overfitting
There are two basic ways of preventing overfitting:

1. Stop tree growth before it happens: Pre-pruning.

2. Remove the parts of the tree due to overfitting after it has


been constructed: Post-pruning.
Pre-pruning

This approach is appealing because it would save the effort


involved in building then scrapping subtrees.
This implies the need for a stopping criterion: a function
whose value determines when a leaf node should not be
expanded into subtrees.
Two types of stopping criteria have been tried:
Stopping when the improvement gets too small.
Typically stop when the improvement indicated by the
attribute selection criterion drops below some pre-set
threshold .
Choice of is crucial. Too low and you still get
overfitting. Too high and you lose accuracy.
This method proved unsatisfactory. It wasnt possible
to choose a value for that worked for all data sets.
Stopping when the evidence for an extension becomes
statistically insignificant.
Quinlan used 2 testing in some versions of ID3.
He later abandoned this because results were
satisfactory but uneven.

P.D.Scott University of Essex


Decision Trees Slide 37 of 40

Chi-Square Tests

Chi-square testing is an extremely useful technique for


determining whether the differences between two distributions
could be due to chance.
That is, whether they could both be samples of the same
parent population.
Suppose we have a set of n categories and a set of
observations O1OiOn of the frequency that each
category occurs in a sample.
Suppose we wish to know if this set of observations could
be a sample drawn from some population whose
frequencies we also know.
We can calculate the expected frequencies E1EiEn of
each category if the sample exactly followed the
distribution the population.
Now compute the value of the chi-square statistic defined:
n
(Oi Ei ) 2
2
i 1 Ei

Clearly 2 increases as the two distributions deviate.

To determine whether the deviation is statistically significant,


consult chi square tables for the appropriate number of
degrees of freedom in this case n-1.

P.D.Scott University of Essex


Decision Trees Slide 38 of 40

Post-Pruning

The Basic Idea


First build a decision tree, allowing overfitting to occur.
Then, for each subtree:
Assess whether a more accurate tree would result if the
subtree were replaced by a leaf.
(The leaf will choose the modal class for classification)
If so, replace the subtree with a leaf.

Validation Data Sets


How do we assess whether a pruned tree would be more
accurate?
We cant use the training data because the tree has
overfitted to this.
We cant use the test data because then we would have
no independent measure for the accuracy of the final tree.
We must have a third set used only for this purpose.
This is known as a validation set.
Notes:
Validation sets can also be used in pre-pruning.
In C4.5, Quinlan uses the training data for validation but
treats the result as an estimate and sets up a confidence
interval.
This is statistically dubious because the training data
isnt an independent sample.
Quinlan justifies it on the grounds that it works in
practice.
P.D.Scott University of Essex
Decision Trees Slide 39 of 40

Refinements

Substituting Branches
Rather than replacing a subtree with a leaf, it can be replaced
by its most frequently used branch.
More Drastic Transformations
More substantial changes, possibly leading to a structure that
is no longer a tree, have also been used.
One example is the transformation into rule sets in C4.5:
Generate a set of production rules equivalent to the tree
by creating one rule for each path from the root to a leaf.
Generalize each rule by removing any precondition
whose loss does not reduce the accuracy.
This step corresponds to pruning, but note that the
structure may no longer be equivalent to a tree.
An example might match the LHS of more than one
rule.
Sort the rules by their estimated accuracy.
When using the rules for classification, this accuracy is
used for conflict resolution.

P.D.Scott University of Essex


Decision Trees Slide 40 of 40

Suggested Readings
Mitchell, T. M., (1997),Machine Learning,McGraw-Hill. Chapter
3.
Tan, Steinbach & Kumar (2006) Introduction to Data Mining.
Chapter 4
Han & Kamber (2006), Data Mining: Concepts and Techniques.
Section 6.3
Breiman, L., Freidman, J. H., Olshen, R. A. and Stone, C. J., (1984)
Classification and Regression Trees. Wadsworth, Pacific Grove, CA..
(This is a thorough treatment of the subject from a more statistical
perspective an essential reference if you are doing research in the
area usually known as The CART book.)
Quinlan, J. R., (1986), Induction of Decision Trees. Machine
Learning, 1(1), pp 81-106. (A full account of ID3).
Quinlan, J. R., (1993), Programs for Machine Learning. Morgan
Kaufmann, Los Altos, CA.. (A complete account of C4.5, the successor
to ID3 and the yardstick to which other decision tree induction
procedures are usually compared.)
Dougherty, J., Kohavi, R. and Sahami, M., (1995), Supervised and
Unsupervised Discretisation of Continuous Features, in Proc. 12th Int.
Conf. on Machine Learning, Morgan Kaufmann, Los Altos, CA., pp
194-202. (A good comparative study of different methods for
discretising numeric attributes).
Ho, K. M. and Scott, P. D. (2000) Reducing Decision Tree
Fragmentation Through Attribute Value Grouping: A Comparative
Study. Intelligent Data Analysis. 6 pp 255-274

Implementations
An implementation of a decision tree procedure is available as part of
the WEKA suite of data mining programs. It is called J4.8 and closely
resembles C4.5.

P.D.Scott University of Essex

Das könnte Ihnen auch gefallen