Sie sind auf Seite 1von 13

CS181 Lecture 3 Overtting, Description Length and Cross-Validation

Avi Pfeer; Revised by David Parkes Jan 23, 2010

Today we continue our discussion of decision tree learning algorithms. The main focus will be on the phenomenon of overtting, which is an issue for virtually all machine learning algorithms. After discussing why overtting happens, we will look at several methods for dealing with it.


More on the ID3 Algorithm

Information Gain Heuristic

Several comments on the ID3 algorithm are in order. First, let us try to understand better what kinds of splits the information gain criterion selects. Recall that it selects the feature which maximizes information gain, or equivalently, since the current entropy is xed, minimizes the Remainder (Xk , D), which is the weighted entropy of the partition of data induced by a split on feature Xk given data D consistent with the current node in the tree. To gain a better understanding of the heuristic used by ID3, we rst examine the entropy function a little more closely. For binary data, where we denote pT = nT /n (the fraction of instances that with classication true), then the entropy is:
1 "log"





0 0 0.2 0.4 0.6 0.8 1

where the x-axis plots pT and the y-axis entropy. We have Entropy(pT ) = pT log2 pT (1 pT ) log2 (1 pT ). Taking the derivative, d Entropy(pT ) dpT
pT =z

= =

ln z 1 + ln(1 z) 1z + ln 2 (1 z) ln 2 1 (ln(1 z) ln z), ln 2 1

(1) (2)

which is at 0 and at 1. The greatest eect on entropy occurs near the extremes of 0 and 1, while the eect of changes to mid-range distributions is relatively small. For example, we have Entropy(5 : 9) = 0.94 and Entropy(7 : 7) = 1 while Entropy(1 : 6) = 0.59 but Entropy(0 : 7) = 0. There is a big dierence in entropy from small changes to the distribution of target class at the extremes. This implies that the Remainder function selects strongly for splits that generate a partition on data in which some of the parts are very extreme, even if the other parts are very mixed and retain a lot of disorder. Another way of putting it is that Remainder prefers a split that generates a partition with one very extreme and one very mixed part to a partition with two fairly well-sorted parts. The second component of the Remainder function is the weights nx . These mean that getting low entropy n on a large part of the data partition is more important than for small amounts of data. To summarize, then, the information gain criterion will try to choose a feature to split on that induces a partition in which large parts of the data are classied extremely well.


Example: Poisonous Plants

Lets have a look at how this works in the earlier example of poisonous and nutritious plants. Recall that there are four features and two target classes. Suppose that the original data set is as follows: Skin smooth smooth scaly rough rough scaly smooth smooth rough smooth scaly scaly rough rough Color pink pink pink purple orange orange purple orange purple purple purple pink purple orange Thorny false false false false true true false true true true false true false true Flowering true false true true true false true true true false false true false false Class Nutritious Nutritious Poisonous Poisonous Poisonous Poisonous Nutritious Poisonous Poisonous Poisonous Poisonous Poisonous Nutritious Nutritious

Overall, there are 5 Nutritious cases and 9 Poisonous cases in the data, for an initial entropy before splitting of 0.94. Splitting on the various features results in the following information gain: Feature Skin Induced Data Partition Value Nutritious Poisonous smooth 3 2 rough 2 3 0 4 scaly Value Nutritious Poisonous pink 2 2 purple 2 4 orange 1 3 Value Nutritious Poisonous false 4 3 true 1 6 Value Nutritious Poisonous false 3 3 true 2 6 2 Remainder Entropy 0.971 0.971 0 Entropy 1 0.918 0.811 Entropy 0.986 0.592 Entropy 1 0.811 + + 5/14 0.971 5/14 0.971 = 0.694 4/14 0 4/14 1 6/14 0.918 = 0.911 4/14 0.811 7/14 0.986 = 0.788 7/14 0.592 6/14 1 = 0.892 8/14 0.811 Gain 0.247


+ +






We see that splitting on the Skin feature produces by far the best information gain, because it has one perfectly classied sub-partition, even though the other two sub-partitions are even worse than the original distribution. The ID3 algorithm would split rst on Skin.


Variations on Decision Tree Learners

Some other considerations that come into play when learning decision trees include: Continuous features: here the algorithms can be modied to select on a single value that is the most informative in terms of information gain. Missing feature values: here the feature value can be completed, e.g. by assigning the value that is most common amongst the data D at the current node. Multiple feature values: here we must be careful to avoid a bias in ID3 in favor of features with many values over those with few. As an extreme example, we could split on an feature such as Date that exactly partitions the data into single instances. To prevent against this a common method is to adopt gain ratio in place of information gain, which normalizes information gain by the entropy of the data with respect to the values of the feature. Regression trees. Regression is the problem of learning a hypothesis h : X R (to the reals) rather than to a discrete set of classes. For regression rather than classication, similar approaches can be adopted. E.g., one can seek a split that greedily minimizes the least squared error when tting a linear regression to the frequency-weighted data sets generated by a split.

The Overtting Phenomenon

We turn now to the main question: how well does ID3 actually perform as a learning algorithm? Given a consistent training set, ID3 will learn a tree that scores perfectly on the training set, unless it happens at some point that no feature has positive information gain taken individually. To understand that the error is zero except in this case we can simply step through the termination conditions. Furthermore, for an inconsistent training set, it will learn a tree that has the fewest possible errors on the training set, again unless it happens at some point that no feature has positive information gain when considered individually. The new case in the analysis of the error occurs when there is inconsistent data but no features left to branch on. But now this data represents noise in the data, and no tree can avoid this error and selecting the majority target minimizes the number of misclassied examples. So, it seems like ID3 is a great algorithm, right? True, it always gets a hypothesis with as low an error on the training data as possible. But that does not necessarily mean that it learns a classier that will generalize well. In fact, getting perfect accuracy on the training data might not be the best thing to do! Doing better on the training data may actually lead to doing worse on future data. This phenomenon is called overtting.


Potential for Overtting in ID3

ID3 is a greedy algorithm. It uses information gain to greedily decide what feature to split on next, and never questions its decision. Choosing the node that has the highest information gain does not necessarily lead to the shortest tree that is exact on a given data set. For example, suppose there are 100 features, and the true function that explains any instance is true if and only if X1 = X2 . In other words, the target class on an example will be true in the event that the value of the rst feature is equal to that of the second feature, and false otherwise. The remaining 98 features are irrelevant! But on the other hand, considered individually, X1 and X2 are both useless. It is only taken together that they are useful, and in this case they entirely determine the class of a data instance. 3

Now we can consider what ID3 would do when presented with some training data from this domain. Because ID3 is greedy and considers only one feature at a time, it is no more likely to split on X1 or X2 than it is to split on any one of the other features. Once it has split on X1 or X2 , it will of course split on the other at the next step, and then terminate. But it may take a long time until it splits on X1 or X2 . So ID3 will likely grow a much larger tree than necessary. Along the way it is tting to spurious patterns in the training data and learning a hypothesis that does not generalize. Cases like this, in which two features are not individually predictive but work as co-predictors, would not be a problem for an algorithm that looked ahead one step in deciding what to split on, or split on pairs directly. The ID3 algorithm could be modied in this way. But the general problem of overtting would remain, and this is what we turn to next.


Training vs. Test (or Generalization) error

We can consider the eect of overtting by considering a graph that tracks the performance from the sequence of trees constructed by ID3 as it makes successive split decisions. The x-axis would plot the number of nodes in the tree (a measure of its complexity, or the training eort) and the y-axis would plot the percentage of correct classications on the training data and also on some test set of data that is not used for learning and represents future data. The following graph is typical: the training set performance would continue to improve, while the test set performance would be expected to peak before the complete tree is grown and start to dip afterwards.
100% Asymptotic training accuracy Best test accuracy training test

Optimal stopping point

Training effort

Why does this happen? The problem is that a full decision tree is tting the noise in the training data or nding other spurious patterns that happen to hold in the training data by happen stance but dont generalize. In later splits, there may be very few instances on which to base decisions, and so the algorithm may make a splitting decision that is not statistically supported by the data. Such an unsupported decision will in fact override statistically supported decisions at a higher level. It may be the case that a higher level decision will generalize, while a lower level decision will not.


General Causes of Overtting

Overtting is a huge issue in machine learning. It comes up for just about all learning algorithms. The basic issue is that there is always a tradeo between producing a model that ts the training data as well as possible, and a model that generalizes well to new instances. Note: the problem of overtting occurs even in the absence of noisy data and can be simply because there are patterns in the training data that turn out to be insignicant. 4

Before discussing ways of dealing with overtting, let us look at some of the possible causes. Overtting can occur when at least one of the following occurs: The training set is too small. Patterns that show up in a small training set may be spurious, and due to noise. If they carry over to a larger training set, they are likely to reect actual patterns in the domain. (Imagine thinking that all dogs are black but all cats are white having seen just a few examples of each!) The domain is non-deterministic. For example, this can occur because of noise in assigning labels or other natural variance. This increases the likelihood of spurious patterns that do not reect actual patterns in the domain. (Recall the basic example we saw in class of tting a curve to noisy data.) The hypothesis space H from which hypothesis h : X Y is large. Overtting requires the ability to t the noise in the data, which may not be possible with a restricted hypothesis space. For example, consider a hypothesis space consisting of conjunctive Boolean concepts, and suppose the true concept is actually a conjunctive concept but the training set is noisy. Since noise in the data is random, it is highly unlikely that a conjunctive concept will be able to classify the noise correctly, and so the conjunctive concept that performs best on the training set may well be the true concept. Recall also: the example from class of tting a higher-order degree polynomial to noisy data; and continuing to split on data even without noisy data but when there are features that turn out to be unimportant. There are a large number of features. Intuitively, each feature provides an opportunity for spurious patterns to show up. This is particularly an issue with irrelevant features, that are in the domain but actually have no impact on the classication. If there are many irrelevant features, it is quite likely that some of them will appear relevant for a particular data set. For this reason, some machine learning methods use feature selection to try to pick out the features that are actually relevant before beginning to learn. (Consider splitting on the color of a professors tie when looking to predict the weather because it happened to rain every time the tie was red in the training data.) For example, we say in lecture that if there are |D| = 8 training examples and the true concept is f (x) = x1 , but also 1999 uninformative features ({X2 , . . . , X2000 }), each taking value true or false on any x with probability 0.5, then: the probability that any one feature (e.g., X2 ) explains the data is 2(1/2)8 = 1/128 (with the rst factor of 2 occurring because it can explain as f (x) = x2 or f (x) = x2 ) the probability that none of these 1999 features explains the data is 1 (127/128)1999 1 and so it is very likely that an uninformative feature will be selectedas the initial branch in ID3, in place of the informative feature x1 .


Dealing with Overtting: Inductive Bias

The basic idea behind virtually all methods for dealing with overtting is to increase the inductive bias of the learning algorithm. Overtting is the result of the algorithm being too heavily swayed by the training data. Increasing the bias means making stronger assumptions that are not supported by the data. This means in turn that more data will be needed to counter the assumptions. In particular, noise in the data or spurious patterns, which will tend not to be supported by large amounts of data, will not override the inductive assumptions. As discussed in the last class, increasing the inductive bias could mean using a more restricted hypothesis space. For example, in using decision trees one could stipulate that only trees whose depth is at most 3, or that have at most 10 nodes, or use at most 4 features (or any combination of these) will be considered. The size of the decision tree could also remain a parameter of the model and be selected through a cross-validation approach. We discuss this idea below. Alternatively, increasing the inductive bias could involve increasing the preference bias of the learning algorithm, so that some hypotheses will be preferred over others, and even if they perform worse on the training data. This is often referred to as regularization, and we will revisit this idea frequently in the course. For decision trees, this means that simpler trees that have some training 5

error (i.e., incorrect classications on the training data) may be preferred to more complex trees that have no training error.

Increasing the Preference Bias of ID3

The approach of introducing a preference bias is the one usually taken in preventing overtting for decision trees. We have already seen some preference bias. But we will now use pre-pruning and post-pruning to add even more bias. For pre-pruning, the learning algorithm will terminate even when there is a split with positive information gain. The algorithm stops growing the tree at some point, once it thinks growing the tree further would not be justied. Essentially, this approach adds another termination condition to the ID3 algorithm. For pre-pruning, we will consider chi-squared pruning, which uses a statistical test on the data itself to determine whether a split is justied. For post-pruning, the learning algorithm will construct a complete tree as in the basic ID3 algorithm already seen and then look to replace one or more subtrees with a single leaf, whose label is the majority classication of the training instances that are associated with each subtree that is replaced in this way. Post-pruning is also referred to as a generate-and-prune approach. The algorithm grows an entire tree with ID3 and only afterwards prunes away those parts of the tree that are not justied. For post-pruning, we will consider validation-set pruning which uses more data to determine whether a split is justied. The advantage of pre-pruning is that less work needs to be done, so it results in a quicker learning algorithm. The advantage of post-pruning is that more information is available for making a pruning decision. The decision in post-pruning can be based on the entire subtree beneath a node rather than just on the data at a node and the eect of a single split at a node on the class distribution of the data. This allows for consideration of co-predictors. These two approaches are also paradigmatic of methods for dealing with overtting in other learning frameworks.


Pre-Pruning: Chi-Squared Pruning

A rst approach to pruning is to apply a statistical test to the data to determine whether a split on some feature Xk is statistically signicant, in terms of the eect of the split on the distribution of classes in the partition on data induced by the split. We will consider a null hypothesis that the data is independently distributed according to a distribution on data consistent with that at the current node.1 If this null hypothesis cannot be rejected with high probability, then the split is not adopted and we terminate ID3 at this node. This is a form of pre-pruning, since it is based only on the distribution of classes induced by the single decision of splitting at the node and not by the decisions made as a result of growing a full subtree below this node (as in the case of validation-set pruning). Ok, so here is our null hypothesis, on which we will do a statistic test: Null hypothesis: feature Xk is unrelated to the classication of data given features already branched on before this node. The split is only accepted if this null hypothesis can be rejected with high probability. The statistical test we use is Pearsons chi-square test, which will be used here as a test of independence. To think about this, suppose that at the current node the data is split 10:10 between negative and positive examples. Further more, suppose that you know that there are 8 instances for which Xk is false, and 12 for which Xk is true. Wed like to understand whether a split that generates labeled data 3:5 (on the Xk false branch) and 7:5 (on the Xk true branch) could occur simply by chance.
1 Dont be confused: this is the hypothesis that we form to determine whether or not to reject the split. It is distinct from the hypothesis that is represented by a decision tree itself.

Note: Once I tell you how many instances have Xk false, and how many Xk true, then there is only one degree of freedom to fully determine the labels after the split on Xk . Once you know that there are 3 negative examples after an Xk false split, then you can complete all the other numbers of labeled instances (i.e., 5 = 8-3, 7=10-3, 5=10-5). We dene a test statistic Dev (Xk ) as follows. Let p and n denote the number of positive and negative examples in the data D before the split. (We consider the Boolean classication setting for simplicity, but everything generalizes.) Remember that D is not the complete training set, but rather the data associated with the current node. For each possible value x Xk , let px and nx denote the number of positive and negative examples in p n Dx , the subset of D in which Xk = x. Furthermore, let px = p+n |Dx | and nx = p+n |Dx |, denote the fraction of positive and negative examples we would expect on average after a split on an unrelated feature. Our test statistic is the deviation of the data from this, dened as: Dev (Xk ) =

(nx nx )2 (px px )2 . + nx px


What does this mean? The larger this test statistic, the higher the probability that Xk is in fact informative. So, we will tend to reject the null hypothesis and accept the split as this deviation increases. In our example, we have Dev (Xk ) = (5 4)2 /4 + (3 4)2 /4 + (7 6)2 /6 + (5 6)2 /6 = 0.833. Back to our question: what is the probability that a split of labels such as this could occur just by chance under the Null hypothesis? For this, we dene the chi-square distribution and will assume that Dev (Xk ) 2 (v) (4)

where 2 (v) is the chi square distribution with degree of freedom v. In our setting of a Binary feature and a Boolean classication problem, we just have v = 1 as intuited in the discussion above (where once a single number is dened, all remaining numbers are dened, given that you know the number of instances for which Xk takes each value.) v 2 What is the 2 (v) distribution? Let Q = i=1 Zi dene a random variable that is the sum of v independent standard Normal random variables Z1 , . . . , Zv , each of which is squared. A standard Normal Zi N (0, 1) has mean 0, variance 1. Then this random variable Q is distributed 2 (v). Chi-square is a one-parameter distribution (a special case of the Gamma distribution). Let F (z; v) denote the cumulative distribution function for chi-square with v degrees of freedom; for z v = 2 chi-square has cumulative distribution function F (z; 2) = 1 e 2 . Why is this relevant here? Well, for enough data (and typically having px , nx 5 for each of x Xk is sucient), then any one of the terms in Dev (Xk ) is well approximated as a standard Normal under the null hypothesis. Because there is only one degree of freedom in our statistic, it is appropriate to model Dev (Xk ) as being distributed 2 (v) for v = 1 under the null hypothesis. When there is less data than alternate statistical tests (Yates correction for continuity and Fishers exact test) can be used. But this is out of the scope of the class. We are now ready to use the chi square distribution to determine the probability that a test statistic of at least Dev (Xk ) could have occured by chance. This is the p-value, and gives the probability of obtaining at least as extreme a classication under the null hypothesis. For example, one can use the chi2cdf function in Matlab or the CHIDIST function in Excel. In our running example, 1 F (z = 0.83333, v = 1) 0.36 and we see that there is a reasonable possibility that the split could occur simply by chance. If the p-value is small (and less than threshold ) then we reject the null hypothesis with high condence; thresholds of = 0.05 or = 0.01 are common, where are are said to reject the null hypothesis with condence 5% or 1% respectively. We accept the split if the p-value is less than . In our example, we would not reject the null hypothesis, and so we would not accept the split (and create a leaf instead.) For a general (non-Binary) classication problem, with c = |Y | classes and r = |Xk | values to feature Xk then there are (r 1)(c 1) degrees of freedom. To see why, one can imagine a table where each feature value is a row and each class value is a column. The table is completed with the number of examples in each category after the split. In our running example, it is a 2 by 2 table: 7

false true

classes negative positive 3 5 7 5

Once all but one number in every row and all but one number in every column has been completed, then there is enough information to compelte the table. Why? Well, because the total number of examples in each row must add to the number of examples known to have that feature value. And the total number of examples in each column must sum to the number of examples in the data with that class value. In our example, we have (2 1)(2 1) = 1 degrees of freedom. Given this, we can modify the ID3 algorithm with this additional pre-pruning step by splicing in the following lines: Choose the best splitting feature Xk in X n 2 p 2 Dev (Xk ) = xXk (px xx ) + (nx x ) p nx v = (r 1)(c 1), where r values of Xk and c class values p-value = 1 F (Dev (Xk ); v) If p-value> //do not reject null hypoth. Then Label(T ) = the most common classication in D Return T For simple examples of chi-square pruning, consider the following splits on a binary feature of data D consisting of 20 instances, with an equal positive and negative examples, and an equal number of instances with Xk true and Xk false: (a) 10:10 splits to 5:5 and 5:5 (b) 10:10 splits to 6:4 and 4:6 (c) 10:10 splits to 7:3 and 3:7 (d) 10:10 splits to 8:2 and 2:8 Which of these could occur just because of chance under the null hypothesis that the feature is unrelated to the classication? Let condence parameter = 0.05. For (a), Dev (Xk ) = 0 and 1 F (0; 1) = 1 > . So, we do not reject the null hypothesis, and we would terminate ID3 and not split. 2 2 2 2 For (b), Dev (Xk ) = (65) + (45) + (45) + (65) = 0.8 and 1 F (0.8; 1) = 0.37 > and we do not 5 5 5 5 reject the null hypothesis, and we would terminate ID3 and not split. 2 2 2 2 For (c), Dev (Xk ) = (75) + (35) + (35) + (75) = 3.2 and 1 F (3.2; 1) = 0.074 > and we do not 5 5 5 5 reject the null hypothesis, and we would terminate ID3 and not split. 2 2 2 2 For (d), Dev (Xk ) = (85) + (25) + (25) + (85) = 7.2 and 1 F (7.2; 1) = 0.0073 < and we reject 5 5 5 5 the null hypothesis, and we would accept the split and continue with ID3. This all seems quite reasonable: the pattern induced by the fourth split is judged to be statistically significant while the other three are not and could be explained with some probability under the null hypothesis.


Post-Pruning: Validation-Set Pruning

The second pruning method tries to judge whether or not a particular split in a tree is justied by seeing if it actually works for real data (i.e., data that is not used for training). The idea is to use a validation set. This is a portion of the training data that set aside purely to test whether or not an actual split is actually a good split. In other words, the validation set is used to validate the model. In this case, a split is validated if it improves performance on the validation set. This provides a simple pruning criterion: prune a subtree if it does not improve performance on the validation set. This turns out to be a simple but powerful approach.

It is crucial that the validation set is separate from the data used to induce the decision tree in the rst place. Otherwise the validation set will be validating the pattern that was partly derived from itself in the rst place. This does not help to prevent overtting! Here is pseudocode for the method of validation-set pruning: Divide the available data into a training set and a validation set Build a complete decision tree T using ID3 using the training set For each (non leaf) node z of T , from the bottom up Let Dz be the training data that falls to node z Let yz be the majority class in Dz Let T be the tree formed by replacing node z and the subtree below z with a leaf with label yz If T performs as well or better on the validation set than T Collapse the subtree at z in T to a leaf with label yz The data that falls to node z is the training data associated with the subtree rooted at z; i.e., all of the instances x D with feature values that are consistent with the branching decisions in the tree above node z. Note carefully the dierent roles of the training and validation set: all training of the tree, and also the value yz adopted for a leaf that replaces a subtree, comes from the training set. The validation set is used to check to see whether the accuracy is just as good under a simplied tree constructed in this way. The approach of validation-set pruning provides an example of the general idea of using a validation set to select the complexity of a hypothesis. For other learning algorithms we will have dierent notions of a simpler hypothesis but the basic idea of removing unnecessary detail will remain the same.


Discussion: Pre- vs Post-Pruning

The advantage of the validation-set approach over chi-square pre-pruning is that it doesnt need to be tuned (how to set ?) and is a direct, data driven method. In addition, the chi-square pre-pruning method requires enough data |D| for the chi-square approximation to be good when performing the statistical test (it only holds exactly in the limit.) Moreover, the validation-set method is less heuristic, in that it evaluates not only the eect of a split on Xk but rather than eectiveness of the entire subtree constructed by ID3 below the current node. Specically, the use of post-pruning can handle co-predictors by building out the subtree rst, where pre-pruning via chi-square would clearly fail. A disadvantage of the validation-set method is that it needs to reserve some of the training data (typically 10%-33%) to form the validation set, instead of using all of the data for determining the tree. For the most part, and especially because of this challenge with co-predictors, the validation-set approach tends to works better in practice. In fact, Ross Quinlan, the inventor of ID3, began by using the chi-square method, and later moved to the validation-set method.

Regularization: Trading Empirical Error for Complexity

We will briey look at a third approach to prevent overtting through an explicit penalty on the complexity of a learned hypothesis in order to introduce a preference bias. Recall that methods to address overtting introduce an implicit inductive bias and prefer shorter trees even when they do not t the training data as well. In this third approach we make this tradeo between accuracy on training data and hypothesis complexity explicit. In regularization, we adopt a cost measure on a hypothesis h that is the weighted sum of empirical training error and a measure of the complexity of the hypothesis: Cost (h, D) = Error (h, D) + Complexity (h) (5)

where > 0 is a parameter, Error (h, D) is the number of examples in D classied incorrectly by h, and Complexity (h) is some measure of complexity or size of the hypothesis. The parameter provides a tradeo between training error and complexity and needs to be tuned. The appropriate notion of Complexity (h) is sometimes a bit hard to dene. Later in the term we will see Bayesian methods to determine a measure of complexity penalty. So far in class we have seen the use of the scalar product on coecients, w w, as a measure of complexity. For decision trees, this is often taken to be the number of nodes or just the height of a tree. This process of making a tradeo between empirical error and hypothesis complexity is called regularization, because it looks for a function that is more regular, or less complex. One seeks the hypothesis h that minimizes the total cost, Cost (h, D). This can be typically formulated as an optimization problem and then solved optimally or approximately. An unfortunate aspect of regularization is that we need to nd a value of parameter that provides the best generalization. For this we can return to the idea of a validation set that we saw for validation-set pruning: we hold out some data and use accuracy on this held out data to identify the value of that gives the best generalization performance. This can be achieved through cross-validation, as explained in the next section.


Minimum Description Length

Sometimes, one can also try to directly compare Error and Complexity and in doing so avoid the need to tune . Information theory provides a method to do this: we will measure both of them in bits. This is the idea of the minimum-description length approach. We select the hypothesis for which the total number of bits required to encode the hypothesis and also the data, given that the hypothesis is known, is minimized. To gain some intuition: one encoding is a hypothesis that exactly classies all of the input data D. But this hypothesis may itself be costly to describe. Another encoding is a trivial hypothesis that predicts true always, and requires exactly describing the training data which can itself be costly. The minimumdescription length (MDL) principle asserts that the best hypothesis makes the optimal tradeo between these two encoding costs! The total cost function, which we will try to minimize, becomes: Cost (h, D) = Bits(D|h) + Bits(h), (6)

where Bits(D|h) is a measure of the error of the hypothesis h on data D and the number of bits to encode D in the optimal encoding given h. If exact then no bits are required. If h is uninformative then more bits are required. Similarly, Bits(h) is the number of bits needed to describe hypothesis h from the space H of hypotheses, again in the optimal encoding given this hypothesis space. This idea of MDL has an elegant theoretical underpinning, the full details of which are beyond the scope of this course. Here is a very brief summary. For Bits(D|h), this is given by Shannons information theory, where the hypothesis is assumed to provide a probabilistic model of the data and the optimal encoding for x has bits log2 Pr(x|h) where Pr(x|h) is the probability of x given h. Given this, then Bits(D|h) = xD log2 Pr(x|h). For Bits(h), this is determined in a similar way and with an optimal coding that depends on Pr(h), which is the prior on dierent hypotheses. This is where Occams razor is captured: we can put a higher prior on simpler hypotheses and therefore require smaller numbers of bits for their encoding. From a practical perspective we can adopt the principle of MDL to decision trees by adopting reasonable encodings for the hypothesis (a tree) and the data. For the tree, we can encode this so that the description length grows with the number of nodes (or the number of nodes and edges if this is not a binary tree.) 10

For the data D given a particular hypothesis h, we can suppose that the instances x1 , . . . , xn are known to a receiver, and what is not known is the target classes y1 , . . . , yn . (Note that the cost of transmitting x1 , . . . , xn is independent of the hypothesis in any case, and so wouldnt aect any comparison of dierent h.) Now if the classications are identical to h(xi ) for every instance i then we need no additional data for their description. For any misclassied instance, we need to identify the instance (requiring log2 n bits) and provide the correct classication (in log2 c bits, where there are c = |Y | target classes.) Let Ne (h, D) denote the number of misclassied instances in D given h. Putting this together, we seek a tree with corresponding hypothesis h that minimizes the Cost (h, D) = Ne (h, D)(log2 n + log2 c) + |h|, (7)

where |h| is the number of nodes in the decision tree. The tree that minimizes this can be identied through search. Experimental results suggest that MDL-based approaches for decision trees produce results comparable to those with standard tree-pruning methods such as the pre- and post-pruning methods described above.

Experimental Methodology: Cross-Validation

One moral of overtting is that we need to be very careful about our experimental methodology in testing machine learning algorithms. Early machine learning work reported success by showing that algorithms could learn a model that t the training data. As we know, this is the wrong goal. We want algorithms that generalize well to unseen data. In this section we discuss the general and extremely important methodology of cross-validation. This is a technique that can be used both for validation set methods (e.g., for pruning, or for tuning a model parameter such as ) and also for the evaluation of the performance of a xed learning algorithm. In general, there are two kinds of questions we might be interested in: 1. Q1: How to pick the best learning approach for a problem, or the best parameters of a particular learning approach? 2. Q2: How to estimate the accuracy to expect in generalization to unseen data? A naive approach that is WRONG would be to put all the data D into a big training set, run each algorithm and dierent parameterizations of the algorithms on the data, and nd the algorithm and parameter settings that minimizes prediction error. But this would badly over-t and would not be expected to nd the method that best generalizes. Moreover, any reported accuracy would be badly optimistic.


Cross-Validation to Select a Learning Approach and Parameters

What we do instead is to split data D into training data and validation data (with typically 10-30% reserved for validation.) The validation set is often referred to as the hold-out set. Given this, one can train each algorithm (or each parameterization of each algorithm) in turn on the training set and compute its accuracy on the validation set nd the algorithm and parameterizations with best performance on the validation set ultimately, one can then (having selected the algorithm and parameters) nally train on all the available data before releasing into an application But, we might have insucient data for such a test to be statistically valid. Instead, a nice approach is to use cross-validation. The data D is divided up into k equal sized pieces, referred to as folds. A typical value of k is 10. This is k-fold cross-validation. Given this, we run k dierent experiments. In each 11

experiment, k 1 folds are used as training data, while the remaining fold is used as validation data: the 1 algorithm is trained on (k1) of the data, and its accuracy evaluated on the hold-out set of k of the data. k The nal accuracy of an algorithm is reported as the average on the validation data over the k experiments. This method allows multiple experiments to be run on a single data set and puts the data to better use. Notice that over the course of the k experiments, each fold gets used as validation data exactly once. Because of this, the validation sets are independent across each experiment. For example, for 5-fold cross-validation the approach would be to divide the data D into components (ABCD|E), (ABCE|D), (ABDE|C), (ACDE|B), (BCDE|A), where the rst part is the training set and the second the validation set. Although the experiments are not completely independent of each other, since they have overlapping training sets, running cross-validated experiments provides more signicant results than running a single experiment, and the method is completely standard in the eld. An extreme form of cross-validation is leave-one-out cross-validation. Here, the number of folds is equal to n, that is the number of examples D. In each experiment, all but one of the examples is used as training data, and the validation set consists only of the one remaining example. The performance of the algorithm is the percentage of experiments in which the single hold-out instance is classied correctly. This method is extreme in that the performance for each of the individual experiments is based on only one example, and therefore extremely insignicant. But by running n experiments, a statistically signicant measure of the performance of the algorithm can be attained. Leave-one-out cross-validation is especially useful when the labeled data set D is extremely small (e.g., 10s rather than 100s). The general experimental methodology in using cross-validation for the process of selecting a good learner and a good parameterization (often termed just model selection) is as follows: 1. Consider some set of learning algorithms L1 , L2 , . . . and data D. These could be dierent algorithms (e.g., decision trees, neural networks, etc.), or dierent parameterizations of the same algorithm (e.g., dierent limits on the number of nodes.) 2. Use cross-validation to select the learning algorithm L with best average performance on the validation sets. 3. Ultimately, one can then train the best learning algorithm L on all the data D and use classier h = L (D) to classify future data.


Extending Cross-Validation: Also Reporting an Accuracy

The second question we asked above is Q2: how can we report an unbiased measure of performance, to predict the generalization we can expect when used on new data. The average performance determined in the second step, when evaluating a learner on the validation sets, can not be reported as an unbiased measure to expect of the trained classier. The problem is that we have looked at this validation data in selecting an optimal algorithm, and so we would expect a bias in any reported accuracy statistic. In machine learning parlance, it is common to say that peeking has occured. In repoting an unbiased performance measure, is essential that the data on which this is computed is not used for any kind of tuning or selection of the learning algorithm For example, the data should not be used even for setting a parameter that controls complexity. A simple thing one can do is to divide the data into three pieces: a training set a validation set a test set 12

The test set is ONLY used once all model selection is compelete. It is to be kept locked up until then. Only right at the nal step should it be used in reporting nal performance. We can also extend the cross-validation approach so as to continue to make could use of our precious data. For this, we run k experiments, where two folds of the data are held out each experiment, one to be used for validation and one to be used for testing. The best algorithm is selected by training on the remaining k 2 folds in each experiment, with the average performance determined over the validation sets. Finally, the performance of a selected learner (and parameterizations) is reported over the k-fold test sets. This is an extremely important idea. For example, when using 5-fold cross-validation with both validation and test sets, the approach would be to divide the data D according to components (ABC|D|E), (ABE|C|D), (ADE|B|C), (CDE|A|B), (BCD|E|A), where the rst part is the training set, the second the validation set, and the third part the test set. The algorithm that performs the best on D, C, B, A and E, when trained on ABC, ABE, ADE, CDE and BCD respectively, is selected, and with its performance nally reported in each case on the independent test sets E, D, C, B and A.