Beruflich Dokumente
Kultur Dokumente
Agenda
Where are we now? Decision Trees What is ID3? Entropy Information Gain Pros and Cons of ID3 An Example - The Simpsons What is ROC Analysis? ROC Space ROC Space Example over predictions
Decision Trees
One of the most used classification approach because of its clear model and presentation Classification by using data attributes Aim is to reaching estimating destination field value using source fields Tree Induction Create tree Apply data into tree to classify Each branch node represents a choice between a number of alternatives Each leaf node represents a classification or decision Leaf Count = Rule Count
ID3 Algorithm
Iterative Dichotomizer 3 Developed by J. Ross Quinlan in 1979 Based on Entropy Only works for discrete data Can not work with defective data Advantage over Hunt's algorithm is choosing the right attribute while classification. (Hunt's algorithm chooses randomly)
Entropy
A formula to calculate the homogeneity of a sample; gives idea about how much information gain provides each leaf A complete homogeneous sample entropy value is 0 An equally divided sample entropy value is 1 Formula:
Results
Attribute Hair Length <= 5 Weight <= 160 Age <= 40 Information Gain (IG) 0.0911 0.5900 0.0183
As seen in the results, weight is the best attribute to classify these group.
Female
Male
Example II
Example II (Cont.)
Information Gain over Engine
Engine: 6 small, 5 medium, 4 large 3 values for attribute engine, so we need 3 entropy calculations small: 5 no, 1 yes IGsmall = -(5/6)log2(5/6)-(1/6)log2(1/6) = ~0.65 medium: 3 no, 2 yes IGmedium = -(3/5)log2(3/5)-(2/5)log2(2/5) = ~0.97 large: 2 no, 2 yes IGlarge = 1 (evenly distributed subset) => IGEngine = IE(S) [(6/15)*IGsmall + (5/15)*IGmedium + (4/15)*Ilarge] = IGEngine = 0.918 0.85 = 0.068
Example II (Cont.)
Information Gain over SC/Turbo
SC/Turbo: 4 yes, 11 no 2 values for attribute SC/Turbo, so we need 2 entropy calculations yes: 2 yes, 2 no IGturbo = 1 (evenly distributed subset) no: 3 yes, 8 no IGnoturbo = -(3/11)log2(3/11)-(8/11)log2(8/11) = ~0.84 IGturbo = IE(S) [(4/15)*IGturbo + (11/15)*IGnoturbo] IGturbo = 0.918 0.886 = 0.032
Example II (Cont.)
Information Gain over Weight
Weight: 6 Average, 4 Light, 5 Heavy 3 values for attribute weight, so we need 3 entropy calculations average: 3 no, 3 yes IGaverage = 1 (evenly distributed subset) light: 3 no, 1 yes IGlight = -(3/4)log2(3/4)-(1/4)log2(1/4) = ~0.81 heavy: 4 no, 1 yes IGheavy = -(4/5)log2(4/5)-(1/5)log2(1/5) = ~0.72
IGWeight = IE(S) [(6/15)*IGaverage + (4/15)*IGlight + (5/15)*IGheavy] IGWeight = 0.918 0.856 = 0.062
Example II (Cont.)
Information Gain over Full Eco
Fuel Economy: 2 good, 3 average, 10 bad 3 values for attribute Fuel Eco, so we need 3 entropy calculations good: 0 yes, 2 no IGgood = 0 (no variability) average: 0 yes, 3 no IGaverage = 0 (no variability) bad: 5 yes, 5 no IGbad = 1 (evenly distributed subset)
We can omit calculations for good and average since they always end up not fast. IGFuelEco = IE(S) [(10/15)*IGbad] IGFuelEco = 0.918 0.667 = 0.251
Example II (Cont.)
Results:
IGEngine IGturbo IGWeight IGFuelEco 0.068 0.032 0.062 0.251
Example II (Cont.)
Since we selected the Fuel Eco attribute for our Root Node, it is removed from the table for future calculations.
Example II (Cont.)
Information Gain over Engine
Engine: 1 small, 5 medium, 4 large 3 values for attribute engine, so we need 3 entropy calculations small: 1 yes, 0 no IGsmall = 0 (no variability) medium: 2 yes, 3 no IGmedium = -(2/5)log2(2/5)-(3/5)log2(3/5) = ~0.97 large: 2 no, 2 yes IGlarge = 1 (evenly distributed subset)
IGEngine = IE(SFuelEco) (5/10)*IGmedium + (4/10)*IGlarge] IGEngine = 1 0.885 = 0.115
Example II (Cont.)
Information Gain over SC/Turbo
SC/Turbo: 3 yes, 7 no 2 values for attribute SC/Turbo, so we need 2 entropy calculations yes: 2 yes, 1 no IGturbo = -(2/3)log2(2/3)-(1/3)log2(1/3) = ~0.84 no: 3 yes, 4 no IGnoturbo = -(3/7)log2(3/7)-(4/7)log2(4/7) = ~0.84 IGturbo = IE(SFuelEco) [(3/10)*IGturbo + (7/10)*IGnoturbo] IGturbo = 1 0.965 = 0.035
Example II (Cont.)
Information Gain over Weight
Weight: 3 average, 5 heavy, 2 light 3 values for attribute weight, so we need 3 entropy calculations average: 3 yes, 0 no IGaverage = 0 (no variability) heavy: 1 yes, 4 no IGheavy = -(1/5)log2(1/5)-(4/5)log2(4/5) = ~0.72 light: 1 yes, 1 no IlGight = 1 (evenly distributed subset) IGEngine = IE(SFuel Eco) [(5/10)*IGheavy+(2/10)*IGlight] IGEngine = 1 0.561 = 0.439
Example II (Cont.)
Results:
IGEngine IGturbo IGWeight 0.115 0.035 0.439
Weight has the highest gain, and is thus the best choice.
Example II (Cont.)
Since there are only two items for SC/Turbo where Weight = Light, and the result is consistent, we can simplify the weight = Light path.
Example II (Cont.)
Updated Table: (Weight = Heavy)
All cars with large engines in this table are not fast. Due to inconsistent patterns in the data, there is no way to proceed since medium size engines may lead to either fast or not fast.
ROC Analysis
Receiver Operating Characteristic The limitations of diagnostic accuracy as a measure of decision performance require introduction of the concepts of the sensitivity and specificity of a diagnostic test. These measures and the related indices, true positive rate and false positive rate, are more meaningful than accuracy. ROC curve is shown to be a complete description of this decision threshold effect, indicating all possible combinations of the relative frequencies of the various kinds of correct and incorrect decisions.
Prediction Outcome
p n p n
Description
True Positive Rate (TPR) False Negative Rate (FNR) False Positive Rate (FPR) True Negative Rate (TNR)
TPR is equivalent with sensitivity. FPR is equivalent with 1 - specificity. Best possible prediction would be 100% sensitivity and 100% specificity (which means FPR = 0%).
ROC Space
A ROC space is defined by FPR and TPR as x and y axes respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs). Since TPR is equivalent with sensitivity and FPR is equal to 1 specificity, the ROC graph is sometimes called the sensitivity vs (1 specificity) plot. Each prediction result one point in the ROC space.
Calculations
Sensitivity TPR = TP / P = TP / (TP + FN) Specificity FPR = FP / N = FP / (FP + TN) Accuracy ACC = (TP + TN) / (P + N)
References
1. Data Mining Course Lectures, Ass. Prof. Nilfer Yurtay 2. Quinlan, J.R. 1986, Machine Learning, 1, 81 3. http://www.cse.unsw.edu. au/~billw/cs9414/notes/ml/06prop/id3/id3.html 4. J. Han, M. Kamber, J. Pie, Data Mining Concepts and Techniques, 3rd Edition, Elsevier, 2011. 5. http://www.cise.ufl.edu/~ddd/cap6635/Fall97/Short-papers/2.htm 6. C. E. Metz, Basic Principles of ROC Analysis, Seminars in Nuclear Medicine, Volume 8, Issue 4, P 283-298