Beruflich Dokumente
Kultur Dokumente
Introduction
Decision Trees
Useful to explore data to gain insight into relationships of a large number of candidate input variables to a target (output) variable
Target :
default on a home-equity line of credit (BAD)
Inputs :
number of delinquent trade lines (DELINQ) number of credit inquiries (NINQ) debt to income ratio (DEBTINC) possibly many other inputs
Decision Trees:
Interpretation of the fitted decision tree The internal nodes contain rules Start at the root node (top) and follow the rules until a terminal node (leaf) is reached. The leaves contain the estimate of the expected value of the target in this case the posterior probability of BAD. The probability can then be used to allocate cases to classes.
Child
Leaf
Leaf
10% BAD
yes
n = 3,350
no
n = 1,650
5% BAD
21% BAD
The method is recursive because each subgroup results from splitting a subgroup from a previous split. Thus, the 3,350 cases in the left child node and the 1,650 cases in the right child node are split again in similar fashion.
8
Split Search Which splits are to be considered? Splitting Criterion Which split is best? Stopping Rule When should the splitting stop? Pruning Rule Should some branches be lopped off?
Splitting Criteria
How is the best split determined? In some situations, the worth of a split is obvious. If the expected target is the same in the child nodes as in the parent node, no improvement was made, and the split is worthless! In contrast, if a split results in pure child nodes, the split is undisputedly best. For classification trees, the three most widely used splitting criteria are based on the Pearson chisquared test, the Gini index, and entropy. All three measure the difference in class distributions across the child nodes. The three methods usually give similar results.
10
Splitting Criteria
Left Not Bad Bad 3196 154 Left Not Bad 2521 Right 1304 346 4500 500 Debt-to-Income Ratio < 45
Bad
Not Bad
115
4500
162
0
223
4500
500
Bad
11
500
500
Perfect Split
Split Criteria
The best split is defined as one that does the best job of separating the data into groups where a single class predominates in each group Measure used to evaluate a potential split is purity The best split is one that increases purity of the subsets by the greatest amount A good split also creates nodes of similar size or at least does not create very small nodes
Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure) Gini Score =0.5*.82+0.5*.82=.82 (close to pure)
Easy to understand
2.
3. 4. 5.
Benefits of Trees
17
Pruning
18
Explore the types of decision tree models available in Enterprise Miner. Build a decision tree model. Examine the model results and interpret these results. Choose a decision threshold theoretically and empirically.
19
The Scenario
Determine who should be approved for a home equity loan. The target variable is a binary variable that indicates whether an applicant eventually defaulted on the loan. The input variables are variables such as the amount of the loan, amount due on the existing mortgage, the value of the property, and the number of recent credit inquiries.
20
The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates if an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded. Presume that every two dollars loaned eventually returns three dollars if the loan is paid off in full.
21
Misclassification error
Error = classifying a record as belonging to one class when it belongs to another class.
Error rate = percent of misclassified records out of the total records in the validation data
Confusion Matrix
Classification Confusion Matrix Predicted Class Actual Class 1 0 1 201 25 0 85 2689
201 1s correctly classified as 1 85 1s incorrectly classified as 0 25 0s incorrectly classified as 1 2689 0s correctly classified as 0
Error Rate
Classification Confusion Matrix Predicted Class Actual Class 1 0 1 201 25 0 85 2689
Overall error rate = (25+85)/3000 = 3.67% Accuracy = 1 err = (201+2689) = 96.33% If multiple classes, error rate is:
Default cutoff value is 0.50 If >= 0.50, classify as 1 If < 0.50, classify as 0 Can use different cutoff values Typically, error rate is lowest for cutoff = 0.50
Cutoff Table
Actual Class 1 1 1 1 1 1 1 0 1 1 1 0
Prob. of "1" 0.996 0.988 0.984 0.980 0.948 0.889 0.848 0.762 0.707 0.681 0.656 0.622
Actual Class 1 0 0 1 0 0 0 0 0 0 0 0
Prob. of "1" 0.506 0.471 0.337 0.218 0.199 0.149 0.048 0.038 0.025 0.022 0.016 0.004
If cutoff is 0.50: eleven records are classified as 1 If cutoff is 0.80: seven records are classified as
Classification Confusion Matrix Predicted Class Actual Class owner non-owner owner 11 4 non-owner 1 8
0.75
Classification Confusion Matrix Predicted Class Actual Class owner non-owner owner 7 1 non-owner 5 11
23 1967
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
Int
Int
Int
Int
Nom
Int
Nom
Int
Int
Nom
De f a u l t
Int
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1
Int
Loan 24900 13400 65500 16800 24700 15500 6300 20600 20100 24000 30200 6500 11800 21700 5700 10300 11800 12000 15200 9700 40000 40000 12000 24800 41100 40000 20000 10000 10000 12000 17600 12000 10000 45000 15000 8500 8500 10000 47000 10000 6000 15000 6000
Int
Mo r t g a g e 62191 131524 205156 27623 79347 82054 12476 52946 16755 88783 80951 183860 74512 24984 74172 70147 67678 76345 105328 32660 4742 53543 88000 37200 94600 120000 . . 69727 42000 76043 87000 76700 47321 29000 48961 18240 34767 164411 32000 9660 45000 24600
Int
Va l u e 83694 148356 290239 88231 108238 104627 32559 83558 29412 116967 116160 208910 93328 92297 79846 122124 108092 89036 113931 54536 . . 118750 67000 151000 159000 115750 65088 90312 60000 95605 101200 97800 115000 105000 73550 40200 51000 235500 59000 35900 68250 30500
Nom
Re a s o n Ho me I mp De b t Co n De b t Co n De b t Co n De b t Co n De b t Co n Ho me I mp De b t Co n Ho me I mp De b t Co n De b t Co n De b t Co n Ho me I mp De b t Co n De b t Co n Ho me I mp Ho me I mp Ho me I mp Ho me I mp De b t Co n De b t Co n De b t Co n Ho me I mp De b t Co n De b t Co n De b t Co n Ho me I mp Ho me I mp Ho me I mp De b t Co n De b t Co n De b t Co n De b t Co n De b t Co n Ho me I mp Ho me I mp Ho me I mp Ho me I mp De b t Co n De b t Co n De b t Co n De b t Co n Ho me I mp
_ NODE _ _ L E A F _ P _ DE F A UL T 1 P _ DE F A UL T 0 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 9 2 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 9 2 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 9 2 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718
I _ DE F A UL T U_ DE F A UL T 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
F _ DE F A UL T R_ DE F A UL T 1 R_ DE F A UL T 0 _ WA RN_ 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718
Consequences of a Decision
Actual 0
False Positive
True Negative
44
Actual 0
True Negative
45