Predictive Modeling Using Decision Trees

Predictive Modeling Using Decision Trees
Introduction
Decision Trees
Powerful/popular for classification & prediction Represent rules

Rules can be expressed in English
IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No
Useful to explore data to gain insight into relationships of a large number of candidate input variables to a target (output) variable
Decision Tree What is it?

A structure that can be used to divide up a large collection of records into successively smaller sets of records by applying a sequence of simple decision rules
A decision tree model consists of a set of rules for dividing a large heterogeneous population into smaller, more homogeneous groups with respect to a particular target variable
Decision Trees: HMEQ Example
Banking marketing scenario (HMEQ):
Target :
default on a home-equity line of credit (BAD)
Inputs :
number of delinquent trade lines (DELINQ) number of credit inquiries (NINQ) debt to income ratio (DEBTINC) possibly many other inputs
Introduction to Decision Tree modeling
Decision Trees:
Interpretation of the fitted decision tree The internal nodes contain rules Start at the root node (top) and follow the rules until a terminal node (leaf) is reached. The leaves contain the estimate of the expected value of the target in this case the posterior probability of BAD. The probability can then be used to allocate cases to classes.
Decision Tree Template

Drawn top-to-bottom or left-toright Top (or left-most) node = Root Node
Root Child Child Leaf
Descendent node(s) = Child Node(s)

Bottom (or right-most) node(s) = Leaf Node(s) Unique path from root to each leaf = Rule
Child
Leaf
Leaf
Divide and Conquer

The tree is fitted to the data by recursive partitioning. Partitioning refers to segmenting the data into subgroups that are as homogeneous as possible with respect to the target. In this case, the binary split (Debt-toIncome Ratio < 45) was chosen. The 5,000 cases were split into two groups, one with a 5% BAD rate and the other with a 21% BAD rate.
n = 5,000
10% BAD
yes
n = 3,350
Debt-to-Income Ratio < 45
no
n = 1,650
5% BAD
21% BAD
The method is recursive because each subgroup results from splitting a subgroup from a previous split. Thus, the 3,350 cases in the left child node and the 1,650 cases in the right child node are split again in similar fashion.
8
The Cultivation of Trees
Split Search Which splits are to be considered? Splitting Criterion Which split is best? Stopping Rule When should the splitting stop? Pruning Rule Should some branches be lopped off?
Splitting Criteria
How is the best split determined? In some situations, the worth of a split is obvious. If the expected target is the same in the child nodes as in the parent node, no improvement was made, and the split is worthless! In contrast, if a split results in pure child nodes, the split is undisputedly best. For classification trees, the three most widely used splitting criteria are based on the Pearson chisquared test, the Gini index, and entropy. All three measure the difference in class distributions across the child nodes. The three methods usually give similar results.
10
Splitting Criteria
Left Not Bad Bad 3196 154 Left Not Bad 2521 Right 1304 346 4500 500 Debt-to-Income Ratio < 45
Center Right 1188 791 4500
Bad
Not Bad
115
4500
162
0
223
4500
500
A Competing Three-Way Split
Bad
11
500
500
Perfect Split
Decision Tree Types

Binary trees only two choices in each split. Can be nonuniform (uneven) in depth N-way trees or ternary trees three or more choices in at least one of its splits (3-way, 4-way, etc.)
Split Criteria
The best split is defined as one that does the best job of separating the data into groups where a single class predominates in each group Measure used to evaluate a potential split is purity The best split is one that increases purity of the subsets by the greatest amount A good split also creates nodes of similar size or at least does not create very small nodes
Tests for Choosing Best Split

Purity (Diversity) Measures:

Gini (population diversity) Entropy (information gain)
Gini (Population Diversity)

The Gini measure of a node is the sum of the squares of the proportions of the classes.
Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance)
Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure) Gini Score =0.5*.82+0.5*.82=.82 (close to pure)
Decision Tree Advantages

1.
Easy to understand
2.
3. 4. 5.
Map nicely to a set of business rules

Applied to real problems Make no prior assumptions about the data Able to process both numerical and categorical data
Benefits of Trees
Interpretability tree-structured presentation
Mixed Measurement Scales

Regression trees Handling of Outliers
Handling of Missing Values
17
The Right-Sized Tree

Stunting
Pruning
18
Building and Interpreting Decision Trees
Explore the types of decision tree models available in Enterprise Miner. Build a decision tree model. Examine the model results and interpret these results. Choose a decision threshold theoretically and empirically.
19
The Scenario
Determine who should be approved for a home equity loan. The target variable is a binary variable that indicates whether an applicant eventually defaulted on the loan. The input variables are variables such as the amount of the loan, amount due on the existing mortgage, the value of the property, and the number of recent credit inquiries.
20
The HMEQ data set contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates if an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded. Presume that every two dollars loaned eventually returns three dollars if the loan is paid off in full.
21
Accuracy Measures (Classification)
Misclassification error
Error = classifying a record as belonging to one class when it belongs to another class.
Error rate = percent of misclassified records out of the total records in the validation data
Confusion Matrix
Classification Confusion Matrix Predicted Class Actual Class 1 0 1 201 25 0 85 2689
201 1s correctly classified as 1 85 1s incorrectly classified as 0 25 0s incorrectly classified as 1 2689 0s correctly classified as 0
Error Rate
Classification Confusion Matrix Predicted Class Actual Class 1 0 1 201 25 0 85 2689
Overall error rate = (25+85)/3000 = 3.67% Accuracy = 1 err = (201+2689) = 96.33% If multiple classes, error rate is:
Cutoff for classification

Most DM algorithms classify via a 2-step process: For each record, 1. Compute probability of belonging to class 1 2. Compare to cutoff value, and classify accordingly
Default cutoff value is 0.50 If >= 0.50, classify as 1 If < 0.50, classify as 0 Can use different cutoff values Typically, error rate is lowest for cutoff = 0.50
Cutoff Table
Actual Class 1 1 1 1 1 1 1 0 1 1 1 0
Prob. of "1" 0.996 0.988 0.984 0.980 0.948 0.889 0.848 0.762 0.707 0.681 0.656 0.622
Actual Class 1 0 0 1 0 0 0 0 0 0 0 0
Prob. of "1" 0.506 0.471 0.337 0.218 0.199 0.149 0.048 0.038 0.025 0.022 0.016 0.004
If cutoff is 0.50: eleven records are classified as 1 If cutoff is 0.80: seven records are classified as
Confusion Matrix for Different Cutoffs

Cut off Prob.Val. for Success (Updatable) 0.25
Classification Confusion Matrix Predicted Class Actual Class owner non-owner owner 11 4 non-owner 1 8
Cut off Prob.Val. for Success (Updatable)
0.75
Classification Confusion Matrix Predicted Class Actual Class owner non-owner owner 7 1 non-owner 5 11
23 1967
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
Int
Int
Int
Int
Nom
Int
Nom
Int
Int
Nom
De f a u l t
Int
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1
Int
Loan 24900 13400 65500 16800 24700 15500 6300 20600 20100 24000 30200 6500 11800 21700 5700 10300 11800 12000 15200 9700 40000 40000 12000 24800 41100 40000 20000 10000 10000 12000 17600 12000 10000 45000 15000 8500 8500 10000 47000 10000 6000 15000 6000
Int
Mo r t g a g e 62191 131524 205156 27623 79347 82054 12476 52946 16755 88783 80951 183860 74512 24984 74172 70147 67678 76345 105328 32660 4742 53543 88000 37200 94600 120000 . . 69727 42000 76043 87000 76700 47321 29000 48961 18240 34767 164411 32000 9660 45000 24600
Int
Va l u e 83694 148356 290239 88231 108238 104627 32559 83558 29412 116967 116160 208910 93328 92297 79846 122124 108092 89036 113931 54536 . . 118750 67000 151000 159000 115750 65088 90312 60000 95605 101200 97800 115000 105000 73550 40200 51000 235500 59000 35900 68250 30500
Nom
Re a s o n Ho me I mp De b t Co n De b t Co n De b t Co n De b t Co n De b t Co n Ho me I mp De b t Co n Ho me I mp De b t Co n De b t Co n De b t Co n Ho me I mp De b t Co n De b t Co n Ho me I mp Ho me I mp Ho me I mp Ho me I mp De b t Co n De b t Co n De b t Co n Ho me I mp De b t Co n De b t Co n De b t Co n Ho me I mp Ho me I mp Ho me I mp De b t Co n De b t Co n De b t Co n De b t Co n De b t Co n Ho me I mp Ho me I mp Ho me I mp Ho me I mp De b t Co n De b t Co n De b t Co n De b t Co n Ho me I mp
_ NODE _ _ L E A F _ P _ DE F A UL T 1 P _ DE F A UL T 0 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 9 2 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 9 2 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 9 2 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 16 4 1. 0000 0. 0000 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718 7 11 0. 8282 0. 1718
I _ DE F A UL T U_ DE F A UL T 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
F _ DE F A UL T R_ DE F A UL T 1 R_ DE F A UL T 0 _ WA RN_ 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 0000 0. 0000 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 0 - 0. 8282 0. 8282 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718 1 0. 1718 - 0. 1718
Consequences of a Decision
Decision 1 Actual 1 True Positive
Decision 0 False Negative
Actual 0
False Positive
True Negative
44
Consequences of a Decision: Profit matrix (SAS EM)

Decision 1 Actual 1 True Positive (Profit = $2) Decision 0 False Negative
Actual 0
False Positive (Loss = $1)
True Negative
45
Bayes Rule: Optimal threshold
1 cost of false negative 1 cost of false positive

Using the cost structure defined for the home equity example, the optimal threshold is 1/(1+(2/1)) = 1/3. That is, reject all applications whose predicted probability of default exceeds 0.3333
46

Predictive Modeling Using Decision Trees

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Predictive Modeling Using Decision Trees

Hochgeladen von

Copyright:

Verfügbare Formate

Predictive Modeling Using Decision Trees

Powerful/popular for classification & prediction Represent rules

Decision Tree What is it?

Decision Trees: HMEQ Example

Banking marketing scenario (HMEQ):

Introduction to Decision Tree modeling

Decision Tree Template

Descendent node(s) = Child Node(s)

Divide and Conquer

Debt-to-Income Ratio < 45

The Cultivation of Trees

Center Right 1188 791 4500

A Competing Three-Way Split

Decision Tree Types

Tests for Choosing Best Split

Gini (population diversity) Entropy (information gain)

Gini (Population Diversity)

Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance)

Decision Tree Advantages

Map nicely to a set of business rules

Interpretability tree-structured presentation

Mixed Measurement Scales

Handling of Missing Values

The Right-Sized Tree

Building and Interpreting Decision Trees

Accuracy Measures (Classification)

Cutoff for classification

Confusion Matrix for Different Cutoffs

Cut off Prob.Val. for Success (Updatable)

Decision 1 Actual 1 True Positive

Decision 0 False Negative

Consequences of a Decision: Profit matrix (SAS EM)

False Positive (Loss = $1)

Bayes Rule: Optimal threshold

1 cost of false negative 1 cost of false positive

Das könnte Ihnen auch gefallen