PADM - Decision Trees

Predictive Analytics and Data Mining
- Predictive Analytics Using Decision Trees

- Evaluation and Assessment of Data Mining Models - I
Decision Trees: Introduction
Powerful/popular for classification
Represent rules
Rules can be expressed in English
IF Age <=43 & Gender = Male
& Credit Card Insurance = No
THEN Life Insurance Promotion = No
Useful to explore data to gain insight into relationships of a large

number of candidate input variables to a target (output) variable
YOUTUBE VIDEO: (Decision Tree (CART) Machine
Learning Made Fun and Easy)
Decision Trees: HMEQ Example
Banking marketing scenario (HMEQ):
Target :
default on a home-equity line of credit (BAD)
Inputs :
number of delinquent trade lines (DELINQ)

number of credit inquiries (NINQ)
debt to income ratio (DEBTINC)
other inputs
The Scenario
Determine who should be approved for a home equity loan.
The target variable is a binary variable that indicates whether an

applicant eventually defaulted on the loan.
The input variables are variables such as the amount of the loan,
amount due on the existing mortgage, the value of the property,
and the number of recent credit inquiries.
HMEQ Dataset
The HMEQ data set contains baseline and loan performance

information for 5,960 recent home equity loans. The target (BAD) is a
binary variable that indicates if an applicant eventually defaulted or was
seriously delinquent. This adverse outcome occurred in 1,189 cases
(20%). For each applicant, 12 input variables were recorded.
Presume that every two dollars loaned eventually returns three dollars
if the loan is paid off in full.
The Cultivation of Trees
Split Search
Which splits are to be considered?
Splitting Criterion
Which split is best?
Stopping Rule
When should the splitting stop?
Pruning Rule
Should some branches be lopped off?
Decision Trees
Interpretation of the fitted decision tree
The internal nodes contain rules
Start at the root node (top) and follow the rules until a terminal
node (leaf) is reached.
The leaves contain the estimate of the expected value of the

target in this case the posterior probability of BAD. The
probability can then be used to allocate cases to classes.
Decision Tree Template
Drawn top-to-bottom or left-to-right
Top (or left-most) node = Root Root

Node
Child Child Leaf
Descendent node(s) = Child
Node(s) Child Leaf
Bottom (or right-most) node(s) = Leaf

Leaf Node(s)
Unique path from root to each leaf =

Rule
Divide and Conquer
n = 5,000
The tree is fitted to the data
by recursive partitioning.
Partitioning refers to
segmenting the data into
subgroups that are as 10% BAD
homogeneous as possible yes no
with respect to the target. In Debt-to-Income
this case, the binary split n = 3,350 Ratio < 45 n = 1,650
(Debt-to-Income Ratio < 45)
was chosen. The 5,000
cases were split into two
groups, one with a 5% BAD 5% BAD 21% BAD
rate and the other with a
21% BAD rate.
The method is recursive because each subgroup results from splitting a
subgroup from a previous split. Thus, the 3,350 cases in the left child node
and the 1,650 cases in the right child node are split again in similar
fashion.
The Cultivation of Trees
Split Search
Which splits are to be considered?
Splitting Criterion
Which split is best?
Stopping Rule
When should the splitting stop?
Pruning Rule
Should some branches be lopped off?
Splitting Criteria
How is the best split determined? In some situations, the worth of a

split is obvious. If the expected target is the same in the child
nodes as in the parent node, no improvement was made, and the
split is worthless!
In contrast, if a split results in pure child nodes, the split is

undisputedly best. For classification trees, the three most widely
used splitting criteria are based on the Pearson chi-squared test, the
Gini index, and entropy. All three measure the difference in class
distributions across the child nodes. The three methods usually give
similar results.
Splitting Criteria
Left Right
Not Bad 3196 1304 4500 Debt-to-Income
Bad 154 346 500 Ratio < 45
Left Center Right

Not Bad 2521 1188 791 4500 A Competing
Bad 115 162 223 500 Three-Way Split
Not Bad 4500 0 4500

Perfect Split
Bad 0 500 500
14
Split Criteria
The best split is defined as one that does the best job of separating
the data into groups where a single class predominates in each
group
Measure used to evaluate a potential split is purity
The best split is one that increases purity of the sub-sets by the
greatest amount
A good split also creates nodes of similar size or at least does not
create very small nodes
Tests for Choosing Best Split
Purity (Diversity) Measures:
Gini (population diversity)
Entropy (information gain)
Pearson Chi-Square Test

Information Gain (from Data Science for Business by Provost and Fawcett)
The most common splitting criterion is called information gain (IG)

It is based on a purity measure called entropy
Measures the general disorder of a set

Information Gain
Information gain measures the change in entropy due to any

amount of new information being added
Information Gain
Information Gain
Gini (Population Diversity)
The Gini measure of a node is the sum of the squares of the

proportions of the classes.
Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance)
Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure)
Gini Score =0.5*.82+0.5*.82=.82 (close to pure)

Decision Tree Advantages
Easy to understand
Map nicely to a set of business rules
Applied to real problems
Make no prior assumptions about the data
Able to process both numerical and categorical data

Benefits of Trees
Interpretability
Tree-structured presentation
Better insights and business understanding
Determining the most informative attributes
Mixed Measurement Scales

Regression trees
Handling of Outliers
Handling of Missing Values

The Right-Sized Tree
Stunting
Pruning
24
The Scenario
Determine who should be approved for a home equity loan.
The target variable is a binary variable that indicates whether an

applicant eventually defaulted on the loan.
The input variables are variables such as the amount of the loan,
amount due on the existing mortgage, the value of the property,
and the number of recent credit inquiries.
HMEQ Dataset
The HMEQ data set contains baseline and loan performance

information for 5,960 recent home equity loans. The target (BAD) is a
binary variable that indicates if an applicant eventually defaulted or was
seriously delinquent. This adverse outcome occurred in 1,189 cases
(20%). For each applicant, 12 input variables were recorded.
Presume that every two dollars loaned eventually returns three dollars
if the loan is paid off in full.
SAS Process Flow
Input Data Source Node: Variables
Decision Tree Node: Results
Lift Charts
% Response Chart: Cumulative
% Response Chart: Cumulative
% Response Chart: Non- Cumulative
Lift Value: Cumulative
Lift Value: Non- Cumulative
% Captured Response: Cumulative
Consequences of a Decision
Decision 1 Decision 0
Actual 1 True Positive False Negative
Actual 0 False Positive True Negative
40
Consequences of a Decision: Profit matrix (SAS EM)
Decision 1 Decision 0
Actual 1 True Positive False Negative

(Profit = $2)
Actual 0 False Positive True Negative

(Loss = $1)
41
Profit
NEXT TOPIC: Clustering for Segmentation

PADM - Decision Trees

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

PADM - Decision Trees

Hochgeladen von

Copyright:

Verfügbare Formate

Predictive Analytics and Data Mining

- Predictive Analytics Using Decision Trees

Powerful/popular for classification

Useful to explore data to gain insight into relationships of a large

Banking marketing scenario (HMEQ):

number of delinquent trade lines (DELINQ)

Determine who should be approved for a home equity loan.

The target variable is a binary variable that indicates whether an

The HMEQ data set contains baseline and loan performance

Interpretation of the fitted decision tree

The internal nodes contain rules

The leaves contain the estimate of the expected value of the

Drawn top-to-bottom or left-to-right

Top (or left-most) node = Root Root

Bottom (or right-most) node(s) = Leaf

Unique path from root to each leaf =

How is the best split determined? In some situations, the worth of a

In contrast, if a split results in pure child nodes, the split is

Left Center Right

Not Bad 4500 0 4500

Measure used to evaluate a potential split is purity

Purity (Diversity) Measures:

Gini (population diversity)

Entropy (information gain)

Pearson Chi-Square Test

The most common splitting criterion is called information gain (IG)

Measures the general disorder of a set

Information gain measures the change in entropy due to any

The Gini measure of a node is the sum of the squares of the

Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance)

Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure)

Gini Score =0.5*.82+0.5*.82=.82 (close to pure)

Map nicely to a set of business rules

Applied to real problems

Make no prior assumptions about the data

Able to process both numerical and categorical data

Better insights and business understanding

Determining the most informative attributes

Mixed Measurement Scales

Handling of Missing Values

Determine who should be approved for a home equity loan.

The target variable is a binary variable that indicates whether an

The HMEQ data set contains baseline and loan performance

Actual 1 True Positive False Negative

Actual 0 False Positive True Negative

Actual 1 True Positive False Negative

Actual 0 False Positive True Negative

Das könnte Ihnen auch gefallen

Gini Score =0.5.82+0.5.82=.82 (close to pure)