Sie sind auf Seite 1von 43

Predictive Analytics and Data Mining

- Predictive Analytics Using Decision Trees


- Evaluation and Assessment of Data Mining Models - I
Decision Trees: Introduction

Powerful/popular for classification

Represent rules
Rules can be expressed in English
IF Age <=43 & Gender = Male
& Credit Card Insurance = No
THEN Life Insurance Promotion = No

Useful to explore data to gain insight into relationships of a large


number of candidate input variables to a target (output) variable
YOUTUBE VIDEO: (Decision Tree (CART) Machine
Learning Made Fun and Easy)
Decision Trees: HMEQ Example

Banking marketing scenario (HMEQ):

Target :
default on a home-equity line of credit (BAD)
Inputs :

number of delinquent trade lines (DELINQ)


number of credit inquiries (NINQ)
debt to income ratio (DEBTINC)
other inputs
The Scenario

Determine who should be approved for a home equity loan.

The target variable is a binary variable that indicates whether an


applicant eventually defaulted on the loan.

The input variables are variables such as the amount of the loan,
amount due on the existing mortgage, the value of the property,
and the number of recent credit inquiries.
HMEQ Dataset

The HMEQ data set contains baseline and loan performance


information for 5,960 recent home equity loans. The target (BAD) is a
binary variable that indicates if an applicant eventually defaulted or was
seriously delinquent. This adverse outcome occurred in 1,189 cases
(20%). For each applicant, 12 input variables were recorded.

Presume that every two dollars loaned eventually returns three dollars
if the loan is paid off in full.
The Cultivation of Trees

Split Search
Which splits are to be considered?

Splitting Criterion
Which split is best?

Stopping Rule
When should the splitting stop?

Pruning Rule
Should some branches be lopped off?
Decision Trees

Interpretation of the fitted decision tree

The internal nodes contain rules

Start at the root node (top) and follow the rules until a terminal
node (leaf) is reached.

The leaves contain the estimate of the expected value of the


target in this case the posterior probability of BAD. The
probability can then be used to allocate cases to classes.
Decision Tree Template

Drawn top-to-bottom or left-to-right

Top (or left-most) node = Root Root


Node
Child Child Leaf
Descendent node(s) = Child
Node(s) Child Leaf

Bottom (or right-most) node(s) = Leaf


Leaf Node(s)

Unique path from root to each leaf =


Rule
Divide and Conquer

n = 5,000
The tree is fitted to the data
by recursive partitioning.
Partitioning refers to
segmenting the data into
subgroups that are as 10% BAD
homogeneous as possible yes no
with respect to the target. In Debt-to-Income
this case, the binary split n = 3,350 Ratio < 45 n = 1,650
(Debt-to-Income Ratio < 45)
was chosen. The 5,000
cases were split into two
groups, one with a 5% BAD 5% BAD 21% BAD
rate and the other with a
21% BAD rate.
The method is recursive because each subgroup results from splitting a
subgroup from a previous split. Thus, the 3,350 cases in the left child node
and the 1,650 cases in the right child node are split again in similar
fashion.
The Cultivation of Trees

Split Search
Which splits are to be considered?

Splitting Criterion
Which split is best?

Stopping Rule
When should the splitting stop?

Pruning Rule
Should some branches be lopped off?
Splitting Criteria

How is the best split determined? In some situations, the worth of a


split is obvious. If the expected target is the same in the child
nodes as in the parent node, no improvement was made, and the
split is worthless!

In contrast, if a split results in pure child nodes, the split is


undisputedly best. For classification trees, the three most widely
used splitting criteria are based on the Pearson chi-squared test, the
Gini index, and entropy. All three measure the difference in class
distributions across the child nodes. The three methods usually give
similar results.
Splitting Criteria

Left Right
Not Bad 3196 1304 4500 Debt-to-Income
Bad 154 346 500 Ratio < 45

Left Center Right


Not Bad 2521 1188 791 4500 A Competing
Bad 115 162 223 500 Three-Way Split

Not Bad 4500 0 4500


Perfect Split
Bad 0 500 500

14
Split Criteria

The best split is defined as one that does the best job of separating
the data into groups where a single class predominates in each
group

Measure used to evaluate a potential split is purity

The best split is one that increases purity of the sub-sets by the
greatest amount

A good split also creates nodes of similar size or at least does not
create very small nodes
Tests for Choosing Best Split

Purity (Diversity) Measures:

Gini (population diversity)

Entropy (information gain)

Pearson Chi-Square Test


Information Gain (from Data Science for Business by Provost and Fawcett)

The most common splitting criterion is called information gain (IG)


It is based on a purity measure called entropy

Measures the general disorder of a set


Information Gain

Information gain measures the change in entropy due to any


amount of new information being added
Information Gain
Information Gain
Gini (Population Diversity)

The Gini measure of a node is the sum of the squares of the


proportions of the classes.

Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance)

Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure)

Gini Score =0.5*.82+0.5*.82=.82 (close to pure)


Decision Tree Advantages

Easy to understand

Map nicely to a set of business rules

Applied to real problems

Make no prior assumptions about the data

Able to process both numerical and categorical data


Benefits of Trees

Interpretability
Tree-structured presentation

Better insights and business understanding

Determining the most informative attributes

Mixed Measurement Scales


Regression trees

Handling of Outliers

Handling of Missing Values


The Right-Sized Tree

Stunting

Pruning

24
The Scenario

Determine who should be approved for a home equity loan.

The target variable is a binary variable that indicates whether an


applicant eventually defaulted on the loan.

The input variables are variables such as the amount of the loan,
amount due on the existing mortgage, the value of the property,
and the number of recent credit inquiries.
HMEQ Dataset

The HMEQ data set contains baseline and loan performance


information for 5,960 recent home equity loans. The target (BAD) is a
binary variable that indicates if an applicant eventually defaulted or was
seriously delinquent. This adverse outcome occurred in 1,189 cases
(20%). For each applicant, 12 input variables were recorded.

Presume that every two dollars loaned eventually returns three dollars
if the loan is paid off in full.
SAS Process Flow
Input Data Source Node: Variables
Decision Tree Node: Results
Decision Tree Node: Results
Decision Tree Node: Results
Lift Charts
% Response Chart: Cumulative
% Response Chart: Cumulative
% Response Chart: Non- Cumulative
Lift Value: Cumulative
Lift Value: Non- Cumulative
% Captured Response: Cumulative
Consequences of a Decision

Decision 1 Decision 0

Actual 1 True Positive False Negative

Actual 0 False Positive True Negative

40
Consequences of a Decision: Profit matrix (SAS EM)

Decision 1 Decision 0

Actual 1 True Positive False Negative


(Profit = $2)

Actual 0 False Positive True Negative


(Loss = $1)

41
Profit
NEXT TOPIC: Clustering for Segmentation

Das könnte Ihnen auch gefallen