Sie sind auf Seite 1von 67

Applied Data Mining Analysis:

A Step-by-Step Introduction Using Real-World Data Sets

http://info.salford-systems.com/jsm-2015-ctw

August 2015
Salford Systems
Course Outline
Demonstration of two classification examples in SPM
o Bank Marketing
o KDD cup 2009

Predictive Modeling package used for the


examples
o Core Statistics
o Logistic Regression
o CART Decision Tree (original, by Jerome Friedman)
o MARS Spline Regression (original, by Jerome Friedman)
o TreeNet gradient boosting machine ((original, by Jerome Friedman)
o RandomForests (original, Breiman and Cutler)
o Automation and model acceleration

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 2


Bank Marketing Data
Portuguese bank marketing data
o 41,188 records
o 20 attributes, such as age, job, education, housing status
o The goal is to predict whether the client will subscribe a term deposit
o Output variable (desired target):
has the client subscribed a term deposit? (binary: 'yes','no')
Dataset is publicly available at UCI machine learning
repository
o http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing
Challenges
o Missing Value
o Mixed categorical and numerical variables
o Variable selection

Copyright Salford Systems 2013


Sample Data
EMP_VAR_RAT
AGE JOB MARITAL DEF HOUSING LOAN CONTACT E CPI CCI EURIBOR NUM_EMP Y
housemai
56 d married no no no telephone 1.1 93.994 -36.4 4.857 5191 no

57 services married no no telephone 1.1 93.994 -36.4 4.857 5191 no

37 services married no yes no telephone 1.1 93.994 -36.4 4.857 5191 no

40 admin. married no no no telephone 1.1 93.994 -36.4 4.857 5191 no

56 services married no no yes telephone 1.1 93.994 -36.4 4.857 5191 no

45 services married no no telephone 1.1 93.994 -36.4 4.857 5191 no

59 admin. married no no no telephone 1.1 93.994 -36.4 4.857 5191 no


blue-
41 collar married no no telephone 1.1 93.994 -36.4 4.857 5191 no

24 technician single no yes no telephone 1.1 93.994 -36.4 4.857 5191 no

25 services single no yes no telephone 1.1 93.994 -36.4 4.857 5191 no

Other variables include: level of education, date of last contact,


outcome of last campaign, days since last contact, etc.

Note: missing values, categorical and numeric variables


Open Raw Data:
bank.CSV

Copyright Salford Systems 2013


Character Variables and
Missing Values

Copyright Salford Systems 2013


Request Descriptive
Statistics

All variables are included in default


Copyright Salford Systems 2013
Brief Descriptive Stats

We always check for prevalence of missing data


Always review number of distinct values (too few?, too many?)
Anything looks wrong in the dataset

Copyright Salford Systems 2013


Full Descriptive Stats

Output contains detailed descriptive statistics for every variable

Copyright Salford Systems 2013


Frequency of Target
variable
Target Variable
0 means non subscriber
1 means subscriber
Its not surprised that there are
only a small percentage of people
subscribed term deposit

Copyright Salford Systems 2013


Data Preparation
The records in this dataset are ordered by date
(from May 2008 to November 2010)
Note that 2008 economy crisis made this dataset
complicated because time has to be considered as
a factor in the analysis.
We partitioned 80% as learning data and remaining
20% as testing data in time order.
Note: pdays 999 means the clients have never been
contacted before this phone call.

Copyright Salford Systems 2013


Build LOGIT Model

Copyright Salford Systems 2013


LOGIT Model Summary

ROC learn value is 0.94 which should get your attention to exam if it is
too good to be true
ROC learning and test difference tells us that time does have an impact
Copyright Salford Systems 2013
LOGIT Model Coefficients

Partial coefficients are shown in the table above


Copyright Salford Systems 2013
CART
Classification and Regression Trees
o Separates relevant from irrelevant predictors
o Yields simply, easy to understand results
o Doesnt require variable transformations
o Impervious to outliers and missing values

Fastest, most versatile predictive


modeling algorithm available to
analysts

Provides the foundation to modern


data mining techniques such as
bagging and boosting
Build CART Model

Copyright Salford Systems 2013


Testing Method

Copyright Salford Systems 2013


CART Model

Learn and Test sample perform quite different with this model which means
time does contribute as a factor to influence the outcome
Also learning sample performance looks too good to be true
Copyright Salford Systems 2013
Variable Importance

Duration: this attribute highly affects the output target (e.g., if duration=0
then y='no'). Yet, the duration is not known before a call is performed.

Copyright Salford Systems 2013


Rerun CART model
excluding Duration

Copyright Salford Systems 2013


Variable Importance
Ranking

CART gives an initial look of what variable are important, it is useful when there
are quite a few predictors in your dataset.
Copyright Salford Systems 2013
Root Node Split Very
Effective
We can view nodes detail by clicking
Tree Details in CART output window
The first splitter is month which is
also shown in variable importance
ranking table as the most influential
predictor
The whole tree with details can be
viewed as well

Copyright Salford Systems 2013


MARS
Multivariate Adaptive Regression Splines
Uses knots to impose local linearities
These knots create basis functions to
decompose the information in each variable
individually

60 60
50 50
40 40
30
MV
MV

30
20
10 20
0 10
-10 0
0 10 20 30 40 0 10 20 30 40
LSTAT LSTAT
Build MARS Model

Copyright Salford Systems 2013


MARs Model Setup
Max basis Function
default setting is 15
where often time
model hits this limit
and stop before
reaching the
optimal model
So we set it as 60
after a couple of
runs

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 25


MARS Output Window

This output window shows you the number of basis functions in the model
against the performance of the model. Because MARS is a regression engine, the
MSE and R-squared values will still be reported, but can be ignored here.
Copyright Salford Systems 2013
Summary

This model improved in targeting customers, with an ROC of 0.72.

Copyright Salford Systems 2013


MARS Basis function

Here is where the logistic regression equation is laid out in terms of the basis
functions (transformations of the predictors). Each basis function is
described and the final model is listed at the bottom. This form of output is
especially desired by those who are comfortable with standard regression.
Copyright Salford Systems 2013
MARs Plots

Note: The presence of nonlinearity


in this dataset

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 29


TreeNet
Stochastic Gradient Boosting
Small decision trees built in an error-
correcting sequence
1. Begin with small tree as initial model
2. Compute residuals from this model for all records
3. Grow a second small tree to predict these residuals
4. And so on
Build TreeNet Model

Copyright Salford Systems 2013


TreeNet Output Window

The Output window shows a graph of the number of trees in the ensemble
with its corresponding ROC value. The vertical green bar denotes the model
with the optimal ROC: 9 trees at 0.69.

Copyright Salford Systems 2013


Partial Dependency Plots

Using TreeNet for targeted marketing has improved random calling and given you
an idea of how the predictors affect subscription
Copyright Salford Systems 2013
Random Forests
Ensemble of trees built on bootstrap samples
Algorithm:
o Each tree is grown on a bootstrap sample from the learning data
o During tree growing, only P predictors are selected and tried at each
node
o By default, P is the square root of total predictors

The overall prediction is determined by averaging


Law of Large Numbers ensures convergence
The key to accuracy is low correlation and bias
To keep bias low, trees are grown to maximum
depth
Build RandomForests
Model

Copyright Salford Systems 2013


RandomForests Output1

RandomForests optimal model is always the one with most trees,

Copyright Salford Systems 2013


RandomForests Summary

Copyright Salford Systems 2013


Prediction Success Table1

We want to minimize the false non-subscribers rate to spend least effort


to reach most subscribers
Copyright Salford Systems 2013
Adjust Class Weights
Class Weights
default is
BALANCED
which means
Upweight small
classes to equal
size of largest
target class.
Now we
manually
upweight class
1 which is the
small class even
more than
Balanced
setting
Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 39
Prediction Success Table2

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 40


Conclusion
CART, MARs, TreeNet and RandomForests
o handles missing value automatically
o Detect interaction and nonlinearity automatically
o Model can be translate into other programing languages
o Model performance usually exceeds traditional classification algorithms
o Advanced setting boosts model performance

CART provides initial insights of the dataset


MARs gives equations in a linear regression format
with transformation of original predictors
TreeNet generates more accurate models
RandomForests outperforms with wide datasets

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 41


KDD Cup 2009
Knowledge Discovery and Data mining competition
held once a year to challenge modelers to a task
o http://www.kdd.org/kddcup/index.php - competitions from 1997-2010
o Includes tasks, data, rules, results, and FAQs

KDD Cup 2009 was about customer relationship


prediction
French telecom company Orange provided large
marketing databases
Overall goal was to beat the in-house system
implemented by Orange

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 42


Datasets
50,000 customers
15,000 predictors
o ex) demographic, geographic, behavioral

Three binary classification tasks:


o Appetency: customer buys new product or service
o Churn: customer switches providers
o Upselling: customer buys upgrade offered to them

Training and testing dataset


Smaller subsets of data available for practice

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 43


Challenges
Large database
o 50,000 x 15,000

Numerical and categorical variables


Missing data
Unbalanced class distributions
o Many more customers NOT doing these things

Sanitized data - no intuition

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 44


Data Preparation
Combine multiple datasets
o Large dataset broken into 5 chunks, 53 MB each
o True target values needed to be appended

Delete or impute missing values


o Not necessary in SPM

Handle categorical variables


o Create dummy indicators
o Combine levels in variables with many
o Again, not necessary in SPM

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 45


Open Prepared Data

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 46


View Data

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 47


Run Descriptive Statistics

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 48


Target Frequencies

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 49


Appetency
In this context, appetency is
the propensity of the
customer to buy a new
product or service

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 50


CART Model Setup
Choose CART as the
Analysis Engine
Our Target is coded -1/1,
so we will choose
Classification/Logistic
Binary as the Target Type
Appetency is our
response variable and
VAR1-VAR15000 are our
predictors

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 51


Setting a Testing Method

A separate test
dataset is provided
in the competition,
but true target
values were not
included
For model-building,
we will use a 20%
random partition of
the training dataset
to monitor
performance

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 52


Restricting Tree Size

We are interested in
looking at CART
ranking of important
predictors
By forcing the tree
to only one split, we
can quickly create
a tree to access this
information

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 53


Penalties
We are aware
there are
variables with
many missing
values and
variables with a
high number of
categorical
levels
Setting penalties
on these cases
makes it harder
to include these
in the model

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 54


Results - Single Split CART Tree

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 55


Variable Improvement Measures

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 56


TreeNet Model Setup

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 57


Results - TreeNet Ensemble

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 58


Variable Selection

Improvement measures are averaged across all trees in the ensemble


Only 185 of the original 15,000 predictors are flagged as important

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 59


Recursive Feature Elimination (RFE)
Remove one variable at a time from
the TOP of the variable importance list
to eliminate too good predictors

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 60


RFE, Step 2
Remove one variable at a time from
the BOTTOM of the variable importance
list to eliminate weak predictors

Final ROC: 0.9048

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 61


Parameter Variation - Automates
Each TreeNet control
parameter can be
automatically varied
over its values
A model is built at
each step and
summarized

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 62


Stability of the Model
Automate PARTITION varies the learn/test partition so the user can
observe the stability of model performance

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 63


Repeat on Churn
Churn is the propensity of the customer to switch providers
We repeat the same steps of model-building to achieve a final model

Final ROC: 0.7320

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 64


Repeat on Upsell
Upsell is the propensity of the customer to buy an upgrade offered to
them
We repeat the same steps of model-building to achieve a final model

Final ROC: 0.9059

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 65


Summary of Results
Rank Team Appetency Churn Upselling Score

1 IBM Research 0.8830 0.7611 0.9038 0.8493

You! 0.9048 0.7320 0.9059 0.8476

2 ID Analytics, Inc. 0.8724 0.7565 0.9056 0.8448

3 Old dogs with new tricks 0.8740 0.7541 0.9050 0.8443

4 Crusaders 0.8688 0.7569 0.9034 0.8430

5 Financial Engineering Group, Inc. Japan 0.8732 0.7498 0.9057 0.8429

Unable to compare to true target values because these were only seen
by competition judges
However, we are confident in our results (2 of the above groups used SPM)
Results can vary based on optimal selection criterion, random number
seed, etc.

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 66


Overall Conclusions
We were able to narrow down the predictor list
significantly using TreeNet and Automate SHAVING
o Of the original 15,000 predictors:
Appetency: 167
Churn: 249
Upselling: 165

Handling of categorical variables and missing values


was automatic and didnt cause any issues
Small rates in the class of interest didnt pose a
problem
o Priors/Costs and Class Weights can control for this in CART and TreeNet

Couldnt draw any insight as to the variables


affecting appetency, churn, and upsell

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 67

Das könnte Ihnen auch gefallen