Applied Data Mining

Applied Data Mining Analysis:
A Step-by-Step Introduction Using Real-World Data Sets
http://info.salford-systems.com/jsm-2015-ctw
August 2015
Salford Systems
Course Outline
Demonstration of two classification examples in SPM
o Bank Marketing
o KDD cup 2009
Predictive Modeling package used for the

examples
o Core Statistics
o Logistic Regression
o CART Decision Tree (original, by Jerome Friedman)
o MARS Spline Regression (original, by Jerome Friedman)
o TreeNet gradient boosting machine ((original, by Jerome Friedman)
o RandomForests (original, Breiman and Cutler)
o Automation and model acceleration
Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 2

Bank Marketing Data
Portuguese bank marketing data
o 41,188 records
o 20 attributes, such as age, job, education, housing status
o The goal is to predict whether the client will subscribe a term deposit
o Output variable (desired target):
has the client subscribed a term deposit? (binary: 'yes','no')
Dataset is publicly available at UCI machine learning
repository
o http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing
Challenges
o Missing Value
o Mixed categorical and numerical variables
o Variable selection
Copyright Salford Systems 2013

Sample Data
EMP_VAR_RAT
AGE JOB MARITAL DEF HOUSING LOAN CONTACT E CPI CCI EURIBOR NUM_EMP Y
housemai
56 d married no no no telephone 1.1 93.994 -36.4 4.857 5191 no
57 services married no no telephone 1.1 93.994 -36.4 4.857 5191 no
37 services married no yes no telephone 1.1 93.994 -36.4 4.857 5191 no
40 admin. married no no no telephone 1.1 93.994 -36.4 4.857 5191 no
56 services married no no yes telephone 1.1 93.994 -36.4 4.857 5191 no
45 services married no no telephone 1.1 93.994 -36.4 4.857 5191 no
59 admin. married no no no telephone 1.1 93.994 -36.4 4.857 5191 no

blue-
41 collar married no no telephone 1.1 93.994 -36.4 4.857 5191 no
24 technician single no yes no telephone 1.1 93.994 -36.4 4.857 5191 no
25 services single no yes no telephone 1.1 93.994 -36.4 4.857 5191 no
Other variables include: level of education, date of last contact,

outcome of last campaign, days since last contact, etc.
Note: missing values, categorical and numeric variables

Open Raw Data:
bank.CSV

Character Variables and
Missing Values

Request Descriptive
Statistics
All variables are included in default

Brief Descriptive Stats
We always check for prevalence of missing data

Always review number of distinct values (too few?, too many?)
Anything looks wrong in the dataset

Full Descriptive Stats
Output contains detailed descriptive statistics for every variable

Frequency of Target
variable
Target Variable
0 means non subscriber
1 means subscriber
Its not surprised that there are
only a small percentage of people
subscribed term deposit

Data Preparation
The records in this dataset are ordered by date
(from May 2008 to November 2010)
Note that 2008 economy crisis made this dataset
complicated because time has to be considered as
a factor in the analysis.
We partitioned 80% as learning data and remaining
20% as testing data in time order.
Note: pdays 999 means the clients have never been
contacted before this phone call.

Build LOGIT Model

LOGIT Model Summary
ROC learn value is 0.94 which should get your attention to exam if it is
too good to be true
ROC learning and test difference tells us that time does have an impact
LOGIT Model Coefficients
Partial coefficients are shown in the table above

CART
Classification and Regression Trees
o Separates relevant from irrelevant predictors
o Yields simply, easy to understand results
o Doesnt require variable transformations
o Impervious to outliers and missing values
Fastest, most versatile predictive

modeling algorithm available to
analysts
Provides the foundation to modern

data mining techniques such as
bagging and boosting
Build CART Model

Testing Method

CART Model
Learn and Test sample perform quite different with this model which means
time does contribute as a factor to influence the outcome
Also learning sample performance looks too good to be true
Variable Importance
Duration: this attribute highly affects the output target (e.g., if duration=0
then y='no'). Yet, the duration is not known before a call is performed.

Rerun CART model
excluding Duration

Variable Importance
Ranking
CART gives an initial look of what variable are important, it is useful when there
are quite a few predictors in your dataset.
Root Node Split Very
Effective
We can view nodes detail by clicking
Tree Details in CART output window
The first splitter is month which is
also shown in variable importance
ranking table as the most influential
predictor
The whole tree with details can be
viewed as well

MARS
Multivariate Adaptive Regression Splines
Uses knots to impose local linearities
These knots create basis functions to
decompose the information in each variable
individually
60 60
50 50
40 40
30
MV
MV
30
20
10 20
0 10
-10 0
0 10 20 30 40 0 10 20 30 40
LSTAT LSTAT
Build MARS Model

MARs Model Setup
Max basis Function
default setting is 15
where often time
model hits this limit
and stop before
reaching the
optimal model
So we set it as 60
after a couple of
runs

MARS Output Window
This output window shows you the number of basis functions in the model
against the performance of the model. Because MARS is a regression engine, the
MSE and R-squared values will still be reported, but can be ignored here.
Summary
This model improved in targeting customers, with an ROC of 0.72.

MARS Basis function
Here is where the logistic regression equation is laid out in terms of the basis
functions (transformations of the predictors). Each basis function is
described and the final model is listed at the bottom. This form of output is
especially desired by those who are comfortable with standard regression.
MARs Plots
Note: The presence of nonlinearity

in this dataset

TreeNet
Stochastic Gradient Boosting
Small decision trees built in an error-
correcting sequence
1. Begin with small tree as initial model
2. Compute residuals from this model for all records
3. Grow a second small tree to predict these residuals
4. And so on
Build TreeNet Model

TreeNet Output Window
The Output window shows a graph of the number of trees in the ensemble
with its corresponding ROC value. The vertical green bar denotes the model
with the optimal ROC: 9 trees at 0.69.

Partial Dependency Plots
Using TreeNet for targeted marketing has improved random calling and given you
an idea of how the predictors affect subscription
Random Forests
Ensemble of trees built on bootstrap samples
Algorithm:
o Each tree is grown on a bootstrap sample from the learning data
o During tree growing, only P predictors are selected and tried at each
node
o By default, P is the square root of total predictors
The overall prediction is determined by averaging

Law of Large Numbers ensures convergence
The key to accuracy is low correlation and bias
To keep bias low, trees are grown to maximum
depth
Build RandomForests
Model

RandomForests Output1
RandomForests optimal model is always the one with most trees,

RandomForests Summary

Prediction Success Table1
We want to minimize the false non-subscribers rate to spend least effort

to reach most subscribers
Adjust Class Weights
Class Weights
default is
BALANCED
which means
Upweight small
classes to equal
size of largest
target class.
Now we
manually
upweight class
1 which is the
small class even
more than
Balanced
setting
Prediction Success Table2

Conclusion
CART, MARs, TreeNet and RandomForests
o handles missing value automatically
o Detect interaction and nonlinearity automatically
o Model can be translate into other programing languages
o Model performance usually exceeds traditional classification algorithms
o Advanced setting boosts model performance
CART provides initial insights of the dataset

MARs gives equations in a linear regression format
with transformation of original predictors
TreeNet generates more accurate models
RandomForests outperforms with wide datasets

KDD Cup 2009
Knowledge Discovery and Data mining competition
held once a year to challenge modelers to a task
o http://www.kdd.org/kddcup/index.php - competitions from 1997-2010
o Includes tasks, data, rules, results, and FAQs
KDD Cup 2009 was about customer relationship

prediction
French telecom company Orange provided large
marketing databases
Overall goal was to beat the in-house system
implemented by Orange

Datasets
50,000 customers
15,000 predictors
o ex) demographic, geographic, behavioral
Three binary classification tasks:

o Appetency: customer buys new product or service
o Churn: customer switches providers
o Upselling: customer buys upgrade offered to them
Training and testing dataset

Smaller subsets of data available for practice

Challenges
Large database
o 50,000 x 15,000
Numerical and categorical variables

Missing data
Unbalanced class distributions
o Many more customers NOT doing these things
Sanitized data - no intuition

Data Preparation
Combine multiple datasets
o Large dataset broken into 5 chunks, 53 MB each
o True target values needed to be appended
Delete or impute missing values

o Not necessary in SPM
Handle categorical variables

o Create dummy indicators
o Combine levels in variables with many
o Again, not necessary in SPM

Open Prepared Data

View Data

Run Descriptive Statistics

Target Frequencies

Appetency
In this context, appetency is
the propensity of the
customer to buy a new
product or service

CART Model Setup
Choose CART as the
Analysis Engine
Our Target is coded -1/1,
so we will choose
Classification/Logistic
Binary as the Target Type
Appetency is our
response variable and
VAR1-VAR15000 are our
predictors

Setting a Testing Method
A separate test
dataset is provided
in the competition,
but true target
values were not
included
For model-building,
we will use a 20%
random partition of
the training dataset
to monitor
performance

Restricting Tree Size
We are interested in
looking at CART
ranking of important
predictors
By forcing the tree
to only one split, we
can quickly create
a tree to access this
information

Penalties
We are aware
there are
variables with
many missing
values and
variables with a
high number of
categorical
levels
Setting penalties
on these cases
makes it harder
to include these
in the model

Results - Single Split CART Tree

Variable Improvement Measures

TreeNet Model Setup

Results - TreeNet Ensemble

Variable Selection
Improvement measures are averaged across all trees in the ensemble

Only 185 of the original 15,000 predictors are flagged as important

Recursive Feature Elimination (RFE)
Remove one variable at a time from
the TOP of the variable importance list
to eliminate too good predictors

RFE, Step 2
Remove one variable at a time from
the BOTTOM of the variable importance
list to eliminate weak predictors
Final ROC: 0.9048

Parameter Variation - Automates
Each TreeNet control
parameter can be
automatically varied
over its values
A model is built at
each step and
summarized

Stability of the Model
Automate PARTITION varies the learn/test partition so the user can
observe the stability of model performance

Repeat on Churn
Churn is the propensity of the customer to switch providers
We repeat the same steps of model-building to achieve a final model
Final ROC: 0.7320

Repeat on Upsell
Upsell is the propensity of the customer to buy an upgrade offered to
them
We repeat the same steps of model-building to achieve a final model
Final ROC: 0.9059

Summary of Results
Rank Team Appetency Churn Upselling Score
1 IBM Research 0.8830 0.7611 0.9038 0.8493
You! 0.9048 0.7320 0.9059 0.8476
2 ID Analytics, Inc. 0.8724 0.7565 0.9056 0.8448
3 Old dogs with new tricks 0.8740 0.7541 0.9050 0.8443
4 Crusaders 0.8688 0.7569 0.9034 0.8430
5 Financial Engineering Group, Inc. Japan 0.8732 0.7498 0.9057 0.8429
Unable to compare to true target values because these were only seen
by competition judges
However, we are confident in our results (2 of the above groups used SPM)
Results can vary based on optimal selection criterion, random number
seed, etc.

Overall Conclusions
We were able to narrow down the predictor list
significantly using TreeNet and Automate SHAVING
o Of the original 15,000 predictors:
Appetency: 167
Churn: 249
Upselling: 165
Handling of categorical variables and missing values

was automatic and didnt cause any issues
Small rates in the class of interest didnt pose a
problem
o Priors/Costs and Class Weights can control for this in CART and TreeNet
Couldnt draw any insight as to the variables

affecting appetency, churn, and upsell

Applied Data Mining

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Applied Data Mining

Hochgeladen von

Copyright:

Verfügbare Formate

Applied Data Mining Analysis:

A Step-by-Step Introduction Using Real-World Data Sets

Predictive Modeling package used for the

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 2

Copyright Salford Systems 2013

57 services married no no telephone 1.1 93.994 -36.4 4.857 5191 no

37 services married no yes no telephone 1.1 93.994 -36.4 4.857 5191 no

40 admin. married no no no telephone 1.1 93.994 -36.4 4.857 5191 no

56 services married no no yes telephone 1.1 93.994 -36.4 4.857 5191 no

45 services married no no telephone 1.1 93.994 -36.4 4.857 5191 no

59 admin. married no no no telephone 1.1 93.994 -36.4 4.857 5191 no

24 technician single no yes no telephone 1.1 93.994 -36.4 4.857 5191 no

25 services single no yes no telephone 1.1 93.994 -36.4 4.857 5191 no

Other variables include: level of education, date of last contact,

Note: missing values, categorical and numeric variables

Copyright Salford Systems 2013

Copyright Salford Systems 2013

All variables are included in default

We always check for prevalence of missing data

Copyright Salford Systems 2013

Output contains detailed descriptive statistics for every variable

Copyright Salford Systems 2013

Copyright Salford Systems 2013

Copyright Salford Systems 2013

Copyright Salford Systems 2013

Partial coefficients are shown in the table above

Fastest, most versatile predictive

Provides the foundation to modern

Copyright Salford Systems 2013

Copyright Salford Systems 2013

Copyright Salford Systems 2013

Copyright Salford Systems 2013

Copyright Salford Systems 2013

Copyright Salford Systems 2013

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 25

This model improved in targeting customers, with an ROC of 0.72.

Copyright Salford Systems 2013

Note: The presence of nonlinearity

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 29

Copyright Salford Systems 2013

Copyright Salford Systems 2013

The overall prediction is determined by averaging

Copyright Salford Systems 2013

RandomForests optimal model is always the one with most trees,

Copyright Salford Systems 2013

Copyright Salford Systems 2013

We want to minimize the false non-subscribers rate to spend least effort

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 40

CART provides initial insights of the dataset

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 41

KDD Cup 2009 was about customer relationship

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 42

Three binary classification tasks:

Training and testing dataset

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 43

Numerical and categorical variables

Sanitized data - no intuition

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 44

Delete or impute missing values

Handle categorical variables

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 45

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 46

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 47

Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 48