Bagging and Boosting Classification Trees To Predict Churn.: Insights From The US Telecom Industry

Bagging and Boosting Classification
Trees to Predict Churn.

Insights from the US Telecom Industry
Forthcoming, Journal of Marketing Research

The Context
The 2002 Churn Tournament organised by Teradata Center for CRM
at Duke University
Churn means defecting from a company, i.e. take his business

elsewhere
Customer database from an anonymous U.S. wireless telecom

company
Challenge: predicting churn for elaborating targeted retention

strategies (Bolton et al. 2000, Ganesh et al. 2000, Shaffer and Zhang
2002)
Details can be found in Neslin et al. (2004)

The Context (contd)
The US Wireless Telecom market (2004)
182.1 million subscribers
Leader in market share: Cingular Wireless
26.9% total market volume
turnover US$19.4 billion / net income US$201 million
Other major players: AT&T, Verizon, Sprint and Nextel
Mergers & Acquisitions : Cingular with AT&T Wireless & Sprint
with Nextel
The Context (contd)
Churn
High churn rates 2.6% a month

Causes: increased competition, lack of differentiation,
market saturation
Cost: $300 to $700 cost of replacement of a lost
customer in terms of sales support, marketing,
advertising, etc.
Targeted retention strategies
Formulation of the Churn Problem
Churn as a Classification issue:
Classify a customer i characterized by k variables
xi = (xi1 , xi2 , , xiK ) as
Churner yi = + 1
Non-churner yi = - 1
Churn is the response binary variable to predict: yi = f(xi )
Choice of the binary choice model f ( . ) ?

Classification Models in Marketing
Simple binary logit choice model (e.g. Andrews et al. 2002)
Models allowing for the heterogeneity in consumers response:
Finite mixture model (e.g. Wedel and Kamakura 2000)
Hierarchical Bayes model (e.g. Yang and Allenby 2003)
Non-parametric choice models:
Decisions trees, neural nets (e.g. Thieme et al. 2000; West et al. 1997)
Bagging (Breiman 1996), Boosting (Freund and Schapire 1996),
Stochastic gradient boosting (Friedman 2002)
Classification Models in Marketing
Simple binary logit choice model (e.g. Andrews et al. 2002)
Models allowing for the heterogeneity in consumers response:
Finite mixture model (e.g. Wedel and Kamakura 2000)
Hierarchical Bayes model (e.g. Yang and Allenby 2003)
Non-parametric choice models:
Decisions trees, neural nets (e.g. Thieme et al. 2000; West et al. 1997)
Bagging (Breiman 1996), Boosting (Freund and Schapire 1996),
Stochastic gradient boosting (Friedman 2002)
Mostly ignored in the marketing literature
S.G.B. won the Tournament (Cardell, from Salford Systems)

Decision Trees for Churn
Example: Change in consumption
< 0.5 0.5
Customer care calls Age
< 26 55
<3 3
55< & 26
Yes
No
Yes No
Handset price
$150 <$150
No Yes
Bagging and Boosting
Machine Learning Algorithms
Principle: classifier aggregation (Breiman, 1996)
Tree-based method (e.g. Currim et al. 1988)
Bagging: Bootstrap AGGregatING

Calibration sample
Z = {(xi , yi ) }, i = 1, , N
Random sample Z1*
f1* x
e.g. tree
Random sample Z2*
f2* x
Aggregating bootstrap samples
f1* x
f2* x
Churn propensity score:
B
1 * x
f3* x

f bag ( x )
B
b
f
b 1
. . . Churn classification:
fB* x
cbag ( x ) sign fbag ( x )
Bagging
Let the calibration sample be Z={(x1,y1), , (xi,yi), , (xN ,yN)}
B bootstrap samples Z b* , b 1, 2, , B
From each Z b* , a base classifier (e.g. tree) is estimated,

f
giving B score functions: 1
*
x , ,
f b
*
x , ,
f B x
*
The final classifier is obtained by averaging the scores

B
fbag ( x) fb* x
1
B b 1
The classification rule is carried out via

cbag ( x) sign fbag ( x)
Stochastic Gradient Boosting
Winner of the Teradata Churn Modeling Tournament
(Cardell, Golovnya and Steinberg, Salford Systems).
Data adaptively resampled
Previously misclassified observations weights
Previously well-classified observations weights

Data
C Calibration Sample Validation Hold-Out
u Sample
s
t
o yi = - 1
Balanced
m N = 51,306
e Sample
yi = + 1 Equal proportion of
r churners = 50%
Xi = (x1,, x46) yi
Behavioral predictors
e.g. the average monthly minutes of use
Proportional
Company interactions yi = - 1
Real-life N=100,462 variables
proportion Sample
e.g. mean
of churners = 1.8% unrounded minutes of customer
care calls yi = + 1
Customer demographics Xi=(x1,, x46) yi
e.g. the number of adults in the household
Time
Research Questions
Do bagging (and boosting) provide better results than
other benchmarks?
What are the financial gains to be expected from this improvement?
What are the more relevant churn drivers or triggers that marketers
could watch for?
How to correct estimated scores obtained from a

balanced calibration sample, when predicting rare
events like churn?
Comparing Error Rates
Model* Validated Error Rate**
Binary Logit Model 0.400
Bagging (tree-based) 0.374
Stochastic Gradient Boosting 0.460
* Model estimated on the balanced calibration sample

** Error rates computed on the hold-out proportional validation sample
Bias due to Balanced Sampling
Overestimation of the number of churners
Several bias correction methods exist (see e.g. Cosslett 1993;
Donkers et al. 2003; Franses and Paap 2001, p.73-75; Imbens and Lancaster
1996; King and Zeng 2001a,b; Scott and Wild 1997).
However, most are dedicated to traditional models (e.g. logit).
We discuss two corrections for bagging and boosting.

The Bias Correction Methods
The weighting correction:
Based on marketers prior beliefs about the churn rate, i.e. the
proportion of churners among their customers, we attach weights to
observations of a balanced calibration sample.
The intercept correction:

Take a non-zero cut-off value B such that the proportion of
predicted churners in the calibration sample equals the actual a
priori proportion of churners.

Bagging
Let the calibration sample be Z={(x1,y1), , (xi,yi), , (xN,yN)}
B bootstrap samples Z b* , b 1, 2, , B
From each Z b*, a base classifier (e.g. tree) is estimated,
giving B score functions: f1* x , , fb* x , , fB* x

The final classifier is obtained by averaging the scores
B
1 * x

f bag ( x )
B
b
f
b 1
The classification is carried out via

cbag ( x) sign fbag ( x) B
Assessing the Best Bias Correction
Bias Correction
No correction Intercept Weighting
Model* Validated Error Rates**
Binary logit model 0.400 0.035 0.018
Bagging (tree-based) 0.374 0.034 0.025
S.G. boosting 0.460 0.034 0.018

The Top-Decile Lift
Focuses on the most critical group of customers
regarding their churn risk: Ideal segment for targeting a
retention marketing campaign
The top 10% riskiest customers
10%
Risk to churn
10%
Top - decile lift

With 10% = the proportion of churners in this risky segment
And = the proportion of churners in the whole validation set
Financial Gains: Neslin et al. (2004)
Gain N Top decile LVC
N : customer base of the company

: percentage of targeted customers (here, 10%)
Top decile : increase in top-decile lift
: success rate of the incentive among the churners
LVC : lifetime value of a customer (Gupta, Lehmann and Stuart
2004)
: incentive cost per customer
: success rate of the incentive among the non-churners.
Top-Decile Lift with Intercept Correction
2.6
Bagging
Stochastic Gradient Boosting
Binary Logit Model
2.4
2.2
Top decile*
+26%
2.0
1.8
1.6
0 20 40 60 80 100
Number of iterations
* Model estimated on the balanced sample, and lift computed on the validation sample.
Validated** Top-Decile Lift
No / Intercept Weighting
Model*
correction correction
Binary logit model 1.775 1.764
Bagging (tree-based) 2.246 1.549
Stochastic gradient boosting 2.290 1.632

Financial Gains
Gain N Top decile LVC

If we consider
N : customer base of 5,000,000 customers
: 10% of targeted customers
: 30% success rate of the incentive among the churners
LVC : $2,500 lifetime value of a customer
: $50 incentive cost per customer
: 50% success rate of the incentive among the non-churners
Financial Gains
Additional financial gains that we may expect from a retention
marketing campaign which would be targeted using the scores
predicted by the bagging instead of a random selection:
Top decile : 1.246 (= 2.246 1.000)
Gain = + $ 8,550,000
Additional financial gains that we may expect from a retention

marketing campaign which would be targeted using the scores
predicted by the bagging instead of the logit model:
Top decile : 0. 471 (= 2.246 1.775)
Gain = + $ 3,214,800
Most Important Churn Triggers
Mean inbound calls less 1 min. Bagging
Mean monthly min. wireless to wireless
Age
Mean monthly min. of use
Mean attempted calls
Months in service
Mean completed calls
Average monthly min. of use (6 months)
Mean overage revenue
Total revenue over life
Mean peak calls
Handset price
Base cost of the calling plan
Change in monthly min. of use
Equipment days
0
20
40
60
80
100
Relative importance
Partial Dependence Plots
Bagging
62
56
60
54
58
52
Probability to churn
Probability to churn
56
50
54
48
52
46
50
44
48
-1000 0 1000 2000 0 500 1000 1500
Change in monthly min. of use Equipment days

Partial Dependence Plot
51
50
49
Conclusions: Main Findings
1. Bagging and S.G. boosting are substantially better
classifiers than the binary logit choice model
Improvement of 26% for the top-decile lift,
Good diagnostic measures offering face validity,
Interesting insights about potential churn drivers,
Bagging is conceptually simple and easy-to-implement.
2. Intercept correction constitutes an appropriate bias

correction for bagging when using balanced sampling
scheme.
Thanks for your attention
From Profit to Financial Gains
Profit 1 N 1 LVC 1 1 1 c
LVC of a churner Incentive cost for the Contact
who does not churners retained cost
churn + non-churners
targeted
Top decile 1 1 /
Gain 1- 2 Profit 1 Profit 2
N 1 - 2 LVC
N Top decile LVC

Bagging and Boosting Classification Trees To Predict Churn.: Insights From The US Telecom Industry

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Bagging and Boosting Classification Trees To Predict Churn.: Insights From The US Telecom Industry

Hochgeladen von

Copyright:

Verfügbare Formate

Bagging and Boosting Classification

Trees to Predict Churn.

Forthcoming, Journal of Marketing Research

Churn means defecting from a company, i.e. take his business

Customer database from an anonymous U.S. wireless telecom

Challenge: predicting churn for elaborating targeted retention

Details can be found in Neslin et al. (2004)

182.1 million subscribers

Leader in market share: Cingular Wireless

26.9% total market volume

turnover US$19.4 billion / net income US$201 million

Other major players: AT&T, Verizon, Sprint and Nextel

Mergers & Acquisitions : Cingular with AT&T Wireless & Sprint

High churn rates 2.6% a month

Classify a customer i characterized by k variables

xi = (xi1 , xi2 , , xiK ) as

Churn is the response binary variable to predict: yi = f(xi )

Choice of the binary choice model f ( . ) ?

S.G.B. won the Tournament (Cardell, from Salford Systems)

< 0.5 0.5

Customer care calls Age

Machine Learning Algorithms

Principle: classifier aggregation (Breiman, 1996)

Tree-based method (e.g. Currim et al. 1988)

Bagging: Bootstrap AGGregatING

Random sample Z1*

Random sample Z2*

From each Z b* , a base classifier (e.g. tree) is estimated,

The final classifier is obtained by averaging the scores

Data adaptively resampled

Previously misclassified observations weights

Previously well-classified observations weights

could watch for?

How to correct estimated scores obtained from a

Bagging (tree-based) 0.374

Stochastic Gradient Boosting 0.460

* Model estimated on the balanced calibration sample

Several bias correction methods exist (see e.g. Cosslett 1993;

1996; King and Zeng 2001a,b; Scott and Wild 1997).

However, most are dedicated to traditional models (e.g. logit).

We discuss two corrections for bagging and boosting.

proportion of churners among their customers, we attach weights to

observations of a balanced calibration sample.

The intercept correction:

predicted churners in the calibration sample equals the actual a

priori proportion of churners.

From each Z b*, a base classifier (e.g. tree) is estimated,

giving B score functions: f1* x , , fb* x , , fB* x

No correction Intercept Weighting

Model* Validated Error Rates**

Binary logit model 0.400 0.035 0.018

Bagging (tree-based) 0.374 0.034 0.025

S.G. boosting 0.460 0.034 0.018

* Model estimated on the balanced calibration sample

Gain N Top decile LVC

N : customer base of the company

Bagging (tree-based) 2.246 1.549

Stochastic gradient boosting 2.290 1.632

* Model estimated on the balanced calibration sample

Gain N Top decile LVC

Additional financial gains that we may expect from a retention

-1000 0 1000 2000 0 500 1000 1500

Change in monthly min. of use Equipment days