Sie sind auf Seite 1von 32

Bagging and Boosting Classification

Trees to Predict Churn.


Insights from the US Telecom Industry

Forthcoming, Journal of Marketing Research


The Context
The 2002 Churn Tournament organised by Teradata Center for CRM
at Duke University

Churn means defecting from a company, i.e. take his business


elsewhere

Customer database from an anonymous U.S. wireless telecom


company

Challenge: predicting churn for elaborating targeted retention


strategies (Bolton et al. 2000, Ganesh et al. 2000, Shaffer and Zhang
2002)

Details can be found in Neslin et al. (2004)


The Context (contd)
The US Wireless Telecom market (2004)

182.1 million subscribers

Leader in market share: Cingular Wireless

26.9% total market volume

turnover US$19.4 billion / net income US$201 million

Other major players: AT&T, Verizon, Sprint and Nextel

Mergers & Acquisitions : Cingular with AT&T Wireless & Sprint

with Nextel
The Context (contd)
Churn

High churn rates 2.6% a month


Causes: increased competition, lack of differentiation,
market saturation
Cost: $300 to $700 cost of replacement of a lost
customer in terms of sales support, marketing,
advertising, etc.
Targeted retention strategies
Formulation of the Churn Problem
Churn as a Classification issue:

Classify a customer i characterized by k variables

xi = (xi1 , xi2 , , xiK ) as

Churner yi = + 1

Non-churner yi = - 1

Churn is the response binary variable to predict: yi = f(xi )

Choice of the binary choice model f ( . ) ?


Classification Models in Marketing
Simple binary logit choice model (e.g. Andrews et al. 2002)
Models allowing for the heterogeneity in consumers response:
Finite mixture model (e.g. Wedel and Kamakura 2000)
Hierarchical Bayes model (e.g. Yang and Allenby 2003)
Non-parametric choice models:
Decisions trees, neural nets (e.g. Thieme et al. 2000; West et al. 1997)
Bagging (Breiman 1996), Boosting (Freund and Schapire 1996),
Stochastic gradient boosting (Friedman 2002)
Classification Models in Marketing
Simple binary logit choice model (e.g. Andrews et al. 2002)
Models allowing for the heterogeneity in consumers response:
Finite mixture model (e.g. Wedel and Kamakura 2000)
Hierarchical Bayes model (e.g. Yang and Allenby 2003)
Non-parametric choice models:
Decisions trees, neural nets (e.g. Thieme et al. 2000; West et al. 1997)
Bagging (Breiman 1996), Boosting (Freund and Schapire 1996),
Stochastic gradient boosting (Friedman 2002)
Mostly ignored in the marketing literature

S.G.B. won the Tournament (Cardell, from Salford Systems)


Decision Trees for Churn
Example: Change in consumption

< 0.5 0.5

Customer care calls Age

< 26 55
<3 3
55< & 26
Yes
No
Yes No
Handset price

$150 <$150

No Yes
Bagging and Boosting

Machine Learning Algorithms

Principle: classifier aggregation (Breiman, 1996)

Tree-based method (e.g. Currim et al. 1988)

Bagging: Bootstrap AGGregatING


Calibration sample
Z = {(xi , yi ) }, i = 1, , N

Random sample Z1*

f1* x

e.g. tree

Random sample Z2*

f2* x
Aggregating bootstrap samples

f1* x

f2* x
Churn propensity score:
B
1 * x
f3* x

f bag ( x )
B
b
f
b 1

. . . Churn classification:
fB* x
cbag ( x ) sign fbag ( x )
Bagging
Let the calibration sample be Z={(x1,y1), , (xi,yi), , (xN ,yN)}
B bootstrap samples Z b* , b 1, 2, , B

From each Z b* , a base classifier (e.g. tree) is estimated,



f
giving B score functions: 1
*
x , ,
f b
*
x , ,
f B x
*

The final classifier is obtained by averaging the scores


B
fbag ( x) fb* x
1
B b 1
The classification rule is carried out via


cbag ( x) sign fbag ( x)
Stochastic Gradient Boosting
Winner of the Teradata Churn Modeling Tournament
(Cardell, Golovnya and Steinberg, Salford Systems).

Data adaptively resampled

Previously misclassified observations weights

Previously well-classified observations weights


Data
C Calibration Sample Validation Hold-Out
u Sample
s
t
o yi = - 1
Balanced
m N = 51,306
e Sample
yi = + 1 Equal proportion of
r churners = 50%
Xi = (x1,, x46) yi

Behavioral predictors
e.g. the average monthly minutes of use
Proportional
Company interactions yi = - 1
Real-life N=100,462 variables
proportion Sample
e.g. mean
of churners = 1.8% unrounded minutes of customer
care calls yi = + 1
Customer demographics Xi=(x1,, x46) yi
e.g. the number of adults in the household

Time
Research Questions
Do bagging (and boosting) provide better results than
other benchmarks?
What are the financial gains to be expected from this improvement?

What are the more relevant churn drivers or triggers that marketers

could watch for?

How to correct estimated scores obtained from a


balanced calibration sample, when predicting rare
events like churn?
Comparing Error Rates
Model* Validated Error Rate**
Binary Logit Model 0.400

Bagging (tree-based) 0.374

Stochastic Gradient Boosting 0.460

* Model estimated on the balanced calibration sample


** Error rates computed on the hold-out proportional validation sample
Bias due to Balanced Sampling
Overestimation of the number of churners

Several bias correction methods exist (see e.g. Cosslett 1993;

Donkers et al. 2003; Franses and Paap 2001, p.73-75; Imbens and Lancaster

1996; King and Zeng 2001a,b; Scott and Wild 1997).

However, most are dedicated to traditional models (e.g. logit).

We discuss two corrections for bagging and boosting.


The Bias Correction Methods
The weighting correction:
Based on marketers prior beliefs about the churn rate, i.e. the

proportion of churners among their customers, we attach weights to

observations of a balanced calibration sample.

The intercept correction:


Take a non-zero cut-off value B such that the proportion of

predicted churners in the calibration sample equals the actual a

priori proportion of churners.


Bagging
Let the calibration sample be Z={(x1,y1), , (xi,yi), , (xN,yN)}
B bootstrap samples Z b* , b 1, 2, , B

From each Z b*, a base classifier (e.g. tree) is estimated,

giving B score functions: f1* x , , fb* x , , fB* x


The final classifier is obtained by averaging the scores
B
1 * x

f bag ( x )
B
b
f
b 1
The classification is carried out via


cbag ( x) sign fbag ( x) B
Assessing the Best Bias Correction
Bias Correction

No correction Intercept Weighting

Model* Validated Error Rates**

Binary logit model 0.400 0.035 0.018

Bagging (tree-based) 0.374 0.034 0.025

S.G. boosting 0.460 0.034 0.018

* Model estimated on the balanced calibration sample


** Error rates computed on the hold-out proportional validation sample
The Top-Decile Lift
Focuses on the most critical group of customers
regarding their churn risk: Ideal segment for targeting a
retention marketing campaign
The top 10% riskiest customers
10%

Risk to churn
10%
Top - decile lift

With 10% = the proportion of churners in this risky segment
And = the proportion of churners in the whole validation set
Financial Gains: Neslin et al. (2004)

Gain N Top decile LVC

N : customer base of the company


: percentage of targeted customers (here, 10%)
Top decile : increase in top-decile lift
: success rate of the incentive among the churners
LVC : lifetime value of a customer (Gupta, Lehmann and Stuart
2004)
: incentive cost per customer
: success rate of the incentive among the non-churners.
Top-Decile Lift with Intercept Correction
2.6

Bagging
Stochastic Gradient Boosting
Binary Logit Model
2.4
2.2
Top decile*

+26%
2.0
1.8
1.6

0 20 40 60 80 100
Number of iterations
* Model estimated on the balanced sample, and lift computed on the validation sample.
Validated** Top-Decile Lift

No / Intercept Weighting
Model*
correction correction
Binary logit model 1.775 1.764

Bagging (tree-based) 2.246 1.549

Stochastic gradient boosting 2.290 1.632

* Model estimated on the balanced calibration sample


** Error rates computed on the hold-out proportional validation sample
Financial Gains

Gain N Top decile LVC


If we consider
N : customer base of 5,000,000 customers
: 10% of targeted customers
: 30% success rate of the incentive among the churners
LVC : $2,500 lifetime value of a customer
: $50 incentive cost per customer
: 50% success rate of the incentive among the non-churners
Financial Gains
Additional financial gains that we may expect from a retention
marketing campaign which would be targeted using the scores
predicted by the bagging instead of a random selection:
Top decile : 1.246 (= 2.246 1.000)

Gain = + $ 8,550,000

Additional financial gains that we may expect from a retention


marketing campaign which would be targeted using the scores
predicted by the bagging instead of the logit model:
Top decile : 0. 471 (= 2.246 1.775)

Gain = + $ 3,214,800
Most Important Churn Triggers
Mean inbound calls less 1 min. Bagging
Mean monthly min. wireless to wireless
Age
Mean monthly min. of use
Mean attempted calls
Months in service
Mean completed calls
Average monthly min. of use (6 months)
Mean overage revenue
Total revenue over life
Mean peak calls
Handset price
Base cost of the calling plan
Change in monthly min. of use
Equipment days
0

20

40

60

80

100
Relative importance
Partial Dependence Plots
Bagging
62

56
60

54
58

52
Probability to churn

Probability to churn
56

50
54

48
52

46
50

44
48

-1000 0 1000 2000 0 500 1000 1500

Change in monthly min. of use Equipment days


Partial Dependence Plot

51

50

49
Conclusions: Main Findings
1. Bagging and S.G. boosting are substantially better
classifiers than the binary logit choice model
Improvement of 26% for the top-decile lift,

Good diagnostic measures offering face validity,

Interesting insights about potential churn drivers,

Bagging is conceptually simple and easy-to-implement.

2. Intercept correction constitutes an appropriate bias


correction for bagging when using balanced sampling
scheme.
Thanks for your attention
From Profit to Financial Gains

Profit 1 N 1 LVC 1 1 1 c
LVC of a churner Incentive cost for the Contact
who does not churners retained cost
churn + non-churners
targeted

Top decile 1 1 /
Gain 1- 2 Profit 1 Profit 2
N 1 - 2 LVC
N Top decile LVC

Das könnte Ihnen auch gefallen