Data Analytics For Marketing Decision Support: Introduction and A Wallet Estimation Case Study

IBM Research
Data Analytics for Marketing

Decision Support: Introduction
and a Wallet Estimation Case
Study
Saharon Rosset
IBM T.J. Watson Research Center
2006 IBM Corporation
IBM Research
Two parts:
1 Introduction to use of Data Mining in marketing applications
(Collaborator: Naoki Abe)
What are the problems we address?
Comparison of Data Mining and Marketing Science
approaches
Some of the challenges for Data Mining approaches
1 Customer Wallet and Opportunity Estimation: Analytical
Approaches and Applications
(Collaborators: Claudia Perlich, Rick Lawrence, Srujana
Merugu and others)
Define the problem
Describe analytic solutions
Demonstrate performance in real application
2
IBM Research
The grand challenges of marketing

1
1
Maximize profits (duh)

Initiate, maintain and improve relationships with
customers:
Acquire customers
Create loyalty, prevent churn
Improve profitability (lifetime value)
Optimize use of resources:

Sales channels
Advertising
Customer targeting
IBM Research
Some of the concrete modeling problems

1
Channel optimization
Cross/up-sell (customer targeting)
New customer acquisition
Churn analysis
Product life-cycle analysis
Customer lifetime value modeling
1
1
1
Effect of marketing actions on LTV?

Advertising allocation
RFM (Recency, Frequency, Monetary) analysis
...
IBM Research
Data analytics for decision support: grand challenge

Beyond modeling the current situation, we need
to offer insight about the effect or potential of
possible actions and decisions:
How would different channels / incentives affect LTV of
our customers?
How much more money could this customer be spending
with us (customer wallet)
Can we predict the effects of new actions that have never
been tried in historical data? What if they have been tried
on non-representative set?
Can we be confident our results are actionable? Can we
differentiate causality from correlation in our models?
5
IBM Research
Typical marketing analytics vs. data mining

1
CRM analytics:
Relies on primary research (=surveys) to understand
needs and wants
Relies on (more or less) detailed models of customer
behavior
Usually parametric statistical models
Often estimates customer-level parameters
Data mining:
Typically relies on data in Data Warehouse /Mart
Uses minimum of parametric assumptions
Often attempts to fit problem into standard modeling
framework: classification, regression, clustering...
IBM Research
Comparison of approaches
Criterion
Marketing
DM
Parametric models formalize

knowledge of domain and problems
Robust against incorrect assumptions

about domain and problems
Actively collect the data to estimate

model quantities (active learning)
Rely on existing, abundant data in

Corporate Data Warehouses
Integrate expert input from managers

and customers (wants and needs)
Use data to learn new, surprising

patterns about customer behavior
IBM Research
Example 1: modeling and improving LTV

Rust, Lemon and Zeithaml (2004), Return on Marketing: Using
Customer Equity to Focus Marketing Strategy, J. of
Marketing
1
Modeling customer equity / lifetime value
Combine several previous approaches
Model the brand switching matrix as a function of customer

preference, history and product properties
Want to identify drivers of satisfaction (levers)
Calculate effect (ROI) of marketing actions pulling levers

Mostly relies on primary research collected specifically for
this study
Interviews with managers
Survey of consumer preferences

IBM Research
Simplified version of papers business model
Costs
Return on
marketing
investment
Marketing
investment
Pulling
levers
Increased
equity
Main goals:
1 Identify relevant levers
1 Quantify their effect
9
IBM Research
Analytic setup (main components only)

1 logit(pijk) = 0k LASTijk + xik k
pijk is probability that customer i buys item k given they bought
item j previously
LAST is a dummy variable for inertia
Xik is a feature vector for customer i, product k
1 This is used to compute the brand switching matrix

{pijk} and customer lifetime value is calculated as:
CLVij = t PROFij Bijt
PROF is a profit measure considering discounting, price & cost

(assumed known)
Bijt is probability customer i buys product j in time t, calculated
from the stochastic matrix {pijk}
10
IBM Research
Data definitions
1 Potential drivers (marketing activities) are reflected in
the components of xi
Price
Quality of service
etc.
1 The data to estimate the logit model is based on:

Expert (manager) input
Questionnaires of customers
Corporate data warehouse (not implemented in their case
study...)
11
IBM Research
Results: important drivers for airline industry?

Driver
Coefficient
Std error
Z score
(coeff/std)
Inertia
.849
.075
11.34
Quality
.441
.041
10.87
Price
.199
.020
9.86
Convenience
.609
.093
6.56
..
.
..
.
..
.
..
.
Etc. (all factors deemed important)
12
IBM Research
What would a data miner do?

1 Count more (or only) on historical data in data
warehouse
Variables would have different meaning
Identify correlations, not necessarily drivers
1 Could use same analytic formulation, but also try

alternative approaches
Relate LTV directly to variables observed?
Model transaction sizes in addition to switching?
Use non-parametric modeling tools?
Etc.
13
IBM Research
Example 2: the segmentation approach

Common practice in marketing:
1
Define static, fixed customer segments

Supposed to capture true essence of customers
behaviors, needs and wants
Often given catchy names: Upwardly mobile
businessmen representing the average profile
14
Make marketing decisions at segment level,

based on understanding of needs and wants
IBM Research
A market segmentation methodology

Based on Kotler (2000). Marketing Management. Prentice-Hall
1. Survey stage: primary research to capture motivations,
attitudes, behaviors
2. Analysis stage: factor analysis, then clustering of survey
data
Identify segments
3. Profiling stage: analyze segments and give them names
Additional stage often taken is to assign all customers to the
defined segments:
4. Assignment stage: build classification model to assign all
customers to learned segments
15
IBM Research
What would a data-miner do?

Option 1: clustering
Replace primary research by warehouse data
Cluster all customers
Lose the needs and wants aspect
Option 2: supervised learning
Treat each decision problem as separate modeling task
E.g., find positive and negative examples for each binary
decision, learn model
Advantage: customized
Disadvantages:
May not have right data to model decisions we want to make
Past correlations may not be indicative of future outcomes
16
IBM Research
Comparison of approaches
Criterion
17
Marketing
DM
Parametric models formalize

knowledge of domain and problems
Robust against incorrect assumptions

about domain and problems
Actively collect the data to estimate

model quantities (active learning)
Rely on existing, abundant data in

Corporate Data Warehouses
Integrate expert input from managers

and customers (wants and needs)
Use data to learn new, surprising

patterns about customer behavior
IBM Research
An integrated approach
1
Count on historical data as much as possible
Avoid complex parametric models
Let the data guide us
Still want to integrate domain knowledge
18
Analyze and understand the special aspects of marketing

modeling problems
Importance of long-term relationship (lifetime value, loyalty)
Effects of competition (customer wallet vs. customer

spending)
Modify existing, or develop new, data analytics
approaches to address problems properly
IBM Research
Moving beyond revenue modeling

To really understand the profitability and potential of our
customers, we need to move beyond modeling their shortterm revenue contribution
1 Revenue over time: Lifetime Value modeling
How much can we expect to gain from customer over time?

Incorporates loyalty/churn, prediction of future customer
revenue
LTV = t S(t) v(t) D(t) dt
(S(t) is customer survival function, v(t) customer value over time,

D(t) discounting factor)
1 Potential revenue: Customer Wallet Estimation
How much revenue could we be generating from this

customer?
Incorporates competition, brand switching etc.
19
IBM Research
LTV and Wallet: beyond standard modeling
Future
Next year
Now
LTV modeling
Time
Sales
forecasting
Sales /
revenue
modeling
Actual sales
20
Wallet estimation
Revenue
Potential sales
IBM Research
Types of decision support

1
Passive decision support

Understand more about problems and causes
Identify areas of need, under-performance etc.
Help in making better decisions
Active decision support

Model the effect of actions
Actively help in deciding between alternative actions
Active decision support is typically more

challenging in terms of data needed to learn
models
21
IBM Research
Depth and actionability of insights

Depth
LTV modeling
Real
insight
Wallet
estimation
Understand effect
of potential actions
on LTV and Wallet
attainment
Revenue
modeling
Revenue
forecast
Lever
identification
Basic
concepts
Correlation
Causality
Actionability
Passive
22
Active
IBM Research
The causality challenge

1
Predictive models discover correlation
For active decision support we need to identify levers to

pull to affect outcome
23
Example: linear regression

Significant t-statistic for coefficients imply they have a
significant effect, not that they are actually causing the
response
Only works with causality

Causality is difficult to find or prove from observation data
If we have knowledge about causality, we can formalize it as

(say) Bayesian network and use in our models
We can get closer to causality by case-control experiments
IBM Research
Illustration: predictive power is not causality

1
1
1
Assume we observe for some companies:

X = companys marketing budget,
Y = companys sales
and want to understand how to affect Y by controlling X
Assume we find that X is very predictive about Y
Possible scenarios:
x
Causality successfully identified lever
Fixed percent of revenue to marketing?
Y
24
Z=Company size independently

determining both quantities?
X
IBM Research
Some other challenges

1 Modeling effects of new/unobserved actions
Critical for active support, often difficult or impossible
Even for established actions, they may have been
applied in different context than our planned campaign
1 Integrating expert knowledge into process

Can be done formally via graphical models
1 Handling data issues: matching, leaks, cleaning

Always critical
1 Delivering solutions and results
25
IBM Research
Example: Telecom Churn Management

Cell phone company has set of customers, some leave (churn)
every month
The goals of a Churn Management system:
1
Analyze the process of churn
Causes
Dynamics
Effects on company
26
Design policies and actions to improve the situation
Marketing campaigns
Incentive allocation (offer new features or presents)
Change in plans to contend with competition

IBM Research
First step: understand current situation

1 Who is likely to churn (predictive patterns)?
Phones features / plans
Usage patterns
Demographics
Tools: segmentation, classification, etc.
1 Which of these patterns are causal?

Tools: expert knowledge, Bayesian networks, etc.
1 Which causal effects not in data?
Competition, economy etc.
1 Which of these customers are profitable?

Short term: customer value
Long term: lifetime value
Growth potential: customer wallet
27
IBM Research
Second step: design actions

1
Can we affect causal churn patterns?
Given possible incentives and marketing actions, what

effect will they have?
Loyalty and relationship
Current customer value and wallet attainment
Customer lifetime value
Cost to company
28
For example, by improving customer service
How can we optimize use of our marketing resources?
Identify segments we want to retain
Identify effective marketing actions
IBM Research
Survey of Useful Methodologies

1 Utility-based classification*: Cost-sensitive and Active Learning
Motivation: need to handle utility of decision and cost of data acquisition in
marketing decision problems
Example domains: Targeted marketing, Brand switch modeling
1 Markov Decision Processes (MDP) and Reinforcement Learning

Motivation: need to consider long term profit maximization
Example domain: Customer lifetime value modeling
1 Bayesian Networks
Motivation: need to address causality vs. correlation issue; need to
formalize domain knowledge about relationships in data
Example domain: Customer wallet estimation
*c.f. Utility-Based Data Mining Workshop at KDD05 and KDD06

29
IBM Research
Cost-sensitive Learning for Marketing Decision Support

1
Use of Basic Machine Learning (e.g. Classification and Regression) in Marketing

Decision Support is well accepted
Example applications include: targeted marketing, credit rating, and others
But are they the best we have to offer ?
Regression is an inherently harder problem than is required

One does not necessarily need to predict business outcome, customer behavior,
etc, but is merely required to make business decisions
Regression may fail to detect significant patterns, especially when data is noisy
Classification is an over simplification

By mapping to classification, one loses information on the degree of
goodness/badness of a business decision in the past data
Cost-sensitive classification provides the desired middle ground

It simplifies the problem almost to classification and thus allows discovery of
significant patterns;
Yet retains and exploits the information on the degree of goodness of business
decisions, in a way that is motivated by Utility theory
30
IBM Research
Cost-sensitive Learning a.k.a. Utility-based Classification

1
In regression: given (x,r) 1 X x R, generated from a sampling

distribution, find F:
F(x) 1 r
E.g. r = profit obtained by targeting customer x
In classification: given (x,y) 1 X x {0,1} , generated from a sampling

distribution, find F:
F(x) 1 y
E.g. y = 1 if customer x is good, 0 otherwise
In utility-based classification: given (stochastic) utility function U and

(x,y) 1 X x {0,1} generated from a sampling distribution, find F:
E[U(x,y,F(x))] is maximized (or equivalently E[C(x,y,F(x))] is minimized)
E.g. U(x,1,1) = Profit(x) = Profit obtained by targeting customer x, when x is
indeed a good customer.
31
IBM Research
Example Cost and Utility Functions

1 Simple formulation (cost/benefit matrix)
Predicted
True
Predicted
True
Classification utility matrix
Misclassification cost matrix
1 More realistic formulation (utility/cost dependent on individuals)

Predicted
bad
good
True
bad
good
True
bad
-C
bad
- Default Amt
good
Profit C
good
Interest
Targeted marketing utility

32
Predicted
Credit rating utility

IBM Research
Bayesian Approach with Regression

1 For each example x, choose the class that minimizes the
expected cost:
i * ( x) = arg min P( j | x)C ( x, i, j )

i
need be estimated!
1 Problem: Requires conditional density estimation and
regression to solve a classification problem.
Price is high computational and sample complexity
1 Merit: more flexibility and general applicability
Business constraints
Variability in fixed costs
But, is it necessary ?
33
IBM Research
A Classification approach: Cost-sensitive boosting algorithm [AZL 2004]

Define the expanded sample S as:
(x,y)S'= {(x,y') | y (x,y)S, y' Y}
GBSE (Learner A, Expanded data S, count T)

(1)
For all
(x,y)S' initialize H0(x, y) =1/ | Y |
(2)
For all
(3)
(x, y) S' initialize weight

wx, y = E S , yH0 [C( x, y)] C( x, y)
For t=1 to T do
(a) For all (x,y) in S update weight

(b) Let
wx, y = E S , yHt1 [C( x, y)] C( x, y)
T' ={((x, y),I (wx, y > 0))| (x, y) S'}
(c) Let ht = A(T,|w|)
Weight updated in each iteration
(d) ft = Stochastic(hi )
(e) Ft = (1- 2)Ft-1+2ft
(4) Output h(x) = arg max(
h (x, y) )
t =1
34
the difference between average cost

by the current ensemble and cost of y
IBM Research
Gradient Boosting with Stochastic Ensembles: Illustration
The difference between the current average cost and the cost
associated with a particular label is the boosting weight
The sign of the weight, E[C(x,y)] C(x,y), is the training label
C(x,y)
Cost C(x,y)
Ave Cost
E[C(x,y)]
Predicted
At learning iteration t
35
Label, y
Training Labels
At learning iteration t+1

IBM Research
Cost-sensitive boosting outperforms existing methods of costsensitive learning as well as classification and regression
Data Set
Bagging
Annealing
1059174
12712
20742
344
Solar
5403397
23738
5317390
4810
31942
428
499
21
Letter
1513
921
1302
852
Splice
645
614
503
584
19010
1086
1046
936
KDD-99
Satellite
AvgCost
Existing methods
36
MetaCost
GBSE
Ave Test Set Cost (SE)

IBM Research
Active Learning a.k.a. Query Learning
The goal is to achieve data and computational efficient learning by

obtaining labeled data for points of the algorithms choosing
Existing approaches can be classified into two main categories

Algorithmic approach (c.f. [Angluin])
Information-theoretic approach (c.f. [SOS 1992])
Active/Query Leaning
Domain
Select points necessary for learning
Learner
Training Sample
Add data to training sample

*The domain size is generally exponential
37
IBM Research
The Query by Committee Algorithm [STS 1991]
This is a representative information theoretic active learning method
Main idea is to query points at which the agent algorithms disagree the most (to
maximize information gain)
Merit: data efficient learning is theoretically guaranteed, subject to assumptions

of representability of the target
Weakness: the theory requires ideal (Gibbs) agent learner, and is generally not
computationally feasible
Query by Committee
Maximize Uncertainty: Query a point with maximum spread
Let Agents Predict on
randomly selected points
Agent 2
38
Agent 2
Agent 2
Agent 2
1Idealized
Agents:
Randomized Algorithms
Input Sample
IBM Research
An Efficient Variant: Query by Bagging and Query by Boosting [AM 1998]
These methods combines a computational approach of ensemble method

with information theoretic query by committee method
The method allows arbitrary deterministic agent algorithms
Query by Bagging/Boosting
Queries point x* at which component algorithms disagree most

h1
Agent
Learner A
h2
Agent
Learner A
hT
Agent
Learner A
Bagging: re-sampling with uniform distribution

Boosting: weighted sampling with boosting weights
Input Sample
39
IBM Research
Active learning can accelerate learning
It has been observed that active learning can drastically accelerate the
rate of learning (e.g. 10 to 100 folds) over passive learning
Application to primary research (survey) in marketing analytics is

promising but has not been exploited extensively
WDBC (UCI ML repository) and C4.5

40
Breast Cancer Wisconsin (UCI) and C4.5

IBM Research
Sequential Cost-sensitive Decision Making by

Reinforcement Learning
1
1
1
41
Cost-sensitive classification provides an adequate framework for

single marketing decision making
Real world marketing decision making is rarely made in isolation, but is
made sequentially
Need to address the sequential dependency in decision making
Cost-sensitive classification
Maximizes E[U(x,h(x)]
We now wish to
Maximize 2t E[U(xt,h(xt)], where x may depend on earlier decisions
This is nothing but Reinforcement Learning, if we view x as the state
Maximize 2t E[U(st,3(st))], where st is determined stochastically
according to a transition probability determined by st-1 and 3(st-1).
IBM Research
Review: Markov Decision Process (MDP)

1 At any given time t, the agent is in some state s.
1 It takes an action a, and makes a transition to the next state s,

dictated by transition probability T(s,a)
1 It then receives a reward, or utility U(s,a), which also depends on
state s and action a.
1 The goal of a reinforcement learner in MDP is to learn a policy,
namely 3: S 4 A, mapping states to actions, so as to maximize the
cumulative discounted reward:
R = t U(st , at )
t=0
42
IBM Research
MDP and Reinforcement Learning provide an advanced

framework for modeling customer lifetime value
1 Modeling CRM process using Markov Decision Process (MDP)
1 Customer is in some "state" (his/her attributes) at any point in time
1 Retailer's action will move customer into another state
1 Retailer's goal is to take sequence of actions to guide customer's path to maximize customer's
lifetime value
1 Reinforcement Learning produces optimized targeting rules of the form

1 If customer is in state "s", then take marketing action "a"
1 Customer state s represented by current customer attribute vector
1 estimates LTV(s,a) -- best policy is to choose a to maximize LTV(s,a)
Typical CRM Process

p 64
Campaign C
Campaign E
Valuable
Customer
Loyal
Customer
Loyal
Customer
Campaign A
Potentially
Valuable
Repeater
Repeater
Repeater
Bargain
Hunter
One Timer
Defector
Defector
Campaign B
43
Campaign D
IBM Research
MDP enables genuine lifetime value modeling, in contrast to

existing approaches that use observed lifetime value
1
1
Observed lifetime value reflects only customers lifetime value attained by current
marketing policy, and therefore fails to capture their potential lifetime value
MDP based lifetime value modeling allows modeling of lifetime value based on
optimized marketing policy (= the output of system !)
Customer As path under

Current marketing policy
Optimized marketing policy
Estimated (potential) lifetime value will be based on the
optimal path
Output policy will lead the customer through the same
path
Campaign E
Loyal
Customer
Loyal
Customer
Repeater
Repeater
Repeater
One Timer
Defector
Defector
Campaign C
Potentially
Valuable
Campaign A
Bargain
Hunter
Campaign B
44
Valuable
Customer
Campaign D
IBM Research
And here is how this is possible

1 The MDP enables the use of data for many customers in various stages (states) to
determine potential lifetime value of a particular customer in a particular state
1 Reinforcement Learning can estimate the lifetime value (function) without explicitly
estimating the MDP itself
The key lies in the value iteration procedure based on Bellmans equation
Q(s, a) = E[U(s, a)] + max a ' Q(s' , a' )

LTV of a state = reward now + LTV of best next state
Each rule is, in effect, trained with data
corresponding to all subsequent states
Rule a
Potentially
Valuable
45
Rule d
Rule c
Valuable
Customer
Rule b
Repeater
Loyal
Customer
Loyal
Customer
Repeater
Repeater
IBM Research
Reinforcement Learning Methods with Function Approximation

1 Value Iteration (based on Bellman Equation)
Provides the basis for classic reinforcement learning methods
like Q-learning
Q 0 (s, a) = E[U(s, a)]

Q k +1 (s, a) = E[U(s, a)] + max a ' Q k (s' , a' )
(s) = arg max a Q (s, a)

1 Batch Q-Learning (with Function Approximation)
Solves value iteration as iterative regression problems
Q 0 (s, a) U(s, a)
Q k +1 (s, a) (1 - )Q k (s, a) + (U ( s, a) + max a ' Q k (s' , a' ))
Estimate using function approximation (regression)
46
IBM Research
Lifetime value modeling based on reinforcement learning can achieve

greater long term profits than the traditional approach
The graph below plots profits per campaign obtained in monthly campaigns over 2 years (in
an empirical evaluation using benchmark data, i.e. KDD cup 98 data)
to yield
greater
long
term
profits
80000
Output
Outputpolicy
policy
ofofMDP
MDP
approach
approach
(CCOM)
(CCOM)
invests
investsinin
initial
initial
campaigns
campaigns
70000
60000
50000
S ingle
40000
C C OM
30000
20000
10 0 0 0
0
C a m pa ign num be r
47
IBM Research
Bayesian Network a.k.a Graphical Model

1 Bayesian Network is a directed acyclic graphical model and defines a
probability model
1 Here is a simple example
P(E)
P(E)
0.3
0.7
P(M,E,C,R) = P(E) P(M|E) P(C|E) P(R|M,C)
Economy
E
P(M)
P(M)
0.3
0.7
0.9
0.1
Marketing
Competition
P(C)
P(C)
0.4
0.6
0.7
0.3
Revenue
48
MC
P(R)
P(R)
FF
0.3
0.7
TF
0.9
0.1
FT
0.2
0.8
TT
0.6
0.4
IBM Research
Bayesian Network as a General Unifying Framework

1 Bayesian Network provides a general framework that subsumes
numerous known classes of probabilistic models, e.g.
Nave Bayes Classification
Clustering (Mixture models)
Auto regressive models
Hidden Markov models, etc, etc
1 Bayesian Network provides a framework for discussing modeling,
inference, causality, hidden variables, etc
Unobserved
Class
Class
Variable 1
Variable N
Nave Bayes classification

49
Unobserved
Variable 1
Variable N
Clustering/Mixture
State
State
Symbol
Symbol
Hidden Markov Model

IBM Research
Bayesian Network and Causality

1 Causality is not necessarily implied by the edge direction
Economy
An Example Bayesian Network
Marketing
Economy
Competition
Economy
Marketing
Marketing
Competition
Revenue
Marketing
Economy
Competition
Competition
P(M,E,C) = P(C) P(E|C) P(M|E)

P(M,E,C,R) = P(E) P(M|E) P(C|E) P(R|M,C)
This is actually ambiguous between
P(M,E,C) = P(E) P(M|E) P(C|E)

P(M,E,C) = P(M) P(E|M) P(C|E)
50
IBM Research
Causal Network and Causal Pattern

1 Causal Network
Is a directed graph, in which the direction of edge means causality
1 Causal Pattern
Is an equivalence class of causal networks
Causal Network
Causal Pattern
Economy
Marketing
Economy
Competition
Revenue
Marketing
Competition
Revenue
This pattern shows that the causal relationship between
E, M, and C are ambiguous
51
IBM Research
Edge Orientation in Bayesian/Causal Networks
[P. Spirtes, C. Glymour, and R. Scheines (2000)]

52
IBM Research
Inferring Structure of Bayesian/Causal Network from Data

M3R|E
The causal structure cannot be determined from data !
Economy
Marketing
Economy
Revenue
P(M,E,R) = P(E) P(M|E) P(R|E)
M 34564748
Marketing
Economy
Revenue
P(M,E,R) = P(M) P(E|M) P(R|E)
Marketing
Revenue
P(M,E,R) = P(R) P(E|R) P(M|E)
The causal structure can be determined from data !
Marketing
Competition
It can be inferred that Marketing can be
Revenue
a lever for controlling Revenue !
P(M,E,C) = P(M) P(C) P(R|M,C)

53
IBM Research
Estimation and Inference with Bayesian Networks

1 Inferring causal structure from data
Sometimes possible but in general not
1 Bayesian network structure learning from data

It is known to be intractable for general classes
It is even NP-complete to estimate polytrees robustly
1 Parameter estimation from data, given structure
It is efficiently solvable for many model classes
1 Inference given model

Exact inference is known to be NP-complete for sub-class including undirected
cycles
It is efficiently solvable for tree structures and many models used in practice
1 Latent variable estimation, given structure
Local optimum estimation is often possible via EM-algorithms
1 Given these facts, determining network structure using domain knowledge

and using it to do parameter estimation and inference is common practice
example
54
IBM Research
Lifetime Value Modeling and

Cross-Channel Optimized Marketing (CCOM)
1 Optimizes targeted marketing across multiple channels for lifetime

value maximization.
1 Combines scalable data mining and reinforcement learning methods

to realize unique capability.
$
Web
Kiosk
Direct
Mail
Call
Center
Store
$
55
$
IBM Research
CCOM Pilot Project with Saks Fifth Avenue

1
1
1
reminder
Business Problem addressed: Optimizing direct mailing to maximize

lifetime revenue at the store (and other channels)
Provided solution for the Cross-Channel Challenge: No explicit linking
between marketing actions in one channel and revenue in another
CCOM mailing policy shown to achieve 7-8% increase in expected
revenue in the store (in laboratory experiments) !
$
Direct
Mail
Store
CCOM-pilot business problem

56
IBM Research
Some Example Features

Demographic Features
action
reward
FULL_LINE_STORE_OF_RES.:
If a full-line store exists in the area
0.018
0.004
NON_FL_STORE_OF_RES.:
If a non full-line store exists in area
0.012
-0.004
Transaction Features (concerning divisions relevant to current campaign)

CUR_DIV_PURCHASE_AMT_1M:
Pur amt in last month in curr div
0.065
0.090
CUR_DIV_PURCHASE_AMT_2_3M:
Pur amt in 2-3 month in curr div
0.099
0.080
CUR_DIV_PURCHASE_AMT_4_6M:
Pur amt in 4-6 month in curr div
0.133
0.091
CUR_DIV_PURCHASE_AMT_1Y:
Pur amt in last year in curr div
0.162
0.128
CUR_DIV_PURCHASE_AMT_TOT:
Total Pur amt in current division
0.153
0.147
Promotion History Features (on divisions relevant to current campaign)

CUR_DIV_N_CATS_1M:
Num cat sent last month in curr div
0.294
0.028
CUR_DIV_N_CATS_2_3M:
Num cat sent 2-3 months ago in curr div
0.260
0.025
CUR_DIV_N_CATS_4_6M:
Num cat sent 4-6 months ago in curr div
0.158
0.062
CUR_DIV_N_CATS_TOT:
Total num cat sent in curr div to date 0.254
0.062
To mail or not to mail
1.000
0.008
Expected cumulative profits
0.008
1.000
Control Variable
ACTION:
Target (Response) Variable

REWARD:
57
IBM Research
The Cross-Channel Challenge and Solution

1
The Challenge: No explicit linking between actions in one channel (mailing) and
rewards in another (revenue)
Very low correlation observed between actions and responses
Other factors determining life time value may dominate over the control variable
(marketing action) in estimation of expected value
Obtained models can be independent of the action and give rise to useless rules !
The Cross-Channel Solution: Learn the relative advantage of competing actions!
Standard Method
Value in state s1
Value in state s2
Value in state s1
Value in state s2
Approximation
a1 a2
Actions
a1 a2
Proposed Method Value in state s1
Actions
a1 a2
Actions
a1 a2
Value in state s2
Actions
a1 a2
58
a1 a2
IBM Research
The Learning Method

1
1
Definition of Advantage
A(s,a):= 1/9t(Q(s,a) maxa Q(s,a))

Advantage Updating Procedure [Baird 94]
Repeat
1. Learn
1.1. A(s,a):=(1-5)A(s,a)
+5 (Amax(s)+(R(s,a)+67tV(s)-V(s))/7t)
1.2. Use Regression to estimate A(s,a)
1.3. V(s):=(1-8)V(s)
+8(V(s)+(Amax-new(s)-Amax-old(s))/5)
2. Normalize
A(s,a):=(1- 9)A(s,a)+9(A(s,a)-Amax(s))
Modifications: 1. Initialization with empirical life time value
59
2. Batch Learning with optional function approximation

IBM Research
Evaluation Results
1 Obtained policy with 7- 8% policy

advantage, i.e. 7- 8% increase in
expected revenue (for 1.6 million
customers considered)
Policy Advantage
Advantage (percentage)
1 Significant policy advantage

observed with small number of
iterations
Typical run (version 1)
8
6
4
2
0
1
-2
-4
Learning iterations
1 Mailing policy was constrained to

mail same number of catalogues in
each campaign as last year
Typical run (version 2)

Policy Advantage
8
Advantage (percent)
1 CCOM to evaluate sequence of

models and output best model
10
6
4
2
0
1
-2
-4
Learning iterations
60
IBM Research
Evaluation Method
1 Challenge in Evaluation: Need to evaluate new policy using data
collected by existing (sampling) policy
1 Solution: Use bias-corrected estimation of policy advantage using data
collected by sampling policy
1 Definition of policy advantage:
(Discrete Time) Advantage
A
(s,a):= Q
(s,a) maxa Q
(s,a)
Policy Advantage
As~
(
):= E
[Ea~
[A
(s,a)]]
1 Estimating policy advantage with bias corrected sampling

As~
(
):= E
[(
(a|s)/
(a|s)) [A
(s,a)]]
61
IBM Research
Combination of reinforcement learning (MDP) with predictive data

mining enables automatic generation of trigger-based marketing
targeting rules
1 Optimized with respect to the customers
potential lifetime value
1 Stated in simple if then style, which
supports flexibility and compatibility
1 Refined to make reference to detailed
customer attributes and hence, well-suited to
event and trigger-based marketing
1 This is made possible by
Representing the states in MDP by customers
attribute vectors
Combining reinforcement learning with predictive
data mining to estimate lifetime value as function
of customer attributes and marketing actions
An example marketing targeting rule

output by CCOM system
62
IBM Research
Some examples of rules output by CCOM
Avoid saturation effects

Interpretation: If a customer has spent in the
current division but enough catalogues have
been sent, then dont mail
Differentiate between customers

who may be near saturation and
those who are not
Interpretation: If a customer has spent in the
current division and has received moderately
many relevant catalogues, then mail
Invest in a customer until it knows it

is not worth it
Interpretation: If a customer has spent
significantly in the past and yet has not spent
much in the current division (product group) then
dont mail
63
IBM Research
CCOM is generically applicable by mapping physical data to this

model
CCOM - Logical Data Model
*Developed with CBO
Customer Profile History
Period
Period Identifier
Period Duration
Customer Identifier
Profile History Date
Period Identifier
Product Category Identifier
Channel Identifier
Aggregated Count of Event
Aggregated Revenue
Aggegated Profit
Transaction
Customer
Customer Identifier
Transaction Date
Event Identifier
Channel Identifier
Transaction Revenue
Transaction Profit
Customer Identifier
First Name
Last Name
Age
Gender
Customer Loyalty Level History

Customer Identifier
Loyalty Level Start Date
Loyalty Level End Date
Loyalty Level
Lifetime Value Model

Customer Marketing Action
Product Category
Product Category Description
Channel
Event Identifier
Customer Identifier
Marketing Action Date
Marketing Action
Channel Identifier
Channel Description
Marketing Policy Model
Marketing Event
Event
Product Category
Event Identifier
Weight
Model Identifier
Model Type
Model
Model Identifier
Model Type
Model
Event Identifier
Channel Identifier
Event Date
Event Category Description
Fixed Cost
CCOM Output Models
Optional Entity
64
IBM Research
Customer Wallet and Opportunity

Estimation: Analytical Approaches
and Applications
Saharon Rosset, Claudia Perlich,
Rick Lawrence
IBM T. J. Watson Research Center
IBM Research
Outline
1 Wallet estimation: problems and solutions
The different wallet definitions
How can we evaluate wallet models?
Modeling approaches
Empirical evaluation
1 MAP (Market Alignment Program)

Description of application and goals
The interview process and the feedback loop
Evaluation of Wallet models performance in MAP
66
IBM Research
What is Wallet (AKA Opportunity)?

1 Total amount of money a company can spend on a
certain category of products.
Company Revenue
IT Wallet
IBM Sales
IBM sales IT wallet Company revenue

67
IBM Research
Why Are We Interested in Wallet?
OnTarget
MAP
1 Better evaluation of growth potential by

combining wallet estimates and past sales
history
Enables focus on high wallet, low share-of-wallet
customers
1 Intelligent marketing using wallet estimates for

sub-categories e.g., software, hardware
1 Evaluating success of sales personnel and
sales channel by share-of-wallet they attain
Making resource assignment decisions
68
IBM Research
Wallet Modeling Problem

1 Given:
customer firmographics x (from D&B): industry, emloyee
number, company type etc.
customer revenue r
IBM relationship variables z: historical sales by product
IBM sales s
1 Goal: model customer wallet w, then use it to

predict present/future wallets
No direct training data on w or information

about its distribution!
69
IBM Research
Historical Approaches
1 Top down: this is the approach used by IBM

Market Intelligence in North America (called ITEM)
Use econometric models to assign total opportunity to
segment (e.g., industry geography)
Assign to companies in segment proportional to their size
(e.g., D&B employee counts)
1 Bottom up: learn a model for individual companies

Get true wallet values through surveys or appropriate
data repositories (exist e.g. for credit cards)
1 Many issues with both approaches (wont go into

detail)
We would like a predictive approach from raw data
70
IBM Research
Agenda
1 Introduction and analytical issues
Different wallet definitions
1 How can we evaluate wallet models?

The quantile regression loss function
1 Modeling approaches and results:

Nearest neighbor approach
Quantile regression
Model decomposition approach
71
IBM Research
Multiple Wallet Definitions
1 TOTAL: Total customer available budget in the

relevant area (e.g., total IT)
Can we really hope to attain all of it?
1 SERVED: Total customer spending on IT products

covered by IBM
Better definition for our marketing purposes
1 REALISTIC: IBM spending of the best similar

customers
This can be concretely defined a high percentile of:
P(IBM revenue | customer attributes)
Total Wallet
Served Wallet
REALISTIC SERVED TOTAL

72
Realistic
IBM Research
REALISTIC Wallet: Percentile of Conditional

1 Distribution of IBM sales to the customer given
customer attributes: s|r,x,z ~ f,r,x,z
E.g., the standard linear regression assumption:
s = x + r + z + , ~ N (0, )
2
E(s|r,x,z)
Realistic
What we are looking for is the (say) 90th percentile of this

distribution
73
IBM Research
Agenda
1 Introduction and analytical issues
Different wallet definitions
1 How can we evaluate wallet models?

The quantile regression loss function
1 Modeling approaches and results:

Nearest neighbor approach
Quantile regression approach
Model decomposition approach
74
IBM Research
Traditional Approaches to Model Evaluation

1 Evaluate models based on surveys
Cost and reliability issues
1 Evaluate models based on high-level performance

indicators:
Do the wallet numbers sum up to numbers that make
sense at segment level (e.g., compared to macroeconomic models)?
Does the distribution of differences between predicted
Wallet and actual IBM Sales and/or Company Revenue
make sense? In particular, are the same % we expect
bigger/smaller?
Problem: no observation-level evaluation
75
IBM Research
The Quantile Loss Function

1 Our REALISTIC wallet definition calls for estimating the
pth quantile of P(s|data).
1 Can we devise a loss function which is optimized in
expectation when we succeed?
Answer: yes, the quantile loss function for quantile p.
if y > y
p ( y y )
L p ( y, y ) =
(1 p ) ( y y ) if y > y
1 This loss function is optimized in expectation when we

correctly predict REALISTIC:
arg min y E ( L p ( y, y ) | x) = p th quantile of P( y | x)
76
IBM Research
Some Quantile Loss Functions

4
p=0.8
p=0.5 (absolute loss)
-3
-2
-1
Residual (observed-predicted)
77
IBM Research
Which Wallet Definitions to Model?
1 We are generally interested in modeling

REALISTIC and SERVED wallets
TOTAL wallets are not of real marketing interest
1 For REALISTIC (or opportunity) we have multiple

modeling approaches
Quantile k-nearest neighbors
Quantile regression approaches:
Linear quantile regression
Tree-based regression
Kernel quantile regression, quanting,
1 For SERVED we have developed a graphical

modeling approach will not discuss here
78
IBM Research
Modeling REALISTIC Wallets

1 REALISTIC defines wallet as 90th percentile of
conditional of spending given customer attributes
Implies some 10% of the customers are spending full
wallet with IBM
1 Two obvious ways to get at the 90th percentile:

Estimate the conditional by integrating over a
neighborhood of specific customers
Take 90th percentile of spending in neighborhood
Create a global model for 90th percentile
Build regression models using quantile loss function
79
IBM Research
Universe of IBM customers
with D&B information
K-Nearest Neighbors
1 Distance metric:
Target company i
Normalization
1 Neighborhood sizes (k):
Employees
Neighborhood of target company
Quantile of firms in the neighborhood
Frequency
Neighborhood size has significant

effect on prediction quality
1 Prediction:
Re
ve
nu
e
Euclidean distance on firmographics

and past IBM sales
Industry
Industry match
Wallet Estimate
IBM Sales
80
IBM Research
Quantile Regression
1 Traditional Regression:
Estimation of conditional expected value by minimizing sum of
n
squares
2
min ( yi f (xi , ))
i=1
1 Quantile Regression:
Minimize Quantile loss: min
L (y
i =1
f ( xi , ))
if y > y
p ( y y )
L p ( y, y) =
(1 p ) ( y y ) if y > y
1 Implementation:
assume linear function
programming
quantile
regression
loss
function
y = x + , solution using linear
Linear quantile regression package in R (Koenker, 2001)

81
IBM Research
Quantile Regression Tree

1 Motivation:
Identify a locally optimal definition of neighborhood

Inherently nonlinear
1 Adjustments of M5/CART for Quantile prediction:
Predict the percentile rather than the mean of the leaf
Splitting/pruning criteria does not require adjustment
Industry = Banking
no
yes
Sales<100K
Frequency
Frequency
yes
no
IBM Rev 2003>10K
Wallet Estimate
Wallet Estimate
yes
no
IBM Sales
Wallet Estimate
Frequency
Frequency
IBM Sales
Wallet Estimate
82
IBM Sales
IBM Sales
IBM Research
Empirical Evaluation: Quantile Loss

1 Setup
4 Domains with monetary dependent variable including
direct mailing, housing prices, income data, IBM sales
Performance on test set in terms of quantile loss
Approaches: kNN, Linear quantile regression, quantile tree,
Quanting
1 Baselines
Constant model
Traditional regression models for expected values (for
skewed distributions, the expected value is actually a high
quantile)
83
IBM Research
Performance on Quantile Loss
1 Conclusions
If there is a time-lagged variable, linear quantile model is
best
Quanting (using decision trees) and quantile tree perform
comparably
Generalized kNN is not competitive
84
IBM Research
Residuals for Quantile Regression
Total positive holdout residuals: 90.05% (18009/20000)

85
IBM Research
Market Alignment Project (MAP): Background

1 MAP - Objective:
Optimize the allocation of sales force
Focus on customers with growth potential
Set evaluation baselines for sales personal
1 MAP Components:
Web-interface with customer information
Analytical component: wallet estimates
Workshops with Sales personal to review and correct the
wallet predictions
Shift of resources towards customers with lower wallet
share
86
IBM Research
The MAP tool captures expert feedback from the Client Facing teams
MAP interview process all Integrated and Aligned Coverages
MAP Interview Team
Client Facing Unit (CFU) Team
Insight Delivery
and Capture
Web Interface
Wallet models:
Predicted
Opportunity
Transaction
Data
D&B
Data
Expert
validated
Opportunity
Resource
Assignments
Analytics and
Validation
Data Integration
Post-processing
The objective here is to use expert feedback (i.e. validated revenue opportunity)
from from last years workshops to evaluate our latest opportunity models
87
IBM Research
MAP workshops overview
1 Calculated 2005 opportunity using naive k-NN

approach
1 2005 MAP workshops
Displayed opportunity by brand

Expert can accept or alter the opportunity
1 Select 3 brands for evaluation: DB2, Rational, Tivoli

1 Build ~100 models for each brand using different
approaches
1 Compare expert opportunity to model prediction
Error measures: absolute, squared
Scale: original, log, root
88
IBM Research
Displayed Model Predictions of kNN

1 Distance metric
Identical Industry
Universe of IBM customers

with D&B information
Euclidean distance on size (Revenue

or employees)
Median of the non-zero neighbors
Target company i
(Alternatives Max, Percentile)
1 Post-Processing
Floor prediction by max of last 3 years
revenue
89
Employees
Re
ve
nu
e
1 Prediction
Industry
1 Neighborhood sizes 20
Neighborhood of target company
IBM Research
Expert Feedback (Log Scale) to Original Model (DB2)

20
Experts accept
opportunity (45%)
18
Expert Feedback
16
Increase (17%)
14
12
Experts change
opportunity (40%)
10
Decrease (23%)
8
6
4
2
0
0
10
12
14
16
18
Experts reduced
opportunity to 0
20 (15%)
MODEL_OPPTY
90
IBM Research
Observations
1 Many accounts are set for external reasons to zero
Exclude from evaluation since no model can predict this
1 Exponential distribution of opportunities

Residual-based evaluation on the original scale suffers
from huge outliers
1 Experts seem to make percentage adjustments

Consider log scale evaluation in addition to original scale
and root as intermediate
Suspect strong anchoring bias, 45% of opportunities
were not touched
91
IBM Research
Evaluation Measures
1 Different scales to avoid outlier artifacts
Original: e = model - expert
Root:
e = root(model) - root(expert)
Log:
e = log(model) - log(expert)
1 Statistics on the distribution of the errors

Mean of e2
Mean of |e|
1 Total of 6 criteria
92
IBM Research
Model comparison results: Count how often a model

scores within the top 10 and 20 for each of the 6 measures
Model
93
Rational
DB2
Tivoli
Displayed Model (kNN) 6
Max 03-05 Revenue
Linear Quantile 0.8
Regression Tree
kNN 50 + flooring
Decomposition Center
Quantile Tree 0.8
(Anchoring)
(Best)
IBM Research
Conclusions
1 kNN performs very well after flooring but is typically

low prior to flooring
1 Empirically linear 80th quantile performs consistently
well (flooring has a minor effect)
1 Experts are strongly influenced by displayed
opportunity (and displayed revenue of previous
years)
1 Models without last years revenue dont perform
well
Use Linear Quantile Regression with q=0.8 in MAP 06
94
IBM Research
Ongoing and Future Work

1 Extend MAP to other geographies
1 Quantile estimation performance of different

methods as a function of the quantile
1 Performance as a function of the shape of the
conditional distribution of the dependent variables
1 Theoretical generalization of the decomposition
approach
95
IBM Research
back
A graphical model approach

Company
firmographics
Company
IT
Wallet
IT spend
with IBM
Historical
relationship
with IBM
1 Wallet is unobserved, all other variables are
1 Two families of variables --- firmographics and IBM relationship are

conditionally independent given wallet
1 We develop inference procedures and demonstrate them
In some cases leads to simple linear regression as ML inference on

wallet
See poster in this conference:
Merugu, Rosset, Perlich: A new multi-view learning approach with
an application to customer wallet estimation.
96
IBM Research
References
1
Marketing Science
R. Rust, K. Lemon and V. Zeithaml, Return on Marketing: Using Customer Equity
to Focus Marketing Strategy, J. of Marketing, 2004.
P. Kotler, Marketing Management. Millennium Ed., Prentice-Hall, 2000.
Cost-sensitive Learning
P. Domingo, Meta-Cost: A general method for making classifiers cost-sensitive,
The 5th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 1999.
N. Abe, B. Zadrozny and J. Langford, An Iterative Method for Multi-class Costsensitive Learning, The Tenth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, August 2004.
Active Learning
H.S. Seung, M. Opper and H. Sompolinsky. Query by committee. Proceedings of
the Fifth Workshop on Computaional Learning Theory, 1992.
D. Angluin. Queries and concept learning. Machine Learning, 1988.
MDP and Reinforcement Learning

R. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press,
Cambridge, MA, 1998.
L. P. Kaelbling, M. L. Littman, A. W. Moore, Reinforcement Learning: A Survey,
Journal of Artificial Intelligence Research, 1996.
97
IBM Research
References
1 Bayesian Networks and Causal Networks
K. Murphy, A brief introduction to Bayesian Networks and
Graphical Models,
http://www.cs.berkeley.edu/~murphyk/Bayes/bayes.html
D. Heckerman, A tutorial on learning with Bayesian Networks,
Microsoft Research MSR-TR-95-06, March 1995.
J. Pearl, Causality: Models, Reasoning, and Inference,
Cambridge University Press, 2000.
P. Spirtes, C. Glymour and R. Scheines, Causation, Prediction,
and Search, 2nd Edition (MIT Press), 2000.
1 Case Study: Customer Wallet Estimation

S. Rosset, C. Perlich, B. Zadrozny, S. Merugu, S. Weiss and R.
Lawrence, Customer Wallet Estimation. 1st NYU workshop on
CRM and Data Mining, 2005.
98
S. Merugu, S. Rosset and C. Perlich, A New Multi-View

Regression Method with an Application to Customer Wallet
Estimation. The Twelfth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, August 2006.
IBM Research
Thank you!
srosset@us.ibm.com
99

Data Analytics For Marketing Decision Support: Introduction and A Wallet Estimation Case Study

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Analytics For Marketing Decision Support: Introduction and A Wallet Estimation Case Study

Hochgeladen von

Copyright:

Verfügbare Formate

IBM Research

Data Analytics for Marketing

2006 IBM Corporation

2006 IBM Corporation

The grand challenges of marketing

Maximize profits (duh)

Optimize use of resources:

2006 IBM Corporation

Some of the concrete modeling problems

Cross/up-sell (customer targeting)

New customer acquisition

Product life-cycle analysis

Customer lifetime value modeling

Effect of marketing actions on LTV?

2006 IBM Corporation

Data analytics for decision support: grand challenge

2006 IBM Corporation

Typical marketing analytics vs. data mining

Often estimates customer-level parameters

2006 IBM Corporation

Parametric models formalize

Robust against incorrect assumptions

Actively collect the data to estimate

Rely on existing, abundant data in

Integrate expert input from managers

Use data to learn new, surprising

2006 IBM Corporation

Example 1: modeling and improving LTV

Modeling customer equity / lifetime value

Combine several previous approaches

Model the brand switching matrix as a function of customer

Want to identify drivers of satisfaction (levers)

Calculate effect (ROI) of marketing actions pulling levers

Interviews with managers

Survey of consumer preferences

Simplified version of papers business model

2006 IBM Corporation

Analytic setup (main components only)

1 This is used to compute the brand switching matrix

PROF is a profit measure considering discounting, price & cost

2006 IBM Corporation

1 The data to estimate the logit model is based on:

2006 IBM Corporation

Results: important drivers for airline industry?

Etc. (all factors deemed important)

2006 IBM Corporation

What would a data miner do?

1 Could use same analytic formulation, but also try

2006 IBM Corporation

Example 2: the segmentation approach

Define static, fixed customer segments

Make marketing decisions at segment level,

2006 IBM Corporation

A market segmentation methodology

2006 IBM Corporation

What would a data-miner do?

2006 IBM Corporation

Parametric models formalize

Robust against incorrect assumptions

Actively collect the data to estimate

Rely on existing, abundant data in

Integrate expert input from managers

Use data to learn new, surprising