Sie sind auf Seite 1von 99

IBM Research

Data Analytics for Marketing


Decision Support: Introduction
and a Wallet Estimation Case
Study
Saharon Rosset
IBM T.J. Watson Research Center

2006 IBM Corporation

IBM Research

Two parts:
1 Introduction to use of Data Mining in marketing applications
(Collaborator: Naoki Abe)
What are the problems we address?
Comparison of Data Mining and Marketing Science
approaches
Some of the challenges for Data Mining approaches
1 Customer Wallet and Opportunity Estimation: Analytical
Approaches and Applications
(Collaborators: Claudia Perlich, Rick Lawrence, Srujana
Merugu and others)
Define the problem
Describe analytic solutions
Demonstrate performance in real application
2

2006 IBM Corporation

IBM Research

The grand challenges of marketing


1
1

Maximize profits (duh)


Initiate, maintain and improve relationships with
customers:
Acquire customers
Create loyalty, prevent churn
Improve profitability (lifetime value)

Optimize use of resources:


Sales channels
Advertising
Customer targeting

2006 IBM Corporation

IBM Research

Some of the concrete modeling problems


1

Channel optimization

Cross/up-sell (customer targeting)

New customer acquisition

Churn analysis

Product life-cycle analysis

Customer lifetime value modeling

1
1
1

Effect of marketing actions on LTV?


Advertising allocation
RFM (Recency, Frequency, Monetary) analysis
...

2006 IBM Corporation

IBM Research

Data analytics for decision support: grand challenge


Beyond modeling the current situation, we need
to offer insight about the effect or potential of
possible actions and decisions:
How would different channels / incentives affect LTV of
our customers?
How much more money could this customer be spending
with us (customer wallet)
Can we predict the effects of new actions that have never
been tried in historical data? What if they have been tried
on non-representative set?
Can we be confident our results are actionable? Can we
differentiate causality from correlation in our models?
5

2006 IBM Corporation

IBM Research

Typical marketing analytics vs. data mining


1

CRM analytics:
Relies on primary research (=surveys) to understand
needs and wants
Relies on (more or less) detailed models of customer
behavior
Usually parametric statistical models

Often estimates customer-level parameters

Data mining:
Typically relies on data in Data Warehouse /Mart
Uses minimum of parametric assumptions
Often attempts to fit problem into standard modeling
framework: classification, regression, clustering...

2006 IBM Corporation

IBM Research

Comparison of approaches
Criterion

Marketing

DM

Parametric models formalize


knowledge of domain and problems

Robust against incorrect assumptions


about domain and problems

Actively collect the data to estimate


model quantities (active learning)

Rely on existing, abundant data in


Corporate Data Warehouses

Integrate expert input from managers


and customers (wants and needs)

Use data to learn new, surprising


patterns about customer behavior

2006 IBM Corporation

IBM Research

Example 1: modeling and improving LTV


Rust, Lemon and Zeithaml (2004), Return on Marketing: Using
Customer Equity to Focus Marketing Strategy, J. of
Marketing
1

Modeling customer equity / lifetime value

Combine several previous approaches

Model the brand switching matrix as a function of customer


preference, history and product properties

Want to identify drivers of satisfaction (levers)

Calculate effect (ROI) of marketing actions pulling levers


Mostly relies on primary research collected specifically for
this study

Interviews with managers

Survey of consumer preferences


2006 IBM Corporation

IBM Research

Simplified version of papers business model

Costs
Return on
marketing
investment

Marketing
investment
Pulling
levers

Increased
equity

Main goals:
1 Identify relevant levers
1 Quantify their effect
9

2006 IBM Corporation

IBM Research

Analytic setup (main components only)


1 logit(pijk) = 0k LASTijk + xik k
pijk is probability that customer i buys item k given they bought
item j previously
LAST is a dummy variable for inertia
Xik is a feature vector for customer i, product k

1 This is used to compute the brand switching matrix


{pijk} and customer lifetime value is calculated as:
CLVij = t PROFij Bijt

PROF is a profit measure considering discounting, price & cost


(assumed known)
Bijt is probability customer i buys product j in time t, calculated
from the stochastic matrix {pijk}
10

2006 IBM Corporation

IBM Research

Data definitions
1 Potential drivers (marketing activities) are reflected in
the components of xi
Price
Quality of service
etc.

1 The data to estimate the logit model is based on:


Expert (manager) input
Questionnaires of customers
Corporate data warehouse (not implemented in their case
study...)

11

2006 IBM Corporation

IBM Research

Results: important drivers for airline industry?


Driver

Coefficient

Std error

Z score
(coeff/std)

Inertia

.849

.075

11.34

Quality

.441

.041

10.87

Price

.199

.020

9.86

Convenience

.609

.093

6.56

..
.

..
.

..
.

..
.

Etc. (all factors deemed important)

12

2006 IBM Corporation

IBM Research

What would a data miner do?


1 Count more (or only) on historical data in data
warehouse
Variables would have different meaning
Identify correlations, not necessarily drivers

1 Could use same analytic formulation, but also try


alternative approaches
Relate LTV directly to variables observed?
Model transaction sizes in addition to switching?
Use non-parametric modeling tools?
Etc.
13

2006 IBM Corporation

IBM Research

Example 2: the segmentation approach


Common practice in marketing:
1

Define static, fixed customer segments


Supposed to capture true essence of customers
behaviors, needs and wants
Often given catchy names: Upwardly mobile
businessmen representing the average profile

14

Make marketing decisions at segment level,


based on understanding of needs and wants

2006 IBM Corporation

IBM Research

A market segmentation methodology


Based on Kotler (2000). Marketing Management. Prentice-Hall
1. Survey stage: primary research to capture motivations,
attitudes, behaviors
2. Analysis stage: factor analysis, then clustering of survey
data
Identify segments
3. Profiling stage: analyze segments and give them names
Additional stage often taken is to assign all customers to the
defined segments:
4. Assignment stage: build classification model to assign all
customers to learned segments

15

2006 IBM Corporation

IBM Research

What would a data-miner do?


Option 1: clustering
Replace primary research by warehouse data
Cluster all customers
Lose the needs and wants aspect
Option 2: supervised learning
Treat each decision problem as separate modeling task
E.g., find positive and negative examples for each binary
decision, learn model
Advantage: customized
Disadvantages:
May not have right data to model decisions we want to make
Past correlations may not be indicative of future outcomes
16

2006 IBM Corporation

IBM Research

Comparison of approaches
Criterion

17

Marketing

DM

Parametric models formalize


knowledge of domain and problems

Robust against incorrect assumptions


about domain and problems

Actively collect the data to estimate


model quantities (active learning)

Rely on existing, abundant data in


Corporate Data Warehouses

Integrate expert input from managers


and customers (wants and needs)

Use data to learn new, surprising


patterns about customer behavior

2006 IBM Corporation

IBM Research

An integrated approach
1

Count on historical data as much as possible

Avoid complex parametric models

Let the data guide us

Still want to integrate domain knowledge

18

Analyze and understand the special aspects of marketing


modeling problems

Importance of long-term relationship (lifetime value, loyalty)

Effects of competition (customer wallet vs. customer


spending)
Modify existing, or develop new, data analytics
approaches to address problems properly

2006 IBM Corporation

IBM Research

Moving beyond revenue modeling


To really understand the profitability and potential of our
customers, we need to move beyond modeling their shortterm revenue contribution
1 Revenue over time: Lifetime Value modeling

How much can we expect to gain from customer over time?


Incorporates loyalty/churn, prediction of future customer
revenue
LTV = t S(t) v(t) D(t) dt

(S(t) is customer survival function, v(t) customer value over time,


D(t) discounting factor)

1 Potential revenue: Customer Wallet Estimation

How much revenue could we be generating from this


customer?
Incorporates competition, brand switching etc.
19

2006 IBM Corporation

IBM Research

LTV and Wallet: beyond standard modeling

Future

Next year

Now

LTV modeling

Time

Sales
forecasting
Sales /
revenue
modeling

Actual sales
20

Wallet estimation

Revenue
Potential sales
2006 IBM Corporation

IBM Research

Types of decision support


1

Passive decision support


Understand more about problems and causes
Identify areas of need, under-performance etc.
Help in making better decisions

Active decision support


Model the effect of actions
Actively help in deciding between alternative actions

Active decision support is typically more


challenging in terms of data needed to learn
models
21

2006 IBM Corporation

IBM Research

Depth and actionability of insights


Depth

LTV modeling
Real
insight

Wallet
estimation

Understand effect
of potential actions
on LTV and Wallet
attainment

Revenue
modeling

Revenue
forecast

Lever
identification

Basic
concepts

Correlation

Causality
Actionability

Passive
22

Active
2006 IBM Corporation

IBM Research

The causality challenge


1

Predictive models discover correlation

For active decision support we need to identify levers to


pull to affect outcome

23

Example: linear regression


Significant t-statistic for coefficients imply they have a
significant effect, not that they are actually causing the
response

Only works with causality


Causality is difficult to find or prove from observation data

If we have knowledge about causality, we can formalize it as


(say) Bayesian network and use in our models

We can get closer to causality by case-control experiments

2006 IBM Corporation

IBM Research

Illustration: predictive power is not causality


1

1
1

Assume we observe for some companies:


X = companys marketing budget,
Y = companys sales
and want to understand how to affect Y by controlling X
Assume we find that X is very predictive about Y
Possible scenarios:
x

Causality successfully identified lever

Fixed percent of revenue to marketing?

Y
24

Z=Company size independently


determining both quantities?
X
2006 IBM Corporation

IBM Research

Some other challenges


1 Modeling effects of new/unobserved actions
Critical for active support, often difficult or impossible
Even for established actions, they may have been
applied in different context than our planned campaign

1 Integrating expert knowledge into process


Can be done formally via graphical models

1 Handling data issues: matching, leaks, cleaning


Always critical

1 Delivering solutions and results

25

2006 IBM Corporation

IBM Research

Example: Telecom Churn Management


Cell phone company has set of customers, some leave (churn)
every month
The goals of a Churn Management system:
1

Analyze the process of churn

Causes

Dynamics

Effects on company

26

Design policies and actions to improve the situation

Marketing campaigns

Incentive allocation (offer new features or presents)

Change in plans to contend with competition


2006 IBM Corporation

IBM Research

First step: understand current situation


1 Who is likely to churn (predictive patterns)?
Phones features / plans
Usage patterns
Demographics

Tools: segmentation, classification, etc.

1 Which of these patterns are causal?


Tools: expert knowledge, Bayesian networks, etc.
1 Which causal effects not in data?
Competition, economy etc.

1 Which of these customers are profitable?


Short term: customer value
Long term: lifetime value
Growth potential: customer wallet
27

2006 IBM Corporation

IBM Research

Second step: design actions


1

Can we affect causal churn patterns?

Given possible incentives and marketing actions, what


effect will they have?

Loyalty and relationship

Current customer value and wallet attainment

Customer lifetime value

Cost to company

28

For example, by improving customer service

How can we optimize use of our marketing resources?

Identify segments we want to retain

Identify effective marketing actions

2006 IBM Corporation

IBM Research

Survey of Useful Methodologies


1 Utility-based classification*: Cost-sensitive and Active Learning
Motivation: need to handle utility of decision and cost of data acquisition in
marketing decision problems
Example domains: Targeted marketing, Brand switch modeling

1 Markov Decision Processes (MDP) and Reinforcement Learning


Motivation: need to consider long term profit maximization
Example domain: Customer lifetime value modeling

1 Bayesian Networks
Motivation: need to address causality vs. correlation issue; need to
formalize domain knowledge about relationships in data
Example domain: Customer wallet estimation

*c.f. Utility-Based Data Mining Workshop at KDD05 and KDD06


29

2006 IBM Corporation

IBM Research

Cost-sensitive Learning for Marketing Decision Support


1

Use of Basic Machine Learning (e.g. Classification and Regression) in Marketing


Decision Support is well accepted
Example applications include: targeted marketing, credit rating, and others
But are they the best we have to offer ?

Regression is an inherently harder problem than is required


One does not necessarily need to predict business outcome, customer behavior,
etc, but is merely required to make business decisions
Regression may fail to detect significant patterns, especially when data is noisy

Classification is an over simplification


By mapping to classification, one loses information on the degree of
goodness/badness of a business decision in the past data

Cost-sensitive classification provides the desired middle ground


It simplifies the problem almost to classification and thus allows discovery of
significant patterns;
Yet retains and exploits the information on the degree of goodness of business
decisions, in a way that is motivated by Utility theory

30

2006 IBM Corporation

IBM Research

Cost-sensitive Learning a.k.a. Utility-based Classification


1

In regression: given (x,r) 1 X x R, generated from a sampling


distribution, find F:
F(x) 1 r
E.g. r = profit obtained by targeting customer x

In classification: given (x,y) 1 X x {0,1} , generated from a sampling


distribution, find F:
F(x) 1 y
E.g. y = 1 if customer x is good, 0 otherwise

In utility-based classification: given (stochastic) utility function U and


(x,y) 1 X x {0,1} generated from a sampling distribution, find F:
E[U(x,y,F(x))] is maximized (or equivalently E[C(x,y,F(x))] is minimized)
E.g. U(x,1,1) = Profit(x) = Profit obtained by targeting customer x, when x is
indeed a good customer.

31

2006 IBM Corporation

IBM Research

Example Cost and Utility Functions


1 Simple formulation (cost/benefit matrix)
Predicted

True

Predicted

True

Classification utility matrix

Misclassification cost matrix

1 More realistic formulation (utility/cost dependent on individuals)


Predicted

bad

good

True

bad

good

True

bad

-C

bad

- Default Amt

good

Profit C

good

Interest

Targeted marketing utility


32

Predicted

Credit rating utility


2006 IBM Corporation

IBM Research

Bayesian Approach with Regression


1 For each example x, choose the class that minimizes the
expected cost:

i * ( x) = arg min P( j | x)C ( x, i, j )


i

need be estimated!
1 Problem: Requires conditional density estimation and
regression to solve a classification problem.
Price is high computational and sample complexity
1 Merit: more flexibility and general applicability
Business constraints
Variability in fixed costs
But, is it necessary ?
33

2006 IBM Corporation

IBM Research

A Classification approach: Cost-sensitive boosting algorithm [AZL 2004]


Define the expanded sample S as:

(x,y)S'= {(x,y') | y (x,y)S, y' Y}

GBSE (Learner A, Expanded data S, count T)


(1)

For all

(x,y)S' initialize H0(x, y) =1/ | Y |

(2)

For all

(3)

(x, y) S' initialize weight


wx, y = E S , yH0 [C( x, y)] C( x, y)

For t=1 to T do

(a) For all (x,y) in S update weight


(b) Let

wx, y = E S , yHt1 [C( x, y)] C( x, y)

T' ={((x, y),I (wx, y > 0))| (x, y) S'}

(c) Let ht = A(T,|w|)

Weight updated in each iteration

(d) ft = Stochastic(hi )

(e) Ft = (1- 2)Ft-1+2ft

(4) Output h(x) = arg max(

h (x, y) )
t =1

34

the difference between average cost


by the current ensemble and cost of y
2006 IBM Corporation

IBM Research

Gradient Boosting with Stochastic Ensembles: Illustration

The difference between the current average cost and the cost
associated with a particular label is the boosting weight

The sign of the weight, E[C(x,y)] C(x,y), is the training label

C(x,y)

Cost C(x,y)
Ave Cost
E[C(x,y)]

Predicted

At learning iteration t
35

Label, y
Training Labels

At learning iteration t+1


2006 IBM Corporation

IBM Research

Cost-sensitive boosting outperforms existing methods of costsensitive learning as well as classification and regression
Data Set

Bagging

Annealing

1059174

12712

20742

344

Solar

5403397

23738

5317390

4810

31942

428

499

21

Letter

1513

921

1302

852

Splice

645

614

503

584

19010

1086

1046

936

KDD-99

Satellite

AvgCost

Existing methods
36

MetaCost

GBSE

Ave Test Set Cost (SE)


2006 IBM Corporation

IBM Research

Active Learning a.k.a. Query Learning

The goal is to achieve data and computational efficient learning by


obtaining labeled data for points of the algorithms choosing

Existing approaches can be classified into two main categories


Algorithmic approach (c.f. [Angluin])
Information-theoretic approach (c.f. [SOS 1992])

Active/Query Leaning

Domain

Select points necessary for learning

Learner

Training Sample

Add data to training sample


*The domain size is generally exponential

37

2006 IBM Corporation

IBM Research

The Query by Committee Algorithm [STS 1991]

This is a representative information theoretic active learning method

Main idea is to query points at which the agent algorithms disagree the most (to
maximize information gain)

Merit: data efficient learning is theoretically guaranteed, subject to assumptions


of representability of the target

Weakness: the theory requires ideal (Gibbs) agent learner, and is generally not
computationally feasible

Query by Committee
Maximize Uncertainty: Query a point with maximum spread
Let Agents Predict on
randomly selected points

Agent 2

38

Agent 2

Agent 2

Agent 2
1Idealized
Agents:
Randomized Algorithms

Input Sample

2006 IBM Corporation

IBM Research

An Efficient Variant: Query by Bagging and Query by Boosting [AM 1998]

These methods combines a computational approach of ensemble method


with information theoretic query by committee method

The method allows arbitrary deterministic agent algorithms

Query by Bagging/Boosting

Queries point x* at which component algorithms disagree most


h1
Agent
Learner A

h2

Agent
Learner A

hT
Agent
Learner A

Bagging: re-sampling with uniform distribution


Boosting: weighted sampling with boosting weights

Input Sample
39

2006 IBM Corporation

IBM Research

Active learning can accelerate learning

It has been observed that active learning can drastically accelerate the
rate of learning (e.g. 10 to 100 folds) over passive learning

Application to primary research (survey) in marketing analytics is


promising but has not been exploited extensively

WDBC (UCI ML repository) and C4.5


40

Breast Cancer Wisconsin (UCI) and C4.5


2006 IBM Corporation

IBM Research

Sequential Cost-sensitive Decision Making by


Reinforcement Learning
1

1
1

41

Cost-sensitive classification provides an adequate framework for


single marketing decision making
Real world marketing decision making is rarely made in isolation, but is
made sequentially
Need to address the sequential dependency in decision making
Cost-sensitive classification
Maximizes E[U(x,h(x)]
We now wish to
Maximize 2t E[U(xt,h(xt)], where x may depend on earlier decisions
This is nothing but Reinforcement Learning, if we view x as the state
Maximize 2t E[U(st,3(st))], where st is determined stochastically
according to a transition probability determined by st-1 and 3(st-1).

2006 IBM Corporation

IBM Research

Review: Markov Decision Process (MDP)


1 At any given time t, the agent is in some state s.

1 It takes an action a, and makes a transition to the next state s,


dictated by transition probability T(s,a)
1 It then receives a reward, or utility U(s,a), which also depends on
state s and action a.
1 The goal of a reinforcement learner in MDP is to learn a policy,
namely 3: S 4 A, mapping states to actions, so as to maximize the
cumulative discounted reward:

R = t U(st , at )
t=0

42

2006 IBM Corporation

IBM Research

MDP and Reinforcement Learning provide an advanced


framework for modeling customer lifetime value
1 Modeling CRM process using Markov Decision Process (MDP)
1 Customer is in some "state" (his/her attributes) at any point in time
1 Retailer's action will move customer into another state

1 Retailer's goal is to take sequence of actions to guide customer's path to maximize customer's
lifetime value

1 Reinforcement Learning produces optimized targeting rules of the form


1 If customer is in state "s", then take marketing action "a"

1 Customer state s represented by current customer attribute vector

1 estimates LTV(s,a) -- best policy is to choose a to maximize LTV(s,a)

Typical CRM Process


p 64
Campaign C

Campaign E

Valuable
Customer

Loyal
Customer

Loyal
Customer

Campaign A
Potentially
Valuable

Repeater

Repeater

Repeater

Bargain
Hunter

One Timer

Defector

Defector

Campaign B
43

Campaign D
2006 IBM Corporation

IBM Research

MDP enables genuine lifetime value modeling, in contrast to


existing approaches that use observed lifetime value
1
1

Observed lifetime value reflects only customers lifetime value attained by current
marketing policy, and therefore fails to capture their potential lifetime value
MDP based lifetime value modeling allows modeling of lifetime value based on
optimized marketing policy (= the output of system !)

Customer As path under


Current marketing policy
Optimized marketing policy
Estimated (potential) lifetime value will be based on the
optimal path
Output policy will lead the customer through the same
path

Campaign E
Loyal
Customer

Loyal
Customer

Repeater

Repeater

Repeater

One Timer

Defector

Defector

Campaign C
Potentially
Valuable

Campaign A

Bargain
Hunter

Campaign B
44

Valuable
Customer

Campaign D
2006 IBM Corporation

IBM Research

And here is how this is possible


1 The MDP enables the use of data for many customers in various stages (states) to
determine potential lifetime value of a particular customer in a particular state
1 Reinforcement Learning can estimate the lifetime value (function) without explicitly
estimating the MDP itself
The key lies in the value iteration procedure based on Bellmans equation

Q(s, a) = E[U(s, a)] + max a ' Q(s' , a' )


LTV of a state = reward now + LTV of best next state
Each rule is, in effect, trained with data
corresponding to all subsequent states
Rule a

Potentially
Valuable
45

Rule d

Rule c

Valuable
Customer

Rule b

Repeater

Loyal
Customer

Loyal
Customer

Repeater

Repeater

2006 IBM Corporation

IBM Research

Reinforcement Learning Methods with Function Approximation


1 Value Iteration (based on Bellman Equation)
Provides the basis for classic reinforcement learning methods
like Q-learning

Q 0 (s, a) = E[U(s, a)]


Q k +1 (s, a) = E[U(s, a)] + max a ' Q k (s' , a' )

(s) = arg max a Q (s, a)


1 Batch Q-Learning (with Function Approximation)
Solves value iteration as iterative regression problems

Q 0 (s, a) U(s, a)
Q k +1 (s, a) (1 - )Q k (s, a) + (U ( s, a) + max a ' Q k (s' , a' ))
Estimate using function approximation (regression)
46

2006 IBM Corporation

IBM Research

Lifetime value modeling based on reinforcement learning can achieve


greater long term profits than the traditional approach
The graph below plots profits per campaign obtained in monthly campaigns over 2 years (in
an empirical evaluation using benchmark data, i.e. KDD cup 98 data)

to yield
greater
long
term
profits

80000

Output
Outputpolicy
policy
ofofMDP
MDP
approach
approach
(CCOM)
(CCOM)
invests
investsinin
initial
initial
campaigns
campaigns

70000
60000
50000

S ingle

40000

C C OM

30000
20000
10 0 0 0
0
C a m pa ign num be r

47

2006 IBM Corporation

IBM Research

Bayesian Network a.k.a Graphical Model


1 Bayesian Network is a directed acyclic graphical model and defines a
probability model
1 Here is a simple example

P(E)

P(E)

0.3

0.7

P(M,E,C,R) = P(E) P(M|E) P(C|E) P(R|M,C)

Economy
E

P(M)

P(M)

0.3

0.7

0.9

0.1

Marketing

Competition

P(C)

P(C)

0.4

0.6

0.7

0.3

Revenue

48

MC

P(R)

P(R)

FF

0.3

0.7

TF

0.9

0.1

FT

0.2

0.8

TT

0.6

0.4
2006 IBM Corporation

IBM Research

Bayesian Network as a General Unifying Framework


1 Bayesian Network provides a general framework that subsumes
numerous known classes of probabilistic models, e.g.
Nave Bayes Classification
Clustering (Mixture models)
Auto regressive models
Hidden Markov models, etc, etc
1 Bayesian Network provides a framework for discussing modeling,
inference, causality, hidden variables, etc
Unobserved
Class

Class

Variable 1

Variable N

Nave Bayes classification


49

Unobserved

Variable 1

Variable N

Clustering/Mixture

State

State

Symbol

Symbol

Hidden Markov Model


2006 IBM Corporation

IBM Research

Bayesian Network and Causality


1 Causality is not necessarily implied by the edge direction

Economy
An Example Bayesian Network

Marketing
Economy

Competition

Economy
Marketing
Marketing

Competition

Revenue

Marketing

Economy

Competition

Competition

P(M,E,C) = P(C) P(E|C) P(M|E)


P(M,E,C,R) = P(E) P(M|E) P(C|E) P(R|M,C)
This is actually ambiguous between

P(M,E,C) = P(E) P(M|E) P(C|E)


P(M,E,C) = P(M) P(E|M) P(C|E)

50

2006 IBM Corporation

IBM Research

Causal Network and Causal Pattern


1 Causal Network
Is a directed graph, in which the direction of edge means causality
1 Causal Pattern
Is an equivalence class of causal networks
Causal Network

Causal Pattern

Economy

Marketing

Economy

Competition

Revenue

Marketing

Competition

Revenue
This pattern shows that the causal relationship between
E, M, and C are ambiguous

51

2006 IBM Corporation

IBM Research

Edge Orientation in Bayesian/Causal Networks

[P. Spirtes, C. Glymour, and R. Scheines (2000)]


52

2006 IBM Corporation

IBM Research

Inferring Structure of Bayesian/Causal Network from Data


M3R|E

The causal structure cannot be determined from data !

Economy

Marketing

Economy

Revenue

P(M,E,R) = P(E) P(M|E) P(R|E)

M 34564748

Marketing

Economy

Revenue

P(M,E,R) = P(M) P(E|M) P(R|E)

Marketing

Revenue

P(M,E,R) = P(R) P(E|R) P(M|E)

The causal structure can be determined from data !

Marketing

Competition
It can be inferred that Marketing can be
Revenue

a lever for controlling Revenue !

P(M,E,C) = P(M) P(C) P(R|M,C)


53

2006 IBM Corporation

IBM Research

Estimation and Inference with Bayesian Networks


1 Inferring causal structure from data
Sometimes possible but in general not

1 Bayesian network structure learning from data


It is known to be intractable for general classes
It is even NP-complete to estimate polytrees robustly
1 Parameter estimation from data, given structure
It is efficiently solvable for many model classes

1 Inference given model


Exact inference is known to be NP-complete for sub-class including undirected
cycles
It is efficiently solvable for tree structures and many models used in practice
1 Latent variable estimation, given structure
Local optimum estimation is often possible via EM-algorithms

1 Given these facts, determining network structure using domain knowledge


and using it to do parameter estimation and inference is common practice
example

54

2006 IBM Corporation

IBM Research

Lifetime Value Modeling and


Cross-Channel Optimized Marketing (CCOM)

1 Optimizes targeted marketing across multiple channels for lifetime


value maximization.

1 Combines scalable data mining and reinforcement learning methods


to realize unique capability.
$
Web
Kiosk
Direct
Mail
Call
Center
Store

$
55

$
2006 IBM Corporation

IBM Research

CCOM Pilot Project with Saks Fifth Avenue


1
1
1

reminder

Business Problem addressed: Optimizing direct mailing to maximize


lifetime revenue at the store (and other channels)
Provided solution for the Cross-Channel Challenge: No explicit linking
between marketing actions in one channel and revenue in another
CCOM mailing policy shown to achieve 7-8% increase in expected
revenue in the store (in laboratory experiments) !

$
Direct
Mail
Store

CCOM-pilot business problem


56

2006 IBM Corporation

IBM Research

Some Example Features


Demographic Features

action

reward

FULL_LINE_STORE_OF_RES.:

If a full-line store exists in the area

0.018

0.004

NON_FL_STORE_OF_RES.:

If a non full-line store exists in area

0.012

-0.004

Transaction Features (concerning divisions relevant to current campaign)


CUR_DIV_PURCHASE_AMT_1M:

Pur amt in last month in curr div

0.065

0.090

CUR_DIV_PURCHASE_AMT_2_3M:

Pur amt in 2-3 month in curr div

0.099

0.080

CUR_DIV_PURCHASE_AMT_4_6M:

Pur amt in 4-6 month in curr div

0.133

0.091

CUR_DIV_PURCHASE_AMT_1Y:

Pur amt in last year in curr div

0.162

0.128

CUR_DIV_PURCHASE_AMT_TOT:

Total Pur amt in current division

0.153

0.147

Promotion History Features (on divisions relevant to current campaign)


CUR_DIV_N_CATS_1M:

Num cat sent last month in curr div

0.294

0.028

CUR_DIV_N_CATS_2_3M:

Num cat sent 2-3 months ago in curr div

0.260

0.025

CUR_DIV_N_CATS_4_6M:

Num cat sent 4-6 months ago in curr div

0.158

0.062

CUR_DIV_N_CATS_TOT:

Total num cat sent in curr div to date 0.254

0.062

To mail or not to mail

1.000

0.008

Expected cumulative profits

0.008

1.000

Control Variable
ACTION:

Target (Response) Variable


REWARD:
57

2006 IBM Corporation

IBM Research

The Cross-Channel Challenge and Solution


1

The Challenge: No explicit linking between actions in one channel (mailing) and
rewards in another (revenue)
Very low correlation observed between actions and responses
Other factors determining life time value may dominate over the control variable
(marketing action) in estimation of expected value
Obtained models can be independent of the action and give rise to useless rules !
The Cross-Channel Solution: Learn the relative advantage of competing actions!
Standard Method
Value in state s1

Value in state s2

Value in state s1

Value in state s2

Approximation

a1 a2

Actions

a1 a2

Proposed Method Value in state s1

Actions

a1 a2

Actions

a1 a2

Value in state s2

Actions
a1 a2
58

a1 a2
2006 IBM Corporation

IBM Research

The Learning Method


1
1

Definition of Advantage

A(s,a):= 1/9t(Q(s,a) maxa Q(s,a))


Advantage Updating Procedure [Baird 94]

Repeat
1. Learn
1.1. A(s,a):=(1-5)A(s,a)
+5 (Amax(s)+(R(s,a)+67tV(s)-V(s))/7t)
1.2. Use Regression to estimate A(s,a)
1.3. V(s):=(1-8)V(s)
+8(V(s)+(Amax-new(s)-Amax-old(s))/5)
2. Normalize
A(s,a):=(1- 9)A(s,a)+9(A(s,a)-Amax(s))
Modifications: 1. Initialization with empirical life time value

59

2. Batch Learning with optional function approximation


2006 IBM Corporation

IBM Research

Evaluation Results

1 Obtained policy with 7- 8% policy


advantage, i.e. 7- 8% increase in
expected revenue (for 1.6 million
customers considered)

Policy Advantage
Advantage (percentage)

1 Significant policy advantage


observed with small number of
iterations

Typical run (version 1)

8
6
4
2
0
1

-2

-4
Learning iterations

1 Mailing policy was constrained to


mail same number of catalogues in
each campaign as last year

Typical run (version 2)


Policy Advantage
8
Advantage (percent)

1 CCOM to evaluate sequence of


models and output best model

10

6
4
2
0
1

-2
-4
Learning iterations

60

2006 IBM Corporation

IBM Research

Evaluation Method
1 Challenge in Evaluation: Need to evaluate new policy using data
collected by existing (sampling) policy
1 Solution: Use bias-corrected estimation of policy advantage using data
collected by sampling policy
1 Definition of policy advantage:
(Discrete Time) Advantage
A
(s,a):= Q
(s,a) maxa Q
(s,a)

Policy Advantage

As~
(
):= E
[Ea~
[A
(s,a)]]

1 Estimating policy advantage with bias corrected sampling


As~
(
):= E
[(
(a|s)/
(a|s)) [A
(s,a)]]

61

2006 IBM Corporation

IBM Research

Combination of reinforcement learning (MDP) with predictive data


mining enables automatic generation of trigger-based marketing
targeting rules
1 Optimized with respect to the customers
potential lifetime value
1 Stated in simple if then style, which
supports flexibility and compatibility
1 Refined to make reference to detailed
customer attributes and hence, well-suited to
event and trigger-based marketing
1 This is made possible by
Representing the states in MDP by customers
attribute vectors
Combining reinforcement learning with predictive
data mining to estimate lifetime value as function
of customer attributes and marketing actions

An example marketing targeting rule


output by CCOM system
62

2006 IBM Corporation

IBM Research

Some examples of rules output by CCOM

Avoid saturation effects


Interpretation: If a customer has spent in the
current division but enough catalogues have
been sent, then dont mail

Differentiate between customers


who may be near saturation and
those who are not
Interpretation: If a customer has spent in the
current division and has received moderately
many relevant catalogues, then mail

Invest in a customer until it knows it


is not worth it
Interpretation: If a customer has spent
significantly in the past and yet has not spent
much in the current division (product group) then
dont mail

63

2006 IBM Corporation

IBM Research

CCOM is generically applicable by mapping physical data to this


model
CCOM - Logical Data Model
*Developed with CBO
Customer Profile History
Period
Period Identifier
Period Duration

Customer Identifier
Profile History Date
Period Identifier
Product Category Identifier
Channel Identifier
Aggregated Count of Event
Aggregated Revenue
Aggegated Profit

Transaction
Customer
Customer Identifier
Transaction Date
Product Category Identifier
Event Identifier
Channel Identifier
Transaction Revenue
Transaction Profit

Customer Identifier
First Name
Last Name
Age
Gender

Customer Loyalty Level History


Customer Identifier
Loyalty Level Start Date
Loyalty Level End Date
Loyalty Level

Lifetime Value Model


Customer Marketing Action
Product Category
Product Category Identifier
Product Category Description
Channel

Event Identifier
Customer Identifier
Marketing Action Date
Marketing Action

Channel Identifier
Channel Description

Marketing Policy Model

Marketing Event
Event
Product Category
Event Identifier
Product Category Identifier
Weight

Model Identifier
Model Type
Model

Model Identifier
Model Type
Model

Event Identifier
Channel Identifier
Event Date
Event Category Description
Fixed Cost
CCOM Output Models
Optional Entity

64

2006 IBM Corporation

IBM Research

Customer Wallet and Opportunity


Estimation: Analytical Approaches
and Applications
Saharon Rosset, Claudia Perlich,
Rick Lawrence
IBM T. J. Watson Research Center

2006 IBM Corporation

IBM Research

Outline
1 Wallet estimation: problems and solutions
The different wallet definitions
How can we evaluate wallet models?
Modeling approaches
Empirical evaluation

1 MAP (Market Alignment Program)


Description of application and goals
The interview process and the feedback loop
Evaluation of Wallet models performance in MAP

66

2006 IBM Corporation

IBM Research

What is Wallet (AKA Opportunity)?


1 Total amount of money a company can spend on a
certain category of products.

Company Revenue
IT Wallet
IBM Sales

IBM sales IT wallet Company revenue


67

2006 IBM Corporation

IBM Research

Why Are We Interested in Wallet?

OnTarget
MAP

1 Better evaluation of growth potential by


combining wallet estimates and past sales
history
Enables focus on high wallet, low share-of-wallet
customers

1 Intelligent marketing using wallet estimates for


sub-categories e.g., software, hardware
1 Evaluating success of sales personnel and
sales channel by share-of-wallet they attain
Making resource assignment decisions

68

2006 IBM Corporation

IBM Research

Wallet Modeling Problem


1 Given:
customer firmographics x (from D&B): industry, emloyee
number, company type etc.
customer revenue r
IBM relationship variables z: historical sales by product
IBM sales s

1 Goal: model customer wallet w, then use it to


predict present/future wallets

No direct training data on w or information


about its distribution!
69

2006 IBM Corporation

IBM Research

Historical Approaches

1 Top down: this is the approach used by IBM


Market Intelligence in North America (called ITEM)
Use econometric models to assign total opportunity to
segment (e.g., industry geography)
Assign to companies in segment proportional to their size
(e.g., D&B employee counts)

1 Bottom up: learn a model for individual companies


Get true wallet values through surveys or appropriate
data repositories (exist e.g. for credit cards)

1 Many issues with both approaches (wont go into


detail)
We would like a predictive approach from raw data
70

2006 IBM Corporation

IBM Research

Agenda
1 Introduction and analytical issues
Different wallet definitions

1 How can we evaluate wallet models?


The quantile regression loss function

1 Modeling approaches and results:


Nearest neighbor approach
Quantile regression
Model decomposition approach

71

2006 IBM Corporation

IBM Research

Multiple Wallet Definitions

1 TOTAL: Total customer available budget in the


relevant area (e.g., total IT)
Can we really hope to attain all of it?

1 SERVED: Total customer spending on IT products


covered by IBM
Better definition for our marketing purposes

1 REALISTIC: IBM spending of the best similar


customers
This can be concretely defined a high percentile of:
P(IBM revenue | customer attributes)

Total Wallet
Served Wallet

REALISTIC SERVED TOTAL


72

Realistic

2006 IBM Corporation

IBM Research

REALISTIC Wallet: Percentile of Conditional


1 Distribution of IBM sales to the customer given
customer attributes: s|r,x,z ~ f,r,x,z
E.g., the standard linear regression assumption:

s = x + r + z + , ~ N (0, )
2

E(s|r,x,z)

Realistic

What we are looking for is the (say) 90th percentile of this


distribution
73

2006 IBM Corporation

IBM Research

Agenda
1 Introduction and analytical issues
Different wallet definitions

1 How can we evaluate wallet models?


The quantile regression loss function

1 Modeling approaches and results:


Nearest neighbor approach
Quantile regression approach
Model decomposition approach

74

2006 IBM Corporation

IBM Research

Traditional Approaches to Model Evaluation


1 Evaluate models based on surveys
Cost and reliability issues

1 Evaluate models based on high-level performance


indicators:
Do the wallet numbers sum up to numbers that make
sense at segment level (e.g., compared to macroeconomic models)?
Does the distribution of differences between predicted
Wallet and actual IBM Sales and/or Company Revenue
make sense? In particular, are the same % we expect
bigger/smaller?
Problem: no observation-level evaluation
75

2006 IBM Corporation

IBM Research

The Quantile Loss Function


1 Our REALISTIC wallet definition calls for estimating the
pth quantile of P(s|data).
1 Can we devise a loss function which is optimized in
expectation when we succeed?
Answer: yes, the quantile loss function for quantile p.

if y > y
p ( y y )
L p ( y, y ) =
(1 p ) ( y y ) if y > y

1 This loss function is optimized in expectation when we


correctly predict REALISTIC:

arg min y E ( L p ( y, y ) | x) = p th quantile of P( y | x)

76

2006 IBM Corporation

IBM Research

Some Quantile Loss Functions


4

p=0.8

p=0.5 (absolute loss)

-3

-2

-1

Residual (observed-predicted)
77

2006 IBM Corporation

IBM Research

Which Wallet Definitions to Model?

1 We are generally interested in modeling


REALISTIC and SERVED wallets
TOTAL wallets are not of real marketing interest

1 For REALISTIC (or opportunity) we have multiple


modeling approaches
Quantile k-nearest neighbors
Quantile regression approaches:
Linear quantile regression
Tree-based regression
Kernel quantile regression, quanting,

1 For SERVED we have developed a graphical


modeling approach will not discuss here
78

2006 IBM Corporation

IBM Research

Modeling REALISTIC Wallets


1 REALISTIC defines wallet as 90th percentile of
conditional of spending given customer attributes
Implies some 10% of the customers are spending full
wallet with IBM

1 Two obvious ways to get at the 90th percentile:


Estimate the conditional by integrating over a
neighborhood of specific customers
Take 90th percentile of spending in neighborhood
Create a global model for 90th percentile
Build regression models using quantile loss function

79

2006 IBM Corporation

IBM Research
Universe of IBM customers
with D&B information

K-Nearest Neighbors
1 Distance metric:

Target company i

Normalization

1 Neighborhood sizes (k):

Employees

Neighborhood of target company

Quantile of firms in the neighborhood

Frequency

Neighborhood size has significant


effect on prediction quality
1 Prediction:

Re
ve
nu
e

Euclidean distance on firmographics


and past IBM sales

Industry

Industry match

Wallet Estimate

IBM Sales
80

2006 IBM Corporation

IBM Research

Quantile Regression
1 Traditional Regression:
Estimation of conditional expected value by minimizing sum of
n
squares
2

min ( yi f (xi , ))
i=1

1 Quantile Regression:

Minimize Quantile loss: min

L (y
i =1

f ( xi , ))

if y > y
p ( y y )

L p ( y, y) =
(1 p ) ( y y ) if y > y

1 Implementation:
assume linear function
programming

quantile
regression
loss
function

y = x + , solution using linear

Linear quantile regression package in R (Koenker, 2001)


81

2006 IBM Corporation

IBM Research

Quantile Regression Tree


1 Motivation:

Identify a locally optimal definition of neighborhood


Inherently nonlinear
1 Adjustments of M5/CART for Quantile prediction:
Predict the percentile rather than the mean of the leaf
Splitting/pruning criteria does not require adjustment
Industry = Banking

no

yes

Sales<100K

Frequency

Frequency

yes

no

IBM Rev 2003>10K

Wallet Estimate

Wallet Estimate

yes

no

IBM Sales

Wallet Estimate

Frequency

Frequency

IBM Sales
Wallet Estimate

2006 IBM Corporation

82
IBM Sales

IBM Sales

IBM Research

Empirical Evaluation: Quantile Loss


1 Setup
4 Domains with monetary dependent variable including
direct mailing, housing prices, income data, IBM sales
Performance on test set in terms of quantile loss
Approaches: kNN, Linear quantile regression, quantile tree,
Quanting

1 Baselines
Constant model
Traditional regression models for expected values (for
skewed distributions, the expected value is actually a high
quantile)
83

2006 IBM Corporation

IBM Research

Performance on Quantile Loss

1 Conclusions
If there is a time-lagged variable, linear quantile model is
best
Quanting (using decision trees) and quantile tree perform
comparably
Generalized kNN is not competitive
84

2006 IBM Corporation

IBM Research

Residuals for Quantile Regression

Total positive holdout residuals: 90.05% (18009/20000)


85

2006 IBM Corporation

IBM Research

Market Alignment Project (MAP): Background


1 MAP - Objective:
Optimize the allocation of sales force
Focus on customers with growth potential
Set evaluation baselines for sales personal

1 MAP Components:
Web-interface with customer information
Analytical component: wallet estimates
Workshops with Sales personal to review and correct the
wallet predictions
Shift of resources towards customers with lower wallet
share
86

2006 IBM Corporation

IBM Research

The MAP tool captures expert feedback from the Client Facing teams
MAP interview process all Integrated and Aligned Coverages

MAP Interview Team

Client Facing Unit (CFU) Team

Insight Delivery
and Capture

Web Interface
Wallet models:
Predicted
Opportunity

Transaction
Data

D&B
Data

Expert
validated
Opportunity

Resource
Assignments

Analytics and
Validation

Data Integration

Post-processing
The objective here is to use expert feedback (i.e. validated revenue opportunity)
from from last years workshops to evaluate our latest opportunity models
87

2006 IBM Corporation

IBM Research

MAP workshops overview

1 Calculated 2005 opportunity using naive k-NN


approach
1 2005 MAP workshops

Displayed opportunity by brand


Expert can accept or alter the opportunity

1 Select 3 brands for evaluation: DB2, Rational, Tivoli


1 Build ~100 models for each brand using different
approaches
1 Compare expert opportunity to model prediction
Error measures: absolute, squared
Scale: original, log, root
88

2006 IBM Corporation

IBM Research

Displayed Model Predictions of kNN


1 Distance metric
Identical Industry

Universe of IBM customers


with D&B information

Euclidean distance on size (Revenue


or employees)

Median of the non-zero neighbors

Target company i

(Alternatives Max, Percentile)

1 Post-Processing
Floor prediction by max of last 3 years
revenue

89

Employees

Re
ve
nu
e

1 Prediction

Industry

1 Neighborhood sizes 20

Neighborhood of target company

2006 IBM Corporation

IBM Research

Expert Feedback (Log Scale) to Original Model (DB2)


20

Experts accept
opportunity (45%)

18

Expert Feedback

16

Increase (17%)

14
12

Experts change
opportunity (40%)

10

Decrease (23%)

8
6
4
2
0
0

10

12

14

16

18

Experts reduced
opportunity to 0
20 (15%)

MODEL_OPPTY
90

2006 IBM Corporation

IBM Research

Observations
1 Many accounts are set for external reasons to zero
Exclude from evaluation since no model can predict this

1 Exponential distribution of opportunities


Residual-based evaluation on the original scale suffers
from huge outliers

1 Experts seem to make percentage adjustments


Consider log scale evaluation in addition to original scale
and root as intermediate
Suspect strong anchoring bias, 45% of opportunities
were not touched

91

2006 IBM Corporation

IBM Research

Evaluation Measures
1 Different scales to avoid outlier artifacts
Original: e = model - expert
Root:

e = root(model) - root(expert)

Log:

e = log(model) - log(expert)

1 Statistics on the distribution of the errors


Mean of e2
Mean of |e|

1 Total of 6 criteria

92

2006 IBM Corporation

IBM Research

Model comparison results: Count how often a model


scores within the top 10 and 20 for each of the 6 measures

Model

93

Rational

DB2

Tivoli

Displayed Model (kNN) 6

Max 03-05 Revenue

Linear Quantile 0.8

Regression Tree

kNN 50 + flooring

Decomposition Center

Quantile Tree 0.8

(Anchoring)

(Best)

2006 IBM Corporation

IBM Research

Conclusions

1 kNN performs very well after flooring but is typically


low prior to flooring
1 Empirically linear 80th quantile performs consistently
well (flooring has a minor effect)
1 Experts are strongly influenced by displayed
opportunity (and displayed revenue of previous
years)
1 Models without last years revenue dont perform
well
Use Linear Quantile Regression with q=0.8 in MAP 06

94

2006 IBM Corporation

IBM Research

Ongoing and Future Work


1 Extend MAP to other geographies

1 Quantile estimation performance of different


methods as a function of the quantile
1 Performance as a function of the shape of the
conditional distribution of the dependent variables
1 Theoretical generalization of the decomposition
approach

95

2006 IBM Corporation

IBM Research

back

A graphical model approach


Company
firmographics

Company
IT
Wallet

IT spend
with IBM

Historical
relationship
with IBM

1 Wallet is unobserved, all other variables are

1 Two families of variables --- firmographics and IBM relationship are


conditionally independent given wallet
1 We develop inference procedures and demonstrate them

In some cases leads to simple linear regression as ML inference on


wallet
See poster in this conference:
Merugu, Rosset, Perlich: A new multi-view learning approach with
an application to customer wallet estimation.
96

2006 IBM Corporation

IBM Research

References
1

Marketing Science
R. Rust, K. Lemon and V. Zeithaml, Return on Marketing: Using Customer Equity
to Focus Marketing Strategy, J. of Marketing, 2004.
P. Kotler, Marketing Management. Millennium Ed., Prentice-Hall, 2000.

Cost-sensitive Learning
P. Domingo, Meta-Cost: A general method for making classifiers cost-sensitive,
The 5th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 1999.
N. Abe, B. Zadrozny and J. Langford, An Iterative Method for Multi-class Costsensitive Learning, The Tenth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, August 2004.

Active Learning
H.S. Seung, M. Opper and H. Sompolinsky. Query by committee. Proceedings of
the Fifth Workshop on Computaional Learning Theory, 1992.
D. Angluin. Queries and concept learning. Machine Learning, 1988.

MDP and Reinforcement Learning


R. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press,
Cambridge, MA, 1998.
L. P. Kaelbling, M. L. Littman, A. W. Moore, Reinforcement Learning: A Survey,
Journal of Artificial Intelligence Research, 1996.

97

2006 IBM Corporation

IBM Research

References
1 Bayesian Networks and Causal Networks
K. Murphy, A brief introduction to Bayesian Networks and
Graphical Models,
http://www.cs.berkeley.edu/~murphyk/Bayes/bayes.html
D. Heckerman, A tutorial on learning with Bayesian Networks,
Microsoft Research MSR-TR-95-06, March 1995.
J. Pearl, Causality: Models, Reasoning, and Inference,
Cambridge University Press, 2000.
P. Spirtes, C. Glymour and R. Scheines, Causation, Prediction,
and Search, 2nd Edition (MIT Press), 2000.

1 Case Study: Customer Wallet Estimation


S. Rosset, C. Perlich, B. Zadrozny, S. Merugu, S. Weiss and R.
Lawrence, Customer Wallet Estimation. 1st NYU workshop on
CRM and Data Mining, 2005.

98

S. Merugu, S. Rosset and C. Perlich, A New Multi-View


Regression Method with an Application to Customer Wallet
2006 IBM Corporation
Estimation. The Twelfth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, August 2006.

IBM Research

Thank you!
srosset@us.ibm.com

99

2006 IBM Corporation

Das könnte Ihnen auch gefallen