Beruflich Dokumente
Kultur Dokumente
ing demand for a product as a function of the products attributes and of mar-
MICHAEL TRUSOV
is a doctoral candidate at the UCLA
LEE G. COOPER
2006 Wiley Periodicals, Inc. and Direct Marketing Educational Foundation, Inc. is Professor Emeritus at the UCLA
71
Journal of Interactive Marketing DOI: 10.1002/dir
INTRODUCTION The reliance on the simple last like rule became both
inefficient and infeasible. Several authors have dis-
Since the early 1970s, promotions have emerged to cussed the advantages of applying market-response
represent the main share of the marketing budget for models over traditional rule-of-thumb approaches.
most consumer-packaged goods (Srinivasan, Pauwels, Indeed, the use of statistical models often results in
Hanssens, & Dekimpe, 2004). It is not surprising that significant cost savings, as out-of-stock and overstock
numerous studies in the field of marketing research losses are reduced.
are devoted to promotions. Researchers have exam-
ined multiple facets of the promotion phenomenon, To predict demand for future promotion events, the
such as what, when, and how to promote, optimal promotion-forecasting system PromoCast (Cooper
length of a promotion, its strategic implications, the et al., 1999) and its extension using data mining
persistent effect of promotions, their impact on con- (Cooper & Giuffrida, 2000) exploit historical data
sumer behavior, and so on. In the business environ- accumulated from past promotions for each separate
ment, promotion planning is a routine daily task for SKU in each store within a retail chain. PromoCast
both retailers and manufacturers; however, the uses a traditional market-response model to extract
planning task for retailers is quite different from the information from continuous variables, and Cooper
one faced by manufacturers. Manufacturerseven and Giuffrida (2000) used data mining techniques on
those with very broad product linesmust develop the residuals to extract information from many-val-
promotion-planning procedures for a maximum of a ued nominal variables such as promotion conditions
couple hundred products. On the other hand, the or merchandise category. The output of the data min-
average supermarket chain carries thousands of ing algorithm is a set of rules that specify what
items, with some subset of them always being pro- adjustments should be made to the forecast produced
moted. While there are many important issues that by the market-response model.
must be addressed in the context of promotion plan-
ning, one of the most basic daily issues is the planning The purpose of this study is to highlight some of the
of inventory for upcoming sales events. This aspect of limitations inherent to the data mining techniques
promotion planning has received relatively little employed in the extension of the PromoCast system
attention in marketing literature, possibly due in part and to show how addressing these limitations can
to the more applied, business-operations nature of the improve the accuracy of the forecasts and inter-
planning process. On the other hand, the techniques pretability of produced knowledge. Specifically, we
traditionally employed by marketing scholars to build show that by allowing for set-valued features in the
forecast models (e.g., econometric methods) are not rule syntax (Cohen, 1996), the quality of predictions
always well suited for retail-industry-scale systems. (in terms of correctly applied forecast adjustments)
Some recent studies, however, bring to light the topic improves by as much as 50% (The original data min-
of retail promotion planning and sales forecasting ing algorithm produces an 11.84% improvement over
(e.g., Cooper, Baron, Levy, Swisher, & Gogos, 1999; the traditional market-response model.) Using this
Divakar, Ratchford, & Shankar, 2005). This interest new PromoCast data mining algorithm as an exam-
may be attributable to the emergence of new concep- ple, we also show that by applying simple optimiza-
tual modeling/computational methods as well as tion techniques to the set of generated rules, the size
advancements in technologies. of the rules space could be significantly reduced with-
out loss of predictive ability. In our sample, we were
Until the emergence of automated promotion- able to reduce the number of generated rules from 156
planning systems, the common practice for retail store to just 21 while preserving the overall accuracy of
managers was to use the last like rule in ordering forecast. The optimized rules set is much more man-
inventory for upcoming promotion events. This meant ageable and transparent to a market practitioner.
that they ordered the same quantity of goods that was
sold during a similar past promotion (Cooper et al., This article is organized as follows. In the next sec-
1999). While this was the common (and only) approach tion, we provide a brief overview of PromoCast, a
for decades, advancements in technology offered better recent development in the field of automated promo-
ways for managers to handle the planning process. tion-forecasting systems. PromoCast has a strong
market-performance record, compares favorably The motivation for PromoCast was to provide a tacti-
against other leading forecasting packages, and com- cal promotion-planning solution for store managers or
bines the strength of traditional market-response buyers/category managers. The objective was to sup-
models with state-of-the-art data mining algorithms. ply retailers with knowledge about the amount of
Limitations of this system that are inherent to the stock to order for a promotional event such that it
whole class of such models are discussed. Next, we would minimize inventory costs and out-of-stock con-
demonstrate how the incorporation of set-valued fea- ditions (two often-conflicting goals). Since the emer-
tures into a mining algorithm can significantly gence of scanner technology, businesses have been
enhance promotion-forecasting performance on two accumulating data on past promotions for each sepa-
measures: scope and accuracy. Results are discussed, rate SKU in each store within a retail chain.
and then we present the issue of nonoptimality of PromoCast was built to take advantage of this huge
rules space generated by the rule-induction algorithm amount of historical data to predict demand for future
and propose a simple approach which leads to promotion events. Figure 1 presents the general
improved interpretability and manageability of pro- schematic of the PromoCast system.
duced knowledge. Consequences of rules space opti-
mization in application to other marketing areas are Of particular interest are two key system compo-
discussed, as are some efficiency issues. We then pre- nents: the PromoCast Forecast module and the
sent our conclusions. Corrective Action Generator. The forecast module
uses a traditional market-response model to extract
information from 67 variables that either come direct-
PROMOTION-EVENT FORECASTING ly from a retailers historical records (e.g., unit prices,
SYSTEMPROMOCAST percentage discounts, type of supporting ads) or are
The promotion-event forecasting system PromoCast being inferred from those records (e.g., baseline sales)
was initially presented to the scientific community by (for some examples of traditional market-response
Cooper et al. (1999). For a complete overview and models, see Hanssens, 1998). Excluded from the sta-
implementation details, the interested reader should tistical forecaster are nominal variables such as an
consult the original publication. We offer here a brief items manufacturer and merchandise category.
summary of the PromoCast system, just sufficient to Market-response models are not sufficiently robust to
support the idea of our research. respond to the addition of 1,000 dummy variables for
FIGURE 1
PromoCast Design
the system over other popular algorithms justifies our We start with the hypothetical example of inventory
choice of PromoCast as a benchmark platform. planning for an upcoming promotion event in the ice
cream product category. Let us consider a supermar-
In conclusion of this section, note that while the ket which carries 40 different flavors of Ben &
Corrective Action Generator strongly contributes to Jerrys ice cream. The store database contains his-
forecast accuracy, the module has certain limitations torical records on the past promotions and inventory
inherent to the class of employed data mining algo- status for all 40 flavors. It happens that most flavors
rithms. In the following section, we discuss these lim- except cherry and coffee are underforecasted by
itations in depth and show that forecast performance the statistical model and would result in out of stock
is significantly improved when these limitations are before the promotion ends. Assume that the store
addressed. database contains records for 200 promotion events
where none of any particular flavor of Ben & Jerrys
SEARCHING FOR RARE PATTERNS ice cream appears more than 30 times. Finally, lets
assume that the mining application uses a minimum
As discussed in the previous section, the KDS algo- support level of 50 occurrences. Then the resulting
rithm discovers patterns in excluded variables to rule for Ben & Jerrys ice cream may look like:
explain specific outcomes in residuals. Specifically, in
the example presented earlier, the discovered pattern IF
associated with the outcome of underforecast is DCS ICE CREAM AND
{DCS Gelatin, TPR Very High, Mfr MFR BEN & JERRYS
General Foods, Store Node Culver City, #231}. THEN
The discovery of frequent patterns is one of the essen- UNDERFORCAST 10 CASES
tial tasks in data mining. To ensure generation of the
correct and complete set of frequent patterns, popular Note that flavor attribute does not enter the rule
mining algorithms rely on a measure of so-called min- because of insufficient support. The recommendation
imum support (Han, Wang, Lu, & Tzvetkov, 2002). to adjust the forecast by 10 cases is inferred from the
For the reader who is not familiar with the idea of historical records where pattern {DCS = ICE
minimum support, we may suggest an analogy in CREAM, MFR = BEN & JERRYS} appears in asso-
statisticsstatistical significance. From the statisti- ciation with an underforecast outcome. The same rule
cal perspective, an inference is reliable if based on a can be rewritten as:
large enough number of observations. Thus, a mining
algorithm needs to know in how many records a given IF
pattern must appear to be considered an as-a-rule DCS ICE CREAM AND
candidate. MFR BEN & JERRYS AND
FLAVOR ANY
The minimum support value depends on the current THEN
users goal. The higher the value, the fewer rules are UNDERFORCAST 10 CASES
created and the fewer records are classified. The prob-
lem of defining minimum support is quite subtle: A Note, however, that the following set of rules would
too-small threshold may lead to the generation of a produce more accurate forecast for Ben & Jerrys ice
huge number of rules whereas a too-large one often cream:
may generate no answers (Han et al., 2002). The
major consequences of the former are poor scalability IF
and outcome (rules) interpretability. The latter could DCS ICE CREAM AND
be paralleled to the famous broken leg cue problem MFR BEN & JERRYS
described by Meehl (1954), in which highly diagnostic FLAVOR ALL BUT CHERRY AND COFFEE
cues that occur infrequently in the data sample are THEN
being left out of a statistical model. Next, we demon- UNDERFORCAST 10 CASES
strate how arbitrary selection of minimum support
may lead to lost knowledge. AND
we limited our set to 8,195 records, which were ran- The extended version of the KDS algorithm produced
domly selected from the output of the PromoCast additional 85 grouping rules, for a total of 156 rules.
Forecast module. A total of 4,195 records were The Corrective Action Generator with grouping support
retained for calibration purposes (training set) and attempted to correct 2,805 of 4,000 records in the
4,000 records were used in the holdout sample. When holdout sample. Based on knowledge contained in com-
compared against the actual promotion outcome, posite rules, 2,581 adjustments were made; the rest
3,184 records in the holdout sample required corrective 224 predictionscame from simple rules. Note that in
actions. Of these, 1,217 (30%) were underforecasted the original version of the algorithm, all predictions are
by the PromoCast Forecast module, and 1,967 (49%) based on simple rules. The success rate was 78.29%.
were overforecasted. Overall improvement in terms of records was 1,592 of
3,184 (50.0%). When compared with results produced by
Performance Measures the original Corrective Action Generator, we observe
significant improvements in performance.
Recall that the Corrective Action Generator makes
adjustments to the demand predictions generated by
the Forecast module. Given that our approach applies RULES SPACE OPTIMIZATION
to the mining module (and, accordingly, does not In the previous section, we showed how mining and
interfere with the statistical methods of the Forecast grouping of rare patterns significantly improves
module), it is sufficient to compare the performance forecast performance. In this section, we briefly touch
results of the proposed new version of the Corrective on some issues that are common to systems built on
Action Generator with the results produced by the rule-induction algorithms. As described earlier, rule-
original module. Therefore, the following discussion induction algorithms produce knowledge by means of
pertains to the accuracy of corrective actions, but not scanning through historical records in search for pat-
to the forecast as a whole. terns. Knowledge is formalized in the form of rules in
the following format: pattern S prediction. In a typi-
The Corrective Action Generator first identifies cal real-life application, it is common to have a large
records for which matching rules can be found, and number of rules generated by popular data mining
then applies adjustment according to these rules. We algorithms (including the one described in this study).
count the number of records where suggested adjust- Large numbers of rules tend to have high degrees of
ments are appropriate and penalize for incorrect redundancy. By redundancy, we mean that multiple
adjustments to evaluate algorithm performance. rules apply to the same record and produce identical
Accordingly, the percentage improvement in the fore- predictions. To clarify this idea, we suggest the
cast should not be interpreted as absolute but rather following analogy:
as improvements in the number of correctly adjusted
records. Therefore, for our holdout sample, a reported Assume that a certain amount of Knowledge K
improvement of 100% would correspond to correct could be represented by a collection of Rules S. If it
identification and correction of 3,184 records which is possible to remove some subset of Rules s from
need adjustment. the original set S without sacrificing any amount
of Knowledge K represented by the remaining set
S*, then we say that the original set S is not
RESULTS optimal or contains redundant rules.
The original Corrective Action Generator attempted
to correct 917 of 4,000 records found in the holdout Depending on the specific application, generated
sample. Of these 917 records, 647 were adjusted cor- rules could be used either in automatic mode (without
rectly while 270 records received a wrong adjustment. human intervention, as is the case with PromoCast)
The resulting success rate was 70.48%. Overall or presented for further analysis to the human agent
improvement in terms of records was 377 of 3,184 (e.g., category manager). While in the former case,
(11.84%). The Corrective Action Generator created a the number of rules constituting knowledge might be
total of 71 simple rules using a minimum support of irrelevant (ignoring considerations of additional
50 records. storage-space requirements and performance issues);
in the latter case, it may directly impact a systems ticular variable out of the regression, it may not hurt
overall usability. Many business settings dictate accuracy at all because its effect already may be cap-
the necessity for a human expert to conduct the tured via another variable that it is correlated with.
rules screening. Mindless or blind application of In regression, this is called the best K-subset selection
automatically generated rules even may have legal problem, where we choose various values of K from 1
consequencesespecially in systems where the to the maximum number of predictors. Then, for
results of rule application are transparent to the cus- each value of K, we choose the best k variables using
tomer, such as dynamic pricing or product recommen- a combinatorial heuristic called stepwise regression.
dations in online retailers. Given the necessity of We do an almost identical process here, except that
screening, then, consider the example of a manager instead of identifying the subset of variables, we are
who would need to analyze only 20 rules generated by identifying the best subset of rules. We choose various
data mining application versus 300 rules. The prob- values from 1 to the maximum number of rules
lem of not optimizing a set of rules (e.g., having 280 generated by the data mining algorithm. Then for
redundant rules) is immediately apparent. each value of K, we choose the best k rules using a
combinatorial heuristic.
We suggest that by applying simple optimization
techniques to the set of generated rules, the size of the Performance (or predictive ability) of a rules set of
rules space could be considerably reduced without sig- size K depends on a quality of rules constituting the
nificant loss of predictive ability. The resulting set (accuracy in adjustments) and its scope. By scope,
reduced rule set is much easier for managers to ana- we mean the percentage of records in the historical
lyze and to perform screening. We demonstrate this database to which a rules set may be applied. Indeed,
idea using the rules set produced by the Corrective having an extremely accurate rules set which is
Action Generator in application to the promotion-fore- applicable only to very few records, however, is not
cast task discussed in the previous section. useful, knowing that most predictions produced by
the statistical forecaster (e.g., 75% in our dataset)
require some adjustment. The applicability of a rules
Optimization Method set is determined by the match between patterns in
The goal of optimization algorithm is to identify and the rules and patterns found in the dataset. For
drop redundant rules because they do not add much example, consider the record of the following format:
value in terms of prediction accuracy. The situation {DCS ICE CREAM, MFR BEN & JERRYS,
here is very similar to the case of regression with FLAVOR CHERRY}
highly correlated variables. There, too, we have a lot
of redundancies among the variables because the Let us assume that the data mining algorithm has
variables are highly correlated. So, if we drop a par- generated rules shown in Table 1.
2300
3. In {Rulei}All, search for the rule {Rulec} which, if
added to the Candidate Set, produces the largest 2100 k=5
6
The proposed procedure also should control for possible cyclic FIGURE 2
behavior. Rules Space Optimization
1900
Berry, M.J.A., & Linoff, G.S. (2004). Data Mining
1800
Techniques: For Marketing, Sales, and Customer
1700
Relationship Management (2nd ed.). Hoboken, NJ:
1600
Wiley.
1500
Cohen, W.W. (1996). Learning Trees and Rules With Set-
1400
Valued Features. Proceedings of the 8th annual confer-
1300
ence on Innovative Applications of Artificial Intelligence,
1200
Portland, OR.
1 26 51 76 101 126 151
Rules space size (number of rules) Cooper, L.G., Baron, P., Levy, W., Swisher, M., & Gogos,
P. (1999). PromoCast: A New Forecasting Method
FIGURE 3 for Promotion Planning. Marketing Science, 18(3),
Number of Correct Predictions as a Function of the Rules Set Size 301316.
Cooper, L.G., & Giuffrida, G. (2000). Turning Datamining Hanssens, D.M. (1998). Order Forecasts, Retail Sales and
Into a Management Science Tool. Management Science, the Marketing Mix for Consumer Durables. Journal of
46(2), 249264. Forecasting, JuneJuly, (17), 327346.
Divakar, S., Ratchford, B.T., & Shankar, V. (2005). Klayman, J., Soll, J.B., Gonzlez-Vallejo, & Barlas, S.
CHAN4CAST: A multichannel, multiregion sales fore- (1999). Overconfidence: It Depends on How, What, and
casting model and decision support system for consumer Whom You Ask. Organizational Behavior and Human
packaged goods. Marketing Science, 24(3), 334350. Performance, 79(3), 216247.
Giuffrida, G., Chu, W., & Hanssens, D.M. (2000). Mining Krycha, K.A. (1999). bung aus Verfahren der
Classification Rules From Datasets With Large Number Marktforschung Endbericht [Case Study Growmart: The
of Many-Valued Attributes. In Lecture Notes in Effects of Promotional Activities on Sales]. Universitt
Computer Science (Vol. 1777, pp. 335349). New York: Wien, Institute fr Betriebswirtschaftslehre.
Springer Berlin/Heidelberg. Meehl, P.E. (1954). Clinical Versus Statistical Prediction.
Han, J., Wang, J., Lu, Y., & Tzvetkov, P. (2002). Mining Top- Minneapolis: University of Minnesota Press.
K Frequent Closed Patterns Without Minimum Support. Srinivasan, S., Pauwels, K., Hanssens, D.M., & Dekimpe,
Proceedings of the 2002 international conference on M. (2004). Do Promotions Benefit Manufacturers,
Data Mining, Maebashi, Japan. Retailers or Both? Management Science, 50(5), 617629.