Beruflich Dokumente
Kultur Dokumente
net/publication/259150031
CITATIONS READS
14 63
2 authors:
Some of the authors of this publication are also working on these related projects:
Optimizing Networks with Combined Cycle and Tree Structures View project
All content following this page was uploaded by Stefan Lessmann on 06 August 2017.
Abstract. Supervised classification embraces theories and algorithms for disclosing patterns within large,
heterogeneous data streams. Several empirical experiments in various domains including medical diagnosis,
drug design, document and image classification as well as text recognition have proven its effectiveness to
solve complex forecasting and identification tasks. This paper considers applications of classification within
the scope of customer relationship management (CRM). Representative operational planning tasks are
reviewed to describe the potential and limitations of classification analysis. To that end, a survey of the
relevant literature is given to summarize the body of knowledge in each field and identify similarities across
applications. The discussion provides a general understanding of technical and managerial challenges
encountered in typical CRM applications and indicates promising areas for future research.
1 Introduction
In times of changing market structures and increasing competition, companies seek novel ways of differentiation
to gain competitive advantages. This development gave rise to management philosophies like CRM which
emphasize customers as strategic assets whose values have to be evaluated, maintained and escalated. On an
operational level, several customer-centric decision problems are considered within the general framework of
CRM. Despite task specific particularities, the common denominator embracing many different applications is
that they may be approached by means of supervised classification. Consequently, this modelling paradigm is of
pivotal importance for corporate data mining.
Classification, also referred to as pattern recognition (see, e.g., [31; 53; 100]), comprises theories and
algorithms to disclose patterns from past observations, which, in turn, facilitate predicting future events. In
particular, a classification model, or classifier, enables estimating the membership of examples to predefined
groups. For example, churn prediction aims at identifying customers who are about to abandon their relationship
with a company, so that proactive marketing actions may be taken to prevent customer attrition. A classification
algorithm supports this task by processing records of previous churners and non-churners and deriving an
input/output mapping between observable customer characteristics (e.g., age, gender, duration of service
subscription, service usage patterns, etc.) and a target variable (e.g., whether or not a particular customer with
particular attributes has churned in the past). The resulting model may be used subsequently to identify
customers who are at the risk of churning, i.e., who exhibit characteristics similar to previous churners.
Although respective decision problems are well established in the literature (see, e.g., [7; 70; 76]), they
nowadays attract increasing interest in corporate practice due to the availability of large amounts of data about
customers/consumers as well as techniques and tools to build powerful forecasting models in a widely automated
manner. Extending operational decision support, the knowledge contained in such models facilitates a deeper
understanding of customer behaviour and may consequently enable a refinement and optimization of customer-
centric business processes to support the strategic objective of immunizing customers against competitors’ offers
and gaining competitive advantages in globalized markets.
This environment dictates three major requirements classifiers have to fulfil to support managerial decision
making. On the one hand, predictive accuracy is of pivotal importance as even small improvements may induce
significant financial gains [16]. On the other hand, considerations with respect to discovering interesting and
understandable knowledge in data [37] may enforce the usage of transparent and explainable classifiers while
prohibiting the application of opaque (nonlinear) models. Finally, the efficiency of data analysis processes
benefits from a high degree of automation and imposes further constraints associated with the usability,
flexibility and robustness of forecasting procedures. Obviously, these requirements cannot be satisfied
simultaneously to the same degree and a balance between conflicting objectives has to be found. The prevailing
approach within the literature is to focus on, e.g., modelling accuracy or the construction of comprehensible
classifiers for a given task while paying only moderate attention to other criteria and/or applications. Therefore,
the application perspective is emphasized within this paper to review different approaches under a unifying
umbrella. While this paper cannot be fully comprehensive in its literature review, we focus on pointing out most
important aspects of each decision problem to highlight challenges as well as practical constraints.
The paper is organized as follows: The theory of supervised classification is reviewed in Section 2, together
with a brief discussion of algorithmic approaches and performance measurement techniques. Section 3 describes
three exemplary applications of classification in customer-centric data mining: namely response modelling for
direct marketing, customer churn prediction and fraud detection. These applications are considered well
representative for CRM and have been selected because of their popularity within the literature and practical
relevance. Conclusions are drawn in Section 4.
2 Supervised Classification
The task of classification for CRM can be defined as follows: Let S be a dataset of N observations,
S = {( xi , yi )}iN=1 . The vectors xi ∈ ℜ M represent individual customers, each of which is characterized by M
attributes, e.g., demographic data or information concerned with past business transactions. The dependent
variable is denoted by yi and represents some behavioural trait, e.g., whether or not the respective customer has
abandoned his relationship with the company or has responded to direct mail in the past. In a classification
setting, the dependent variable is discrete. That is, yi consists of nominal values that represent individual classes.
This paper considers only the case of binary classification ( yi ∈ {0,1} ) which is the prevailing modelling
approach in CRM.1 A classification model can be defined as a mapping from examples (customers) x to
classes y:
f ( x ) : ℜM 6 y (1)
In other words, the idea is that some functional relationship between the independent variables and the
dependent variable exists, which allows predicting y if only x is known. The nature of the relationship (i.e. the
function f) is, however, unknown and has to be approximated from past observations (S) in an inductive manner.
The traditional statistical perspective towards this step grounds on the Bayes theorem which states that the a
posteriori probability of x belonging to class 1, p ( y = 1| x ) , can be expressed in terms of the class-conditional
probability, p ( x | y = 1) , and a priori probabilities p ( y = 1) and p ( x ) [42]:
p ( x | y = 1) ⋅ p ( y = 1) (2)
p( y = 1| x) =
p( x)
A so called Bayes optimal classifier is obtained if examples are assigned to the class with maximal a
posteriori probability. Hence, to construct a classification model (1), the posteriori probabilities have to be
estimated from the S [44].
Traditional statistical classifiers like logistic regression or linear/quadratic discriminant analysis adopt the
aforementioned approach. The former strives to model p ( y | x ) by means of a logistic function, whereas
linear/quadratic discriminant analysis estimates class conditional probabilities, assuming them to be Gaussian.
Subsequently, a posteriori probabilities can be obtained via (2). A closely related approach is implemented in the
so called Naïve Bayes classifier. These methods are called parametric since particular functional forms are
assumed (e.g., logistic or Gaussian) whose parameters are estimated from S, normally by means of maximum
likelihood procedures. These basic methods as well as state-of-the-art enhancements are described in detail in
statistical textbooks (see, e.g., [31; 44]).
Recursive partitioning methods, also known as decision trees, are particularly popular within the machine
learning community. They classify examples by traversing a sequence of questions, or rules, until reaching a leaf
node which determines the final class. The rules are derived from the data in such a way that each question
partitions the examples into purer subgroups. That is, the prevalence of one particular class in the subsets is
1 Generalizations to multi-class settings are straightforward since a classification problem with multiple classes can always
be transformed into multiple binary problems (see, e.g., [2]).
higher than before separating the data. A variety of different paradigms have been suggested in the literature,
differing mainly in the splitting criterion which determines the attribute used in a given iteration to separate the
data. For example, the popular C4.5 algorithm induces decision trees based on the information theoretical
concept of entropy [86] and assesses the expected reduction in entropy because of splitting on a specific attribute
value. The split realizing the maximal reduction is selected. Class predictions are based on the concentration of
records of a particular class within a leaf node.
Artificial neural networks are mathematical representations inspired by the functioning of the human brain.
Many different types have been suggested in the literature, whereby multilayer perceptrons are probably most
popular for classification. These networks are composed of an input layer, one or more hidden layers and an
output layer, each consisting of several neurons. Each neuron processes its inputs and generates one output value
that is transmitted to the neurons in the subsequent layer. The number of neurons in the input layer equals the
dimension of x (i.e. M), whereas the remaining architecture (number of hidden layers as well as number of
neurons per hidden layer) has to be determined empirically. The task of building a neural network classifier
corresponds to determining the weights of the connections between the neurons, i.e. solving a nonlinear
optimization problem. A detailed description of neural networks in classification applications is given by, e.g.,
[11; 46].
Support vector machines are closely related to neural networks from a mathematical point of view. However,
the development of neural networks was predominantly driven by practical considerations (e.g., mimicking the
way humans learn), whereas support vector machines originate from statistical learning theory [101]. They
implement the idea of structural risk minimization [100] which extends the classical paradigm of inferring
relationships (learn) from data by optimizing the empirical risk, i.e., the agreement of the classification model
f ( x ) and the empirical observations S. Especially in data mining settings, where the dimensionality of x may be
large, for example because customer-centric data is available in abundance, focussing only on empirical risk
when building a classifier might lead to poor generalization results when the model is applied to predict the class
of unknown examples. Consequently, support vector machines incorporate such considerations directly into the
process of classifier building and strive to construct simple but accurate models. Roughly speaking, models with
lower complexity are less susceptible to overfitting the empirical data S and thus more likely to produce
generalizable results [101].
Finally, meta-learning strategies have received considerable attention in the machine learning and statistical
literature. The underlying idea is that predictive accuracy can be improved if the results of several base
classifiers are aggregated, so that a final class prediction corresponds to a vote (e.g., majority vote) of several
base models. Respective approaches commonly incorporate relatively simple statistical or tree-based methods
and implement different strategies to construct individual members of the classifier committee to achieve a
desirable degree of diversity. Diversity refers to the property of ensemble members to focus on different
idiosyncrasies of the classification task and the data, respectively. In other words, the base classifiers have to
complement each other to truly improve accuracy. Therefore, they can be derived from different training
samples, i.e. different sub-samples of S, and/or by using different attribute subsets. The AdaBoost algorithm [39]
as well as the random forest classifier [15] are among the most popular ensemble approaches. These algorithms
may represent the most accurate classifiers available today (see, e.g., [40]).
Assessing the predictive performance of a classification model is a non-trivial endeavour. In particular, one has
to select an appropriate accuracy indicator and decide upon a scheme for estimating the model’s performance on
future data. Reserving a certain part of the available data as hold-out sample, k-fold cross-validation or bootstrap
sampling are common choices for the latter task (see, e.g., [61]). On the other hand, which metric is a suitable
candidate to measure the performance of a classification model depends on the particular task.
The most intuitive way to asses a classifier is by counting the number of correctly predicted examples over
hold-out data. Considering binary classification, this procedure leaves four possible outcomes: If the classified
example belongs to class 1 and is classified as such, it is counted as a true positive (TP), and as a false positive
(FP) otherwise. If the example is a member of class 0 and classified accordingly it is counted as a true negative
(TN), and as a false negative (FN) in the opposite case. Such results can be illustrated by means of a confusion
matrix (Fig. 1) which represents the dispositions of the classified examples and constitutes the basis for many
common performance metrics.
True class accuracy TN+TP
(1- classification error) =
TN+TP+FN+FP
Predicted 1 0 sensitivity TP
(true positive rate) =
class FN+TP
1 TP FP specificity TN
(true negative rate) =
TN+FP
Classification accuracy is a commensurate metric that assigns equal importance to both error types. In other
words, the implicit assumption of accuracy-based evaluation is that misclassification costs as well as the number
of class 1 and class 0 examples are approximately equal [82; 83]. As will be shown later in the paper, these
assumptions are normally violated in customer-centric applications: Customers who are relevant from a business
perspective (e.g., those who respond to direct mail or are about to churn) naturally represent minorities. In
addition, it is also more costly to misclassify one of these customers. Hence, accuracy and similarly classification
error (i.e., the percentage of misclassified points) are conceptually inappropriate performance measures for CRM
applications. One can alleviate this problem to some extend by considering the arithmetic or geometric mean of
sensitivity and specificity [63]. Such measures strive to maximize class individual accuracies while keeping them
balanced.
However, a better way to evaluate the performance of classifiers for CRM scenarios is known as receiver
operating characteristics (ROC) analysis. ROC has a long tradition within the medical decision making
community to appraise the performance of diagnostic tests [73]. It became a popular tool in machine learning
due to its ability to assess the predictive performance of classification systems when class distributions and/or
misclassification costs are imbalanced, unknown and/or subject to change. That is, ROC analysis separates
classification performance from class and cost distributions [82; 83].
A ROC graph is a 2-dimensional illustration of sensitivity on the Y-axis versus false positive rate on the X-
axis (Fig. 2). Usually, the output of a classification model is some continuous value representing the confidence
of the model that the example is member of a particular class. For example, some classifiers estimate class
membership probabilities. These values have to be thresholded to obtain discrete class predictions. To produce a
ROC curve, this threshold is varied over all possible values each time producing a different true positive and
false positive rate [35]. Each ROC curve passes through the points (0,0) and (1,1). The former represents a
classifier which always predicts class 0. This strategy gives TP rate = FP rate = zero. The point (1,1) depicts a
situation where all examples are classified as class 1 (lowest possible threshold). The ideal point is the upper-left
corner (0,1) since a respective classifier makes no error (TP rate=1, FP rate=0). Hence, points towards the north-
west are preferable. They represent high TP rates with low FP rates. A straight line through (0,0) and (1,1) sets
the benchmark for any classifier since it represents a strategy of guessing a class at random [84].
To compare different classifiers their respective ROC curves are drawn in ROC space. Fig. 2 provides an
example of three classifiers C1, C2 and C3. The ROC curve of C1 is always above the competitors’ ones.
Consequently, C1 achieves a higher TP rate for all FP rates and dominates its two competitors. ROC curves of
different classifiers may however intersect making a performance comparison less obvious (e.g., curves C2 and
C3). To overcome this problem, one often calculates the area under the ROC curve (AUC) as a single scalar
measure for expected performance [14]. A higher AUC indicates that the classifier is on average more to the
north-west region of the graph. Furthermore, the AUC has an important statistical property: it represents the
probability that a classifier ranks a randomly chosen example of class 1 higher than a randomly chosen example
of the opposite class, which is equivalent to the Wilcoxon test of ranks [35; 43]. In summary, the AUC provides
a simple figure-of-merit representing the predictive performance of a classifier.2
2 Note that the area under the diagonal is 0.5. Hence, a good classifier should always have an AUC>0.5.
Fig. 2. The ROC curve of three classifiers C1, C2 and C3.
3 Note that this assumption is not unquestioned within the literature and that some empirical results challenge its validity for
certain industries (see, e.g., [88]).
The identification step requires predictive modelling and is predominantly approached by means of
supervised classification. Given that respective churn models are sufficiently accurate, proactive, targeted CCM
may be more cost efficient since customers are not prepared to negotiate for better deals when being contacted
unexpectedly and thus are likely to accepted cheaper incentives [20]. Considering the standard classification
setting, the construction of a churn prediction model requires defining an independent target variable which
distinguishes churners and loyal customers. Since most research is conducted in the field of subscription-based
businesses,4 the prevailing modelling paradigm is to define customers as churners if they (actively) cancel or
(passively) not renew their subscription within a predefined time-frame [20; 38; 96; 107]. Similarly, churn in the
financial service industry may be represented by the fact that no new product is bought after a previously owned
product has expired [65].
5 Although ‘lifetime value’ is, by definition, a monetary (i.e. continuous) value, classification analysis can be applied to
model some artefacts of lifetime value, e.g., whether a customer increases or decreases his spending [3].
6 For example, consider the number of daily transactions processed by a major credit card company or the number of daily
calls over the network of a leading telecommunication provider.
Superimposed fraud is detectable if authorized users have a fairly regular behaviour which is distinguishable
from the fraudulent behaviour. Consequently, user profiling is a common approach to counter superimposed
fraud. Such approaches employ supervised classification as one component within a multi-stage fraud detection
process (see, e.g., [21]). A respective system is proposed in [36]: First, so called local rules, from individual
accounts are derived using standard rule learning methods. It is crucial to perform this step on account level
because a particular behaviour (e.g., making several long distance calls per day) may be regular for one customer
but highly suspicious for another. Subsequently, the rules are assessed with respect to coverage and a subset is
selected as fraud indicators. That is, the validity of a rule which is derived from one account for other accounts is
appraised and only general rules are maintained. The selected rules are converted into profiling monitors to
achieve sensitivity with respect to individual customers. That is, the ordinary usage pattern of an account is
monitored and the number of calls satisfying a particular rule is computed. This produces a threshold against
which a calling pattern may be compared to. The evidence produced by the monitors is then fed into a standard
classifier, i.e. monitors’ outputs constitute the features used for classification.7 The final output is a detector that
profiles each user’s behaviour based on several indicators and produces an alarm if there is sufficient evidence of
unusual, i.e. fraudulent, activity. A similar approach using call data record to learn user signatures is presented in
[109].
The task of detecting subscription fraud is even more difficult because no historical data concerning account
usage is available, so that behavioural monitors may fail. Therefore, it is crucial to integrate applicant data (e.g.,
demographic information, commercial antecedents, etc.) with calling data from other accounts [90]. However,
this complicates the construction of a classifier because of the hierarchical relationship between user and calling
data (i.e., one user makes several calls). To alleviate this problem, a modified C.4.5 algorithm is developed in
[90], which distinguishes between splitting on user and behavioural attributes and comprises respective splitting
criteria. Similar to [36], the overall system incorporates rule selection mechanisms to prune the rules learnt in the
first step. Prediction of subscription fraud on the basis of behavioural and customer data is also considered in
[34]. The authors devise fuzzy rules to generate a database of fraud/non-fraud cases and employ a multi-layered
neural network as classification engine.
A meta-learning framework for credit card fraud detection is designed in [22; 23]. The authors elaborate on
problems associated with imbalanced class distributions in fraud detection. To cope with this difficulty, a
partitioning of majority class examples into different sub-datasets is proposed, whereby minority class examples
are replicated across all of these sets. Consequently, all examples are considered during classifier building and
each sub-dataset exhibits a balanced class distribution. Classifiers are built from these subsets and are integrated
by means of (meta-) learning. In particular, stacking is considered for integrating subset results. That is, a meta-
training set is derived from the base classifiers’ predictions on a validation sample and the true class label is used
as target. This approach is further elaborated in [80]. The authors employ stacking to identify particularly
suitable base classifiers (e.g., Naïve Bayes or neural network classifiers) which predictions’ are subsequently
integrated by means of Bagging. Similarly, the usage of meta-learning schemes and classifier ensembles for
learning fraud detectors from imbalanced data is reported in [1; 56].
4 Conclusions
We have discussed the basic setting of supervised classification as well as the most relevant algorithmic
approaches. Three representative applications within the field of customer-centric data mining have been
selected to demonstrate the potential as well as limitations of classification within this discipline.
Studying the relevant literature on direct marketing, churn prediction and fraud detection, we found that
remarkable similarities exist across these applications. In particular, imbalanced class and cost distributions
complicate classification in each of these fields and require special treatment. Therefore, approaches and
algorithms to overcome these obstacles have been a fruitful area of research. In general, it is reasonable to
assume that economically relevant groups, e.g., customers responding to a direct mail solicitation, will always
represent minorities, so that the relevance of learning from imbalanced data [55] as well as cost-sensitive
learning [105] is expected to continue.
Similarly, the need to construct comprehensible classification models which enable gaining a deeper
understanding of customer behaviour is well established in all three settings. It has been shown that a classifier is
considered comprehensible if it reveals the influence and relevance of individual attributes for the classification
decision. For example, several studies in the direct marketing community have explored the importance of RFM-
type variables in comparison to behavioural attributes to identify the most informative predictors. Transparency
and accuracy are usually perceived as conflicting goals. On the one hand, nonlinear modelling may be required
to improve the accuracy of a classifier if the relationship between customer attributes and the respective
7 Note that feature selection heuristics are applied since some rules do not perform well in some monitors and some monitors
overlap in fraud coverage.
independent variables is nonlinear. However, respective methods like neural networks or support vector
machines are opaque and do not reveal their internal processing of customer attributes. On the other hand,
understandable linear models may lack expressive power in complex CRM settings. One approach to alleviate
this trade-off may be seen in the combination of nonlinear models and rule extraction algorithms which derive an
understandable representation of the nonlinear classifier (see, e.g., [6; 71; 77]). Despite the value of such
approaches it should be noted that the random forest classifier [15] naturally provides an assessment of attribute
relevance and is capable of modelling nonlinear relationships. Therefore, this method may be the most promising
classifier from a practical point of view and researchers should try to adopt the principles underlying random
forests when crafting novel algorithms.
A noteworthy difference between classification tasks in direct marketing, churn prediction and fraud detection
is associated with the role of the classifier in the decision making process. Whereas it might be feasible to
automate the task of selecting the customers for a mailing campaign to a large extent, classification is only a
single step in the complex process of fraud detection. Here, the objective of classification is to support − rather
than automate − decision making by computing a suspicious scores [13] to reduce the number of final-line fraud
investigations. Consequently, an important area of future research is to design integrated analytical systems,
which are tailored to real-world decision problems. Nowadays, the principles of learning from data are well
understood and powerful algorithms for classification as well as other forecasting tasks have been developed.
The next logical step is to proceed into the direction of decision making and focus on the solution of real-world
problems.
Literature
1. Akbani, R.; Kwek, S.; Japkowicz, N.: Applying Support Vector Machines to Imbalanced Datasets. In: Boulicaut, J.-F.;
Esposito, F.; Giannotti, F.; Pedreschi, D. (eds.): Machine Learning – Proc. of the 15th European Conference on Machine
Learning. Springer, Berlin 2004, 39-50.
2. Allwein, E.L.; Schapire, R.E.; Singer, Y.: Reducing multi-class to binary: A unifying approach for margin classifiers. In:
Journal of Machine Learning Research 1 (2000), 113-141.
3. Baesens, B.; Verstraeten, G.; Van den Poel, D.; Egmont-Petersen, M.; Van Kenhove, P.; Vanthienen, J.: Bayesian network
classifiers for identifying the slope of the customer lifecycle of long-life customers. In: European Journal of Operational
Research 156 (2004), 508-523.
4. Baesens, B.; Viaene, S.; Van den Poel, D.; Vanthienen, J.; Dedene, G.: Bayesian neural network learning for repeat
purchase modelling in direct marketing. In: European Journal of Operational Research 138 (2002), 191-211.
5. Banslaben, J.: Predictive Modelling. In: Nash, E.L. (ed.): The Direct Marketing Handbook. McGraw-Hill, New York
1992, 620-636.
6. Barakat, N.H.; Bradley, A.P.: Rule extraction from support vector machines: A sequential covering approach. In: IEEE
Transactions on Knowledge and Data Engineering 19 (2007), 729-741.
7. Bauer, C.L.: A direct mail customer purchase model. In: Journal of Direct Marketing 2 (1988), 16-24.
8. Bennett, K.P.; Wu, S.; Auslender, L.: On Support Vector Decision Trees for Database Marketing. In: Proc. of the Intern.
Joint Conf. on Neural Networks. IEEE Press, Piscataway 1999, 904-909.
9. Berger, P.; Magliozzi, T.: The effect of sample size and proportion of buyers in the sample on the performance of list
segmentation equations generated by regression analysis. In: Journal of Direct Marketing 6 (1992), 13-22.
10.Bhattacharyya, S.: Direct marketing performance modeling using genetic algorithms. In: INFORMS Journal on
Computing 11 (1999), 248-257.
11.Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford 1995.
12.Bitran, G.R.; Mondschein, S.V.: Mailing decisions in the catalog sales industry. In: Management Science 59 (1996), 1364-
1381.
13.Bolton, R.J.; Hand, D.J.: Statistical fraud detection: A review. In: Statistical Science 17 (2002), 235–255.
14.Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. In: Pattern
Recognition 30 (1997), 1145-1159.
15.Breiman, L.: Random forests. In: Machine Learning 45 (2001), 5-32.
16.Buckinx, W.; Van den Poel, D.: Customer base analysis: Partial defection of behaviourally loyal clients in a non-
contractual FMCG retail setting. In: European Journal of Operational Research 164 (2005), 252-268.
17.Buckinx, W.; Verstraeten, G.; Van den Poel, D.: Predicting customer loyalty using the internal transactional database. In:
Expert Systems with Applications 32 (2007), 125-134.
18.Bucklin, R.E.; Lattin, J.M.; Ansari, A.; Gupta, S.; Bell, D.; Coupey, E.; Little, J.D.C.; Mela, C.; Montgomery, A.; Steckel,
J.: Choice and the internet: From clickstream to research stream. In: Marketing Letters 13 (2002), 245-258.
19.Bult, J.R.; Wansbeek, T.: Optimal selection for direct mail. In: Marketing Science 14 (1995), 378-394.
20.Burez, J.; Van den Poel, D.: CRM at a pay-TV company: Using analytical models to reduce customer attrition by targeted
marketing for subscription services. In: Expert Systems with Applications 32 (2007), 277-288.
21.Burge, P.; Shawe-Taylor, J.; Cooke, C.; Moreau, Y.; Preneel, B.; Stoermann, C.: Fraud Detection and Management in
Mobile Telecommunications Networks. In: IEE (ed.): ECOS97 – Proc. of the 2nd European Convention on Security and
Detection. IEE, London 1997, 91-96.
22.Chan, P.K.; Fan, W.; Prodromidis, A.L.; Stolfo, S.J.: Distributed data mining in credit card fraud detection. In: IEEE
Intelligent Systems 14 (1999), 67-74.
23.Chan, P.K.; Stolfo, S.J.: Toward Scalable Learning with Nonuniform Class and Cost Distributions: A Case Study in Credit
Card Fraud Detection. In: Agrawal, R.; Stolorz, P.E.; Piatetsky-Shapiro, G. (eds.): KDD'98 – Proc. of the 4th Intern. Conf.
on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park 1998, 164-168.
24.Coenen, F.; Swinnen, G.; Vanhoof, K.; Wets, G.: The improvement of response modeling: Combining rule-induction and
case-based reasoning. In: Expert Systems with Applications 18 (2000), 307-313.
25.Coussement, K.; Van den Poel, D.: Churn prediction in subscription services: An application of support vector machines
while comparing two parameter-selection techniques. In: Expert Systems with Applications 34 (2008), 313-327.
26.Crone, S.F.; Lessmann, S.; Stahlbock, R.: The impact of preprocessing on data mining: An evaluation of classifier
sensitivity in direct marketing. In: European Journal of Operational Research 173 (2006), 781-800.
27.Cui, D.; Curry, D.: Predictions in marketing using the support vector machine. In: Marketing Science 24 (2005), 595-615.
28.Cui, G.; Wong, M.L.: Implementing neural networks for decision support in direct marketing. In: International Journal of
Market Research 46 (2004), 235-254.
29.Cui, G.; Wong, M.L.; Lui, H.-K.: Machine learning for direct marketing response models: Bayesian networks with
evolutionary programming. In: Management Sciences 52 (2006), 597-612.
30.Deichmann, J.; Eshghi, A.; Haughton, D.; Sayek, S.; Teebagy, N.: Application of multiple adaptive regression splines
(MARS) in direct response modeling. In: Journal of Interactive Marketing 16 (2002), 15-27.
31.Duda, R.O.; Hart, P.E.; Stork, D.G.: Pattern Classification. Wiley, New York 2001.
32.Eiben, A.E.; Euverman, T.J.; Kowalczyk, W.; Peelen, E.; Slisser, F.; Wesseling, J.A.M.: Comparing Adaptive and
Traditional Techniques for Direct Marketing. In: Zimmermann, H.-J. (ed.): EUFIT'96 – Proc. of the 4th European
Congress on Intelligent Techniques and Soft Computing. Verlag Mainz, Aachen 1996, 434-437.
33.Emmanouilides, C.; Hammond, K.: Internet usage: Predictors of active users and frequency use. In: Journal of Interactive
Marketing 14 (2000), 17-32.
34.Estevez, P.A.; Held, C.M.; Perez, C.A.: Subscription fraud prevention in telecommunications using fuzzy rules and neural
networks. In: Expert Systems with Applications 31 (2006), 337-344.
35.Fawcett, T.: An introduction to ROC analysis. In: Pattern Recognition Letters 27 (2006), 861-874.
36.Fawcett, T.; Provost, F.: Adaptive fraud detection. In: Data Mining and Knowledge Discovery 1 (1997), 291-316.
37.Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P.: From data mining to knowledge discovery in databases: An overview. In:
AI Magazine 17 (1996), 37-54.
38.Ferreira, J.B.; Marley Vellasco: Data Mining Techniques on the Evaluation of Wireless Churn. In: Verleysen, M. (ed.):
Trends in Neurocomputing – Proc. of the 12th European Symposium on Artificial Neural Networks. Elsevier, Amsterdam
2004, 483-488.
39.Freund, Y.; Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In:
Journal of Computer and System Science 55 (1997), 119-139.
40.Friedman, J.H.: Recent advances in predictive (machine) learning. In: Journal of Classification 23 (2006), 175-197.
41.Friedman, N.; Geiger, D.; Goldszmidt, M.: Bayesian network classifiers. In: Machine Learning 29 (1997), 131-163.
42.Hand, D.J.: Construction and Assessment of Classification Rules. John Wiley, Chichester 1997.
43.Hanley, J.A.; McNeil, B.J.: The meaning and use of the area under the receiver operating characteristic (ROC) curve. In:
Radiology 143 (1982), 29-36.
44.Hastie, T.; Tibshirani, R.; Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction.
Springer, New York 2002.
45.Haughton, D.; Oulabi, S.: Direct marketing modeling with CART and CHAID. In: Journal of Direct Marketing 11 (1997),
42-52.
46.Haykin, S.S.: Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle River 1999.
47.He, H.; Wang, J.; Graco, W.; Hawkins, S.: Application of neural networks to detection of medical fraud. In: Expert
Systems with Applications 13 (1997), 329-336.
48.Heilman, C.M.; Kaefer, F.; Ramenofsky, S.D.: Determining the appropriate amount of data for classifying consumers for
direct marketing purposes. In: Journal of Interactive Marketing 17 (2003), 5-28.
49.Hoath, P.: Telecoms fraud, the gory details. In: Computer Fraud & Security 1998 (1998), 10-14.
50.Hung, S.-Y.; Yen, D.C.; Wang, H.-Y.: Applying data mining to telecom churn management. In: Expert Systems with
Applications 31 (2006), 515-524.
51.Hur, Y.; Lim, S.: Customer Churning Prediction Using Support Vector Machines in Online Auto Insurance Service. In:
Wang, J.; Liao, X.; Yi, Z. (eds.): Advances in Neural Networks – Proc. of the 2nd Intern. Symposium on Neural
Networks. Springer, Berlin 2005, 928-933.
52.Hwang, H.; Jung, T.; Suh, E.: An LTV model and customer segmentation based on customer value: A case study on the
wireless telecommunication industry. In: Expert Systems with Applications 26 (2004), 181-188.
53.Jain, A.K.; Duin, R.P.W.; Mao, J.: Statistical pattern recognition: A review In: IEEE Transactions on Pattern Analysis and
Machine Intelligence 22 (2000), 4-37.
54.Jain, D.; Singh, S.S.: Customer lifetime value research in marketing: A review and future directions. In: Journal of
Interactive Marketing 16 (2002), 34-46.
55.Japkowicz, N.; Stephen, S.: The class imbalance problem: A systematic study. In: Intelligent Data Analysis 6 (2002), 429-
450.
56.Kim, H.-C.; Pang, S.; Je, H.-M.; Kim, D.; Yang Bang, S.: Constructing support vector machine ensemble. In: Pattern
Recognition 36 (2003), 2757-2767.
57.Kim, S.; Shin, K.-S.; Park, K.: An Application of Support Vector Machines for Customer Churn Analysis: Credit Card
Case. In: Wang, L.; Chen, K.; Ong, Y.S. (eds.): Advances in Natural Computation – Proc. of the 1st Intern. Conf. on
Advances in Natural Computation. Springer, Berlin 2005, 636-647.
58.Kim, Y.; Street, W.N.: An intelligent system for customer targeting: A data mining approach. In: Decision Support
Systems 37 (2003), 215-228.
59.Kim, Y.S.; Street, W.N.; Russell, G.J.; Menczer, F.: Customer targeting: A neural network approach guided by genetic
algorithms. In: Management Science 51 (2005), 264-276.
60.Kirkos, E.; Spathis, C.; Manolopoulos, Y.: Data mining techniques for the detection of fraudulent financial statements. In:
Expert Systems with Applications 32 (2007), 995-1003.
61.Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: Mellish, C.S.
(ed.): IJCAI'95 – Proc. of the 14th Intern. Joint Conf. on Artificial Intelligence. Morgan Kaufmann, San Fransisco 1995,
1137-1143.
62.Kohavi, R.; John, G.H.: Wrappers for feature subset selection. In: Artificial Intelligence 97 (1997), 273-324.
63.Kubat, M.; Holte, R.C.; Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. In: Machine
Learning 30 (1998), 195-215.
64.Lariviere, B.; Van den Poel, D.: Investigating the role of product features in preventing customer churn, by using survival
analysis and choice modeling: The case of financial services. In: Expert Systems with Applications 27 (2004), 277-285.
65.Lariviere, B.; Van den Poel, D.: Predicting customer retention and profitability by using random forests and regression
forests techniques. In: Expert Systems with Applications 29 (2005), 472-484.
66.Lewis, M.: The influence of loyalty programs and short-term promotions on customer retention. In: Journal of Marketing
Research 41 (2004), 281-292.
67.Ling, C.X.; Li, C.: Data Mining for Direct Marketing: Problems and Solutions. In: Agrawal, R.; Stolorz, P. (eds.):
KDD'98 – Proc. of the 4th Intern. Conf. on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park 1998, 73-
79.
68.Lo, V.S.Y.: The true lift model: A novel data mining approach to response modeling in database marketing. In: ACM
SIGKDD Explorations Newsletter 4 (2002), 78-86.
69.Madeira, S.; Sousa, J.M.: Comparison of Target Selection Methods in Direct Marketing. In: Lieven, K. (ed.): EUNITE'02
– Proc. of the European Symposium on Intelligent Technologies, Hybrid Systems and their implementation on Smart
Adaptive Systems Elite Foundation, Aachen 2002, 333-338.
70.Magidson, J.: Improved statistical techniques for response modeling: Progression beyond regression. In: Journal of Direct
Marketing 2 (1988), 6-18.
71.Martens, D.; Baesens, B.; van Gestel, T.; Vanthienen, J.: Comprehensible credit scoring models using rule extraction from
support vector machines. In: European Journal of Operational Research 183 (2007), 1466-1476.
72.McDonald, W.J.: Direct Marketing. McGraw-Hill, Singapore 1998.
73.Metz, C.E.: Basic principles of ROC analysis. In: Seminars in Nuclear Medicine 8 (1978), 283-298.
74.Mozer, M.C.; Dodier, R.; Colagrosso, M.D.; Guerra-Salcedo, C.; Wolniewicz, R.: Prodding the ROC Curve: Constrained
Optimization of Classifier Performance. In: Dietterich, T.G.; Becker, S.; Ghahramani, Z. (eds.): Advances in Neural
Information Processing Systems 14. MIT Press, Cambridge 2002, 1409-1415.
75.Mozer, M.C.; Wolniewicz, R.; Grimes, D.B.; Johnson, E.; Kaushansky, H.: Predicting subscriber dissatisfaction and
improving retention in the wireless telecommunications industry. In: IEEE Transactions on Neural Networks 11 (2000),
690-696.
76.Nash, E.L.: The Direct Marketing Handbook. McGraw-Hill, New York 1992.
77.Navia-Vázquez, A.; Parrado-Hernándeza, E.: Support vector machine interpretation. In: Neurocomputing 69 (2006),
1754-1759.
78.O'Brien, T.V.: Neural nets for direct marketers. In: Marketing Research 6 (1994), 47.
79.Pan, J.; Yang, Q.; Yang, Y.; Li, L.; Li, F.T.; Li, G.W.: Cost-sensitive preprocessing for mining customer relationship
management databases. In: IEEE Intelligent Systems 22 (2007), 46-51.
80.Phua, C.; Alahakoon, D.; Lee, V.: Minority report in fraud detection: Classification of skewed data. In: ACM SIGKDD
Explorations Newsletter 6 (2004), 50-59.
81.Piatetsky-Shapiro, G.; Masand, B.: Estimating Campaign Benefits and Modeling Lift. In: Chaudhuri, S.; Madigan, D.
(eds.): KDD'99 – Proc. of the 5th Intern. Conf. on Knowledge Discovery and Data Mining. ACM Press 1999, 185-193.
82.Provost, F.; Fawcett, T.: Analysis and Visualization of Classifier Performance: Comparison Under Imprecise Class and
Cost Distributions. In: Heckerman, D.; Mannila, H.; Pregibon, D.; Uthurusamy, R. (eds.): KDD'97 – Proc. of the 3rd
Intern. Conf. on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park 1997, 43-48.
83.Provost, F.; Fawcett, T.: Robust classification for imprecise environments. In: Machine Learning 42 (2001), 203-231.
84.Provost, F.; Fawcett, T.; Kohavi, R.: The Case Against Accuracy Estimation for Comparing Induction Algorithms. In:
Shavlik, J.W. (ed.): Machine Learning – Proc. of the 15th Intern. Conf. on Machine Learning. Morgan Kaufmann, San
Francisco 1998, 445-453.
85.Quah, J.T.S.; Sriganesh, M.: Real-time credit card fraud detection using computational intelligence. In: Expert Systems
with Applications (doi:10.1016/j.eswa.2007.08.093) (2007).
86.Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo 1993.
87.Ratner, B.: Finding the best variables for direct marketing models. In: Journal of Targeting, Measurement & Analysis for
Marketing 9 (2001), 270-296.
88.Reinartz, W.; Kumar, V.: On the profitability of long-life customers in a noncontractual setting: An empirical
investigation and implications for marketing. In: Journal of Marketing 64 (2000), 17-35.
89.Reinartz, W.J.; Kumar, V.: The impact of customer relationship characteristics on profitable lifetime duration. In: Journal
of Marketing 67 (2003), 77-99.
90.Rosset, S.; Murad, U.; Neumann, E.; Idan, Y.; Pinkas, G.: Discovery of Fraud Rules for Telecommunications—
Challenges and Solutions. In: Chaudhuri, S.; Madigan, D. (eds.): Proc. of the 5th Intern. Conf. on Knowledge Discovery
and Data Mining. ACM, New York 1999, 409-413.
91.Rosset, S.; Neumann, E.; Eick, U.; Vatnik, N.; Idan, I.: Evaluation of Prediction Models for Marketing Campaigns. In:
Provost, F.; Srikant, R. (eds.): KDD'01 – Proc. of the 7th Intern. Conf. on Knowledge Discovery and Data Mining. ACM
Press, New York 2001, 456 -461.
92.Rossi, P.E.; McCulloch, R.E.; Allenby, G.M.: The value of purchase history data in target marketing. In: Marketing
Science 15 (1996), 321.
93.Schölkopf, B.; Platt, J.C.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C.: Estimating the support of a high-dimensional
distribution. In: Neural Computation 13 (2001), 1443-1471.
94.Shawe-Taylor, J.; Howker, K.; Gosset, P.; Hyland, M.; Verrelst, H.; Moreau, Y.; Stoermann, C.; Burge, P.: Novel
Techniques for Profiling and Fraud Detection in Mobile Telecommunications. In: Lisboa, P.J.G.; B. Edisbury; Vellido, A.
(eds.): Business Applications of Neural Networks. World Scientific, Singapore 2000, 113-139.
95.Shin, H.; Cho, S.: Response modeling with support vector machines. In: Expert Systems with Applications 30 (2006),
746-760.
96.Smith, K.A.; Willis, R.J.; Brooks, M.: An analysis of customer retention and insurance claim patterns using data mining:
A case study. In: Journal of the Operational Research Society 51 (2000), 532-541.
97.Thrasher, R.P.: CART: A recent advance in tree-structured list segmentation methodology. In: Journal of Direct
Marketing 5 (1991), 35-47.
98.Van den Poel, D.; Buckinx, W.: Predicting online-purchasing behaviour. In: European Journal of Operational Research
166 (2005), 557-575.
99.Van den Poel, D.; Prinzie, A.: Constrained optimization of data-mining problems to improve model performance: A
direct-marketing application. In: Expert Systems with Applications (doi:10.1016/j.eswa.2005.04.017) (2008).
100. Vapnik, V.; Kotz, S.: Estimation of Dependences Based on Empirical Data. Springer, New York 2006.
101. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York 1995.
102. Viaene, S.; Ayuso, M.; Guillen, M.; Van Gheel, D.; Dedene, G.: Strategies for detecting fraudulent claims in the
automobile insurance industry. In: European Journal of Operational Research 176 (2007), 565-583.
103. Viaene, S.; Baesens, B.; Van den Poel, D.; Dedene, G.; Vanthienen, J.: Wrapped input selection using multilayer
perceptrons for repeat-purchase modeling in direct marketing. In: International Journal of Intelligent Systems in
Accounting, Finance & Management 10 (2001), 115-126.
104. Viaene, S.; Baesens, B.; Van Gestel, T.; Suykens, J.A.K.; Van den Poel, D.; Vanthienen, J.; De Moor, B.; Dedene, G.:
Knowledge discovery in a direct marketing case using least squares support vector machines. In: International Journal of
Intelligent Systems 16 (2001), 1023-1036.
105. Viaene, S.; Dedene, G.: Cost-sensitive learning and decision making revisited. In: European Journal of Operational
Research 166 (2004), 212-220.
106. Viaene, S.; Derrig, R.A.; Baesens, B.; Dedene, G.: A comparison of state-of-the-art classification techniques for expert
automobile insurance claim fraud detection. In: Journal of Risk & Insurance 69 (2002), 373-421.
107. Wei, C.P.; Chiu, I.T.: Turning telecommunications call details to churn prediction: A data mining approach. In: Expert
Systems with Applications 23 (2002), 103-112.
108. Wheeler, R.; Aitken, S.: Multiple algorithms for fraud detection. In: Knowledge-Based Systems 13 (2000), 93-99.
109. Xing, D.; Girolami, M.: Employing latent Dirichlet allocation for fraud detection in telecommunications. In: Pattern
Recognition Letters 28 (2007), 1727-1734.
110. Yan, L.; Wolniewicz, R.H.; Dodier, R.: Predicting customer behavior in telecommunications. In: IEEE Intelligent
Systems 19 (2004), 50-58.
111. Yang, W.-S.; Hwang, S.-Y.: A process-mining framework for the detection of healthcare fraud and abuse. In: Expert
Systems with Applications 31 (2006), 56-68.
112. Yu, E.; Cho, S.: Constructing response model using ensemble based on feature subset selection. In: Expert Systems
with Applications 30 (2006), 352-360.
113. Zahavi, J.; Levin, N.: Applying neural computing to target marketing. In: Journal of Direct Marketing 11 (1999), 76-93.
114. Zahavi, J.; Levin, N.: Issues and problems in applying neural computing to target marketing. In: Journal of Direct
Marketing 11 (1999), 63-75.
115. Zhao, Y.; Li, B.; Li, X.; Liu, W.; Ren, S.: Customer Churn Prediction Using Improved One-Class Support Vector
Machine. In: Li, X.; Wang, S.; Dong, Z.Y. (eds.): Advanced Data Mining and Applications – Proc. of the 1st Intern. Conf.
on Advanced Data Mining and Applications. Springer, Berlin 2005, 300-306.