CH 15

Data Mining and Predictive Analytics
Daniel Larose, Ph.D. and Chantal Larose
Chapter 15
Model Evaluation Techniques
Prepared by Andrew Hendrickson, Graduate Assistant
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose
John Wiley & Sons, Inc, Hoboken, NJ, 2015.
1
Model Evaluation Techniques
• Recall the CRISP-DM data mining process has six

phases:
– Business Understanding Phase
– Data Understanding Phase
– Data Preparation Phase
– Modeling Phase
– Evaluation Phase
– Deployment Phase
• Evaluation Phase
– Concerned with evaluating quality and effectiveness of candidate
data mining models
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 2
Model Evaluation Techniques (cont’d)
• Model deployment requires significant company

expenditures
• Invalid models waste corporate investment
• Therefore, models require evaluation before deployment
to field
• Model evaluation techniques applicable to description,
estimation, prediction, classification, clustering, and
association tasks explored
Model Evaluation Techniques for
the Description Task
– Recall EDA powerful technique for describing data
– No target is classified, estimated, or predicted
– Therefore, methods to objectively measure results elusive
– In general, data mining models should be transparent

– Likewise, effectiveness of EDA based on clarity of results
presented
– Results must elicit understanding in target audience
– If required, Minimum Descriptive Length (MDL) principle may be

used to quantify descriptions
the Description Task (cont’d)
– MDL quantifies information required (in bits) to
encode model and exceptions to model
– Occam’s Razor states: simple representations

preferable to complex ones
– Therefore, best description minimizes MDL
the Estimation and Prediction Tasks
– Models produce an estimation (prediction) ŷ for the actual target
value y
– Mean Square Error (MSE) used to evaluate models
 ( yi  yˆi )2
MSE  i
n  p 1
– p represents number of model parameters
– Preferred models minimize MSE
– Typical error for estimation/prediction models is Standard Error
of the Estimate s
s  MSE
(cont’d)
– For example, regression output shown for predicting nutritional
rating based on sugar content (Minitab)
– Here, MSE = 84.0 and s = 9.167
(cont’d)
– s = 9.167 is estimated prediction error for this model
– That is, model’s typical error using sugar to estimate rating is
9.167 rating points
– Is error acceptable?
– Should model be deployed?
– Deployment of model depends on business objectives
– Now, should this model be deployed?
(cont’d)
– Multiple regression model more complex than previous model
– Uses eight predictors, as opposed to single predictor
– Data mining requires trade-off between model complexity and

prediction error
– Domain experts help determine where point of diminishing
returns lies
– Chapter 7 examined SSE, another evaluation measure
SSE    actual - output 2
records output nodes
– SSE roughly similar to MSE’s numerator

– Again, preferable models minimize SSE
Model Evaluation Techniques for the Estimation and Prediction Tasks
(cont’d)
• One of the drawbacks of the above evaluation measures is that
outliers may have an undue influence on the value of the evaluation
measure
• This is because the above measures are based on the squared error,
which is much larger for outliers than for the bulk of the data
• An option is mean absolute error (MAE)
Model Evaluation Techniques for the Classification Task
Model Evaluation Techniques for the Classification Task - cont
ACCURACY AND OVERALL ERROR RATE
• Accuracy represents an overall measure of the proportion of correct classifications

being made by the model, while overall error rate measures the proportion of
incorrect classifications, across all cells in the contingency table
• For example:
• 86.48% of the classifications made by this model are correct, while 13.52% are
wrong
SENSITIVITY AND SPECIFICITY
• Sensitivity measures the ability of the model to classify a record positively, while
specificity measures the ability to classify a record negatively
• For example:
• A good classification model should be sensitive, meaning that it should identify a high
proportion of the customers who are positive (have high income)
• A classification model also needs to be specific, meaning that it should identify a high
proportion of the customers who are negative (have low income)
FALSE POSITIVE RATE AND FALSE NEGATIVE RATE
• False Positive Rate and False Negative Rate are additive inverses of sensitivity and
specificity
• For example:
• Our low false positive rate of 4.31% indicates that we incorrectly identify actual low
income customers as high income only 4.31% of the time
• The much higher false negative rate indicates that we incorrectly classify actual high
income customers as low income 42.80% of the time
PROPORTIONS OF TRUE POSITIVES, TRUE NEGATIVES, FALSE
POSITIVES, AND FALSE NEGATIVES
• Proportion of True Positives and the Proportion of True Negatives, and are defined as
follows
• For example:
• That is, the probability is 80.69% that a customer actually has high income, given
that our model has classified it as high income, while the probability is 87.66% that a
customer actually has low income, given that we have classified it as low income
PROPORTIONS OF TRUE POSITIVES, TRUE NEGATIVES, FALSE
POSITIVES, AND FALSE NEGATIVES - cont
• For the Proportion of False Positives and the Proportion of False Negatives
• For example:
• In other words, there is a 19.31% likelihood that a customer actually has low income,
given that our model has classified it as high income, and there is a 12.34%
likelihood that a customer actually has high income, given that we have classified it
as low income
• As an aside, in the parlance of hypothesis testing, since the default decision is to find
that the applicant has low income, we would have the following hypotheses:
• where H0 represents the default, or null, hypothesis, and Ha represents the

alternative hypothesis, which requires evidence to support it. A false positive would
be considered a type I error in this setting, incorrectly rejecting the null hypothesis,
while a false negative would be considered a type II error, incorrectly accepting the
null hypothesis
MISCLASSIFICATION COST ADJUSTMENT TO REFLECT REAL-
WORLD CONCERNS
• Which error, a false negative or a false positive, would be considered more damaging
from the lender’s point of view?
– If the lender commits a false negative, an applicant who had high income gets turned down for a loan: an
unfortunate but not very expensive mistake
– if the lender commits a false positive, an applicant who had low income would be awarded the loan
(expensive for the lender)
• Therefore, the lender would consider the false positive to be the more damaging type
of error and would prefer to minimize the proportion of false positives
• The analyst would therefore adjust the C5.0 algorithm’s misclassification cost matrix
to reflect the lender’s concerns
• How would you expect the misclassification cost adjustment to affect the
performance of the algorithm?
MISCLASSIFICATION COST ADJUSTMENT TO REFLECT REAL-
WORLD CONCERNS - cont
• The C5.0 algorithm was rerun, this time including the misclassification cost adjustment. The
resulting contingency table is shown in Table 15.3
• The classification model evaluation measures are presented in Table 15.4
• As desired, the proportion of false positives has decreased
• However, the algorithm, hesitant to classify records as positive due to the higher cost, instead
made many more negative classifications, and therefore more false negatives
• While the overall error rate is higher (0.1444 from 0.1352) higher proportion of false negatives
are considered a “good trade” by this lender, who is eager to reduce the loan default rate, which
is very costly to the firm
Decision Cost/Benefit Analysis
– Models often evaluated in terms of cost/benefit analysis

– Managers want error rates, false positive rates, and false
negative rates translated to “dollars and cents”
– Anticipated model profit or loss determined by associating cost

or benefit with four possible classifications:
• true positive
• false positive
• true negative
• false negative
(cont’d)
• For example, suppose cost/benefit values are assigned to classifications according to

Table 15.5
• The “–$300” cost is actually the anticipated average interest revenue to be collected
from applicants whose income is actually >50,000
• The $500 reflects the average cost of loan defaults, averaged over all loans to
applicants whose income level is low
(cont’d)
• Using the costs from Table 15.5, we can then compare models 1 and 2:
• Cost of Model 1 (False positive cost not doubled):
• Cost of Model 2 (False positive cost doubled):
• Negative costs represent profits. Thus, the estimated cost savings from deploying
Model 2, which doubles the cost of a false positive error, is
• In other words, the simple data mining step of doubling the false positive cost has
resulted in the deployment of a model greatly increasing the company’s profit
LIFT CHARTS AND GAINS CHARTS
• For classification models, lift is a concept, originally from the marketing field, which
seeks to compare the response rates with and without using the classification model
• Lift charts and gains charts are graphical evaluative methods for assessing and
comparing the usefulness of classification models
• We define lift as the proportion of true positives, divided by the proportion of positive
hits in the data set overall:
• For example for Model 1:
• Thus, the lift, measured at the 4242 positively predicted records, is
LIFT CHARTS AND GAINS CHARTS - cont
• When calculating lift, the software will first sort the records by the probability of
being classified positive
• The lift is then calculated for every sample size from n = 1 to n = the size of the data
set
• A chart is then produced which graphs lift against the percentile of the data set
• Note that lift is highest at the lowest percentiles, which makes sense since the data
are sorted according to the most likely positive hits
• As the plot moves from left to right, the positive hits tend to get “used up,” so that
the proportion steadily decreases until the lift finally equals exactly 1 when the entire
data set is considered the sample
• Lift charts are often presented in their cumulative form, where they are denoted as
cumulative lift charts, or gains charts
• The gains chart associated with the lift chart in Figure 15.2 is presented in Figure
15.3
• The diagonal on the gains chart is analogous to the horizontal axis at lift = 1 on the
lift chart
• Analysts would like to see gains charts where the upper curve rises steeply as one
moves from left to right and then gradually flattens out
• For ex, in Fig 15.3, canvassing the top
• 20% of our contact list, we expect
• to reach about 62% of the total number
• of high-income persons on the list
• Canvassing the top 40% would allow us
• to reach about 85%. Past this point, the law
• of diminishing returns is in effect
• Figure 15.4 shows the combined lift chart for models 1 and 2
• For example, up to about the 6th percentile, there appears to be no apparent
difference in model lift
• Then, up to approximately the 17th percentile, model 2 is preferable, providing
slightly higher lift
• Thereafter, model 1 is preferable
• It is to be stressed that model evaluation techniques should be performed
on the test data set, rather than on the training set, or on the data set as a
whole
INTERWEAVING MODEL EVALUATION WITH MODEL BUILDING
• We would recommend that model evaluation become a nearly “automatic” process,

performed to a certain degree whenever a new model is generated
• Therefore, at any point in the process, we may have an accurate measure of the
quality of the current or working model
• Therefore, it is suggested that model evaluation be interwoven seamlessly into the
methodology for building and evaluating a data model presented in Chapter 7, being
performed on the models generated from each of the training set and the test set
• For example, when we adjust the provisional model to minimize the error rate on the
test set, we may have at our fingertips the evaluation measures like sensitivity and
specificity, along with the lift charts and the gains charts
• These evaluative measures and graphs can then point the analyst in the proper
direction for best ameliorating any drawbacks of the working model
CONFLUENCE OF RESULTS: APPLYING A SUITE OF MODELS
• Whenever possible, the analyst should not depend solely on a single data mining method
• Instead, he or she should seek a confluence of results from a suite of different data mining
models
• For example, for the adult database, our analysis from Chapters 11 and 12 shows that the
variables listed in Table 15.7 are the most influential (ranked roughly in order of importance) for
classifying income, as identified by CART, C5.0, and the neural network algorithm, respectively
• All three algorithms identify Marital_Status, education-num, capital-gain, capital-loss, and hours-
per-week as the most important variables, except for the neural network, where age snuck in
past capital-loss
• None of the algorithms identified either work-class or sex as important variables, and only the
neural network identified age as important
• The algorithms agree on various ordering trends, such as education-num is more important than
hours-per-week

CH 15

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

CH 15

Hochgeladen von

Copyright:

Verfügbare Formate

Data Mining and Predictive Analytics

Daniel Larose, Ph.D. and Chantal Larose

• Recall the CRISP-DM data mining process has six

• Model deployment requires significant company

– In general, data mining models should be transparent

– If required, Minimum Descriptive Length (MDL) principle may be

– Occam’s Razor states: simple representations

– Therefore, best description minimizes MDL

– Now, should this model be deployed?

– Data mining requires trade-off between model complexity and

records output nodes

– SSE roughly similar to MSE’s numerator

• Accuracy represents an overall measure of the proportion of correct classifications

• where H0 represents the default, or null, hypothesis, and Ha represents the

– Models often evaluated in terms of cost/benefit analysis

– Anticipated model profit or loss determined by associating cost

• For example, suppose cost/benefit values are assigned to classifications according to

• Cost of Model 2 (False positive cost doubled):

• For example for Model 1:

• Thus, the lift, measured at the 4242 positively predicted records, is

• We would recommend that model evaluation become a nearly “automatic” process,

Das könnte Ihnen auch gefallen