Beruflich Dokumente
Kultur Dokumente
Chapter 15
Model Evaluation Techniques
Prepared by Andrew Hendrickson, Graduate Assistant
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose
John Wiley & Sons, Inc, Hoboken, NJ, 2015.
1
Model Evaluation Techniques
• Evaluation Phase
– Concerned with evaluating quality and effectiveness of candidate
data mining models
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 2
Model Evaluation Techniques (cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 3
Model Evaluation Techniques for
the Description Task
– Recall EDA powerful technique for describing data
– No target is classified, estimated, or predicted
– Therefore, methods to objectively measure results elusive
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 4
Model Evaluation Techniques for
the Description Task (cont’d)
– MDL quantifies information required (in bits) to
encode model and exceptions to model
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 5
Model Evaluation Techniques for
the Estimation and Prediction Tasks
– Models produce an estimation (prediction) ŷ for the actual target
value y
– Mean Square Error (MSE) used to evaluate models
( yi yˆi )2
MSE i
n p 1
– p represents number of model parameters
– Preferred models minimize MSE
– Typical error for estimation/prediction models is Standard Error
of the Estimate s
s MSE
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 6
Model Evaluation Techniques for
the Estimation and Prediction Tasks
(cont’d)
– For example, regression output shown for predicting nutritional
rating based on sugar content (Minitab)
– Here, MSE = 84.0 and s = 9.167
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 7
Model Evaluation Techniques for
the Estimation and Prediction Tasks
(cont’d)
– s = 9.167 is estimated prediction error for this model
– That is, model’s typical error using sugar to estimate rating is
9.167 rating points
– Is error acceptable?
– Should model be deployed?
– Deployment of model depends on business objectives
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 8
Model Evaluation Techniques for
the Estimation and Prediction Tasks
(cont’d)
– Multiple regression model more complex than previous model
– Uses eight predictors, as opposed to single predictor
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 9
Model Evaluation Techniques for the Estimation and Prediction Tasks
(cont’d)
• One of the drawbacks of the above evaluation measures is that
outliers may have an undue influence on the value of the evaluation
measure
• This is because the above measures are based on the squared error,
which is much larger for outliers than for the bulk of the data
• An option is mean absolute error (MAE)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 10
Model Evaluation Techniques for the Classification Task
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 11
Model Evaluation Techniques for the Classification Task - cont
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 12
ACCURACY AND OVERALL ERROR RATE
• For example:
• 86.48% of the classifications made by this model are correct, while 13.52% are
wrong
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 13
SENSITIVITY AND SPECIFICITY
• Sensitivity measures the ability of the model to classify a record positively, while
specificity measures the ability to classify a record negatively
• For example:
• A good classification model should be sensitive, meaning that it should identify a high
proportion of the customers who are positive (have high income)
• A classification model also needs to be specific, meaning that it should identify a high
proportion of the customers who are negative (have low income)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 14
FALSE POSITIVE RATE AND FALSE NEGATIVE RATE
• False Positive Rate and False Negative Rate are additive inverses of sensitivity and
specificity
• For example:
• Our low false positive rate of 4.31% indicates that we incorrectly identify actual low
income customers as high income only 4.31% of the time
• The much higher false negative rate indicates that we incorrectly classify actual high
income customers as low income 42.80% of the time
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 15
PROPORTIONS OF TRUE POSITIVES, TRUE NEGATIVES, FALSE
POSITIVES, AND FALSE NEGATIVES
• Proportion of True Positives and the Proportion of True Negatives, and are defined as
follows
• For example:
• That is, the probability is 80.69% that a customer actually has high income, given
that our model has classified it as high income, while the probability is 87.66% that a
customer actually has low income, given that we have classified it as low income
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 16
PROPORTIONS OF TRUE POSITIVES, TRUE NEGATIVES, FALSE
POSITIVES, AND FALSE NEGATIVES - cont
• For the Proportion of False Positives and the Proportion of False Negatives
• For example:
• In other words, there is a 19.31% likelihood that a customer actually has low income,
given that our model has classified it as high income, and there is a 12.34%
likelihood that a customer actually has high income, given that we have classified it
as low income
• As an aside, in the parlance of hypothesis testing, since the default decision is to find
that the applicant has low income, we would have the following hypotheses:
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 17
MISCLASSIFICATION COST ADJUSTMENT TO REFLECT REAL-
WORLD CONCERNS
• Which error, a false negative or a false positive, would be considered more damaging
from the lender’s point of view?
– If the lender commits a false negative, an applicant who had high income gets turned down for a loan: an
unfortunate but not very expensive mistake
– if the lender commits a false positive, an applicant who had low income would be awarded the loan
(expensive for the lender)
• Therefore, the lender would consider the false positive to be the more damaging type
of error and would prefer to minimize the proportion of false positives
• The analyst would therefore adjust the C5.0 algorithm’s misclassification cost matrix
to reflect the lender’s concerns
• How would you expect the misclassification cost adjustment to affect the
performance of the algorithm?
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 18
MISCLASSIFICATION COST ADJUSTMENT TO REFLECT REAL-
WORLD CONCERNS - cont
• The C5.0 algorithm was rerun, this time including the misclassification cost adjustment. The
resulting contingency table is shown in Table 15.3
• The classification model evaluation measures are presented in Table 15.4
• As desired, the proportion of false positives has decreased
• However, the algorithm, hesitant to classify records as positive due to the higher cost, instead
made many more negative classifications, and therefore more false negatives
• While the overall error rate is higher (0.1444 from 0.1352) higher proportion of false negatives
are considered a “good trade” by this lender, who is eager to reduce the loan default rate, which
is very costly to the firm
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 19
Decision Cost/Benefit Analysis
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 20
Decision Cost/Benefit Analysis
(cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 21
Decision Cost/Benefit Analysis
(cont’d)
• Using the costs from Table 15.5, we can then compare models 1 and 2:
• Cost of Model 1 (False positive cost not doubled):
• Negative costs represent profits. Thus, the estimated cost savings from deploying
Model 2, which doubles the cost of a false positive error, is
• In other words, the simple data mining step of doubling the false positive cost has
resulted in the deployment of a model greatly increasing the company’s profit
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 22
LIFT CHARTS AND GAINS CHARTS
• For classification models, lift is a concept, originally from the marketing field, which
seeks to compare the response rates with and without using the classification model
• Lift charts and gains charts are graphical evaluative methods for assessing and
comparing the usefulness of classification models
• We define lift as the proportion of true positives, divided by the proportion of positive
hits in the data set overall:
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 23
LIFT CHARTS AND GAINS CHARTS - cont
• When calculating lift, the software will first sort the records by the probability of
being classified positive
• The lift is then calculated for every sample size from n = 1 to n = the size of the data
set
• A chart is then produced which graphs lift against the percentile of the data set
• Note that lift is highest at the lowest percentiles, which makes sense since the data
are sorted according to the most likely positive hits
• As the plot moves from left to right, the positive hits tend to get “used up,” so that
the proportion steadily decreases until the lift finally equals exactly 1 when the entire
data set is considered the sample
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 24
LIFT CHARTS AND GAINS CHARTS - cont
• Lift charts are often presented in their cumulative form, where they are denoted as
cumulative lift charts, or gains charts
• The gains chart associated with the lift chart in Figure 15.2 is presented in Figure
15.3
• The diagonal on the gains chart is analogous to the horizontal axis at lift = 1 on the
lift chart
• Analysts would like to see gains charts where the upper curve rises steeply as one
moves from left to right and then gradually flattens out
• For ex, in Fig 15.3, canvassing the top
• 20% of our contact list, we expect
• to reach about 62% of the total number
• of high-income persons on the list
• Canvassing the top 40% would allow us
• to reach about 85%. Past this point, the law
• of diminishing returns is in effect
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 25
LIFT CHARTS AND GAINS CHARTS - cont
• Figure 15.4 shows the combined lift chart for models 1 and 2
• For example, up to about the 6th percentile, there appears to be no apparent
difference in model lift
• Then, up to approximately the 17th percentile, model 2 is preferable, providing
slightly higher lift
• Thereafter, model 1 is preferable
• It is to be stressed that model evaluation techniques should be performed
on the test data set, rather than on the training set, or on the data set as a
whole
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 26
INTERWEAVING MODEL EVALUATION WITH MODEL BUILDING
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 27
CONFLUENCE OF RESULTS: APPLYING A SUITE OF MODELS
• Whenever possible, the analyst should not depend solely on a single data mining method
• Instead, he or she should seek a confluence of results from a suite of different data mining
models
• For example, for the adult database, our analysis from Chapters 11 and 12 shows that the
variables listed in Table 15.7 are the most influential (ranked roughly in order of importance) for
classifying income, as identified by CART, C5.0, and the neural network algorithm, respectively
• All three algorithms identify Marital_Status, education-num, capital-gain, capital-loss, and hours-
per-week as the most important variables, except for the neural network, where age snuck in
past capital-loss
• None of the algorithms identified either work-class or sex as important variables, and only the
neural network identified age as important
• The algorithms agree on various ordering trends, such as education-num is more important than
hours-per-week
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 28