Beruflich Dokumente
Kultur Dokumente
-
and why using the Gini coefficient has
cost you money
David J. Hand
______________________________________________________________________________
QFRMC - Imperial College London 2
How to evaluate a scorecard
depends on what it is being used for
This talk is about using scorecards for predictive classification
e.g. using scorecards to predict the likely future class of a
customer, e.g. default / not default
______________________________________________________________________________
QFRMC - Imperial College London 3
Scorecards for classification: problem structure
Design sample
Measured Î Outcome
characteristics
Applied to
Measured Î Outcome
characteristics
______________________________________________________________________________
QFRMC - Imperial College London 4
Scorecards for classification
Notation: measured characteristics x, outcome c
(Here assume two classes, c = 0,1)
Aim: use design sample ( xi , ci ) i = 1,..., n
- construct score s = f ( x )
- compare score with threshold t
- assign to class 0 if s ≤ t class 1 if s > t
⇒ two issues
- how to construct the score
- how to choose the threshold
______________________________________________________________________________
QFRMC - Imperial College London 5
Distribution of class 1 scores is f1 ( s ) , with cdf F1 ( s )
Distribution of class 0 scores is f 0 ( s ) , with cdf F0 ( s )
______________________________________________________________________________
QFRMC - Imperial College London 6
Methods for deriving score function
logistic regression
linear discriminant analysis
naive Bayes
segmented scorecards
e.g. logistic regression trees
Other methods: quadratic discriminant analysis, regularised discriminant
analysis, perceptrons, neural networks, radial basis function methods, vector
quantization methods, nearest neighbour and kernel nonparametric
methods, tree classifiers such as CART and C4.5, support vector machines,
rule-based methods, random forests, etc. etc. etc.
______________________________________________________________________________
QFRMC - Imperial College London 7
How to choose between them
- compare performance
- need a performance criterion
______________________________________________________________________________
QFRMC - Imperial College London 8
Basic issues
Business vs statistical criteria
______________________________________________________________________________
QFRMC - Imperial College London 9
Basic misclassification table
True class
0 1
Predicted 0 a b
class 1 c d
a+b+c+d = n a + c = n0 b + d = n1
Class priors
Proportion in class 0 = π 0 = n0 n
Proportion in class 1 = π 1 = n1 n
______________________________________________________________________________
QFRMC - Imperial College London 10
True class
0 1
Predicted 0 a b
class 1 c d
______________________________________________________________________________
QFRMC - Imperial College London 11
Can represent performance with all choices of t
simultaneously with an ROC curve
L = c0π 0 (1 − F0 ( t ) ) + c1π1 F1 ( t )
______________________________________________________________________________
QFRMC - Imperial College London 13
The threshold T which minimises the loss is given by
{
T ( c0 , c1 ) arg min c0π 0 (1 − F0 ( t ) ) + c1π 1 F1 ( t )
t
}
Leading to a mapping between c1 c0 and T
c1 c0 = π 0 f 0 (T ) π 1 f1 (T )
______________________________________________________________________________
QFRMC - Imperial College London 14
Strategy 1:
Example 1: c0 = c1 = 1
Leading to error rate
Example 2: c0 = π 1, c1 = π 0
Leading to Kolmogorov-Smirnov statistic
______________________________________________________________________________
QFRMC - Imperial College London 15
BUT
______________________________________________________________________________
QFRMC - Imperial College London 16
Strategy 2:
{ ( ) }
L = ∫ ∫ c0π 0 1 − F0 (T ( c ) ) + c1π 1 F1 (T ( c ) ) u ( c0 , c1 ) dc0 dc1
0 0
{ ( }
1
)
L = ∫ cπ 0 1 − F0 (T ( c ) ) + (1 − c ) π 1 F1 (T ( c ) ) w ( c ) dc
0
with
c = (1 + c1 c0 )
−1
(
and c = π 1 f1 (T ) π 0 f 0 (T ) + π 1 f1 (T ) )
______________________________________________________________________________
QFRMC - Imperial College London 17
A particular choice of w ( c ) in
{ ( }
1
)
L = ∫ cπ 0 1 − F0 (T ( c ) ) + (1 − c ) π 1 F1 (T ( c ) ) w ( c ) dc
0
leads to
∞
L = 1 − 2π 0π 1 ∫ F0 ( s ) f1 ( s ) ds
−∞
= 1 − 2π 0π 1 AUC = (1 − π 0π 1 ) − π 0π 1G
This choice is
dT ( c ) dT ( c )
w * ( c ) = π 0 f 0 (T ( c ) ) + π 1 f1 (T ( c ) )
dc dc
______________________________________________________________________________
QFRMC - Imperial College London 18
That is, AUC and Gini are equivalent to averaging
the loss over a weight function w ( c ) which depends
on the observed distributions
BUT
Knowledge of c, or, equivalently,
knowledge of T or the costs c0 and c1 ,
must come from information in the
problem other than the score
distributions !
______________________________________________________________________________
QFRMC - Imperial College London 19
Using the AUC or Gini is equivalent to saying
your belief in the relative severity which will be
assigned to the different types of
misclassifications depends on choice of
scorecard
This is absurd
______________________________________________________________________________
QFRMC - Imperial College London 20
But (you say) the AUC and Gini have
several nice interpretations
1
AUC = ∫ F0 ( F 1
−1
( v ) ) dv
0
______________________________________________________________________________
QFRMC - Imperial College London 21
F1 and F0 are operating characteristics of the scorecard
t t’
1-F1(t) F0(t) 1-F1(t’) F0(t’)
Rule 1 95 89 80 90
Rule 2 95 10 80 90
______________________________________________________________________________
QFRMC - Imperial College London 22
● In contrast, the distribution w ( c ) is a matter of belief,
not choice
● It measures the beliefs about the relative severity of
the different kinds of misclassification
● w ( c ) cannot vary from scorecard to scorecard
● I cannot say: if I use logistic regression then
misclassifying a good as a bad is ten times as serious
as the reverse,
but if I use linear discriminant analysis then it is a
hundred times as serious
● c is a property of the problem, not the scorecard
______________________________________________________________________________
QFRMC - Imperial College London 23
Relationship between costs and sensitivity 1-F1
depends on empirical score distributions f 0 and f1
⇒ fundamental complementarity
ϕ ( v ) ~ choice of x distribution for the averaging 1 − F1
w ( c ) ~ choice of c distribution for the averaging costs
______________________________________________________________________________
QFRMC - Imperial College London 25
It is OK for different people to choose different
w
- because they are interested in different aspects
of performance
______________________________________________________________________________
QFRMC - Imperial College London 26
Conclusion
______________________________________________________________________________
QFRMC - Imperial College London 27
What to do instead?
______________________________________________________________________________
QFRMC - Imperial College London 29
Standard H: with (α , β ) = ( 2, 2 )
______________________________________________________________________________
QFRMC - Imperial College London 30
Example (but not needed !):
Whether or not someone will settle outstanding debts
immediately: 9 predictors, 6378 cases
______________________________________________________________________________
QFRMC - Imperial College London 31
The Gini coefficient uses different w * (c)
distributions for the classifiers:
______________________________________________________________________________
QFRMC - Imperial College London 32
Some references:
______________________________________________________________________________
QFRMC - Imperial College London 33