Beruflich Dokumente
Kultur Dokumente
1 Introduction
Allocation of objects to labels is a central problem in statistics. Usually, the
discussed problems focus on the way to build the assignment rules, tests and
validation. We are concerned here with the posterior evaluation of a given
rule taking into account the error cost. Such an evaluation allows one to de-
tect different problems including the ability of a procedure to assign objects
to classes, the quality of available data or the definition of the classes.
Allocations considered here could be descriptive or inductive.The first situ-
ation consists in allocations as primary data. The second situation concerns
supervised learning methods as discriminant analysis (LDA, QDA), logistic
discrimination, support vector machines (SVM), decision trees, and so on.
For these two ways of allocation, the errors depend on the quality of the data.
Specially for supervised classification methods, errors could result from vari-
ous different causes such as the ability of these data to predict, the definition
of the predicted categorical variable, the choice of the learning sample, the
methodology to build the allocation rule, the time robustness of the rule and
so on.
Our point, here, is the study of missallocations when the associated costs
are non uniform and consequently using only the correctly classified rate is
insufficient.
68 Beninel and Grun Rehomme
In the statistical literature, the hypothesis of a non uniform cost is only con-
sidered when elaborating and validating the decision rule(Breiman, 1984).
Unfortunately, in real situations validated allocation rules minimizing some
cost function could generate higher misclassification costs.
This paper deals with a post-learning situation i.e. the allocation rule is given
and we have to evaluate it using a new observed sample. We, frequently, en-
counter such a situation when one needs confidentiality or when allocation
and origin class are only realized for a sample of individuals. In insurance,
for instance, the sample could be individuals subject to some risk observed
over the first year from subscription.
The proposed approach consists of evaluating allocation rules, using an in-
dex which is some generalization of the correctly classified rate. To compute
the significance level of an observed value of such an index, we consider the
p-value associated with a null hypothesis consisting in an acceptable cost.
The determination of the p-value leads to a non linear programming prob-
lem one could resolve, using available packages of numerical analysis. For the
simple case of 3 cost values, we give an analytical solution.
2 Methodology
definitions and notations.
Let us denote by Ω the set of individuals and by Y the associated label vari-
able i.e.
Y : Ω −→ G = (g1 , ..., gq ),
ω −→ Y (ω).
We denote by Ck,l the error cost, when assigning to label gk an individual
from gl , by ψ(Ω −→ G) the labelling rule (i.e. ψ(ω) consists of the allocation
class of the individual ω and by C the cost variable (i.e. C(ω) = Ck,l when
ψ(ω) = gk and Y (ω) = gl ).
Consider the random variable Z(Ω −→ [0, 1]) where Z(ω) measures the level
of concordance between the allocation and the origin class of individual ω.
Given a stratified sample, the problem of interest is to infer a comparison
between allocations and origin classes for all individuals of Ω.
We propose a statistical index which measures the level of concordance be-
tween functions ψ and Y , using observed data. Such an index is a linear
combination of the concordance sampling variables.
The mean of the misallocation cost is given by E(Z) = k,l Ck,l pk,l and for
a fixed threshold δ ∈ R the null hypothesis of acceptable cost is
H0 (C, δ) : {(pk,l ) ∈ R+,q×q , k,l pk,l = 1, k,l Ck,l pk,l ≤ δ}.
q
nh
Tψ,Y (Z (n) , α, w) = ωn,h Zh,j .
h=1 j=1
3 Probabilistic study
Asymptotic distribution.
Tψ,Y (Z (n) , α, w) is a linear combination of multinomial variables. Under the
assumptions
of the independence of sampling variables and convergence of
2
n
h h w n,h we obtain from the Lindeberg theorem (Billingsley 1995 p.359-
,
362)
Tψ,Y − µn d
−→ N (0, 1), n → ∞, (1)
σn
where µn (p̄) = αk,l pk,l and σn2 (p̄) = ( k,l α2k,l pk,l − ( k,l αk,l pk,l )2 )
2
k,l
h nh wn,h .
From a practical point of view, to use the previous result leading to the
gaussian model, we have to consider carefully the atypical individuals.
70 Beninel and Grun Rehomme
1
p − value(t) = Φ( max Fα,t (p̄)), (2)
2
nh wn,h
h
Problem(M.d) :
⎪
⎩
p̄ ∈ H0 (C, δ).
The order d of the problem corresponds to the number of distinct non null
cost values.
Here, the constraints of H0 (C, δ) are linear inequalities and as L is a linear
function of the components of p̄ and Q a quadratic one, Fα,t (p̄) is non linear
function. To solve the general problem (M.d) one could use or adapt non
linear programming optimization procedures.
An ad hoc SCILAB program called VACC(VAlidation under Cost Constraints)
is available at http://www-math.univ-poitiers.fr/Stat/VACC.html.
The case of problem (M.2) is relatively frequent in real situations and has sim-
ple analytical solutions. For this case, we deal with three cost values w.o.l.g.
Ck,l = 1, C, 0 with 1 > C > 0. The concordance variable Z is defined as
follows.
⎧
⎨ 1 if C(ω) = 0 (p1 ),
Z(ω) = α if C(ω) = C (pα ),
⎩
0 if C(ω) = 1 (1 − p1 − pα ).
⎧ t−p1 −αpα
⎪
⎪ max Fα,t (p̄) = Fα,t (p1 , pα ) = √ ,
⎪
⎪ p1 +α2 pα −(p1 +αpα )2
⎪
⎨
Problem(M.2) : p1 ≥ 0, pα ≥ 0, p1 + pα ≤ 1,
⎪
⎪
⎪
⎪
⎪
⎩
Cpα + (1 − p1 − pα ) ≤ δ.
Lemma 1. For t, α ∈ [0, 1] and C, δ such that 1 > C > δ > 0, the maximum
of Fα,t (x, y) is attained for (C − 1)y − x + 1 = δ.
δ−1 1
Let us set r = C−1 and s = C−1 . For (x, y) such that y = sx + r (i.e. the
cost constraint at boundary) Fα,t (x, y) = G(x) where
−(1 + αs)x + t − αr
G(x) = . (3)
−(1 + αs)2 x2 + (1 + α2 s − 2αr − 2α2 rs)x + α2 (r − r2 )
On the parameters.
The parameters α, δ, C are fixed by the users of the discussed methodology
and the value t is derived from data. The choice α = 0.5 leads to the UMP
test, when using the class of statistics
Tψ,Y (Z n , α, w)
The choice wn,h = n1 minimizes h nh wn,h 2
when h nh wn,h = 1. For such
√
a choice of w, p − value(t) = Φ( n maxFα,t (p)) and constitutes an upper
bound for the other choices.
5 Conclusion
References
ADAMS,N.M., HAND, D.J. (1999): Comparing classifiers when the misallocation
costs are uncertain. Pattergn recognition, 32, 1139-1147
BILLINGSLEY, P. (1990): Probability and measure. Wiley series in probability and
mathematical statistics), New York, pp.593.
BREIMAN, L., FRIEDMAN, J., OHLSEN, R., STONE,C. (1984): Classification
and regression trees. Wadsmorth, Belmont.
GIBBONS, J.D., PRATT, J.W. (1975): p-value: interpretation and methodology.
JASA, 29/1, 20-25
GOVAERT, G. (2003): Analyse des données. Lavoisier serie ”traitement du signal
et de l’image”, Paris, pp.362.
SEBBAN, M., RABASEDA,S., BOUSSAID,O. (1996): Contribution of related geo-
metrical graph in pattern and recognition. In: E. Diday, Y. Lechevallier and O.
Optiz (Eds.): Ordinal and symbolic data Analysis. Springer, Berlin, 167–178.