Sie sind auf Seite 1von 7

Evaluation of Allocation Rules

Under Some Cost Constraints

Farid Beninel1 and Michel Grun Rehomme2


1
Université de poitiers, UMR CNRS 6086
IUT- STID, 8 rue Archimède, 79000 Niort, FRANCE
2
Université paris2, ERMES, UMR CNRS 7017
92 rue d’Assas, 75006 Paris, FRANCE

Abstract. Allocation of individuals or objects to labels or classes is a central prob-


lem in statistics, particularly in supervised classification methods such as Linear
and Quadratic Discriminant analysis, Logistic Discrimination, Neural Networks,
Support Vector Machines, and so on. Misallocations occur when allocation class
and origin class differ. These errors could result from different situations such as
quality of data, definition of the explained categorical variable or choice of the
learning sample. Generally, the cost is not uniform depending on the type of error
and consequently the use only of the percentage of correctly classified objects is not
enough informative.
In this paper we deal with the evaluation of allocation rules taking into account the
error cost. We use a statistical index which generalizes the percentage of correctly
classified objects.

1 Introduction
Allocation of objects to labels is a central problem in statistics. Usually, the
discussed problems focus on the way to build the assignment rules, tests and
validation. We are concerned here with the posterior evaluation of a given
rule taking into account the error cost. Such an evaluation allows one to de-
tect different problems including the ability of a procedure to assign objects
to classes, the quality of available data or the definition of the classes.
Allocations considered here could be descriptive or inductive.The first situ-
ation consists in allocations as primary data. The second situation concerns
supervised learning methods as discriminant analysis (LDA, QDA), logistic
discrimination, support vector machines (SVM), decision trees, and so on.
For these two ways of allocation, the errors depend on the quality of the data.
Specially for supervised classification methods, errors could result from vari-
ous different causes such as the ability of these data to predict, the definition
of the predicted categorical variable, the choice of the learning sample, the
methodology to build the allocation rule, the time robustness of the rule and
so on.
Our point, here, is the study of missallocations when the associated costs
are non uniform and consequently using only the correctly classified rate is
insufficient.
68 Beninel and Grun Rehomme

In the statistical literature, the hypothesis of a non uniform cost is only con-
sidered when elaborating and validating the decision rule(Breiman, 1984).
Unfortunately, in real situations validated allocation rules minimizing some
cost function could generate higher misclassification costs.
This paper deals with a post-learning situation i.e. the allocation rule is given
and we have to evaluate it using a new observed sample. We, frequently, en-
counter such a situation when one needs confidentiality or when allocation
and origin class are only realized for a sample of individuals. In insurance,
for instance, the sample could be individuals subject to some risk observed
over the first year from subscription.
The proposed approach consists of evaluating allocation rules, using an in-
dex which is some generalization of the correctly classified rate. To compute
the significance level of an observed value of such an index, we consider the
p-value associated with a null hypothesis consisting in an acceptable cost.
The determination of the p-value leads to a non linear programming prob-
lem one could resolve, using available packages of numerical analysis. For the
simple case of 3 cost values, we give an analytical solution.

2 Methodology
definitions and notations.
Let us denote by Ω the set of individuals and by Y the associated label vari-
able i.e.

Y : Ω −→ G = (g1 , ..., gq ),
ω −→ Y (ω).
We denote by Ck,l the error cost, when assigning to label gk an individual
from gl , by ψ(Ω −→ G) the labelling rule (i.e. ψ(ω) consists of the allocation
class of the individual ω and by C the cost variable (i.e. C(ω) = Ck,l when
ψ(ω) = gk and Y (ω) = gl ).
Consider the random variable Z(Ω −→ [0, 1]) where Z(ω) measures the level
of concordance between the allocation and the origin class of individual ω.
Given a stratified sample, the problem of interest is to infer a comparison
between allocations and origin classes for all individuals of Ω.
We propose a statistical index which measures the level of concordance be-
tween functions ψ and Y , using observed data. Such an index is a linear
combination of the concordance sampling variables.

acceptable cost hypothesis.


Let us denote by {(αk,l , pk,l ) : k, l = 1, . . . , q} the probability distribution of
the variable Z i.e. (αk,l ) are the possible values of Z and (pk,l ) the associated
probabilities.
Obviously, the cost of a decision decreases as the associated level of concor-
dance rises. Hence, αk,l ≥ αi,j when Ck,l ≤ Ci,j .
Evaluation of Allocation Rules 69


The mean of the misallocation cost is given by E(Z) = k,l Ck,l pk,l and for
a fixed threshold δ ∈ R the null hypothesis of acceptable cost is
 
H0 (C, δ) : {(pk,l ) ∈ R+,q×q , k,l pk,l = 1, k,l Ck,l pk,l ≤ δ}.

Let us denote by Zh,j (h = 1, .., q j = 1, .., nh ) the sampling variables dis-


tributed as Z.
We consider, here, the non independent case derived from sampling without
replacement and with replacement when group sizes are large.

generalized correctly classified rate.


As a statistical index, to measure the concordance between functions Y, ψ
given a sample Z (n) = (Zh,j )h,j , we propose


q 
nh
Tψ,Y (Z (n) , α, w) = ωn,h Zh,j .
h=1 j=1

Here n = n1 + ... + nq and w = (wn,h )h is a weighting parameter where


samplings from a same group are weighted identically. q We suppose, without
loss of generality (w.o.l.g.), positive components, 1 nh wn,h = 1 and conse-
quently, Tψ,Y (Z (n) , α, w) ∈ [0, 1].
Note that, for equal weighting and a unique type of error, we deal with the
classical correctly classified rate or Tψ,Y (Z (n) , (δkl ), ( n1 )) where δkl is the Kro-
necker coefficient.

3 Probabilistic study

Asymptotic distribution.
Tψ,Y (Z (n) , α, w) is a linear combination of multinomial variables. Under the
assumptions
 of the independence of sampling variables and convergence of
2
n
h h w n,h we obtain from the Lindeberg theorem (Billingsley 1995 p.359-
,
362)

Tψ,Y − µn d
−→ N (0, 1), n → ∞, (1)
σn

  
where µn (p̄) = αk,l pk,l and σn2 (p̄) = ( k,l α2k,l pk,l − ( k,l αk,l pk,l )2 )
 2
k,l
h nh wn,h .

From a practical point of view, to use the previous result leading to the
gaussian model, we have to consider carefully the atypical individuals.
70 Beninel and Grun Rehomme

The optimization problem to compute p-value.


Given an observed value t of Tψ,Y , we deal with computation of the asso-
ciated p-value. Here, the definition corresponding to the most powerful test
consists of p − value(t) = maxp̄ prob(Tψ,Y ≤ t/p̄ ∈ H0 (C, δ)).

Let Fα,t ([0, 1] −→ R) be the function such that t−µn (p̄)


σn (p̄) = √ 1
2
Fα,t (p̄).
h nh wn,h

Using the asymptotic distribution, we obtain

1
p − value(t) = Φ(   max Fα,t (p̄)), (2)
2
nh wn,h
h

where Φ is the CDF of the N (0, 1) distribution.

The calculation of the p − value leads to the following optimization problem.



⎨ max Fα,t (p̄) = √Q(p̄) ,
⎪ L(p̄)

Problem(M.d) :


p̄ ∈ H0 (C, δ).

The order d of the problem corresponds to the number of distinct non null
cost values.
Here, the constraints of H0 (C, δ) are linear inequalities and as L is a linear
function of the components of p̄ and Q a quadratic one, Fα,t (p̄) is non linear
function. To solve the general problem (M.d) one could use or adapt non
linear programming optimization procedures.
An ad hoc SCILAB program called VACC(VAlidation under Cost Constraints)
is available at http://www-math.univ-poitiers.fr/Stat/VACC.html.

4 Application: The case of problem (M.2)

The case of problem (M.2) is relatively frequent in real situations and has sim-
ple analytical solutions. For this case, we deal with three cost values w.o.l.g.
Ck,l = 1, C, 0 with 1 > C > 0. The concordance variable Z is defined as
follows.

⎨ 1 if C(ω) = 0 (p1 ),
Z(ω) = α if C(ω) = C (pα ),

0 if C(ω) = 1 (1 − p1 − pα ).

and the optimization problem is


Evaluation of Allocation Rules 71

⎧ t−p1 −αpα

⎪ max Fα,t (p̄) = Fα,t (p1 , pα ) = √ ,

⎪ p1 +α2 pα −(p1 +αpα )2


Problem(M.2) : p1 ≥ 0, pα ≥ 0, p1 + pα ≤ 1,






Cpα + (1 − p1 − pα ) ≤ δ.

We derive the solution of problem (M.2), using the following result.

Lemma 1. For t, α ∈ [0, 1] and C, δ such that 1 > C > δ > 0, the maximum
of Fα,t (x, y) is attained for (C − 1)y − x + 1 = δ.

δ−1 1
Let us set r = C−1 and s = C−1 . For (x, y) such that y = sx + r (i.e. the
cost constraint at boundary) Fα,t (x, y) = G(x) where

−(1 + αs)x + t − αr
G(x) =  . (3)
−(1 + αs)2 x2 + (1 + α2 s − 2αr − 2α2 rs)x + α2 (r − r2 )

We establish the following result.

Proposition 1. Let ∆ = ∆(α, C, t) = (1 − αs)(1 − α2 s − 2t + 2αst) and


x0 (t) = −(α3 rs − α2 st(2r − 1) − 2α2 r + 2αrt + αr − t)/∆(α, C, t). Then, the
following result holds.

⎨ G(x0 (t) if x0 (t) ∈ [1 − Cδ , 1 − δ] and ∆ < 0,
max Fα,t (x, y) =

max(G(1 − Cδ ), G(1 − δ)) elsewhere.

On the parameters.
The parameters α, δ, C are fixed by the users of the discussed methodology
and the value t is derived from data. The choice α = 0.5 leads to the UMP
test, when using the class of statistics
 Tψ,Y (Z n , α, w)

The choice wn,h = n1 minimizes h nh wn,h 2
when h nh wn,h = 1. For such

a choice of w, p − value(t) = Φ( n maxFα,t (p)) and constitutes an upper
bound for the other choices.

Example: α = C = 0.5 and δ < C


The problem, here, consists of maximizing Fα,t (x, y) = √ t−x−0.5y
x+0.25y−(x+0.5y)2
with the system of constraints
72 Beninel and Grun Rehomme

H0 = {(x, y) : y > 0, x + y ≤ 1, x + 0.5y ≥ 1 − δ}.

We derive from the previous proposition


⎧ √n(t−1+δ)

⎨ √0.5δ−δ2 if t > 1 − δ,
max p − value(t) =
⎪ √
⎩ n(t−1+δ)
√ elsewhere.
δ−δ 2

As an illustration, we give the following table. The considered t values relate


real situations. The associated p-value is calculated for some δ values.

Table 1. Computations for n = 50



t value δ value n maxFα,t max p − value(t)
0.90 0.20 2.887 0.998
0.15 1.543 0.939
0.10 0.000 0.500
0.05 -2.357 0.009
0.75 0.20 -1.443 0.074
0.15 -3.086 0.001
0.10 -5.303 0.000
0.05 -9.428 0.000

5 Conclusion

Using the generalized correctly classified rate, the interpretation of an ob-


served value integrates the case of non uniform error cost. As forthcoming
applications and extensions, we have in mind:
• The study of the same concordance statistic in the non gaussian case.
Such a case appears when n is small or when weights are unbalanced.
• The study of other statistics measuring the quality of an allocation rule,
under cost constraints.
• The extension of the cost notion to take into account the structure of
classes. This structure is, sometimes, given by proximity measures be-
tween classes as for the terminal nodes of a decision tree or for ordered
classes.
Evaluation of Allocation Rules 73

References
ADAMS,N.M., HAND, D.J. (1999): Comparing classifiers when the misallocation
costs are uncertain. Pattergn recognition, 32, 1139-1147
BILLINGSLEY, P. (1990): Probability and measure. Wiley series in probability and
mathematical statistics), New York, pp.593.
BREIMAN, L., FRIEDMAN, J., OHLSEN, R., STONE,C. (1984): Classification
and regression trees. Wadsmorth, Belmont.
GIBBONS, J.D., PRATT, J.W. (1975): p-value: interpretation and methodology.
JASA, 29/1, 20-25
GOVAERT, G. (2003): Analyse des données. Lavoisier serie ”traitement du signal
et de l’image”, Paris, pp.362.
SEBBAN, M., RABASEDA,S., BOUSSAID,O. (1996): Contribution of related geo-
metrical graph in pattern and recognition. In: E. Diday, Y. Lechevallier and O.
Optiz (Eds.): Ordinal and symbolic data Analysis. Springer, Berlin, 167–178.

Das könnte Ihnen auch gefallen