Evaluation of Allocation Rules Under Some Cost Constraints PDF

Evaluation of Allocation Rules
Under Some Cost Constraints
Farid Beninel1 and Michel Grun Rehomme2

1
Université de poitiers, UMR CNRS 6086
IUT- STID, 8 rue Archimède, 79000 Niort, FRANCE
2
Université paris2, ERMES, UMR CNRS 7017
92 rue d’Assas, 75006 Paris, FRANCE
Abstract. Allocation of individuals or objects to labels or classes is a central prob-

lem in statistics, particularly in supervised classification methods such as Linear
and Quadratic Discriminant analysis, Logistic Discrimination, Neural Networks,
Support Vector Machines, and so on. Misallocations occur when allocation class
and origin class differ. These errors could result from different situations such as
quality of data, definition of the explained categorical variable or choice of the
learning sample. Generally, the cost is not uniform depending on the type of error
and consequently the use only of the percentage of correctly classified objects is not
enough informative.
In this paper we deal with the evaluation of allocation rules taking into account the
error cost. We use a statistical index which generalizes the percentage of correctly
classified objects.
1 Introduction
Allocation of objects to labels is a central problem in statistics. Usually, the
discussed problems focus on the way to build the assignment rules, tests and
validation. We are concerned here with the posterior evaluation of a given
rule taking into account the error cost. Such an evaluation allows one to de-
tect different problems including the ability of a procedure to assign objects
to classes, the quality of available data or the definition of the classes.
Allocations considered here could be descriptive or inductive.The first situ-
ation consists in allocations as primary data. The second situation concerns
supervised learning methods as discriminant analysis (LDA, QDA), logistic
discrimination, support vector machines (SVM), decision trees, and so on.
For these two ways of allocation, the errors depend on the quality of the data.
Specially for supervised classification methods, errors could result from vari-
ous different causes such as the ability of these data to predict, the definition
of the predicted categorical variable, the choice of the learning sample, the
methodology to build the allocation rule, the time robustness of the rule and
so on.
Our point, here, is the study of missallocations when the associated costs
are non uniform and consequently using only the correctly classified rate is
insufficient.
68 Beninel and Grun Rehomme
In the statistical literature, the hypothesis of a non uniform cost is only con-
sidered when elaborating and validating the decision rule(Breiman, 1984).
Unfortunately, in real situations validated allocation rules minimizing some
cost function could generate higher misclassification costs.
This paper deals with a post-learning situation i.e. the allocation rule is given
and we have to evaluate it using a new observed sample. We, frequently, en-
counter such a situation when one needs confidentiality or when allocation
and origin class are only realized for a sample of individuals. In insurance,
for instance, the sample could be individuals subject to some risk observed
over the first year from subscription.
The proposed approach consists of evaluating allocation rules, using an in-
dex which is some generalization of the correctly classified rate. To compute
the significance level of an observed value of such an index, we consider the
p-value associated with a null hypothesis consisting in an acceptable cost.
The determination of the p-value leads to a non linear programming prob-
lem one could resolve, using available packages of numerical analysis. For the
simple case of 3 cost values, we give an analytical solution.
2 Methodology
definitions and notations.
Let us denote by Ω the set of individuals and by Y the associated label vari-
able i.e.
Y : Ω −→ G = (g1 , ..., gq ),
ω −→ Y (ω).
We denote by Ck,l the error cost, when assigning to label gk an individual
from gl , by ψ(Ω −→ G) the labelling rule (i.e. ψ(ω) consists of the allocation
class of the individual ω and by C the cost variable (i.e. C(ω) = Ck,l when
ψ(ω) = gk and Y (ω) = gl ).
Consider the random variable Z(Ω −→ [0, 1]) where Z(ω) measures the level
of concordance between the allocation and the origin class of individual ω.
Given a stratified sample, the problem of interest is to infer a comparison
between allocations and origin classes for all individuals of Ω.
We propose a statistical index which measures the level of concordance be-
tween functions ψ and Y , using observed data. Such an index is a linear
combination of the concordance sampling variables.
acceptable cost hypothesis.

Let us denote by {(αk,l , pk,l ) : k, l = 1, . . . , q} the probability distribution of
the variable Z i.e. (αk,l ) are the possible values of Z and (pk,l ) the associated
probabilities.
Obviously, the cost of a decision decreases as the associated level of concor-
dance rises. Hence, αk,l ≥ αi,j when Ck,l ≤ Ci,j .
Evaluation of Allocation Rules 69

The mean of the misallocation cost is given by E(Z) = k,l Ck,l pk,l and for
a fixed threshold δ ∈ R the null hypothesis of acceptable cost is

H0 (C, δ) : {(pk,l ) ∈ R+,q×q , k,l pk,l = 1, k,l Ck,l pk,l ≤ δ}.
Let us denote by Zh,j (h = 1, .., q j = 1, .., nh ) the sampling variables dis-

tributed as Z.
We consider, here, the non independent case derived from sampling without
replacement and with replacement when group sizes are large.
generalized correctly classified rate.

As a statistical index, to measure the concordance between functions Y, ψ
given a sample Z (n) = (Zh,j )h,j , we propose

q
nh
Tψ,Y (Z (n) , α, w) = ωn,h Zh,j .
h=1 j=1
Here n = n1 + ... + nq and w = (wn,h )h is a weighting parameter where

samplings from a same group are weighted identically. q We suppose, without
loss of generality (w.o.l.g.), positive components, 1 nh wn,h = 1 and conse-
quently, Tψ,Y (Z (n) , α, w) ∈ [0, 1].
Note that, for equal weighting and a unique type of error, we deal with the
classical correctly classified rate or Tψ,Y (Z (n) , (δkl ), ( n1 )) where δkl is the Kro-
necker coefficient.
3 Probabilistic study
Asymptotic distribution.
Tψ,Y (Z (n) , α, w) is a linear combination of multinomial variables. Under the
assumptions
of the independence of sampling variables and convergence of
2
n
h h w n,h we obtain from the Lindeberg theorem (Billingsley 1995 p.359-
,
362)
Tψ,Y − µn d
−→ N (0, 1), n → ∞, (1)
σn

where µn (p̄) = αk,l pk,l and σn2 (p̄) = ( k,l α2k,l pk,l − ( k,l αk,l pk,l )2 )
2
k,l
h nh wn,h .
From a practical point of view, to use the previous result leading to the
gaussian model, we have to consider carefully the atypical individuals.
The optimization problem to compute p-value.

Given an observed value t of Tψ,Y , we deal with computation of the asso-
ciated p-value. Here, the definition corresponding to the most powerful test
consists of p − value(t) = maxp̄ prob(Tψ,Y ≤ t/p̄ ∈ H0 (C, δ)).
Let Fα,t ([0, 1] −→ R) be the function such that t−µn (p̄)

σn (p̄) = √ 1
2
Fα,t (p̄).
h nh wn,h
Using the asymptotic distribution, we obtain
1
p − value(t) = Φ( max Fα,t (p̄)), (2)
2
nh wn,h
h
where Φ is the CDF of the N (0, 1) distribution.
The calculation of the p − value leads to the following optimization problem.

⎧
⎨ max Fα,t (p̄) = √Q(p̄) ,
⎪ L(p̄)
Problem(M.d) :
⎪
⎩
p̄ ∈ H0 (C, δ).
The order d of the problem corresponds to the number of distinct non null
cost values.
Here, the constraints of H0 (C, δ) are linear inequalities and as L is a linear
function of the components of p̄ and Q a quadratic one, Fα,t (p̄) is non linear
function. To solve the general problem (M.d) one could use or adapt non
linear programming optimization procedures.
An ad hoc SCILAB program called VACC(VAlidation under Cost Constraints)
is available at http://www-math.univ-poitiers.fr/Stat/VACC.html.
4 Application: The case of problem (M.2)
The case of problem (M.2) is relatively frequent in real situations and has sim-
ple analytical solutions. For this case, we deal with three cost values w.o.l.g.
Ck,l = 1, C, 0 with 1 > C > 0. The concordance variable Z is defined as
follows.
⎧
⎨ 1 if C(ω) = 0 (p1 ),
Z(ω) = α if C(ω) = C (pα ),
⎩
0 if C(ω) = 1 (1 − p1 − pα ).
and the optimization problem is

⎧ t−p1 −αpα
⎪
⎪ max Fα,t (p̄) = Fα,t (p1 , pα ) = √ ,
⎪
⎪ p1 +α2 pα −(p1 +αpα )2
⎪
⎨
Problem(M.2) : p1 ≥ 0, pα ≥ 0, p1 + pα ≤ 1,
⎪
⎪
⎪
⎪
⎪
⎩
Cpα + (1 − p1 − pα ) ≤ δ.
We derive the solution of problem (M.2), using the following result.
Lemma 1. For t, α ∈ [0, 1] and C, δ such that 1 > C > δ > 0, the maximum
of Fα,t (x, y) is attained for (C − 1)y − x + 1 = δ.
δ−1 1
Let us set r = C−1 and s = C−1 . For (x, y) such that y = sx + r (i.e. the
cost constraint at boundary) Fα,t (x, y) = G(x) where
−(1 + αs)x + t − αr
G(x) = . (3)
−(1 + αs)2 x2 + (1 + α2 s − 2αr − 2α2 rs)x + α2 (r − r2 )
We establish the following result.
Proposition 1. Let ∆ = ∆(α, C, t) = (1 − αs)(1 − α2 s − 2t + 2αst) and

x0 (t) = −(α3 rs − α2 st(2r − 1) − 2α2 r + 2αrt + αr − t)/∆(α, C, t). Then, the
following result holds.
⎧
⎨ G(x0 (t) if x0 (t) ∈ [1 − Cδ , 1 − δ] and ∆ < 0,
max Fα,t (x, y) =
⎩
max(G(1 − Cδ ), G(1 − δ)) elsewhere.
On the parameters.
The parameters α, δ, C are fixed by the users of the discussed methodology
and the value t is derived from data. The choice α = 0.5 leads to the UMP
test, when using the class of statistics
Tψ,Y (Z n , α, w)

The choice wn,h = n1 minimizes h nh wn,h 2
when h nh wn,h = 1. For such
√
a choice of w, p − value(t) = Φ( n maxFα,t (p)) and constitutes an upper
bound for the other choices.
Example: α = C = 0.5 and δ < C

The problem, here, consists of maximizing Fα,t (x, y) = √ t−x−0.5y
x+0.25y−(x+0.5y)2
with the system of constraints
H0 = {(x, y) : y > 0, x + y ≤ 1, x + 0.5y ≥ 1 − δ}.
We derive from the previous proposition

⎧ √n(t−1+δ)
⎪
⎨ √0.5δ−δ2 if t > 1 − δ,
max p − value(t) =
⎪ √
⎩ n(t−1+δ)
√ elsewhere.
δ−δ 2
As an illustration, we give the following table. The considered t values relate

real situations. The associated p-value is calculated for some δ values.
Table 1. Computations for n = 50

√
t value δ value n maxFα,t max p − value(t)
0.90 0.20 2.887 0.998
0.15 1.543 0.939
0.10 0.000 0.500
0.05 -2.357 0.009
0.75 0.20 -1.443 0.074
0.15 -3.086 0.001
0.10 -5.303 0.000
0.05 -9.428 0.000
5 Conclusion
Using the generalized correctly classified rate, the interpretation of an ob-

served value integrates the case of non uniform error cost. As forthcoming
applications and extensions, we have in mind:
• The study of the same concordance statistic in the non gaussian case.
Such a case appears when n is small or when weights are unbalanced.
• The study of other statistics measuring the quality of an allocation rule,
under cost constraints.
• The extension of the cost notion to take into account the structure of
classes. This structure is, sometimes, given by proximity measures be-
tween classes as for the terminal nodes of a decision tree or for ordered
classes.
References
ADAMS,N.M., HAND, D.J. (1999): Comparing classifiers when the misallocation
costs are uncertain. Pattergn recognition, 32, 1139-1147
BILLINGSLEY, P. (1990): Probability and measure. Wiley series in probability and
mathematical statistics), New York, pp.593.
BREIMAN, L., FRIEDMAN, J., OHLSEN, R., STONE,C. (1984): Classification
and regression trees. Wadsmorth, Belmont.
GIBBONS, J.D., PRATT, J.W. (1975): p-value: interpretation and methodology.
JASA, 29/1, 20-25
GOVAERT, G. (2003): Analyse des données. Lavoisier serie ”traitement du signal
et de l’image”, Paris, pp.362.
SEBBAN, M., RABASEDA,S., BOUSSAID,O. (1996): Contribution of related geo-
metrical graph in pattern and recognition. In: E. Diday, Y. Lechevallier and O.
Optiz (Eds.): Ordinal and symbolic data Analysis. Springer, Berlin, 167–178.

Evaluation of Allocation Rules Under Some Cost Constraints PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Evaluation of Allocation Rules Under Some Cost Constraints PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Evaluation of Allocation Rules

Under Some Cost Constraints

Farid Beninel1 and Michel Grun Rehomme2

Abstract. Allocation of individuals or objects to labels or classes is a central prob-

acceptable cost hypothesis.

Let us denote by Zh,j (h = 1, .., q j = 1, .., nh ) the sampling variables dis-

generalized correctly classiﬁed rate.

Here n = n1 + ... + nq and w = (wn,h )h is a weighting parameter where

The optimization problem to compute p-value.

Let Fα,t ([0, 1] −→ R) be the function such that t−µn (p̄)

Using the asymptotic distribution, we obtain

where Φ is the CDF of the N (0, 1) distribution.

The calculation of the p − value leads to the following optimization problem.

4 Application: The case of problem (M.2)

and the optimization problem is

We derive the solution of problem (M.2), using the following result.

We establish the following result.

Proposition 1. Let ∆ = ∆(α, C, t) = (1 − αs)(1 − α2 s − 2t + 2αst) and

Example: α = C = 0.5 and δ < C

H0 = {(x, y) : y > 0, x + y ≤ 1, x + 0.5y ≥ 1 − δ}.

We derive from the previous proposition

As an illustration, we give the following table. The considered t values relate

Table 1. Computations for n = 50

Using the generalized correctly classiﬁed rate, the interpretation of an ob-

Das könnte Ihnen auch gefallen