Beruflich Dokumente
Kultur Dokumente
3390/e16052686
OPEN ACCESS
entropy
ISSN 1099-4300
www.mdpi.com/journal/entropy
Article
Model Selection Criteria Using Divergences
Aida Toma
1,2
1
Department of Applied Mathematics, Bucharest Academy of Economic Studies, Piat a Roman a 6,
Bucharest, 010374, Romania; E-Mail: aida
(x) :=
x
x + 1
( 1)
(2)
for R 0, 1 and
0
(x) := log x + x 1,
1
(x) := x log x x + 1 with
(0) = lim
x0
(x),
() = lim
x
has density f
0
.
In the following, D(f
, f
0
) denotes the divergence between f
and f
0
, namely
D(f
, f
0
) :=
_
_
f
0
_
f
0
d. (3)
Using a Fenchel duality technique, Broniatowski and Keziou [11] have proved a dual representation
of divergences. The main interest on this duality formula is that it leads to a wide variety of estimators, by
a plug-in method of the empirical measure evaluated to the data set, without making use of any grouping,
nor smoothing.
We consider divergences, dened through differentiable functions , that we assume to satisfy
(C.0) There exists 0 < < 1 such that for all c [1, 1+], there exist numbers c
1
, c
2
, c
3
such that
(cx) c
1
(x) + c
2
[x[ +c
3
, x R. (4)
Condition (C.0) holds for all power divergences, including KL and KL
m
divergences.
Assuming that D(f
, f
0
) is nite and that the function satises the condition (C.0), the dual
representation holds
D(f
, f
0
) = sup
_
m(, , x)f
0
(x)dx, (5)
Entropy 2014, 16 2689
with
m(, , x) :=
_
_
f
(z)
f
(z)
_
f
(z)dz
_
_
f
(x)
f
(x)
_
f
(x)
f
(x)
_
f
(x)
f
(x)
__
, (6)
where is the notation for the derivative of , the supremum in Equation (5) being uniquely attained in
=
0
, independently on .
We mention that the dual representation Equation (5) of divergences has been obtained independently
by Liese and Vajda [30].
Naturally, for xed , an estimator of the divergence D(f
, f
0
) is obtained by replacing Equation (5)
by its sample analogue. This estimator is exactly
D(f
, f
0
) := sup
1
n
n
i=1
m(, , X
i
), (7)
the supremum being attained for
() := arg sup
1
n
n
i=1
m(, , X
i
). (8)
Formula (8) denes a class of estimators of the parameter
0
called dual divergence estimators.
Further, since
inf
D(f
, f
0
) = D(f
0
, f
0
) = 0 (9)
and since the inmum in the above display is unique, a natural denition of estimators of the parameter
0
, called minimum dual divergence estimators, is provided by
:= arg inf
D(f
, f
0
) = arg inf
sup
1
n
n
i=1
m(, , X
i
). (10)
For more details on the dual representation of divergences and associated minimum dual divergence
estimators, we refer to Broniatowski and Keziou [11].
2.3. Asymptotic Properties
Broniatowski and Keziou [11] have proved both the weak and the strong consistency, as well as
the asymptotic normality for the classes of estimators (), (
) and
. Here, we shortly recall those
asymptotic results that will be used in the next sections. The following conditions are considered.
(C.1) The estimates
and (
) exist.
(C.2) sup
,
[
1
n
n
i=1
m(, , X
i
)
_
m(, , x)f
0
(x)dx[ tends to 0 in probability.
(a) for any positive , there exists some positive such that for any with |
0
| > and for
all it holds that
_
m(, , x)f
0
(x)dx <
_
m(
0
, , x)f
0
(x)dx .
(b) there exists some neighborhood N
0
of
0
such that for any positive , there exists some positive
such that for all N
0
and all satisfying |
0
| > , it holds that
_
m(,
0
, x)f
0
(x)dx <
_
m(, , x)f
0
(x)dx .
(C.3) There exists some neighborhood N
0
of
0
and a positive function H with
_
H(x)f
0
(x)dx
nite, such that for all N
0
, |m(,
0
, X)| H(X) in probability.
Entropy 2014, 16 2690
(C.4) There exists a neighborhood N
0
of
0
such that the rst and the second order partial derivatives
with respect to and of
_
f
(x)
f
(x)
_
f
0
N
0
by some -integrable functions.
The third order partial derivatives with respect to and of m(, , x) are dominated on N
0
N
0
by
some P
0
-integrable functions (where P
0
is the probability measure corresponding to the law F
0
).
(C.5) The integrals
_
|
m(
0
,
0
, x)|
2
f
0
(x)dx,
_
|
m(
0
,
0
, x)|
2
f
0
(x)dx,
_
|
2
m(
0
,
0
, x)|f
0
(x)dx,
_
|
2
m(
0
,
0
, x)|f
0
(x)dx,
_
|
2
m(
0
,
0
, x)|f
0
(x)dx are nite and
the Fisher information matrix I(
0
) :=
_
f
0
(z)
f
t
0
(z)
f
0
(z)
dz is nonsingular, t denoting the transpose.
Proposition 1. Assume that conditions (C.1)(C.3) hold. Then
(a) sup
| ()
0
| tends to 0 in probability.
(b)
converges to
0
in probability.
If (C.1)(C.5) are fullled, then
(c)
n(
0
) and
n( (
)
0
) converge in distribution to a centered p-variate normal random
variable with covariance matrix I(
0
)
1
.
For discussions and examples about the fulllment of conditions (C.1)(C.5), we refer to
Broniatowski and Keziou [11].
3. Model Selection Criteria
In this section, we apply the same methodology used for AIC to the divergences in dual form in order
to develop model selection criteria. Consider a random sample X
1
, . . . , X
n
from the distribution with
density g (the true model) and a candidate model f
) indexed
by an unknown parameter , where is a subset of R
p
. We use divergences satisfying (C.0) and
denote for simplicity the divergence D(f
, g) between f
.
3.1. The Expected Overall Discrepancy
The target theoretical quantity that will be approximated by an asymptotically unbiased estimator is
given by
E[W
] = E[W
[ =
] (11)
where
is a minimum dual divergence estimator dened by Equation (10). The same divergence is used
for both W
and
. The quantity E[W
) and it
is called the expected overall discrepancy between g and (f
).
The next Lemma gives the gradient vector and the Hessian matrix of W
and
f
of such that
_
sup
uN
_
_
_
_
u
_
_
f
u
g
___
_
_
_
gd < . (12)
Entropy 2014, 16 2691
(C.7) There exists a neighborhood N
of such that
_
sup
uN
_
_
_
_
u
_
_
f
u
g
_
f
u
__
_
_
_
d < . (13)
Lemma 1. Assume that conditions (C.6) and (C.7) hold. Then, the gradient vector
of W
is
given by
_
_
f
g
_
f
d (14)
and the Hessian matrix
2
is given by
_
_
_
f
g
_
f
f
t
g
+
_
f
g
_
f
_
d. (15)
The proof of this Lemma is straightforward, therefore it is omitted.
Particularly, when using CressieRead divergences, the gradient vector
of W
is given by
1
1
_ _
f
(z)
g(z)
_
1
_
g(z)
f
(z)
(z)dz, if = 0 (17)
_
log
_
f
(z)
g(z)
_
f
(z)dz, if = 1 (18)
and the Hessian matrix
2
is given by
_ _
f
(z)
g(z)
_
1
(z)
f
t
(z)
f
(z)
dz +
1
1
_ _
f
(z)
g(z)
_
1
(z)
(z)
f
t
(z)dz
_
g(z)
f
(z)
(z)dz, if = 0 (20)
_
log
_
f
(z)
g(z)
_
f
(z)dz +
_
f
(z)
f
t
(z)
f
(z)
dz, if = 1. (21)
When the true model g belongs to the parametric model (f
), hence g = f
0
, the gradient vector and
the Hessian matrix of W
evaluated in =
0
simplify to
_
_
=
0
= 0 (22)
_
2
_
=
0
= (1)I(
0
). (23)
The hypothesis that the true model g belongs to the parametric family (f
0
and =
0
, the expected overall discrepancy is given by
E[W
] = W
0
+
(1)
2
E[(
0
)
t
I(
0
)(
0
)] + E[R
n
], (24)
Entropy 2014, 16 2692
where R
n
= o(|
0
|
2
) and
0
is the true value of the parameter.
Proof. By applying a Taylor expansion to W
= W
0
+
(1)
2
(
0
)
t
I(
0
)(
0
) + o(|
0
|
2
). (25)
Then Equation (24) is proved.
3.2. Estimation of the Expected Overall Discrepancy
In this section we construct an asymptotically unbiased estimator of the expected overall discrepancy,
under the hypothesis that the true model g belongs to the parametric family (f
).
For a given , a natural estimator of W
is
Q
:= sup
1
n
n
i=1
m(, , X
i
) =
1
n
n
i=1
m( (), , X
i
), (26)
where m(, , x) is given by formula (6), which can also be expressed as
Q
=
_
_
f
(z)
f
()
(z)
_
f
(z)dz
1
n
n
i=1
_
_
f
(X
i
)
f
()
(X
i
)
_
f
(X
i
)
f
()
(X
i
)
_
f
(X
i
)
f
()
(X
i
)
__
(27)
using the sample analogue of the dual representation of the divergence.
The following conditions allow derivation under the integral sign for the integral term of Q
.
(C.8) There exists a neighborhood N
of such that
_
sup
uN
_
_
_
_
u
_
_
f
u
f
(u)
_
f
u
__
_
_
_
d < . (28)
(C.9) There exists a neighborhood N
of such that
_
sup
uN
_
_
_
_
_
u
_
_
f
u
f
(u)
_
_
f
u
f
(u)
f
u
_
f
u
f
(u)
_
2
u
(u)
f
(u)
_
+
_
f
u
f
(u)
_
f
u
__
_
_
_
_
d < .
(29)
Lemma 2. Under (C.8) and (C.9), the gradient vector and the Hessian matrix of Q
are
=
1
n
n
i=1
m( (), , X
i
) (30)
=
1
n
n
i=1
m( (), , X
i
). (31)
Proof. Since
Q
=
1
n
n
i=1
m( (), , X
i
) (32)
Entropy 2014, 16 2693
derivation yields
()
_
1
n
n
i=1
m( (), , X
i
)
_
+
1
n
n
i=1
m( (), , X
i
). (33)
Note that, by its very denition, () is a solution of the equation
1
n
n
i=1
m(, , X
i
) = 0 (34)
taken with respect to , therefore
=
1
n
n
i=1
m( (), , X
i
). (35)
On the other hand,
()
_
1
n
n
i=1
m( (), , X
i
)
_
+
1
n
n
i=1
m( (), , X
i
) (36)
=
1
n
n
i=1
m( (), , X
i
). (37)
Proposition 3. Under conditions (C.1)(C.3) and (C.8)(C.9) and assuming that the integrals
_
|
2
m(
0
,
0
, x)|f
0
(x)dx,
_
|
3
m(
0
,
0
, x)|f
0
(x)dx and
_
|
3
m(
0
,
0
, x)|f
0
(x)dx are -
nite, the gradient vector and the Hessian matrix of Q
evaluated in =
satisfy
_
= 0 (38)
_
2
= (1)I(
0
) + o
P
(1). (39)
Proof. By the very denition of
, the equality (38) is veried. For the second relation, we take =
=
1
n
n
i=1
m( (
),
, X
i
). (40)
A Taylor expansion of
1
n
n
i=1
m(, , X
i
) as function of (, ) around to (
0
,
0
) yields
1
n
n
i=1
m( (
),
, X
i
) =
1
n
n
i=1
m(
0
,
0
, X
i
) +
_
1
n
n
i=1
m(
0
,
0
, X
i
)
_
( (
)
0
) +
_
1
n
n
i=1
m(
0
,
0
, X
i
)
_
(
0
) + o(
_
| (
)
0
|
2
+|
0
|
2
).
Using the fact that
_
|
2
m(
0
,
0
, x)|f
0
(x)dx is nite, the weak law of large numbers leads to
1
n
n
i=1
m(
0
,
0
, X
i
)
P
(1)I(
0
). (41)
Entropy 2014, 16 2694
Then, since ( (
)
0
) = o
P
(1) and (
0
) = o
P
(1), and taking into account that
_
|
3
m(
0
,
0
, x)|f
0
(x)dx and
_
|
3
m(
0
,
0
, x)|f
0
(x)dx are nite, we deduce that
1
n
n
i=1
m( (
),
, X
i
) = (1)I(
0
) + o
P
(1). (42)
Thus we obtain Equation (39).
In the following, we suppose that conditions of Proposition 1, Proposition 2 and Proposition 3 are
all satised. These conditions allow obtaining an asymptotically unbiased estimator of the expected
overall discrepancy.
Proposition 4. When the true model g belongs to the parametric model (f
] = E[Q
+ (1)(
0
)
t
I(
0
)(
0
) + R
n
], (43)
where R
n
= o(|
0
|
2
).
Proof. A Taylor expansion of Q
around to
yields
Q
= Q
+ (
)
t
_
+
1
2
(
)
t
_
2
(
) + o(|
|
2
) (44)
and using Proposition 3, we have
Q
= Q
+
1
2
(
)
t
[ (1)I(
0
) + o
P
(1)](
) + o(|
|
2
). (45)
Taking =
0
, for large n, it holds
Q
0
= Q
+
(1)
2
(
0
)
t
I(
0
)(
0
) + o(|
0
|
2
) (46)
and consequently
E[Q
0
] = E[Q
] +
(1)
2
E[(
0
)
t
I(
0
)(
0
)] + E[R
n
], (47)
where R
n
= o(|
0
|
2
).
According to Proposition 2 it holds
E[W
] = W
0
+
(1)
2
E[(
0
)
t
I(
0
)(
0
)] + E[R
n
]. (48)
Note that
E[Q
] = E
_
sup
1
n
n
i=1
m(, , X
i
)
_
= sup
E
_
1
n
n
i=1
m(, , X
i
)
_
= sup
E [m(, , X
i
)] = sup
_
m(, , x)f
0
(x)dx = W
. (49)
Then, combining Equation (48) with Equations (49) and (47), we get
E[W
] = E[Q
+ (1)(
0
)
t
I(
0
)(
0
) + R
n
]. (50)
Entropy 2014, 16 2695
Proposition 4 shows that an asymptotically unbiased estimator of the expected overall discrepancy is
given by
Q
+ (1)(
0
)
t
I(
0
)(
0
). (51)
According to Proposition 1,
n(
0
) is asymptotically distributed as A
p
(0, I(
0
)
1
).
Consequently, n(
0
)
t
I(
0
)(
0
) has approximately a
2
p
distribution. Then, taking into account that
no(|
0
|
2
) = o
P
(1), an asymptotically unbiased estimator of n-times the expected overall discrepancy
evaluated at
is provided by
nQ
+ (1)p. (52)
3.3. Inuence Functions
In the following, we compute the inuence function of the statistics Q
is dened by
IF(x; T, F
) :=
T(
F
x
)
=0
(53)
where
F
x
:= (1 )F
+
x
, > 0,
x
being the Dirac measure putting all mass at x. Whenever the
inuence function is bounded with respect to x, the corresponding estimator is called robust (see [31]).
Since
Q
=
1
n
n
i=1
m( (
),
, X
i
), (54)
the statistical functional corresponding to Q
(F) is the statistical functional associated to the estimator () and V (F) is the statistical
functional associated to the estimator
.
Proposition 5. The inuence function of Q
is
IF(x; U, F
0
) = (1)
0
(x)
f
0
(x)
. (56)
Proof. For the contaminated model
F
x
:= (1 )F
0
+
x
, it holds
U(
F
x
) = (1 )
_
m(T
V (
F
x
)
(
F
x
), V (
F
x
), y)dF
0
(y) + m(T
V (
F
x
)
(
F
x
), V (
F
x
), x). (57)
Derivation with respect to yields
[U(
F
x
)]
=0
=
_
m(
0
,
0
, y)dF
0
(y) +
__
m(
0
,
0
, y)dF
0
(y)
_
[T
V (
F
x
)
(
F
x
)]
=0
+
+
__
m(
0
,
0
, y)dF
0
(y)
_
IF(x; V, F
0
) + m(
0
,
0
, x).
Entropy 2014, 16 2696
Note that m(
0
,
0
, y) = 0 for any y and
_
m(
0
,
0
, y)dF
0
(y) = 0. Also, some straightforward
calculations give
_
m(
0
,
0
, y)dF
0
(y) = (1)I(
0
). (58)
On the other hand, according to the results presented in [12], the inuence function of the minimum dual
divergence estimator is
IF(x; V, F
0
) = I(
0
)
1
0
(x)
f
0
(x)
. (59)
Consequently, we obtain Equation (60).
Note that, for CressieRead divergences, it holds
IF(x; U, F
0
) =
0
(x)
f
0
(x)
(60)
irrespective of the used divergence, since
0
) is not bounded, therefore the robustness of the statistics Q
, as measured by
the inuence function, does not hold.
4. Conclusions
The dual representation of divergences and corresponding minimum dual divergence estimators are
useful tools in statistical inference. The presented theoretical results show that, in the context of model
selection, these tools provide asymptotically unbiased criteria. These criteria are not robust in the sense
of the bounded inuence function, but this fact does not exclude the stability of the criteria with respect
to other robustness measures. The computation of Q