Mathematics Statistics

Mathematical Statistics
Sara van de Geer

September 2010
2
Contents
1 Introduction 7
1.1 Some notation and model assumptions . . . . . . . . . . . . . . . 7
1.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Comparison of estimators: risk functions . . . . . . . . . . . . . . 12
1.4 Comparison of estimators: sensitivity . . . . . . . . . . . . . . . . 12
1.5 Condence intervals . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.1 Equivalence condence sets and tests . . . . . . . . . . . . 13
1.6 Intermezzo: quantile functions . . . . . . . . . . . . . . . . . . . 14
1.7 How to construct tests and condence sets . . . . . . . . . . . . . 14
1.8 An illustration: the two-sample problem . . . . . . . . . . . . . . 16
1.8.1 Assuming normality . . . . . . . . . . . . . . . . . . . . . 17
1.8.2 A nonparametric test . . . . . . . . . . . . . . . . . . . . 18
1.8.3 Comparison of Students test and Wilcoxons test . . . . . 20
1.9 How to construct estimators . . . . . . . . . . . . . . . . . . . . . 21
1.9.1 Plug-in estimators . . . . . . . . . . . . . . . . . . . . . . 21
1.9.2 The method of moments . . . . . . . . . . . . . . . . . . . 22
1.9.3 Likelihood methods . . . . . . . . . . . . . . . . . . . . . 23
2 Decision theory 29
2.1 Decisions and their risk . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Minimaxity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Bayes decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Intermezzo: conditional distributions . . . . . . . . . . . . . . . . 35
2.6 Bayes methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7 Discussion of Bayesian approach (to be written) . . . . . . . . . . 39
2.8 Integrating parameters out (to be written) . . . . . . . . . . . . . 39
2.9 Intermezzo: some distribution theory . . . . . . . . . . . . . . . . 39
2.9.1 The multinomial distribution . . . . . . . . . . . . . . . . 39
2.9.2 The Poisson distribution . . . . . . . . . . . . . . . . . . . 41
2.9.3 The distribution of the maximum of two random variables 42
2.10 Suciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.10.1 Rao-Blackwell . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.10.2 Factorization Theorem of Neyman . . . . . . . . . . . . . 45
2.10.3 Exponential families . . . . . . . . . . . . . . . . . . . . . 47
2.10.4 Canonical form of an exponential family . . . . . . . . . . 48
3
4 CONTENTS
2.10.5 Minimal suciency . . . . . . . . . . . . . . . . . . . . . . 53
3 Unbiased estimators 55
3.1 What is an unbiased estimator? . . . . . . . . . . . . . . . . . . . 55
3.2 UMVU estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1 Complete statistics . . . . . . . . . . . . . . . . . . . . . . 59
3.3 The Cramer-Rao lower bound . . . . . . . . . . . . . . . . . . . . 62
3.4 Higher-dimensional extensions . . . . . . . . . . . . . . . . . . . . 66
3.5 Uniformly most powerful tests . . . . . . . . . . . . . . . . . . . . 68
3.5.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5.2 UMP tests and exponential families . . . . . . . . . . . . 71
3.5.3 Unbiased tests . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5.4 Conditional tests . . . . . . . . . . . . . . . . . . . . . . . 77
4 Equivariant statistics 81
4.1 Equivariance in the location model . . . . . . . . . . . . . . . . . 81
4.2 Equivariance in the location-scale model (to be written) . . . . . 86
5 Proving admissibility and minimaxity 87
5.1 Minimaxity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3 Inadmissibility in higher-dimensional settings (to be written) . . . 95
6 Asymptotic theory 97
6.1 Types of convergence . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.1.1 Stochastic order symbols . . . . . . . . . . . . . . . . . . . 99
6.1.2 Some implications of convergence . . . . . . . . . . . . . . 99
6.2 Consistency and asymptotic normality . . . . . . . . . . . . . . . 101
6.2.1 Asymptotic linearity . . . . . . . . . . . . . . . . . . . . . 101
6.2.2 The -technique . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3.1 Consistency of M-estimators . . . . . . . . . . . . . . . . . 106
6.3.2 Asymptotic normality of M-estimators . . . . . . . . . . . 109
6.4 Plug-in estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.4.1 Consistency of plug-in estimators . . . . . . . . . . . . . . 117
6.4.2 Asymptotic normality of plug-in estimators . . . . . . . . 118
6.5 Asymptotic relative eciency . . . . . . . . . . . . . . . . . . . . 121
6.6 Asymptotic Cramer Rao lower bound . . . . . . . . . . . . . . . 123
6.6.1 Le Cams 3
rd
Lemma . . . . . . . . . . . . . . . . . . . . . 126
6.7 Asymptotic condence intervals and tests . . . . . . . . . . . . . 129
6.7.1 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 131
6.7.2 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . 135
6.8 Complexity regularization (to be written) . . . . . . . . . . . . . 139
7 Literature 141
CONTENTS 5
These notes in English will closely follow Mathematische Statistik, by H.R.
K unsch (2005), but are as yet incomplete. Mathematische Statistik can be used
as supplementary reading material in German.
Mathematical rigor and clarity often bite each other. At some places, not all
subtleties are fully presented. A snake will indicate this.
6 CONTENTS
Chapter 1
Introduction
Statistics is about the mathematical modeling of observable phenomena, using
stochastic models, and about analyzing data: estimating parameters of the
model and testing hypotheses. In these notes, we study various estimation and
testing procedures. We consider their theoretical properties and we investigate
various notions of optimality.
1.1 Some notation and model assumptions
The data consist of measurements (observations) x
1
, . . . , x
n
, which are regarded
as realizations of random variables X
1
, . . . , X
n
. In most of the notes, the X
i
are real-valued: X
i
R (for i = 1, . . . , n), although we will also consider some
extensions to vector-valued observations.
Example 1.1.1 Fizeau and Foucault developed methods for estimating the
speed of light (1849, 1850), which were later improved by Newcomb and Michel-
son. The main idea is to pass light from a rapidly rotating mirror to a xed
mirror and back to the rotating mirror. An estimate of the velocity of light
is obtained, taking into account the speed of the rotating mirror, the distance
travelled, and the displacement of the light as it returns to the rotating mirror.
Fig. 1
The data are Newcombs measurements of the passage time it took light to
travel from his lab, to a mirror on the Washington Monument, and back to his
lab.
7
8 CHAPTER 1. INTRODUCTION
distance: 7.44373 km.
66 measurements on 3 consecutive days
rst measurement: 0.000024828 seconds= 24828 nanoseconds
The dataset has the deviations from 24800 nanoseconds.
The measurements on 3 dierent days:
0 5 10 15 20 25
4
0
0
2
0
4
0
day 1
t1
X
1
20 25 30 35 40 45
4
0
0
2
0
4
0
day 2
t2
X
2
40 45 50 55 60 65
4
0
0
2
0
4
0
day 3
t3
X
3
All measurements in one plot:
0 10 20 30 40 50 60
4
0
2
0
0
2
0
4
0
t
X
1.1. SOME NOTATION AND MODEL ASSUMPTIONS 9
One may estimate the speed of light using e.g. the mean, or the median, or
Hubers estimate (see below). This gives the following results (for the 3 days
separately, and for the three days combined):
Mean
Median
Huber
Day 1 Day 2 Day 3 All
21.75 28.55 27.85 26.21
25.5 28 27 27
25.65 28.40 27.71 27.28
Table 1
The question which estimate is the best one is one of the topics of these notes.
Notation
The collection of observations will be denoted by X = X
1
, . . . , X
n
. The
distribution of X, denoted by IP, is generally unknown. A statistical model is
a collection of assumptions about this unknown distribution.
We will usually assume that the observations X
1
, . . . , X
n
are independent and
identically distributed (i.i.d.). Or, to formulate it dierently, X
1
, . . . , X
n
are
i.i.d. copies from some population random variable, which we denote by X.
The common distribution, that is: the distribution of X, is denoted by P. For
X R, the distribution function of X is written as
F() = P(X ).
Recall that the distribution function F determines the distribution P (and vise
versa).
Further model assumptions then concern the modeling of P. We write such
a model as P T, where T is a given collection of probability measures, the
so-called model class.
The following example will serve to illustrate the concepts that are to follow.
Example 1.1.2 Let X be real-valued. The location model is
T := P
,F
0
(X ) := F
0
( ), R, F
0
T
0
, (1.1)
where T
0
is a given collection of distribution functions. Assuming the expec-
tation exist, we center the distributions in T
0
to have mean zero. Then P
,F
0
has mean . We call a location parameter. Often, only is the parameter of
interest, and F
0
is a so-called nuisance parameter.
The class T
0
is for example modeled as the class of all symmetric distributions,
that is
T
0
:= F
0
(x) = 1 F
0
(x), x. (1.2)
This is an innite-dimensional collection: it is not parametrized by a nite
dimensional parameter. We then call F
0
an innite-dimensional parameter.
A nite-dimensional model is for example
T
0
:= (/) : > 0, (1.3)
where is the standard normal distribution function.
Thus, the location model is
X
i
= +
i
, i = 1, . . . , n,
with
1
, . . . ,
n
i.i.d. and, under model (1.2), symmetrically but otherwise un-
known distributed and, under model (1.3), ^(0,
2
)-distributed with unknown
variance
2
.
1.2 Estimation
A parameter is an aspect of the unknown distribution. An estimator T is some
given function T(X) of the observations X. The estimator is constructed to
estimate some unknown parameter, say.
In Example 1.1.2, one may consider the following estimators of :
The average

1
:=
1
n
N
i=1
X
i
.
Note that
1
minimizes the squared loss
n
i=1
(X
i
)
2
.
It can be shown that
1
is a good estimator if the model (1.3) holds. When
(1.3) is not true, in particular when there are outliers (large, wrong, obser-
vations) (Ausreisser), then one has to apply a more robust estimator.
The (sample) median is

2
:=
_
X
((n+1)/2)
when n odd
X
(n/2)
+X
(n/2+1)
/2 when n is even
,
where X
(1)
X(n) are the order statistics. Note that
2
is a minimizer
of the absolute loss
n
i=1
[X
i
[.
1.2. ESTIMATION 11
The Huber estimator is

3
:= arg min
i=1
(X
i
), (1.4)
where
(x) =
_
x
2
if [x[ k
k(2[x[ k) if [x[ > k
,
with k > 0 some given threshold.
We nally mention the -trimmed mean, dened, for some 0 < < 1, as

4
:=
1
n 2[n]
n[n]
i=[n]+1
X
(i)
.
Note To avoid misunderstanding, we note that e.g. in (1.4), is used as variable
over which is minimized, whereas in (1.1), is a parameter. These are actually
distinct concepts, but it is a general convention to abuse notation and employ
the same symbol . When further developing the theory (see Chapter 6) we
shall often introduce a new symbol for the variable, e.g., (1.4) is written as

3
:= arg min
c
n
i=1
(X
i
c).
An example of a nonparametric estimator is the empirical distribution function
F
n
() :=
1
n
#X
i
, 1 i n.
This is an estimator of the theoretical distribution function
F() := P(X ).
Any reasonable estimator is constructed according the so-called a plug-in princi-
ple (Einsetzprinzip). That is, the parameter of interest is written as = Q(F),
with Q some given map. The empirical distribution

F
n
is then plugged in, to
obtain the estimator T := Q(

F
n
). (We note however that problems can arise,
e.g. Q(

F
n
) may not be well-dened ....).
Examples are the above estimators
1
, . . . ,
4
of the location parameter . We
dene the maps
Q
1
(F) :=
_
xdF(x)
(the mean, or point of gravity, of F), and
Q
2
(F) := F
1
(1/2)
(the median of F), and
Q
3
(F) := arg min
_
( )dF,
and nally
Q
4
(F) :=
1
1 2
_
F
1
(1)
F
1
()
xdF(x).
Then
k
corresponds to Q
k
(

F
n
), k = 1, . . . , 4. If the model (1.2) is correct,

1
, . . . ,
4
are all estimators of . If the model is incorrect, each Q
k
(

F
n
) is still
an estimator of Q
k
(F) (assuming the latter exists), but the Q
k
(F) may all be
dierent aspects of F.
1.3 Comparison of estimators: risk functions
A risk function R(, ) measures the loss due to the error of an estimator. The
risk depends on the unknown distribution, e.g. in the location model, on
and/or F
0
. Examples are
R(, F
0
, ) :=
_
I E
,F
0
[ [
p
IP
,F
0
([ [ > a)
. . .
.
Here p 1 and a > 0 are chosen by the researcher.
If is an equivariant estimator, the above risks no longer depend on . An
estimator := (X
1
, . . . , X
n
) is called equivariant if
(X
1
+c, . . . , X
n
+c) = (X
1
, . . . , X
n
) +c, c.
Then, writing
IP
F
0
:= IP
0,F
0
,
(and likewise for the expectation I E
F
0
), we have for all t > 0
IP
,F
0
( t) = IP
F
0
( t),
that is, the distribution of does not depend on . We then write
R(, F
0
, ) := R(F
0
, ) :=
_
I E
F
0
[ [
p
IP
F
0
([ [ > a)
. . .
.
1.4 Comparison of estimators: sensitivity
We can compare estimators with respect to their sensitivity to large errors in
the data. Suppose the estimator =
n
is dened for each n, and is symmetric
in X
1
, . . . , X
n
.
Inuence of a single additional observation
The inuence function is
l(x) :=
n+1
(X
1
, . . . , X
n
, x)
n
(X
1
, . . . , X
n
), x R.
1.5. CONFIDENCE INTERVALS 13
Break down point
Let for m n,
(m) := sup
x
1
,...,x
m
[ (x
1
, . . . , x
m
, X
m+1
, . . . , X
n
)[.
If (m) := , we say that with m outliers the estimator can break down. The
break down point is dened as
:= minm : (m) = /n.

1.5 Condence intervals
Consider the location model (Example 1.1.2).
Denition A subset I = I(X) R, depending (only) on the data X =
(X
1
, . . . , X
n
), is called a condence set (Vertrauensbereich) for , at level 1,
if
IP
,F
0
( I) 1 , R, F
0
T
0
.
A condence interval is of the form
I := [, ],
where the boundaries = (X) and = (X) depend (only) on the data X.
1.5.1 Equivalence condence sets and tests
Let for each
0
R, (X,
0
) 0, 1 be a test at level for the hypothesis
H
0
: =
0
.
Thus, we reject H
0
if and only if (X,
0
) = 1, and
IP
0
,F
0
((X,
0
) = 1) .
Then
I(X) := : (X, ) = 0
is a (1 )-condence set for .
Conversely, if I(X) is a (1 )-condence set for , then, for all
0
, the test
(X,
0
) dened as
(X,
0
) =
_
1 if
0
/ I(X)
0 else
is a test at level of H
0
.
1.6 Intermezzo: quantile functions
Let F be a distribution function. Then F is cadlag (continue `a droite, limite `a
gauche). Dene the quantile functions
q
F
+
(u) := supx : F(x) u,
and
q
F
(u) := infx : F(x) u := F

1
(u).
It holds that
F(q
F
(u)) u
and, for all h > 0,
F(q
F
+
(u) h) u.
Hence
F(q
F
+
(u)) := lim
h0
F(q
F
+
(u) h) u.
1.7 How to construct tests and condence sets
Consider a model class
T := P
: .
Moreover, consider a space , and a map
g : , g() := .
We think of as the parameter of interest (as in the plug-in principle, with
= Q(P
) = g()).
For instance, in Example 1.1.2, the parameter space is := = (, F
0
),
R, F
0
T
0
, and, when is the parameter of interest, g(, F
0
) = .
To test
H
0
: =
0
,
we look for a pivot (T ur-Angel). This is a function Z(X, ) depending on the
data X and on the parameter , such that for all , the distribution
IP
(Z(X, g()) ) := G()

does not depend on . We note that to nd a pivot is unfortunately not always
possible. However, if we do have a pivot Z(X, ) with distribution G, we can
compute its quantile functions
q
L
:= q
G
+
_
2
_
, q
R
:= q
G
_
1

2
_
.
and the test
(X,
0
) :=
_
1 if Z(X,
0
) / [q
L
, q
R
]
0 else
.
1.7. HOW TO CONSTRUCT TESTS AND CONFIDENCE SETS 15
Then the test has level for testing H
0
, with
0
= g(
0
):
IP
0
((X, g(
0
)) = 1) = P
0
(Z(X, g(
0
)) > q
R
) + IP
0
(Z(X), g(
0
)) < q
L
)
= 1 G(q
R
) +G(q
L
) 1
_
1

2
_
+

2
= .
As example, consider again the location model (Example 1.1.2). Let
:= = (, F
0
), R, F
0
T
0
,
with T
0
a subset of the collection of symmetric distributions (see (1.2)). Let
be an equivariant estimator, so that the distribution of does not depend
on .
If T
0
:= F
0
is a single distribution (i.e., the distribution F
0
is known), we
take Z(X, ) := as pivot. By the equivariance, this pivot has distribution
G depending only on F
0
.
If T
0
:= F
0
() = (/) : > 0, we choose :=

X
n
where

X
n
=
n
i=1
X
i
/n
is the sample mean. As pivot, we take
Z(X, ) :=
n(

X
n
)
S
n
,
where S
2
n
=

n
i=1
(X
i

X)
2
/(n 1) is the sample variance. Then G is the
Student distribution with n 1 degrees of freedom.
If T
0
:= F
0
continuous at x = 0, we let the pivot be the sign test statistic:
Z(X, ) :=
n
i=1
lX
i
.
Then G is the Binomial(n, p) distribution, with parameter p = 1/2.
Let Z
n
(X
1
, . . . , X
n
, ) be some function of the data and the parameter of in-
terest, dened for each sample size n. We call Z
n
(X
1
, . . . , X
n
, ) an asymptotic
pivot if for all ,
lim
n
IP
(Z
n
(X
1
, . . . , X
n
, ) ) = G(),
where the limit G does not depend on .
In the location model, suppose X
1
, . . . , X
n
are the rst n of an innite sequence
of i.i.d. random variables, and that
T
0
:= F
0
:
_
xdF
0
(x) = 0,
_
x
2
dF
0
(x) < .
Then
Z
n
(X
1
, . . . , X
n
, ) :=
n(

X
n
)
S
n
is an asymptotic pivot, with limiting distribution G = .
Comparison of condence intervals and tests
When comparing condence intervals, the aim is usually to take the one with
smallest length on average (keeping the level at 1 ). In the case of tests,
we look for the one with maximal power. In the location model, this leads to
studying
I E
,F
0
[ (X) (X)[
for (1 )-condence sets [, ], or to studying the power of test (X,
0
) at
level . Recall that the power is P
,F
0
((X,
0
) = 1) for values ,=
0
.
1.8 An illustration: the two-sample problem
Consider the following data, concerning weight gain/loss. The control group x
had their usual diet, and the treatment group y obtained a special diet, designed
for preventing weight gain. The study was carried out to test whether the diet
works.
control
group group
treatment
5
0
16
2
9
32
+ +
6
-5
-6
1
4
0
rank(x) rank(y)
x y
7
9
10
3
5
8
2
1
4
6
Table 2
Let n (m) be the sample size of the control group x (treatment group y). The
mean in group x (y) is denoted by x ( y). The sums of squares are SS
x
:=
n
i=1
(x
i
x)
2
and SS
y
:=
m
j=1
(y
j
y)
2
. So in this study, one has n = m = 5
and the values x = 6.4, y = 0, SS
x
= 161.2 and SS
y
= 114. The ranks, rank(x)
and rank(y), are the rank-numbers when putting all n +m data together (e.g.,
y
3
= 6 is the smallest observation and hence rank(y
3
) = 1).
We assume that the data are realizations of two independent samples, say
X = (X
1
, . . . , X
n
) and Y = (Y
1
, . . . , Y
m
), where X
1
, . . . , X
n
are i.i.d. with
distribution function F
X
, and Y
1
, . . . , Y
m
are i.i.d. with distribution function
F
Y
. The distribution functions F
X
and F
Y
may be in whole or in part un-
known. The testing problem is:
H
0
: F
X
= F
Y
against a one- or two-sided alternative.
1.8. AN ILLUSTRATION: THE TWO-SAMPLE PROBLEM 17
1.8.1 Assuming normality
The classical two-sample student test is based on the assumption that the data
come from a normal distribution. Moreover, it is assumed that the variance of
F
X
and F
Y
are equal. Thus,
(F
X
, F
Y
)
_
F
X
=
_

_
, F
Y
=
_
( +)
_
: R, > 0,
_
.
Here, 0 is the range of shifts in mean one considers, e.g. = R for
two-sided situations, and = (, 0] for a one-sided situation. The testing
problem reduces to
H
0
: = 0.
We now look for a pivot Z(X, Y, ). Dene the sample means
X :=
1
n
n
i=1
X
i
,

Y :=
1
m
m
j=1
Y
j
,
and the pooled sample variance
S
2
:=
1
m+n 2
_
n
i=1
(X
i

X)
2
+
m
j=1
(Y
j

Y )
2
_
.
Note that

X has expectation and variance
2
/n, and

Y has expectation +
and variance
2
/m. So

Y

X has expectation and variance
2
n
+

2
m
=
2
_
n +m
nm
_
.
The normality assumption implies that
Y

X is ^
_
,
2
_
n +m
nm
__
distributed.
Hence
_
nm
n +m
_
Y

X
_
is ^(0, 1)distributed.
To arrive at a pivot, we now plug in the estimate S for the unknown :
Z(X, Y, ) :=
_
nm
n +m
_
Y

X
S
_
.
Indeed, Z(X, Y, ) has a distribution G which does not depend on unknown
parameters. The distribution G is Student(n+m2) (the Student-distribution
with n+m2 degrees of freedom). As test statistic for H
0
: = 0, we therefore
take
T = T
Student
:= Z(X, Y, 0).
The one-sided test at level , for H
0
: = 0 against H
1
: < 0, is
(X, Y) :=
_
1 if T < t
n+m2
(1 )
0 if T t
n+m2
(1 )
,
where, for > 0, t
(1 ) = t
() is the (1 )-quantile of the Student()-

distribution.
Let us apply this test to the data given in Table 2. We take = 0.05. The
observed values are x = 6.4, y = 0 and s
2
= 34.4. The test statistic takes the
value 1.725 which is bigger than the 5% quantile t
8
(0.05) = 1.9. Hence, we
cannot reject H
0
. The p-value of the observed value of T is
pvalue := IP
=0
(T < 1.725) = 0.06.
So the p-value is in this case only a little larger than the level = 0.05.
1.8.2 A nonparametric test
In this subsection, we suppose that F
X
and F
Y
are continuous, but otherwise
unknown. The model class for both F
X
and F
Y
is thus
T := all continuous distributions.
The continuity assumption ensures that all observations are distinct, that is,
there are no ties. We can then put them in strictly increasing order. Let
N = n +m and Z
1
, . . . , Z
N
be the pooled sample
Z
i
:= X
i
, i = 1, . . . , n, Z
n+j
:= Y
j
, j = 1, . . . , m.
Dene
R
i
:= rank(Z
i
), i = 1, . . . , N.
and let
Z
(1)
< < Z
(N)
be the order statistics of the pooled sample (so that Z
i
= Z
(R
i
)
(i = 1, . . . , n)).
The Wilcoxon test statistic is
T = T
Wilcoxon
:=
n
i=1
R
i
.
One may check that this test statistic T can alternatively be written as
T = #Y
j
< X
i
+
n(n + 1)
2
.
For example, for the data in Table 2, the observed value of T is 34, and
#y
j
< x
i
= 19,
n(n + 1)
2
= 15.
1.8. AN ILLUSTRATION: THE TWO-SAMPLE PROBLEM 19
Large values of T mean that the X
i
are generally larger than the Y
j
, and hence
indicate evidence against H
0
.
To check whether or not the observed value of the test statistic is compatible
with the null-hypothesis, we need to know its null-distribution, that is, the
distribution under H
0
. Under H
0
: F
X
= F
Y
, the vector of ranks (R
1
, . . . , R
n
)
has the same distribution as n random draws without replacement from the
numbers 1, . . . , N. That is, if we let
r := (r
1
, . . . , r
n
, r
n+1
, . . . , r
N
)
denote a permutation of 1, . . . , N, then
IP
H
0
_
(R
1
, . . . , R
n
, R
n+1
, . . . R
N
) = r
_
=
1
N!
,
(see Theorem 1.8.1), and hence
IP
H
0
(T = t) =
#r :

n
i=1
r
i
= t
N!
.
This can also be written as
IP
H
0
(T = t) =
1
_
N
n
_#r
1
< < r
n
< r
n+1
< < r
N
:
n
i=1
r
i
= t.
So clearly, the null-distribution of T does not depend on F
X
or F
Y
. It does
however depend on the sample sizes n and m. It is tabulated for n and m
small or moderately large. For large n and m, a normal approximation of the
null-distribution can be used.
Theorem 1.8.1 formally derives the null-distribution of the test, and actually
proves that the order statistics and the ranks are independent. The latter result
will be of interest in Example 2.10.4.
For two random variables X and Y , use the notation
X
D
= Y
when X and Y have the same distribution.
Theorem 1.8.1 Let Z
1
, . . . , Z
N
be i.i.d. with continuous distribution F on
R. Then (Z
(1)
, . . . , Z
(N)
) and R := (R
1
, . . . , R
N
) are independent, and for all
permutations r := (r
1
, . . . , r
N
),
IP(R = r) =
1
N!
.
Proof. Let Z
Q
i
:= Z
(i)
, and Q := (Q
1
, . . . , Q
N
). Then
R = r Q = r
1
:= q,
where r
1
is the inverse permutation of r.
1
For all permutations q and all
measurable maps f,
f(Z
1
, . . . , Z
N
)
D
= f(Z
q
1
, . . . , Z
q
N
).
Therefore, for all measurable sets A R
N
, and all permutations q,
IP
_
(Z
1
, . . . , Z
N
) A, Z
1
< . . . < Z
N
_
= IP
_
(Z
q
1
. . . , Z
q
N
) A, Z
q
1
< . . . < Z
q
N
_
.
Because there are N! permutations, we see that for any q,
IP
_
(Z
(1)
, . . . , Z
(n)
) A
_
= N!IP
_
(Z
q
1
. . . , Z
q
N
) A, Z
q
1
< . . . < Z
q
N
_
= N!IP
_
(Z
(1)
, . . . , Z
(N)
) A, R = r
_
,
where r = q
1
. Thus we have shown that for all measurable A, and for all r,
IP
_
(Z
(1)
, . . . , Z
(N)
) A, R = r
_
=
1
N!
IP
_
(Z
(1)
, . . . , Z
(n)
) A
_
. (1.5)
Take A = R
N
to nd that (1.5) implies
IP
_
R = r
_
=
1
N!
.
Plug this back into (1.5) to see that we have the product structure
IP
_
(Z
(1)
, . . . , Z
(N)
) A, R = r
_
= IP
_
(Z
(1)
, . . . , Z
(n)
) A
_
IP
_
R = r
_
,
which holds for all measurable A. In other words, (Z
(1)
, . . . , Z
(N)
) and R are
independent. .
1.8.3 Comparison of Students test and Wilcoxons test
Because Wilcoxons test is ony based on the ranks, and does not rely on the
assumption of normality, it lies at hand that, when the data are in fact normally
distributed, Wilcoxons test will have less power than Students test. The loss
1
Here is an example, with N = 3:
(z1, z2, z3) = ( 5 , 6 , 4 )
(r1, r2, r3) = ( 2 , 3 , 1 )
(q1, q2, q3) = ( 3 , 1 , 2 )
1.9. HOW TO CONSTRUCT ESTIMATORS 21
of power is however small. Let us formulate this more precisely, in terms of
the relative eciency of the two tests. Let the signicance be xed, and
let be the required power. Let n and m be equal, N = 2n be the total
sample size, and N
Student
(N
Wilcoxon
) be the number of observations needed to
reach power using Students (Wilcoxons) test. Consider shift alternatives,
i.e. F
Y
() = F
X
( ), (with, in our example, < 0). One can show that
N
Student
/N
Wilcoxon
is approximately .95 when the normal model is correct. For
a large class of distributions, the ratio N
Student
/N
Wilcoxon
ranges from .85 to ,
that is, when using Wilcoxon one generally has very limited loss of eciency as
compared to Student, and one may in fact have a substantial gain of eciency.
1.9 How to construct estimators
Consider i.i.d. observations X
1
, . . . , X
n
, copies of a random variable X with
distribution P P
: . The parameter of interest is denoted by

= g() .
1.9.1 Plug-in estimators
For real-valued observations, one can dene the distribution function
F() = P(X ).
An estimator of F is the empirical distribution function
F
n
() =
1
n
n
i=1
lX
i
.
Note that when knowing only

F
n
, one can reconstruct the order statistics
X
(1)
. . . X
(n)
, but not the original data X
1
, . . . , X
n
. Now, the order
at which the data are given carries no information about the distribution P. In
other words, a reasonable
2
estimator T = T(X
1
, . . . , X
n
) depends only on the
sample (X
1
, . . . , X
n
) via the order statistics (X
(1)
, . . . X
(n)
) (i.e., shuing the
data should have no inuence on the value of T). Because these order statistics
can be determined from the empirical distribution

F
n
, we conclude that any
reasonable estimator T can be written as a function of

F
n
:
T = Q(

F
n
),
for some map Q.
Similarly, the distribution function F
:= P
(X ) completely characterizes
the distribution P. Hence, a parameter is a function of F
:
= g() = Q(F
).
2
What is reasonable has to be considered with some care. There are in fact reasonable
statistical procedures that do treat the {Xi} in an asymmetric way. An example is splitting
the sample into a training and test set (for model validation).
If the mapping Q is dened at all F
as well as at

F
n
, we call Q(

F
n
) a plug-in
estimator of Q(F
).
The idea is not restricted to the one-dimensional setting. For arbitrary obser-
vation space A, we dene the empirical measure
P
n
=
1
n
n
i=1
X
i
,
where
x
is a point-mass at x. The empirical measure puts mass 1/n at each
observation. This is indeed an extension of A = R to general A, as the empirical
distribution function

F
n
jumps at each observation, with jump height equal to
the number of times the value was observed (i.e. jump height 1/n if all X
i
are
distinct). So, as in the real-valued case, if the map Q is dened at all P
as well
as at

P
n
, we call Q(

P
n
) a plug-in estimator of Q(P
).
We stress that typically, the representation = g() as function Q of P
is not
unique, i.e., that there are various choices of Q. Each such choice generally
leads to a dierent estimator. Moreover, the assumption that Q is dened at
P
n
is often violated. One can sometimes modify the map Q to a map Q
n
that,
in some sense, approximates Q for n large. The modied plug-in estimator then
takes the form Q
n
(

P
n
).
1.9.2 The method of moments
Let X R and suppose (say) that the parameter of interest is itself, and
that R
p
. Let
1
(), . . . ,
p
() denote the rst p moments of X (assumed
to exist), i.e.,
j
() = E
X
j
=
_
x
j
dF
(x), j = 1, . . . , p.
Also assume that the map
m : R
p
,
dened by
m() = [
1
(), . . . ,
p
()],
has an inverse
m
1
(
1
, . . . ,
p
),
for all [
1
, . . . ,
p
] / (say). We estimate the
j
by their sample counterparts

j
:=
1
n
n
i=1
X
j
i
=
_
x
j
d
F
n
(x), j = 1, . . . , p.
When [
1
, . . . ,
p
] / we can plug them in to obtain the estimator
:= m
1
(
1
, . . . ,
p
).
Example
Let X have the negative binomial distribution with known parameter k and
unknown success parameter (0, 1):
P
(X = x) =
_
k +x 1
x
_
k
(1 )
x
, x 0, 1, . . ..
This is the distribution of the number of failures till the k
th
success, where at
each trial, the probability of success is , and where the trials are independent.
It holds that
E
(X) = k
(1 )
:= m().
Hence
m
1
() =
k
+k
,
and the method of moments estimator is
=
k
X +k
=
nk
n
i=1
X
i
+nk
=
number of successes
number of trails
.
Example
Suppose X has density
p
(x) = (1 +x)
(1+)
, x > 0,
with respect to Lebesgue measure, and with (0, ). Then, for > 1
E
X =
1
1
:= m(),
with inverse
m
1
() =
1 +
.
The method of moments estimator would thus be
=
1 +

X
X
.
However, the mean E
X does not exist for < 1, so when contains values

< 1, the method of moments is perhaps not a good idea. We will see that the
maximum likelihood estimator does not suer from this problem.
1.9.3 Likelihood methods
Suppose that T := P
: is dominated by a -nite measure . We

write the densities as
p
:=
dP
d
, .
Denition The likelihood function (of the data X = (X
1
, . . . , X
n
)) is
L
X
() :=
n
i=1
p
(X
i
).
The MLE (maximum likelihood estimator) is
:= arg max
L
X
().
Note We use the symbol for the variable in the likelihood function, and the
slightly dierent symbol for the parameter we want to estimate. It is however
a common convention to use the same symbol for both (as already noted in the
earlier section on estimation). However, as we will see below, dierent symbols
are needed for the development of the theory.
Note Alternatively, we may write the MLE as the maximizer of the log-likelihood
= arg max
log L
X
() = arg max
i=1
log p
(X
i
).
The log-likelihood is generally mathematically more tractable. For example,
if the densities are dierentiable, one can typically obtain the maximum by
setting the derivatives to zero, and it is easier to dierentiate a sum than a
product.
Note The likelihood function may have local maxima. Moreover, the MLE is
not always unique, or may not exist (for example, the likelihood function may
be unbounded).
We will now show that maximum likelihood is a plug-in method. First, as noted
above, the MLE maximizes the log-likelihood. We may of course normalize the
log-likelihood by 1/n:
= arg max
1
n
n
i=1
log p
(X
i
).
Replacing the average

n
i=1
log p
(X
i
)/n by its theoretical counterpart gives
arg max
log p
(X)
which is indeed equal to the parameter we are trying to estimate: by the
inequality log x x 1, x > 0,
E
log
p
(X)
p
(X)
E
_
p
(X)
p
(X)
1
_
= 0.
(Note that using dierent symbols and is indeed crucial here.) Chapter 6
will put this is a wider perspective.
Example
We turn back to the previous example. Suppose X has density
p
(x) = (1 +x)
(1+)
, x > 0,
with respect to Lebesgue measure, and with = (0, ). Then
log p
(x) = log (1 +) log(1 +x),

d
d
log p
(x) =
1
log(1 +x).
We put the derivative of the log-likelihood to zero and solve:
n
i=1
log(1 +X
i
) = 0

=
1
n
i=1
log(1 +X
i
)/n
.
(One may check that this is indeed the maximum.)
Example
Let X R and = (,
2
), with R a location parameter, > 0 a scale
parameter. We assume that the distribution function F
of X is
F
() = F
0
_

_
,
where F
0
is a given distribution function, with density f
0
w.r.t. Lebesgue mea-
sure. The density of X is thus
p
() =
1
f
0
_

_
.
Case 1 If F
0
= (the standard normal distribution), then
f
0
(x) = (x) =
1
2
exp
_
1
2
x
2
_
, x R,
so that
p
(x) =
1
2
2
exp
_
1
2
2
(x )
2
_
, x R.
The MLE of resp.
2
is
=

X,
2
=
1
n
n
i=1
(X
i

X)
2
.
Case 2 The (standardized) double exponential or Laplace distribution has den-
sity
f
0
(x) =
1
2
exp
_
2[x[
_
, x R,
so
p
(x) =
1
2
2
exp
_
2[x [
_
, x R.
The MLE of resp. is now
= sample median, =
2
n
n
i=1
[X
i

2
[.
Example
Here is a famous example, from Kiefer and Wolfowitz (1956), where the like-
lihood is unbounded, and hence the MLE does not exist. It concerns the case
of a mixture of two normals: each observation, is either ^(, 1)-distributed or
^(,
2
)-distributed, each with probability 1/2 (say). The unknown parameter
is = (,
2
), and X has density
p
(x) =
1
2
(x ) +
1
2
((x )/), x R,
w.r.t. Lebesgue measure. Then
L
X
(,
2
) =
n
i=1
_
1
2
(X
i
) +
1
2
((X
i
)/)
_
.
Taking = X
1
yields
L
X
(X
1
,
2
) =
1
2
(
1
2
+
1
2
)
n
i=2
_
1
2
(X
i
X
1
) +
1
2
((X
i
X
1
)/)
_
.
Now, since for all z ,= 0
lim
0
1
(z/) = 0,
we have
lim
0
n
i=2
_
1
2
(X
i
X
1
) +
1
2
((X
i
X
1
)/)
_
=
n
i=2
1
2
(X
i
X
1
) > 0.
It follows that
lim
0
L
X
(X
1
,
2
) = .
Asymptotic tests and condence intervals based on the likelihood
Suppose that is an open subset of R
p
. Dene the log-likelihood ratio
Z(X, ) := 2
_
log L
X
(
) log L
X
()
_
.
Note that Z(X, ) 0, as

maximizes the (log)-likelihood. We will see in
Chapter 6 that, under some regularity conditions,
Z(X, )
D

2
p
, .
Here,
D
means convergence in distribution under IP
, and
2
p
denotes the
Chi-squared distribution with p degrees of freedom.
Thus, Z(X, ) is an asymptotic pivot. For the null-hypotheses
H
0
: =
0
,
a test at asymptotic level is: reject H
0
if Z(X,
0
) >
2
p
(1), where
2
p
(1)
is the (1)-quantile of the
2
p
-distribution. An asymptotic (1)-condence
set for is
: Z(X, )
2
p
(1 )
= : 2 log L
X
(
) 2 log L
X
() +
2
p
(1 ).
Example
Here is a toy example. Let X have the ^(, 1)-distribution, with R un-
known. The MLE of is the sample average =

X. It holds that
log L
X
( ) =
n
2
log(2)
1
2
n
i=1
(X
i

X)
2
,
and
2
_
log L
X
( ) log L
X
()
_
= n(

X )
2
.
The random variable
n(

X) is ^(0, 1)-distributed under IP
. So its square,
n(

X )
2
, has a
2
1
-distribution. Thus, in this case the above test (condence
interval) is exact.

Chapter 2
Decision theory
Notation
In this chapter, we denote the observable random variable (the data) by X A,
and its distribution by P T. The probability model is T := P
: ,
with an unknown parameter. In particular cases, we apply the results with
X being replaced by a vector X = (X
1
, . . . , X
n
), with X
1
, . . . , X
n
i.i.d. with
distribution P P
: (so that X has distribution IP :=

n
i=1
P
IP
n
i=1
P
: ).
2.1 Decisions and their risk
Let / be the action space.
/ = R corresponds to estimating a real-valued parameter.
/ = 0, 1 corresponds to testing a hypothesis.
/ = [0, 1] corresponds to randomized tests.
/ = intervals corresponds to condence intervals.
Given the observation X, we decide to take a certain action in /. Thus, an
action is a map d : A /, with d(X) being the decision taken.
A loss function (Verlustfunktion) is a map
L : / R,
with L(, a) being the loss when the parameter value is and one takes action
a.
The risk of decision d(X) is dened as
R(, d) := E
L(, d(X)), .
29
30 CHAPTER 2. DECISION THEORY
Example 2.1.1 In the case of estimating a parameter of interest g() R, the
action space is / = R (or a subset thereof). Important loss functions are then
L(, a) := w()[g() a[
r
,
where w() are given non-negative weights and r 0 is a given power. The risk
is then
R(, d) = w()E
[g() d(X)[
r
.
A special case is taking w 1 and r = 2. Then
R(, d) = E
[g() d(X)[
2
is called the mean square error.
Example 2.1.2 Consider testing the hypothesis
H
0
:
0
against the alternative
H
1
:
1
.
Here,
0
and
1
are given subsets of with
0
1
= . As action space, we
take / = 0, 1, and as loss
L(, a) :=
_
1 if
0
and a = 1
c if
1
and a = 0
0 otherwise
.
Here c > 0 is some given constant. Then
R(, d) =
_
P
(d(X) = 1) if
0
cP
(d(X) = 0) if
1
0 otherwise
.
Thus, the risks correspond to the error probabilities (type I and type II errors).
Note
The best decision d is the one with the smallest risk R(, d). However, is not
known. Thus, if we compare two decision functions d
1
and d
2
, we may run into
problems because the risks are not comparable: R(, d
1
) may be smaller than
R(, d
2
) for some values of , and larger than R(, d
2
) for other values of .
Example 2.1.3 Let X R and let g() = E
X := . We take quadratic loss

L(, a) := [ a[
2
.
Assume that var
(X) = 1 for all . Consider the collection of decisions

d
(X) := X,
where 0 1. Then
R(, d
) = var(X) + bias
2
(X)
2.2. ADMISSIBILITY 31
=
2
+ ( 1)
2
2
.
The optimal choice for would be
opt
:=

2
1 +
2
,
because this value minimizes R(, d
). However,
opt
depends on the unknown
, so d
opt
(X) is not an estimator.
Various optimality concepts
We will consider three optimality concepts: admissibility (zulassigkeit), mini-
max and Bayes.
2.2 Admissibility
Denition A decision d
is called strictly better than d if

R(, d
) R(, d), ,
and
: R(, d
) < R(, d).

When there exists a d
that is strictly better than d, then d is called inadmissible.

Example 2.2.1 Let, for n 2, X
1
, . . . , X
n
be i.i.d., with g() := E
(X
i
) := ,
and var(X
i
) = 1 (for all i). Take quadratic loss L(, a) := [ a[
2
. Consider
d
(X
1
, . . . , X
n
) :=

X
n
and d(X
1
, . . . , X
n
) := X
1
. Then, ,
R(, d
) =
1
n
, R(, d) = 1,
so that d is inadmissible.
Note
We note that to show that a decision d is inadmissible, it suces to nd a
strictly better d
. On the other hand, to show that d is admissible, one has to

verify that there is no strictly better d
. So in principle, one then has to take

all possible d
into account.
Example 2.2.2 Let L(, a) := [g() a[
r
and d(X) := g(
0
), where
0
is some
xed given value.
Lemma Assume that P
0
dominates P
1
for all . Then d is admissible.
Proof.
1
Let P and Q be probability measures on the same measurable space. Then P dominates
Q if for all measurable B, P(B) = 0 implies Q(B) = 0 (Q is absolut stetig bez uglich P).
Suppose that d
is better than d. Then we have

E
0
[g(
0
) d
(X)[
r
0.
This implies that
d
(X) = g(
0
), P
0
almost surely. (2.1)
Since by (2.1),
P
0
(d
(X) ,= g(
0
)) = 0
the assumption that P
0
dominates P
, , implies now
P
(d
(X) ,= g(
0
)) = 0, .
That is, for all , d
(X) = g(
0
), P
-almost surely, and hence, for all , R(, d
) =
R(, d). So d
is not strictly better than d. We conclude that d is admissible. .

Example 2.2.3 We consider testing
H
0
: =
0
H
1
: =
1
.
We let / = [0, 1] and let d := be a randomized test. As risk, we take
R(, ) :=
_
E
(X), =
0
1 E
(X), =
1
.
We let p
0
(p
1
) be the density of P
0
(P
1
) with respect to some dominating
measure (for example = P
0
+P
1
). A Neyman Pearson test is
NP
:=
_
_
_
1 if p
1
/p
0
> c
q if p
1
/p
0
= c
0 if p
1
/p
0
< c
.
Here 0 q 1, and 0 c < are given constants. To check whether
NP
is
admissible, we rst recall the Neyman Pearson Lemma.
Neyman Pearson Lemma Let be some test. We have
R(
1
,
NP
) R(
1
, ) c[R(
0
, ) R(
0
,
NP
)].
Proof.
R(
1
,
NP
) R(
1
, ) =
_
(
NP
)p
1
=
_
p
1
/p
0
>c
(
NP
)p
1
+
_
p
1
/p
0
=c
(
NP
)p
1
+
_
p
1
/p
0
<c
(
NP
)p
1
c
_
p
1
/p
0
>c
(
NP
)p
0
+c
_
p
1
/p
0
=c
(
NP
)p
0
+c
_
p
1
/p
0
<c
(
NP
)p
0
2.3. MINIMAXITY 33
= c[R(
0
, ) R(
0
,
NP
)].
.
Lemma A Neyman Pearson test is admissible if and only if one of the following
two cases hold:
i) its power is strictly less than 1,
or
ii) it has minimal level among all tests with power 1.
Proof. Suppose R(
0
, ) < R(
0
,
NP
). Then from the Neyman Pearson
Lemma, we now that either R(
1
, ) > R(
1
,
NP
) (i.e., then is not bet-
ter then
NP
), or c = 0. But when c = 0, it holds that R(
1
,
NP
) = 0, i.e. then
NP
has power one.
Similarly, suppose that R(
1
, ) < R(
1
,
NP
). Then it follows from the Neyman
Pearson Lemma that R(
0
, ) > R(
0
,
NP
), because we assume c < .
.
2.3 Minimaxity
Denition A decision d is called minimax if
sup
R(, d) = inf
d
sup
R(, d
).
Thus, the minimax criterion concerns the best decision in the worst possible
case.
Lemma A Neyman Pearson test
NP
is minimax, if and only if R(
0
,
NP
) =
R(
1
,
NP
).
Proof. Let be a test, and write for j = 0, 1,
r
j
:= R(
j
,
NP
), r
j
= R(
j
, ).
Suppose that r
0
= r
1
and that
NP
is not minimax. Then, for some test ,
max
j
r
j
< max
j
r
j
.
This implies that both
r
0
< r
0
, r
1
< r
1
and by the Neyman Pearson Lemma, this is not possible.
Let S = (R(
0
, ), R(
1
, )) : : A [0, 1]. Note that S is convex. Thus, if
r
0
< r
1
, we can nd a test with r
0
< r
0
< r
1
and r
1
< r
1
. So then
NP
is not
minimax. Similarly if r
0
> r
1
.
.
2.4 Bayes decisions
Suppose the parameter space is a measurable space. We can then equip it
with a probability measure . We call the a priori distribution.
Denition The Bayes risk (with respect to the probability measure ) is
r(, d) :=
_
R(, d)d().
A decision d is called Bayes (with respect to ) if
r(, d) = inf
d
r(, d
).
If has density w := d/d with respect to some dominating measure , we
may write
r(, d) =
_
R(, d)w()d() := r
w
(d).
Thus, the Bayes risk may be thought of as taking a weighted average of the
risks. For example, one may want to assign more weight to important values
of .
Example 2.4.1 Consider again the testing problem
H
0
: =
0
H
1
: =
1
.
Let
r
w
() := w
0
R(
0
, ) +w
1
R(
1
, ).
We take 0 < w
0
= 1 w
1
< 1.
Lemma Bayes test is
Bayes
=
_
_
_
1 if p
1
/p
0
> w
0
/w
1
q if p
1
/p
0
= w
0
/w
1
0 if p
1
/p
0
< w
0
/w
1
.
Proof.
r
w
() = w
0
_
p
0
+w
1
(1
_
p
1
)
=
_
(w
0
p
0
w
1
p
1
) +w
1
.
So we choose [0, 1] to minimize (w
0
p
0
w
1
p
1
). This is done by taking
=
_
_
_
1 if w
0
p
0
w
1
p
1
< 0
q if w
0
p
0
w
1
p
1
= 0
0 if w
0
p
0
w
1
p
1
> 0
,
2.5. INTERMEZZO: CONDITIONAL DISTRIBUTIONS 35
where for q we may take any value between 0 and 1. .
Note that
2r
w
(
Bayes
) = 1
_
[w
1
p
1
w
0
p
0
[.
In particular, when w
0
= w
1
= 1/2,
2r
w
(
Bayes
) = 1
_
[p
1
p
0
[/2,
i.e., the risk is large if the two densities are close to each other.
2.5 Intermezzo: conditional distributions
Recall the denition of conditional probabilities: for two sets A and B, with
P(B) ,= 0, the conditional probability of A given B is dened as
P(A[B) =
P(A B)
P(B)
.
It follows that
P(B[A) = P(A[B)
P(B)
P(A)
,
and that, for a partition B
j
2
P(A) =
j
P(A[B
j
)P(B
j
).
Consider now two random vectors X R
n
and Y R
m
. Let f
X,Y
(, ), be the
density of (X, Y ) with respect to Lebesgue measure (assumed to exist). The
marginal density of X is
f
X
() =
_
f
X,Y
(, y)dy,
and the marginal density of Y is
f
Y
() =
_
f
X,Y
(x, )dx.
Denition The conditional density of X given Y = y is
f
X
(x[y) :=
f
X,Y
(x, y)
f
Y
(y)
, x R
n
.
2
{Bj} is a partition if Bj B
k
= for all j = k and P(jBj) = 1.
Thus, we have
f
Y
(y[x) = f
X
(x[y)
f
Y
(y)
f
X
(x)
, (x, y) R
n+m
,
and
f
X
(x) =
_
f
X
(x[y)f
Y
(y)dy, x R
n
.
Denition The conditional expectation of g(X, Y ) given Y = y is
E[g(X, Y )[Y = y] :=
_
f
X
(x[y)g(x, y)dx.
Note thus that
E[g
1
(X)g
2
(Y )[Y = y] = g
2
(y)E[g
1
(X)[Y = y].
Notation We dene the random variable E[g(X, Y )[Y ] as
E[g(X, Y )[Y ] := h(Y ),
where h(y) is the function h(y) := E[g(X, Y )[Y = y].
Lemma 2.5.1 (Iterated expectations lemma) It holds that
E
_
[E[g(X, Y )[Y ]
_
= Eg(X, Y ).
Proof. Dene
h(y) := E[g(X, Y )[Y = y].
Then
Eh(Y ) =
_
h(y)f
Y
(y)dy =
_
E[g(X, Y )[Y = y]f
Y
(y)dy
=
_ _
g(x, y)f
X,Y
(x, y)dxdy = Eg(X, Y ).
.
2.6 Bayes methods
Let X have distribution P T := P
: . Suppose T is dominated by a
(-nite) measure , and let p
= dP
/d denote the densities. Let be an a

priori distribution on , with density w := d/d. We now think of p
as the
density of X given the value of . We write it as
p
(x) = p(x[), x A.
2.6. BAYES METHODS 37
Moreover, we dene
p() :=
_
p([)w()d().
Denition The a posteriori density of is
w([x) = p(x[)
w()
p(x)
, , x A.
Lemma 2.6.1 Given the data X = x, consider as a random variable with
density w([x). Let
l(x, a) := E[L(, a)[X = x] :=
_
L(, a)w([x)d(),
and
d(x) := arg min
a
l(x, a).
Then d is Bayes decision d
Bayes
.
Proof.
r
w
(d
) =
_
R(, d)w()d()
=
_
__
X
L(, d
(x))p(x[)d(x)
_
w()d()
=
_
X
__
L(, d
(x))w([x)d()
_
p(x)d(x)
=
_
X
l(x, d
(x))p(x)d(x)
_
X
l(x, d(x))p(x)d(x)
= r
w
(d).
.
Example 2.6.1 For the testing problem
H
0
: =
0
H
1
: =
1
,
we have
l(x, ) = w
0
p
0
(x)/p(x) + (1 )w
1
p
1
(x)/p(x).
Thus,
arg min
l(, ) =
_
_
_
1 if w
1
p
1
> w
0
p
0
q if w
1
p
1
= w
0
p
0
0 if w
1
p
1
< w
0
p
0
.
In the next example, we shall use:
Lemma. Let Z be a real-valued random variable. Then
arg min
aR
E(Z a)
2
= EZ.
Proof.
E(Z a)
2
= var(Z) + (a EZ)
2
.
.
Example 2.6.2 Consider the case / = R and R . Let L(, a) := [ a[
2
.
Then
d
Bayes
(X) = E([X).
Example 2.6.3 Consider again the case R, and / = , and now with
loss function L(, a) := l[ a[ > c for a given constant c > 0. Then
l(x, a) = ([ a[ > c[X = x) =
_
|a|>c
w([x)d.
We note that for c 0
1 l(x, a)
2c
=
([ a[ c[X = x)
2c
w(a[x) = p(x[a)
w(a)
p(x)
.
Thus, for c small, Bayes rule is approximately d
0
(x) := arg max
a
p(x[a)w(a).
The estimator d
0
(X) is called the maximum a posteriori estimator. If w is the
uniform density on (which only exists if is bounded), then d
0
(X) is the
maximum likelihood estimator.
Example 2.6.4 Suppose that given , X has Poisson distribution with pa-
rameter , and that has the Gamma(k, )-distribution. The density of is
then
w() =
k
k1
e
/(k),
where
(k) =
_

0
e
z
z
k1
dz.
The Gamma(k, ) distribution has mean
E =
_

0
w()d =
k
.
The a posteriori density is then
w([x) = p(x[)
w()
p(x)
= e
x
x!
k1
e
/(k)
p(x)
2.7. DISCUSSION OF BAYESIAN APPROACH (TO BE WRITTEN) 39
= e
(1+)
k+x1
c(x, k, ),
where c(x, k, ) is such that
_
w([x)d = 1.
We recognize w([x) as the density of the Gamma(k + x, 1 + )-distribution.
Bayes estimator with quadratic loss is thus
E([X) =
k +X
1 +
.
The maximum a posteriori estimator is
k +X 1
1 +
.
Example 2.6.5 Suppose given , X has the Binomial(n, )-distribution, and
that is uniformly distributed on [0, 1]. Then
w([x) =
_
n
x
_
x
(1 )
nx
/p(x).
This is the density of the Beta(x+1, nx+1)-distribution. Thus, with quadratic
loss, Bayes estimator is
E([X) =
X + 1
n + 2
.
(...To be extended to general Beta prior.)
2.7 Discussion of Bayesian approach (to be written)

2.8 Integrating parameters out (to be written)

2.9 Intermezzo: some distribution theory
2.9.1 The multinomial distribution
In a survey, people were asked their opinion about some political issue. Let X
be the number of yes-answers, Y be the number of no and Z be the number
of perhaps. The total number of people in the survey is n = X + Y + Z. We
consider the votes as a sample with replacement, with p
1
= P(yes), p
2
= P(no),
and p
3
= P(perhaps), p
1
+p
2
+p
3
= 1. Then
P(X = x, Y = y, Z = z) =
_
n
x y z
_
p
x
1
p
y
2
p
z
3
, (x, y, z) 0, . . . , n, x+y+z = n.
Here
_
n
x y z
_
:=
n!
x!y!z!
.
It is called a multinomial coecient.
Lemma The marginal distribution of X is the Binomial(n, p
1
)-distribution.
Proof. For x 0, . . . , n, we have
P(X = x) =
nx
y=0
P(X = x, Y = y, Z = n x y)
=
nx
y=0
_
n
x y n x y
_
p
x
1
p
y
2
(1 p
1
p
2
)
nxy
=
_
n
x
_
p
x
1
nx
y=0
_
n x
y
_
p
y
2
(1 p
1
p
2
)
nxy
=
_
n
x
_
p
x
1
(1 p
1
)
nx
.
.
Denition We say that the random vector (N
1
, . . . , N
k
) has the multinomial
distribution with parameters n and p
1
, . . . , p
k
(with

k
j=1
p
j
= 1), if for all
(n
1
, . . . , n
k
) 0, . . . , n
k
, with n
1
+ +n
k
= n, it holds that
P(N
1
= n
1
, . . . , N
k
= n
k
) =
_
n
n
1
n
k
_
p
n
1
1
p
n
k
k
.
Here
_
n
n
1
n
k
_
:=
n!
n
1
! n
k
!
.
Example 2.9.1 Let X
1
, . . . , X
n
be i.i.d. copies of a random variable X R
with distribution F, and let = a
0
< a
1
< < a
k1
< a
k
= . Dene,
for j = 1, . . . , k,
p
j
:= P(X (a
j1
, a
j
]) = F(a
j
) F(a
j1
),
N
j
n
:=
#X
i
(a
j1
, a
j
]
n
=

F
n
(a
j
)

F
n
(a
j1
).
Then (N
1
, . . . , N
k
) has the Multinomial(n, p
1
, . . . , p
k
)-distribution.
2.9. INTERMEZZO: SOME DISTRIBUTION THEORY 41
2.9.2 The Poisson distribution
Denition A random variable X 0, 1, . . . has the Poisson distribution with
parameter > 0, if for all x 0, 1, . . .
P(X = x) = e
x
x!
.
Lemma Suppose X and Y are independent, and that X has the Poisson()-
distribution, and Y the Poisson()-distribution. Then Z := X + Y has the
Poisson( +)-distribution.
Proof. For all z 0, 1, . . ., we have
P(Z = z) =
z
x=0
P(X = x, Y = z x)
=
z
x=0
P(X = x)P(Y = z x) =
z
x=0
e
x
x!
e

zx
(z x)!
= e
(+)
1
z!
zx
x=0
_
n
x
_
zx
= e
(+)
( +)
z
z!
.
.
Lemma Let X
1
, . . . , X
n
be independent, and (for i = 1, . . . , n), let X
i
have
the Poisson(
i
)-distribution. Dene Z :=

n
i=1
X
i
. Let z 0, 1, . . .. Then
the conditional distribution of (X
1
, . . . , X
n
) given Z = z is the multinomial
distribution with parameters z and p
1
, . . . , p
n
, where
p
j
=

j
n
i=1
i
, j = 1, . . . , n.
Proof. First note that Z is Poisson(
+
)-distributed, with
+
:=

n
i=1
i
.
Thus, for all (x
1
, . . . , x
n
) 0, 1, . . . , z
n
satisfying

n
i=1
x
i
= z, we have
P(X
1
= x
1
, . . . , X
n
= x
n
[Z = z) =
P(X
1
= x
1
, . . . , X
n
= x
n
)
P(Z = z)
=
n
i=1
_
e
x
i
i
/x
i
!
_
e
z
+
/z!
=
_
z
x
1
x
n
__
+
_
x
1

_
+
_
xn
.
.
2.9.3 The distribution of the maximum of two random variables
Let X
1
and X
2
be independent and both have distribution F. Suppose that F
has density f w.r.t. Lebesgue measure. Let
Z := maxX
1
, X
2
.
Lemma The distribution function of Z is F
2
. Moreover, Z has density
f
Z
(z) = 2F(z)f(z), z R.
Proof. We have for all z,
P(Z z) = P(maxX
1
, X
2
z)
= P(X
1
z, X
2
z) = F
2
(z).
If F has density f, then (Lebesgue)-almost everywhere,
f(z) =
d
dz
F(z).
So the derivative of F
2
exists almost everywhere, and
d
dz
F
2
(z) = 2F(z)f(z).
.
Let X := (X
1
, X
2
). The conditional density of X given Z = z has density
f
X
(x
1
, x
2
[z) =
_
_
_
f(x
2
)
2F(z)
if x
1
= z and x
2
< z
f(x
1
)
2F(z)
if x
1
< z and x
2
= z
0 else
.
The conditional distribution function of X
1
given Z = z is
F
X
1
(x
1
[z) =
_
F(x
1
)
2F(z)
, x
1
< z
1, x
1
z
.
Note thus that this distribution has a jump of size 1/2 at z.
2.10 Suciency
Let S : A be some given map. We consider the statistic S = S(X).
Throughout, by the phrase for all possible s, we mean for all s for which con-
ditional distributions given S = s are dened (in other words: for all s in the
support of the distribution of S, which may depend on ).
Denition We call S sucient for if for all , and all possible s, the
conditional distribution
P
(X [S(X) = s)
does not depend on .
2.10. SUFFICIENCY 43
Example 2.10.1 Let X
1
, . . . , X
n
be i.i.d., and have the Bernoulli distribution
with probability (0, 1) of success: (for i = 1, . . . , n)
P
(X
i
= 1) = 1 P
(X
i
= 0) = .
Take S =
n
i=1
X
i
. Then S is sucient for : for all possible s,
IP
(X
1
= x
1
, . . . , X
n
= x
n
[S = s) =
1
_
n
s
_,
n
i=1
x
i
= s.
Example 2.10.2 Let X := (X
1
, . . . , X
n
), with X
1
, . . . , X
n
i.i.d. and Poisson()-
distributed. Take S =
n
i=1
X
i
. Then S has the Poisson(n)-distribution. For
all possible s, the conditional distribution of X given S = s is the multinomial
distribution with parameters s and (p
1
, . . . , p
n
) = (
1
n
, . . . ,
1
n
):
IP
(X
1
= x
1
, . . . , X
n
= x
n
[S = s) =
_
s
x
1
x
n
__
1
n
_
s
,
n
i=1
x
i
= s.
This distribution does not depend on , so S is sucient for .
1
and X
2
be independent, and both have the exponen-
tial distribution with parameter > 0. The density of e.g., X
1
is then
f
X
1
(x; ) = e
x
, x > 0.
Let S = X
1
+X
2
. Verify that S has density
f
S
(s; ) = s
2
e
s
, s > 0.
(This is the Gamma(2, )-distribution.) For all possible s, the conditional den-
sity of (X
1
, X
2
) given S = s is thus
f
X
1
,X
2
(x
1
, x
2
[S = s) =
1
s
, x
1
+x
2
= s.
Hence, S is sucient for .
1
, . . . , X
n
be an i.i.d. sample from a continuous dis-
tribution F. Then S := (X
(1)
, . . . , X
(n)
) is sucient for F: for all possible
s = (s
1
, . . . , s
n
) (s
1
< . . . < s
n
), and for (x
q
1
, . . . , x
qn
) = s,
IP
_
(X
1
, . . . , X
n
) = (x
1
, . . . , x
n
)
(X
(1)
, . . . , X
(n)
) = s
_
=
1
n!
.
1
and X
2
be independent, and both uniformly dis-
tributed on the interval [0, ], with > 0. Dene Z := X
1
+X
2
.
Lemma The random variable Z has density
f
Z
(z; ) =
_
z/
2
if 0 z
(2 z)/
2
if z 2
.
Proof. First, assume = 1. Then the distribution function of Z is
F
Z
(z) =
_
z
2
/2 0 z 1
1 (2 z)
2
/2 1 z 2
.
So the density is then
f
Z
(z) =
_
z 0 z 1
2 z 1 z 2
.
For general , the result follows from the uniform case by the transformation
Z Z, which maps f
Z
into f
Z
(/)/. .
The conditional density of (X
1
, X
2
) given Z = z (0, 2) is now
f
X,X
2
(x
1
, x
2
[Z = z; ) =
_
1
z
0 z
1
2z
z 2
.
This depends on , so Z is not sucient for .
Consider now S := maxX
1
, X
2
. The conditional density of (X
1
, X
2
) given
S = s (0, ) is
f
X
1
,X
2
(x
1
, x
2
[S = s) =
1
2s
, 0 x
1
< s, x
2
= s or x
1
= s, 0 x
2
< s.
This does not depend on , so S is sucient for .
2.10.1 Rao-Blackwell
Lemma 2.10.1 Suppose S is sucient for . Let d : A / be some decision.
Then there is a randomized decision (S) that only depends on S, such that
R(, (S)) = R(, d), .
Proof. Let X
s
be a random variable with distribution P(X [S = s). Then,
by construction, for all possible s, the conditional distribution, given S = s,
of X
s
and X are equal. It follows that X and X
S
have the same distribution.
Formally, let us write Q
for the distribution of S. Then

P
(X
S
) =
_
P(X
s
[S = s)dQ
(s)
=
_
P(X [S = s)dQ
(s) = P
(X ).
The result of the lemma follows by taking (s) := d(X
s
). . .
Lemma 2.10.2 (Rao Blackwell) Suppose that S is sucient for . Suppose
moreover that the action space / R
p
is convex, and that for each , the
map a L(, a) is convex. Let d : A / be a decision, and dene d
(s) :=
E(d(X)[S = s) (assumed to exist). Then
R(, d
) R(, d), .
Proof. Jensens inequality says that for a convex function g,
E(g(X)) g(EX).
Hence, ,
E
_
L
_
, d(X)
_
S = s
_
L
_
, E
_
d(X)[S = s
__
= L(, d
(s)).
By the iterated expectations lemma, we arrive at
R(, d) = E
L(, d(X))
= E
E
_
L
_
, d(X)
_
S
_
E
L(, d
(S)).
.
2.10.2 Factorization Theorem of Neyman
Theorem 2.10.1 (Factorization Theorem of Neyman) Suppose P
:
is dominated by a -nite measure . Let p
:= dP
/d denote the densities.

Then S is sucient for if and only if one can write p
in the form
p
(x) = g
(S(x))h(x), x, .
Proof in the discrete case. Suppose X takes only the values a
1
, a
2
, . . .
(so we may take to be the counting measure). Let Q
be the distribution of
S:
Q
(s) :=
j: S(a
j
)=s
P
(X = a
j
).
The conditional distribution of X given S is
P
(X = x[S = s) =
P
(X = x)
Q
(s)
, S(x) = s.
() If S is sucient for , the above does not depend on , but is only a
function of x, say h(x). So we may write for S(x) = s,
P
(X = x) = P
(X = x[S = s)Q
(S = s) = h(x)g
(s),
with g
(s) = Q
(S = s).
() Inserting p
(x) = g
(S(x))h(x), we nd
Q
(s) = g
(s)
j: S(a
j
)=s
h(a
j
),
This gives in the formula for P
(X = x[S = s),
P
(X = x[S = s) =
h(x)
j: S(a
j
)=s
h(a
j
)
which does not depend on . .
Remark The proof for the general case is along the same lines, but does have
some subtle elements!
Corollary 2.10.1 The likelihood is L
X
() = p
(X) = g
(S)h(X). Hence, the

maximum likelihood estimator

= arg max
L
X
() = arg max
(S) depends
only on the sucient statistic S.
Corollary 2.10.2 The Bayes decision is
d
Bayes
(X) = arg min
aA
l(X, a),
where
l(x, a) = E(L(, a)[X = x) =
_
L(, a)w([x)d()
=
_
L(, a)g
(S(x))w()d()h(x)/p(x).
So
d
Bayes
(X) = arg min
aA
_
L(, a)g
(S)w()d(),
which only depends on the sucient statistic S.
1
, . . . , X
n
be i.i.d., and uniformly distributed on the
interval [0, ]. Then the density of X = (X
1
, . . . , X
n
) is
p
(x
1
, . . . , x
n
) =
1
n
l0 minx
1
, . . . , x
n
maxx
1
, . . . , x
n

= g
(S(x
1
, . . . , x
n
))h(x
1
, . . . , x
n
),
with
g
(s) :=
1
n
ls ,
and
h(x
1
, . . . , x
n
) := l0 minx
1
, . . . , x
n
.
Thus, S = maxX
1
, . . . , X
n
is sucient for .
2.10.3 Exponential families
Denition A k-dimensional exponential family is a family of distributions P
:
, dominated by some -nite measure , with densities p
= dP
/d of
the form
p
(x) = exp
_
k
j=1
c
j
()T
j
(x) d()
_
h(x).
Note In case of a k-dimensional exponential family, the k-dimensional statistic
S(X) = (T
1
(X), . . . , T
k
(X)) is sucient for .
Note If X
1
, . . . , X
n
is an i.i.d. sample from a k-dimensional exponential family,
then the distribution of X = (X
1
, . . . , X
n
) is also in a k-dimensional exponential
family. The density of X is then (for x := (x
1
, . . . , x
n
)),
p
(x) =
n
i=1
p
(x
i
) = exp[
k
j=1
nc
j
()
T
j
(x) nd()]
n
i=1
h(x
i
),
where, for j = 1, . . . , k,
T
j
(x) =
1
n
n
i=1
T
j
(x
i
).
Hence S(X) = (
T
1
(X), . . . ,

T
k
(X)) is then sucient for .
Note The functions T
j
and c
j
are not uniquely dened.
Example 2.10.7 If X is Poisson()-distributed, we have
p
(x) = e
x
x!
= exp[xlog ]
1
x!
.
Hence, we may take T(x) = x, c() = log , and d() = .
Example 2.10.8 If X has the Binomial(n, )-distribution, we have
p
(x) =
_
n
x
_
x
(1 )
nx
=
_
n
x
__

1
_
x
(1 )
n
=
_
n
x
_
exp
_
xlog(

1
) +nlog(1 )
_
.
So we can take T(x) = x, c() = log(/(1 )), and d() = nlog(1 ).
Example 2.10.9 If X has the Negative Binomial(k, )-distribution we have
p
(x) =
(x +k)
(k)x!

k
(1 )
x
=
(x +k)
(k)x!
exp[xlog(1 ) +k log()].
So we may take T(x) = x, c() = log(1 ) and d() = k log().
Example 2.10.10 Let X have the Gamma(k, )-distribution (with k known).
Then
p
(x) = e
x
x
k1

k
(k)
=
x
k1
(k)
exp[x +k log ].
So we can take T(x) = x, c() = , and d() = k log .
Example 2.10.11 Let X have the Gamma(k, )-distribution, and let =
(k, ). Then
p
(x) = e
x
x
k1

k
(k)
=
x
k1
(k)
exp[x + (k 1) log x +k log log (k)].
So we can take T
1
(x) = x, T
2
(x) = log x, c() = , c
2
() = (k 1), and
d() = k log + log (k).
Example 2.10.12 Let X be ^(,
2
)-distributed, and let = (, ). Then
p
(x) =
1
2
exp
_
(x )
2
2
2
_
=
1
2
exp
_
x
2

x
2
2
2

2
2
2
log
_
.
So we can take T
1
(x) = x, T
2
(x) = x
2
, c
1
() = /
2
, c
2
() = 1/(2
2
), and
d() =
2
/(2
2
) + log().
2.10.4 Canonical form of an exponential family
In this subsection, we assume regularity conditions, such as existence of deriva-
tives, and inverses, and permission to interchange dierentiation and integra-
tion.
Let R
k
, and let P
: be a family of probability measures dominated

by a -nite measure . Dene the densities
p
:=
dP
d
.
Denition We call P
: an exponential family in canonical form, if

p
(x) = exp
_
k
j=1
j
T
j
(x) d()
_
h(x).
Note that d() is the normalizing constant
d() = log
_
_
_
exp
_
k
j=1
j
T
j
(x)
_
h(x)d(x)
_
_
.
We let
d() :=

d() =
_
_
_
1
d()
.
.
.
k
d()
_
_
_
denote the vector of rst derivatives, and
d() :=

2
d() =
_
2
d()
_
denote the k k matrix of second derivatives. Further, we write
T(X) :=
_
_
T
1
(X)
.
.
.
T
k
(X)
_
_
, E
T(X) :=
_
_
E
T
1
(X)
.
.
.
E
T
k
(X)
_
_
,
and we write the k k covariance matrix of T(X) as
Cov
(T(X)) :=
_
cov
(T
j
(X), T
j
(X))
_
.
Lemma We have (under regularity)
E
T(X) =

d(), Cov
(T(X)) =

d().
Proof. By the denition of d(), we nd
d() =

log
__
exp
_
T(x)
_
h(x)d(x)
_
=
_
exp
_
T(x)
_
T(x)h(x)d(x)
_
exp
_
T(x)
_
h(x)d(x)
=
_
exp
_
T(x) d()
_
T(x)h(x)d(x)
=
_
p
(x)T(x)d(x) = E
T(X),
and
d() =
_
exp
_
T(x)
_
T(x)T(x)
h(x)d(x)
_
exp
_
T(x)
_
h(x)d(x)
_
_
exp
_
T(x)
_
T(x)h(x)d(x)
__
_
exp
_
T(x)
_
T(x)h(x)d(x)
_
_
_
exp
_
T(x)
_
h(x)d(x)
_
2
=
_
exp
_
T(x) d()
_
T(x)T(x)
h(x)d(x)
__
exp
_
T(x) d()
_
T(x)h(x)d(x)
_
__
exp
_
T(x) d()
_
T(x)h(x)d(x)
_
=
_
p
(x)T(x)T(x)
d(x)
__
p
(x)T(x)d(x)
___
p
(x)T(x)d(x)
_
= E
T(X)T(X)
_
E
T(X)
__
E
T(X)
_
= Cov
(T(X)).
.
Let us now simplify to the one-dimensional case, that is R. Consider an
exponential family, not necessarily in canonical form:
p
(x) = exp[c()T(x) d()]h(x).

We can put this in canonical form by reparametrizing
c() := (say),
to get
p
(x) = exp[T(x) d
0
()]h(x),
where
d
0
() = d(c
1
()).
It follows that
E
T(X) =

d
0
() =

d(c
1
())
c(c
1
())
=

d()
c()
, (2.2)
and
var
(T(X)) =

d
0
() =

d(c
1
())
[ c(c
1
())]
2

d(c
1
()) c(c
1
())
[ c(c
1
())]
3
=

d()
[ c()]
2

d() c()
[ c()]
3
=
1
[ c()]
2
_
d()
d()
c()
c()
_
. (2.3)
For an arbitrary (but regular) family of densities p
: , with (again for

simplicity) R, we dene the score function
s
(x) :=
d
d
log p
(x),
and the Fisher information for estimating
I() := var
(s
(X))
(see also Chapter 3 and 6).
Lemma We have (under regularity)
E
(X) = 0,
and
I() = E
(X),
where s
(x) :=
d
d
s
(x).
Proof. The results follow from the fact that densities integrate to one, assuming
that we may interchange derivatives and integrals:
E
(X) =
_
s
(x)p
(x)d(x)
=
_
d log p
(x)
d
p
(x)d(x) =
_
dp
(x)/d
p
(x)
p
(x)d(x)
=
_
d
d
p
(x)d(x) =
d
d
_
p
(x)d(x) =
d
d
1 = 0,
and
E
(X) = E
_
d
2
p
(X)/d
2
p
(X)

_
dp
(X)/d
p
(X)
_
2
_
= E
_
d
2
p
(X)/d
2
p
(X)
_
E
s
2
(X).
Now, E
s
2
(X) equals var
(X), since E
(X) = 0. Moreover,
E
_
d
2
p
(X)/d
2
p
(X)
_
=
_
d
2
d
2
p
(x)d(x) =
d
2
d
2
_
p
(x)d(x) =
d
2
d
2
1 = 0.
.
In the special case that P
: is a one-dimensional exponential family,

the densities are of the form
p
(x) = exp[c()T(x) d()]h(x).

Hence
s
(x) = c()T(x)

d().
The equality E
(X) = 0 implies that

E
T(X) =

d()
c()
,
which re-establishes (2.2). One moreover has
s
(x) = c()T(x)

d().
Hence, the inequality var
(s
(X)) = E
(X) implies
[ c()]
2
var
(T(X)) = c()E
T(X) +

d()
=

d()
d()
c()
c(),
which re-establishes (2.3). In addition, it follows that
I() =

d()
d()
c()
c().
The Fisher information for estimating = c() is
I
0
() =

d
0
() =
I()
[ c()]
2
.
More generally, the Fisher information for estimating a dierentiable function
g() of the parameter , is equal to I()/[ g()]
2
.
Example
Let X 0, 1 have the Bernoulli-distribution with success parameter (0, 1):
p
(x) =
x
(1 )
1x
= exp
_
xlog
_

1
_
+ log(1 )
_
, x 0, 1.
We reparametrize:
:= c() = log
_

1
_
,
which is called the log-odds ratio. Inverting gives
=
e
1 + e
,
and hence
d() = log(1 ) = log
_
1 + e
_
:= d
0
().
Thus
d
0
() =
e
1 + e
= = E
X,
and
d
0
() =
e
1 + e

e
2
(1 + e
)
2
=
e
(1 + e
)
2
= (1 ) = var
(X).
The score function is
s
(x) =
d
d
_
xlog
_

1
_
+ log(1 )
_
=
x
(1 )

1
1
.
The Fisher information for estimating the success parameter is
E
s
2
(X) =
var
(X)
[(1 )]
2
=
1
(1 )
,
whereas the Fisher information for estimating the log-odds ratio is
I
0
() = (1 ).
2.10.5 Minimal suciency
Denition We say that two likelihoods L
x
() and L
x
() are proportional at
(x, x
), if
L
x
() = L
x
()c(x, x
), ,
for some constant c(x, x
).
A statistic S is called minimal sucient if S(x) = S(x
) for all x and x
for
which the likelihoods are proportional.
1
. . . , X
n
be independent and ^(, 1)-distributed. Then
S =
n
i=1
X
i
is sucient for . We moreover have
log L
x
() = S(x)
n
2
2

n
i=1
x
2
i
2
log(2)/2.
So
log L
x
() log L
x
() = (S(x) S(x
))
n
i=1
x
2
i

n
i=1
(x
i
)
2
2
,
which equals,
log c(x, x
), ,
for some function c, if and only if S(x) = S(x
). So S is minimal sucient.
1
, . . . , X
n
be independent and Laplace-distributed with
location parameter . Then
log L
x
() = (log 2)/2
2
n
i=1
[x
i
[,
so
log L
x
() log L
x
() =
2
n
i=1
([x
i
[ [x
i
[)
which equals
log c(x, x
), ,
for some function c, if and only if (x
(1)
, . . . , x
(n)
) = (x
(1)
, . . . , x
(n)
). So the order
statistics X
(1)
, . . . , X
(n)
are minimal sucient.
Chapter 3
Unbiased estimators
3.1 What is an unbiased estimator?
Let X A denote the observations. The distribution P of X is assumed to
be a member of a given class P
: of distributions. The parameter of

interest in this chapter is := g(), with g : R (for simplicity, we initially
assume to be one-dimensional).
Let T : A R be an estimator of g().
Denition The bias of T = T(X) is
bias
(T) := E
T g().
The estimator T is called unbiased if
bias
(T) = 0, .
Thus, unbiasedness means that there is no systematic error: E
T = g(). We
require this for all .
Example 3.1.1 Let X Binomial(n, ), 0 < < 1. We have
E
T(X) =
n
k=0
_
n
k
_
k
(1 )
nk
T(k) := q().
Note that q() is a polynomial in of degree at most n. So only parameters
g() which are polynomials of degree at most n can be estimated unbiasedly. It
means that there exists no unbiased estimator of, for example,

or /(1 ).
Example 3.1.2 Let X Poisson(). Then
E
T(X) =
k=0
e
k
k!
T(k) := e
p().
55
56 CHAPTER 3. UNBIASED ESTIMATORS
Note that p() is a power series in . Thus only parameters g() which are a
power series in times e
can be estimated unbiasedly. An example is the

probability of early failure
g() := e
= P
(X = 0).
An unbiased estimator of e
is for instance
T(X) = lX = 0.
As another example, suppose the parameter of interest is
g() := e
2
.
An unbiased estimator is
T(X) =
_
+1 if X is even
1 if X is odd
.
This estimator does not make sense at all!
Example 3.1.3 Let X
1
, . . . , X
n
be i.i.d. ^(,
2
), and let = (,
2
) R
R
+
. Then
S
2
:=
1
n 1
n
i=1
(X
i

X)
2
is an unbiased estimator of
2
. But S is not an unbiased estimator of . In
fact, one can show that there does not exist any unbiased estimator of !
We conclude that requiring unbiasedness can have disadvantages: unbiased es-
timators do not always exist, and if they do, they can be nonsensical. Moreover,
the property of unbiasedness is not preserved under taking nonlinear transfor-
mations.
3.2 UMVU estimators
Lemma 3.2.1 We have the following equality for the mean square error:
E
[T g()[
2
= bias
2
(T) + var
(T).
In other words, the mean square error consists of two components, the (squared)
bias and the variance. This is called the bias-variance decomposition. As we
will see, it is often the case that an attempt to decrease the bias results in an
increase of the variance (and vise versa).
Example 3.2.1 Let X
1
, . . . , X
n
be i.i.d. ^(,
2
)-distributed. Both and
2
are unknown parameters: := (,
2
).
3.2. UMVU ESTIMATORS 57
Case i Suppose the mean is our parameter of interest. Consider the estimator
T :=

X, where 0 1. Then the bias is decreasing in , but the variance
is increasing in :
E
[T [
2
= (1 )
2
2
+
2
2
/n.
The right hand side can be minimized as a function of . The minimum is
attained at
opt
:=

2
2
/n +
2
.
However,
opt

X is not an estimator as it depends on the unknown parameters.
Case ii Suppose
2
is the parameter of interest. Let S
2
be the sample variance:
S
2
:=
1
n 1
n
i=1
(X
i

X)
2
.
It is known that S
2
is unbiased. But does it also have small mean square error?
Let us compare it with the estimator

2
:=
1
n
n
i=1
(X
i

X)
2
.
To compute the mean square errors of these two estimators, we rst recall that
n
i=1
(X
i

X)
2
2

2
n1
,
a
2
-distribution with n1 degrees of freedom. The
2
-distribution is a special
case of the Gamma-distribution, namely
2
n1
=
_
n 1
2
,
1
2
_
.
Thus
1
E
_
n
i=1
(X
i

X)
2
/
2
_
= n 1, var
_
n
i=1
(X
i

X)
2
/
2
_
= 2(n 1).
It follows that
E
[S
2
2
[
2
= var(S
2
) =

4
(n 1)
2
2(n 1) =
2
4
n 1
,
and
E

2
=
n 1
n

2
,
bias
(
2
) =
1
n
2
,
1
If Y has a (k, )-distribution, then EY = k/ and var(Y ) = k/
2
.
so that
E
[
2
2
[
2
= bias
2
(
2
) + var
(
2
) =

4
n
2
+

4
n
2
2(n 1) =

4
(2n 1)
n
2
.
Conclusion: the mean square error of
2
is smaller than the mean square error
of S
2
!
Generally, it is not possible to construct an estimator that possesses the best
among all of all desirable properties. We therefore x one property: unbi-
asedness (despite its disadvantages), and look for good estimators among the
unbiased ones.
Denition An unbiased estimator T
is called UMVU (Uniform Minimum

Variance Unbiased) if for any other unbiased estimator T,
var
(T
) var
(T), .
Suppose that T is unbiased, and that S is sucient. Let
T
:= E(T[S).
The distribution of T given S does not depend on , so T
is also an estimator.
Moreover, it is unbiased:
E
= E
(E(T[S)) = E
T = g().
By conditioning on S, superuous variance in the sample is killed. Indeed,
the following lemma (which is a general property of conditional distributions)
shows that T
cannot have larger variance than T:

var
(T
) var
(T), .
Lemma 3.2.2 Let Y and Z be two random variables. Then
var(Y ) = var(E(Y [Z)) +Evar(Y [Z).
Proof. It holds that
var(E(Y [Z)) = E
_
E(Y [Z)
_
2
_
E(E(Y [Z))
_
2
= E[E(Y [Z)]
2
[EY ]
2
,
and
Evar(Y [Z) = E
_
E(Y
2
[Z) [E(Y [Z)]
2
_
= EY
2
E[E(Y [Z)]
2
.
Hence, when adding up, the term E[E(Y [Z)]
2
cancels out, and what is left over
is exactly the variance
var(Y ) = EY
2
[EY ]
2
.
.
3.2.1 Complete statistics
The question arises: can we construct an unbiased estimator with even smaller
variance than T
= E(T[S)? Note that T
depends on X only via S = S(X),

i.e., it depends only on the sucient statistic. In our search for UMVU estima-
tors, we may restrict our attention to estimators depending only on S. Thus, if
there is only one unbiased estimator depending only on S, it has to be UMVU.
Denition A statistic S is called complete if we have the following implication:
E
h(S) = 0 h(S) = 0, P
a.s., .
Lemma 3.2.3 (Lehmann-Schee) Let T be an unbiased estimator of g(),
with, for all , nite variance. Moreover, let S be sucient and complete.
Then T
:= E(T[S) is UMVU.
Proof. We already noted that T
= T
(S) is unbiased and that var
(T
)
var
(T) . If T
(S) is another unbiased estimator of g(), we have

E
(T(S) T
(S)) = 0, .
Because S is complete, this implies
T
= T
, P
a.s.
.
To check whether a statistic is complete, one often needs somewhat sophisti-
cated tools from analysis/integration theory. In the next two examples, we only
sketch the proofs of completeness.
Example 3.2.2 Let X
1
, . . . , X
n
be i.i.d. Poisson()-distributed. We want to
estimate g() := e
, the probability of early failure. An unbiased estimator is

T(X
1
, . . . , X
n
) := lX
1
= 0.
A sucient statistic is
S :=
n
i=1
X
i
.
We now check whether S is complete. Its distribution is the Poisson(n)-
distribution. We therefore have for any function h,
E
h(S) =
k=0
e
n
(n)
k
k!
h(k).
The equation
E
h(S) = 0 ,
thus implies
k=0
(n)
k
k!
h(k) = 0 .
Let f be a function with Taylor expansion at zero. Then
f(x) =
k=0
x
k
k!
f
(k)
(0).
The left hand side can only be zero for all x if f 0, in which case also
f
(k)
(0) = 0 for all k. Thus (h(k) takes the role of f
(k)
(0) and n the role of x),
we conclude that h(k) = 0 for all k, i.e., that S is complete.
So we know from the Lehmann-Schee Lemma that T
:= E(T[S) is UMVU.
Now,
P(T = 1[S = s) = P(X
1
= 0[S = s)
=
e
e
(n1)
[(n 1)]
s
/s!
e
n
(n)
s
/s!
=
_
n 1
n
_
s
.
Hence
T
=
_
n 1
n
_
S
is UMVU.
Example 3.2.3 Let X
1
, . . . , X
n
be i.i.d. Uniform[0, ]-distributed, and g() :=
. We know that S := maxX
1
, . . . , X
n
is sucient. The distribution function
of S is
F
S
(s) = P
(maxX
1
, . . . , X
n
s) =
_
s
_
n
, 0 s .
Its density is thus
f
S
(s) =
ns
n1
n
, 0 s .
Hence, for any (measurable) function h,
E
h(S) =
_

0
h(s)
ns
n1
n
ds.
If
E
h(S) = 0 ,
it must hold that
_

0
h(s)s
n1
ds = 0 .
Dierentiating w.r.t. gives
h()
n1
= 0 ,
which implies h 0. So S is complete.
It remains to nd a statistic T
that depends only on S and that is unbiased.

We have
E
S =
_

0
s
ns
n1
n
ds =
n
n + 1
.
So S itself is not unbiased, it is too small. But this can be easily repaired: take
T
=
n + 1
n
S.
Then, by the Lehmann-Schee Lemma, T
is UMVU.
In the case of an exponential family, completeness holds for a sucient statistic
if the parameter space is of the same dimension as the sucient statistic.
This is stated more formally in the following lemma. We omit the proof.
Lemma 3.2.4 Let for ,
p
(x) = exp
_
k
j=1
c
j
()T
j
(x) d()
_
h(x).
Consider the set
( := (c
1
(), . . . , c
k
()) : R
k
.
Suppose that ( is truly k-dimensional (that is, not of dimension smaller than
k), i.e., it contains an open ball in R
k
. (Or an open cube

k
j=1
(a
j
, b
j
).) Then
S := (T
1
, . . . , T
k
) is complete.
Example 3.2.4 Let X
1
, . . . , X
n
be i.i.d. with (k, )-distribution. Both k and
are assumed to be unknown, so that := (k, ). We moreover let := R
2
+
.
The density f of the (k, )-distribution is
f(z) =

k
(k)
e
z
z
k1
, z > 0.
Hence,
p
(x) = exp
_
i=1
x
i
+ (k 1)
n
i=1
log x
i
d()
_
h(x),
where
d(k, ) = nk log +nlog (k),
and
h(x) = lx
i
> 0, i = 1, . . . , n.
It follows that
_
n
i=1
X
i
,
n
i=1
log X
i
_
is sucient and complete.
Example 3.2.5 Consider two independent samples from normal distributions:
X
1
, . . . X
n
i.i.d. ^(,
2
)-distributed and Y
1
, . . . , Y
m
be i.i.d. ^(,
2
)-distributed.
Case i If = (, ,
2
,
2
) R
2
R
2
+
, one can easily check that
S :=
_
n
i=1
X
i
,
n
i=1
X
2
i
,
m
j=1
Y
j
,
m
j=1
Y
2
j
_
is sucient and complete.
Case ii If ,
2
and
2
are unknown, and = , then S of course remains
sucient. One can however show that S is not complete. Dicult question:
does a complete statistic exist?
3.3 The Cramer-Rao lower bound
Let P
: be a collection of distributions on A, dominated by a -nite

measure . We denote the densities by
p
:=
dP
d
, .
In this section, we assume that is a one-dimensional open interval (the ex-
tension to a higher-dimensional parameter space will be handled in the next
section).
We will impose the following two conditions:
Condition I The set
A := x : p
(x) > 0
does not depend on .
Condition II (Dierentiability in L
2
) For all and for a function s
: A R
satisfying
I() := E
(X)
2
< ,
it holds that
lim
h0
E
_
p
+h
(X) p
(X)
hp
(X)
s
(X)
_
2
= 0.
Denition If I and II hold, we call s
the score function, and I() the Fisher

information.
Lemma 3.3.1 Assume conditions I and II. Then
E
(X) = 0, .
3.3. THE CRAMER-RAO LOWER BOUND 63
Proof. Under P
, we only need to consider values x with p
(x) > 0, that is,

we may freely divide by p
, without worrying about dividing by zero.

Observe that
E
_
p
+h
(X) p
(X)
p
(X)
_
=
_
A
(p
+h
p
)d = 0,
since densities integrate to 1, and both p
+h
and p
vanish outside A. Thus,

[E
(X)[
2
=
_
p
+h
(X) p
(X)
hp
(X)
s
(X)
_
2
E
_
p
+h
(X) p
(X)
hp
(X)
s
(X)
_
2
0.
.
Note Thus I() = var
(s
(X)).
Remark If p
(x) is dierentiable for all x, we take

s
(x) :=
d
d
log p
(x) =
p
(x)
p
(x)
,
where
p
(x) :=
d
d
p
(x).
Remark Suppose X
1
, . . . , X
n
are i.i.d. with density p
, and s
= p
/p
exists.
The joint density is
p
(x) =
n
i=1
p
(x
i
),
so that (under conditions I and II) the score function for n observations is
s
(x) =
n
i=1
s
(x
i
).
The Fisher information for n observations is thus
I() = var
(s
(X)) =
n
i=1
var
(s
(X
i
)) = nI().
Theorem 3.3.1 (The Cramer-Rao lower bound) Suppose conditions I and II
are met, and that T is an unbiased estimator of g() with nite variance. Then
g() has a derivative, g() := dg()/d, equal to
g() = cov(T, s
(X)).
Moreover,
var
(T)
[ g()]
2
I()
, .
Proof. We rst show dierentiability of g. As T is unbiased, we have
g( +h) g()
h
=
E
+h
T(X) E
T(X)
h
=
1
h
_
T(p
+h
p
)d = E
T(X)
p
+h
(X) p
(X)
hp
(X)
= E
T(X)
_
p
+h
(X) p
(X)
hp
(X)
s
(X)
_
+E
T(X)s
(X)
= E
_
T(X) g
__
p
+h
(X) p
(X)
hp
(X)
s
(X)
_
+E
T(X)s
(X)
E
T(X)s
(X),
as, by the Cauchy-Schwarz inequality
_
T(X) g
__
p
+h
(X) p
(X)
hp
(X)
s
(X)
_
2
var
(T)E
_
p
+h
(X) p
(X)
hp
(X)
s
(X)
_
2
0.
Thus,
g() = E
T(X)s
(X) = cov
(T, s
(X)).
The last inequality holds because E
(X) = 0. By Cauchy-Schwarz,
[ g()]
2
= [cov
(T, s
(X))[
2
var
(T)var
(s
(X)) = var
(T)I().
.
Denition We call [ g()]
2
/I(), , the Cramer Rao lower bound (CRLB)
(for estimating g()).
Example 3.3.1 Let X
1
, . . . , X
n
be i.i.d. Exponential(), > 0. The density
of a single observation is then
p
(x) = e
x
, x > 0.
Let g() := 1/, and T :=

X. Then T is unbiased, and var
(T) = 1/(n
2
). We
now compute the CRLB. With g() = 1/, one has g() = 1/
2
. Moreover,
log p
(x) = log x,
so
s
(x) = 1/ x,
and hence
I() = var
(X) =
1
2
.
The CRLB for n observations is thus
[ g()]
2
nI()
=
1
n
2
.
In other words, T reaches the CRLB.
3.3. THE CRAMER-RAO LOWER BOUND 65
Example 3.3.2 Suppose X
1
, . . . , X
n
are i.i.d. Poisson(), > 0. Then
log p
(x) = +xlog log x!,

so
s
(x) = 1 +
x
,
and hence
I() = var
_
X
_
=
var
(X)
2
=
1
.
One easily checks that

X reaches the CRLB for estimating .
Let now g() := e
. The UMVU estimator of g() is

T :=
_
1
1
n
_
P
n
i=1
X
i
.
To compute its variance, we rst compute
E
T
2
=
k=0
_
1
1
n
_
2k
(n)
k
k!
e
n
= e
n
k=0
1
k!
_
(n 1)
2
n
_
k
= e
n
exp
_
(n 1)
2
n
_
= exp
_
(1 2n)
n
_
.
Thus,
var
(T) = E
T
2
[E
T]
2
= E
T
2
e
2
= e
2
_
e
/n
1
_
_
> e
2
/n
e
2
/n for n large
.
As g() = e
, the CRLB is
[ g()]
2
nI()
=
e
2
n
.
We conclude that T does not reach the CRLB, but the gap is small for n large.
For the next result, we:
Recall Let X and Y be two real-valued random variables. The correlation
between X and Y is
(X, Y ) :=
cov(X, Y )
_
var(X)var(Y )
.
We have
[(X, Y )[ = 1 constants a, b : Y = aX +b (a.s.).
The next lemma shows that the CRLB can only be reached within exponential
families, thus is only tight in a rather limited context.
Lemma 3.3.2 Assume conditions I and II, with s
= p
/p
. Suppose T is
unbiased for g(), and that T reaches the Cramer-Rao lower bound. Then P
:
forms a one-dimensional exponential family: there exist functions c(),
d(), and h(x) such that for all ,
p
(x) = exp[c()T(X) d()]h(x), x A.

Moreover, c() and d() are dierentiable, say with derivatives c() and

d()
respectively. We furthermore have the equality
g() =

d()/ c(), .
Proof. By Theorem 3.3.1, when T reaches the CRLB, we must have
var
(T) =
[cov(T, s
(X))[
2
var
(s
(X))
,
i.e., then the correlation between T and s
(X) is 1. Thus, there exist constants

a() and b() (depending on ), such that
s
(X) = a()T(X) b(). (3.1)

But, as s
= p
/p
= d log p
/d, we can take primitives:

log p
(x) = c()T(x) d() +

h(x),
where c() = a(),

d() = b() and

h(x) is constant in . Hence,
p
(x) = exp[c()T(x) d()]h(x),

with h(x) = exp[
h(x)].
Moreover, the equation (3.1) tells us that
E
(X) = a()E
T b() = a()g() b().

Because E
(X) = 0, this implies that g() = b()/a(). .

3.4 Higher-dimensional extensions
Expectations and covariances of random vectors
Let X R
p
be a p-dimensional random vector. Then EX is a p-dimensional
vector, and
:= Cov(X) := EXX
T
(EX)(EX)
T
3.4. HIGHER-DIMENSIONAL EXTENSIONS 67
is a p p matrix containing all variances (on the diagonal) and covariances
(o-diagonal). Note that is positive semi-denite: for any vector a R
p
, we
have
var(a
T
X) = a
T
a 0.
Some matrix algebra
Let V be a symmetric matrix. If V is positive (semi-)denite, we write this
as V > 0 (V 0). One then has that V = W
2
, where W is also positive
(semi-)denite.
Auxiliary lemma. Suppose V > 0. Then
max
aR
p
[a
T
c[
2
a
T
V a
= c
T
V
1
c.
Proof. Write V = W
2
, and b := Wa, d := W
1
c. Then a
T
V a = b
T
b = |b|
2
and a
T
c = b
T
d. By Cauchy-Schwarz
max
bR
p
[b
T
d[
2
|b|
2
= |d|
2
= d
T
d = c
T
V
1
c.
.
We will now present the CRLB in higher dimensions. To simplify the exposition,
we will not carefully formulate the regularity conditions, that is, we assume
derivatives to exist and that we can interchange dierentiation and integration
at suitable places.
Consider a parameter space R
p
. Let
g : R,
be a given function. Denote the vector of partial derivatives as
g() :=
_
_
g()/
1
.
.
.
g()/
p
_
_
.
The score vector is dened as
s
() :=
_
_
log p
/
1
.
.
.
log p
/
p
_
_
.
The Fisher information matrix is
I() = E
(X)s
T
(X) = Cov
(s
(X)).
Theorem 3.4.1 Let T be an unbiased estimator of g(). Then, under regular-
ity conditions,
var
(T) g()
T
I()
1
g().
Proof. As in the one-dimensional case, one can show that, for j = 1, . . . , p,
g
j
() = cov
(T, s
,j
(X)).
Hence, for all a R
p
,
[a
T
g()[
2
= [cov
(T, a
T
s
(X))[
2
var
(T)var
(a
T
s
(X))
= var
(T)a
T
I()a.
Combining this with the auxiliary lemma gives
var
(T) max
aR
p
[a
T
g()[
2
a
T
I()a
= g()
T
I()
1
g().
.
Corollary 3.4.1 As a consequence, one obtains a lower bound for unbiased
estimators of higher-dimensional parameters of interest. As example, let g() :=
= (
1
, . . . ,
p
)
T
, and suppose that T R
p
is an unbiased estimator of . Then,
for all a R
p
, a
T
T is an unbiased estimator of a
T
. Since a
T
has derivative
a, the CRLB gives
var
(a
T
T) a
T
I()
1
a.
But
var
(a
T
T) = a
T
Cov
(T)a.
So for all a,
a
T
Cov
(T)a a
T
I()
1
a,
in other words, Cov
(T) I()
1
, that is, Cov
(T) I()
1
is positive semi-
denite.
3.5 Uniformly most powerful tests
3.5.1 An example
Let X
1
, . . . , X
n
be i.i.d. copies of a Bernoulli random variable X 0, 1 with
success parameter (0, 1):
P
(X = 1) = 1 P
(X = 0) = .
3.5. UNIFORMLY MOST POWERFUL TESTS 69
We consider three testing problems. The chosen level in all three problems is
= 0.05.
Problem 1
We want to test, at level , the hypothesis
H
0
: = 1/2 :=
0
,
H
1
: = 1/4 :=
1
.
Let T :=

n
i=1
X
i
be the number of successes (T is a sucient statistic), and
consider the randomized test
(T) :=
_
1 if T < t
0
q if T = t
0
0 if T > t
0
,
where q (0, 1), and where t
0
is the critical value of the test. The constants q
and t
0
0, . . . , n are chosen in such a way that the probability of rejecting
H
0
when it is in fact true, is equal to :
P
0
(H
0
rejected) = P
0
(T t
0
1) +qP
0
(T = t
0
) := .
Thus, we take t
0
in such a way that
P
0
(T t
0
1) , P
0
(T t
0
) > ,
(i.e., t
0
1 = q
+
() with q
+
the quantile function dened in Section 1.6) and
q =
P
0
(T t
0
1)
P
0
(T = t
0
)
.
Because =
NP
is the Neyman Pearson test, it is the most powerful test (at
level ) (see the Neyman Pearson Lemma in Section 2.2). The power of the
test is (
1
), where
() := E
(T).
Numerical Example
Let n = 7. Then
P
0
(T = 0) =
_
1/2
_
7
= 0.0078,
P
0
(T = 1) =
_
7
1
_
_
1/2
_
7
= 0.0546,
P
0
(T 1) = 0.0624 > ,
so we choose t
0
= 1. Moreover
q =
0.05 0.0078
0.0546
=
422
546
.
The power is now
(
1
) = P
1
(T = 0) +qP
1
(T = 1)
=
_
3/4
_
7
+
422
546
_
7
1
__
3/4
_
6
_
1/4
_
= 0.1335 +
422
546
0.3114.
Problem 2
Consider now testing
H
0
:
0
= 1/2,
against
H
1
: < 1/2.
In Problem 1, the construction of the test is independent of the value
1
<
0
.
So is most powerful for all
1
<
0
. We say that is uniformly most powerful
(German: gleichmassig machtigst) for the alternative H
1
: <
0
.
Problem 3
We now want to test
H
0
: 1/2,
H
1
: < 1/2.
Recall the function
() := E
(T).
The level of is dened as
sup
1/2
().
We have
() = P
(T t
0
1) +qP
(T = t
0
)
= (1 q)P
(T t
0
1) +qP
(T t
0
).
Observe that if
1
<
0
, small values of T are more likely under P
1
than under
P
0
:
P
1
(T t) > P
0
(T t), t 0, 1, . . . , n.
Thus, () is a decreasing function of . It follows that the level of is
sup
1/2
() = (1/2) = .
Hence, is uniformly most powerful for H
0
: 1/2 against H
1
: < 1/2.
3.5.2 UMP tests and exponential families
Let T := P
: be a family of probability measures. Let

0
,
1
, and
0

1
= . Based on observations X, with distribution P T,
we consider the general testing problem, at level , for
H
0
:
0
,
against
H
1
:
1
.
We say that a test has level if
sup
0
E
(X) .
Denition A test is called Uniformly Most Powerful (UMP) if
has level ,
for all tests
with level , it holds that E
(X) E
(X)
1
.
We now simplify the situation to the case where is an interval in R, and to
the testing problem
H
0
:
0
,
against
H
1
: >
0
.
We also suppose that T is dominated by a -nite measure .
Theorem 3.5.1 Suppose that T is a one-dimensional exponential family
dP
d
(x) := p
(x) = exp[c()T(x) d()]h(x).

Assume moreover that c() is a strictly increasing function of . Then the UMP
test is
(T(x)) :=
_
_
_
1 if T(x) > t
0
q if T(x) = t
0
0 if T(x) < t
0
,
where q and t
0
are chosen in such a way that E
0
(T) = .
Proof. The Neyman Pearson test for H
0
: =
0
against H
1
: =
1
is
NP
(x) =
_
_
_
1 if p
1
(x)/p
0
(x) > c
0
q
0
if p
1
(x)/p
0
(x) = c
0
0 if p
1
(x)/p
0
(x) < c
0
,
where q
0
and c
0
are chosen in such a way that E
NP
(X) = . We have
p
1
(x)
p
0
(x)
= exp
_
(c(
1
) c(
0
))T(X) (d(
1
) d(
0
))
_
.
Hence
p
1
(x)
p
0
(x)
>
=
<
c T(x)
>
=
<
t ,
where t is some constant (depending on c,
0
and
1
). Therefore, =
NP
. It
follows that is most powerful for H
0
: =
0
against H
1
: =
1
. Because
does not depend on
1
, it is therefore UMP for H
0
: =
0
against H
1
: >
0
.
We will now prove that () := E
(T) is increasing in . Let

p
(t) = exp[c()t d()],

This is the density of T with respect to
d (t) =
_
x: T(x)=t
h(x)d(x).
For >
p
(t)
p
(t)
= exp
_
(c() c())t (d() d())
_
,
which is increasing in t. Moreover, we have
_
p
d =
_
p
d = 1.
Therefore, there must be a point s
0
where the two densities cross:
_
p
(t)
p
(t)
1 for t s
0
p
(t)
p
(t)
1 for t s
0
.
But then
() () =
_
(t)[ p
(t) p
(t)]d (t)
=
_
ts
0
(t)[ p
(t) p
(t)]d (t) +
_
ts
0
(t)[ p
(t) p
(t)]d (t)
(s
0
)
_
[ p
(t) p
(t)]d (t) = 0.
So indeed () is increasing in .
But then
sup
0
() = (
0
) = .
Hence, has level . Because any other test
with level must have

E
(X) , we conclude that is UMP.

.
Example 3.5.1 Let X
1
, . . . , X
n
be an i.i.d. sample from the ^(
0
,
2
)-distribution,
with
0
known, and
2
> 0 unknown. We want to test
H
0
:
2

2
0
,
against
H
1
:
2
>
2
0
.
The density of the sample is
p
2(x
1
, . . . , x
n
) = exp
_
1
2
2
n
i=1
(x
i
0
)
2
n
2
log(2
2
)
_
.
Thus, we may take
c(
2
) =
1
2
2
,
and
T(X) =
n
i=1
(X
i
0
)
2
.
The function c(
2
) is strictly increasing in
2
. So we let be the test which
rejects H
0
for large values of T(X).
Example 3.5.2 Let X
1
, . . . , X
n
be an i.i.d. sample from the Bernoulli()-distribution,
0 < < 1. Then
p
(x
1
, . . . , x
n
) = exp
_
log
_

1
_
n
i=1
x
i
+ log(1 )
_
.
We can take
c() = log
_

1
_
,
which is strictly increasing in . Then T(X) =
n
i=1
X
i
.
Right-sided alternative
H
0
:
0
,
against
H
1
: >
0
.
The UMP test is
R
(T) :=
_
1 T > t
R
q
R
T = t
R
0 T < t
R
.
The function
R
() := E
R
(T) is strictly increasing in .
Left-sided alternative
H
0
:
0
,
against
H
1
: <
0
.
The UMP test is
L
(T) :=
_
1 T < t
L
q
L
T = t
L
0 T > t
L
.
The function
L
() := E
L
(T) is strictly decreasing in .
Two-sided alternative
H
0
: =
0
,
against
H
1
: ,=
0
.
The test
R
is most powerful for >
0
, whereas
L
is most powerful for <
0
.
Hence, a UMP test does not exist for the two-sided alternative.
3.5.3 Unbiased tests
Consider again the general case: T := P
: is a family of probability
measures, the spaces
0
, and
1
are disjoint subspaces of , and the testing
problem is
H
0
:
0
,
against
H
1
:
1
.
The signicance level is ( < 1).
As we have seen in Example 3.5.2, uniformly most powerful tests do not always
exist. We therefore restrict attention to a smaller class of tests, and look for
uniformly most powerful tests in the smaller class.
DenitionA test is called unbiased (German unverfalscht) if for all
0
and all
1
,
E
(X) E
(X).
Denition A test is called Uniformly Most Powerful Unbiased (UMPU) if
has level ,
is unbiased,
for all unbiased tests
with level , one has E
(X) E
(X)
1
.
We return to the special case where R is an interval. We consider testing
H
0
: =
0
,
against
H
1
: ,=
0
.
The following theorem presents the UMPU test. We omit the proof (see e.g.
Lehmann ...).
Theorem 3.5.2 Suppose T is a one-dimensional exponential family:
dP
d
(x) := p
(x) = exp[c()T(x) d()]h(x),

with c() strictly increasing in . Then the UMPU test is
(T(x)) :=
_
_
1 if T(x) < t
L
or T(x) > t
R
q
L
if T(x) = t
L
q
R
if T(x) = t
R
0 if t
L
< T(x) < t
R
,
where the constants t
R
, t
L
, q
R
and q
L
are chosen in such a way that
E
0
(X) = ,
d
d
E
(X)
=
0
= 0.
Note Let
R
a right-sided test as dened Theorem 3.5.1 with level at most
and
L
be the similarly dened left-sided test. Then
R
() = E
R
(T) is
strictly increasing, and
L
() = E
L
(T) is strictly decreasing. The two-sided
test of Theorem 3.5.2 is a superposition of two one-sided tests. Writing
() = E
(T),
the one-sided tests are constructed in such a way that
() =
R
() +
L
().
Moreover, () should be minimal at =
0
, whence the requirement that its
derivative at
0
should vanish. Let us see what this derivative looks like. With
the notation used in the proof of Theorem 3.5.1, for a test

depending only on
the sucient statistic T,
E
(T) =
_

(t) exp[c()t d()]d (t).
Hence, assuming we can take the dierentiation inside the integral,
d
d
E
(T) =
_

(t) exp[c()t d()]( c()t

d())d (t)
= c()cov
(T), T).
Example 3.5.3 Let X
1
, . . . , X
n
be an i.i.d. sample from the ^(,
2
0
)-distribution,
with R unknown, and with
2
0
known. We consider testing
H
0
: =
0
,
against
H
1
: ,=
0
.
A sucient statistic is T :=
n
i=1
X
i
. We have, for t
L
< t
R
,
E
(T) = IP
(T > t
R
) + IP
(T < t
L
)
= IP
_
T n
n
0
>
t
R
n
n
0
_
+ IP
_
T n
n
0
<
t
L
n
n
0
_
= 1
_
t
R
n
n
0
_
+
_
t
L
n
n
0
_
,
where is the standard normal distribution function. To avoid confusion with
the test , we denote the standard normal density in this example by

. Thus,
d
d
E
(T) =
1
n
0
_
t
R
n
n
0
_
n
0
_
t
L
n
n
0
_
,
So putting
d
d
E
(T)
=
0
= 0,
gives
_
t
R
n
0
n
0
_
=

_
t
L
n
0
n
0
_
,
or
(t
R
n
0
)
2
= (t
L
n
0
)
2
.
We take the solution (t
L
n
0
) = (t
R
n
0
), (because the solution
(t
L
n
0
) = (t
R
n
0
) leads to a test that always rejects, and hence does
not have level , as < 1). Plugging this solution back in gives
E
0
(T) = 1
_
t
R
n
0
n
0
_
+
_
t
R
n
0
n
0
_
= 2
_
1
_
t
R
n
0
n
0
__
.
The requirement E
0
(T) = gives us
_
t
R
n
0
n
0
_
= 1 /2,
and hence
t
R
n
0
=
n
0
1
(1 /2), t
L
n
0
=
n
0
1
(1 /2).
3.5.4 Conditional tests
We now study the case where is an interval in R
2
. We let = (, ), and we
assume that is the parameter of interest. We aim at testing
H
0
:
0
,
H
1
: >
0
.
We assume moreover that we are dealing with an exponential family in canonical
form:
p
(x) = exp[T
1
(x) +T
2
(x) d()]h(x).
Then we can restrict ourselves to tests (T) depending only on the sucient
statistic T = (T
1
, T
2
).
Lemma 3.5.1 Suppose that : (,
0
) contains an open interval. Let
(T
1
, T
2
) =
_
_
_
1 if T
2
> t
0
(T
1
)
q(T
1
) if T
2
= t
0
(T
1
)
0 if T
2
< t
0
(T
1
)
,
where the constants t
0
(T
1
) and q(T
1
) are allowed to depend on T
1
, and are
chosen in such a way that
E
0
_
(T
1
, T
2
)
T
1
_
= .
Then is UMPU.
Proof.
Let p
(t
1
, t
2
) be the density of (T
1
, T
2
) with respect to :
p
(t
1
, t
2
) := exp[t
1
+t
2
d()],
d (t
1
, t
2
) :=
_
T
1
(x)=t
1
, T
2
(x)=t
2
h(x)d(x).
The conditional density of T
2
given T
1
= t
1
is
p
(t
2
[t
1
) =
exp[t
1
+t
2
d()]
_
s
2
exp[t
1
+s
2
d()]d (t
1
, s
2
)
= exp[t
2
d([t
1
)],
where
d([t
1
) := log
__
s
2
exp[s
2
]d (t
1
, s
2
)
_
.
In other words, the conditional distribution of T
2
given T
1
= t
1
- does not depend on ,
- is a one-parameter exponential family in canonical form.
This implies that given T
1
= t
1
, is UMPU.
Result 1 The test has level , i.e.
sup
0
E
(.)
(T) = E
(,
0
)
(T) = , .
Proof of Result 1.
sup
0
E
(,)
(T) E
(,
0
)
(T) = E
(,
0
)
E
0
((T)[T
1
) = .
Conversely,
sup
0
E
(,)
(T) = sup
0
E
(,)
E
((T)[T
1
) E
(,)
sup
0
E
((T)[T
1
) = .
Result 2 The test is unbiased.
Proof of Result 2. If >
0
, it holds that E
((T)[T
1
) , as the conditional
test is unbiased. Thus, also, for all ,
E
(,)
(T) = E
(,)
E
((T)[T
1
) ,
i.e., is unbiased.
Result 3 Let
be a test with level
:= sup
0
E
(,)
(T) ,
and suppose moreover that
is unbiased, i.e., that

sup
0
sup
E
(,)
(T) inf
>
0
inf
E
(,)
(T).
Then, conditionally on T
1
,
has level
.
Proof of Result 3. As
= sup
0
sup
E
(,)
(T)
we know that
E
(,
0
)
(T)
, .
Conversely, the unbiasedness implies that for all >
0
,
E
(,)
(T)
, .
A continuity argument therefore gives
E
(,
0
)
(T) =
, .
In other words, we have
E
(,
0
)
(
(T)
) = 0, .
But then also
E
(,
0
)
E
0
_
(
(T)
T
1
_
= 0, ,
which we can write as
E
(,
0
)
h(T
1
) = 0, .
The assumption that : (,
0
) contains an open interval implies that
T
1
is complete for (,
0
). So we must have
h(T
1
) = 0, P
(,
0
)
a.s., ,
or, by the denition of h,
E
0
(
(T)[T
1
) =
, P
(,
0
)
a.s., .
So conditionally on T
1
, the test
has level
.
Result 4 Let
be a test as given in Result 3. Then
can not be more powerful

than at any (, ), with >
0
.
Proof of Result 4. By the Neyman Pearson lemma, conditionally on T
1
, we
have
E
(T)[T
1
) E
((T)[T
1
), >
0
.
Thus also
E
(,)
(T) E
(,)
(T), , >
0
.
.
Example 3.5.4 Consider two independent samples X = (X
1
, . . . , X
n
) and
Y = (Y
1
, . . . , Y
m
), where X
1
, . . . , X
n
are i.i.d. Poisson()-distributed, and
Y
1
, . . . , Y
m
are i.i.d. Poisson()-distributed. We aim at testing
H
0
: ,
H
1
: > .
Dene
:= log(m), := log((n)/(m)).
The testing problem is equivalent to
H
0
:
0
,
H
1
: >
0
,
where
0
:= log(n/m).
The density is
p
(x
1
, . . . , x
n
, y
1
, . . . , y
m
)
= exp
_
log(n)
n
i=1
x
i
+ log(m)
m
j=1
y
j
n m
_
n
i=1
1
x
i
!
m
j=1
1
y
j
!
= exp
_
log(m)(
n
i=1
x
i
+
m
j=1
y
j
) + log((n)/(m))
n
i=1
x
i
n m
_
h(x, y)
= exp[T
1
(x, y) +T
2
(x) d()]h(x, y),
where
T
1
(X, Y) :=
n
i=1
X
i
+
m
j=1
Y
j
,
and
T
2
(X) :=
n
i=1
X
i
,
and
h(x, y) :=
n
i=1
1
x
i
!
m
j=1
1
y
j
!
.
The conditional distribution of T
2
given T
1
= t
1
is the Binomial(t
1
, p)-distribution,
with
p =
n
n +m
=
e
1 + e
.
Thus, conditionally on T
1
= t
1
, using the observation T
2
from the Binomial(t
1
, p)-
distribution, we test
H
0
: p p
0
,
H
1
: p > p
0
,
where p
0
:= n/(n +m). This test is UMPU for the unconditional problem.
Chapter 4
Equivariant statistics
As we have seen in the previous chapter, it can be useful to restrict attention
to a collection of statistics satisfying certain desirable properties. In Chapter
3, we restricted ourselves to unbiased estimators. In this chapter, equivariance
will be the key concept.
The data consists of i.i.d. real-valued random variables X
1
, . . . , X
n
. We write
X := (X
1
, . . . , X
n
). The density w.r.t. some dominating measure , of a
single observation is denoted by p
. The density of X is p
(x) =

i
p
(x
i
),
x = (x
1
, . . . , x
n
).
Location model
Then R is a location parameter, and we assume
X
i
= +
i
, i = 1, . . . , n.
We are interested in estimating . Both the parameter space , as well as the
action space /, are the real line R.
Location-scale model
Here = (, ), with R a location parameter and > 0 a scale parameter.
We assume
X
i
= +
i
, i = 1, . . . , n.
The parameter space and action space / are both R (0, ).
4.1 Equivariance in the location model
Denition A statistic T = T(X) is called location equivariant if for all con-
stants c R and all x = (x
1
, . . . , x
n
),
T(x
1
+c, . . . , x
n
+c) = T(x
1
, . . . , x
n
) +c.
81
82 CHAPTER 4. EQUIVARIANT STATISTICS
Examples
T =
_

X
X
(
n+1
2
)
(n odd)

.
Denition A loss function L(, a) is called location invariant if for all c R,
L( +c, a +c) = L(, a), (, a) R
2
.
In this section we abbreviate location equivariance (invariance) to simply equiv-
ariance (invariance), and we assume throughout that the loss L(, a) is invari-
ant.
Corollary If T is equivariant (and L(, a) is invariant), then
R(, T) = E
L(, T(X)) = E
L(0, T(X) )
= E
L(0, T(X)) = E
L
0
[T()],
where L
0
[a] := L(0, a) and := (
1
, . . . ,
n
). Because the distribution of does
not depend on , we conclude that the risk does not depend on . We may
therefore omit the subscript in the last expression:
R(, T) = EL
0
[T()].
Since for = 0, we have the equality X = we may alternatively write
R(, T) = E
0
L
0
[T(X)] = R(0, T).
Denition An equivariant statistic T is called uniform minimum risk equivari-
ant (UMRE) if
R(, T) = min
d equivariant
R(, d), ,
or equivalently,
R(0, T) = min
d equivariant
R(0, d).
Lemma 4.1.1 Let Y
i
:= X
i
X
n
, i = 1, . . . , n, and Y := (Y
1
, . . . , Y
n
). We
have
T equivariant T = T(Y) +X
n
.
Proof.
() Trivial.
() Replacing X by X + c leaves Y unchanged (i.e. Y is invariant). So
T(X+c) = T(Y) +X
n
+c = T(X) +c. .
4.1. EQUIVARIANCE IN THE LOCATION MODEL 83
Theorem 4.1.1 Let Y
i
:= X
i
X
n
, i = 1, . . . , n, Y := (Y
1
, . . . , Y
n
), and dene
T
(Y) := arg min

v
E
_
L
0
[v +
n
]
Y
_
.
Moreover, let
T
(X) := T
(Y) +X
n
.
Then T
is UMRE.
Proof. First, note that the distribution of Y does not depend on , so that T
is indeed a statistic. It is also equivariant, by the previous lemma.

Let T be an equivariant statistic. Then T(X) = T(Y) +X
n
. So
T(X) = T(Y) +
n
.
Hence
R(0, T) = EL
0
[T(Y) +
n
] = E
_
E
_
L
0
[T(Y) +
n
]
Y
__
.
But
E
_
L
0
[T(Y) +
n
]
Y
_
min
v
E
_
L
0
[v +
n
]
Y
_
= E
_
L
0
[T
(Y) +
n
]
Y
_
.
Hence,
R(0, T) E
_
E
_
L
0
[T
(Y) +
n
]
Y
__
= R(0, T
).
.
Corollary 4.1.1 If we take quadratic loss
L(, a) := (a )
2
,
we get L
0
[a] = a
2
, and so, for Y = XX
n
,
T
(Y) = arg min

v
E
_
(v +
n
)
2
Y
_
= E(
n
[Y),
and hence
T
(X) = X
n
E(
n
[Y).
This estimator is called the Pitman estimator.
To investigate the case of quadratic risk further, we:
Note If (X, Z) has density f(x, z) w.r.t. Lebesgue measure, then the density
of Y := X Z is
f
Y
(y) =
_
f(y +z, z)dz.
Lemma 4.1.2 Consider quadratic loss. Let p
0
be the density of = (
1
, . . . ,
n
)
w.r.t. Lebesgue measure. Then a UMRE statistic is
T
(X) =
_
zp
0
(X
1
z, . . . , X
n
z)dz
_
p
0
(X
1
z, . . . , X
n
z)dz
.
Proof. Let Y = XX
n
. The random vector Y has density
f
Y
(y
1
, . . . , y
n1
, 0) =
_
p
0
(y
1
+z, . . . , y
n1
+z, z)dz.
So the density of
n
given Y = y = (y
1
, . . . , y
n1
, 0) is
f
n
(u) =
p
0
(y
1
+u, . . . , y
n1
+u, u)
_
p
0
(y
1
+z, . . . , y
n1
+z, z)dz
.
It follows that
E(
n
[y) =
_
up
0
(y
1
+u, . . . , y
n1
+u, u)du
_
p
0
(y
1
+z, . . . , y
n1
+z, z)dz
.
Thus
E(
n
[Y) =
_
up
0
(Y
1
+u, . . . , Y
n1
+u, u)du
_
p
0
(Y
1
+z, . . . , Y
n1
+z, z)dz
=
_
up
0
(X
1
X
n
+u, . . . , X
n1
X
n
+u, u)du
_
p
0
(X
1
X
n
+z, . . . , X
n1
X
n
+z, z)dz
=
_
up
0
(X
1
X
n
+u, . . . , X
n1
X
n
+u, u)du
_
p
0
(X
1
+z, . . . , X
n1
+z, X
n
+z)dz
= X
n
_
zp
0
(X
1
z, . . . , X
n1
z, X
n
z)dz
_
p
0
(X
1
+z, . . . , X
n1
+z, X
n
+z)dz
.
Finally, recall that T
(X) = X
n
E(
n
[Y). .
Example 4.1.1 Suppose X
1
, . . . , X
n
are i.i.d. Uniform[1/2, +1/2], R.
Then
p
0
(x) = l[x[ 1/2.
We have
max
1in
[x
i
z[ 1/2 x
(n)
1/2 z x
(1)
+ 1/2.
So
p
0
(x
1
z, . . . , x
n
z) = lx
(n)
1/2 z x
(1)
+ 1/2.
Thus, writing
T
1
:= X
(n)
1/2, T
2
:= X
(1)
+ 1/2,
the UMRE estimator T
is
T
=
__
T
2
T
1
zdz
____
T
2
T
1
dz
_
=
T
1
+T
2
2
=
X
(1)
+X
(n)
2
.
4.1. EQUIVARIANCE IN THE LOCATION MODEL 85
We now consider more general invariant statistics Y.
Denition A map Y : R
n
R
n
is called maximal invariant if
Y(x) = Y(x
) c : x = x
+c.
(The constant c may depend on x and x
.)
Example The map Y(x) := x x
n
is maximal invariant:
() is clear
() if x x
n
= x
n
, we have x = x
+ (x
n
x
n
).
More generally:
Example Let d(X) be equivariant. Then Y := Xd(X) is maximal invariant.
Theorem 4.1.2 Suppose that d(X) is equivariant. Let Y := Xd(X), and
T
(Y) := arg min

v
E
_
L
0
[v +d()]
Y
_
.
Then
T
(X) := T
(Y) +d(X)
is UMRE.
Proof. Let T be an equivariant estimator. Then
T(X) = T(Xd(X)) +d(X)
= T(Y) +d(X).
Hence
E
_
L
0
[T()]
Y
_
= E
_
L
0
[T(Y) +d()]
Y
_
min
v
E
_
L
0
[v +d()]
Y
_
.
Now, use the iterated expectation lemma. .
Special case
For quadratic loss (L
0
[a] = a
2
), the denition of T
(Y) in the above theorem

is
T
(Y) = E(d()[Y) = E
0
(d(X)[Xd(X)),
so that
T
(X) = d(X) E
0
(d(X)[Xd(X)).
So for a equivariant estimator T, we have
T is UMRE E
0
(T(X)[XT(X)) = 0.
From the right hand side, we conclude that E
0
T = 0 and hence E
(T) =
. Thus, in the case of quadratic loss, an UMRE estimator is unbiased.
Conversely, suppose we have an equivariant and unbiased estimator T. If T(X)
and XT(X) are independent, it follows that
E
0
(T(X)[XT(X)) = E
0
T(X) = 0.
So then T is UMRE.
To check independence, Basus lemma can be useful.
Basus lemma Let X have distribution P
, . Suppose T is sucient
and complete, and that Y = Y (X) has a distribution that does not depend on
. Then, for all , T and Y are independent under P
.
Proof. Let A be some measurable set, and
h(T) := P(Y A[T) P(Y A).
Notice that indeed, P(Y A[T) does not depend on because T is sucient.
Because
E
h(T) = 0, ,
we conclude from the completness of T that
h(T) = 0, P
a.s., ,
in other words,
P(Y A[T) = P(Y A), P
a.s., .
Since A was arbitrary, we thus have that the conditional distribution of Y given
T is equal to the unconditional distribution:
P(Y [T) = P(Y ), P
a.s., ,
that is, for all , T and Y are independent under P
. .
Basus lemma is intriguing: it proves a probabilistic property (independence)
via statistical concepts.
Example 4.1.2 Let X
1
, . . . , X
n
be independent ^(,
2
), with
2
known. Then
T :=

X is sucient and complete, and moreover, the distribution of Y := X

X
does not depend on . So by Basus lemma,

X and X

X are independent.
Hence,

X is UMRE.
Remark Indeed, Basus lemma is peculiar:

X and X

X of course remain
independent if the mean is known and/or the variance
2
is unknown!
Remark As a by-product, one concludes the independence of

X and the sample
variance S
2
=
n
i=1
(X
i

X)
2
/(n 1), because S
2
is a function of X

X.
4.2 Equivariance in the location-scale model (to be
written)
Chapter 5
Proving admissibility and
minimaxity
Bayes estimators are quite useful, also for obdurate frequentists. They can be
used to construct estimators that are minimax (admissible), or for verication
of minimaxity (admissibility).
Let us rst recall the denitions. Let X A have distribution P
, . Let
T = T(X) be a statistic (estimator, decision), L(, a) be a loss function, and
R(, T) := E
L(, T(X)) be the risk of T.

T is minimax if T
sup
R(, T) sup
R(, T
).
T is inadmissible if T
: R(, T
) R(, T) and R(, T
) < R(, T).

T is Bayes (for the prior density w on ) if T
, r
w
(T) r
w
(T
).
Recall also that Bayes risk for w is
r
w
(T) =
_
R(, T)w()d().
Whenever we say that a statistic T is Bayes, without referring to an explicit
prior on , we mean that there exists a prior for which T is Bayes. Of course,
if the risk R(, T) = R(T) does not depend on , then Bayes risk of T does not
depend on the prior.
Especially in cases where one wants to use the uniform distribution as prior,
but cannot do so because is not bounded, the notion extended Bayes is useful.
Denition A statistic T is called extended Bayes if there exists a sequence
of prior densities w
m
m=1
(w.r.t. dominating measures that are allowed to
depend on m), such that r
wm
(T) inf
T
r
wm
(T
) 0 as m .
87
88 CHAPTER 5. PROVING ADMISSIBILITY AND MINIMAXITY
5.1 Minimaxity
Lemma 5.1.1 Suppose T is a statistic with risk R(, T) = R(T) not depending
on . Then
(i) T admissible T minimax,
(ii) T Bayes T minimax,
and in fact more generally,
(iii) T extended Bayes T minimax.
Proof.
(i) T is admissible, so for all T
, either there is a with R(, T
) > R(T), or
R(, T
) R(T) for all . Hence sup
R(, T
) R(T).
(ii) Since Bayes implies extended Bayes, this follows from (iii). We nevertheless
present a separate proof, as it is somewhat simpler than (iii).
Note rst that for any T
,
r
w
(T
) =
_
R(, T
)w()d()
_
sup
R(, T
)w()d() (5.1)
= sup
R(, T
),
that is, Bayes risk is always bounded by the supremum risk. Suppose now that
T
is a statistic with sup
R(, T
) < R(T). Then

r
w
(T
) sup
R(, T
) < R(T) = r
w
(T),
which is in contradiction with the assumption that T is Bayes.
(iii) Suppose for simplicity that a Bayes decision T
m
for the prior w
m
exists, for
all m, i.e.
r
wm
(T
m
) = inf
T
r
wm
(T
), m = 1, 2, . . . .
By assumption, for all > 0, there exists an m suciently large, such that
R(T) = r
wm
(T) r
wm
(T
m
) + r
wm
(T
) + sup
R(, T
) +,
because, as we have seen in (5.1), the Bayes risk is bounded by supremum risk.
Since can be chosen arbitrary small, this proves (iii). .
Example 5.1.1 Consider a Binomial(n, ) random variable X. Let the prior
on (0, 1) be the Beta(r, s) distribution. Then Bayes estimator for quadratic
loss is
T =
X +r
n +r +s
.
Its risk is
R(, T) = E
(T )
2
= var
(T) + bias
2
(T)
=
n(1 )
(n +r +s)
2
+
_
n +r
n +r +s

(n +r +s)
n +r +s
_
2
=
[(r +s)
2
n]
2
+ [n 2r(r +s)] +r
2
(n +r +s)
2
.
This can only be constant in if the coecients in front of
2
and are zero:
(r +s)
2
n = 0, n 2r(r +s) = 0.
Solving for r and s gives
r = s =
n/2.
Plugging these values back in the estimator T gives
T =
X +
n/2
n +
n
is minimax. The minimax risk is
R(T) =
1
4(
n + 1)
2
.
We can compare this with the supremum risk of the unbiased estimator X/n:
sup
R(, X/n) = sup
(1 )
n
=
1
4n
.
So for large n, this does not dier much from the minimax risk.
Example 5.1.2 We consider again the Pitman estimator (see Lemma 4.1.2)
T
=
_
zp
0
(X
1
z, . . . , X
n
z)
p
0
(X
1
z, . . . , X
n
z)
dz.
(Rest is to be written.)
5.2 Admissibility
In this section, the parameter space is assumed to be an open subset of a
topological space, so that we can consider open neighborhoods of members of
, and continuous functions on . We moreover restrict ourselves to statistics
T with R(, T) < .
Lemma 5.2.1 Suppose that the statistic T is Bayes for the prior density w.
Then (i) or (ii) below are sucient conditions for the admissibility of T.
(i) The statistic T is the unique Bayes decision (i.e., r
w
(T) = r
w
(T
) implies
that , T = T
),
(ii) For all T
, R(, T
) is continuous in , and moreover, for all open U ,

the prior probability (U) :=
_
U
w()d() of U is strictly positive.
Proof.
(i) Suppose that for some T
, R(, T
) R(, T) for all . Then also r

w
(T
)
r
w
(T). Because T is Bayes, we then must have equality:
r
w
(T
) = r
w
(T).
So then, , T
and T are equal P
-a.s., and hence, , R(, T
) = R(, T), so
that T
can not be strictly better than T.

(ii) Suppose that T is inadmissible. Then, for some T
, R(, T
) R(, T) for
all , and, for some
0
, R(
0
, T
) < R(
0
, T). This implies that for some > 0,
and some open neighborhood U of
0
, we have
R(, T
) R(, T) , U.
But then
r
w
(T
) =
_
U
R(, T
)w()d() +
_
U
c
R(, T
)w()d()
_
U
R(, T)w()d() (U) +
_
U
c
R(, T)w()d()
= r
w
(T) (U) < r
w
(T).
We thus arrived at a contradiction. .
Lemma 5.2.2 Suppose that T is extended Bayes, and that for all T
, R(, T
)
is continuous in . In fact assume, for all open sets U ,
r
wm
(T) inf
T
r
wm
(T
m
(U)
0,
as m . Here
m
(U) :=
_
U
w
m
()d
m
() is the probability of U under the
prior
m
. Then T is admissible.
Proof. We start out as in the proof of (ii) in the previous lemma. Suppose that
T is inadmissible. Then, for some T
, R(, T
) R(, T) for all , and, for some
0
, R(
0
, T
) < R(
0
, T), so that for some > 0, and some open neighborhood
U of
0
, we have
R(, T
) R(, T) , U.
This would give that for all m,
r
wm
(T
) r
wm
(T)
m
(U).
Suppose for simplicity that a Bayes decision T
m
for the prior w
m
exists, for all
m, i.e.
r
wm
(T
m
) = inf
T
r
wm
(T
), m = 1, 2, . . . .
Then, for all m,
r
wm
(T
m
) r
wm
(T
) r
wm
(T)
m
(U),
or
r
wm
(T) r
wm
(T
m
)
m
(U)
> 0,
that is, we arrived at a contradiction. .
Example 5.2.1 Let X be ^(, 1)-distributed, and R(, T) := E
(T )
2
be
the quadratic risk. We consider estimators of the form
T = aX +b, a > 0, b R.
Lemma T is admissible if and only if one of the following cases hold
(i) a < 1,
(ii) a = 1 and b = 0.
Proof.
() (i)
First, we show that T is Bayes for some prior. It turns out that this works with
a normal prior, i.e., we take ^(c,
2
) for some c and
2
to be specied. With
the notation
f() g(x, )
we mean that f()/g(x, ) does not depend on . We have
w([x) =
p(x[)w()
p(x)
(x )
_
c
_
exp
_
1
2
_
(x )
2
+
( c)
2
2
__
exp
_
1
2
_

2
x +c
2
+ 1
_
2
1 +
2
2
_
.
We conclude that Bayes estimator is
T
Bayes
= E([X) =

2
X +c
2
+ 1
.
Taking
2
+ 1
= a,
c
2
+ 1
= b,
yields T = T
Bayes
.
Next, we check (i) in Lemma 5.2.1, i.e. that T is unique. For quadratic loss,
and for T = E([X), the Bayes risk of an estimator T
is
r
w
(T
) = Evar([X) +E(T T
)
2
.
This follows from straightforward calculations:
r
w
(T
) =
_
R(, T
)w()d()
= ER(, T
) = E( T
)
2
= E
_
E
_
( T
)
2
X
__
and, with being the random variable,
E
_
( T
)
2
X
_
= E
_
( T)
2
X
_
+ (T T
)
2
= var([X) + (T T
)
2
.
We conclude that if r
w
(T
) = r
w
(T), then
E(T T
)
2
= 0.
Here, the expectation is with integrated out, i.e., with respect to the measure
P with density
p(x) =
_
p
(x)w()d().
Now, we can write X = +, with ^(c,
2
)-distributed, and with a standard
normal random variable independent of . So X is ^(c,
2
+1), that is, P is the
^(c,
2
+ 1)-distribution. Now, E(T T
)
2
= 0 implies T = T
P-a.s.. Since
P dominates all P
, we conclude that T = T
-a.s., for all . So T is unique,

and hence admissible.
() (ii)
In this case, T = X. We use Lemma 5.2.2. Because R(, T) = 1 for all , also
r
w
(T) = 1 for any prior. Let w
m
be the density of the ^(0, m)-distribution.
As we have seen in the previous part of the proof, the Bayes estimator is
T
m
=
m
m+ 1
X.
By the bias-variance decomposition, it has risk
R(, T
m
) =
m
2
(m+ 1)
2
+
_
m
m+ 1
1
_
2
2
=
m
2
(m+ 1)
2
+

2
(m+ 1)
2
.
As E
2
= m, its Bayes risk is
r
wm
(T
m
) =
m
2
(m+ 1)
2
+
m
(m+ 1)
2
=
m
m+ 1
.
It follows that
r
wm
(T) r
wm
(T
m
) = 1
m
m+ 1
=
1
m+ 1
.
So T is extended Bayes. But we need to prove the more rened property of
Lemma 5.2.2. It is clear that here, we only need to consider open intervals
U = (u, u +h), with u and h > 0 xed. We have
m
(U) =
_
u +h
m
_
_
u
m
_
=
1
_
u
m
_
h +o(1/
m).
For m large,
_
u
m
_
(0) =
1
2
>
1
4
(say),
so for m suciently large (depending on u)
_
u
m
_
1
4
.
Thus, for m suciently large (depending on u and h), we have
m
(U)
1
4
m
h.
We conclude that for m suciently large
r
wm
(T) r
wm
(T
m
)
m
(U)

4
h
m
.
As the right hand side converges to zero as m , this shows that X is
admissible.
()
We now have to show that if (i) or (ii) do not hold, then T is not admissible.
This means we have to consider two cases: a > 1 and a = 1, b ,= 0. In the
case a > 1, we have R(, aX + b) var(aX + b) > 1 = R(, X), so aX + b is
not admissible. When a = 1 and b ,= 0, it is the bias term that makes aX + b
inadmissible:
R(, X +b) = 1 +b
2
> 1 = R(, X).
. .
Lemma 5.2.3 Let = R and P
: be an exponential family in
canonical form:
p
(x) = exp[T(x) d()]h(x).

Then T is an admissible estimator of g() :=

d(), under quadratic loss (i.e.,
under the loss L(, a) := [a g()[
2
).
Proof. Recall that
d() = E
T,

d() = var
(T) = I().
Now, let T
be some estimator, with expectation

E
:= q().
the bias of T
is
b() = q() g(),
or
q() = b() +g() = b() +

d().
This implies
q() =

b() +I().
By the Cramer Rao lower bound
R(, T
) = var
(T) +b
2
()
[ q()]
2
I()
+b
2
() =
[
b() +I()]
2
I()
+b
2
().
Suppose now that
R(, T
) R(, T), .
Because R(, T) = I() this implies
[
b() +I()]
2
I()
+b
2
() I(),
or
I()b
2
() + 2
b() [
b()]
2
0.
This in turn implies
b
2
() + 2
b() 0,
end hence
b()
b
2
()

1
2
,
so
d
d
_
1
b()
_
1
2
0,
or
d
d
_
1
b()

2
_
0.
In other words, 1/b() /2 is an increasing function, and hence, b() is a
decreasing function.
We will now show that this implies that b() = 0 for all .
Suppose instead b(
0
) < 0 for some
0
. Let
0
< . Then
1
b()

1
b(
0
)
+

0
2
,
i.e.,
b() 0,
This is not possible, as b() is a decreasing function.
Similarly, if b(
0
) > 0, take
0
, to nd again
b() 0, ,
which is not possible.
We conclude that b() = 0 for all , i.e., T
is an unbiased estimator of . By
the Cramer Rao lower bound, we now conclude
R(, T
) = var
(T
) R(, T) = I().
.
Example Let X be ^(, 1)-distributed, with R unknown. Then X is an
admissible estimator of .
Example Let X be ^(0,
2
), with
2
(0, ) unknown. Its density is
p
(x) =
1
2
2
exp
_
x
2
2
2
_
= exp[T(x) d()]h(x),
5.3. INADMISSIBILITYINHIGHER-DIMENSIONAL SETTINGS (TO BE WRITTEN)95
with
T(x) = x
2
/2, = 1/
2
, d() = (log
2
)/2 = (log )/2,
d() =
1
2
=
2
2
,
d() =
1
2
2
=

4
2
.
Observe that = (0, ), which is not the whole real line. So Lemma 5.2.3
cannot be applied. We will now show that T is not admissible. Dene for all
a > 0,
T
a
:= aX
2
.
so that T = T
1/2
. We have
R(, T
a
) = var
(T
a
) + bias
2
(T
a
)
= 2a
2
4
+ [a 1/2]
2
4
.
Thus, R(, T
a
) is minimized at a = 1/6 giving
R(, T
1/6
) =
4
/6 <
4
/2 = R(, T).
5.3 Inadmissibility in higher-dimensional settings (to
be written)
Chapter 6
Asymptotic theory
In this chapter, the observations X
1
, . . . , X
n
are considered as the rst n of
an innite sequence of i.i.d. random variables X
1
, . . . , X
n
, . . . with values in A
and with distribution P. We say that the X
i
are i.i.d. copies, of some random
variable X A with distribution P. We let IP = P P be the distribution
of the whole sequence X
i
i=1
.
The model class for P is T := P
: . When P = P
, we write
IP = IP
= P
. The parameter of interest is

:= g() R
p
,
where g : R
p
is a given function. We let
:= g() :
be the parameter space for .
An estimator of , based on the data X
1
, . . . , X
n
, is some function T
n
=
T
n
(X
1
, . . . , X
n
) of the data. We assume the estimator is dened for all n,
i.e., we actually consider a sequence of estimators T
n
n=1
.
Remark Under the i.i.d. assumption, it is natural to assume that each T
n
is a
symmetric function of the data, that is
T
n
(X
1
, . . . , X
n
) = T
n
(X
1
, . . . X
n
)
for all permutations of 1, . . . , n. In that case, one can write T
n
in the form
T
n
= Q(

P
n
), where

P
n
is the empirical distribution (see also Subsection 1.9.1).
6.1 Types of convergence
Denition Let Z
n
n=1
and Z be R
p
-valued random variables dened on the
same probability space.
1
We say that Z
n
converges in probability to Z if for all
1
Let (, A, IP) be a probability space, and X : X and Y : Y be two measurable
maps. Then X and Y are called random variables, and they are dened on the same probability
space .
97
98 CHAPTER 6. ASYMPTOTIC THEORY
> 0,
lim
n
IP(|Z
n
Z| > ) = 0.
Notation: Z
n
IP
Z.
Remark Chebyshevs inequality can be a tool to prove convergence in proba-
bility. It says that for all increasing functions : [0, ) [0, ), one has
IP(|Z
n
Z| )
I E(|Z
n
Z|)
()
.
Denition Let Z
n
n=1
and Z be R
p
-valued random variables. We say that
Z
n
converges in distribution to Z, if for all continuous and bounded functions
f,
lim
n
I Ef(Z
n
) = I Ef(Z).
Notation: Z
n
D
Z.
Remark Convergence in probability implies convergence in distribution, but
not the other way around.
Example Let X
1
, X
2
, . . . be i.i.d. real-valued random variables with mean
and variance
2
. Let

X
n
:=
n
i=1
X
i
/n be the average of the rst n. Then by
the central limit theorem (CLT),
n(

X
n
)
D
^(0,
2
),
that is
IP
_
n
(

X
n
)
z
_
(z), z.
The following theorem says that for convergence in distribution, one actually
can do with one-dimensional random variables. We omit the proof.
Theorem 6.1.1 (Cramer-Wold device) Let (Z
n
, Z) be a collection of R
p
-
valued random variables. Then
Z
n
D
Z a
T
Z
n
D
a
T
Z a R
p
.
Example Let X
1
, X
2
, . . . be i.i.d. copies of a random variable X = (X
(1)
, . . . , X
(p)
)
T
in R
p
. Assume EX := = (
1
, . . . ,
p
)
T
and := Cov(X) := EXX
T

T
exist. Then for all a R
p
,
Ea
T
X = a
T
, var(a
T
X) = a
T
a.
Dene
X
n
= (

X
(1)
n
, . . . ,

X
(p)
n
)
T
.
6.1. TYPES OF CONVERGENCE 99
By the 1-dimensional CLT, for all a R
p
,
n(a
T

X
n
a
T
)
D
^(0, a
T
a).
The Cramer-Wold device therefore gives the p-dimensional CLT
n(

X
n
)
D
^(0, ).
We recall the Portmanteau Theorem:
Theorem 6.1.2 Let (Z
n
p
-valued random variables.
Denote the distribution of Z by Q and let G = Q(Z ) be its distribution
function. The following statements are equivalent:
(i) Z
n
D
Z (i.e., I Ef(Z
n
) I Ef(Z) f bounded and continuous).
(ii) I Ef(Z
n
) I Ef(Z) f bounded and Lipschitz.
2
(iii) I Ef(Z
n
) I Ef(Z) f bounded and Q-a.s. continuous.
(iv) IP(Z
n
z) G(z) for all G-continuity points z.
6.1.1 Stochastic order symbols
Let Z
n
be a collection of R
p
-valued random variables, and let r
n
be strictly
positive random variables. We write
Z
n
= O
IP
(1)
(Z
n
is bounded in probability) if
lim
M
limsup
n
IP(|Z
n
| > M) = 0.
This is also called uniform tightness of the sequence Z
n
. We write Z
n
=
O
IP
(r
n
) if Z
n
/r
n
= O
IP
(1).
If Z
n
converges in probability to zero, we write this as
Z
n
= o
IP
(1).
Moreover, Z
n
= o
IP
(r
n
) (Z
n
is of small order r
n
in probability) if Z
n
/r
n
= o
IP
(1).
6.1.2 Some implications of convergence
Lemma 6.1.1 Suppose that Z
n
converges in distribution. Then Z
n
= O
IP
(1).
2
A real-valued function f on (a subset of) R
p
is Lipschitz if for a constant C and all (z, z)
in the domain of f, |f(z) f( z)| Cz z.
Proof. To simplify, take p = 1 (Cramer-Wold device). Let Z
n
D
Z, where Z
has distribution function G. Then for every G-continuity point M,
IP(Z
n
> M) 1 G(M),
and for every G-continuity point M,
IP(Z
n
M) G(M).
Since 1 G(M) as well as G(M) converge to zero as M , the result
follows. .
Example Let X
1
, X
2
, . . . be i.i.d. copies of a random variable X R with
EX = and var(X) < . Then by the CLT,
X
n
= O
IP
_
1
n
_
.
Theorem 6.1.3 (Slutsky) Let (Z
n
, A
n
p
-valued ran-
dom variables, and a R
p
be a vector of constants. Assume that Z
n
D
Z,
A
n
IP
a. Then
A
T
n
Z
n
D
a
T
Z.
Proof. Take a bounded Lipschitz function f, say
[f[ C
B
, [f(z) f( z)[ C
L
|z z|.
Then
I Ef(A
T
n
Z
n
) I Ef(a
T
Z)
I Ef(A
T
n
Z
n
) I Ef(a
T
Z
n
)
I Ef(a
T
Z
n
) I Ef(a
T
Z)
.
Because the function z f(a
T
z) is bounded and Lipschitz (with Lipschitz
constant |a|C
L
), we know that the second term goes to zero. As for the rst
term, we argue as follows. Let > 0 and M > 0 be arbitrary. Dene S
n
:=
|Z
n
| M, |A
n
a| . Then
I Ef(A
T
n
Z
n
) I Ef(a
T
Z
n
)
I E
f(A
T
n
Z
n
) f(a
T
Z
n
)
= I E
f(A
T
n
Z
n
) f(a
T
Z
n
)
lS
n
+ I E
f(A
T
n
Z
n
) f(a
T
Z
n
)
lS
c
n
C
L
M + 2C
B
IP(S
c
n
). (6.1)
Now
IP(S
c
n
) IP(|Z
n
| > M) + IP(|A
n
a| > ).
Thus, both terms in (6.1) can be made arbitrary small by appropriately choosing
small and n and M large. .
6.2. CONSISTENCY AND ASYMPTOTIC NORMALITY 101
6.2 Consistency and asymptotic normality
Denition A sequence of estimators T
n
of = g() is called consistent if
T
n
IP
.
Denition A sequence of estimators T
n
of = g() is called asymptotically
normal with asymptotic covariance matrix V
, if
n(T
n
)
D
^(0, V
).
Example Suppose T is the location model
T = P
,F
0
(X ) := F
0
( ), R, F
0
T
0
.
The parameter is then = (, F
0
) and = RT
0
. We assume for all F
0
T
0
_
xdF
0
(x) = 0,
2
F
0
:=
_
x
2
dF
0
(x) < .
Let g() := and T
n
:= (X
1
+ + X
n
)/n =

X
n
. Then T
n
is a consistent
estimator of and, by the central limit theorem
n(T
n
)
D
^(0,
2
F
0
).
6.2.1 Asymptotic linearity
As we will show, for many estimators, asymptotic normality is a consequence
of asymptotic linearity, that is, the estimator is approximately an average, to
which we can apply the CLT.
Denition The sequence of estimators T
n
of = g() is called asymptoti-
cally linear if for a function l
: A R
p
, with E
(X) = 0 and
E
(X)l
T
(X) := V
< ,
it holds that
T
n
=
1
n
n
i=1
l
(X
i
) +o
IP
(n
1/2
).
Remark. We then call l
the inuence function of (the sequence) T

n
. Roughly
speaking, l
(x) approximately measures the inuence of an additional observa-

tion x.
Example Assuming the entries of X have nite variance, the estimator T
n
:=
X
n
is a linear and hence asymptotically linear estimator of the mean , with
inuence function
l
(x) = x .
Example 6.2.1 Let X be real-valued, with E
X := , var
(X) :=
2
and
:= E
(X )
4
(assumed to exist). Consider the estimator

2
n
:=
1
n
n
i=1
(X
i

X
n
)
2
,
of
2
. We rewrite

2
n
=
1
n
n
i=1
(X
i
)
2
+ (

X
n
)
2
2
n
n
i=1
(X
i
)(

X
n
)
=
1
n
n
i=1
(X
i
)
2
(

X
n
)
2
.
Because by the CLT,

X
n
= O
IP
(n
1/2
), we get

2
n
=
1
n
n
i=1
(X
i
)
2
+O
IP
(1/n).
So
2
n
is asymptotically linear with inuence function
l
(x) = (x )
2
2
.
The asymptotic variance is
V
= E
_
(X )
2
2
_
2
=
4
.
6.2.2 The -technique
Theorem 6.2.1 Let (T
n
, Z) be a collection of random variables in R
p
, c R
p
be a nonrandom vector, and r
n
be a nonrandom sequence of positive numbers,
with r
n
0. Moreover, let h : R
p
R be dierentiable at c, with derivative
h(c) R
p
. Suppose that
(T
n
c)/r
n
D
Z.
Then
(h(T
n
) h(c))/r
n
D
h(c)
T
Z.
Proof. By Slutskys Theorem,
h(c)
T
(T
n
c)/r
n
D
h(c)
T
Z.
Since (T
n
c)/r
n
converges in distribution, we know that |T
n
c|/r
n
= O
IP
(1).
Hence, |T
n
c| = O
IP
(r
n
). The result follows now from
h(T
n
) h(c) =

h(c)
T
(T
n
c) +o(|T
n
c|) =

h(c)
T
(T
n
c) +o
IP
(r
n
).
.
6.2. CONSISTENCY AND ASYMPTOTIC NORMALITY 103
Corollary 6.2.1 Let T
n
be an asymptotically linear estimator of := g(),
with inuence function l
and asymptotic covariance matrix V
. Suppose h is
dierentiable at . Then it follows in the same way as in the previous theorem,
that h(T
n
) is an asymptotically linear estimator of h(), with inuence function
h()
T
l
and asymptotic variance

h()
T
V
h().
Example 6.2.2 Let X
1
, . . . , X
n
be a sample from the Exponential() distri-
bution, with > 0. Then

X
n
is a linear estimator of E
X = 1/ := , with
inuence function l
(x) = x 1/. The variance of

n(T
n
1/) is 1/
2
=
2
.
Thus, 1/

X
n
is an asymptotically linear estimator of . In this case, h() = 1/,
so that

h() = 1/
2
. The inuence function of 1/

X
n
is thus
h()l
(x) =
1
2
(x ) =
2
(x 1/).
The asymptotic variance of 1/

X
n
is
[
h()]
2
2
=
1
2
=
2
.
So
n
_
1
X
n
_
D
^(0,
2
).
Example 6.2.3 Consider again Example 6.2.1. Let X be real-valued, with
E
X := , var
(X) :=
2
and := E
(X )
4
(assumed to exist). Dene
moreover, for r = 1, 2, 3, 4, the r-th moment
r
:= E
X
r
. We again consider
the estimator

2
n
:=
1
n
n
i=1
(X
i

X
n
)
2
.
We have

2
n
= h(T
n
),
where T
n
= (T
n,1
, T
n,2
)
T
, with
T
n,1
=

X
n
, T
n,2
=
1
n
n
i=1
X
2
i
,
and
h(t) = t
2
t
2
1
, t = (t
1
, t
2
)
T
.
The estimator T
n
has inuence function
l
(x) =
_
x
1
x
2
2
_
.
By the 2-dimensional CLT,
n
_
T
n
2
__
D
^(0, ),
with
=
_

2
2
1

3
2

4
2
2
_
.
It holds that
h
__
2
__
=
_
2
1
1
_
,
so that
2
n
has inuence function
_
2
1
1
_
T
_
x
1
x
2
2
_
= (x )
2
2
,
(invoking
1
= ). After some calculations, one nds moreover that
_
2
1
1
_
T
_
2
1
1
_
=
4
,
i.e., the -method gives the same result as the ad hoc method in Example 6.2.1,
as it of course should.
6.3 M-estimators
Let, for each , be dened some loss function
(X). These are for instance

constructed as in Chapter 2: we let L(, a) be the loss when taking action a.
Then, we x some decision d(x), and rewrite
L(, d(x)) :=
(x),
assuming the loss L depends only on via the parameter of interest = g().
We now require that the risk
E
c
(X)
is minimized at the value c = i.e.,
= arg min
c
E
c
(X). (6.2)
Alternatively, given
c
, one may view (6.2) as the denition of .
If c
c
(x) is dierentiable for all x, we write
c
(x) :=
c
(x) :=

c
c
(x).
Then, assuming we may interchange dierentiation and taking expectations
3
,
we have
E
(X) = 0.
3
If |c/c| H() where E
H(X) < , then it follows from the dominated convergence

theorem that [E
c(X)]/c = E
[c(X)/c].
6.3. M-ESTIMATORS 105
Example 6.3.1 Let X R, and let the parameter of interest be the mean
= E
X. Assume X has nite variance

2
Then
= arg min
c
E
(X c)
2
,
as (recall), by the bias-variance decomposition
E
(X c)
2
=
2
+ ( c)
2
.
So in this case, we can take
c
(x) = (x c)
2
.
Example 6.3.2 Suppose R
p
and that the densities p
= dP
/d exist
w.r.t. some -nite measure .
Denition The quantity
K(
[) = E
log
_
p
(X)
p
(X)
_
is called the Kullback Leibler information, or the relative entropy.
Remark Some care has to be taken, not to divide by zero! This can be handled
e.g., by assuming that the support x : p
(x) > 0 does not depend on (see

also condition I in the CRLB of Chapter 3).
Dene now
(x) = log p
(x).
One easily sees that
K(
[) = E
(X) E
(X).
Lemma E
(X) is minimized at

= :
= arg min
(X).
Proof. We will show that
K(
[) 0.
This follows from Jensens inequality. Since the log-function is concave,
K(
[) = E
log
_
p
(X)
p
(X)
_
log
_
E
_
p
(X)
p
(X)
__
= log 1 = 0.
.
Denition The M-estimator
n
of is dened as

n
:= arg min
c
1
n
n
i=1
c
(X
i
).
The M in M-estimator stands for Minimizer (or - take minus signs - Max-
imizer).
If
c
(x) is dierentiable in c for all x, we generally can dene
n
as the solution
of putting the derivatives
c
n
i=1
c
(X
i
) =
n
i=1
c
(X
i
)
to zero. This is called the Z-estimator.
Denition The Z-estimator
n
of is dened as a solution of the equations
1
n
n
i=1
n
(X
i
) = 0.
Remark A solution
n
is then assumed to exist.
6.3.1 Consistency of M-estimators
Note that minimizes a theoretical expectation, whereas the M-estimator
n
minimizes the empirical average. Likewise, is a solution of putting a theoret-
ical expectation to zero, whereas the Z-estimator
n
is the solution of putting
an empirical average to zero.
By the law of large numbers, averages converge to expectations. So the M-
estimator (Z-estimator) does make sense. However, consistency and further
properties are not immediate, because we actually need convergence the aver-
ages to expectations over a range of values c simultaneously. This is the
topic of empirical process theory.
We will borrow the notation from empirical process theory. That is, for a
function f : A R
r
, we let
P
f := E
f(X),

P
n
f :=
1
n
n
i=1
f(X
i
).
Then, by the law of large numbers, if P
[f[ < ,
(

P
n
P
)f 0, IP
a.s..
We now show consistency assuming the parameter space is compact. (The
assumption of compactness can however often be omitted if c
c
is convex.
We skip the details.)
We will need that convergence of to the minimum value also implies convergence
of the arg min, i.e., convergence of the location of the minimum. To this end,
we assume
Denition The minimizer of P
c
is called well-separated if for all > 0,
infP
c
: c , |c | > > P
.
Theorem 6.3.1 Suppose that is compact, that c
c
(x) is continuous for
all x, and that
P
_
sup
c
[
c
[
_
< .
Then
P
n
P
, IP
a.s..
If is well-separated, this implies
n
, IP
-a.s..
Proof. We will show the uniform convergence
sup
c
[(

P
n
P
)
c
[ 0, P
a.s.. (6.3)
This will indeed suce, as
0 P
(
n

) = (

P
n
P
)(
n

) +

P
n
(
n

)
(

P
n
P
)(
n

) [(

P
n
P
)
n
[ +[(

P
n
P
[
sup
c
[(

P
n
P
)
c
[ +[(

P
n
P
[ 2 sup
c
[(

P
n
P
)
c
[.
To show (6.3), dene for each > 0 and c ,
w(, , c) := sup
c: cc<
[
c
c
[.
Then for all x, as 0,
w(x, , c) 0.
So also, by dominated convergence
P
w(, , c) 0.
Hence, for all > 0, there exists a
c
such that
P
w(,
c
, c) .
Let
B
c
:= c : | c c| <
c
.
Then B
c
: c is a covering of by open sets. Since is compact, there
exists nite sub-covering
B
c
1
. . . B
c
N
.
For c B
c
j
,
[
c
c
j
[ w(,
c
j
, c
j
).
It follows that
sup
c
[(

P
n
P
)
c
[ max
1jN
[(

P
n
P
)
c
j
[
+ max
1jN
P
n
w(,
c
j
, c
j
) + max
1jN
P
w(,
c
j
, c
j
)
2 max
1jN
P
w(,
c
j
, c
j
) 2, IP
a.s..
.
Example The above theorem directly uses the denition of the M-estimator,
and thus does not rely on having an explicit expression available. Here is
an example where an explicit expression is indeed not possible. Consider the
logistic location family, where the densities are
p
(x) =
e
x
(1 + e
x
)
2
, x R,
where R is the location parameter. Take
(x) := log p
(x) = x + 2 log(1 + e
x
).
So

n
is a solution of
2
n
n
i=1
e
X
i
n
1 + e
X
i
n
= 1.
This expression cannot be made into an explicit expression. However, we do
note the caveat that in order to be able to apply the above consistency theorem,
we need to assume that is bounded. This problem can be circumvented by
using the result below for Z-estimators.
To prove consistency of a Z-estimator of a one-dimensional parameter is rela-
tively easy.
Theorem 6.3.2 Assume that R, that
c
(x) is continuous in c for all x,
that
P
[
c
[ < , c,
and that > 0 such that
P
c
> 0, < c < +,
P
c
< 0, < c < .
Then for n large enough, IP
-a.s., there is a solution

n
of

P
n
n
= 0, and this
solution
n
is consistent.
Proof. Let 0 < < be arbitrary. By the law of large numbers, for n
suciently large, IP
-a.s.,
P
n
+
> 0,

P
n
< 0.
The continuity of c
c
implies that then

P
n
n
= 0 for some [
n
[ < . .
6.3.2 Asymptotic normality of M-estimators
Recall the CLT: for each f : A R
r
for which
:= P
ff
T
(P
f)(P
f)
T
exists, we have
n(

P
n
P
)f
D
^(0, ).
Denote now
n
(c) :=
n(

P
n
P
)
c
, c .
Denition The stochastic process
n
(c) : c
is called the empirical process indexed by c. The empirical process is called
asymptotically continuous at if for all (possibly random) sequences
n
in
, with |
n
| = o
IP
(1), we have
[
n
(
n
)
n
()[ = o
IP
(1).
For verifying asymptotic continuity, there are various tools, which involve com-
plexity assumptions on the map c
c
. This goes beyond the scope of these
notes. Asymptotic linearity can also be established directly, under rather re-
strictive assumptions, see Theorem 6.3.4 below. But rst, let us see what
asymptotic continuity can bring us.
We assume that
M
:=

c
T
P
c=
exists. It is a p p matrix. We require it to be of full rank, which amounts to
assuming that , as a solution to P
= 0, is well-identied.
Theorem 6.3.3 Let
n
be the Z-estimator of , and suppose that
n
is a con-
sistent estimator of , and that
n
is asymptotically continuous at . Suppose
moreover M
1
exists, and also

J
:= P
.
Then
n
is asymptotically linear, with inuence function
l
= M
1
.
Hence
n(
n
)
D
^(0, V
),
with
V
= M
1
M
1
.
Proof. By denition,
P
n
n
= 0, P
= 0.
So we have
0 =

P
n
n
= (

P
n
P
)
n
+P
n
= (

P
n
P
)
n
+P
(
n

)
= (i) + (ii).
For the rst term, we use the asymptotic continuity of
n
at :
(i) = (

P
n
P
)
n
=
n
(
n
)/
n =
n
()/
n +o
IP
(1/
n)
=

P
n
+o
IP
(1/n).
For the second term, we use the dierentiability of P
c
at c = :
(ii) = P
(
n

) = M(
n
) +o(|
n
|).
So we arrive at
0 =

P
n
+o
IP
(1/n) +M(
n
) +o(|
n
|).
Because, by the CLT,

P
n
= O
IP
(1/
n), this implies |

n
| = O
IP
(1/
n).
Hence
0 =

P
n
+M(
n
) +o
IP
(1/
n),
or
M(
n
) =
P
n
+o
IP
(1/
n),
or
(
n
) =
P
n
M
1
+o
IP
(1/
n).
.
In the next theorem, we assume quite a lot of smoothness for the functions
c
(namely, derivatives that are Lipschitz), so that asymptotic linearity can be
proved by straightforward arguments. We stress however that such smoothness
assumptions are by no means necessary.
Theorem 6.3.4 Let
n
be the Z-estimator of , and suppose that
n
is a consis-
tent estimator of . Suppose that, for all c in a neighborhood c : |c | <
, the map c
c
(x) is dierentiable for all x, with derivative
c
(x) =

c
T

c
(x)
(a p p matrix). Assume moreover that, for all c and c in a neighborhood of
, and for all x, we have, in matrix-norm
4
,
|

c
(x)

c
(x)| H(x)|c c|,
4
For a matrix A, A := sup
v=0
Av/v.
where H : A R satises
P
H < .
Then
M
=

c
T
P
c=
= P
. (6.4)
Assuming M
1
and J := E
exist, the inuence function of

n
is
l
= M
1
.
Proof. Result (6.4) follows from the dominated convergence theorem.
By the mean value theorem,
0 =

P
n
n
=

P
n
+

P
n

n()
(
n
)
where for all x, |
n
(x) | |
n
|. Thus
0 =

P
n
+

P
n

(
n
) +

P
n
(

n()
)(
n
),
so that
P
n
+

P
n

(
n
)

P
n
H|
n
|
2
= O
IP
(1)|
n
|
2
,
where in the last inequality, we used P
H < . Now, by the law of large

numbers,
P
n

= P
+o
IP
(1) = M
+o
IP
(1).
Thus

P
n
+M
(
n
) +o
IP
(|
n
|)
= O
IP
(|
n
|
2
).
Because

P
n
= O
IP
(1/
n), this ensures that |

n
| = O
IP
(1/
n). It
follows that
P
n
+M
(
n
) +o
IP
(1/
n)
= O
IP
(1/n).
Hence
M
(
n
) =
P
n
+o
IP
(1/
n)
and so
(
n
) =
P
n
M
1
+o
IP
(1/
n).
.
Example 6.3.3 In this example, we show that, under regularity conditions, the
MLE is asymptotically normal with asymptotic covariance matrix the inverse
of the Fisher-information matrix I(). Let T = P
: be dominated
by a -nite dominating measure , and write the densities as p
= dP
/d.
Suppose that R
p
. Assume condition I, i.e. that the support of p
does not
depend on . As loss we take minus the log-likelihood:
:= log p
.
We suppose that the score function
s
log p
=
p
exists, and that we may interchange dierentiation and integration, so that the
score has mean zero.
P
=
_
p
d =

_
p
d =

1 = 0.
Recall that the Fisher-information matrix is
I() := P
s
T
.
Now, it is clear that
= s
, and, assuming derivatives exist and that again

we may change the order of dierentiation and integration,
M
= P
= P
,
and
P
= P
_
p
s
T
_
=
_

2
T
1
_
P
s
T
= 0 I().
Hence, in this case, M
= I(), and the inuence function of the MLE
n
:= arg max
P
n
log p
is
l
= I()
1
s
.
So the asymptotic covariance matrix of the MLE

n
is
I()
1
_
P
s
T
_
I()
1
= I()
1
.
Example 6.3.4 In this example, the parameter of interest is the -quantile.
We will consider a loss function which does not satisfy regularity conditions,
but nevertheless leads to an asymptotically linear estimator.
Let A := R. The distribution function of X is denoted by F. Let 0 < < 1
be given. The -quantile of F is = F
1
() (assumed to exist). We moreover
assume that F has density f with respect to Lebesgue measure, and that f(x) >
0 in a neighborhood of . As loss function we take
c
(x) := (x c),
where
(x) := (1 )[x[lx < 0 +[x[lx > 0.
We now rst check that
arg min
c
P
c
= F
1
() := .
We have
(x) = lx > 0 (1 )lx < 0.
Note that does not exist at x = 0. This is one of the irregularities in this
example.
It follows that
c
(x) = lx > c + (1 )x < c.
Hence
P
c
= +F(c)
(the fact that
c
is not dened at x = c can be shown not to be a problem,
roughly because a single point has probability zero, as F is assumed to be
continuous). So
P
= 0, for = F
1
().
We now derive M
, which is a scalar in this case:

M
=
d
dc
P
c=
=
d
dc
( +F(c))
c=
= f() = f(F
1
()).
The inuence function is thus
5
l
(x) = M
1
(x) =
1
f()
_
lx < +
_
.
We conclude that, for

n
= arg min
c
P
n
c
,
which we write as the sample quantile
n
=

F
1
n
() (or an approximation
thereof up to order o
IP
(1/
n)), one has
n(

F
1
n
() F
1
())
D
^
_
0,
(1 )
f
2
(F
1
())
_
.
5
Note that in the special case = 1/2 (where is the median), this becomes
l
(x) =
(
1
2f()
x <
+
1
2f()
x >
.
Example 6.3.5 In this example, we illustrate that the Huber-estimator is
asymptotically linear. Let again A = R and F be the distribution function
of X. We let the parameter of interest be the a location parameter. The Huber
loss function is
c
(x) = (x c),
with
(x) =
_
x
2
[x[ k
k(2[x[ k) [x[ > k
.
We dene as
:= arg min
c
P
c
.
It holds that
(x) =
_
2x [x[ k
+2k x > k
2k x < k
.
Therefore,
c
(x) =
_
2(x c) [x c[ k
2k x c > k
+2k x c < k
.
One easily derives that
P
c
= 2
_
k+c
k+c
xdF(x) + 2c[F(k +c) F(k +c)]
2k[1 F(k +c)] + 2kF(k +c).
So
M
=
d
dc
P
c=
= 2[F(k +) F(k +)].
The inuence function of the Huber estimator is
l
(x) =
1
[F(k +) F(k +)]
_
_
_
x [x [ k
+k x > k
k x < k
.
For k 0, this corresponds to the inuence function of the median.
6.4 Plug-in estimators
When A is Euclidean space, one can dene the distribution function F(x) :=
P
(X x) and the empirical distribution function
F
n
(x) =
1
n
#X
i
x, 1 i n.
6.4. PLUG-IN ESTIMATORS 115
This is the distribution function of a probability measure that puts mass 1/n at
each observation. For general A, we dene likewise the empirical distribution

P
n
as the distribution that puts mass 1/n at each observation, i.e., more formally
P
n
:=
1
n
n
i=1
X
i
,
where
x
is a point mass at x. Thus, for (measurable ) sets A A,
P
n
(A) =
1
n
#X
i
A, 1 i n.
For (measurable) functions f : A R
r
, we write, as in the previous section,
P
n
f :=
1
n
n
i=1
f(X
i
) =
_
fd
P
n
.
Thus, for sets,
P
n
(A) =

P
n
l
A
.
Again, as in the previous section, we use the same notations for expectations
under P
:
P
f := E
f(X) =
_
fdP
,
so that
P
(A) = P
l
A
.
The parameter of interest is denoted as
= g() R
p
.
It can often be written in the form
= Q(P
),
where Q is some functional on (a supset of) the model class T. Assuming Q is
also dened at the empirical measure

P
n
, the plug-in estimator of is now
T
n
:= Q(

P
n
).
Conversely,
Denition If a statistic T
n
can be written as T
n
= Q(

P
n
), then it is called a
Fisher-consistent estimator of = g(), if Q(P
) = g() for all .

We will also encounter modications, where
T
n
= Q
n
(

P
n
),
and for n large,
Q
n
(P
) Q(P
) = g().
Example Let := h(P
f). The plug-in estimator is then T

n
= h(

P
n
f).
Example The M-estimator
n
= arg min
c
P
n
c
is a plug-in estimator of =
arg min
c
P
c
(and similarly for the Z-estimator).
Example Let A = R and consider the -trimmed mean
T
n
:=
1
n 2[n]
n[n]
i=[n]+1
X
(i)
.
What is its theoretical counterpart? Because the i-th order statistic X
(i)
can
be written as
X
(i)
=

F
1
n
(i/n),
and in fact
X
(i)
=

F
1
n
(u), i/n u < (i + 1)/n,
we may write, for
n
:= [n]/n,
T
n
=
n
n 2[n]
1
n
n[n]
i=[n]+1
F
1
n
(i/n)
=
1
1 2
n
_
1n
n+1/n
F
1
n
(u)du := Q
n
(

P
n
).
Replacing

F
n
by F gives,
Q
n
(F) =
1
1 2
n
_
1n
n+1/n
F
1
(u)du
1
1 2
_
1
F
1
(u)du =
1
1 2
_
F
1
(1)
F
1
()
xdF(x) := Q(P
).
Example Let A = R, and suppose X has density f w.r.t., Lebesgue measure.
Suppose f is the parameter of interest. We may write
f(x) = lim
h0
F(x +h) F(x h)
2h
.
Replacing F by

F
n
here does not make sense. Thus, this is an example where
Q(P) = f is only well dened for distributions P that have a density f. We
may however slightly extend the plug-in idea, by using the estimator
f
n
(x) =

F
n
(x +h
n
)

F
n
(x h
n
)
2h
n
:= Q
n
(

P
n
),
with h
n
small (h
n
0 as n ).
6.4.1 Consistency of plug-in estimators
We rst present the uniform convergence of the empirical distribution function
to the theoretical one.
Such uniform convergence results hold also in much more general settings (see
also (6.3) in the proof of consistency for M-estimators).
Theorem 6.4.1 (Glivenko-Cantelli) Let A = R. We have
sup
x
[
F
n
(x) F(x)[ 0, IP
a.s..
Proof. We know that by the law of large numbers, for all x
[
F
n
(x) F(x)[ 0, IP
a.s.,
so also for all nite collection a
1
, . . . , a
N
,
max
1jN
[
F
n
(a
j
) F(a
j
)[ 0, IP
a.s..
Let > 0 be arbitrary, and take a
0
< a
1
< < a
N1
< a
N
in such a way that
F(a
j
) F(a
j1
) , j = 1, . . . , N
where F(a
0
) := 0 and F(a
N
) := 1. Then, when x (a
j1
, a
j
],
F
n
(x) F(x)

F
n
(a
j
) F(a
j1
) F
n
(a
j
) F(a
j
) +,
and
F
n
(x) F(x)

F
n
(a
j1
) F(a
j
)

F
n
(a
j1
) F(a
j1
) ,
so
sup
x
[
F
n
(x) F(x)[ max
1jN
[
F
n
(a
j
) F(a
j
)[ + , IP
a.s..
.
Example Let A = R and let F be the distribution function of X. We consider
estimating the median := F
1
(1/2). We assume F to continuous and strictly
increasing. The sample median is
T
n
:=

F
1
n
(1/2) :=
_
X
((n+1)/2)
n odd
[X
(n/2)
+X
(n/2+1)
]/2 n even
.
So
F
n
(T
n
) =
1
2
+
_
1/(2n) n odd
0 n even
.
It follows that
[F(T
n
) F()[ [
F
n
(T
n
) F(T
n
)[ +[
F
n
(T
n
) F()[
= [
F
n
(T
n
) F(T
n
)[ +[
F
n
(T
n
)
1
2
[
[
F
n
(T
n
) F(T
n
)[ +
1
2n
0, IP
a.s..
So

F
1
n
(1/2) = T
n
= F
1
(1/2), IP
a.s., i.e., the sample median is a

consistent estimator of the population median.
6.4.2 Asymptotic normality of plug-in estimators
Let := Q(P) R
p
be the parameter of interest. The idea in this subsection is
to apply a -method, but now in a nonparametric framework. The parametric
-method says that if

n
is an asymptotically linear estimator of R
p
, and if
= g() is some function of the parameter , with g being dierentiable at ,
then is an asymptotically linear estimator of . Now, we write = Q(P) as
a function of the probability measure P (with P = P
, so that g() = Q(P
)).
We let P play the role of , i.e., we use the probability measures themselves as
parameterization of T. We then have to redene dierentiability in an abstract
setting, namely we dierentiate w.r.t. P.
Denition
The inuence function of Q at P is
l
P
(x) := lim
0
Q((1 )P +
x
) Q(P)
, x A,
whenever the limit exists.
The map Q is called Gateaux dierentiable at P if for all probability measures
P, we have
lim
0
Q((1 )P +
P) Q(P)
= E
P
l
P
(X).
Let d be some (pseudo-)metric on the space of probability measures. The map
Q is called Frechet dierentiable at P, with respect to the metric d, if
Q(

P) Q(P) = E
P
l
P
(X) +o(d(

P, P)).
Remark 1 In line with the notation introduced previously, we write for a
function f : A R
r
and a probability measure

P on A
Pf := E
P
f(X).
Remark 2 If Q is Frechet or Gateaux dierentiable at P, then
Pl
P
(:= E
P
l
P
(X)) = 0.
Remark 3 If Q is Frechet dierentiable at P, and if moreover
d((1 )P +
P, P) = o(), 0,
then Q is Gateaux dierentiable at P:
Q((1 )P +
P) Q(P) = ((1 )P +
P)l
P
+o()
=
Pl
P
+o().
We now show that Frechet dierentiable functionals are generally asymptoti-
cally linear.
Lemma 6.4.1 Suppose that Q is Frechet dierentiable at P with inuence
function l
P
, and that
d(

P
n
, P) = O
IP
(n
1/2
). (6.5)
Then
Q(

P
n
) Q(P) =

P
n
l
P
+o
IP
(n
1/2
).
Proof. This follows immediately from the denition of Frechet dierentiability.
.
Corollary 6.4.1 Assume the conditions of Lemma 6.4.1, with inuence func-
tion l
P
satisfying V
P
:= Pl
P
l
T
P
< . Then
n(Q(

P
n
) Q(P))
D
P
^(0, V
P
).
An example where (6.5) holds
Suppose A = R and that we take
d(

P, P) := sup
x
[
F(x) F(x)[.
Then indeed d(

P
n
, P) = O
IP
(n
1/2
). This follows from Donskers theorem,
which we state here without proof:
Donskers theorem Suppose F is continuous. Then
sup
x
n[
F
n
(x) F(x)[
D
Z,
where the random variable Z has distribution function
G(z) = 1 2
j=1
(1)
j+1
exp[2j
2
z
2
], z 0.
Frechet dierentiability is generally quite hard to prove, and often not even
true. We will only illustrate Gateaux dierentiability in some examples.
Example 6.4.1 We consider the Z-estimator. Throughout in this example, we
assume enough regularity.
Let be dened by the equation
P
= 0.
Let P
:= (1 )P +
P, and let
be a solution of the equation

P
= 0.
We assume that as 0, also
. It holds that
(1 )P
= 0,
so
P
+(

P P)
= 0,
and hence
P(
) +(

P P)
= 0.
Assuming dierentiabality of c P
c
, we obtain
P(
) =
_

c
T
P
c
c=
_
(
) +o([
[)
:= M
P
(
) +o([
[).
Moreover, again under regularity
(

P P)
= (

P P)
+ (

P P)(
)
= (

P P)
+o(1) =

P
+o(1).
It follows that
M
P
(
) +o([
[) +(

P P)
+o() = 0,
or, assuming M
P
to be invertible,
(
)(1 +o(1)) = M
1
P
+o(),
which gives
M
1
P
.
The inuence function is thus (as already seen in Subsection 6.3.2)
l
P
= M
1
P

.
Example 6.4.2 The -trimmed mean is a plug-in estimator of
:= Q(P) =
1
1 2
_
F
1
(1)
F
1
()
xdF(x).
Using partial integration, may write this as
(1 2) = (1 )F
1
(1 ) F
1
()
_
1
vdF
1
(v).
The inuence function of the quantile F
1
(v) is
q
v
(x) =
1
f(F
1
(v))
_
lx F
1
(v) v
_
6.5. ASYMPTOTIC RELATIVE EFFICIENCY 121
(see Example 6.3.4), i.e., for the distribution P
= (1 )P +
P, with distri-
bution function F
= (1 )F +
F, we have
lim
0
F
1
(v) F
1
(v)
=

Pq
v
=
1
f(F
1
(v))
_
F(F
1
(v)) v
_
.
Hence, for P
= (1 )P +
P,
(1 2) lim
0
Q((1 )P +
P) Q(P)
= (1 )

Pq
1
Pq
_
1
vd
Pq
v
=
_
1
1
f(F
1
(v))
_
F(F
1
(v)) v
_
dv
=
_
F
1
(1)
F
1
()
1
f(u)
_
F(u) F(u)
_
dF(u) =
_
F
1
(1)
F
1
()
_
F(u) F(u)
_
du
= (1 2)

Pl
P
,
where
l
P
(x) =
1
1 2
_
F
1
(1)
F
1
()
_
lx u F(u)
_
du.
We conclude that, under regularity conditions, the -trimmed mean is asymp-
totically linear with the above inuence function l
P
, and hence asymptotically
normal with asymptotic variance Pl
2
P
.
6.5 Asymptotic relative eciency
In this section, we assume that the parameter of interest is real-valued:
R.
Denition Let T
n,1
and T
n,2
be two estimators of , that satisfy
n(T
n,j
)
D
^(0, V
,j
), j = 1, 2.
Then
e
2:1
:=
V
,1
V
,2
is called the asymptotic relative eciency of T
n,2
with respect to T
n,1
.
If e
2:1
> 1, the estimator T
n,2
is asymptotically more ecient than T
n,1
. An
asymptotic (1)-condence interval for based on T
n,2
is then narrower than
the one based on T
n,1
.
Example 6.5.1 Let A = R, and F be the distribution function of X. Suppose
that F is symmetric around the parameter of interest . In other words,
F() = F
0
( ),
where F
0
is symmetric around zero. We assume that F
0
has nite variance
2
, and that is has density f
0
w.r.t. Lebesgue measure, with f
0
(0) > 0. Take
T
n,1
:=

X
n
, the sample mean, and T
n,2
:=

F
1
n
(1/2), the sample median. Then
V
,1
=
2
and V
,2
= 1/(4f
2
0
(0)) (the latter being derived in Example 6.3.4). So
e
2:1
= 4
2
f
2
0
(0).
Whether the sample mean is the winner, or rather the sample median, depends
thus on the distribution F
0
. Let us consider three cases.
Case i Let F
0
be the standard normal distribution, i.e., F
0
= . Then
2
= 1
and f
0
(0) = 1/
2. Hence
e
2:1
=
2
0.64.
So

X
n
is the winner. Note that

X
n
is the MLE in this case.
Case ii Let F
0
be the Laplace distribution, with variance
2
equal to one. This
distribution has density
f
0
(x) =
1
2
exp[
2[x[], x R.
So we have f
0
(0) = 1/
2, and hence
e
2:1
= 2.
Thus, the sample median, which is the MLE for this case, is the winner.
Case iii Suppose
F
0
= (1 ) +(/3).
This means that the distribution of X is a mixture, with proportions 1 and
, of two normal distributions, one with unit variance, and one with variance 3
2
.
Otherwise put, associated with X is an unobservable label Y 0, 1. If Y = 1,
the random variable X is ^(, 1)-distributed. If Y = 0, the random variable
X has a ^(, 3
2
) distribution. Moreover, P(Y = 1) = 1 P(Y = 0) = 1 .
Hence
2
:= var(X) = (1 )var(X[Y = 1) +var(X[Y = 0) = (1 ) +9 = 1 8.
It furthermore holds that
f
0
(0) = (1 )(0) +

3
(0) =
1
2
_
1
2
3
_
.
It follows that
e
2:1
=
2
_
1
2
3
_
2
(1 + 8).
6.6. ASYMPTOTIC CRAMER RAO LOWER BOUND 123
Let us now further compare the results with the -trimmed mean. Because
F is symmetric, the -trimmed mean has the same inuence function as the
Huber-estimator with k = F
1
(1 ):
l
(x) =
1
F
0
(k) F(k)
_
_
_
x , [x [ k
+k, x > k
k, x < k
.
This can be seen from Example 6.4.2. The inuence function is used to compute
the asymptotic variance V
,
of the -trimmed mean:
V
,
=
_
F
1
0
(1)
F
1
0
()
x
2
dF
0
(x) + 2(F
1
0
(1 ))
2
(1 2)
2
.
From this, we then calculate the asymptotic relative eciency of the -trimmed
mean w.r.t. the mean. Note that the median is the limiting case with 1/2.
Table: Asymptotic relative eciency of -trimmed mean over mean
= 0.05 0.125 0.5
= 0.00 0.99 0.94 0.64
0.05 1.20 1.19 0.83
0.25 1.40 1.66 1.33
6.6 Asymptotic Cramer Rao lower bound
Let X have distribution P P
: . We assume for simplicity that

R and that is the parameter of interest. Let T
n
be an estimator of .
Throughout this section, we take certain, sometimes unspecied, regularity
conditions for granted.
In particular, we assume that T is dominated by some -nite measure , and
that the Fisher-information
I() := E
s
2
(X)
exists for all . Here, s
is the score function

s
:=
d
d
log p
= p
/p
,
with p
:= dP
/d.
Recall now that if T
n
is an unbiased estimator of , then by the Cramer Rao
lower bound, 1/I() is a lower bound for its variance (under regularity condi-
tions I and II, see Section 3.3.1).
Denition Suppose that
n(T
n
)
D
^(b
, V
), .
Then b
is called the asymptotic bias, and V
the asymptotic variance. The

estimator T
n
is called asymptotically unbiased if b
= 0 for all . If T
n
is
asymptotically unbiased and moreover V
= 1/I() for all , and some regularity

conditions holds, then T
n
is called asymptotically ecient.
Remark 1 The assumptions in the above denition, are for all . Clearly, if
one only looks at one xed given
0
, it is easy to construct a super-ecient es-
timator, namely T
n
=
0
. More generally, to avoid this kind of super-eciency,
one does not only require conditions to hold for all , but in fact uniformly
in , or for all sequences
n
. The regularity one needs here involves the
idea that one actually needs to allow for sequences
n
the form
n
= +h/
n.
In fact, the regularity requirement is that also, for all h,
n(T
n
n
)
D
n
^(0, V
).
To make all this mathematically precise is quite involved. We refer to van der
Vaart (1998). A glimps is given in Le Cams 3
rd
Lemma, see the next subsection.
Remark 2 Note that when =
n
is allowed to change with n, this means that
distribution of X
i
can change with n, and hence X
i
can change with n. Instead
of regarding the sample X
1
, . . . , X
n
are the rst n of an innite sequence, we
now consider for each n a new sample, say X
1,1
, . . . , X
n,n
.
Remark 3 We have seen that the MLE

n
generally is indeed asymptotically
unbiased with asymptotic variance V
equal to 1/I(), i.e., under regularity

assumptions, the MLE is asymptotically ecient.
For asymptotically linear estimators, with inuence function l
, one has asymp-

totic variance V
= E
l
2
(X). The next lemma indicates that generally 1/I()

is indeed a lower bound for the asymptotic variance.
Lemma 6.6.1 Suppose that
(T
n
) =
1
n
n
i=1
l
(X
i
) +o
IP
(n
1/2
),
where E
(X) = 0, E
l
2
(X) := V
< . Assume moreover that

E
(X)s
(X) = 1. (6.6)
Then
V

1
I()
.
Proof. This follows from the Cauchy-Schwarz inequality:
1 = [cov
(l
(X), s
(X))[
var
(l
(X))var
(s
(X)) = V
I().
.
It may look like a coincidence when in a special case, equality (6.6) indeed
holds. But actually, it is true in quite a few cases. This may at rst seem like
magic.
We consider two examples. To simplify the expressions, we again write short-
hand
P
f := E
f(X).
Example 6.6.1 This example examines the Z-estimator of . Then we have,
for P = P
,
P
= 0.
The inuence function is
l
/M
,
where
M
:=
d
d
P
.
Under regularity, we have
M
= P

=
_

d,

=
d
d
.
We may also write
M
=
_

d, p
=
d
d
p
.
This follows from the chain rule
d
d
,
and (under regularity)
_
d
d
d =
d
d
_

d =
d
d
P
=
d
d
0 = 0.
Thus
Pl
= M
1
= M
1
d = 1,
that is, (6.6) holds.
Example 6.6.2 We consider now the plug-in estimator Q(

P
n
). Suppose that
Q is Fisher consistent (i.e., Q(P
) = for all ). Assume moreover that Q is

Frechet dierentiable with respect to the metric d, at all P
, and that
d(P
, P
) = O([
[).
Then, by the denition of Frechet dierentiability
h = Q(P
+h
) Q(P
) = P
+h
l
+o([h[) = (P
+h
P
)l
+o([h[),
or, as h 0,
1 =
(P
+h
P
)l
h
+o(1) =
_
l
(p
+h
p
)d
h
+o(1)
_
l
d = P
(l
).
So (6.6) holds.
6.6.1 Le Cams 3
rd
Lemma
The following example serves as a motivation to consider sequences
n
depend-
ing on n. It shows that pointwise asymptotics can be very misleading.
Example 6.6.3 (Hodges-Lehmann example of super-eciency) Let X
1
, . . . , X
n
be i.i.d. copies of X, where X = + , and is ^(0, 1)-distributed. Consider
the estimator
T
n
:=
_

X
n
, if [

X
n
[ > n
1/4
X
n
/2, if [

X
n
[ n
1/4
.
Then
n(T
n
)
D
_
^(0, 1), ,= 0
^(0,
1
4
), = 0
.
So the pointwise asymptotics show that T
n
can be more ecient than the sample
average

X
n
. But what happens if we consider sequences
n
? For example, let
n
= h/
n. Then, under IP
n
,

X
n
=
n
+ h/(
n) = O
IP
n
(n
1/2
). Hence,
IP
n
([

X
n
[ > n
1/4
) 0, so that IP
n
(T
n
=

X
n
) 0. Thus,
n(T
n
n
) =
n(T
n
n
)lT
n
=

X
n
+
n(T
n
n
)lT
n
=

X
n
/2
D
n
^(
h
2
,
1
4
).
The asymptotic mean square error AMSE
(T
n
) is dened as the asymptotic
variance + asymptotic squared bias:
AMSE
n
(T
n
) =
1 +h
2
4
.
The AMSE
(

X
n
) of

X
n
is its normalized non-asymptotic mean square error,
which is
AMSE
n
(

X
n
) = MSE
n
(

X
n
) = 1.
So when h is large enough, the asymptotic mean square error of T
n
is larger
than that of

X
n
.
Le Cams 3
rd
lemma shows that asymptotic linearity for all implies asymptotic
normality, now also for sequences
n
= +h/
n. The asymptotic variance for

such sequences
n
does not change. Moreover, if (6.6) holds for all , the
estimator is also asymptotically unbiased under IP
n
.
Lemma 6.6.2 (Le Cams 3
rd
Lemma) Suppose that for all ,
T
n
=
1
n
n
i=1
l
(X
i
) +o
IP
(n
1/2
),
where P
= 0, and V
:= P
l
2
< . Then, under regularity conditions,
n(T
n
n
)
D
n
^
_
P
(l
) 1h, V
_
.
We will present a sketch of the proof of this lemma. For this purpose, we need
the following auxiliary lemma.
Lemma 6.6.3 (Auxiliary lemma) Let Z R
2
be ^(, )-distributed, where
=
_
2
_
, =
_

2
1

1,2
1,2

2
2
_
.
Suppose that
2
=
2
2
/2.
Let Y R
2
be ^( +a, )-distributed, with
a =
_
1,2
2
2
_
.
Let
Z
be the density of Z and
Y
be the density of Y . Then we have the
following equality for all z = (z
1
, z
2
) R
2
:
Z
(z)e
z
2
=
Y
(z).
Proof. The density of Z is
Z
(z) =
1
2
_
det()
exp
_
1
2
(z )
T
1
(z )
_
.
Now, one easily sees that
1
a =
_
0
1
_
.
So
1
2
(z )
T
1
(z ) =
1
2
(z a)
T
1
(z a)
+a
T
1
(z )
1
2
a
T
1
a
and
a
T
1
(z )
1
2
a
T
1
a =
_
0
1
_
T
(z )
1
2
_
0
1
_
T
a
= z
2
1
2
2
2
= z
2
.
.
Sketch of proof of Le Cams 3
rd
Lemma. Set
n
:=
n
i=1
_
log p
n
(X
i
) log p
(X
i
)
_
.
Then under IP
, by a two-term Taylor expansion,
n

h
n
n
i=1
s
(X
i
) +
h
2
2
1
n
n
i=1
s
(X
i
)
n
n
i=1
s
(X
i
)
h
2
2
I(),
as
1
n
n
i=1
s
(X
i
) E
(X) = I().
We moreover have, by the assumed asymptotic linearity, under IP
n(T
n
)
1
n
n
i=1
l
(X
i
).
Thus,
_
n(T
n
)
n
_
D
Z,
where Z R
2
, has the two-dimensional normal distribution:
Z =
_
Z
1
Z
2
_
^
__
0
h
2
2
I()
_
,
_
V
hP
(l
)
hP
(l
) h
2
I()
__
.
Thus, we know that for all bounded and continuous f : R
2
R, one has
I E
f(
n(T
n
),
n
) I Ef(Z
1
, Z
2
).
Now, let f : R R be bounded and continuous. Then, since
n
i=1
p
n
(X
i
) =
n
i=1
p
(X
i
)e
n
,
we may write
I E
n
f(
n(T
n
)) = I E
f(
n(T
n
))e
n
.
The function (z
1
, z
2
) f(z
1
)e
z
2
is continuous, but not bounded. However,
one can show that one may extend the Portmanteau Theorem to this situation.
This then yields
I E
f(
n(T
n
))e
n
I Ef(Z
1
)e
Z
2
.
Now, apply the auxiliary Lemma, with
=
_
0
h
2
2
I()
_
, =
_
V
hP
(l
)
hP
(l
) h
2
I()
_
.
6.7. ASYMPTOTIC CONFIDENCE INTERVALS AND TESTS 129
Then we get
I Ef(Z
1
)e
Z
2
=
_
f(z
1
)e
z
2
Z
(z)dz =
_
f(z
1
)
Y
(z)dz = I Ef(Y
1
),
where
Y =
_
Y
1
Y
2
_
^
__
hP
(l
)
h
2
2
I()
_
,
_
V
hP
(l
)
hP
(l
) h
2
I()
__
,
so that
Y
1
^(hP
(l
), V
).
So we conclude that
n(T
n
)
D
n
Y
1
^(hP
(l
), V
).
Hence
n(T
n
n
) =
n(T
n
) h
D
n
^(hP
(l
) 1, V
).
.
6.7 Asymptotic condence intervals and tests
Again throughout this section, enough regularity is assumed, such as existence
of derivatives and interchanging integration and dierentiation.
Intermezzo: the
2
distribution Let Y
1
, . . . , Y
p
be i.i.d. ^(0, 1)-distributed.
Dene the p-vector
Y :=
_
_
Y
1
.
.
.
Y
p
_
_
.
Then Y is ^(0, I)-distributed, with I the p p identity matrix. The
2
-
distribution with p degrees of freedom is dened as the distribution of
|Y |
2
:=
p
j=1
Y
2
j
.
Notation: |Y |
2

2
p
.
For a symmetric positive denite matrix , one can dene the square root
1/2
as a symmetric positive denite matrix satisfying
1/2
1/2
= .
Its inverse is denoted by
1/2
(which is the square root of
1
). If Z R
p
is
^(0, )-distributed, the transformed vector
Y :=
1/2
Z
is ^(0, I)-distributed. It follows that
Z
T
1
Z = Y
T
Y = |Y |
2

2
p
.
Asymptotic pivots Recall the denition of an asymptotic pivot (see Section
1.7). It is a function Z
n
() := Z
n
(X
1
, . . . , X
n
, ) of the data X
1
, . . . , X
n
and
the parameter of interest = g() R
p
, such that its asymptotic distribution
does not on the unknown parameter , i.e., for a random variable Z, with
distribution Q not depending on ,
Z
n
()
D
Z, .
An asymptotic pivot can be used to construct approximate (1 )-condence
intervals for , and tests for H
0
: =
0
with approximate level .
Consider now an asymptotically normal estimator T
n
of , which is asymptot-
ically unbiased and has asymptotic covariance matrix V
, that is
n(T
n
)
D
^(0, V
), .
(assuming such an estimator exists). Then, depending on the situation, there
are various ways to construct an asymptotic pivot.
1
st
asymptotic pivot
If the asymptotic covariance matrix V
is non-singular, and depends only on

the parameter of interest , say V
= V () (for example, if = ), then an

asymptotic pivot is
Z
n,1
() := n(T
n
)
T
V ()
1
(T
n
).
The asymptotic distribution is the
2
-distribution with p degrees of freedom.
2nd asymptotic pivot
If, for all , one has a consistent estimator

V
n
of V (), then an asymptotic pivot
is
Z
n,2
() := n(T
n
)
T

V
1
n
(T
n
).
The asymptotic distribution is again the
2
-distribution with p degrees of free-
dom.
Estimators of the asymptotic variance
If

n
is a consistent estimator of and if V
is continuous, one may insert
V
n
:= V
n
.
If T
n
=
n
is the M-estimator of , being the solution of P
= 0, then
(under regularity) the asymptotic covariance matrix is
V
= M
1
M
1
,
where
J
= P
,
and
M
=

c
T
P
c=
= P
.
Then one may estimate J
and M
by
J
n
:=

P
n
T
n
=
1
n
n
i=1
n
(X
i
)
T
n
(X
i
),
and
M
n
:=

P
n

n
=
1
n
n
i=1
n
(X
i
),
respectively. Under some regularity conditions,
V
n
:=

M
1
n

J
n

M
1
n
.
is a consistent estimator of V
6
.
6.7.1 Maximum likelihood
Suppose now that T = P
: has R
p
, and that T is dominated by
some -nite measure . Let p
:= dP
/d denote the densities, and let
n
:= arg max
i=1
log p
(X
i
)
be the MLE. Recall that

n
is an M-estimator with loss function
= log p
,
and hence (under regularity conditions),
is minus the score function

s
:= p
/p
. The asymptotic variance of the MLE is I

1
(), where I() :=
P
s
T
is the Fisher information:
n(
n
)
D
^(0, I
1
()), .
Thus, in this case
Z
n,1
() = n(
n
)I()(
n
),
and, with

I
n
being a consistent estimator of I()
Z
n,2
() = n(
n
)
I
n
(
n
).
6
From most algorithms used to compute the M-estimator n, one easily can obtain

Mn
and

Jn as output. Recall e.g. that the Newton-Raphson algorithm is based on the iterations
new =
old

n
X
i=1

old
!
1
n
X
i=1

old
.
Note that one may take
I
n
:=
1
n
n
i=1
s
n
(X
i
) =

2
T
1
n
n
i=1
log p
(X
i
)
n
as estimator of the Fisher information
7
.
3rd asymptotic pivot
Dene now the twice log-likelihood ratio
2L
n
(
n
) 2L
n
() := 2
n
i=1
_
log p
n
(X
i
) log p
(X
i
)
_
.
It turns out that the log-likelihood ratio is indeed an asymptotic pivot. A
practical advantage is that it is self-normalizing: one does not need to explicitly
estimate asymptotic (co-)variances.
Lemma 6.7.1 Under regularity conditions, 2L
n
(
n
) 2L
n
() is an asymptotic
pivot for . Its asymptotic distribution is again the
2
-distribution with p de-
grees of freedom:
2L
n
(
n
) 2L
n
()
D
2
p
.
Sketch of the proof. We have by a two-term Taylor expansion
2L
n
(
n
) 2L
n
() = 2n
P
n
_
log p
n
log p
_
2n(
n
)
T

P
n
s
+n(
n
)
T

P
n
s
n
)
2n(
n
)
T

P
n
s
n(
n
)
T
I()(
n
),
where in the second step, we used

P
n
s
= I(). (You may compare

this two-term Taylor expansion with the one in the sketch of proof of Le Cams
3
rd
Lemma). The MLE

n
is asymptotically linear with inuence function
l
= I()
1
s
n
= I()
1

P
n
s
+o
IP
(n
1/2
).
Hence,
2L
n
(
n
) 2L
n
() n(

P
n
s
)
T
I()
1
(

P
n
s
).
The result now follows from
P
n
s
^(0, I()).
.
7
In other words (as for general M-estimators), the algorithm (e.g. Newton Raphson) for
calculating the maximum likelihood estimator

n generally also provides an estimator of the
Fisher information as by-product.
Example 6.7.1 Let X
1
, . . . , X
n
be i.i.d. copies of X, where X 1, . . . , k is
a label, with
P
(X = j) :=
j
, j = 1, . . . , k.
where the probabilities
j
are positive and add up to one:

k
j=1
j
= 1,
but are assumed to be otherwise unknown. Then there are p := k 1 un-
known parameters, say = (
1
, . . . ,
k1
). Dene N
j
:= #i : X
i
= j.
(Note that (N
1
, . . . , N
k
) has a multinomial distribution with parameters n and
(
1
, . . . ,
k
)).
Lemma For each j = 1, . . . , k, the MLE of
j
is

j
=
N
j
n
.
Proof. The log-densities can be written as
log p
(x) =
k
j=1
lx = j log
j
,
so that
n
i=1
log p
(X
i
) =
k
j=1
N
j
log
j
.
Putting the derivatives with respect to = (
1
, . . . ,
k1
), (with
k
= 1
k1
j=1

j
) to zero gives,
N
j

j
N
k

k
= 0.
Hence

j
= N
j

k
N
k
, j = 1, . . . , k,
and thus
1 =
k
j=1

j
= n

k
N
k
,
yielding

k
=
N
k
n
,
and hence

j
=
N
j
n
, j = 1, . . . , k.
.
We now rst calculate Z
n,1
(). For that, we need to nd the Fisher information
I().
Lemma The Fisher information is
I() =
_
_
_
1
1
. . . 0
.
.
.
.
.
.
.
.
.
0 . . .
1
k1
_
_
_+
1
T
,
8
where is the (k 1)-vector := (1, . . . , 1)
T
.
Proof. We have
s
,j
(x) =
1
j
lx = j
1
k
lx = k.
So
(I())
j
1
,j
2
= E
_
1
j
1
lX = j
1

1
k
lX = k
__
1
j
2
lX = j
2

1
k
lX = k
_
=
_
1
k
j
1
,= j
2
1
j
+
1
k
j
1
= j
2
= j
.
.
We thus nd
Z
n,1
() = n(
n
)
T
I()(
n
)
= n
_
_

1
1
.
.
.

k1
k1
_
_
T
_
_
_
_
_
_
1
1
. . . 0
.
.
.
.
.
.
.
.
.
0 . . .
1
k1
_
_
_+
1
k
_
_
1 . . . 1
.
.
.
.
.
.
1 . . . 1
_
_
_
_
_
_
_

1
1
.
.
.

k1
k1
_
_
.
= n
k1
j=1
(
j

j
)
2
j
+n
1
k
(
k1
j=1
(
j

j
))
2
= n
k
j=1
(
j

j
)
2
j
=
k
j=1
(N
j
n
j
)
2
n
j
.
This is called the Pearsons chi-square
(observed expected)
2
expected
.
A version of Z
n,2
() is to replace, for j = 1, . . . k,
j
by
j
in the expression for
the Fisher information. This gives
Z
n,2
() =
k
j=1
(N
j
n
j
)
2
N
j
.
8
To invert such a matrix, one may apply the formula (A + bb
T
)
1
= A
1
A
1
bb
T
A
1
1+b
T
A
1
b
.
This is called the Pearsons chi-square
(observed expected)
2
observed
.
Finally, the log-likelihood ratio pivot is
2L
n
(
n
) 2L
n
() = 2
k
j=1
N
j
log
_

j
j
_
.
The approximation log(1+x) xx
2
/2 shows that 2L
n
(
n
)2L
n
() Z
n,2
():
2L
n
(
n
) 2L
n
() = 2
k
j=1
N
j
log
_
1 +

j

j

j
_
2
k
j=1
N
j
_
j

j

j
_
+
k
j=1
N
j
_
j

j

j
_
2
= Z
n,2
().
The three asymptotic pivots Z
n,1
(), Z
n,2
() and 2L
n
(
n
) 2L
n
() are each
asymptotically
2
k1
-distributed under IP
.
6.7.2 Likelihood ratio tests
Intermezzo: some matrix algebra
Let z R
p
be a vector and B be a (qp)-matrix, (p q) with rank q. Moreover,
let V be a positive denite (p p)-matrix.
Lemma We have
max
aR
p
: Ba=0
2a
T
z a
T
a = z
T
z z
T
B
T
(BB
T
)
1
Bz.
Proof. We use Lagrange multipliers R
p
. We have
a
2a
T
z a
T
a + 2a
T
B
T
= z a +B
T
.
Hence for
a
:= arg max
aR
p
: Ba=0
2a
T
z a
T
a,
we have
z a
+B
T
= 0,
or
a
= z +B
T
.
The restriction Ba
= 0 gives
Bz +BB
T
= 0.
So
= (BB
T
)
1
Bz.
Inserting this in the solution a
gives
a
= z B
T
(BB
T
)
1
Bz.
Now,
a
T
= (z
T
z
T
B
T
(BB
T
)
1
B)(zB
T
(BB
T
)
1
Bz) = z
T
zz
T
B
T
(BB
T
)
1
Bz.
So
2a
T
z a
T
= z
T
z z
T
B
T
(BB
T
)
1
Bz.
.
Lemma We have
max
aR
p
: Ba=0
2a
T
z a
T
V a = z
T
V
1
z z
T
V
1
B
T
(BV
1
B
T
)
1
BV
1
z.
Proof. Make the transformation b := V
1/2
a, and y := V
1/2
z, and C =
BV
1/2
. Then
max
a: Ba=0
2a
T
z a
T
V a
= max
b: Cb=0
2b
T
y b
T
b
= y
T
y y
T
C
T
(CC
T
)
1
Cy = z
T
V
1
z z
T
V
1
B
T
(BV
1
B
T
)
1
BV
1
z.
.
Corollary Let L(a) := 2a
T
z a
T
V a. The dierence between the unrestricted
maximum and the restricted maximum of L(a) is
max
a
L(a) max
a: Ba=0
L(a) = z
T
V
1
B
T
(BV
1
B
T
)
1
BV
1
z.
Hypothesis testing
For the simple hypothesis
H
0
: =
0
,
we can use 2L
n
(
n
) 2L
n
(
0
) as test statistic: reject H
0
if 2L
n
(
n
) 2L
n
(
0
) >
2
p,
, where
p,
is the (1 )-quantile of the
2
p
-distribution.
Consider now the hypothesis
H
0
: R() = 0,
where
R() =
_
_
R
1
()
.
.
.
R
q
()
_
_
.
Let

n
be the unrestricted MLE, that is
n
= arg max
i=1
log p
(X
i
).
Moreover, let

0
n
be the restricted MLE, dened as
0
n
= arg max
: R()=0
n
i=1
log p
(X
i
).
Dene the (q p)-matrix
R() =

T
R()[
=
.
We assume

R() has rank q.
Let
L
n
(
n
) L
n
(
0
n
) =
n
i=1
_
log p
n
(X
i
) log p
0
n
(X
i
)
_
be the log-likelihood ratio for testing H
0
: R() = 0.
Lemma 6.7.2 Under regularity conditions, and if H
0
: R() = 0 holds, we
have
2L
n
(
n
) 2L
n
(
0
n
)
D
2
q
.
Sketch of the proof. Let
Z
n
:=
1
n
n
i=1
s
(X
i
).
As in the sketch of the proof of Lemma 6.7.1, we can use a two-term Taylor
expansion to show for any sequence
n
satisfying
n
= +O
IP
(n
1/2
), that
2
n
i=1
_
log p
n
(X
i
)log p
(X
i
)
_
= 2
n(
n
)
T
Z
n
n(
n
)
2
I()(
n
)+o
IP
(1).
Here, we also again use that

n
i=1
s
n
(X
i
)/n = I() + o
IP
(1). Moreover, by
a one-term Taylor expansion, and invoking that R() = 0,
R(
n
) =

R()(
n
) +o
IP
(n
1/2
).
Insert the corollary in the above matrix algebra, with z := Z
n
, B :=

R(), and
V = I(). This gives
2L
n
(
n
) 2L
n
(
0
n
)
= 2
n
i=1
_
log p
n
(X
i
) log p
(X
i
)
_
2
n
i=1
_
log p
0
n
(X
i
) log p
(X
i
)
_
= Z
T
n
I()
1

R
T
()
_

R()I()
1

R()
T
_
1
R()I()
1
Z
n
+o
IP
(1)
:= Y
T
n
W
1
Y
n
+o
IP
(1),
where Y
n
is the q-vector
Y
n
:=

R()I()
1
Z
n
,
and where W is the (q q)-matrix
W :=

R()I()
1

R()
T
.
We know that
Z
n
D
^(0, I()).
Hence
Y
n
D
^(0, W),
so that
Y
T
n
W
1
Y
n
D
2
q
.
.
Corollary 6.7.1 From the sketch of the proof of Lemma 6.7.2, one sees that
moreover (under regularity),
2L
n
(
n
) 2L
n
(
0
n
) n(
0
n
)
T
I()(
0
n
),
and also
2L
n
(
n
) 2L
n
(
0
n
) n(
0
n
)
T
I(
0
n
)(
0
n
).
Example 6.7.2 Let X be a bivariate label, say X (j, k) : j = 1, . . . , r, k =
1, . . . , s. For example, the rst index may correspond to sex (r = 2) and the
second index to the color of the eyes (s = 3). The probability of the combination
(j, k) is
j,k
:= P
_
X = (j, k)
_
.
Let X
1
, . . . , X
n
be i.i.d. copies of X, and
N
j,k
:= #X
i
= (j, k).
From Example 6.7.1, we know that the (unrestricted) MLE of
j,k
is equal to

j,k
:=
N
j,k
n
.
We now want to test whether the two labels are independent. The null-
hypothesis is
H
0
:
j,k
= (
j,+
) (
+,k
) (j, k).
6.8. COMPLEXITY REGULARIZATION (TO BE WRITTEN) 139
Here
j,+
:=
s
k=1
j,k
,
+,k
:=
r
j=1
j,k
.
One may check that the restricted MLE is

0
j,k
= (
j,+
) (
+,k
),
where

j,+
:=
s
k=1

j,k
,
+,k
:=
r
j=1

j,k
.
The log-likelihood ratio test statistic is thus
2L
n
(
n
) 2L
n
(
0
n
) = 2
r
j=1
s
k=1
N
j,k
_
log
_
N
j,k
n
_
log
_
N
j,+
N
+,k
n
2
__
= 2
r
j=1
s
k=1
N
j,k
log
_
nN
j,k
N
j,+
N
+,k
_
.
Its approximation as given in Corollary 6.7.1 is
2L
n
(
n
) 2L
n
(
0
n
) n
r
j=1
s
k=1
(N
j,k
N
j,+
N
+,k
/n)
2
N
j,+
N
+,k
.
This is Pearsons chi-squared test statistic for testing independence. To nd
out what the value of q is in this example, we rst observe that the unrestricted
case has p = rs 1 free parameters. Under the null-hypothesis, there remain
(r 1) + (s 1) free parameters. Hence, the number of restrictions is
q =
_
rs 1
__
(r 1) + (s 1)
_
= (r 1)(s 1).
Thus, under H
0
:
j,k
= (
j,+
) (
+,k
) (j, k), we have
n
r
j=1
s
k=1
(N
j,k
N
j,+
N
+,k
/n)
2
N
j,+
N
+,k
D

2
(r1)(s1)
.
6.8 Complexity regularization (to be written)
Chapter 7
Literature
J.O. Berger (1985) Statistical Decision Theory and Bayesian Analysis
Springer
A fundamental book on Bayesian theory.
P.J. Bickel, K.A. Doksum (2001) Mathematical Statistics, Basic Ideas and
Selected Topics Volume I, 2
nd
edition, Prentice Hall
Quite general, and mathematically sound.
D.R. Cox and D.V. Hinkley (1974) Theoretical Statistics Chapman and
Hall
Contains good discussions of various concepts and their practical meaning.
Mathematical development is sketchy.
J.G. Kalbeisch (1985) Probability and Statistical Inference Volume 2,
Springer
Treats likelihood methods.
L.M. Le Cam (1986) Asymptotic Methods in Statistical Decision Theory
Springer
Treats decision theory on a very abstract level.
E.L. Lehmann (1983) Theory of Point Estimation Wiley
A klassiker. The lecture notes partly follow this book
E.L. Lehmann (1986) Testing Statistical Hypothesis 2
nd
edition, Wiley
Goes with the previous book.
J.A. Rice (1994) Mathematical Statistics and Data Analysis 2
nd
edition,
Duxbury Press
A more elementary book.
M.J. Schervish (1995) Theory of Statistics Springer
Mathematically exact and quite general. Also good as reference book.
R.J. Sering (1980) Approximation Theorems of Mathematical Statistics
Wiley
141
142 CHAPTER 7. LITERATURE
Treats asymptotics.
A.W. van der Vaart (1998) Asymptotic Statistics Cambridge University
Press
Treats modern asymptotics and e.g. semiparametric theory

Mathematics Statistics

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Mathematics Statistics

Hochgeladen von

Copyright:

Verfügbare Formate

Mathematical Statistics

Sara van de Geer

:= minm : (m) = /n.

(u) := infx : F(x) u := F

(Z(X, g()) ) := G()

() is the (1 )-quantile of the Student()-

: . The parameter of interest is denoted by

X does not exist for < 1, so when contains values

: is dominated by a -nite measure . We

(x) = log (1 +) log(1 +x),

means convergence in distribution under IP

: (so that X has distribution IP :=

X := . We take quadratic loss

(X) = 1 for all . Consider the collection of decisions

is called strictly better than d if

) < R(, d).

that is strictly better than d, then d is called inadmissible.

. On the other hand, to show that d is admissible, one has to

. So in principle, one then has to take

is better than d. Then we have

-almost surely, and hence, for all , R(, d

is not strictly better than d. We conclude that d is admissible. .

/d denote the densities. Let be an a

for the distribution of S. Then

/d denote the densities.

(S)h(X). Hence, the

: be a family of probability measures dominated

: an exponential family in canonical form, if

(x) = exp[c()T(x) d()]h(x).

: , with (again for

(X) equals var

: is a one-dimensional exponential family,

(x) = exp[c()T(x) d()]h(x).

(X) = 0 implies that

) for all x and x

: of distributions. The parameter of

can be estimated unbiasedly. An example is the

is called UMVU (Uniform Minimum

cannot have larger variance than T:

= E(T[S)? Note that T

depends on X only via S = S(X),

(S) is unbiased and that var

(S) is another unbiased estimator of g(), we have

, the probability of early failure. An unbiased estimator is

that depends only on S and that is unbiased.

: be a collection of distributions on A, dominated by a -nite

the score function, and I() the Fisher

, we only need to consider values x with p

(x) > 0, that is,

, without worrying about dividing by zero.

vanish outside A. Thus,

(x) is dierentiable for all x, we take

(x) = +xlog log x!,

. The UMVU estimator of g() is

(x) = exp[c()T(X) d()]h(x), x A.

(X) is 1. Thus, there exist constants

(X) = a()T(X) b(). (3.1)

/d, we can take primitives:

(x) = c()T(x) d() +

(x) = exp[c()T(x) d()]h(x),

T b() = a()g() b().

(X) = 0, this implies that g() = b()/a(). .

: be a family of probability measures. Let

with level , it holds that E