STAT 426: Fall 2012

STAT 426
Lecture 20
Fall 2012
Arash A. Amini
November 20, 2012
1 / 37
Outline
Sufficiency
I
I
I
I
Exponential families
I
I
I
What does it really mean?

Sufficient statistic is not unique
Minimal sufficient
Order statistics are sufficient for iid data
More examples
Some properties
MLE in exponential families
Back to sufficiency
I
I
MLE and sufficiency

Rao-Blackwell theorem
2 / 37
What does sufficiency mean?

Here is an example to give you some intuition:
I
Consider iid Bernoulli trials first

iid
X1 , . . . , Xn Ber()
I
Let us think of coin tossing:

Xi = 1 heads observed in ith throw
Xi = 0 tails observed in ith throw
Let x = (x1 , . . . , xn ). The joint pmf is

p (x) =
n
Y
i=1
xi (1 )1xi =
xi
(1 )n
xi
P
= g ( i xi , ).
I
By factorization theorem, T (X ) =
Pn
i=1
Xi is sufficient for .
3 / 37
Suppose you observe the sequence (n = 20)

X = (0 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 0 0)
Since T (X ) = 11 is sufficient for , we can keep 11 and discard the

sequence.
Any estimator that we can form based on the information in X , can be

matched in performance by an estimator that is based on T (X ).
(The above is formalized in Rao-Blackwell theorem.)
4 / 37

Now consider the following modification:
I {Xi } are still independent, but they are not identically distributed.
I Say n = 2k is even.
I Assume that Xi Ber() for i = 1, 3, 5, . . . , n 1
I For i = 2, 4, . . . , n, in each trial, we toss the coin twice and record 1 if we
observe at least one heads:
Xi = 1 HH or HT or TH is observed in ith throw
I
I
I
P (Xi = 1) = 1 (1 )2 = (2 ).

We have Xi Ber (2 ) for i = 2, 4, . . . , n.
The joint pmf is
Y
Y
xi
1xi
p (x) =
xi (1 )1xi
(2 )
(1 )2
.
i even
i odd
Let t1 :=
i odd xi
and t2 :=
i even xi .
Then,
p (x) = t1 +t2 (2 )t2 (1 )kt1 +2k2t2

5 / 37

I
We have
p (x) = t1 +t2 (2 )t2 (1 )3kt1 2t2
= g (t1 , t2 , )
We conclude
that (T1 , T2 ) is sufficient where T1 =
P
T2 = i even Xi .
(Note: (T , T1 ) is also sufficient where T = T1 + T2 .)
For example, if we observe the same sequence
i odd
Xi and
X = (0 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 0 0 0)
It is enough to keep (T1 , T2 ) = (5, 6) and discard the sequence, if we only
care about estimating .
I
In an intuitive sense, with this model, there is more information in the same
sequence for estimating than just its number of ones.
Exercise. Find mle of in this problem.

6 / 37
Outline
Sufficiency
I
I
I
I
I
I
I

Minimal sufficient
More examples
Some properties
Back to sufficiency
I
I
MLE and sufficiency

7 / 37
If T is sufficient and T = h(Te ), then Te is also sufficient.
This follows easily from factorization theorem. (Exercise)

Pn
Example: In the Bernoulli problem, T = i=1 Xi is sufficient.
Pn
Since T = X1 + i=2 Xi , then
I
I
X1 ,
n
X
Xi
i=2
I
I
is also sufficient. [Take h(u, v ) = u + v .]

Pn
Similarly, (X1 , X2 , i=3 Xi ) is also sufficient,
Pn
... and also (X1 , X2 , X3 , i=4 Xi ).
8 / 37
Minimal sufficiency
I
I
I
I
I
If T is sufficient and T = h(Te ), then Te is also sufficient.

Pn
Example: In the Bernoulli problem, T = i=1 Xi is sufficient.

Pn
Pn
Since T = X1 + i=2 Xi , then X1 , i=2 Xi is also sufficient.
Pn
Similarly, (X1 , X2 , i=3 Xi ) is also sufficient,
Pn
... and also (X1 , X2 , X3 , i=4 Xi ).
These progressively carry more information.

(They are in a sense more than sufficient.)
There is a hierarchy (ordering) of sufficient statistics.

Pn
Intuitively, i=1 Xi carries the minimal amount of information sufficient for
estimating . (Minimal sufficient)
9 / 37
Outline
Sufficiency
I
I
I
I
I
I
I

Minimal sufficient
More examples
Some properties
Back to sufficiency
I
I
MLE and sufficiency

10 / 37
Sufficiency of order statistic in iid data

I
The entire data (X1 , . . . , Xn ) is always sufficient.

(In a sense, this is the maximal sufficient statistic.)
For iid data, we can discard the order information:

The order statistic is sufficient for iid data.
To obtain order statistic of (X1 , . . . , Xn ),

we sort them and denote them as
X(1) X(2) X(n)
Example (X1 , X2 , X3 ) = (2, 0.4, 1.5), has order statistic

(X(1) , X(2) , X(3) ) = (0.4, 1.5, 2)
as does (X1 , X2 , X3 ) = (1.5, 2, 0.4).
11 / 37
Sufficiency of order statistic in iid data

I
Example (X1 , X2 , X3 ) = (2, 0.4, 1.5), has order statistic

(X(1) , X(2) , X(3) ) = (0.4, 1.5, 2)
as does (X1 , X2 , X3 ) = (1.5, 2, 0.4).
Why order statistic is enough? By factorization theorem

f (x) =
n
Y
i=1
f (xi ) =
n
Y
f (x(i) )
i=1
Concrete example:
f (2, 0.4, 1.5) = f (2)f (0.4)f (1.5)
= f (0.4)f (1.5)f (2)
12 / 37
Outline
Sufficiency
I
I
I
I
I
I
I

Minimal sufficient
More examples
Some properties
Back to sufficiency
I
I
MLE and sufficiency

13 / 37
More on exponential families
Recall: 1-parameter exponential family in canonical form

f (x) = exp[T (x) A()]h(x)
T (x) is sufficient by factorization.
is the canonical parameter.

R
R
e A() = exp[T (x)]h(x)dx since we need f (x)dx = 1.
I
I
I
E = { : A() < } is the natural parameter space.
Can cook up different distributions by choosing T (x) and h(x).
14 / 37
Cooking up exponential families (1-param)
Say we want a distribution on [1, 1].
Can set up a model by taking

T (x) = x,
I
I
I
h(x) = 1{1x1}
Can compute A() explicitly:

R1
e A() = 1 e x dx = 1 (e e ) =
2 sinh()
The density is
h
2 sinh i
f (x) = exp x log
1{1x1}
exp(x)1{1x1}
15 / 37

I
The plot of f (x) exp(x)1{1x1} .

= 0.5
= 0.1
= 1.2
= 0.9
0.5
0
2
16 / 37
We can add more parameters = more modelling ability
Say we take
T (x) = (x, x 2 ),
h(x) = 1{1x1}
Let = (1 , 2 ). The density is given by (up to a constant)

f (x) exp(1 x + 2 x 2 )1{1x1}
More cumbersome to compute normalizing constant e A() in this case.
17 / 37

I
Plots of f (x) exp(1 x + 2 x 2 )1{1x1}

2.5
= (0.5, 0.5)
= (0.1, 1)
= (1, 1.2)
= (1, 0.9)
= (1, 3)
2
1.5
1
0.5
0
2
18 / 37
Cooking up exponential families (2-D, 5-param)
I
I
Consider now two random variable (U, V ) [0, 1]2 .

Let x = (u, v ) and
T (x) = (u 2 , v 2 , u, v , uv ),
I
h(x) = 1{0u,v 1}
The density is

f (x) exp 1 u 2 + 2 v 2 + 3 u + 4 v + 5 uv 1{0u,v 1}
Computing A() explicitly is going to cause a headache.
(For fixed , can be done by numerical integration.)
19 / 37

h(x) = 1{0u,v 1}
= (2, 2, -0.5, -0.5, -1)
= (-0.5, -0.5, 0.5, 0.5, -4)
0
1
1
0.5
v
0 0
(a)
0.5
u
0
1
1
0.5
v
0 0
0.5
u
(b)
20 / 37

h(x) = 1{0u,v 1}
= (2, -2, 2, 2, 1)
= (-2, -2, 2, 2, 1)
1.5
5
1
0.5
1
0.5
0 0
v
(c)
0.5
u
0
1
1
0.5
v
0 0
0.5
u
(d)
21 / 37

h(x) = 1{0u,v 1}
= (2, -2, 2, -2, 1)
= (0.1, -3, -4, -3, 10)
6
10
0
1
1
0.5
0 0
v
(e)
0.5
u
0
1
1
0.5
v
0 0
0.5
u
(f)
22 / 37

h(x) = 1{1u,v 1}
= (1, 0, 0, 0, -0.5)
= (2, -1, 0.1, 0.1, -0.1)
1
0.5
0.5
1
01
11
v
(g)
1
1
11
(h)
23 / 37

h(x) = 1{1u,v 1}
= (0, 0, 0, 0, -1.5)
= (0, -1, 0, 0, -0.5)
0.4
0.5
0.2
1
1
11
v
(i)
1
01
11
(j)
24 / 37

h(x) = 1{1u,v 1}
= (0.5, -0.5, 0.5, -0.5, 1)
= (0, -0.5, 0, 0, 1)
0.4
0.5
0.2
1
1
11
v
(k)
1
01
11
(l)
25 / 37
More examples of well-known exponential families

x
Poisson distribution has pmf p (x) = e x! ,
Can be rewritten as
x = 0, 1, 2, . . .
1
p (x) = exp x log
x!
|{z}
h(x)
The canonical parameter is = log and A() = e . T (x) = x. After

reparametrization,
p (x) =
1
exp(x e )
x!
Natural parameter space is E = (, ).
26 / 37
Outline
Sufficiency
I
I
I
I
I
I
I

Minimal sufficient
More examples
Some properties
Back to sufficiency
I
I
MLE and sufficiency

27 / 37
Properties of exponential families (mean)
Consider the 1-parameter case. Then, we have

E [T (X )] = A0 ().
e T (x) h(x)dx.
Easy to prove by differentiating e A() =
Example, for X Poi(), we get E [X ] = (e )0 = e = .
The mgf is easy to obtain too
MT (t) := E [e T (X ) ] = e A(t+)A() .
I
The cumulant generating function (cgf) is

KT (t) := log MT (t) = A(t + ) A()
28 / 37
Properties of exponential families (cgf)

I
The cumulant generating function (cgf) is

KT (t) = log MT (t) = A(t + ) A()
Cumulants m (T ) =
One can show that
dm
dt m KT (t) |t=0 .
1 (T ) = E [T ] = A0 ()
2 (T ) = var [T ] = A00 ()
I
For X Poi(), this gives an easy proof of

var (X ) = (e )00 = e =
.
29 / 37
Outline
Sufficiency
I
I
I
I
I
I
I

Minimal sufficient
More examples
Some properties
Back to sufficiency
I
I
MLE and sufficiency

30 / 37

I
I
Consider the 1-param canonical case f (x) = h(x) exp[T (x) A()].
Joint density base on n samples,
n
n X
oY
f (x) = exp
T (xi ) nA()
h(x)
i=1
Letting ` (x) = log f (x). Differentiating

X
` (x)
= 0 =
T (xi ) nA0 () = 0.
I
I
MLE is obtained by solving A0 () = n1

P
equivalently E [T (X )] = n1 i T (Xi ).
T (Xi ),
MLE coincides with MOM
31 / 37
What is not an example of an exponential family?
The shifted exponential f, (x) = e (x) 1{x} .
Also applicable to double exponential or Laplace with shift

1
exp( |x|
(It is if you take = 0.)
f, (x) = 2
).
Uniform[0, ].
32 / 37
Outline
Sufficiency
I
I
I
I
I
I
I

Minimal sufficient
More examples
Some properties
Back to sufficiency
I
I
MLE and sufficiency

33 / 37
Sufficiency and MLE
For any sufficient statistic T = T (X ), MLE is a function of T = T (X ).

Immediate from factorization theorem
argmax f (X ) = argmax[g (T )h(X )] = argmax g (T )
Knowing a sufficient statistic is enough for computing MLE.
34 / 37
Rao-Blackwell Theorem
Formalizing: Sufficient statistic T captures all info. for estimating .
Theorem (Rao-Blackwell)
I
I
I
I
Let b be some estimator with E [b2 ] < (finite 2nd moment).

Let T be sufficient for .
b ].
Let e = E [|T
Then
E (e )2 E (b )2 .
The inequality is strict unless e = b (a.s. P ).
35 / 37
Proof of Rao-Blackwell
I
I
b ]. Want to show that MSE()

e MSE().
b
e = E [|T
Here is the proof:
I
I
I
e = E [].
b
E []
Since MSE = var + (bias)2 , need to compare variances only.
Law of total variance implies

b = var E(|T
b ) .
b ) + E var(|T
var ()
| {z }
|
{z
}
e
We are done.
Where did we use T is sufficient then?
36 / 37
Rao-Blackwell example
I
I
I
I
I
I
I
iid
Bernoulli problem as usual X1 , . . . , Xn Ber(). (n 5)

For some reason, someone proposes b = X1 + 2X5 .
Pn
T = i=1 Xi is sufficient.
b ] does better.
Rao-Blackwell says that e = E [|T
By symmetry E [Xj |T ] does not depend on j, hence
E [Xj |T ] = h(T ).
Now T = E [T |T ] =
Pn
j=1
E [Xj |T ] = nh(T ).
We have shown that h(T ) = n1 T = X n .
We have
3
e = E [X1 |T ] + 2E [X5 |T ] = 3h(T ) = T .
n
Does this really have a lower MSE? Exercise.
37 / 37

STAT 426: Fall 2012

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

STAT 426: Fall 2012

Hochgeladen von

Copyright:

Verfügbare Formate

STAT 426

November 20, 2012

What does it really mean?

MLE and sufficiency

What does sufficiency mean?

Consider iid Bernoulli trials first

Let us think of coin tossing:

Let x = (x1 , . . . , xn ). The joint pmf is

What does sufficiency mean?

Suppose you observe the sequence (n = 20)

Since T (X ) = 11 is sufficient for , we can keep 11 and discard the

Any estimator that we can form based on the information in X , can be

What does sufficiency mean?

p (x) = t1 +t2 (2 )t2 (1 )kt1 +2k2t2

What does sufficiency mean?

(Note: (T , T1 ) is also sufficient where T = T1 + T2 .)

For example, if we observe the same sequence

Exercise. Find mle of in this problem.

What does it really mean?

MLE and sufficiency

Sufficient statistic is not unique

If T is sufficient and T = h(Te ), then Te is also sufficient.

This follows easily from factorization theorem. (Exercise)

is also sufficient. [Take h(u, v ) = u + v .]

If T is sufficient and T = h(Te ), then Te is also sufficient.

These progressively carry more information.

There is a hierarchy (ordering) of sufficient statistics.

What does it really mean?

MLE and sufficiency

Sufficiency of order statistic in iid data

The entire data (X1 , . . . , Xn ) is always sufficient.

For iid data, we can discard the order information:

To obtain order statistic of (X1 , . . . , Xn ),

Example (X1 , X2 , X3 ) = (2, 0.4, 1.5), has order statistic

Sufficiency of order statistic in iid data

Example (X1 , X2 , X3 ) = (2, 0.4, 1.5), has order statistic

as does (X1 , X2 , X3 ) = (1.5, 2, 0.4).

Why order statistic is enough? By factorization theorem

= f (0.4)f (1.5)f (2)

What does it really mean?

MLE and sufficiency

More on exponential families

Recall: 1-parameter exponential family in canonical form

T (x) is sufficient by factorization.

is the canonical parameter.

E = { : A() < } is the natural parameter space.

Can cook up different distributions by choosing T (x) and h(x).

Cooking up exponential families (1-param)

Say we want a distribution on [1, 1].

Can set up a model by taking

Can compute A() explicitly:

Cooking up exponential families (1-param)

The plot of f (x) exp(x)1{1x1} .

Cooking up exponential families (2-param)

We can add more parameters = more modelling ability

Let = (1 , 2 ). The density is given by (up to a constant)

More cumbersome to compute normalizing constant e A() in this case.

Cooking up exponential families (2-param)

Plots of f (x) exp(1 x + 2 x 2 )1{1x1}

Cooking up exponential families (2-D, 5-param)

Consider now two random variable (U, V ) [0, 1]2 .

Computing A() explicitly is going to cause a headache.