ST 522 Slides

ST 522: Statistical Theory II
Subhashis Ghoshal,
North Carolina State University
Subhashis Ghoshal, North Carolina State University ST 522: Statistical Theory II
Useful Results from Calculus
We recapitulate some facts from calculus we need throughout.
Theorem (Binomial theorem)
(a+b)
n
=
_
n
0
_
a
n
b
0
+
_
n
1
_
a
n1
b
1
+ +
_
n
n 1
_
a
1
b
n1
+
_
n
n
_
a
0
b
n
.
Common innite series
Geometric series
a + ar + + ar
n1
= a
r
n
1
r 1
= a
1 r
n
1 r
, r = 1.
Innite Geometric series
a + ar + ar
2
+ = a
1
1r
, |r | < 1.
(1 x)
1
= 1 + x + x
2
+ , |x| < 1
(1 + x)
1
= 1 x + x
2
x
3
+ , |x| < 1.
Common innite series (contd.)
Innite binomial series
(1 x)
2
= 1 + 2x + 3x
2
+ 4x
3
+ , |x| < 1,
(1 x)
r
= 1 +
n=1
_
r +n1
n
_
x
n
, |x| < 1, where for any real
number ,
_
n
_
= ( 1) ( n + 1)/n!, the generalized
binomial coecient. In particular,
_
r +n1
n
_
= r (r +1) (r +n 1)/n!. Also note that for > 0,
_
r
_
= (1)
r
(+1)(+r 1)
r !
.
Exponential series
e
x
= 1 +
x
1!
+
x
2
2!
+
Logarithmic series
log(1 + x) = x
x
2
2
+
x
3
3
, |x| < 1
Useful limits
lim
n
(1 + 1/n)
n
= e.
lim
n
(1 +
n
/n)
n
= e
for any
n
.
lim
x0
(1 + ax)
1/x
= e
a
.
lim
x0
log(1+x)
x
= 1.
lim
x0
sin x
x
= 1.
Derivatives
d
dx
x
n
= nx
n1
.
d
dx
e
ax
= ae
ax
.
d
dx
a
x
= a
x
log a.
d
dx
log x = 1/x.
d
dx
sin x = cos x.
d
dx
cos x = sin x.
d
dx
tan x = 1 + tan
2
x.
d
dx
sin
1
x = 1/
1 x
2
.
d
dx
tan
1
x =
1
1+x
2
.
d
dx
(af (x) + bg(x)) = af

(x) + bg
(x).
d
dx
f (x)g(x) = f

(x)g(x) + f (x)g
(x).
d
dx
(f (x)/g(x)) =
f

(x)g(x)f (x)g
(x)
g
2
(x)
.
d
dx
f (g(x)) = f

(g(x))g
(x).
Integration
_
x
n
dx =
x
n+1
n+1
, n = 1.
_
x
1
dx = log x.
_
e
ax
dx = e
ax
/a, a = 0.
_
f

(x)
f (x)
dx = log f (x).
Integration by substitution
_
g(f (x))f

(x)dx =
_
g(y)dy, y = f (x).
Integration by parts
_
u(x)v(x)dx = u(x)V(x)
_
V(x)u
(x)dx,
where V(x) =
_
v(x)dx, u(x) is called the rst function and
v(x) the second.
Integration (contd.)
Integration by partial fractions
Applies while integrating the ratio of two polynomials P(x)
and Q(x), where the degree of P is less than the degree of Q
without loss of generality. Factorize Q(x) in linear and
quadratic factors. The ratio can be written as uniquely a
linear combination of the reciprocals of the linear factors and
linear over quadratic factors. The resulting expression can be
integrated term by term. Consult any standard Calculus text
such as Apostol.
Denite Integral
_
b
a
f (x)dx = F(x))
b
a
= F(b) F(a),
where F(x) =
_
f (x)dx.
Order Statistics
Given a random sample, we are interested in the smallest,
largest, or middle observations.
the highest ood waters
the lowest winter temperature recorded in the last 50 years
the median price of houses sold in last month
the median salary of NBA players
Denition: Given a random sample, X
1
, , X
n
, the sample
order statistics are the sample values placed in ascending
order,
X
(1)
= min
1i n
X
i
,
X
(2)
= second smallest X
i
,
... = ...
X
(n)
= max
1i n
X
i
.
Example: Suppose four numbers are observed as a sample of
size 4. The sample values are
x
1
= 6, x
2
= 9, x
3
= 3, x
4
= 8.. What are the order
statistics?
Order Statistics (contd.)
Order statistics are random variables themselves (as functions
of a random sample).
Order statistics satisfy
X
(1)
X
(n)
.
Though the samples X
1
, , X
n
are independently and
identically distributed, the order statistics X
(1)
, , X
(n)
are
never independent because of the order restriction.
We will study their marginal distributions and joint
distributions
Order Statistics - Marginal distributions
Assume X
1
, , X
n
are from a continuous population with cdf
F(x) and pdf f (x).
The nth order statistic, or the sample maximum, X
(n)
had the
pdf
f
X
(n)
(x)
= n[F(x)]
n1
f (x)
The rst order statistic, or the sample minimum, X
(1)
had the
pdf
f
X
(1)
(x)
= n[1 F(x)]
n1
f (x)
More generally, the j th order statistic X
(j )
has the pdf
f
X
(j )
(x) =
n!
(j 1)!(n j )!
f (x)[F(x)]
j 1
[1 F(x)]
nj
.
Order Statistics -Joint distributions
For 1 i < j n, the joint pdf of X
(i )
and X
(j )
is
f
X
(i )
,X
(j )
(u, v) =
n!
(i 1)!(j i 1)!(n j )!
f (u)f (v)[F(u)]
i 1
[F(v) F(u)]
j i 1
[1 F(v)]
nj
if < u < v < ; = 0 otherwise.
Special case: Joint pdf of X
(1)
and X
(n)
The joint pdf X
(1)
, , X
(n)
is
f
X
(1)
, ,X
(n)
(u
1
, , u
n
)
= n!f (u
1
) f (u
n
)1l{< u
1
< < u
n
< }.
Illustration
Example: X
1
, , X
n
are iid from unif [0, 1].
Show that X
(j )
Beta(j , n + 1 j ).
Compute E[X
(j )
] and Var[X
(j )
]
The joint pdf of X
(1)
and X
(n)
.
Let n = 5. Derive the joint pdf of X
(2)
and X
(4)
.
X
(1)
|X
(n)
X
(n)
Beta(1, n 1)
For any i < j , X
(i )
|X
(j )
X
(j )
Beta(i , j i )
Let n = 5. Derive the joint pdf of X
(1)
, , X
(5)
.
Example
Compute P(X
(1)
> 1, X
(n)
2).
P(X
(1)
> x, X
(n)
y) =
n
i =1
P(x < X
i
y) = [F(y) F(x)]
n
.
Common statistics based on order statistics
sample range: R = X
(n)
X
(1)
sample midrange: V =
_
X
(n)
+ X
(1)
_
/2
sample median:
M =
_
X
((n+1)/2)
if n is odd
_
X
(n/2)
+ X
(n/2+1)
_
/2 if n is even.
sample percentile: For any 0 < p < 1, the (100p)th sample
percentile is the observation such that about np of the
observations are less than this observation and n(1 p)th of
the observations are larger.
sample median M is 50th sample quantile (the second sample
quartile)
denote Q
1
as 25th sample quantile (the rst sample quartile)
denote Q
3
as 75th sample quantile (the third sample quartile)
interquartile range IQR=Q
3
Q
1
(describing the spread about
the median)
Remarks
Sample Mean vs Sample Median
Sample Median vs Population Median
Principles of data reduction
Data X, (X
1
, . . . , X
n
): Probability distribution P completely or
partially unknown.
Distribution often modeled by standard ones such as Poisson,
normal.
A few parameters control the distribution of the data. P = P
Parameter : unknown, object of interest.

Inference: Any conclusion about parameter values based on data.
Three main inference problems point estimation, hypothesis
testing, interval estimation.
Statistic T = T(X): Any function of data. A summary measure of
the data.
Statistics may be used as point estimators, test statistics, upper
and lower condence limit.
Inductive reasoning
Role of probability theory: Extent of randomness of T controlled
by . Probabilistic characteristics such as expectation, variance,
moments, distribution involve .
Conversely, value of T reects knowledge about . For instance, if
T has expectation and is unknown, then can be estimated by
T. Intuitively, if we observe a large value of T, we tend to
conclude that must be large.
Need to assess the extent of the error.
Frequentist approach: Randomness of error means need to judge
based on average error over repeated sampling. Thus need to
study the sampling distribution of T.
Suciency
As T summarizes the data X, the rst natural question is that
whether there is any loss of information due to summarization.
Data contains many information, some are relevant for and some
are not.
Dropping an irrelevant information is desirable, but dropping a
relevant information is undesirable.
How to compare the amount of information about in data and in
T? Is it sucient to consider only the reduced data T?
Denition (Sucient statistic)
A statistic T is called sucient if the conditional distribution of X
given T is free of (that is, the conditional is a completely known
distribution).
Example
Toss a coin 100 times. The probability of head p is unknown.
T=number of heads obtained.
Suciency principle
If T is sucient, the extra information carried by X is worthless
as long as is concerned. It is then only natural to consider
inference procedures which do not use this extra irrelevant
information. This leads to the principle of suciency.
Denition (Suciency principle)
Any inference procedure should depend on the data only through a
sucient statistic.
How to check suciency?
Theorem (Neyman-Fisher Factorization theorem)
T is sucient i f (x; ) can be written as the product
g(T(x); )h(x), where the rst factor depends on x only though
T(x) and the second factor is free of .
Example
X
1
, . . . , X
n
iid
N(, 1).
Bin(1, )
Poi().
N(,
2
). = (, ).
Ga(, ). = (, ). (Includes exponential)
U(0, ), range of X depends on .
Exponential family
f (x; ) = c()h(x) exp[
k
j =1
w
j
()t
j
(x)], = (
1
, . . . ,
d
), d k.
Theorem
Let X
1
, . . . , X
n
be iid observations from the above exponential
family. Then T(X) = (
n
i =1
t
1
(X
i
), . . . ,
n
i =1
t
k
(X
i
)) is sucient
for = (
1
, . . . ,
d
).
Applications
beta(, ).
Curved exponential family: N(,
2
).
Old examples revisited: binomial, Poisson, normal,
exponential, gamma (except uniform). Exercise
More applications
Discrete uniform. P(X = x) = 1/, x = 1, . . . , , a positive
integer.
f (x, ) = e
(x)
, x > .
A universal example. iid f density. Order statistics
T = (X
(1)
, . . . , X
(n)
) is sucient.
Remarks
In the order statistics example, the dimension of T the same
as the dimension of the data. Still this is a nontrivial reduction
as n! dierent values of data corresponds to one value of T.
Often one nds better reductions for specic parametric
families, as seen in the many examples before.
Trivially X is always sucient for itself, has no gain.
When one statistic is a mathematical function of the other
and vice versa (i.e., there is a one to one correspondence),
then they carry exactly the same amount of information, so
are equivalent.
More generally, if T is sucient for and T = c(U), a
mathematical function of some other statistic U, then U is
also sucient.
Examples of in-suciency
X
1
, X
2
iid Poi(). T = X
1
X
2
is not sucient.
X
1
, . . . , X
n
iid pmf f (x; ). T = (X
1
, . . . , X
n1
) is not
sucient.
Minimal suciency
Maximum possible reduction.
Denition (Minimal sucient statistic)
T is a minimal sucient statistic if, given any other sucient
statistic T
, there is a function c() such that T = c(T
).
Equivalently, T is minimal sucient if, given any other sucient
statistic T
, whenever x and y are two data values such that

T
(x) = T
(y), then T(x) = T(y).

Checking minimal suciency
Theorem (Lehmann-Schee Theorem)
A statistic T is minimal sucient if the following property holds:
For any two sample points x and y, f (x; )/f (y; ) does not
depend on if and only if T(x) = T(y).
Corollary
Minimal sucient statistic is not unique. But any two are in
one-to-one correspondence, so are equivalent.
Examples
iid N(,
2
).
iid U(, + 1).
iid Cauchy().
iid U(, ).
Minimal suciency in exponential family
Theorem
For iid observations from an exponential family
f (x; ) = c()h(x) exp[
w
j
()t
j
(x)],
so that, no ane (linear plus constant) relationship exists between
w
1
(), . . . , w
k
(), the statistic
T(X) = (
n
i =1
t
1
(X
i
), . . . ,
n
i =1
t
k
(X
i
)) is minimal sucient for
= (
1
, . . . ,
d
).
Examples
N(,
2
).
Ga(, ).
Be(, ).
N(,
2
).
Be(, 1 ), 0 < < 1.
Ancillary statistic
Denition
A statistic T is called ancillary if its distribution does not depend
on the parameter.
Induced family is singleton, completely known, contains no
information about . Opposite of suciency.
Function of ancillary is ancillary.
Examples
iid U(, + 1).
Location family, iid f (x ).
Scale family, iid
1
f (x/).
iid N(, 1).
X
1
, X
2
iid N(0,
2
).
X
1
, . . . , X
n
iid N(,
2
).
T = ((X
1

X)/S, . . . , (X
n

X)/S), where S is sample
standard deviation is ancillary.
Results
Location family. f (x ).
T is a location invariant statistic, i.e.,
T(x
1
+ b, . . . , x
n
+ b) = T(x
1
, . . . , x
n
). Then T is ancillary.
In particular, sample sd S is ancillary (and so are other
estimates of scale).
Location scale family.
1
f ((x )/).
T is a location-scale invariant statistic, i.e.,
T(ax
1
+ b, . . . , ax
n
+ b) = T(x
1
, . . . , x
n
). Then T is ancillary.
If T
1
and T
2
are such that
T
1
(ax
1
+ b, . . . , ax
n
+ b) = aT
1
(x
1
, . . . , x
n
) and
T
2
(ax
1
+ b, . . . , ax
n
+ b) = aT
2
(x
1
, . . . , x
n
), then T
1
/T
2
is
ancillary.
Question. An ancillary statistic does not contain any information
about . Then why do we study it?
It indicates how good the given sample is.
Example
X
1
, . . . , X
n
iid U( 1, + 1). is estimated by the midrange
(X
(1)
+ X
(n)
)/2. The range R = X
(n)
X
(1)
is ancillary.
Question. Can addition or removal of ancillary information
change the information content about ?
Intuitively, one may think that ancillary contains no information
about , so it should not change the information content. But this
interpretation is false.
U(, + 1).
A more dramatic example: (X, Y) BVN(0, 0, 1, 1, ).
Completeness
Let a parametric family {f (x, ), } be given. Let T be a
statistic. Induced family of distributions f
T
(t, ), .
Denition
A statistic T is called complete (for the family {f (x, ), }), or
equivalently the induced family f
T
(t, ), is called complete if
E
(g(T)) = 0 for all implies g(T) = 0 a.s. P
for all .
In other words, no non-constant function of T can have constant
expectation (in ).
Completeness not only depends on the statistic, but also on the
family. For instance, no nontrivial statistic is complete if the family
is singleton.
In order to nd optimal estimators and tests, one sometimes needs
to nd complete sucient statistics.
Examples
X bin(n, ), 0 < < 1.
X Poi(), 0 < .
X N(, 1), < < .
Theorem
Let X
1
, . . . , X
n
be iid observations from the above exponential
family. Then T(X) = (
n
i =1
t
1
(X
i
), . . . ,
n
i =1
t
k
(X
i
)) is complete
if the parameter space contains an open set in R
k
(i.e., d = k).
A non-exponential example: iid U(0, ), T = X
(n)
.
Useful facts
If T is complete and S = (T) is a function of T, then S is
also complete.
The constant statistic is complete for any family.
A non-constant ancillary statistic cannot be complete.
A statistic is called rst order ancillary if its expectation is free
of . If a non-constant function of statistic T is rst order
ancillary, then T cannot be complete.
Connection with minimal suciency
Theorem
If T is complete and sucient, and a minimal sucient statistic
exists, then T is also minimal sucient.
As a consequence, in the search for complete sucient statistics, it
is enough check completeness of a minimal sucient statistic (if
exists and easily found).
This implies no complete sucient statistic exists for the
U(, + 1) family, or the Cauchy() family.
Basus theorem
T complete sucient carries all relevant information about . S
ancillary carries no information about . The following
remarkable result shows that they are statistically independent.
Theorem (Basus theorem)
A complete sucient statistic is independent of all ancillary
statistics.
Completeness cannot be dropped, even if T is minimal sucient
iid U(, + 1).
Applications
iid exponential. Then T =
n
i =1
X
i
and (W
1
, . . . , W
n
) are
independent, where W
j
= X
j
/T. Also calculate E(W
j
).
iid normal. T =

X and sample standard deviation S are
independent.
iid U(0, ). Then X
(n)
and X
(1)
/X
(n)
are independent. Also
calculate E(X
(1)
/X
(n)
).
iid Ga(, ), > 0 known. Let U = (
n
i =1
X
i
)
1/n
. Then
U/
X is ancillary, independent of

X. Also
E[(U/
X)
k
] = E(U
k
)/E(
X
k
).
Likelihood
X f (, ) pmf or pdf. X = x is observed.
Denition
The likelihood function is a function of the parameter with an
observed sample, and is given by L(|x) = f (x, ).
Same expression, but now x is xed and is variable.
Examples
binomial experiment. Decide to stop after 10 trials. 3
successes obtained.
negative binomial experiment. Decide to stop after 3
successes. 10 trials were needed.
Likelihood can be viewed as the degree of plausibility. An estimate
of may be obtained by choosing the most plausible value, i.e.,
where the likelihood function is maximized. This leads to one of
the most important methods of estimation the maximum
likelihood estimator (more details in Chapter 7).
For instance, in either example above, the likelihood function is
maximized at 0.3.
More examples
iid Poisson()
iid N(,
2
)
iid U(0, )
Exponential family
Bayesian approach
Suppose that can be considered as a random quantity with some
marginal distribution (), a pre-experiment assessment called the
prior distribution. Then we can legitimately calculate the posterior
distribution of given the data by the Bayes theorem. This
posterior distribution will be the source of any inference about .
Theorem (Bayes theorem)
(|X) =
()f (X, )
_
(t)f (X, t)dt
.
Examples
iid Bin(1, ), prior U(0, 1).
iid Poi(), prior standard exponential.
Diculty:
is xed, nonrandom.
How to specify a prior?
Bayesians response:
Probability is a quantication of uncertainty of any type.
The arbitrariness of prior choice can be rectied to some
extent by the use of automatic priors which are
non-informative. (More later)
Point Estimation
Find estimators for the unknown parameter or its function
().
Evaluate your estimators (are they good?)
Denition
A point estimator of , is a function

= W(X
1
, . . . , X
n
).
Given a sample of realized observations, the number W(x
1
, . . . , x
n
)
is called a point estimate of .
Methods of point estimation
method of moments
maximum likelihood estimator (MLE)
Bayes estimators
Method of Moments
Let X
1
, . . . , X
n
be a sample from a population with pdf or pmf
f (x|
1
, . . . ,
k
). Estimate = (
1
, . . . ,
k
) by solving k equations
formed by matching rst k sample and population raw moments:
m
1
=
1
n
n
i =1
X
i
,
1
= E
(X)
m
2
=
1
n
n
i =1
X
2
i
,
2
= E
(X
2
)
. . . , . . .
m
k
=
1
n
n
i =1
X
k
i
,
k
= E
(X
k
)
Examples
X
1
, . . . , X
n
iid N(,
2
), both and
2
unknown.
X
1
, . . . , X
n
iid Bin(1, ).
X
1
, . . . , X
n
iid Ga(, ), with (, ) unknown.
X
1
, . . . , X
n
iid Unif(
1
,
2
), where
1
<
2
, both unknown.
Features
Easy to implement
Computationally cheap
Converges to the parameter with increasing probability (called
consistency)
Not necessarily give asymptotically most ecient estimator
Often used as an initial estimator in iterative methods
Maximum Likelihood Estimator
Recall that the likelihood function is
L(|X) = L(|X
1
, . . . , X
n
) =
n
i =1
f (X
i
|)
Denition
The maximum likelihood estimator (MLE)

of is the location at
which L(|X) attains its maximum as a function of . Its numerical
value is often called the maximum likelihood estimate.
How to nd the MLE?
We want to nd the global maximum of L(|X).
If L(|X) is dierentiable in (
1
, . . . ,
k
), we solve
j
L(|X) = 0, j = 1, . . . , k.
The solutions to these likelihood equations locate only
extreme points in the interior of , and provide possible
candidates for the MLE. They can be local or global minima,
local or global maxima, or inection points. Our job is to nd
a global maximum.
((d
2
/d
2
)L())
=
< 0 is sucient for local maxima. We also

need to check the boundary points separately.
If there is only one local maxima, then that must be the
unique global maxima.
Many examples falls in this category, so no further work will
be needed then.
How to nd the MLE? (contd.)
In practice, we often work with log L(|X), i.e. solve
j
log L(|X) = 0, j = 1, . . . , k.
We consider several dierent situations:
one parameter case
non-dierentiable L(|X)
restricted range MLE (e.g. is not the whole real line)
discrete parameter space
two-parameter case
Examples: One-parameter case
X
1
, . . . , X
n
iid N(, 1), with unknown.
X
1
, . . . , X
n
iid Poi().
X
1
, . . . , X
n
iid Exp().
(numerical/iterative method): X
1
, . . . , X
n
iid Weibull().
(numerical/iterative method): X
1
, . . . , X
n
iid gamma(, 1).
Restricted MLE
Parameter space is a proper subset of the set of all possible
values of the parameter. Special attention is needed to make sure

X
1
, . . . , X
n
iid N(, 1), 0.
But what if > 0?
X
1
, . . . , X
n
iid N(,
2
), a b.
Non-dierentiable likelihood
X
1
, . . . , X
n
iid Unif(0, ], > 0.
X
1
, . . . , X
n
iid exponential location family with pdf
f (x) = e
(x)
, if x .
X
1
, . . . , X
n
iid Unif(
1
2
, +
1
2
).
Discrete parameter space
Example
Let X by a single observation taking values from {0, 1, 2} according
to P
, where =
0
or
1
. The probability of X is summarized
x = 0 x = 1 x = 2
=
0
0.8 0.1 0.1
=
1
0.2 0.3 0.5
Examples: Two-parameter case
For dierentiable likelihood, needs calculus of several variables in
general, but often simple tricks help reduce to one-dimension.
X
1
, . . . , X
n
iid N(,
2
).
X
1
, . . . , X
n
iid location-scale exponential family, with pdf
f (x; , ) =
1
e
(x)/
if x .
Remarks about the MLE
The MLE

is the value for which the observed sample x is
most likely; possess some optimal properties (discussed later)
In exponential families, coincides with the method of moment
estimator.
The MLE can be numerically sensitive to the variation in the
data, if the likelihood function is discontinuous.
If T is sucient for , then the MLE

must be a function of
T.
MLE is the value of that maximizes g(T(X), ), where
g(t, )) is the pdf or pmf of T = T(X) at t.
Induced likelihood
If = () is a parametric function, then the likelihood for is
dened by
L
(|X) = sup
:()=
L(|X).
Theorem (Invariance Principle)
If

is the MLE of , then for any function (), the MLE of ()
is (
).
Examples
X
1
, . . . , X
n
iid Bin(1, ). Find the MLE of
_
(1 ).
X
1
, . . . , X
n
iid Poi(). Find the MLE of P
(X 1).
X
1
, . . . , X
n
iid N(,
2
).
Find the MLE of /.
Find the MLE of the population median.
Find the MLE for c = c(, ) such that P
,
(
X > c) = 0.025.
(the 97.5% percentile of the distribution of

X).
EM-algorithm
Useful numerical algorithm to compute the MLE with
missing data
Iterative method repeating E-step (Expectation) and M-step
(Maximization).
Given data Y, missing vital X. Augmented data (X, Y).
Actual likelihood L(|Y) = E
[L(|X, Y)|Y].
Start with an initial estimator

.
Calculate E
=
0
(log L(|X, Y)|Y).
Maximize with respect to to get update

1
.
Repeat the procedure by replacing the old estimate by the
new until convergence.
Example
Multinomial (( + 1)/2, /4, /4, 1/2 ).
Bayes Estimators
Recall, in the Bayesian approach is considered as a quantity
whose variation can be described by a probability distribution
(called the prior distribution). A sample is then taken from a
population indexed by and the prior distribution is updated with
this sample information. The updated prior is called the posterior
distribution.
Prior distribution of : ()
Posterior distribution of : (|X) = f (X|)()/m(X)
Marginal distribution of X: m(X) =
_
f (X|)()d
The mean of the posterior distribution, E(|X), can be used
as the Bayes estimator of .
Examples
X
1
, . . . , X
n
iid Bin(1, ). Assume the prior distribution on is
Beta(, ). Find the posterior distribution of and the Bayes
estimator of .
Special case: () Unif(0,1).
X
1
, . . . , X
n
iid N(0, ), [0, 1], prior U[0, 1].
Conjugate family
Let F denote the class of pdfs or pmfs f (x|). A class of prior
distributions is a conjugate family for F if the posterior
distribution is in the class for all f F, all priors in , and all
observation values x.
Examples:
The beta family is conjugate for the binomial family.
The normal family is conjugate for the normal family.
Methods of Evaluating Estimators
Various criteria to evaluate

and compare dierent point
estimators
mean squared error
best unbiased estimators or UMVUE (Uniform Minimum
Variance Unbiased Estimator)
optimal for general loss function and risk
Unbiasedness and Mean Squared Error
The bias of a point estimator W of is Bias
(W) = E
W .
An estimator whose bias is equal to 0 is called unbiased.
An unbiased estimator satises E
W = for all .
The mean squared error (MSE) of an estimator W of is
dened by E
(W )
2
.
the MSE is a function of , and has the representation
E
(W )
2
= Var
W + (Bias
W)
2
.
the MSE incorporates two components, one measuring the
variability of the estimator (precision) and the other measuring
its bias (accuracy).
Small value of MSE implies small combined variance and bias.
Unbiased estimators do a good job of controlling bias.
Smaller MSE indicates smaller probability for W to be far from
, because
P(|W | > )
1
2
E
(W )
2
=
1
2
MSE
(W)
by Chebyshev Inequality.
In general, there will not be one best estimator. Often the MSE
of two estimators cross each other, showing that each estimator is
better in only a portion of the parameter space.
Example
Let X
1
, X
2
be iid from Bin(1, p) with 0 < p < 1. Compare three
estimators with respect to their MSE.
p
1
= X
1
p
2
=
X
1
+X
2
2
p
3
= 0.5.
Illustration
Let X
1
, . . . , X
n
be iid N(,
2
). Show

X is unbiased for and
S
2
is unbiased for
2
, and compute their MSEs.
What about non-normal distributions with mean and
variance
2
?
Let X
1
, . . . , X
n
be iid N(,
2
). Show the estimator

2
=
1
n
n
i =1
(X
i

X)
2
is biased for
2
, but it has a smaller
MSE than S
2
.
More generally, nd the MSE of cS
2
.
Uniformly Minimum Variance Unbiased Estimator
If the estimator W is unbiased for (), then its MSE is equal to
Var
(W). Therefore, choosing a better unbiased estimator is

equivalent to choosing the one with smaller variance.
Denition
An estimator W
is a best unbiased estimator of () if it satises:

E
= () for all ;
For any other estimator W with E
W = (), we have
Var
Var
W for all .
W
is also called a uniform minimum variance unbiased estimator

(UMVUE).
Example
X
1
, . . . , X
n
iid Poi(). Both

X and S
2
are unbiased for .
How to nd a best unbiased estimator?
If B() is a lower bound on the variance of any unbiased
estimators of (), and if W
is unbiased satises
Var
= B(), then W
is a UMVUE.
Cramer-Rao Inequality
Theorem
Let X be a sample with pdf f (x, ). Suppose W(X) is an
estimator satisfying
E
W(X) = () for any ;

Var
W(X) < .
If dierentiation under integral sign can be carried out, then
Var
(W(X))
[
()]
2
E
_
(

log f (X|))
2
_.
In the i.i.d. case, the bound reduces to
()
2
/nI (), where
I () = E
_
(

log f (X|))
2
_
is called the Fisher information (per observation).
Score function: s(X, ) =

log f (X|) =
1
f (X|)
f (X|).
Lemma (Expressions for I ())
If dierentiation and integration are interchangeable,
I () = E
(s(X, ))
2
= var
(s(X, ))
= E
_

2
2
log f (X, )
_
=
_ _

log f (x, )
_
2
f (x, )dx
=
_
f

(x, )
2
f (x, )
dx
=
_ _

2
2
log f (x, )
_
f (x, )dx.
Examples
X
1
, . . . , X
n
iid Poi(). Find the Fisher information number
and a UMVUE for .
X
1
, . . . , X
n
iid N(,
2
), unknown but
2
known. Find a
UMVUE for using Cramer-Rao bound.
When can we exchange dierentiation and integration?
yes for exponential family.
not always true for non-exponential family. We have to do a
match check for
d
d
_
h(x)f (x, )dx and
_
h(x)

[f (x, )]dx.
Example
X
1
, . . . , X
n
iid from Unif(0, ).
Cramer-Rao bound does not work here!
Attainability of Cramer-Rao bound
The Cramer-Rao bound inequality says, if W
achieves the
variance bound then it is an UMVUE. In the one-parameter
exponential family case, we can nd such an estimator. But there
is no guarantee that this lower bound is sharp (attainable) in other
situations. It is possible that the value of Cramer-Rao bound may
be strictly smaller than the variance of any unbiased estimator.
Corollary
Let X
1
, . . . , X
n
be iid with pdf f (x, ), where f (x, ) satises the
assumptions of the Cramer-Rao bound theorem. Let
L(|x) =
n
i =1
f (x
i
, ) denote the likelihood function. If W(X) is
unbiased for (), then W(X) attains the Cramer-Rao Lower
Bound if and only if
a()[W(X) ()] = s(X, )
for some function a().
Attainability in one-parameter exponential family
Theorem
Let X
1
, . . . , X
n
be iid from a one-parameter exponential family
with the pdf f (x, ) = c()h(x) exp{w()T(x)}. Assume
E[T(X)] = (). Then n
1
n
i =1
T(X
i
), as an unbiased estimator
of (), attains the Cramer-Rao Lower Bound, i.e.
Var
_
n
1
n
i =1
T(X
i
)
_
=
[
()]
2
nI ()
.
Examples
X
1
, . . . , X
n
iid from Bin(1, ). Find an UMVUE of and show
it attains the Lower Bound.
X
1
, . . . , X
n
N(,
2
), with (,
2
) both unknown. Consider
estimation of
2
. What is the Cramer-Rao Lower bound and
is it attainable?
Constructing UMVUE using Rao-Blackwell Method
An important method of nding/constructing UMVUEs with the
help of conditioning on a complete and sucient statistics.
Review on conditional expectation:
E(X) = E[E(X|Y)], for any X, Y.
Var(X) = Var[E(X|Y)] +E[Var(X|Y)], for any X, Y
E(g(X)|Y) =
_
g(x)f
x|y
(x|y)dx, and it is a function of Y.
Cov(E(X|Y), Y) = Cov(X, Y).
Rao-Blackwell Theorem
Theorem
Let W be unbiased for () and T be a sucient statistic for .
Dene (T) = E(W|T). Then the following hold
E
(T) = ();
Var
(T) Var
W for all .
Thus, E(W|T) is a uniformly better unbiased estimator of
() than W.
Conditioning any unbiased estimator on a sucient statistic will
result in a uniform improvement, so we need consider only
statistics that are functions of a sucient statistic for best
unbiased estimators.
Examples
Let X
1
, X
2
be iid N(, 1). Show X
1
is unbiased for and
E(X
1
|
X) is uniformly better.
Let X
1
, . . . , X
n
be iid Unif(0, ). Show Y = (n + 1)X
(1)
is
unbiased for and E(Y|X
(n)
) is uniformly better.
Uniqueness of UMVUE
Theorem
If W is an UMVUE of (), then W is unique.
UMVUE and unbiased estimators of zero
Theorem
If E
W = (), W is the best unbiased estimator of () if and

only if W is uncorrelated with all unbiased estimators of 0.
Example
Let X be an observation from a Unif(, + 1). Show that
X
1
2
is unbiased for .
Show that h(X) = sin(2X) is an unbiased estimators of zero.
Show X
1
2
and h(X) are correlated. So X
1
2
is not best.
Lehmann-Schee theorem
Theorem
Let T be a complete sucient statistic for a parameter , and let
(T) be any estimator based on T. Then (T) is the unique best
unbiased estimators of its expected value.
Thus
Find a complete sucient statistic T for a parameter ,
Find an unbiased estimator h(X) of (),
then (T) = E(h(X)|T) is the best unbiased estimator of
().
Examples
Let X
1
, . . . , X
n
be iid Bin(k, ).
X
1
, . . . , X
n
are iid from Unif(0, ).
Find the UMVUE of .
Find the UMVUE of g(), where g is dierentiable on (0, ).
Suppose X
1
, . . . , X
n
are iid from Poi().
Find the UMVUE of .
Find the UMVUE of g() =
r
, r 1 integer.
Find the UMVUE of g() = e
.
More Examples
Suppose that the random variables Y
1
, . . . , Y
n
satisfy
Y
i
= x
i
+
i
, i = 1, . . . , n,
where x
1
, . . . , x
n
are xed constants, and
1
, . . . ,
n
are iid
N(0,
2
) with
2
known. Find the MLE of and show it is
UMVUE.
Suppose X
1
, . . . , X
n
are iid from exp(), > 0.
Find the UMVUE for .
Find the UMVUE for () = 1 F
(s), where
F
(s) = P
(X
1
> s).
Find the UMVUE for e
1/
.
More Examples (contd.)
Suppose X
1
, . . . , X
n
are iid from N(,
2
), both (,
2
)
unknown.
Find the UMVUE for .
Find the UMVUE for
2
.
Find the UMVUE for
2
.
Normal probability. X
1
, . . . , X
n
iid N(, 1).
() = P
(X
1
c) = (c ).
Ridiculous UMVUE. X
1
, . . . , X
n
iid Poi(). () = e
3
.
Loss Function Optimality
Observations X
1
, . . . , X
n
are iid with pdf f (x, ), . To
evaluate the estimator

(X), various loss function can be used.
The loss function measures the closeness of and

absolute error loss: L(,

) = (
)
2
squared error loss: L(,

) = |
|
a loss that penalizes overestimation more than
underestimation is
L(,

) = (
)
2
I (
< ) + 10(
)
2
I (
).
a loss that penalized more if is near 0 than if || is large
L(,

) =
(
)
2
|| + 1
Loss Function Optimality (contd.)
To compare estimators, we use the expected loss, called the risk
function,
R(,

) = E
L(,

(X)).
If R(,

1
) < R(,

2
) for all , then

1
is the preferred
estimator because it performs better for all . In particular, for the
squared error loss, the risk function is the MSE.
Example
X
1
, . . . , X
n
iid from Bin(1, ). Compare two estimators in terms of
their MSE.
MLE

1
=

X
Bayes estimator: prior () Beta(, ) with
= =
_
n/4,
B
=
n
i =1
X
i
+
_
n/4
n +
n
.
Minimaxity
Risk functions are generally overlapping. One cannot beat everyone
else.
Example
X
1
, . . . , X
n
iid N(,
2
). Consider the estimators of the form
b
(X) = bS
2
.
Minimaxity: Compare the worst case scenario compare the
maximum risks. Find the estimator which has the smallest
maximum risk minimax estimator.
Downside
Problems with unbounded risk maximum is innity.
Not easy to nd the minimax estimator.
Too pessimistic.
Bayes Rule
The Bayes risk is the average risk with respect to the prior ,
_
R(,

)()d.
By denition, the Bayes risk can be written as
_
R(,

)()d =
_
__
L(,

(x))f (x|)dx
_
()d.
Note f (x)() = (|x)m(x), where (x|) is the posterior
distribution of and m(x) is the marginal distribution of X, then
the Bayes risk becomes
_
R(,

)()d =
_ __
L(,

(x))(|x)d
_
m(x)dx.
The quantity
_
L(,

(x))(|x)d is called the posterior expected
loss.
To minimize the Bayes risk, we only need to nd

to minimize the
posterior expected loss for each x.
Bayes Rule (contd.)
The Bayes rule with respect to a prior is an estimator that yields
the smallest value of the Bayes risk.
For squared error loss, the posterior expected loss is
_
( a)
2
(|x)d = E
_
( a)
2
|x
_
,
therefore the Bayes rule is E(|x).
For absolute error loss, the posterior expected loss is
E(| a||x). The Bayes rule is the median of (|x).
Examples
X
1
, . . . , X
n
are iid from N(,
2
) and let () be N(,
2
).
The values
2
, ,
2
are known.
X
1
, . . . , X
n
are iid from Bin(1, ) and let () be Beta(, ).
Hypothesis Testing
Point estimation: provide a single estimate of
Hypothesis testing: test a statement about
A hypothesis is a statement about a population parameter.
Two complementary hypotheses in a hypothesis testing are
called the null hypothesis and alternative hypothesis. Let
0
be a subset of the parameter space, called null region. The
hypotheses are denoted by H
0
and H
1
,
H
0
:
0
vs H
1
:
c
0
.
Illustration
Example
An ideal manufacturing process requires that all products are
non-defective. This is very seldom. The goal is to keep the
proportion of defective items as low as possible. Let be the
proportion of defective items, and 0.01 be the maximum
acceptable proportion of defective items.
Statement 1: 0.01 (the proportion of defectives is
unacceptably high)
Statement 2: < 0.01 (acceptable quality)
Example
Let be the average change in a patients blood pressure after
taking a drug. An experimenter might be interested in testing
H
0
: = 0 (the drug has no eect on blood pressure)
H
1
: = 0 (there is some eect)
Dierent Types of Hypotheses
Simple hypotheses: Both H
0
and H
1
consist of only one
probability distribution,
Composite hypotheses: Either H
0
or H
1
contains more than
one possible distribution
One-sided hypotheses: H :
0
or H : <
0
.
Two-sided hypotheses: H
0
: =
0
vs H
1
: =
0
.
Rejection region
A hypothesis testing procedure or hypothesis test is a rule
that species:
for which sample values the decision is made to accept H
0
as
true
for which sample values H
0
is rejected and H
1
is accepted as
true.
The subset of the sample space for which H
0
will be rejected
is R: rejection region or critical region.
The complement of the rejection region is R
c
: acceptance
region.
The rejection region R of a hypothesis test is usually dened
by a test statistic W(X), a function of the sample
R = {X : W(X) > c} = reject H
0
.
R
c
= {X : W(X) c} = accept H
0
.
Methods of Evaluating Tests
In deciding to accept or reject the null hypothesis H
0
, we might
make a mistake no matter whatever the decision is. There are two
types of errors:
Type I error: if H
0
is actually true, i.e.
0
, but the test
incorrectly decides to reject H
0
Type II error: if H
0
is actually false, i.e.
c
0
, but the test
incorrectly decides to accept H
0
Decision
Accept H
0
Reject H
0
H
0
Correct decision Type I error
Truth
H
1
Type II error Correct decision
Power Function
Denition
The power function of a hypothesis test with rejection region R is
the function of dened by
() = P
(X R).
=
_
probability of Type I error if
0
1 probability of Type II error if
c
0
Note P(Type I error) = (), for
0
, P(Type II error) =
1 (), for
c
0
Ideal test: () = 0 for all
0
; () = 1 for all
c
0
.
Good test:
() is near 0 (small) for most
0
;
() is near 1 (large) for most
c
0
.
Example (Binomial power function)
X Bin(5, ).
H
0
:
1
2
versus H
1
: >
1
2
.
Test 1: reject H
0
if and only if all successes are observed,
i.e R = {5}
Test 2: reject H
0
if X = 3, 4, or 5.
Likelihood Ratio Tests (LRT)
Denition
The likelihood ratio test statistic for testing H
0
:
0
vs
H
1
:
c
0
is
(x) =
sup
0
L(|x)
sup
L(|x)
.
A likelihood ratio test (LRT) has a rejection region
R : {x : (x) c},
where c is any number satisfying 0 c 1.
This should be reduced to the simplest possible form.
Rationale of LRT
The numerator of (x) is the maximum probability of the
observed sample, computed over parameters in H
0
. The
denominator of (x) is the maximum probability of the
observed sample over all possible parameters.
The numerator says which
0
makes the observation of
data most likely; the denominator say which make the
observation of data most likely.
The ratio of these two maxima is small if there are parameter
points in H
1
for which the observed sample is much more
likely than for any parameter in H
0
. In this situation, the LRT
criterion says H
0
should be rejected and H
1
accepted as true.
Relation between LRT and MLE
Let

0
be the MLE of in the null set
0
(restricted
maximization).
Let

be the MLE of in the full set (unrestricted
maximization). then the LRT statistic, a function of x (not ) is
(x) =
sup
0
L(|x)
sup
L(|x)
=
L(
0
|x)
L(
|x)
In R : {x : (x) c}, dierent c gives dierent rejection region
and hence dierent tests.
Examples
X
1
, . . . , X
n
iid N(,
2
) with unknown (
2
known).
Consider testing
H
0
: =
0
versus H
1
: =
0
,
where
0
is a number xed by the experimenter prior to the
experiment.
Find the LRT and its power function.
Comment on the decision rules given by dierent cs.
Let X
1
, . . . , X
n
be a random sample from a
location-exponential family
f (x, ) = e
(x)
if x ,
where < < . Consider testing H
0
:
0
versus
H
1
: >
0
. Find the LRT.
LRT and suciency
Theorem
If T(X) is a sucient statistic for ,
(t) is the LRT statistic

based on T, and (x) is the LRT statistic based on x. Then
(T(x)) = (x)
for every x in the sample space.
Thus the simplied expression for (x) should depend on x only
through T(x) if T(X) is a sucient statistic for .
Examples
X
1
, . . . , X
n
iid N(,
2
) with
2
known. Test
H
0
: =
0
versus H
1
: =
0
.
Let X
1
, . . . , X
n
be a random sample from a
location-exponential family. Test H
0
:
0
versus
H
1
: >
0
.
Nuisance parameter case
Likelihood ratio tests are also useful when there are nuisance
parameters, which are present in the model but not of direct
interest.
Example
X
1
, . . . , X
n
iid N(,
2
), both and
2
unknown. Test
H
0
:
0
versus H
1
: >
0
.
Specify and
0
.
Find the LRT and the power function.
Bayesian Tests
Using the posterior density (|x, compute
P(
0
|x) = P(H
0
is true |x)
P(
c
0
|x) = P(H
1
is true |x)
Decide in favor the hypothesis which has greater posterior
probability: Accept H
0
if P(
0
|x)
1
2
.
Does not work if
0
is a point and is given a prior density. One
will need to put a prior mass at the point.
Example
Let X
1
, . . . , X
n
be iid N(,
2
) and the prior distribution on be
N(,
2
), where
2
, ,
2
are known. Test H
0
:
0
against
H
1
: >
0
.
Unbiased Test
Denition
A test with power function () is unbiased if
(
) (
), for every

c
0
and

0
.
In most problems, there are many unbiased tests.
Recall () = P(reject H
0
). An unbiased test says that the
probability of rejecting H
0
when H
0
is true is smaller than the
probability of rejecting H
0
when H
0
is false.
Examples
X Bin(5, ). Consider testing
H
0
:
1
2
versus H
1
: >
1
2
and reject H
0
if X = 5.
X
1
, . . . , X
n
N(,
2
),with
2
known. Consider testing
H
0
:
0
versus H
1
: >
0
.
The LRT test is unbiased.
Draw the graph of the power function.
Controlling Type I error
For a xed sample size, it is usually impossible to make both types
of error arbitrarily small.
Common approach:
Control the Type I error probability at a specied level .
Within this class of tests, make Type II error probability that
is as small as possible; equivalently, maximize the power.
Size and level test
Denition
For 0 1, a test with power function () is a size test if
sup
0
() = .
Denition
For 0 1, a test with power function () is a level test if
sup
0
() .
If these relations hold only in the limit as n , we call the tests
respectively asymptotically size (level) . [More details in the nal
chapter]
Notations and remarks
Typical choices of are: 0.01, 0.05, 0.10.
We use z
/2
to denote the point having probability /2 to the
right of it for a standard normal pdf. By convention, we have
P(Z > z
) = , where Z N(0, 1)
P(T
n1
> t
n1,/2
) = /2, where T
n1
t
n1
P(
2
p
>
2
p,1
) = 1 , chi square with d.f. p
Note z
= z
1
.
Commonly used cutos:
z
0.05
= 1.645, z
0.025
= 1.96, z
0.01
= 2.33, z
0.005
= 2.58.
How to specify H
0
and H
1
?
If an experimenter expects an experiment will indicate a
phenomenon, should choose H
1
to be the theory being
proposed.
H
1
is sometimes called researchers hypothesis. By using a
level test with small , the experiment is guarding against
saying the data support the research hypothesis when it is
false.
Announcing a new phenomenon when in fact nothing has
happened is usually more serious than missing something new
that has in fact occurred.
Similarly, in judicial system the evidence is collected to decide
whether the accused is innocent or guilty. To prevent the
possibility of penalizing an innocent person incorrectly, the
test should be set up H
0
: innocent versus H
1
: guilty
Example
Let X Bin(5, ). Consider testing
H
0
:
1
2
versus H
1
: >
1
2
with the procedure:
reject H
0
if X = 5.
Is this test a level 0.05 test?
Is this test a size 0.05 test?
What size is the test at?
How to critical value of LRT
In order to make a LRT test be a size test, we choose c such that
sup
0
P((X) c) = .
iid N(,
2
),
2
is known. H
0
:
0
vs H
1
: >
0
.
iid N(,
2
),
2
is known. Consider testing for H
0
: =
0
vs
H
1
: =
0
.
Let X
1
, . . . , X
n
be iid from N(,
2
),
2
unknown. Consider
testing H
0
: =
0
versus H
1
: =
0
. Show that the LRT
test that rejects H
0
if |
X
0
| > t
n1,/2
S/
n is a test of
size .
iid location-exponential dist. Consider testing H
0
:
0
vs
H
1
: >
0
. Find the size LRT test.
Sample size calculation
For a xed sample size, it is usually impossible to make both types
of error probabilities arbitrarily small. But if we can choose the
sample size, it is possible to make the desired power level.
Example
iid N(,
2
),
2
is known. Test H
0
:
0
vs H
1
: >
0
. The
LRT test rejects H
0
if (
X
0
)/(/
n) > C has the power

function
() = 1
_
C +

0
n
_
.
Note () is increasing in .
Notes
The maximum Type I error is
sup
0
() = (
0
) = 1 (C).
For the size test, C = z
.
After C is chosen, it is possible to increase () for >
0
by
increasing the sample size n. Thus we can minimize Type II
error (Remember: Type I error is under control already).
Draw the picture of power function for small n and large n.
Assume C = z
. How to choose n such that the maximum

Type II error is 0.2 if
0
+?
Compute n if = 0.05 in (3).
Example
Let X Bin(n, ). Testing:
H
0
: 3/4 vs H
1
: < 3/4.
The LRT test for this problem is to reject H
0
if X c.
Choose c and n such that the following satises simultaneously:
If =
3
4
, we have Pr (reject H
0
|) = 0.01; (control Type I
error)
If =
1
2
, we have Pr (reject H
0
|) = 0.99. (control Type II
error)
Most Powerful Tests
Given that the maximum probability of Type I error less than or
equal to , the most powerful level test minimizes the
probability of Type II error, or, equivalently maximizes the power
function at a
c
0
.
If this occurs for all
c
0
, such a test is called the uniformly
most powerful (UMP) level test.
Test function
Given a rejection region R, dene a test function on the sample
space to be
(x) =
_
1 if x R
0 if x / R
.
Interpret (X) as the probability of rejecting the null hypothesis
given the sample X.
This also opens doors for randomized tests, where (X) can even
take values strictly between 0 and 1.
Note the expected value of is the power function:
E
[(X)] = P
(X R) = ().
Existence of UMP tests
Lemma (Neyman-Pearson)
Consider testing H
0
: =
0
versus H
1
: =
1
, where the pdf or
pmf corresponding to
i
is f (x,
i
), i = 0, 1. Consider any test
function satisfying
(x) =
_
1, if f (x,
1
) > kf (x,
0
),
0, if f (x,
1
) < kf (x,
0
),
for some k 0, and E
0
(X) = . Then
(X) is a UMP size test,
if k > 0, any other UMP level test
must have size and

can dier from only on the set {x : f (x,
1
) = kf (x,
0
)}.
Examples
X Bin(2,), one observation. H
0
: =
1
2
versus H
1
: =
3
4
.
To obtain the UMP level 1/8 test and the UMP level 1/2 test?
X Exp(), H
0
: = 1 versus H
1
: = 2.
X Cauchy(), H
0
: = 0 versus H
1
: = 1.
X Un(0, ), H
0
: = 1 versus H
1
: = 2.
X Un(, + 1), H
0
: = 0 versus H
1
: = 2.
Sucient statistic and UMP test
Let T(X) be a sucient statistic for and g(t, ) is the pdf or
pmf of T corresponding to . Then a UMP level test (T)
based on T is given by
(t) =
_
1, if g(t,
1
) > kg(t,
0
),
0, if g(t,
1
) < kg(t,
0
),
for some k 0, where = E
0
(T).
Examples
UMP normal test for mean: X
1
, . . . , X
n
be iid from N(,
2
)
with
2
known, H
0
: =
0
versus H
1
: =
1
, where
1
>
0
.
UMP normal test for variance: X
1
, . . . , X
n
be iid from
N(0,
2
) with
2
unknown. H
0
:
2
=
2
0
versus H
1
:
2
=
2
1
,
where
2
1
>
2
0
.
Comments
Discrete Case: Suppose only has two possible values
0
or
1
, and X is a discrete variable taking nite k values with
P
i
(X = a
j
), j = 1, . . . , k; i = 0, 1. H
0
: =
0
vs =
1
. The
rejection region R of the UMP level test satises
max
R
a
j
R
P
1
(X = a
j
)
subject to
a
j
R
P
0
(X = a
j
) .
N-P test is the LRT test for H
0
: =
0
vs =
1
.
For simple hypotheses, the UMP level test is unbiased, i.e.
(
1
) > (
0
) = .
UMP test for one-sided composite alternative
iid N(, 1).
H
0
: =
0
vs H
1
: >
0
.
Monotone Likelihood Ratio (MLR)
Denition
A family of pdfs or pmfs {g(t, ) : } for a univariate random
variable T with real-valued parameter has a monotone likelihood
ratio (MLR) if, for every
2
>
1
, g(t,
2
)/g(t,
1
) is an increasing
function of t on {t : g(t,
1
) > 0 or g(t,
2
) > 0}.
Examples
Normal, Poisson, Binomial all have the MLR property.
If T is from an exponential family with the density
f (t, ) = h(t)c()e
w()t
, then T has an MLR if w() is a
nondecreasing function in .
If X
1
, . . . , X
n
iid from N(,
2
) with known, then

X has an
MLR.
If X
1
, . . . , X
n
iid from N(,
2
) with known, then
n
i =1
(X
i
)
2
has an MLR.
iid Unif(0, ), T = X
(n)
has MLR property.
Stochastically increasing
Denition
A statistic T with family of pdf {f (t, ), } is called
stochastically increasing in if
1
<
2
implies that
P
1
(T > c) P
2
(T > c) for every c,
or equivalently, F
2
(c) F
1
(c), where F is the cdf.
Useful facts
Lemma
If a family T has the MLR property, then it is stochastically
increasing in its parameter.
A location family T is stochastically increasing in its location
parameter.
Let a test have rejection region R = {T > c}. If T has the
MLR property, then the power function
() = P
(T R) = P
(T > c) is non-decreasing in .
Karlin-Rubin Theorem
Theorem
Let T(X) be a sucient statistic for and the family
{g(t,
i
), } has the MLR property. Then
For testing H :
0
vs H
1
: >
0
, the UMP level test
rejects H
0
if and only if T > t
0
, where
= P
0
(T > t
0
).
For testing H :
0
vs H
1
: <
0
, the UMP level test
rejects H
0
if and only if T < t
0
, where
= P
0
(T < t
0
).
Examples
Let X
1
, . . . , X
n
be iid from N(,
2
),
2
known.
Find the UMP level test for testing H
0
:
0
vs
H
1
: >
0
.
0
:
0
vs
H
1
: <
0
.
Let X
1
, . . . , X
n
be iid from N(
0
,
2
),
2
unknown,
0
known.
0
:
2

2
0
vs
H
1
:
2
>
2
0
.
Nonexistence of UMP test
For many problems with two-sided alternative, there is no
UMP level test, because the class of level test is so large
that no one test dominates all the others in terms of power.
Search a UMP test within some subset of the class of level
test, for example, the subset of all unbiased tests.
Example
Let X
1
, . . . , X
n
be iid from N(,
2
),
2
known. Consider testing
H
0
: =
0
vs H
1
: =
0
.
There is no UMP level test.
Find the UMP level test within the class of unbiased tests.
p-value
The choice of is subjective. Dierent people may have
dierent tolerance levels .
If is small, the decision is conservative.
If is large, the decision is overly liberal.
If you reject (or accept) H
0
, is it a strong or borderline
rejection (acceptance)?
p-value (contd.)
Denition
A p-value is the smallest possible level at which H
0
would be
rejected.
Note
p-value is a test statistic, taking value 0 p(x) 1 for the
sample x.
Small values of p(X) gives evidence that H
1
is true.
The smaller p-value, the stronger the evidence of rejecting H
0
.
Reject H
0
at level is equivalent to p-value being less than .
p-value for composite null
A p-value is called valid if, for every
0
and every 0 1,
we have P
(p(X) ) .
Theorem
Let W(X) be a test statistic such that large values of W give
evidence that H
1
is true. For each sample point x, dene
p(x) = sup
0
P
(W(X) W(x)).
Then p(X) is a valid p-value.
Examples
Two-sided normal p-value:
Let X
1
, . . . , X
n
be iid from N(,
2
),
2
unknown. Consider
testing H
0
: =
0
versus H
1
: =
0
, use the LRT statistic
W(X) = |
X
0
|/(S/
n).
Let
0
= 1, n = 16 , observed x = 1.5, s
2
= 1. Do you reject
the hypothesis = 1 at level 0.05? at level 0.1?
One-sided normal p-value:
In the above example, consider testing H
0
:
0
versus
H
1
: >
0
.
p-value and sucient statistic
Sometimes there is a non-trivial sucient for the null model. Then
dening a p-value through conditioning on a sucient statistic
eectively reduces the composite null to a point null:
p(x) = P(W(X) W(x)|S = S(x)).
Fishers Exact Test
Let S
1
and S
2
be independent observations with S
1
Bin(n
1
, p
1
)
and S
2
Bin(n
2
, p
2
). Consider testing H
0
: p
1
= p
2
versus
H
1
: p
1
> p
2
.
To form an exact (non-asymptotic) level test.
Interval Estimation
Interval estimate (L(X), U(X))
Condence coecient min
( (L(X), U(X))) = 1 .
Method of inversion
One to one correspondence between tests and condence intervals.
Hypothesis testing: Fix the parameter asks what sample
values (in the appropriate region) are consistent with that
xed value.
Condence set: Fix the sample value asks what parameter
values make this sample most plausible.
For each
0
, let A(
0
) be the acceptance region of a level
test H
0
: =
0
. Dene a set C(x) = {
0
: x A(
0
)}. Then
C(x) is a (1 )-condence set.
Example
iid N(,
2
), unknown, is parameter of interest.
Method of inversion (contd.)
In general, inverting acceptance region of a two sided test will give
two sided interval and inverting acceptance region of a one sided
test will give an open end interval on one side.
Theorem
Let acceptance region of a two sided test be of the form
A() = {x : c
1
() T(x) c
2
()} and let the cuto be
symmetric, that is, P
(T(X) > c
2
()) = /2 and
P
(T(X) < c
1
()) = /2.
If T has MLR property then both c
1
() and c
2
() are increasing in
.
Examples
X
1
, . . . , X
n
N(,
2
), both unknown.
Upper condence bound for .
Lower condence bound for .
X
1
, . . . , X
n
Exp(). Invert the LRT.
Discrete. X
1
, . . . , X
n
Bin(1, ) Obtain a lower condence
bound.
Pivot
Denition
A random quantity Q(X, ) is called pivotal quantity (or a pivot) if
the distribution of Q(X, ) is independent of .
Note this is dierent from an ancillary statistic since Q(X, )
depends also on and hence is not a statistic.
Examples
Location family
Scale family
Location-scale family
iid exponential. Gamma pivot.
A statistic T has density f (t, ) = g(Q(t, ))|(/t)Q(t, )|.
Then Q(T, ) is a pivot.
Method of pivot
How to construct a condence set using a pivotal quantity?
Find a, b such that P
(a Q(X, ) b) = 1 .
Dene C(x) = { : a Q(x, ) b}.
Then P
( C(X)) = P
(a Q(X, ) b) = 1 .
Method of pivot (contd.)
When will C(x) be an interval?
If Q(x, ) is monotone in , then C(x) is an interval.
Examples:
iid exponetial.
iid N(,
2
), known. Interval for .
iid N(,
2
), unknown. Interval for .
iid N(,
2
), known. Interval for .
iid N(,
2
), unknown. Interval for .
Method of pivot (contd.)
If F(t, ) is decreasing in for all t, dene
L
,
U
by
F(t,
L
) = 1
2
, F(t,
U
) =
1
,
1
+
2
= . Then
[
L
(T),
U
(T)] is (1 ) CI for .
Similarly if F(t|) is increasing in for all t, dene
L
,
U
by
F(t,
L
) =
2
, F(t,
U
) = 1
1
,
1
+
2
= . Then
[
L
(T),
U
(T)] is (1 ) CI for .
Examples:
iid from f (x, ) = e
(x)
I (x > ). X
(n)
sucient.
(1 ) CI is not unique. Among many choices, want to
minimize expected length.
iid N(,
2
) known.
iid N(,
2
) unknown.
iid exponential.
Asymptotic Evaluation
X
1
, . . . , X
n
i.i.d. f (x, ), n large. Mathematically n .
The assumption n makes life easier. Dependence of
optimality on models or loss functions becomes less
pronounced.
Because limit theorems become available, distributions can be
found approximately. Limiting distributions are much simpler
than actual distributions.
Convergence in probability
Denition
We say that Y
n

p
c (Y
n
converges in probability to a constant
c), if P(|Y
n
c| > ) 0 as n for all > 0.
Usual calculus applies for convergence in probability.
A possible method of showing this is Chebychevs inequality
P(|Y
n
c| > )
2
E(Y
n
c)
2
=
2
[var(Y
n
) + (E(Y
n
) c)
2
],
so it is enough to show that the right hand side goes to 0.
If Y
n
=

X
n
, then

X
n

p
E(X) by the law of large numbers.
Convergence in distribution
Denition
If Y
n
is a sequence of random variables and F is a continuous cdf,
we say that Y
n
converges in distribution to F if
P(Y
n
x) F(x) for all x.
We also say that Y
n

d
Y where Y is a random variable having
cdf F.
The central limit theorem states that

n(
X
n
E(X)) converges in
distribution to N(0, var(X)), i.e.,
P
_
n(
X
n
E(X))
_
var(X)
_
(x)
for all x where stands for the standard normal cdf.
Another important result is Slutskys theorem: If Y
n

d
Y and
Z
n

p
c, then Y
n
+ Z
n
Y + c, Y
n
Z
n

d
cY, Y
n
/Z
n
Y/c if
c = 0.
Consistency
Denition
Let W
n
= W
n
(X
1
, . . . , X
n
) be a sequence of estimators for ().
We say that W
n
is consistent for estimating () if W
n

p
()
under P
for all .
Theorem
If E
(W
n
) () (in which case W
n
is called asymptotically
unbiased for ()) and var
(W
n
) 0 for all , then W
n
is
consistent for ().
Examples
If X
1
, . . . , X
n
are i.i.d. f with E(X) = and var(X) =
2
,
then

X
n
is consistent for and S
2
n
=
n
i =1
(X
i

X
n
)
2
/(n 1)
is consistent for
2
.
n
i =1
(X
i

X
n
)
2
/n is consistent for
2
too.
(Invariance principle of consistency): If W
n
is consistent for
and g is a continuous function, then g(W
n
) is consistent for
g().
Method of moment estimator is generally consistent.
UMVUE is consistent: Let X
1
, . . . , X
n
be i.i.d. f (x, ) and let
W
n
be the UMVUE of (). Then W
n
is consistent for ().
Consistency of MLE: Let X
1
, . . . , X
n
be i.i.d. f (x, ), a
parametric family satisfying some regularity conditions. Then
the MLE

n
is consistent for .
Asymptotic normality
Denition
A statistic T
n
is called asymptotically normal with mean () and
variance v()/n, written as AN((), v()/n), if
n(T
n
())/
_
v()
d
N(0, 1) for all .
() is called the asymptotic mean and v() is called the
asymptotic variance.
Note: T
n
is consistent for ().
Example
If X
1
, . . . , X
n
is i.i.d. f (x, ), then T
n
=

X
n
is AN((),
2
()/n),
where () = E
(X) and
2
() = var
(X) by the central limit

theorem.
Delta method
Theorem
If T
n
is AN(,
2
()/n), then g(T
n
) is AN(g(), (g
())
2
2
()/n).
A multivariate version is also true.
CLT and delta method combination gives asymptotic
normality of many statistics of interest.
Eciency
How to distinguish between consistent estimators.
Let estimators be asymptotically normal. Asymptotic means
are the same. Can compare asymptotic variances.
Often one variance is smaller than another throughout.
If there is a lower bound, and that lower bound is attained,
then the estimator making that happen is called
asymptotically ecient. Clearly such an estimator is
impossible to beat asymptotically the best.
Eciency bound
Cramer-Rao bound for MSE of T
n
in estimating ():
(
() + b
n
())
2
nI ()
,
where I () is Fisher information, b
n
() the bias.
So if b
n
() 0, then the bound for the asymptotic variance
should be (
())
2
/I ().
In particular, if () = , the bound for asymptotic variance is
1/I ().
Strictly, speaking, this bound is not valid, although it is nearly
correct.
Then we can dene an estimator to be asymptotically ecient
if its asymptotic variance is 1/I ().
Attaining eciency bound
Theorem
The MLE

is AN(, 1/(nI ())).
More generally, () is AN((), (
())
2
/(nI ())).
The MLE is not the only possible asymptotically ecient
estimator.
Any Bayes estimator is asymptotically ecient.
Method of moment estimators are asymptotically normal, but
need not be asymptotically ecient.
Dene asymptotic eciency of

n
AN(, v()/n) by
I ()/v().
Examples
Cauchy
Logistic
Mean versus median
Asymptotic distribution of likelihood ratio statistic
Theorem (Point null case)
Let X
1
, . . . , X
n
be i.i.d. f (x|) and let
n
(X) be the likelihood
ratio for testing H
0
: =
0
vs H
1
: =
0
and is d dimensional.
Then 2 log
n
(X)
d

2
d
.
Example: Poisson
Asymptotic distribution of likelihood ratio statistic
Theorem (General case)
Let X
1
, . . . , X
n
be i.i.d. f (x, ) and let
n
(X) be the likelihood
ratio for testing H
0
:
0
vs H
1
:
0
. Then
2 log
n
(X)
d

2
d
, where d is the dierence between the
number of free parameter in the model and the number of free
parameters in the null.
Example: Multinomial
Common large sample tests
Two sample normal. Variances unknown.
Two sample binomial.
Common large sample condence intervals
Asymptotic normality of moment or function of moments, by
central limit theorem and delta method, often can give a
condence interval.
Condence interval can also be obtained from the distribution
of the MLE:

n
AN(, 1/(nI ())), i.e.,
n
_
I ()(
n
) AN(0, 1), an asymptotic pivot.
Can possibly invert this to get a CI for . May be complicated.
By Slutsky,

n
_
I (
n
)(
n
) AN(0, 1). This is easy to
invert.
Examples
Poisson
Binomial
Correlation coecient
Variance stabilizing transformation
If T
n
AN(, v()/n), then
g(T
n
) AN(g(), (g
())
2
v()/n).
Choose g such that the variance does not depend on , that is
g
() = v
1/2
(), or g(x) =
_
x
0
v
1/2
(u)du.
Large sample CI:
g(T
n
) z
/2
cn
1/2
< g() < g(T
n
) + z
/2
cn
1/2
, or
[g
1
(g(T
n
) z
/2
cn
1/2
), g
1
(g(T
n
) + z
/2
cn
1/2
)].
Has some advantages, and usually much better accuracy.
Examples
Poisson
Binomial
Chi-square
Correlation coecient

ST 522 Slides

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

ST 522 Slides

Hochgeladen von

Copyright:

Verfügbare Formate

ST 522: Statistical Theory II

Parameter : unknown, object of interest.

, there is a function c() such that T = c(T

, whenever x and y are two data values such that

(y), then T(x) = T(y).

(g(T)) = 0 for all implies g(T) = 0 a.s. P

< 0 is sucient for local maxima. We also

(W). Therefore, choosing a better unbiased estimator is

is a best unbiased estimator of () if it satises:

is also called a uniform minimum variance unbiased estimator

W(X) = () for any ;

W = (), W is the best unbiased estimator of () if and

absolute error loss: L(,

(t) is the LRT statistic

n) > C has the power

. How to choose n such that the maximum

must have size and

(X) by the central limit

Das könnte Ihnen auch gefallen