Beruflich Dokumente
Kultur Dokumente
Lecture 6:
6.1
Example 6.1.1: Let X1 , . . . , Xn be iid random variables such that = E (X1 ) is finite, and
let X be the usual sample mean. Consider X/2 as an estimator of . Then
Bias ( X/2) = E ( X/2) = /2.
Note that the bias is zero if happens to be zero, but not if 0, so this estimator is biased
(i.e., not unbiased).
Example 6.1.2: Let X1 , . . . , Xn be iid random variables such that both = E,2 (X1 ) and
2 = Var,2 (X1 ) are finite, and suppose n 2. Let X and S 2 be the usual sample mean and
sample variance, respectively. Then
E,2 (S 2 ) =
n
1
1
2
n1 2
2
E,2 ( Xi2 nX ) =
[n(2 + 2 ) n(2 + )] =
= 2.
n1
n
1
n
n
1
i=1
The bias tells us whether an estimator tends to overestimate or underestimate its target on
average. It does not necessarily mean that the value it takes for a particular data set (i.e.,
a particular estimate) is larger or smaller than the true parameter value.
Example 6.1.3: Consider (n 1)S 2 /n as an estimator of 2 in Example 6.1.2. Its bias is
Bias,2 [
(n 1)S 2
(n 1) 2
2
(n 1)S 2
] = E,2 [
] 2 =
2 = ,
n
n
n
n
which is negative for all 2 > 0. Thus, (n 1)S 2 /n tends to underestimate 2 . However, this
does not necessarily mean that the value of this estimator for a particular data set (i.e., a
particular estimate) is smaller than 2 .
It is often the case that we can trade a small amount of bias in order to improve an
estimator in other ways. This idea will be discussed more later.
Variance
of an estimator of a parameter .
It can also be useful to consider the variance Var ()
Example 6.1.5: Continuing from Example 6.1.4, suppose further that the distribution of
X1 , . . . , Xn is normal. Then
Var,2 (S 2 ) = (
(n 1)S 2
2 2
2( 2 )2
2 2
) Var,2 [
]
=
(
)
[2(n
1)]
=
,
n1
2
n1
n1
n1 2
n1 2
)S ] = (
) Var,2 (S 2 ),
n
n
which is smaller than the variance of S 2 . The variance of the estimator (X1 X2 )2 /2 can be
found by noting that it is simply the sample variance of the first two observations, and thus
Var,2 [
(X1 X2 )2
] = 2( 2 )2 = (n 1) Var,2 (S 2 ).
2
Unless n is very small, this estimator has much larger variance than either of the other two
estimators discussed above.
A smaller variance is usually better, but this is not always true. For example, a constant
estimator (i.e., an estimator that ignores the data altogether) has zero variance but is clearly
not a good estimator.
Bias-Variance Trade-Off
When comparing sensible estimators, an estimator with larger bias often has smaller variance,
and vice versa. Thus, it may not be immediately clear which of several sensible estimators
is to be preferred.
Example 6.1.6: Continuing from Example 6.1.5, the estimators S 2 and (X1 X2 )2 /2 are
both unbiased, but S 2 has smaller variance. Thus, S 2 is a better estimator than (X1 X2 )2 /2.
However, the comparison between S 2 and (n 1)S 2 /n is not so clear. One estimator has
smaller bias, while the other estimator has smaller variance.
6.2
= E [( )2 ]. It
The mean squared error of an estimator of a parameter is MSE ()
provides one way to evaluate the overall performance of an estimator.
Note: Like the bias, the mean squared error of an estimator implicitly depends on
what an estimator is estimating. Again, we will assume that this is clear from context
rather than explicitly showing it in our notation.
Proof. MSE ()
Example 6.2.2: Continuing from Example 6.1.6, the mean squared errors of the estimators
S 2 and (n 1)S 2 /n are
2( 2 )2
,
n1
n1 2
n1 2 2
n1 2
MSE,2 [(
)S ] = {Bias,2 [(
)S ]} + Var,2 [(
)S ]
n
n
n
2
n1 2
2(n 1)( 2 )2 (2n 1)( 2 )2
= [(
) 2 ] +
=
n
n2
n2
2
by Theorem 6.2.1.
More commonly, when comparing sensible estimators, it is often the case that one estimator
has smaller mean squared error for some parameter values, while the other estimator has
smaller mean squared error for other parameter values. In this case, it is not at all clear
which estimator is better.
Example 6.2.4: Suppose X Bin(n, ), where is unknown and 0 1. Recall that the
maximum likelihood estimator of is MLE = X/n. Its bias and variance are
X
(1 )
Var (MLE ) = Var ( ) =
,
n
n
X
Bias (MLE ) = E ( ) = 0,
n
so its mean squared error is
(1 )
.
n
If we instead put a Beta(a, b) prior on and conduct a Bayesian analysis, we find that the
posterior mean is B = (X + a)/(n + a + b). Its bias and variance are
2
n + a
(1 )a b
X +a
) =
=
,
Bias (B ) = E (
n+a+b
n+a+b
n+a+b
X +a
n(1 )
Var (B ) = Var (
)=
,
n+a+b
(n + a + b)2
so its mean squared error is
2
[(1 )a b] + n(1 )
MSE (B ) = [Bias (B )] + Var (B ) =
.
(n + a + b)2
A rather stupid choice would be the constant estimator that ignores the data and just
estimates some constant c no matter what. Then
2
Var (c) = 0,
0.012
The MSE of each estimator as a function of is plotted below in the case where n = 25. The
Bayes estimator (posterior mean) uses a = b = 1. The constant estimator takes c = 1/3.
0.000
0.004
MSE
0.008
MLE
Bayes
Constant
0.0
0.2
0.4
0.6
0.8
1.0
Best Estimators
It is natural to ask whether we can find an estimator of that has smaller mean squared
error than every other estimator for all . However, no such estimator can exist. This
conclusion is actually trivial, since the constant estimator = c will always have smaller
mean squared error than any other estimator if is actually equal to c. Thus, we must
consider the idea of a best estimator in a narrower sense. There are two ways to do this:
Take a weighted average of the MSE over all possible values, so that we can measure
the performance of an estimator through a single number that takes into account all
values of (rather than a function of ). Then try to find the estimator that minimizes
this average MSE. It turns out that this is surprisingly easy, as well see.
Restrict our attention to only estimators that meet a certain criterion, then try to find
an estimator that is best (has lowest MSE for all ) within this subset. The most
common approach is to restrict our attention to unbiased estimators and try to find
the best unbiased estimator.
The notion of average MSE optimality is discussed below, while the notion of best unbiased
estimators will be addressed later in the course.
Average MSE Optimality
Let w() be a nonnegative weighting function that describes how much we want the various
values of to count toward our weighted average MSE. Assume without loss of generality
that w() d = 1 or w() = 1 (whichever is appropriate). Then let
= MSE ()
w() d
rw ()
or
= MSE ()
w()
rw ()
denote our weighted average MSE. The following theorem tells us how to find an estimator
Theorem 6.2.5. Let B denote the posterior mean of under the prior () = w(). Then
for any other estimator of .
rw (B ) rw ()
Proof. We provide the proof for the case where the data and parameter are both continuous.
(The proofs of the other cases are similar.) Let f (x) be the joint pdf of the data, where
] f (x) dx} () d
= { [(x)
Rn
] ( x) d} m(x) dx,
= { [(x)
Rn
] ( x) d = [(x)
B (x) + B (x) ] ( x) d
[(x)
= [(x)
B (x)] + 2[(x)
B (x)] [B (x) ] ( x) d
+ [ (x) ] ( x) d
B
2
[B (x) ] ( x) d,
{ [B (x) ] ( x) d} m(x) dx = E [( ) ] () d = rw (B ),
rw ()
Rn