Sie sind auf Seite 1von 9

James Stein Estimator

Tony Ke
January 29, 2012

James-Stein Estimator

To get to learn James-Stein estimator one of the most important statistical


ideas of the decade (of the sixties) [8], we need to first sharpen our memory
on some basic but most important statistical concepts.

1.1

Estimator

Sometimes quoted as a decision rule, an estimator is a rule for calculating an


estimate of a unknown quantity based on observed data. We use 2 Rn
to denote the unknown parameter vector, X 2 X Rn to denote the random
vector, and correspondingly x as the realized data point.1 Then an estimator
from data set X to parameter space .
is a mapping
Example 1.1. (Ordinary Estimator)
For X N (, 2 In ),

(x)
= x.

(1.1)

This is a very straightforward estimator: we perform one measurement


for each unknown quantity, and use the measurement result as the estimator
of the unknown quantity. However, out of ones surprise, we will show this
ordinary estimator is not the best estimator in some sense.

Some definitions and examples in this note are excerpted and modified from
wikipedia.org, though not explicitly cited in the context.
1
For sake of conceptual simplicity, we often talk about the data set X which consists
of only one data point in the following discussion. The generation is straightforward.
Prepared by Tony Ke for UC Berkeley MFE 230K class. All rights reserved.

1.2

Risk Function

How to measure the goodness of an estimator? Intuitively an estimator is


good, if it is close to the unknown parameter of interest, or equivalently
to charspeaking the estimation error is small. We use loss function L(, )
acterize the estimation error.
Example 1.2. (Quadratic Loss)

L(, (x))
= |(x)

|2 ,

(1.2)

where | | is the Euclidean norm.


One should notice, the loss function is data-specific. In most cases, we
instead, want to get an overall judgement of the estimation goodness of an
which is defined as
estimator. Then it comes to the risk function R(, ),
the expected value of a loss function
= E L(, (X)),

R(, )

(1.3)

where E is the expectation over all population values of X.


Example 1.3. (Mean Squared Error)
Mean squared error risk corresponds to a quadratic loss function,
= E |(X)

R(, )

|2 .

For an ordinary estimator (X)


= X under X N (,
2
2
E |X | = n . (Verify it by yourself !)

1.3

(1.4)
2

=
In ), R(, )

Admissible Decision Rule

After introducing the risk function to compare the goodness of any two estimators, one natural question is know how to define the best estimator,
which turns out to be a non-trivial question.
An admissible decision rule is defined as a rule for making a decision
such that there isnt any other rule that is always better than it. Why
we need to specify always? Because the parameter is unknown, a decision
rule can perform well for some underlying parameter, while poorly for others.
is an admissible decision rule if
Mathematically speaking, we say
s.t. R(, )

) for all 2 , and R(, )


> R(,
) for some 2 .
R(,
2

1.4

Steins Paradox

Now we are ready to present Steins amazing discovery:


The ordinary estimator for the mean of a multivariate Gaussian distribution is inadmissible under mean squared error risk in dimension at least
three.
Example 1.4. (Steins Example)
For X N (, In ), lets consider the following estimator
(n

S (x) = x

2)
|x|2

(1.5)

which will be shown as a better estimator than the ordinary estimator (x)
=
x in example 1.1.
"
#
2
2
(n
2)
S ) = E X +
R(,
X
|X|2

(n 2) 2
(n 2)2 4
= E | X|2 + 2( X)T
X
+
|X|2
|X|2
|X|4

( X)T X
1
2
2
2 4
= E | X| + 2(n 2) E
+ (n 2) E
2
|X|
|X|2

1
= n 2 (n 2)2 4 E
< n 2.
(1.6)
|X|2
The last equation comes from
by parts. Its not hard to show that
h integration
i
@h
E [(i Xi )h(X)] = E @x
(X) , for any well-behaved function h().
i

Steins example shows that for estimation of mean of multi-variate Gaussian distribution, Steins estimator (1.5) is better than the ordinary estimator
in that its risk function is smaller. As a quirky example, we measure the speed
of light (1 ), tea consumption in Taiwan (2 ), and hog weight in Montana
(3 ), and observe data point x = (x1 , x2 , x3 ). Estimates i (xi ) = xi based on
individual measurement is worse than
the one
based on measurements on all
1

three quantities together i (x) = 1 |x|2 xi , though the three quantities


have nothing to do with each other.
The Stein estimator always improves upon the total mean squared error
risk, i.e., the sum of the expected errors of each component. Therefore, the
3

total mean squared error in measuring light speed, tea consumption, and hog
weight would improve by using the Stein estimator. However, any particular
component (such as the speed of light) would improve for some parameter
values, and deteriorate for others. Thus, although the Stein estimator dominates the ordinary estimator when three or more parameters are estimated,
any single component does not dominate the respective component of the
ordinary estimator.
Its also worthwhile to point out that validity of Steins paradox doesnt
depend on the quadratic form of the loss function. Though quadratic loss
function can approximate any well-behaved general functions by Tyler expansion. Brown has extended the conclusion of inadmissibility to fairly weak
loss function conditions [1, 2, 3].

1.5

Steins Original Argument

It is not intuitively clear why the Stein estimator dominates the ordinary one.
Steins original argument [10] is based on a comparison of T to xT x when
n is large and proceeds as follows. (This note is based on [9]). Intuitively,
should satisfy i i for i = 1, 2, , n, which implies
a good estimate
2
2
i i for i = 1, 2, , n, and thus
T
T .

(1.7)

We would hope that a chosen estimator satisfies this condition.PFor x

N (, 2 In ), the ordinary estimator (x)


= x. Let y = n1 xT x = n1 ni=1 x2i so
that E[y] = 2 + n1 T . Law of large number implies2
1 T
x x!
n

1 T
, as n ! 1.
n

(1.8)

In other words, for large n, it is very likely that xT x exceeds T . This


suggests that to form a good estimator of , we would need to shrink the
ordinary estimator toward 0, which is exactly what the Steins estimator
does. Thats also the reason why Steins estimator is also called shrinkage
estimator.
2

Notice that xi for i = 1, 2, , n are not identically distributed, but still the law of
large number holds, which can be proved by Chevyshevs inequality.

1.6

Empirical Bayes Perspective

Another argument for Steins estimator is based on the Bayesian formulation


[4, 9]. Suppose the prior distribution on is
N (0, 2 In ).
Then for x N (,

(1.9)

In ), the posterior distribution on would be

p(|x) p(x|) p()

1
1 T
T
exp
(x ) (x ) exp

2 2
2 2
(

2)
2
1
exp

1
x
2 2
2 + 2
2 2 +
2
So |x N
is

2
2 + 2

x,

2 2
2 + 2

B =

(1.10)

. The Bayes least-squares estimate of

2
2

+ 2

x.

(1.11)

B achieves the smallest risk for any , under quadratic loss function. One

should notice that the definition of risk function from Bayesian perspective
diers from that from a frequentists perspective. In Bayesian framework, we
take expectation of the loss function over the posterior distribution of ; while
in frequentists framework, we take expectation over the population space of
x. Instead of an admissible rule, we usually call the risk-function-minimizing
rule a Bayes rule in the Bayesian framework.
B in equation (1.11) is intended to evoke the form
The expression of
of Steins estimator. In fact, instead of determining 2 from outside, we
can estimate the prior density from the data, and then apply the Bayesian
framework. This approach is known as empirical Bayesian estimation. It
2
2
can be shown that x(nT x2) is an unbiased estimator of 2 + 2 . By substituting
this estimator back into (1.11), Steins estimator (1.5) is obtained.

1.7

James-Stein Estimator

Steins idea has been completed and improved later, notably by James [6],
Efron and Morris [5]. Lets consider Steins idea in a more general case.
5

For xt N (, ) (t = 1, 2, , T ), similarly we
show that the ordinary
Pcan
T
1

T t=1 xt is inadmissible. The


maximum-likelihood estimator M L = x
following so called James-Stein estimator is a better estimator for n 3
JS = (1

k)
x + kx0 1,

(1.12)

where x0 is an arbitrary number and k is defined as


k=

(n
(
x

2)/T

x0 1)

(
x

x0 1)

(1.13)

We find the James-Stein estimator shrinks not only toward 0 but also in
direction of x0 1. For = 2 In , T = 1 and x0 = 0, (1.12) goes back to Steins
estimator (1.5).

Jorion 1986 Paper

James-Stein method improves accuracy when estimating more than two quantities together. This makes it a natural fit in estimation of multiple asset
returns in a portfolio.

2.1

Framework

Jorion (1986) considered the parameter uncertainties in portfolio optimization problem [7]. He cares more on the uncertainty of return mean, rather
than the uncertainty of return variance-covariance matrix. As pointed out by
Prof. Leland in the class, there are two rationales behand: (1) the optimal
portfolio allocation is very sensitive to the change of mean; (2) the estimation
accuracy on variance-covariance matrix will get refined if one gets data in
finer time scale, such as high-frequency data.
An empirical Bayes method is applied, with the prior on the mean of
asset returns as

1
p(|V, , Y0 ) exp
( Y0 1)T V 1 ( Y0 1) .
(2.1)
2
By repeating a similar procedure in section 1.6, we obtain the optimal estimator as the James-Stein estimator
B (R) = (1

M L (R) + kmin (R)1.


k)
6

(2.2)

M L (R) = x
is the ordinary
where R represents all the asset return data,
maximum-likelihood estimate of the mean, and
k =

(2.3)
+T
1T V 1
(R).
(2.4)
Y0 = T 1
1 V 1 ML
Y0 happens to be the average return for the minimum variance portfolio.
T
1
One can verify the allocation weights 11T VV 1 1 minimize the variance of the
portfolio subject to the condition that they sum to one.
We can further estimate the shrinkage coefficient from data
k =

n+2
ML
(n + 2) + (

min (R)1)T T V

ML
(

min (R)1)

(2.5)

which as one might have recognized, is exactly a James-Stein estimator3 . In


practice, V is unknown, and could be replaced by
V =

T
T

1
n

S,

(2.6)

where S is the usual unbiased sample covariate matrix.

2.2

Example

Figure 1 illustrates a sample estimates from stock market returns for seven
major countries, calculated over a 60-month period. The underlying parameters and were chosen equal to the estimates reported in Figure 1. Then
T independent vectors of returns were generated from this distribution, and
the following estimators are computed:
1. Certainty Equivalence: classical mean-variance optimization
2. Bayes Diuse Prior: Klein and Bawa (1976) uninformative prior
3. Minimum Variance:

! 1 and k = 1

4. Bayes-Stein estimator
The results are shown in Figure 2. We can see Bayes-Stein estimator beats
all others in estimation accuracy for relatively large sample size T 50.
3

n is the number of assets in the portfolio, as represented by J in the lecture note.

Figure 1: Excerpted from Jorion (1986) paper. Dollar returns in percent per
month. Sample period is Jan 1977 to Dec 1981.

Figure 2: Excerpted from Jorion (1986) paper. Fmax is the investors utility function calculated from the true underlying parameters, which is the
theoretically maximum utility that can be achieved. Fi is investors utility
function, when she/he adopts the corresponding estimator calculated from
simulation samples. y-axis on the left shows the relative dierence of the
utility functions, which directly characterizes the goodness of estimation.
8

References
[1] L. Brown. On the admissibility of invariant estimators of one or more
location parameters. The Annals of Mathematical Statistics, 37(5):1087
1136, 1966.
[2] L. Brown. Estimation with incompletely specified loss functions (the
case of several location parameters). Journal of the American Statistical
Association, 70(350):417427, 1975.
[3] L. Brown.
A heuristic method for determining admissibility of
estimatorswith applications. The Annals of Statistics, pages 960994,
1979.
[4] B. Efron and C. Morris. Limiting the risk of bayes and empirical bayes
estimatorspart ii: The empirical bayes case. Journal of the American
Statistical Association, 67(337):130139, 1972.
[5] B. Efron and C. Morris. Steins estimation rule and its competitorsan
empirical bayes approach. Journal of the American Statistical Association, 68(341):117130, 1973.
[6] W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and
probability, volume 1, pages 1379, 1961.
[7] P. Jorion. Bayes-stein estimation for portfolio analysis. Journal of Financial and Quantitative Analysis, 21(3):279292, 1986.
[8] D. Lindley. Discussion on professor steins paper. Journal of the Royal
Statistical Society, 24:285287, 1962.
[9] J. Richards. An introduction to james-stein estimation, 1999.
[10] C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley symposium on mathematical statistics and probability, volume 1, pages 197
206, 1956.

Das könnte Ihnen auch gefallen