Beruflich Dokumente
Kultur Dokumente
(1)
where X = [xTi ]ni=1 is the n p matrix of regressors with i-th row being xTi and is assumed fixed,
is the slope vector of regression coefficients and = [i ]ni=1 is the vector of random variables
representing pure error or measurement error in the dependent variable. For independent obseriid
A popular Bayesian model builds upon the linear regression of y using conjugate priors by specifying
p(, 2 ) = p( | 2 )p( 2 ) = N ( , 2 V ) IG(a, b) = N IG( , V , a, b)
a+p/2+1
ba
1
1
1
T 1
=
exp 2 b + ( ) V ( )
2
(2)p/2 |V |1/2 (a) 2
a+p/2+1
1
1
1
T 1
exp
b
+
(
)
V
(
)
,
(2)
2
2
2
where () represents the Gamma function and the IG(a, b) prior density for 2 is given by
p( 2 ) =
ba
(a)
1
2
a+1
b
exp 2 , 2 > 0,
where a, b > 0. We call this the Normal-Inverse-Gamma (N IG) prior and denote it as N IG( , V , a, b).
The N IG probability distribution is a joint probability distribution of a vector and a scalar
2 . If (, 2 ) N IG(, V, a, b), then an interesting analytic form results from integrating out 2
Z a+1
ba
1
1
1
T 1
b
+
exp
)
V
(
)
d 2
2
2
2
(2)p/2 |V |1/2 (a)
Z
ba
1
1
T 1
=
d 2
exp 2 b + ( ) V ( )
2
(2)p/2 |V |1/2 (a)
(a+ p )
2
ba a + p2
1
T 1
b
+
=
(
)
V
(
)
2
(2)p/2 |V |1/2 (a)
#( 2a+p
"
1
2 )
( )
a + p2
( )T ab V
=
1+
.
2a
p/2 |(2a) ab V |1/2 (a)
N IG(, V, a, b)d 2 =
)
2
M V St (, ) =
1+
,
2 p/2 ||1/2
with = 2a and =
b
a
(3)
V.
The likelihood
The likelihood for the model is defined, up to proportionality, as the joint probability of observing
the data given the parameters. Since X is fixed, the likelihood is given by
2
p(y | , ) = N (X, I) =
1
2 2
n/2
1
T
exp 2 (y X) (y X) .
2
(4)
p(, 2 )p(y | , 2 )
,
p(y)
p(, 2 )p(y | , 2 )dd 2 is the marginal distribution of the data. The key to
deriving the joint posterior distribution is the following easily verified multivariate completion of
squares or ellipsoidal rectification identity:
uT Au 2T u = (u A1 )T A(u A1 ) T A1 ,
(5)
where A is a symmetric positive definite (hence invertible) matrix. An application of this identity
immediately reveals,
1
1
1
1
T
T
T 1
b
+
=
b
+
(
)
V
(
)
+
(y
X)
(y
X)
(
)
V
(
)
,
2
2
2
2
using which we can write the posterior as
a+(n+p)/2+1
1
1
1
T 1
2
exp 2 b + ( ) V
( ) ,
p(, | y)
2
(6)
where
= (V1 + X T X)1 (V1 + X T y),
V = (V 1 + X T X)1 ,
a = a + n/2,
1
b = b + [T V1 + yT y T V 1 ].
2
This posterior distribution is easily identified as a N IG( , V , a , b ) proving it to be a conjugate
family for the linear regression model.
Note that the marginal posterior distribution of 2 is immediately seen to be an IG(a , b )
whose density is given by:
ba
p( | y) =
(a )
2
1
2
a +1
b
exp 2
.
(7)
The marginal posterior distribution of is obtained by integrating out 2 from the N IG joint
posterior as follows:
Z
Z
p(, 2 | y)d 2 = N IG( , V , a , b )d 2
Z a +1
1
1
1
T 1
exp 2 b + ( ) V
( ) d 2
2
2
(a +p/2)
( )T V 1 ( )
1+
.
2b
p(|y) =
1+
,
M V St ( , ) =
2 p/2 | |1/2
with = 2a and = ab V .
3
(8)
T
1
1
y X
I + XV X T
(y X )
2
(9)
On account of the expression for b derived in the preceding section, it suffices to prove that
yT y + T V 1 V 1 = y X
T
I + XV X T
1
(y X )
1
CA1 ,
(11)
where A and D are square matrices that are invertible and B and C are rectangular (square if A
and D have the same dimensions) matrices such that the multiplications are well-defined. This
identity is easily verified by multiplying the right hand side with A + BDC and simplifying to
reduce it to the identity matrix.
Applying (11) twice, once with A = V and D = (X T X)1 to get the second equality and then
with A = (X T X)1 and D = V to get the third equality, we have
V1 V1 V V1 = V1 V1 (V1 + XX T )1 V1
= [V + (X T X)1 ]1
= X T X X T X(X T X + V 1 )1 X T X
= X T (In XV X T )X.
(12)
(13)
(14)
To obtain the marginal distribution of y, we first compute the distribution p(y | 2 ) by integrating
out and subsequently integrate out 2 to obtain p(y). To be precise, we use the expression for
b derived in the preceding section, proceeding as below:
2
exp 2 (y X ) (I + XV X ) (y X ) + ( ) V
( ) d
2
1
1
=
exp 2 (y X )T (I + XV X T )1 (y X )
n+p
2
(2 2 ) 2 |V |1/2
Z
1
T 1
exp 2 ( ) V
( ) d
2
1/2
|V |
1
1
T
T 1
exp 2 (y X ) (I + XV X ) (y X )
=
n
2
(2 2 ) 2 |V |
1
1
T
T 1
exp
(y
X
)
(I
+
XV
X
)
(y
X
)
=
n
2 2
(2 2 ) 2 |I + XV X T |1/2
p(y | ) =
= N (X , 2 (I + XV X T )).
(15)
(16)
to obtain
T
|In + XV X | =
|V ||V1
+ X X| =
|V |
|V |
.
p(y | 2 )p( 2 )d 2 =
N (X , 2 (I + XV X T ))IG(a, b)d 2
Z
b
T
2
T
= N IG(X , (I + XV X ), a, b)d = M V St2a X, (I + XV X ) .
a
p(y) =
(17)
Rewriting our result slightly differently reveals another useful property of the N IG density:
Z
p(y) =
Z
=
b
T
X, (I + XV X ) .
a
(18)
Of course, the computation of p(y) could also be carried out in terms of the N IG distribution
parameters more directly as
Z
p(y) =
ba
=
(2)p/2 |V |1/2 (a)
Z
a +p/2+1
1
1
T 1
exp 2 b + ( ) V
( )
2
p
(a )(2)p/2 |V |
ba
p
=
(b )a
(a)(2)(n+p)/2 |V |
p
o(a+n/2)
|V |
ba a + n2
1 n T 1
T
1
p
.
(19)
=
b+
V + y y V
2
(2)n/2 (a) |V |
1
2
An alternative and much easier way to derive p(y | 2 ), avoiding any integration at all, is to note
that we can write the above model as:
y = X + 1 , where 1 N (0, 2 I);
= + 2 , where 2 N (0, 2 V ),
where 1 and 2 are independent of each other. It then follows that
y = X + X2 + 1 N (X , 2 (I + XV X T )).
6
This gives p(y | 2 ). Next we integrate out 2 to obtain p(y) as in the preceding section to obtain
In fact, the entire distribution theory for the Bayesian regression with N IG priors could proceed
by completely avoiding any integration. To be precise, we obtain this marginal distribution first
and derive the posterior distribution:
p(, 2 | y) =
N IG( , V , a, b) N (X, 2 I)
p(, 2 ) p(y|, 2 )
=
,
p(y)
M V St2a (X, ab (I + XV X T ))
which indeed reduces (after some algebraic manipulation) to the N IG( , V , a , b ) density.
Bayesian Predictions
Next consider Bayesian prediction in the context of the linear regression model. Suppose we now
want to apply our regression analysis to a new set of data, where we have observed a new m p
and we wish to predict the corresponding outcome y
. Observe that if
matrix of regressors X,
and 2 were known, then the probability law for the predicted outcomes would be described as
2 Im ) and would be independent of y. However, these parameters are not known;
N (X,
y
instead they are summarized through their posterior samples. Therefore, all predictions for the
data must follow from the posterior predictive distribution:
Z
p(
y | y) =
p(
y | , 2 )p(, 2 | y)dd 2
N (X,
2 Im ) N IG( , V , a , b )dd 2
b
= M V St2a X , (I + XV X ) ,
a
(20)
where the last step follows from (18). There are two sources of uncertainty in the posterior predictive
distribution: (1) the fundamental source of variability in the model due to 2 , unaccounted for by
and (2) the posterior uncertainty in and 2 as a result of their estimation from a finite
X,
sample y. As the sample size n the variance due to posterior uncertainty disappears, but the
predictive uncertainty remains.
| y) while
{ (l) }L
l=1
and
{ 2(l) }L
l=1
provide samples
Taking V1 0 (i.e. the null matrix) and a p/2 and b 0 leads to the improper prior
p(, 2 ) 1/ 2 . The posterior distribution is N IG ( , V , a , b ) with
= (X T X)1 X T y,
=
V = (X T X)1 ,
np
,
2
(n p)s2
1
T (y X )
= 1 yT (I PX )y, where PX = X(X T X)1 X T .
b =
where s2 =
(y X )
2
np
np
a =
is the classical least squares estimates (also the maximum likelihood estimate) of , s2 is
Here
the classical unbiased estimate of 2 and PX is the projection matrix onto the column space of X.
Plugging in the above values implied by the improper priors into the more general N IG( , V , a , b )
(np)s2
density, we find the marginal posterior distribution of 2 is an IG np
,
(equivalently the
2
2
posterior distribution of (np)s2 / 2 is a 2np distribution) and the marginal posterior distribution
s2 X T X) with density:
of is a M V Stnp (,
M V Stnp ( , s2 X T X) =
np
2
n
2
"
T X T X( )
( )
1+
(n p)s2
# n
2
Predictions with non-informative priors again follow by sampling from the posterior predictive
distribution as earlier, but some additional insight is gained by considering analytical expressions
8
for the expectation and variance of the posterior predictive distribution. Again, plugging in the
parameter values implied by the improper priors into (20), we obtain the posterior predictive density
s2 (I + X(X
,
T X)1 X
T) .
as a M V Stnp X
Note that
E(
y| 2 , y) = E[E(
y | , 2 , y) | 2 , y]
| 2 , y]
= E[X
= X(X
T X)1 X T y,
=X
where the inner expectation averages over p(
y | , 2 ) and the outer expectation averages with
respect to p( | 2 , y). Note that given 2 , the future observations have a mean which does not
depend on 2 . In analogous fashion,
var(
y | 2 , y) = E[var(
y | , 2 , y) | 2 , y] + var[E(
y|, 2 , y)| 2 , y]
| 2 , y]
= E[ 2 Im ] + var[X
T ) 2 .
T X)1 X
= (Im + X(X
Thus, conditional on 2 , the posterior predictive variance has two components: 2 Im , representing
T X)1 X
T 2 , due to uncertainty about .
sampling variation, and X(X