Bayesian Linear Model Gory Details

Bayesian Linear Model: Gory Details
Pubh7440 Notes By Sudipto Banerjee

Let y = [yi ]ni=1 be an n 1 vector of independent observations on a dependent variable (or
response) from n experimental units. Associated with the yi , is a p 1 vector of regressors, say xi ,
and lead to the linear regression model
y = X + ,
(1)
where X = [xTi ]ni=1 is the n p matrix of regressors with i-th row being xTi and is assumed fixed,
is the slope vector of regression coefficients and = [i ]ni=1 is the vector of random variables
representing pure error or measurement error in the dependent variable. For independent obseriid
vations, we assume M V N (0, 2 In ), viz. that each component i N (0, 2 ). Furthermore, we

will assume that the columns of the matrix X are linearly independent so that the rank of X is p.
The N IG conjugate prior family
A popular Bayesian model builds upon the linear regression of y using conjugate priors by specifying
p(, 2 ) = p( | 2 )p( 2 ) = N ( , 2 V ) IG(a, b) = N IG( , V , a, b)

a+p/2+1

ba
1
1
1
T 1
=
exp 2 b + ( ) V ( )
2
(2)p/2 |V |1/2 (a) 2
a+p/2+1

1
1
1
T 1
exp
b
+
(
)
V
(
)
,
(2)
2
2
2
where () represents the Gamma function and the IG(a, b) prior density for 2 is given by
p( 2 ) =
ba
(a)
1
2
a+1

b
exp 2 , 2 > 0,
where a, b > 0. We call this the Normal-Inverse-Gamma (N IG) prior and denote it as N IG( , V , a, b).
The N IG probability distribution is a joint probability distribution of a vector and a scalar
2 . If (, 2 ) N IG(, V, a, b), then an interesting analytic form results from integrating out 2
from the joint density:

Z

Z a+1
ba
1
1
1
T 1
b
+
exp
)
V
(
)
d 2
2
2
2
(2)p/2 |V |1/2 (a)

Z
ba
1
1
T 1
=
d 2
exp 2 b + ( ) V ( )
2
(2)p/2 |V |1/2 (a)

(a+ p )
2
ba a + p2
1
T 1
b
+
=
(
)
V
(
)
2
(2)p/2 |V |1/2 (a)
#( 2a+p
"

1
2 )
( )
a + p2
( )T ab V
=
1+
.
2a
p/2 |(2a) ab V |1/2 (a)
N IG(, V, a, b)d 2 =
This is a multivariate t density:

+p
T 1 ( ) 2
+p
(
)
2

M V St (, ) =
1+
,
2 p/2 ||1/2
with = 2a and =
b
a
(3)
V.
The likelihood
The likelihood for the model is defined, up to proportionality, as the joint probability of observing
the data given the parameters. Since X is fixed, the likelihood is given by
2
p(y | , ) = N (X, I) =
1
2 2
n/2

1
T
exp 2 (y X) (y X) .
2
(4)
The posterior distribution from the N IG prior
Inference will proceed from the posterior distribution

p(, 2 | y) =
where p(y) =
p(, 2 )p(y | , 2 )
,
p(y)
p(, 2 )p(y | , 2 )dd 2 is the marginal distribution of the data. The key to
deriving the joint posterior distribution is the following easily verified multivariate completion of
squares or ellipsoidal rectification identity:
uT Au 2T u = (u A1 )T A(u A1 ) T A1 ,
(5)
where A is a symmetric positive definite (hence invertible) matrix. An application of this identity
immediately reveals,

1
1
1
1
T
T
T 1
b
+
=
b
+
(
)
V
(
)
+
(y
X)
(y
X)
(
)
V
(
)
,
2
2
2
2
using which we can write the posterior as

a+(n+p)/2+1
1
1
1
T 1
2
exp 2 b + ( ) V
( ) ,
p(, | y)
2
(6)
where
= (V1 + X T X)1 (V1 + X T y),
V = (V 1 + X T X)1 ,
a = a + n/2,
1
b = b + [T V1 + yT y T V 1 ].
2
This posterior distribution is easily identified as a N IG( , V , a , b ) proving it to be a conjugate
family for the linear regression model.
Note that the marginal posterior distribution of 2 is immediately seen to be an IG(a , b )
whose density is given by:
ba
p( | y) =
(a )
2
1
2
a +1
b
exp 2

.
(7)
The marginal posterior distribution of is obtained by integrating out 2 from the N IG joint
posterior as follows:
Z
Z
p(, 2 | y)d 2 = N IG( , V , a , b )d 2

Z a +1
1
1
1
T 1
exp 2 b + ( ) V
( ) d 2
2
2

(a +p/2)
( )T V 1 ( )
1+
.
2b
p(|y) =
This is a multivariate t density:

+p
2+p
)T 1 ( ) 2
(
1+
,
M V St ( , ) =
2 p/2 | |1/2

with = 2a and = ab V .
3
(8)
A useful expression for the N IG scale parameter
Here we will prove:

b = b +
T
1
1
y X
I + XV X T
(y X )
2
(9)
On account of the expression for b derived in the preceding section, it suffices to prove that
yT y + T V 1 V 1 = y X
T
I + XV X T
1
(y X )
Substituting = V (V 1 + X T y) in the left hand side above we obtain:

yT y + T V1 V 1 = yT y + T V1 (V1 + X T y)V (V1 + X T y)
= yT (I XV X T )y 2yT XV V 1 + T (V1 V1 V V1 ).
(10)
Further development of the proof will employ two tricky identities. The first is the well-known
Sherman-Woodbury-Morrison identity in matrix algebra:
(A + BDC)1 = A1 A1 B D1 + CA1 B
1
CA1 ,
(11)
where A and D are square matrices that are invertible and B and C are rectangular (square if A
and D have the same dimensions) matrices such that the multiplications are well-defined. This
identity is easily verified by multiplying the right hand side with A + BDC and simplifying to
reduce it to the identity matrix.
Applying (11) twice, once with A = V and D = (X T X)1 to get the second equality and then
with A = (X T X)1 and D = V to get the third equality, we have
V1 V1 V V1 = V1 V1 (V1 + XX T )1 V1
= [V + (X T X)1 ]1
= X T X X T X(X T X + V 1 )1 X T X
= X T (In XV X T )X.
(12)
The next identity notes that since V (V 1 + X T X) = Ip , we have V V 1 = Ip V X T X, so that

XV V 1 = X XV X T X = (In XV X T )X.
4
(13)
Substituting (12) and (13) in (10) we obtain

yT (In XV X T )y 2yT (In XV X T ) + T (In XV X T )
= (y X )T (In XV X T )(y X )
= (y X )T (In + XV X T )1 (y X ),
(14)
where the last step is again a consequence of (11):

(In + XV X T )1 = In X(V 1 + X T X)1 X T = In XV X T .
Marginal distributions the hard way
To obtain the marginal distribution of y, we first compute the distribution p(y | 2 ) by integrating
out and subsequently integrate out 2 to obtain p(y). To be precise, we use the expression for
b derived in the preceding section, proceeding as below:
2
p(y | , )p( | )d = N (X, 2 In ) N ( , 2 V )d

Z
o
1
1 n
T
T 1
=
exp 2 (y X) (y X) + ( ) V ( ) d
n+p
2
(2 2 ) 2 |V |1/2
1
=
n+p
(2 2 ) 2 |V |1/2

Z

1
T
T 1
T 1
exp 2 (y X ) (I + XV X ) (y X ) + ( ) V
( ) d
2

1
1
=
exp 2 (y X )T (I + XV X T )1 (y X )
n+p
2
(2 2 ) 2 |V |1/2

Z

1
T 1
exp 2 ( ) V
( ) d
2
1/2

|V |
1
1
T
T 1
exp 2 (y X ) (I + XV X ) (y X )
=
n
2
(2 2 ) 2 |V |

1
1
T
T 1
exp
(y
X
)
(I
+
XV
X
)
(y
X
)
=
n
2 2
(2 2 ) 2 |I + XV X T |1/2
p(y | ) =
= N (X , 2 (I + XV X T )).
(15)
Here we have applied the matrix identity

|A + BDC| = |A||D||D1 + CA1 B|
5
(16)
to obtain
T
|In + XV X | =
|V ||V1
+ X X| =
|V |
|V |

.
Now, the marginal distribution of p(y) is obtained by integrating a N IG density as follows:

Z
p(y | 2 )p( 2 )d 2 =
N (X , 2 (I + XV X T ))IG(a, b)d 2

Z
b
T
2
T
= N IG(X , (I + XV X ), a, b)d = M V St2a X, (I + XV X ) .
a
p(y) =
(17)
Rewriting our result slightly differently reveals another useful property of the N IG density:
Z
p(y) =
p(y | , 2 )p(, 2 )dd 2
Z
=
N (X, In ) N IG( , V , a, b)dd = M V St2a

b
T
X, (I + XV X ) .
a
(18)
Of course, the computation of p(y) could also be carried out in terms of the N IG distribution
parameters more directly as
Z
p(y) =
p(y|, )p(, )dd =
ba
=
(2)p/2 |V |1/2 (a)
Z
N (X, 2 In ) N IG( , V , a, b)dd 2
a +p/2+1

1
1
T 1
exp 2 b + ( ) V
( )
2
p
(a )(2)p/2 |V |
ba
p
=
(b )a
(a)(2)(n+p)/2 |V |
p

o(a+n/2)
|V |
ba a + n2
1 n T 1
T
1
p
.
(19)
=
b+
V + y y V
2
(2)n/2 (a) |V |
1
2
Marginal distribution: the easy way
An alternative and much easier way to derive p(y | 2 ), avoiding any integration at all, is to note
that we can write the above model as:
y = X + 1 , where 1 N (0, 2 I);
= + 2 , where 2 N (0, 2 V ),
where 1 and 2 are independent of each other. It then follows that
y = X + X2 + 1 N (X , 2 (I + XV X T )).
6
This gives p(y | 2 ). Next we integrate out 2 to obtain p(y) as in the preceding section to obtain
In fact, the entire distribution theory for the Bayesian regression with N IG priors could proceed
by completely avoiding any integration. To be precise, we obtain this marginal distribution first
and derive the posterior distribution:
p(, 2 | y) =
N IG( , V , a, b) N (X, 2 I)
p(, 2 ) p(y|, 2 )
=
,
p(y)
M V St2a (X, ab (I + XV X T ))
which indeed reduces (after some algebraic manipulation) to the N IG( , V , a , b ) density.
Bayesian Predictions
Next consider Bayesian prediction in the context of the linear regression model. Suppose we now
want to apply our regression analysis to a new set of data, where we have observed a new m p
and we wish to predict the corresponding outcome y
. Observe that if
matrix of regressors X,
and 2 were known, then the probability law for the predicted outcomes would be described as
2 Im ) and would be independent of y. However, these parameters are not known;
N (X,
y
instead they are summarized through their posterior samples. Therefore, all predictions for the
data must follow from the posterior predictive distribution:
Z
p(
y | y) =
p(
y | , 2 )p(, 2 | y)dd 2
N (X,
2 Im ) N IG( , V , a , b )dd 2

b
= M V St2a X , (I + XV X ) ,
a
(20)
where the last step follows from (18). There are two sources of uncertainty in the posterior predictive
distribution: (1) the fundamental source of variability in the model due to 2 , unaccounted for by
and (2) the posterior uncertainty in and 2 as a result of their estimation from a finite
X,
sample y. As the sample size n the variance due to posterior uncertainty disappears, but the
predictive uncertainty remains.
Posterior and posterior predictive sampling
Sampling from the N IG posterior distribution is straightforward: for each l = 1, . . . , L, we sample

n
oL
2(l) IG(a + n/2, b ) and (l) M V N ( , 2(l) V ). The resulting (l) , 2(l)
provide
l=1
samples from the joint distribution p(,
| y) while
{ (l) }L
l=1
and
{ 2(l) }L
l=1
provide samples
from the marginal posterior distributions p( | y) and p( 2 | y) respectively.

Predictions are carried out by sampling from the posterior predictive density (20). Sampling
(l) , 2(l) Im ). The
(l) N (X
from this is easy for each posterior sample ( (l) , 2(l) ), we draw y
resulting {
y(l) }L
l=1 are samples from the desired posterior predictive distribution in (20); the mean
and variance of this sample provide estimates of the predictive mean and variance respectively.
The posterior distribution from improper priors
Taking V1 0 (i.e. the null matrix) and a p/2 and b 0 leads to the improper prior
p(, 2 ) 1/ 2 . The posterior distribution is N IG ( , V , a , b ) with
= (X T X)1 X T y,
=
V = (X T X)1 ,
np
,
2
(n p)s2
1
T (y X )
= 1 yT (I PX )y, where PX = X(X T X)1 X T .
b =
where s2 =
(y X )
2
np
np
a =
is the classical least squares estimates (also the maximum likelihood estimate) of , s2 is
Here
the classical unbiased estimate of 2 and PX is the projection matrix onto the column space of X.
Plugging in the above values implied by the improper priors into the more general N IG( , V , a , b )

(np)s2
density, we find the marginal posterior distribution of 2 is an IG np
,
(equivalently the
2
2
posterior distribution of (np)s2 / 2 is a 2np distribution) and the marginal posterior distribution
s2 X T X) with density:
of is a M V Stnp (,
M V Stnp ( , s2 X T X) =
np
2
n
2
"
p/2 |(n p)s2 (X T X)|1/2
T X T X( )
( )
1+
(n p)s2
# n
2
Predictions with non-informative priors again follow by sampling from the posterior predictive
distribution as earlier, but some additional insight is gained by considering analytical expressions
8
for the expectation and variance of the posterior predictive distribution. Again, plugging in the
parameter values implied by the improper priors into (20), we obtain the posterior predictive density

s2 (I + X(X
,
T X)1 X
T) .
as a M V Stnp X
Note that
E(
y| 2 , y) = E[E(
y | , 2 , y) | 2 , y]
| 2 , y]
= E[X
= X(X
T X)1 X T y,
=X
where the inner expectation averages over p(
y | , 2 ) and the outer expectation averages with
respect to p( | 2 , y). Note that given 2 , the future observations have a mean which does not
depend on 2 . In analogous fashion,
var(
y | 2 , y) = E[var(
y | , 2 , y) | 2 , y] + var[E(
y|, 2 , y)| 2 , y]
| 2 , y]
= E[ 2 Im ] + var[X
T ) 2 .
T X)1 X
= (Im + X(X
Thus, conditional on 2 , the posterior predictive variance has two components: 2 Im , representing
T X)1 X
T 2 , due to uncertainty about .
sampling variation, and X(X

Bayesian Linear Model Gory Details

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Bayesian Linear Model Gory Details

Hochgeladen von

Copyright:

Verfügbare Formate

Bayesian Linear Model: Gory Details

Pubh7440 Notes By Sudipto Banerjee

vations, we assume  M V N (0, 2 In ), viz. that each component i N (0, 2 ). Furthermore, we

The N IG conjugate prior family

from the joint density:

This is a multivariate t density:

The posterior distribution from the N IG prior

Inference will proceed from the posterior distribution

This is a multivariate t density:

A useful expression for the N IG scale parameter

Here we will prove:

Substituting = V (V 1 + X T y) in the left hand side above we obtain:

The next identity notes that since V (V 1 + X T X) = Ip , we have V V 1 = Ip V X T X, so that

Substituting (12) and (13) in (10) we obtain

where the last step is again a consequence of (11):

Marginal distributions the hard way

p(y | , )p( | )d = N (X, 2 In ) N ( , 2 V )d

Here we have applied the matrix identity

Now, the marginal distribution of p(y) is obtained by integrating a N IG density as follows:

p(y | , 2 )p(, 2 )dd 2

N (X, In ) N IG( , V , a, b)dd = M V St2a

p(y|, )p(, )dd =

N (X, 2 In ) N IG( , V , a, b)dd 2

Marginal distribution: the easy way

Posterior and posterior predictive sampling

Sampling from the N IG posterior distribution is straightforward: for each l = 1, . . . , L, we sample

samples from the joint distribution p(,

from the marginal posterior distributions p( | y) and p( 2 | y) respectively.

The posterior distribution from improper priors

p/2 |(n p)s2 (X T X)|1/2

Das könnte Ihnen auch gefallen

vations, we assume M V N (0, 2 In ), viz. that each component i N (0, 2 ). Furthermore, we