Abi Script Stud

Applied Bayesian Inference
Prof. Dr. Renate Meyer1,2

1 Institute
for Stochastics, Karlsruhe Institute of Technology, Germany

of Statistics, University of Auckland, New Zealand
2 Department
KIT, Winter Semester 2010/2011
Prof. Dr. Renate Meyer
1 Introduction
1.1 Course Overview
1 Introduction
Overview: Applied Bayesian Inference A
Conjugate examples: Binomial, Exponential
Introduction to R
Simulation-based posterior computation
Introduction to WinBUGS
Regression, ANOVA, GLM, hierarchical models, survival analysis,

state-space models for time series, copulas
1.1 Course Overview
Conjugate examples: Poisson, Normal, Exponential Family
Specification of prior distributions
Likelihood Principle
Multivariate and hierarchical models
Techniques for posterior computation
Normal approximation
Non-iterative Simulation
Basic model checking with WinBUGS
Markov Chain Monte Carlo
Convergence diagnostics with CODA
Bayes Factors, model checking and determination
Decision-theoretic foundations of Bayesian inference
Overview: Applied Bayesian Inference B
Bayes theorem, discrete continuous
1 Introduction
1.1 Course Overview
1 Introduction
Computing
1.2 Why Bayesian Inference?
Why Bayesian Inference?
Or: What is wrong with standard statistical inference?

I
R mostly covered in class
WinBUGS completely covered in class
confidence intervals and
Other at your own risk
hypothesis tests.
The two mainstays of standard/classical statistical inference are
Anything wrong with them?
1 Introduction
1 Introduction
Example: Newcombs Speed of Light
Newcombs Speed of Light: CI
Example 1.1
Light travels fast, but it is not transmitted instantaneously. Light takes
over a second to reach us from the moon and over 10 billion years to
reach us from the most distant objects yet observed in the expanding
universe. Because radio and radar also travel at the speed of light, an
accurate value for that speed is important in communicating with
astronauts and orbiting satellites. An accurate value for the speed of
light is also important to computer designers because electrical signals
travel only at light speed.
The first reasonably accurate measurements of the speed of light were
made by Simon Newcomb between July and September 1882. He
measured the time in seconds that a light signal took to pass from his
laboratory on the Potomac River to a mirror at the base of the
Washington Monument and back, a total distance of 7400m. His first
measurement was 24.828 millions of a second.
Let us assume that the individual measurements

Xi N(, 2 = 0.0052 ) with known measurement variance
2 = 0.0052 . We want to find a 95% confidence interval for .
Answer:
x 1.96 / n

X
N(0, 1):
/ n

X
< 1.96
P 1.96 <
= 0.95
/ n

1.96/ n < < X
1.96/ n = 0.95
P X
Because as
P(24.8182 < < 24.8378) = 0.95

This means that is in this interval with 95% probability.Certainly NOT!
1 Introduction
1 Introduction
Newcombs Speed of Light: Simulation
The Level of Confidence

True mean
After collecting the data and computing the CI, this interval either
contains the true mean or it does not. Its coverage probability is not
0.95 but either 0 or 1.
Then where does our 95% confidence come from?
Let us do an experiment:
N(24.828, 0.0052 )
draw 1000 samples of size 10 each from
for each sample calculate the 95% CI
check whether the true = 24.828 is inside or outside the CI
24.8
Sample
Coverage
to date
1st
100%
2nd
100%
3rd
4th
100%
100%
5th
6th
100%
7th
100%
8th
9th
100%
88.9%
10th
.
100th
.
991st
.
1000th
90.0%
.
94.0%
.
95.2%
.
95.2%
100%
S1
Figure 1: Coverage over repeated sampling.
1 Introduction
1 Introduction
952 of the 1000 CIs include the true mean.
48 of the 1000 CIs do not include the true mean.
In reality, we dont know the true mean.
We do not sample repeatedly, we only take one sample and

calculate one CI.
By contrast, Bayesian confidence intervals, known as credible intervals

do not require this awkward frequentist interpretation.
One can make the more natural and direct statement concerning the
probability of the unknown parameter falling in this interval.
Will this CI contain the true value?
It either will or will not but we do not know.
We take comfort in the fact that the method works 95% of the time
in the long run, i.e. the method produces a CI that contains the
unknown mean 95% of the time that the method is used in the
long run.
10
11
One needs to provide additional structure to make this interpretation

possible.
12
1 Introduction
1 Introduction
Newcombs Speed of Light: Hypothesis Test
H0 : 0 (= 24.828)

The P-value is the probability to observe a value of the test statistic
that is more extreme than the actually observed value uobs if the null
hypothesis were true (under repeated sampling).
versus H1 : > 0
Test statistic:
We can do another thought experiment

0
X
N(0, 1)
U=
/ n
if = 0
Small values of uobs are consistent with H0 , large values favour H1
P-value:
imagine we take 1000 samples of size 10 from a Normal

distribution with mean 0 .
we calculate the P-value for each sample.
it will only we smaller than 0.05 in about 5% of the samples, in

about 50 samples.
we take comfort in the fact that this test works 95% of the time in
the long run, i.e. rejects H0 even though H0 is true only in 5% of
the cases that this method is used.
p = P(U > uobs | = 0 ) = 1 (u0 )

I
if P-value < 0.05 (= usual type I error rate), reject H0
The P-value is the probability that H0 is true.Certainly NOT.

1 Introduction
13
1 Introduction
It can only offer evidence against the null hypothesis. A large

P-value does not offer evidence that H0 is true.
P-value cannot be directly interpreted as "weight of evidence" but

only as a long-term probability (in a hypothetical repetition of the
same experiment) of obtaining data at least as unusual as what
was actually observed.
By contrast, the Bayesian approach to hypothesis testing, due

primarily to Jeffreys (1961) is much simpler and avoids the pitfalls of
the traditional Neyman-Pearson-based approach.
Most practitioners are tempted to say that the P-value is the

probability that H0 ist true.
It allows the direct calculation of the probability that a hypothesis is

true and thus a direct and straightforward interpretation.
P-values depend not only on the observed data but also the
sampling probability of certain unobserved datapoints. This
violates the Likelihood Principle.
Again, as in the case of CIs, we need to add more structure to the

underlying probability model.
This has serious practical implications for instance for the analysis
of clinical trials, where often interim analyses and unexpected
drug toxicities change the original trial design.
14
15
16
1 Introduction
1.3 Historical Overview
1 Introduction
Historical Overview
Inverse Probability
I
I
Bayes and Laplace (late 1700s) inverse probability

Example: Given x successes in n iid trials with success probability
probability statements about observables given assumptions

about unknown parameters
P(9 X 12|)
deductive
inverse probability statements about unknown parameters given

observed data values
P(a < < b|X = 9)
inductive
Figure 2: From William Jefferys webpage, Univ. of Texas at Austin.

1 Introduction
17
1 Introduction
Thomas Bayes
18
Bayes Biography
(b. 1702, London d. 1761, Tunbridge Wells, Kent)
Presbyterian minister and mathematician
Bellhouse, D.R. (2004) The Reverend Thomas Bayes: FRS: A

Biography to Celebrate the Tercentenary of His Birth. Statistical
Science 19(1):3-43.
Son of one of the first 6 Nonconformist ministers in England

Private education (by De Moivre?)
Ordained as Nonconformist minister and took the position as minister
at the Presbyterian Chapel, Tunbridge Wells
Educated and interested in mathematics, probability and statistics,
believed to be the first to use probability inductively, defended the
views and philosophy of Sir Isaac Newton against criticism by Bishop
Berkeley
Two papers published while he was still living:
I Divine Providence and Government is the Happiness of His
Creatures (1731)
I An Introduction to the Doctrine of Fluxions, and a Defense of the
Analyst (1736)
Figure 3: Reverend Thomas Bayes 1702-1761.

19
20
1 Introduction
1 Introduction
Bayes Biography
Bayes Biography
Elected Fellow of the Royal Society in 1742

Most well-known paper published posthumously, submitted by his
friend Richard Price,
Essay Towards Solving a Problem in the Doctrine of Chances" (1763),
Philosophical Trans. of the Royal Society of London
begins with :
Given the number of times in which an unknown event
has happened and failed: Required the chance that the
probability of its happening in a single trial lies
somewhere between any two degrees of probability that
can be named.
1 Introduction
Figure 4: Bayes vault at Bunhill Fields, London
21
1 Introduction
18 and 19th Century
22
20th Century
Sir R.A. Fisher (1890-1962) was a lifelong critic of inverse probability.
and one of the most important persons involved in the demise of
inverse probability.
Bayes laid the foundations of modern Bayesian statistics

Pierre Simon Laplace (1749-1827), French mathematician and
astronomer, developed mathematical astronomy and statistics
refined inverse probablity, acknowledging Bayes work in a monograph
in 1812
George Boole challenged inverse probability in his Laws of Thought in
1854. The Bayesian approach has been controversial ever since but
was predominent in practical applications until the early 20th century
because of a lack of a frequentist alternative. Inverse probability
became an integral part of the Universities statistics curriculum.
Figure 5: Sir Ronald A. Fisher (1890-1962) .

23
24
1 Introduction
1 Introduction
20th Century
Rise of Subjective Probability
Fishers (1922) paper revolutionized statistical thinking by introducing

the notions of "maximum likelihood", "sufficiency", and "efficiency". His
main argument was that one needed to look at the likelihood of the
data given the theory NOT the likelihood of the theory given the data.
He thus advocated an "indirect" approach to statistical inference based
on ideas of logic called "proof by contradiction".
His work impressed two young statisticians at University College
London: J. Neyman and E. Pearson. They developed the mathematical
theory of significance testing and confidence intervals which had a
huge influence on statistical applications (for good or bad).
Inverse probability ideas were studied by Keynes (1921), Borel (1921)

and Ramsay (1926).
In 1930s Harold Jeffreys engaged in a published exchange with R.A.
Fisher on Fishers fiducial argument and Jeffreys inverse probability.
Jeffreys (1939) book on "Theory of Probability" is the most cited in the
current "objective Bayesian" literature.
In Italy in the 1930s, Bruno de Finetti gave a different justification for
subjective probability, introducing the notion of "exchangeability".
Neo-Bayesian revival in 1950s (Savage, Good, Lindley. . . ).
Current huge popularity of Bayesian methods is due to fast computers
and MCMC methods.
Syntheses of Bayesian and non-Bayesian methods? see e.g. Efron
(2005) "Bayesians, frequentists, and scientists"
1 Introduction
25
1.4 Bayesian and Frequentist Inference
1 Introduction
Two main approaches to statistical inference
26
Motivating Example: CPCRA AIDS Trial
the Bayesian approach

Carlin and Hodges (1999), Biometrics
- parameters are random variables

- subjective probability (for some)
I
the frequentist/conventional/classical/orthodox approach

- parameters are fixed but unknown quantities
- probability as long-run relative frequency
Some controversy in the past (and present)
In this course: not adversarial
27
Compare two treatments for Mycobacterium avium complex, a

disease common in late-stage HIV-infected people
Total of 69 patients
In 11 clinical centers
5 deaths in treatment group 1
13 deaths in treatment group 2
28
1 Introduction
1 Introduction
Primary Endpoint Data

Unit
A
A
A
A
D
D
D
D
D
D
D
D
D
D
G
G
G
G
G
G
J
J
J
J
Treatm.
1
2
1
2
2
2
2
2
1
1
1
1
1
1
2
1
2
2
2
1
1
1
2
2
Time
74+
248
272+
244
20+
64
88
148+
162+
184+
188+
198+
382+
436+
32+
64+
102
162+
182+
364+
18+
36+
160+
254
Unit
B
B
Treatm.
2
1
Time
4+
156+
C
E
E
E
E
E
E
E
E
2
1
2
2
1
1
1
2
2
20+
50+
64+
82
186+
214+
214
228+
262
H
H
H
H
H
H
K
K
K
2
1
1
1
1
2
1
1
2
22+
22+
74+
88+
148+
162
28+
70+
106+
Data Safety and Monitoring Board
Unit
F
F
F
F
F
F
F
F
F
F
F
F
Treatm.
1
2
1
2
2
1
1
2
1
1
2
2
Time
6
16+
76
80
202
258+
268+
368+
380+
424+
428+
436+
I
I
I
I
I
I
I
I
I
I
I
I
I
2
2
2
1
1
2
1
2
1
1
2
2
1
8
16+
40
120+
168+
174+
268+
276
286+
366
396+
466+
468+
1 Introduction
Decision based on:

I
Stratified Cox proportional hazards model

relative risk r =1.9 with 95%-CI [0.6, 5.9],
P-value 0.24
Unstratified Cox proportional hazards model

relative risk r =3.1 with 95%-CI [1.1, 8.7],
P-value 0.02
On the basis of the stratified analysis, the Board would have had to
continue the trial.
The P-value of the unstratified analysis was small enough to convince
the Board to stop the trial.
29
1 Introduction
Li () =
k =1
If the largest time in ith stratum is a death, then the partial likelihood
derives no information from this event.
Stratified:
hi (t) = h0i (t) exp( 0 x)
Unstratified:
hi (t) = h0 (t) exp( 0 x)
unit-specific dummy variables
frailty model
stratum-specific baseline hazards are random draws from a

certain population of hazard functions
Bayesian analysis offers a flexibility in modelling, that is not possible

with the frequentist approach.
We will analyze this example in a Bayesian way in Chapter 4.
This is the case in the study: 4 deaths that have largest survival time
per stratum and these are all in treatment group 2.
30
Compromise Stratified-Unstratified Analysis?
Why does the stratified analysis fail to detect the treatment difference?
Contribution of ith stratum to partial likelihood:
e xik
P
0 xik
jRik e
Stratified Cox PH Model
di
Y
31
32
1 Introduction
1 Introduction
Some Advantages of Bayesian Inference
Highly nonlinear models with many parameters can be analyzed
Offers hitherto unknown flexibility in statistical modelling
Can handle "nuisance" parameters that pose problems for

frequentist inference
Does not rely on large sample asymptotics, but gives valid

inference also for small sample sizes
Possibility to incorporate prior knowledge and expert judgement
Adheres to the Likelihood Principle
1 Introduction
33
1.5 Discrete Version of Bayes Theorem
1 Introduction
Reminder of Bayes Theorem: Discrete Case
34
Chess Example
Example 1.3
Theorem 1.2
Let A1 , A2 , . . . , An be a set of mutually exclusive and exhaustive events.
Then
You are in a chess tournament and will play your next game against
either Jun or Martha, depending on results of some other games.
7
, but of beating Martha is
Suppose your probability of beating Jun is 10
2
only 10 . You assess your probability of playing Jun as 14 .
P(Ai |B) = P(Ai )P(B|Ai )/P(B)

I
P(A )P(B|Ai )
Pn i
.
j=1 P(Aj )P(B|Aj )
35
How likely is it that you win your next game?

Given:
7
2
P(W |J) = 10
,
P(W |M) = 10
P(J) = 14 ,
P(M) = 34
Then P(W )
= P(W |J)P(J) + P(W |M)P(M)
7 1
2 3
13
= 10
4 + 10 4 = 40 = 0.325.
36
1 Introduction
1 Introduction
Chess Example
Diagnostic Testing
Example 1.4
A new home HIV test is claimed to have 95% sensitivity and 98%
specificity. In a population with an HIV prevalence of 1/1000, what is
the chance that someone testing positive actually has HIV? Let A be
Now suppose that you tell me you won your next chess game.
Who was your opponent?
be the event
the event that the individual is truly HIV positive and A
that the individual is truly HIV negative.
P(J|W )
=
P(W |J)P(J)
7
=
P(W |J)P(J) + P(W |M)P(M)
13
P(A) = 0.001.
Let B be the event that the test is positive. We want P(A|B).
95% sensitivity" means that
P(B|A) = 0.95.
98% specificity" means that
A)
= 0.98 or P(B|A)
= 0.02.
P(B|
1 Introduction
37
1 Introduction
Diagnostic Testing
38
Monty Hall Problem

Example 1.5
Now Bayes theorem says P(A|B)

=
P(B|A)P(A)
A)
P(B|A)P(A) + P(B|A)P(
.95 .001
= .045.
.95 .001 + .02 .999
You are a contestant on the TV show Lets Make a Deal" and given
the choice of three doors. Two of the doors have a goat behind them
and one a car. You choose a door, say door 2, but before opening the
chosen door, the emcee, Monty Hall, opens a door that has a goat
behind it (e.g. door 1). He gives you the option of revising your choice
or sticking to your first choice. What do you do?
Thus, over 95% of those testing positive will, in fact, not have HIV.
The following example caused a stir in 1991 after a US columnist, who
calls herself Marilyn Vos Savant, used it in her column. She gave the
correct answer. A surprising number of mathematicians wrote to her
saying that she was wrong.
39
Since either box 2 or box 3 must contain the key, he claimed that her
probability of winning had increased to 12 .
Obviously, choose box 3. The probability of finding the prize in either
box 1 or 3 is 2/3. As the emcee showed you that it is not in box 1, the
probability that it is in box 2 is 2/3.
40
1 Introduction
1 Introduction
Monty Hall Problem
Bayes Theorem again
With Bayes theorem:

Let Ai = car behind door No. i", i = 1, 2, 3.
These form a partition.
P(Ai ) = 13 are the prior probabilities for i = 1, 2, 3.
Let H1 , H2 , . . . , Hn denote n hypotheses (mutually disjoint) and D

observed data. Then Bayes theorem says:
P(Hi )P(D|Hi )
P(Hi |D) = Pn
.
j=1 P(Hj )P(D|Hj )
Let B = Monty Hall opens door 1 (with goat)"

P(B|A1 ) = 0
P(B|A2 ) = 12
P(B|A3 ) = 1
likelihood of A1
likelihood of A2
likelihood of A3
P(D|Hi ) are known as likelihoods, the likelihoods given to Hi by D,

or statisticians usually say the likelihood of Hi given D. (This
notion is used extensively in frequentist statistical inference/
method of maximum likelihood means finding the hypothesis
under which the observations are most likely to have occurred.)
P(Hi ) are prior probabilities.
P(Hi |D) are posterior probabilities.
We want P(A3 |B)

=
=
P(B|A3 )P(A3 )
P(B|A1 )P(A1 ) + P(B|A2 )P(A2 ) + P(B|A3 )P(A3 )
1 13
0
2
= 3.
1
3
1
2
1
3
+1
1
3
1 Introduction
41
1 Introduction
Importance of Prior Plausibility
Why do I think it is a tree?
H2 =
man
H3 =
something else
P(H1 ) has a high prior probability.

P(H2 ) has a high prior probability.
P(H3 ) has a low prior probability.
Bayes theorem is in complete accord with this natural reasoning. The
posterior probabilities of the various hypotheses are in proportion to
the products of their prior probabilities and their likelihoods:
P(Hi |D) P(Hi )P(D|Hi )
Bayes theorem thus combines two sources of information:
P(D|H1 ) is close to 1, whereas P(D|H2 ) is close to 0.

But likelihood is not the only consideration in this reasoning.
More specifically, let H3 = cardboard replica of a tree.
Then P(D|H3 ) is close to 1.
H3 has the same likelihood as H1 , but it is not a plausible hypothesis
because it has a very much lower prior probability.
42
Importance of Prior Plausibility
Example 1.6
D = event that I look through my window and see a tall, branched thing
with green blobs covering its branches.
tree
H1 =
prior information represented by prior probabilities

new information represented by likelihoods
These together add up to the total information represented by
posterior probabilities.
43
44
2 Bayesian Inference
2.1 Statistical Model
Notation and Definitions
2.1 Statistical Model
Notation and Definitions

Definition 2.1
Here, we only consider parametric models.

We assume that the observations X1 , . . . , Xn have been generated
from a parametrized probability distribution, i.e., Xi (1 i n) has a
distribution with probability density function (pdf) f (xi |) on IR , such that
the parameters = (1 , . . . , p ) are unknown and the pdf f is known.
This model can then be represented more simply by X f (x|), where
x is the vector of observations and the vector of parameters.
We are usually interested in questions of the form:
What is the value of 1 ?

Is 1 larger than 3 ?
Example: Xi N(, 2 ) iid for i = 1, . . . , n, Then

Q
Q
1 (x )2
f (x|, 2 ) = ni=1 f (xi |, 2 ) = ni=1 1 e 22 i
parameter estimation
hypothesis testing
What is the most likely value of a future event, whose distribution

depends on ?
prediction
A parametric statistical model consists of the observation of a random

variable X, distributed according to f (x|) where only the parameter
is unknown and belongs to a vector space IR p of finite dimension.
(, 2 )
45
2.2 Likelihood-based Functions
Overview
46
Likelihood Function
Definition 2.2
The likelihood function of is the function that associates the value
f (x|) to each . This function is denoted by l(; x). Other common
notations are lx (), l(|x) and l(). It is defined by
In this section, we will introduce (or remind you of)

I
likelihood function
maximum likelihood estimation
information criteria
where x is the observed value of X.
score function
Fisher information
The likelihood function associates to each value of , the probability of

an observed value x for X (if X is discrete). Then, the larger the value
of l the greater are the chances associated to the event under
consideration, using a particular value of . Therefore, by fixing the
value of x and varying we observe the plausibility (or likelihood) of
each value of . The likelihood function is of fundamental importance
in many theories of statistical inference.
l(; x) = f (x|)
47
( )
(2.1)
48
Maximum Likelihood Estimate
General Information Criteria

Modeling process: Suppose f belongs to some family F of meaningful
functional forms, but where the dimension p of the parameter may vary
among members of the family. Then choose f F to maximize
x) p .
GIC = General Information Criterion = log l(;
2
x) denotes the maximum of the log-likelihood function,
Here log l(;
and 2 provides a penalty per parameter in the model.

2 choices
I = 2 (Akaike, 1978)
Definition 2.3
maximizing (2.1) as a function of , with x fixed,
Any vector
provides a maximum likelihood (ML) estimate of .
In intuitive terms, this gives the realization of most likely to have
given rise to the current data set, an important finite sample property.
R
Note that even though
R
I n
f (x|)dx = 1,
l(; x)d
6= 1, in general.
x) p
AIC = Akaike Information Criterion = log l(;
I
= log(n/2) (Schwarz, 1978)

x)
BIC = Bayesian Information Criterion = log l(;
49
Binomial Example
p
n
log
2
2
50
Binomial Example
Example 2.4
X Binomial(2, ). Then
f (x|) = l(; x)

2
=
x (1 )2x ,
x
and
Note that:
x = 0, 1, 2; = (0, 1)
1. if x = 1 then l(; x = 1)= 2(1 ).

The value of that gives highest likelihood to x = 1 or, in other
words, the most likely value of is 0.5
f (x|) = 1
2. If x = 2 then l(; x = 2)= 2 . The most likely value of is 1.
but
Z 1
3. If x = 0 then l(; x = 0)= (1 )2 . The most likely value is 0.

l(; x)d

=
2
x
Z
x (1 )2x d =
0
2
x

B(x + 1, 3 x) =
1
6= 1.
3
51
52
Binomial Example
Geometric Example
Example 2.5
1.0
Let X1 , X2 , . . . , Xn denote a random sample from a geometric

distribution with pdf
(xi = 1, 2, . . .).
0.6
0.4
a) Find the likelihood function of .

l(; x)
0.2
likelihood
0.8
f (Xi = xi |) = (1 )xi 1
l(theta;x=0)
l(theta;x=1)
l(theta;x=2)
0.0
= P(X1 = x1 , X2 = x2 , . . . , Xn = xn |) = f (x1 , . . . xn |)
Q
Q
= ni=1 f (xi |) = ni=1 (1 )xi 1
0.0
0.2
0.4
0.6
0.8
1.0
Pn
= n (1 )
theta
= n (1 )n(x1)
(This is a Beta curve as a function of .)
Figure 6: Likelihood function for different values of x.
i=1 (xi 1)
53
Geometric Example
54
Geometric Example
b) The maximum likelihood estimate of maximizes the probability
of obtaining the observations actually observed. Find .

Easier to maximize the log-likelihood.
log l(; x) = n log + n(x 1) log(1 )
d
d log l(; x)
x 1)
n
= n(1
=
d2
d2
n(x 1)
1
=0
c) The invariance property of maximum likelihood estimates tells that

is the ML estimate of g().
for any function = g() of , = g()
Find the ML estimate of = (1 ) = P(X1 = 2).
)
=
= (1
1
x
1
x
1
x
= n2
n(x 1)
(1)2
< 0
Thus is a global maximum.
55
56
Exponential Example
Exponential Example
The joint pdf of X1 , . . . , Xn is
Example 2.6
Let X1 , X2 , . . . Xn denote a random sample from the exponential
distribution with unknown location parameter , unknown scale
parameter , and pdf
f (x|, ) = exp{(x )}
f (x1 , . . . , xn |, ) =
( < x < ),
n
Y
i=1
n
Y
f (xi |, )
exp{(xi )}I( xi )
i=1
where < < and 0 < < .

The common mean and variance of the Xi are = + 1 and
2 = 2 . Find the likelihood function of and and the ML estimates
of and 2 , in situations where the observed values x1 , x2 , . . . , xn are
not all equal.
Thus, the likelihood of and when x1 , . . . , xn are observed is

(
) n
n
X
Y
n
l(, ; x1 , . . . , xn ) = exp
(xi )
I( xi )
i=1
57
Exponential Example
i=1
58
Exponential Example
Defining z = min(x1 , . . . , xn )
Then
log g() = n log a.
l(, ; x1 , . . . , xn ) = n exp{n(x )}I( z)

d log g()
n
= a = 0
d
As a function of

l(, ; x1 , . . . , xn )
=n = 1 .
=
a
x z
exp(n), z,
0
otherwise.
This is a global maximum as the 2. derivative is always negative.

By the invariance property of ML estimators:
This is maximized when = = z.

Now as a function of , the likelihood is proportional to
1 = z + (x z) = x ,
= +
2 = (x z)2 .
2 =
g() = n exp{a}
> 0 (if x1 , . . . , xn are not all equal).
with a = n(x )
59
60
Fisher Information
Fisher Information
The information measure defined this way is related to the mean value
of the curvature of the likelihood. The larger this curvature is, the
larger is the information content summarized in the likelihood function
and so the larger will I() be. Since the curvature is expected to be
negative, the information value is taken as minus the curvature. The
expectation is taken with respect to the sample distribution. The
observed Fisher information corresponds to minus the second
derivative of the log likelihood:

2
log f (X|)
JX () =
0
Definition 2.7
Let X be a random vector with pdf f (x|) depending on a 1-dim.
parameter .
The expected Fisher information measure of through X is defined by

2
log f (X|)
.
I() = EX|
2
If = (1 , . . . , p ) is a vector then the expected Fisher information
matrix of through X is defined by
2

log f (X|)
I() = EX|
0
with elements Iij () given by
2

log f (X|)
Iij () = EX|
,
i j
and is interpreted as a local measure of the information content while

its expected value, the expected Fisher information, is a global
measure.
i, j = 1, . . . , p.
61
Fisher Information Example
62
Fisher Information
Example 2.8
Let X N(, 2 ) with 2 known. It is easy to get I() = JX () = 2 ,
the normal precision. Verify!
log f (X |) = log{ 1 e
2
1
(X )2
2 2
} = const.
1
(X
2 2
One of the most useful properties of the Fisher information is the

additivity of the information with respect to independent observations.
This means if X = (X1 , . . . , Xn ) are independent random variables with
densities fi (x|) and I and Ii the expected Fisher information measures
obtained through X and Xi , respectively, then
)2
d
2
X
(X ) =
log f (X |) =
d
2 2
2
2
d
1
log f (X |) = 2
d2

d2
1
1
I() = E 2 log f (X |) = E
= 2 = JX ()
2
d
i.e. the normal precision
I() =
n
X
Ii ().
i=1
This states that the total information obtained from independent

observations is the sum of the information of the individual
observations.
63
64
Score Function
Example: Fisher Info for Binomial

Example 2.10
Let X1 , . . . , Xn Binomial(1, ). Show that the ML estimate of has an
) distribution.
asymptotic N(, (1)
n
Definition 2.9
The score function of X, is defined as
U(X; ) =
log f (X|)
.
iid
Xi | Binomial(1, ) with
E(Xi ) = and Var (Xi ) = (1 )
One can show that under certain regularity conditions:

I() = EX| [U 2 (X; )].
will, for large n, possess a
In a large number of situations,
distribution that is approximately multivariate normal with mean vector
and covariance matrix I()1 .
1
) is said to converge in distribution, as n ,
The vector I() 2 (
with p fixed, to a standard spherical normal distribution (i.e. a
multivariate normal distribution N(0, Ip ) with zero mean vector and
covariance matrix equal to the p p identity matrix).
65
l(; x1 , . . . , xn ) =
i=1
P
xi
n
Y
xi (1 )1xi
i=1
P
n xi
(1 )
= x (1 )nx
Example: Fisher Info for Binomial

x
nx
=0
1
x
= =
n
f (xi |) =
P
where x = ni=1 xi .
log l(; x1 , . . . , xn ) = x log + (n x) log(1 )
d
log l(; x1 , . . . , xn ) =
d
n
Y
66
2.3 Bayes Theorem: Continuous Case
Bayesian Statistical Model
Given data x whose distribution depends on an unknown parameter .

We require inference about . (x and can be vectors, but we assume
for ease of notation that they are 1-dim.)
d
Xi
1 Xi
Xi
log f (Xi |) =
=
d
1
(1 )
2
(Xi )
U 2 (Xi ; ) = 2
(1 )2
Var (Xi )
(1 )
1
Ii () = E[U 2 (Xi ; )] = 2
= 2
=
2
2
(1 )
(1 )
(1 )
n
X
n
I() =
Ii () =
.
(1 )
U(Xi ; ) =
Definition 2.11
A Bayesian statistical model consists of a parametric statistical model
(the sampling distribution or likelihood), f (x|), and a prior
distribution on the parameters f ().
i=1
67
68
Bayes theorem
Essential Distributions
Given a complete Bayesian model, we can construct:
a) the joint distribution of (, X ),
Theorem 2.12
Continuous version of Bayes theorem:
f (, x) = f (x|)f ();
Given a Bayesian statistical model, we can update the prior pdf of to

the posterior pdf of given the data x:
b) the marginal or prior predictive distribution of X ,

Z
Z
f (x) = f (, x)d = f (x|)f ()d;
f (|x) = f ()f (x|)/f (x)
c) the posterior distribution of

=
f ()f (x|)
R
f ()f (x|)d
f (|x) = R
prior likelihood
d) the posterior predictive distribution for a future obs. Y given x,

Z
Z
f (y |x) = f (y , |x)d = f (y |)f (|x)d.
Presentation of Posterior Distribution
After seeing the data x, what do we now know about the parameter ?
plot of posterior density function
summary statistics like measures of location and

dispersion/precision
(analogue to frequentist point estimates: e.g. posterior mean,
median, mode)
hypothesis test, e.g. H0 : 0 :

f (|x)d
analogue to frequentist confidence intervals:

central posterior interval and
highest posterior density region.
If exactly 100(/2)% of the posterior probability lies above and

below the posterior interval, it is called a central posterior interval
with coverage probability 1 = p2 p1 .
It is sometimes desirable to find an interval/region which is as
short as possible for a given coverage probability. This is called a
highest posterior density region (HPD).
If F (|x) is the posterior cdf and if

F (1 |x) = p1 , F (2 |x) = p2 > p1 , then the interval (1 , 2 ] is a
posterior interval of with coverage probability p2 p1 (credible
interval).
Pr (H0 true|x) = Pr ( 0 |x) =
70
Presentation of Posterior Distribution

I
69
f ()f (x|)
f ()f (x|)
;
=
f (x)
f ()f (x|)d
71
72
3 Conjugate Distributions
Conjugate Distributions
Bernoulli Trials Discrete Prior
Assume a drug may have response rate of 0.2, 0.4, 0.6, 0.8, each of
equal prior probability. If we observe a single positive response
(x = 1), how is our prior revised?
The term conjugate refers to cases where the posterior distribution is

in the same family as the prior distribution.
In Bayesian probability theory, if the posterior distributions f (|x) are in
the same family as the prior distributions f () for all , the prior
and posterior are called conjugate distributions, and the prior is called
a conjugate prior.
Likelihood:
The concept, as well as the term "conjugate prior", were introduced by

Howard Raiffa and Robert Schlaifer in their work on Bayesian decision
theory (1961).
f (x = 0|) = 1
73
f (x|) = x (1 )1x
f (x = 1|) =
Posterior:
f (x|)f ()
f (|x) = P
f (x|)f ()
j f (x|j )f (j )
.2
.4
.6
.8
P
prior
f ()
0.25
0.25
0.25
0.25
1.0
3.1 Bernoulli Distribution Discrete Prior
Calculating the Posterior
74
Prior Predictive Distribution

With a Bayesian approach, prediction is straightforward.
The prior predictive distribution of X is given by:
X
P(X = 1)= f (x = 1) =
f (x = 1|j )f (j ) = 0.5
likelihood prior posterior

f (x = 1|)f ()
f (|x = 1)
0.2 0.25 = 0.05
0.10
0.4 0.25 = 0.10
0.20
0.6 0.25 = 0.15
0.30
0.8 0.25 = 0.20
0.40
0.50
1.00
P(X = 0) = f (x = 0) = 1 f (x = 1) = 0.5
The prior predictive probability is thus a weighted average of the
likelihoods under the 4 possible values of :
f (x) =
wj f (x|j )
with prior weights given by wj = f (j ).
Note: a single positive response makes it 4 times as likely that the true
response rate is 80% rather than 20%.
Furthermore:
f (x = 1) =
j wj = prior mean of = E[]
j
75
76
Posterior Predictive Distribution
Suppose we wish to predict the outcome of a new observation z, given

what we have already observed.
For discrete we have the posterior predictive distribution:
X
f (z, j |x)
f (z|x) =
which, since z is usually conditionally independent of x given , is
generally equal to
P
f (z|j , x)f (j |x) =
Example: The posterior predictive probability that the next treatment is

successful:
f (z = 1|x = 1)
f (z|x) =
f (z|j )f (j |x)
j f (j |x) = posterior mean of
= 0.2 0.1 + 0.4 0.2 + 0.6 0.3 + 0.8 0.4 = 0.6
f (z|j )wj (x)
where the wj (x) = f (j |x) are posterior weights.
77
3.2 Binomial Distribution Discrete Prior
Binomial response Discrete Prior
Likelihood
f (x = r |) =
n
r
.2
.4
.6
.8
P
r (1 )nr r (1 )nr
Suppose n = 20, r = 15
f (x = 15||) = 15 (1 )5
78
Binomial response Discrete Prior
If we observe r responses out of n patients, how is our prior revised?
79
prior likelihood prior posterior

f ()
f (x = r |)f ()
f (|x = r )
7
(10 )
.25
0.0
0.0
.25
0.2
0.005
.25
12.0
0.298
.25
28.1
0.697
1.0
40.3
1.0
80
Binomial response discrete Prior
Summary and Terminology (Discrete Prior)

Two random variables: X (observable), (unobservable).
After observing x = 15 successes, what is the posterior predictive

probability of a positive response for patient No. 21?
Let X | Binomial(n, )
(or Xj | Bernoulli() conditionally independent for j = 1, . . . , n),
where the unknown parameter can attain I different values i , with a
priori probabilities f (i ), i = 1, . . . , I, respectively.
f (z = 1|x = 15)
P
= i f (z = 1|i )f (i |x = 15)
X | Binomial(n, ) is called the sampling distribution.
= 0.2 0.0 + 0.4 0.005 + 0.6 0.298 + 0.8 0.697
f (i ), i = 1, . . . , I is called the prior distribution.
= 0.7384
The likelihood function:

n
f (x|) =
x (1 )nx x (1 )nx
x
= 1 , . . . , I
NOTE: This is considered as a function of only; x is considered fixed.

81
I
X
82

Posterior predictive pdf for another future observation Y of the
Bernoulli experiment:
Prior predictive pdf of X :

f (x) =
f (x|i )f (i )
for x = 0, 1, . . . , n
f (y |x) =
i=1
I
X
f (y |i )f (i |x)
i=1
(mean or weighted average of f (x|) with weights given by the prior

probabilities for , f (i ))
(mean or weighted average of f (x|) with weights given by the

posterior probabilities for , f (i |x))
Posterior pdf of :
As Y can attain only the values 0, 1 this gives:
f (i |x) =
f (i )f (x|i )
=
PI
f (x)
j=1 f (j )f (x|j )
f (i )f (x|i )
f (i )f (x|i )
f (1|x) =
i f (i |x) = posterior mean of
I=1
i = 1, . . . , I
I
X
f (0|x) = 1 f (1|x)
83
84
3.3 Binomial Distribution Continuous Prior
Binomial Response Continuous Prior
Calculating Posterior
Data: x successes from n independent trials

Likelihood:
Posterior:

f (x|) =
n
x
x (1 )nx x (1 )nx
f (|x)
f (x|)f ()
Prior: flexible conjugate beta family
x (1 )nx 1 (1 )1
Beta(, )
+x1 (1 )+nx1
( + ) 1
(1 )1
()()
f () =
Beta( + x, + n x)
Note: the Binomial and Beta distributions are conjugate distributions
1 (1 )1
85
Posterior Moments
86
Prior and Posterior Densities
For a beta(, ) distribution:

mode m = ( 1)/( + 2)
5
mean = /( + )
likelihood
prior
posterior
variance 2 = (1 )/( + + 1) = /[( + )2 ( + + 1)]
Solving = 0.4 and 2 = 0.12 gives = 9.2, = 13.8.
Convenient to think of this as equivalent to having observed 9.2

successes in + = 23 patients.
density
3
Suppose our prior estimate of the response rate is 0.4 with a standard
deviation of 0.1.
successes
failures
prior
9.2
13.8
likelihood
15
5
0.0
0.4
0.6
0.8
1.0
theta
posterior
24.2
18.8
0.2
Figure 7: Prior, likelihood, and posterior density of .
87
88
Prior and Posterior Means and Modes
Compromise
In general, the posterior mean is a compromise between prior mean
and data mean, i.e. for some w, 0 w 1:
Compare modes of prior, likelihood and posterior:

8.2
= 0.39
21
15
mode of likelihood:
= 0.75
20
23.2
= 0.57
posterior mode:
41
posterior mean = wprior mean + (1 w) data mean
prior mode:
x +
n++
Solve w.r.t. w:
9.2
= 0.4
23
15
data mean:
= 0.75
20
24.2
posterior mean:
= 0.56
43
x
+ (1 w)
+
n
i.e.
prior mean:
w=
= w
x +
+
n
x
=
+
n++
n++ + n++ n
Compare means of prior, data and posterior:
89
prior gets weight
+
n++
data gets weight
n
n++
+
n++
for n
1 for n
Compromise
90
Hypothesis Test
H0 : > 0 = 0.4
Calculate prior and posterior probability of H0 :
Z 1
Z 0
P( > 0 ) =
f ()d = 1
f ()d = 1 FBeta(,) (0 )
0
"A Bayesian is one who, vaguely expecting a horse, and catching a

glimpse of a donkey, strongly believes he has seen a mule. "
P( > 0 |x) =
Z
f (|x)d = 1
f (|x)d = 1FBeta(+x,+nx) (0 )
For 0 = 0.4, use R function

> priorprob=1-pbeta(0.4,9.2,13.8)
> priorprob
[1] 0.4886101
> postprob=1-pbeta(0.4,24.2,18.8)
> postprob
[1] 0.9842593
91
92
Analogue to Confidence Interval

What is the posterior predictive success probability for a further
n + 1 = 21st patient entering the trial?
Z 1
P(Xn+1 = 1|x) =
f (xn+1 = 1|)f (|x1 , . . . , xn )d
Posterior Credible Interval

95% central posterior credible interval for : (l , u )
where
Z
0.95 =
f (|x)d
l and u are 2.5% and 97.5% quantiles of posterior

=
Use R function
=
> l=qbeta(0.025,24.2,18.8)l
[1] 0.4142266
> u=qbeta(0.975,24.2,18.8)
> u
[1] 0.7058181
=
=
93
(n + + )
+x1 (1 )+nx1 d
( + x)( + n x)
0
Z 1
(n + + )
+x (1 )+nx1 d
( + x)( + n x) 0
(n + + )
( + x + 1)( + n x)
( + x)( + n x)
(n + + + 1)
(n + + )
( + x)( + x)( + n x)
( + x)( + n x) (n + + )(n + + )
+x
9.2 + 15
=
= 0.562797
++n
23 + 20
94

If N = 100 further patients enter the trial, what is the posterior
predictive distribution of the number of successes?
Let Y |) Binomial(N, ). Then for y = 0, 1, . . . , N, f (y |x)
Z
f (y |)f (|x)d

Z 1
(n + + )
N
=
y (1 )Ny
+x1 (1 )+nx1 d
y
(
+
x)(
+
n
x)
0

Z 1
(n + + )
N
=
y ++x1 (1 )Ny ++nx1 d
y
( + x)( + n x) 0

(n + + )
( + + n + N)
N
=
y
( + x)( + n x) ( + x + y )( + n x + N y )
This is called a Beta-Binomial distribution.

95
96
3.4 Exchangeability
Independence?
3.4 Exchangeability
Marginal Bivariate Distribution

I
A common statement in statistics:
If f () = Unif(0,1), then
Assume X1 , . . . , Xn are iid random variables

In Bayesian statistics, we need to think hard about independence.
Why?
I
Consider two independent" Bernoulli trials with probability of

success .
Z
=
x1 +x2 (1 )2x1 x2 d
It is true that
f (x1 , x2 |) = x1 +x2 (1 )2x1 x2 f (x1 |)f (x2 |)
f (x1 , x2 |)f ()d
f (x1 , x2 ) =
(x1 + x2 + 1)(3 x1 x2 )
(4)
so that X1 and X2 are independent given .

R
But f (x1 , x2 ) = f (x1 , x2 |)f ()d may not factor.
97
3.4 Exchangeability
Exchangeability
3.4 Exchangeability
Relationship between exchangeability and independence
If independence is no longer the key, then what is?

Exchangeability
rvs that are iid given are exchangeable
Informal definition: subscripts dont matter
More formally: Given events A1 , A2 , . . . , An , we say they are

exchangeable if
an infinite sequence of exchangeable rvs can always be thought

of as iid given some parameter
(DeFinettis theorem)
note previous point requires an infinite sequence
P(A1 , A2 , . . . Ak ) = P(Ai1 , Ai2 , . . . Aik )
98
What is not exchangeable?
for every k where i1 , i2 , . . . , in are permutations of the indices
time series, spatial data
Similarly, given random variables X1 , X2 , . . . , Xn , we say that they

are exchangeable if
may become exchangeable if we explicitly include time in the

analysis
i.e. x1 , x2 , . . . , xt , . . . are not exchangeable but
(t1 , x1 ), (t2 , x2 ), . . . may be
P(X1 x1 , . . . , Xk xk ) = P(Xi1 xi1 , . . . , Xik xik )

for every k .
99
100
3.5 Sequential Learning
Sequential Inference
Suppose we obtain an observation x1 and form the posterior
f (|x1 ) f (x1 |)f () and then we obtain a further observation x2 which
is conditionally independent of x1 given . The posterior on x1 , x2 is
given by:
f (|x1 , x2 ) f (x2 |, x1 ) f (|x1 )
f (x2 |) f (|x1 )
Todays posterior is tomorrows prior!
The resultant posterior is the same as if we have obtained the data
x1 , x2 together:
f (|x1 , x2 ) f (x1 , x2 |) f ()
f (x2 |) f (x1 |) f ()
101
103
102
104
3.6 Comparing Bayesian and Frequentist Inference for Proportion
Comparing Bayesian and Frequentist Inference for Proportion
Point Estimation
A single statistic is calculated from the sample data and used to
estimate the unknown parameter.
The statistic depends on the random sample, so it is random, and its
distribution is called its sampling distribution.
We call the statistic an estimator of the parameter and the value it
takes for the actual sample data an estimate.
There are various frequentist approaches for finding estimators, such
as
Frequentist inference is concerned with

I
point estimation,
interval estimation,
and hypothesis testing.
Least Squares (LS),
maximum likelihood estimation (MLE) and
uniformly minimum variance unbiased estimation (UMVUE).
For estimating the binomial parameter , the LS, MLE and UMVUE of
the population proportion is the sample proportion.
105
106
Bias
Mean Squared Error
From a Bayesian perspective, point estimation means summarizing the

posterior distribution by a single statistic, such as the posterior mean,
median or mode. Here, we will use the posterior mean as the Bayesian
point estimate (it minimizes the posterior mean squared error to give a
decision-theoretic justification).
An estimator is said to be a minimum variance unbiased estimator if no

other unbiased estimator has a smaller variance. However, it is
possible that there may be a biased estimator that, on average, is
closer to the true value than the unbiased estimator. We need to look
at the possible trade-off between bias and variance.
The (frequentist) mean squared error of an estimator is the average
squared distance the estimator is away from the true value:
Z
2
MS() = E[( ) ] = ( )2 f (|)d

.
An estimator is said to be unbiased if the mean of its sampling

distribution is the true parameter, i.e. is unbiased if
Z
(|)d
= ,
E[] = f
One can show that
is the sampling distribution of the estimator given the

where f (|)
parameter . The bias of an estimator is
= bias()
2 + Var ().
MS()
= E[]
.
bias()
Thus, it gives a better frequentist criterion for judging estimators than

the bias or the variance alone.
(Bayes estimators are usually biased.)

107
108
MSE Comparison
MSE Comparison
We will now compare the mean squared error of the Bayesian and the
frequentist estimator of the population proportion .
The frequentist estimator for is
Suppose we use the posterior mean as the Bayesian estimate for ,

where we use the Beta(1,1) prior (uniform prior), then
B = 1+x = x + 1
n+2
X
f = ,
n
and
E[B ] =
1
n+2
1
n(1
(n+2)2
(1 )
Var (f ) =
n
(1 )
2
MS(f ) = 0 +
n
109
MSE Comparison
MS(f ) =
110
MSE Comparison
For example, suppose = 0.4 and the sample size is n = 10. Then
0.40.6
10
Hence, the mean squared error is

2
n
1
1
MS(B ) =
+
+
n(1 )
n+2 n+2
(n + 2)2

1 2 2
1
n(1 )
=
+
n+2
(n + 2)2
E[f ] =
n
n+2
Var (B ) =
Thus,
n+2
and the variance of its sampling distribution is
Var (X ) = n(1 )
n+2
Thus, the mean of its sampling distribution is
where X , the number of successes in n trials, has the Binomial(n, )

distribution with mean and variance given by
E(X ) = n
Figure 8 shows the mean squared error for the Bayesian and the
frequentist estimator as a function of . Over most (but not all) of the
range, the Bayesian estimator (using uniform prior) performs better
that the frequentist estimator.
= 0.024
0.025
and
Bayes
frequentist
0.020
MS(B ) = 0.0169
0.010
MSE
0.015
Next, suppose = 0.5 and n = 10. Then
0.005
MS(f ) = 0.025
0.0
and
0.0
MS(B ) = 0.01736
0.2
0.4
0.6
0.8
1.0
theta
Figure 8: Mean squared error for the two estimates.
111
112
Interval Estimation
Confidence Credible Interval

In this case, the confidence interval has the form
The aim is to find an interval (l, u) that has a predetermined probability

of containing the parameter
where the critical value comes from the normal or t table. For the
sample proportion, an approximate (1 ) 100% confidence interval
for is given by:
s
f (1 f )
.
f tn1 (/2)
n
P(l u) = 1 .
In the frequentist interpretation, the parameter is fixed but unknown
and, before the sample is taken, the interval endpoints are random
because they depend on the data. After the sample is taken, and the
endpoints are calculated, there is nothing random, so the interval is
called a confidence interval for the parameter. Under the frequentist
paradigm, the correct interpretation for a (1 ) 100% confidence
interval is that (1 ) 100% of the random intervals calculated this
way will contain the true value. Often, the sampling distribution of the
estimator is approximately normal or tn1 distributed with mean equal
to the true value.
estimator critical value stdev of estimator ,
A Bayesian credible interval for the parameter on the other hand, has
the natural interpretation that we want. Because it is found from the
posterior distribution of , it has the coverage probability we want for
this specific data.
113
Example: Interval Estimation
Frequentist 95% confidence interval:

r
0.26 0.74
0.26 1.96
= (0.174, 0.346)
100
Bayesian 95% credible interval:
prior: Beta(1,1)
posterior: Beta(1 + 26, 1 + 74) =Beta(27,75)
> lu=qbeta(c(0.025,0.975),27,75)
> lu
[1] 0.1841349 0.3540134
Hypothesis Testing
Example 3.1
Out of a random sample of 100 Hamilton residents, x = 26 said they
support a casino in Hamilton. Compare the frequentist 95% confidence
interval with the Bayesian credible interval (using a uniform prior).
114
Example 3.2
Suppose we wish to determine whether a new treatment is better than
the standard treatment. If so, , the proportion of patients who benefit
from the new treatment, should be higher than 0 , the proportion who
benefit from the standard treatment. It is known from historical records
that 0 = .6. A random group of 10 patients are given the new
treatment. X , the number who benefit from the treatment will be
Binomial(n, ). We observe x = 8 patients benefit. This is better than
we would expect if = 0.6. But, is it sufficiently better for us to
conclude that > 0.6 at the 5% level of significance?
The following table gives the null distribution of X :
x
f (x|0 )
115
0
.001
1
.0016
2
.0106
3
.0425
4
.1115
5
.2007
6
.2508
7
.2150
8
.1209
9
.0403
10
.0060
116
Frequentist Test
Bayesian Test
prior: Beta(1,1)
H0 : 0.6
H1 : > 0.6
data: x = 8, n x = 2
posterior: Beta(9,3)
Under H0 : X | = 0.6 Binomial(10, 0.6)

P-value
P(H0 |x = 8) = P( 0.6|x = 8)
= pbeta(0.6, 9, 3)
= P(X 8|H0 true)= P(X 8| = 0.6)= 1 pbinom(7, 10, 0.6)

not reject H0
= 0.1209 + 0.0403 + 0.0060 = 0.1672 > 0.05 =
= 0.1189
117
119
118
120
3.7 Exponential Distribution
Exponential data
Gamma Prior
The exponential distribution is commonly used to model waiting times

and other continuous positive real-valued random variables, usually
measured on a time scale. The sampling distribution of an outcome x,
given parameter , is
f (x|) = exp(x),
Let X1 , . . . , Xn be iid Exponential() random variables.

Likelihood:
f (x|) n exp(nx )
conjugate Gamma(, ) prior:
f () =
for x > 0.
The exponential distribution is a special case of the Gamma

distribution with parameters (, ) = (1, ).
exp()
()
Posterior density:
f (|x) n+1 exp((nx + )) Gamma( + n, + nx )
121
Exponential Example
iii) The length of life of a light bulb manufactured by a certain process

has an exponential distribution with unknown rate . Suppose the
prior distribution for is a Gamma distribution with coefficient of
variation 0.5.
( The coefficient of variation is defined as the standard deviation
divided by the mean.)
Let Yi , i = 1, . . . , n, be iid exponentially distributed.

i) Using a conjugate Gamma(, ) distribution, derive the the
posterior mean, variance, and mode of . For which values and
does the posterior mode coincide with the ML estimate of ?
ii) What is the posterior density of the mean = 1 ? Which
distribution is conjugate for ?
122
Exponential Example
Example 3.3
A random sample of light bulbs is to be tested and the lifetime of

each obtained. If the coefficient of variation of the distribution of
is to be reduced to 0.1, how many light bulbs need to be tested?
iv) In part iii), if the coefficient of variation refers to instead of , how
would your answer be changed?
123
124
125
127
126
128
3.8 Poisson Distribution
Poisson Data
Gamma Prior
Let X be the number of times a certain event occurs in a unit interval of

time and the following conditions hold
I
The events are occurring at a constant average rate of per unit

time.
The number of events in any one interval of time is statistically

independent of the number in any other nonoverlapping interval.
The probability of more than one event occurring in an interval of

length d goes to zero as d goes to zero.
Let X be a Poisson() random variable and we observe X = x.

Likelihood:
x e
x!
x e
f (x|) =
Conjugate Gamma(, ) prior:
Any process producing events which satisfy the above three axioms is
called a Poisson process and X , the number of events in a unit time
interal, is distributed as Poisson().
129
f () =
exp()
()
1 exp()
130

Prior predictive distribution for X : f (x)
Posterior density:
f (x) =
f (x|)f ()d
0
1 e x e
+x1 e(+1)
=
=
i.e. pdf of Gamma( + x, + 1)

=
131
x e 1
e
d
x! ()
0
Z
1 +x1 (+1)
e
d
() x! 0
1 ( + x)
() x! ( + 1)+x
( + x 1)!
x
( + 1) ( + 1) ( 1)!x!

x

1
+x 1
x
+1
+1
Z
f (|x) f ()f (x|)
132
Negative Binomial
Multiple Poisson Data

Now let X1 , . . . , Xn be iid Poisson() random variables. Suppose we
observe x = (x1 , . . . , xn ).
i.e. Negative-Binomial(, )
i.e. the no. of Bernoulli failures obtained before the th success when
the success probability is p = +1
Likelihood:
f (x|) =
which shows
Z
Neg-bin(x|, ) =
Poisson(x|)Gamma(|, )d
n
Y
f (xi |)
i=1
n
Y
i=1
xi e
xi !
1
Qn
i=1 xi !
Pn
i=1 xi
en
nx en
133
Multiple Poisson Data
134
Poisson Example
Conjugate Gamma(, ) prior:
Example 3.4
Suppose that causes of death are reviewed in detail for a city in the US
for a single year. It is found that 3 persons, out of a population of
200,000, died of asthma, giving a crude estimated asthma mortality
rate in the city of 1.5 per 100,000 persons per year. A Poisson
sampling model is often used for epidemiological data of this form. Let
represent the true underlying long-term asthma mortality rate in the
city (measured in cases per 100,000 persons per year). Reviews of
asthma mortality rates around the world suggest that mortality rates
above 1.5 per 100,000 people are rare in Western countries, with
typical asthma mortality rates around 0.6 per 100,000.
f () 1 exp()
Posterior density:
f (|x) f ()f (x|)
1 e nx en
+nx1 e(+n)
a) Construct a conjugate prior density and derive the posterior

distribution of .
i.e. pdf of Gamma( + nx , + n)

135
136
Poisson Example
Poisson Example
b) What is the posterior probability that the long-term death rate from
asthma in the city is more than 1.0 per 100,000 per year?
c) What is the posterior predictive distribution of a future observation
Y?
d) To consider the effect of additional data, suppose that ten years of
data are obtained for the city in this example with y = 30 deaths
over 10 years. Assuming the population is constant at 200,000,
and assuming the outcomes in the ten years are independent with
constant long-term rate , derive the posterior distribution of .
e) What is the posterior probability that the long-term death rate from
asthma in the city is more than 1.0 per 100,000 per year?
137
Poisson Example
138
Poisson Example
139
140
3.9 Normal Distribution
Normal data, known variance, single data
Normal Example
Example 3.5
According to Kennett and Ross (1983), Geochronology, London:
Longmans, the first apparently reliable datings for the age of Ennerdale
granophyre were obtained from the K/Ar method (which depends on
observing the relative proportions of potassium-40 and argon-40 in the
rock) in the 1960s and early 1970s, and these resulted in an estimate
of 370 20 million years. Later in the 1970s, measurements based on
the Rb/Sr method (depending on the relative proportions of
rubidium-87 and strontium-87) gave an age of 421 8 million years. It
appears that the errors marked are meant to be standard deviations,
and it seems plausible that the errors are normally distributed. If then a
scientist had the K/Ar measurements available in the early 1970s, this
could be the basis of her prior beliefs about the age of these rocks.
A random variable X has a Normal distribution with mean and

variance 2 if X has a continuous distribution with pdf
"

#
1
1 x 2
f (x) =
exp
for < x < .
2
141
Normal Prior
Conjugate prior:
0 , 02
Posterior: |x N(1 , 12 )
where
1 =
are hyperparameters
NB:
1
f () =
exp 2 ( 0 )2
20
20
1
exp 2 ( 0 )2
20
Likelihood: X | N(, 2 ), 2 known

1
1
1
2
2
exp 2 (x ) exp 2 (x )
f (x|) =
2
2
2
N(0 , 02 ),
142
143
1
+ 12 x
02 0
1
+ 12
02
1
1
1
= 2+ 2
2
1
0
posterior precision = prior precision + data precision

posterior mean = weighted average of prior mean and observation
144
Posterior Mean Expressions
f (|x) f (x|)f ()
"
#!
1 (x )2 ( 0 )2
exp
+
2
2
02
Alternative expressions for posterior mean:

2
1 = 0 + (x 0 ) 2 0 2
+ 0
1
1
(x 2 2x + 2 ) 2 (2 20 + 20 )
2
2
20
"
!
!
#!
1 2 1
1
x
0
exp
+
2
+
+ const.
2
2 02
2 02
0
+ 2
2
1
1
2
0
+ const.
exp
1 2 1
1
2 1
+ 2
2
+ 1
exp
02
1
exp
2
1
1
2
1
02
prior mean adjusted towards observed value

1 = x (x 0 )
1
x
2
1
2
2
2 + 02
data shrunk towards prior mean
2
+ 02
0
+ 12
0
145
146
Reminder: Conditional Mean and Variance
Prior predictive distribution of X : X N(0 , 2 + 02 )

Because:
R
f (x) = f (x|)f ()d

f (x, ) = f (x|)f () exp 21 2 (x )2
1
(
202
0 )2
If U and V are random variables, then
E[U] = E[E[U|V ]]
i.e. (X , ) have a bivariate normal distribution

then the marginal distribution of X is normal
Var (U) = E[Var (U|V )] + Var (E[U|V ])
Now:
E[X ] = E[E[X |]] = E[] = 0
Var (X ) = E[Var (X |)] + Var (E[X |])
= E[ 2 ] + Var () = 2 + 02
147
148
Normal Example
Now back to Example 3.5:
1
(
212
1 ) 2
Single Normal observation, Normal prior

0.05
Posterior predictive distribution of future Y :

Y |x N(1 , 2 + 12 ) Because:
R
f (y |x) = f (y |)f (|x)d

f (y , |x) = f (y |)f (|x) exp 21 2 (y )2
prior
posterior
likelihood
density
0.03
0.04
i.e. (Y , )|x have a bivariate normal distribution

then the marginal distribution of Y |x is normal
0.02
Now:
0.01
E[Y |x] = E[E[Y |]|x] = E[|x] = 1
0.0
Var (Y |x) = E[Var (Y |)|x] + Var (E[Y |]|x)
300
= E[ 2 |x] + Var (|x) = 2 + 12
350
400
450
mu
Figure 9: Conjugate Normal prior and single observation.

149
Normal data, known variance, multiple data
What is now called the National Institute of Standards and Technology

(NIST) in Washington DC conducts extremely high precision
measurement of physical constants, such as the actual weight of
so-called check-weights that are supposed to serve as reference
standards (like the official kg). In 1962-63, for example, n = 100
weighings of a block of metal called NB10, which was supposed to
weigh exactly 10g, were made under conditions as close to iid as
possible. The 100 measurements x1 , . . . , xn (the units are micrograms
below 10g) have a mean of x = 404.6 and a SD of s = 6.5.
Normal Example
Example 3.6
150
151
weight
375
392
393
397
398
399
400
401
402
403
404
405
frequency
1
1
1
1
2
7
4
12
8
6
9
5
weight
406
407
408
409
410
411
412
413
415
418
423
437
frequency
12
8
5
5
4
1
3
1
1
1
1
1
152
Normal Example
Calculting Posterior
iid
Likelihood: Xi | N(, 2 ), i = 1, . . . , n, 2 known
Questions:
1. How much does NB10 really weigh?
2. How certain are you given the data that the true weight of NB10 is
less than 405.25 g below 10g?
3. What is the underlying accuracy of the NB10 measuring process?
4. How accurately can you predict the 101st measurement?
Conjugate prior: N(0 , 02 ), 0 , 02

Posterior: |x N(n , n2 )
where
n =
A Normal qqplot shows that a Normal sampling distribution is

appropriate. We first assume that 2 is known.
153
1
+ n2 x
02 0
1
+ n2
02
n
1
1
= 2+ 2
n2
0
154
Why?
Reduction to the case of single data point of previous section:
f (x |)
iid
The likelihood depends on data x1 , . . . , xn only through the sufficient

statistic x
If X1 , . . . , Xn | N(, 2 )
Likelihood: f (x1 , . . . , xn |)
h
Q
Q
= ni=1 f (xi |)= ni=1 1 exp 21
2
= const. exp 12
Pn
i=1
i
xi 2
xi 2
| N(, 2 /n).
and X
Thus, in previous section 3.9, simply substitute 2 by 2 /n and x by x .

. . . exp 21
/ n
2
155
156
Remarks
Remarks
1. If 02 = 2 then
n =
1
n2
P
0 + nx
0 + xi
=
n+1
n+1
n+1
2
4. The prior info is equivalent to

to 0 since
n =
i.e. prior has weight of one additional observation with value 0 .

2. If n large, the posterior is determined by x and
2.
3. If 02 (diffuse prior) and n fixed, then

|x N(x , 2 /n)
2
02
additional observations all equal
1
+ n2 x
02 0
1
+ n2
02
2
02 0
2
02
xi
+n
posterior mean = MLE

157
Back to Normal Example
158
159
160
Normal data, known variance, noninformative prior
Example 3.7
0.6
Multiple Normal observations, Normal prior
0.5
Changes in blood pressure (in mmHg) were recorded for each of 100
patients, where negative numbers are decreases while on the drug
and positive numbers are increases:
prior
posterior
likelihood
density
0.3 0.4
0.2
+3.7 6.7 10.5 . . . 16.7 7.2
0.0
0.1
with sample mean x = 7.99 and standard deviation s = 4.33.

360
380
400
mu
420
We will assume that the change in blood pressure X has a Normal

distribution with unknown mean and known variance 2 = 4.332 .
440
Figure 10: Conjugate Normal prior and several observations.
161
Example
162
Let us assume that we dont know anything about the mean change in
blood pressure induced by the new drug and thus assume that can
attain any real value with equal probability. This gives a flat prior
distribution for on (, ), i.e.
f () 1.
(There is no proper continuous uniform distribution on (, ), but
you can think of being uniform on some finite interval (a, a), for
some large a and ignore the normalization constant, as it is not
needed for the application of Bayes theorem).
Posterior pdf: f (|x )

prior likelihood
f ()f (x |)

2
x
1
1 exp 2 / n

exp
12
x
/ n
2
pdf of Normal(x , 2 /n)
What is the posterior distribution of ?
163
164
Simple Updating Rule
Credible Intervals
95% posterior probability interval for :
iid
If Xi Normal(, 2 ), i = 1, . . . , n and a flat prior is used, then the

posterior distribution of |x is
L = 2.5% quantile of N(7.99, 0.187489)

U = 97.5% quantile of N(7.99, 0.187489)
Normal(n , n2 ) with
In R:
n = x and
n2 = 2 /n.
> lu=qnorm(c(0.025,0.975),-7.99,sqrt(0.187489))
> lu
[1] -8.838664 -7.141336
In Example 3.7
n = 7.99
n2 = 4.332 /100 = 0.187489
165
166
Hypothesis Test
Test the null hypothesis H0 : 7.0.

P(H0 |x ) = P( 7.0|x )
In R:
> p=pnorm(-7,-7.99,sqrt(0.187489))
> p
[1] 0.9888838
167
168
2-Parameter Normal with Conjugate Prior

joint posterior density:
prior distribution:
| 2 N(0 , 2 /0 )
2 Inv-2 (0 , 02 )
where
0
n
0 +
x
0 + n
0 + n
= 0 + n
n =
n
n = 0 + n
N-Inv-2 (0 , 02 /0 , 0 , 02 )
n n2 = 0 02 + (n 1)s2 +
2 (0 /2+1)
N-Inv-2 (n , n2 /n , n , n2 )
where Inv-2 (0 , 02 ) denotes the scaled inverse 2 -distribution with

scale 02 and 0 degrees of freedom, i.e. the distribution of 02 0 /Z
where Z is a 2 random variable with 0 degrees of freedom.
joint prior density:

1
2
1 2 (0 /2+1)
2
2
f (, ) ( )
exp 2 [0 0 + 0 (0 ) ]
2

1
2
2
f (, |x) ( )
exp 2 [0 0 + 0 (0 ) ]
2

1
2 n/2
2
2
( )
exp 2 [(n 1)s + n(x ) ]
2
1
169
0 n
(x 0 )2 .
0 + n
170

conditional posterior of :
| 2 , x N(n , 2 /n )
= N
0
+ n2 x
2 0
, 0
0
+ n2
2
2
1
+
!
n
2
marginal posterior of 2 :
2 |x Inv-2 (n , n2 )
marginal posterior of :

f (|x)
n ( n )2
1+
n n2
(n +1)/2
tn (|n , n2 /n )
171
172
3.10 Normal Linear Regression
Normal Linear Regression
Conjugate Normal-Inverse-Gamma Prior
This can be extended to linear regression models:
The multivariate normal-inverse gamma prior distribution

(, 2 ) NIG( , V, a, b) is conjugate and can be specified as:
Sampling distribution:
Yi |i , 2 N(i , 2 ),
| 2 Np ( , 2 V)
i = 1, . . . , n,
with i = 0 + 1 xi1 + p1 xi,p1 = x0i
where
X=
1 x11 x12 . . . x1,p1

1 x21 x22 . . . x2,p1
..
..
..
..
..
.
.
.
.
.
1 xn1 xn2 . . . xn,p1
0
1
..
.
2 Inv-Gamma(a, b).
= (X
0 y + V1 )
= (X0 X + V 1 )1
n
=
a
+a
2
= SS + b
b
2
0 1
+ 0 V1 .
SS = y0 y
Y|, Nn (X, In )
and
with
a
, b)
Posterior is NIG(, V,
or in matrix notation with n p design matrix X (with rows xi ):

2
173
174
Weighted Average
can be written as a weighted average of prior and sample mean as
in the univariate normal case:

= W
+ (Ip W)
with W = (X0 X + V1 )1 X0 X
= (X0 X)1 X0 y is the MLE.

where
The marginal posterior distribution of is a multivariate Student
distribution. For details, see Bernardo and Smith (1994).
The marginal posterior distribution of 2 is an Inverse Gamma
above.
and b
distribution with parameters a
175
176
4 WinBUGS Applications
4.1 WinBUGS Handouts
WinBUGS Applications: Overview
Calculation of the posterior distribution is difficult in situations with:

I
nonconjugate priors
multiple parameters
The most successful approach, for reasons that we will discuss in the
subsequent sections, is based on simulation. That means, instead of
explicitly calculating the posterior and performing integrations, we
generate a sample from the posterior distribution and use that sample
to approximate any quantity of interest, e.g. approximate the posterior
mean by the sample mean etc.
as we need to calculate summary statistics, like mean and variance,

and in high-dim. problems, calculate marginal posterior distributions.
All this involves integration, which has been a very big hurdle for
Bayesian inference in the past.
A very versatile software to do these posterior simulations is

WinBUGS, the Windows version of BUGS (Bayesian inference Using
Gibbs Sampling), developed by David Spiegelhalter and colleagues at
the MRC Biostatistics Unit of Cambridge University, England.
For low parameter dimensions, say 2,3,4,5 numerical integration

techniques, asymptotic approximations etc may be used but these
break down for higher dimensions.
177
178
WinBUGS Handouts
WinBUGS uses the Gibbs sampler to generate samples from the

posterior distribution of parameters of a Bayesian model. We will
discuss the Gibbs sampler and other Markov chain Monte Carlo
techniques in detail in Chapter 6. For now, we simply consider the
simulation method used in WinBUGS as a black box but simply keep in
mind, that the samples generated are not independent but dependent,
i.e. they are samples from a Markov chain that converges towards the
posterior distribution. Therefore, we can use the samples only from a
point in time where convergence has set in and need to discard the
initial so-called burn-in samples.
179
We illustrate this sampling-based approach using our familiar example

of Binomial data with a conjugate prior distribution and refer to the
handout
Brief Introduction to WinBUGS
Other handouts will discuss running WinBUGS in batch mode, from
within R using R2WinBUGS and how to use the R package CODA for
convergence diagnostics.
Once familiar with WinBUGS, we will look at the huge range of
Bayesian models, especially Bayesian hierarchical models, that can be
handled with WinBUGS and concentrate on practical implementation
issues rather than theory. The underlying theory will be recouped in
the subsequent chapters.
180
4.2 Bayesian Linear Regression
Simple Linear Regression
Example
Example 4.1
In regression analysis, we look at the conditional distribution of the

response variable at different levels of a predictor variable
I Response variable Y
I
I
I
This example investigates the quality of the delivery system network of

a softdrink company, see Example 5.1 in Ntzoufras (2009). One is
interested in estimating the time each employee needs to refill an
automatic vending machine owned and served by the company. For
this reason, a small quality assurance study was set up by an industrial
engineer of the company. The response variable is the total service
time (measured in minutes) of each machine, including its stocking
with beverages and any required maintenance or housekeeping. After
examining the problem, the industrial engineer recommends two
important variables that affect delivery time: the number of cases of
stocked products and the distance walked by the employee (measured
in feet). A dataset of 25 observations was finally collected.
also called dependent or outcome variable

what we want to explain or predict
in simple linear regression, the response variable is continuous
Predictor variables X1 , . . . , Xp
I
I
also called independent variables or covariates

in simple linear regression, the predictor variable is usually
continuous
which variable is response and which is predictor depends on our
research question
181
Data: Softdrink Delivery Times

Delivery Time
16.68
11.5
12.03
14.88
13.75
18.11
8
17.83
79.24
21.5
40.33
21
13.5
19.75
24
29
15.35
19
9.5
35.1
17.9
52.32
18.75
19.83
10.75
Cases
7
3
3
4
6
7
2
7
30
5
16
10
4
6
9
10
6
7
3
17
10
26
9
8
4
Model Assumptions
The explanatory variables are assumed fixed, their values denoted by
xi1 , . . . , xip for i = 1, . . . , n. Given the values of the explanatory
variables, the observations of the response variable are assumed
independent, normally distributed
Distance
560
220
340
80
150
330
110
210
1460
605
688
215
255
462
448
776
200
132
36
770
140
810
450
635
150
182
Yi |xi1 , . . . , xip N(i , 2 )

i
with
= 0 + 1 xi1 + + p xip
for i = 1, . . . , n
or in matrix notation:
Y|x Nn (, 2 I)
with
= X
where 2 and = (0 , 1 , . . . , p ) is the set of regression parameters, I
denotes the identity matrix, Y the vector of observations and X = (xij )
the n (p + 1) design matrix.
183
184
Likelihood Specification in WinBUGS
Prior Specification
In normal regression models, the simplest approach is to assume that
all parameters are a priori independent, i.e.
Note that in WinBUGS the normal distribution is parametrized in terms

of the precision = 12 . The likelihood is thus specified by:
f (, ) =
p
Y
f (j )f ( )
j=0
for (i in 1:n){
y[i] ~ dnorm(mu[i],tau)
mu[i] <- beta0 + beta1*x1[i] + ... + betap*xp[i]
}
sigma2 <- 1/tau
sigma <- sqrt(sigma2)
N(j , cj2 )
Gamma(a, b)
for j = 0, . . . , p
Thus, the precision has a prior mean of E( ) = ba and prior variance

Var( ) = ba2 . This corresponds to an Inverse Gamma prior distribution
for 2 with E( 2 ) =
b
a1
and Var( 2 ) =
b2
.
(a1)2 (a2)
No info about j : j = 0 and cj2 = 10000.

No info about : a = b = 0.001
185
Prior Specification in WinBUGS
beta0
beta1
....
betap
tau
~
~
186
Interpretation of Regression Coefficients
Each regression coefficient j measures the effect of the explanatory

variable Xj on the expected value of the response variable Y adjusted
for the remaining covariates.
dnorm(0.0,1.0E-4)
dnorm(0.0,1.0E-4)
Questions of interest are

~
~
1. Is the effect of Xj important for the description of Y ?
dnorm(0.0,1.0E-4)
dgamma(0.001,0.001)
2. What is the association between Y and Xj (positive or negative)?

3. What is the magnitude of the effect of Xj on Y ?
187
188
Interpretation of Regression Coefficients
Interpretation of 0
Answers:
0 measures the posterior expected value of Y if all covariates are

zero. Often, zero is not in the range of the covariates, and thus the
interpretation of 0 is not meaningful.
1. Look at posterior distribution of j and its credible interval. Does

the credible interval contain 0?
Example: response: heart rate, covariate: body temperature in

degrees C
Better: Center the covariates at their mean xijc = xij xj
2. Calculate the posterior probability P(j > 0) and P(j < 0).
In WinBUGS, use the step function
p.betaj <- step(betaj)
i = 0c + 1c (xi1 x1 ) + + 1c (xip xp )
0c = expected value of Y when all covariates are equal to their means
which creates a binary node p.betaj taking values 1 if j > 0

and 0 otherwise.
Centering the covariates is also advisable from a computational point

of view: it decreases the posterior correlation between parameters and
thus improves convergence of the Gibbs sampler. We will show this in
Section 6.
3. Posterior mean/median of j is a measure of the posterior

expected change of the response variable Y if Xj increases by 1
unit and all other covariates are fixed.
189
190
Regression Example in R
Prepare the data file by including variable names to be used by

WinBUGS at the top of each column and END at the end and save as
plain text file softdrinkdata.txt in your working directory.
cases[]
7
3
3
4
6
..
.
17
10
26
9
8
4
Regression Example in WinBUGS
time[]
16.68
11.5
12.03
14.88
13.75
..
.
35.1
17.9
52.32
18.75
19.83
10.75
END
Alternatively, if we want to fit a linear model in the frequentist way in R

first, to compare later on with the Bayesian results in WinBUGS, we
read in the data, fit a linear model and output a list using dput(),
using the following R commands:
distance[]
560
220
340
80
150
..
.
770
140
810
450
635
150
softdrink <- read.table(file="softdrinkdata.txt",header=TRUE, sep

attach(softdrink)
cases_cent<- cases - mean(cases)
distance_cent <- distance - mean(distance)
summary(lm(time ~ cases_cent + distance_cent))
dput(list(time=time,cases=cases,distance=distance)
"softdrinkdatalist.txt")
For some odd reason (bug in WinBUGS?), make sure there is a blank
line after END.
191
192
Regression Output in R
Regression Model in WinBUGS
Call:
lm(formula = time ~ cases_cent + distance_cent)
Residuals:
Min
1Q Median
3Q
Max
-5.7880 -0.6629 0.4364 1.1566 7.4197
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
22.384000
0.651895 34.337 < 2e-16 ***
cases_cent
1.615907
0.170735
9.464 3.25e-09 ***
distance_cent 0.014385
0.003613
3.981 0.000631 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 3.259 on 22 degrees of freedom
Multiple R-squared: 0.9596,
Adjusted R-squared: 0.9559
F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
193
model
{
# likelihood
for (i in 1:n){
time[i] ~ dnorm( mu[i],tau)
mu[i] <- beta0 + beta1*(cases[i]-mean(cases[])) +
beta2* (distance[i]-mean(distance[]))
}
# prior distributions
tau ~ dgamma(0.001,0.001)
beta0 ~ dnorm(0.0,1.0E-4)
194
Regression Output in WinBUGS
# definition of sigma, sigma2, and sd(Y)

sigma2 <- 1/tau
sigma <- sqrt(sigma2)
#calculation of Bayesian version of Rsquared
R2B <- 1-sigma2/pow(sd(time[]),2)
#posterior probabilities
p.beta0 <- step(beta0)
}
#inits
list(tau=1,beta0=1, beta1=0, beta2=0)
Regression Model in WinBUGS
node
R2B
beta0
beta1
beta2
p.beta0
p.beta1
p.beta2
sigma2
195
mean
0.9516
22.37
1.61
0.01447
1.0
1.0
1.0
11.67
sd
0.01732
0.6681
0.1851
0.003931
0.0
0.0
0.0
4.175
MC error
7.742E-4
0.02255
0.005237
1.263E-4
3.162E-12
3.162E-12
3.162E-12
0.1866
2.5%
0.9063
21.15
1.254
0.006683
1.0
1.0
1.0
6.364
median
0.9551
22.35
1.606
0.0144
1.0
1.0
1.0
10.82
97.5%
0.9737
23.78
1.992
0.02251
1.0
1.0
1.0
22.7
196
start
1001
1001
1001
1001
1001
1001
1001
1001
Bayesian Coefficient of Variation
Bayesian Coefficient of Variation
A high value of the precision (low 2 ) indicates that the model can
accurately predict the expected value of Y . We can rescale this
quantity using the sample variance of the response variable Y , sY2 ,
using the RB2 statistic given by:
RB2 = 1
It can be regarded as the Bayesian analog of the adjusted coefficient

of determination
2
2
Radj
=1 2 ,
sY
where
2
1
=
1
.
sY2
sY2
2 =
n
1 X
(yi yi )2
np
197
MC error
0.0821
2.5%
13.71
xij j ,
i=1
198
Prediction in WinBUGS
Assume for instance, that observation 21 in the linear regression

Example 4.1 was missing, i.e. time[21] for cases[21]=10 and
distance[21]=140 was missing.
In WinBUGS, missing values are denoted by NA in the dataset.
Substituting 17.9 in the dataset by NA and running the code again, now
monitoring the node time[21], we get the output
sd
3.696
p
X
Missing data are easily incorporated in a Bayesian analysis.

They are treated as unknown parameters to be estimated.
mean
21.06
yi = 0 +
where j are the maximum likelihood estimates of j .
Missing Data
node
time[21]
with
i=1
This quantity can be interpreted as the proportional reduction of

uncertainty concerning the response variable Y achieved by
incorporating the explanatory variables Xj in the model.
median
21.02
97.5%
28.4
start
1001
sample
2000
Predicting future observations that follow the same distributional

assumptions as the observed data is straightforward. In the regression
context, we are interested in the posterior predictive distribution of a
future observation Yn+1 |y1 , . . . , yn with certain values of the predictors
x. Its posterior predictive pdf is
Z
f (yn+1 |y, x) =
f (yn+1 |, x)f (|y, x)d
or ignoring the dependence on x:

Z
f (yn+1 |y) =
f (yn+1 |)f (|y)d
and we can use the mixture method (to be discussed in Chapter 7) to

simulate from this distribution. This is easily implemented in WinBUGS.
199
200
Prediction in WinBUGS
In the linear regression Example 4.1, this means defining another
variable in the code with the same distribution as the original data and
values for the predictor variables for which we want to forecast, e.g.
cases=20 and distance=1000, and including this variable in the
dataset with value NA:
pred.time ~ dnorm( pmu,tau)
pmu<- beta0 + beta1*(20-mean(cases[])) +
beta2* (1000-mean(distance[]))
Running the model again and monitoring pred.time gives the
posterior predictive summary:
node
pred.time
mean
48.98
sd
3.71
MC error
0.07796
2.5%
41.73
median
49.0
97.5%
56.56
start
1001
sample
2000
201
4.3 Model Checking
Model Assessment
Several authors have suggested using the marginal distribution of the

data, p(y), in this regard. Observed yi values for which p(yi ) is small
are unlikely", and therefore may be considered outliers under the
assumed model. Too many small values of p(yi ) suggest the model
itself is inadequate, and should be modified and expanded.
202
4.3 Model Checking
Model Assessment
Having successfully fit a model to a given dataset, the statistician must

be concerned with whether the fit is adequate, and whether the
assumptions made by the model are justified. For example, in standard
linear regression, the assumptions of normality, independence,
linearity, and homogeneity of variance must all be investigated.
A problem, with this approach is the difficulty in defining how small is

small" and how many outliers are too many". In addition, we have the
problem of the possible impropriety of p(y) under noninformative
priors.
As such, we might work with the predictive distributions, since they will
be proper whenever the posterior is.
203
204
4.3 Model Checking
Model Checking
4.3 Model Checking
Examination of Individual Observations

Consider data y1 , . . . , yn and parameters under the assumed model.
Gelfand et al. (1992) suggest a series of "checking functions". These
are based on comparing a predictive distribution p(Yirep ) (to be made
precise in the following) with the actual observed yi :
1. the residuals: yi E[Yirep ]
Checking validity of model assumptions:

I
examination of individual observations
comparison between two or more competitor models (later)
global goodness-of-fit checks
yi E[Yirep ]
2. the standardised residuals: q
Var (Yirep )
3. the chance of getting a more extreme observation:
min(P(Yirep < yi ), P(Yirep yi ))
4. the chance of getting more surprising observation:
P(Yirep : f (Yirep ) f (yi ))
5. the predictive ordinate of the observation: f (yirep )
4.3 Model Checking
Separate Evaluation Data Available
As usually the yi s are conditionally independent of the zi s given , this

becomes
Z
f (yi |z) = f (yi |)f (|z)d
206
4.3 Model Checking
Separate Evaluation Data Available
Assume the data has been divided into a training set z and an
evaluation set y. Then the posterior distribution of is based on z and
the predictive distribution above is given by
Z
f (yi |z) = f (yi |z, )f (|z)d
205
207
In WinBUGS, calculating the predictive distribution just requires

defining an additional node of each Yirep with the appropriate parents
and monitoring the Yirep s.
The observed yi can then be compared with their predictive distribution
through the residuals or standardized residuals
ri = yi
E[Yirep |z]
yi E[Yirep |z]
q
and sri =
Var (Yirep |z)
Plotting these residuals versus fitted values might reveal failure in

a normality or homogeneity of variance assumption.
Plotting them versus time could reveal a failure of independence.
Summing their squares or absolute values could provide an

overall measure of fit.
208
4.3 Model Checking
No Separate Evaluation Data Available
Cross-Validation Approach
The above discussion assumes the existence of two independent data

samples, which may well be unavailable in many problems. As such,
Gelfand et al. (1992) suggested a cross-validation approach, wherein
the fitted value for yirep is computed conditionally on all the data except
yi , namely y(i) = (y1 , . . . , yi1 , yi+1 , . . . , yn ).That is, the ith residual
becomes
ri = yi E[Yirep |y(i) ],
yi E[Yirep |y(i) ]
sri = q
.
rep
Var (Yi |y(i) )
Note that in this cross-validatory approach we compute the posterior

mean and variance with respect to the conditional predictive
distribution,
Z
p(y)
= p(yi |, y(i) )p(|y(i) )d,
p(yi |y(i) ) =
p(y(i) )
which gives the likelihood of each point given the remainder of the
data.
The actual values of p(yi |y(i) ), referred to as the conditional predictive
ordinate, or CPO, can be plotted versus i as an outlier diagnostic,
since data values having low CPO are poorly fit by the model.
and the ith standardized residual
209
4.3 Model Checking
210
4.3 Model Checking
WinBUGS Cross-Validation
Unfortunately, this is generally difficult to do within WinBUGS.

But an approximation to the cross-validatory method is to use the
methods for a separate evaluation set, but replacing z by y. Hence our
predictive distribution becomes the posterior predictive density without
case omission
Z
rep
f (yi |y) =
f (yirep |y, )f (|y)d
Z
=
f (yirep |)f (|y)d
If we do wish to sample from the correct cross-validatory predictive
distribution, this can be carried out using an additional importance
sampling step to remove the effect of yi when repredicting Yirep
(Gelfand et al., 1992), although this would have to be carried out
external to WinBUGS.
Cross-Validation Approach
4.3 Model Checking
Let us implement checking functions 1 and 2 in WinBUGS for the

Example 4.1 using the approximate cross-validatory method.
Note that
E[Yirep |y] =
yirep f (yirep |y)dyirep

Z

Z
=
yirep
f (yirep |)f (|y)d dyirep

Z Z
rep
rep
rep
=
yi f (yi |)dyi
f (|y)d
= E[i |y]
i.e. the posterior mean of i = 0 + 1 xi1 + 2 xi2 . Similarly,

Var (Yirep |y) = posterior mean of .
211
212
4.3 Model Checking
WinBUGS Cross-Validation
4.3 Model Checking
Examination of Individual Observations in WinBUGS

Monitoring these vectors r and sr, we can look at summary statistics
etc. However, we get a better overview by using the comparison tool of
the Inference menu and clicking on "boxplot":
Thus, in WinBUGS we only need to define the following nodes:
box plot: sr
[9]
4.0
for (i in 1:n){
r[i]<- time[i]-mu[i]
sr[i]<- (time[i]-mu[i])*sqrt(tau)
}
[4]
2.0
[21]
[10]
[18]
[11]
[2]
[7]
[3]
[5]
[19]
[13] [14]
[8]
[12]
[6]
[15]
[16]
[17]
[25]
[22]
0.0
[24]
[1]
[20]
[23]
-2.0
-4.0
Figure 11: Boxplot of standardized residuals.
213
4.3 Model Checking
Checking Function 3 in WinBUGS
node
p.smaller[1]
p.smaller[2]
p.smaller[3]
p.smaller[4]
p.smaller[5]
p.smaller[6]
p.smaller[7]
p.smaller[8]
p.smaller[9]
p.smaller[10]
p.smaller[12]
p.smaller[13]
p.smaller[14]
p.smaller[15]
p.smaller[16]
p.smaller[17]
p.smaller[18]
p.smaller[19]
p.smaller[20]
p.smaller[21]
p.smaller[22]
p.smaller[23]
p.smaller[24]
p.smaller[25]
The step() function is then used to calculate the variable

p.smaller[i] which takes the value 1 if time[i]-time.rep[i]
0 and zero otherwise.
The posterior mean of p.smaller[i] is simply the proportion of
iterations for which time.rep[i] < time[i].
P(Yi yi ) = 1 posterior mean of p.smaller. The chance of
observing a more extreme value for Yi is thus the minimum of these
two probabilities.
214
4.3 Model Checking
To compute P(Yirep < yi ), we first need to obtain sample values of the

random variable Yirep by generating a replicate dataset time.rep[i]
which depends on the current values of mu[i] and tau at each
iteration.
215
mean
0.077
0.626
0.4875
0.9275
0.449
0.459
0.5915
0.6325
0.9555
0.7575
0.431
0.631
0.633
0.591
0.4285
0.571
0.8505
0.712
0.052
0.235
0.175
0.093
0.09
0.4685
sd
0.2666
0.4839
0.4998
0.2593
0.4974
0.4983
0.4916
0.4821
0.2062
0.4286
0.4952
0.4825
0.482
0.4916
0.4949
0.4949
0.3566
0.4528
0.222
0.424
0.38
0.2904
0.2862
0.499
MC error
0.005964
0.01051
0.01109
0.006629
0.009629
0.009853
0.01047
0.01033
0.004386
0.01117
0.009716
0.009968
0.0116
0.009021
0.012
0.01115
0.007033
0.009266
0.004984
0.008387
0.008043
0.006644
0.007328
0.01068
216
4.3 Model Checking

Thus, the ith CPO can be estimated from the inverse of the sample
mean of the inverse likelihood of yi for each generated from the full
posterior distribution.
The CPO, checking function 5, can be explicitly calculated in

WinBUGS using the relationship
1
f (yirep |y(i) )
=
=
=
=
=
f (y(i) )
f (y)
Z
f (y(i) |)f ()
d
f (y)
Z
1 f (y|)f ()
d
f (yi |) f (y)
Z
1
f (|y)d
f (yi |)

1
E|y
f (yi |)
I.e. a Monte Carlo estimate of CPOi is

i=
CPO
mean
34.12
9.383
8.959
31.2
8.929
8.712
9.184
9.37
6273.0
13.03
11.38
9.211
9.213
9.338
8.846
9.538
8.844
16.44
10.51
53.19
13.66
30.14
24.4
27.73
8.82
In WinBUGS:
like[i] <- sqrt(tau/(2*PI))*exp(-0.5*pow(sr[i],2))
p.inv[i] <- 1/like[i]
217
218
4.3 Model Checking
Global Goodness-of-fit Checks

The idea of global goodness-of-fit checks goes back to Rubin (1984).
One constructs test statistics or other discrepancy measures D(y)
that attempt to measure departures of the observed data from the
assumed model (likelihood and prior distribution).
MC error
0.7646
0.04268
0.03359
0.4766
0.03761
0.03228
0.042
0.0396
3500.0
0.1362
0.0671
0.04563
0.03934
0.0409
0.03416
0.0458
0.03562
0.1572
0.06173
1.06
0.1599
1.025
0.237
0.5858
0.03519
!1
which is the harmonic mean of the likelihood function. But note that
harmonic means are notoriously unstable so care is required regarding
convergence!
4.3 Model Checking
sd
27.03
1.698
1.627
20.32
1.512
1.41
1.669
1.669
154700.0
6.565
2.956
1.792
1.586
1.699
1.423
2.268
1.473
6.838
2.532
49.06
7.111
53.81
9.003
21.34
1.473
N
1X
1
N
f (yi |(n) )
n=1

node
p.inv[1]
p.inv[2]
p.inv[3]
p.inv[4]
p.inv[5]
p.inv[6]
p.inv[7]
p.inv[8]
p.inv[9]
p.inv[10]
p.inv[11]
p.inv[12]
p.inv[13]
p.inv[14]
p.inv[15]
p.inv[16]
p.inv[17]
p.inv[18]
p.inv[19]
p.inv[20]
p.inv[21]
p.inv[22]
p.inv[23]
p.inv[24]
p.inv[25]
4.3 Model Checking
For example, suppose we have fit a normal distribution to a sample of

univariate data, and wish to investigate the models fit in the lower tail.
We might compare the observed value of the discrepancy measure
D(y) = ymin
with its posterior predictive distribution, p(D(yrep )|y), where yrep
denotes a hypothetical future value of y. If the observed value is
extreme relative to this reference distribution, doubt is cast on some
aspect of the model.
219
220
4.3 Model Checking
Posterior Predictive Model Checks
Posterior Predictive Model Checks
In order to be computable in the classical framework, test statistics

must be functions of the observed data alone. But as pointed out by
Gelman et al. (1996), basing Bayesian model checking on the
posterior predictive distribution allows generalized test statistics D(y, )
that depend on the parameters as well as the data.
For example, as an omnibus goodness-of-fit measure, Gelman et al.
(1996) recommend
D(y, ) =
n
X
(yi E[Yi |])2
.
Var (Yi |)
With varying according to its posterior distribution, we would now

compare the distribution of D(y, ) for the observed y with that of
D(y , ) for a future observation y .
pD = P [D(yrep , ) > D(y, )|y]

Z
=
P [D(yrep , ) > D(y, )|] p(|y)d.
As such, pD is sometimes referred to as the Bayesian P-value.
221
222
4.3 Model Checking
Posterior Predictive Model Checks in WinBUGS
In Example 4.1, we consider 2 different statistics for D(y , ) which may

be sensitive to outlying observations in a Normal model. These are
"
#
X 3
I coefficient of skewness: E
measure of asymmetry, skewness of Normal rv is zero

"
#
X 4
I coefficient of kurtosis: E
measure of peakedness, kurtosis of Normal rv is 3
4.3 Model Checking
A convenient summary measure of the extremeness of the former with

respect to the latter is the tail area
In the case where D(y , )s distribution is free of , pD is exactly equal

to the frequentist P-value, or the probability of seeing a test statistic as
extreme as the one actually observed.
i=1
4.3 Model Checking
223
for (i in 1:n){
#residuals and moments for observed data
r[i]<- time[i]-mu[i]
sr[i]<- (time[i]-mu[i])*sqrt(tau)
m3[i] <- pow(sr[i],3)
m4[i] <- pow(sr[i],4)
# residuals and moments of replicates for Bayesian p-valu

time.rep[i] ~ dnorm(mu[i], tau)
resid.rep[i] <- time.rep[i]-mu[i]
sresid.rep[i] <- resid.rep[i]*sqrt(tau)
m3.rep[i] <- pow(sresid.rep[i],3)
m4.rep[i] <- pow(sresid.rep[i],4) }
224
4.3 Model Checking
4.3 Model Checking
Bayesian P-values in WinBUGS
# Bayesian p-value:
node
skew.obs
skew.rep
p.skew
kurtosis.obs
kurtosis.rep
p.kurtosis
skew.obs <- sum(m3[])/n

skew.rep <- sum(m3.rep[])/n
p.skew
<- step(skew.rep-skew.obs)
kurtosis.obs <- sum(m4[])/n
kurtosis.rep <- sum(m4.rep[])/n
p.kurtosis
<- step(kurtosis.rep-kurtosis.obs)
225
227
sd
0.8858
0.7959
0.499
2.754
2.023
0.4931
MC error
0.0185
0.01879
0.01028
0.05979
0.04379
0.01081
4.3 Model Checking
mean
0.09787
-0.02244
0.4685
3.783
3.045
0.417
226
4.3 Model Checking
228
4.4 Model Comparison via DIC
Model Comparison via DIC
Problems with Classical Information Criteria
In general, for model comparison we need:

I
measure of fit
measure of complexity
Problems:
e.g.
+ 2p
AIC = 2 log p(y |)
+ p log n
BIC = 2 log p(y |)
229
2 -approximation for small samples
p = no. of parameters in hierarchical models
n = no. of observations in hierarchical models
Deviance
230
Deviance Information Criterion
Suggestion by Spiegelhalter et al. (2002)

Suggestion by Dempster (1974):
measure of fit:
= E|y [D]
D
posterior mean of deviance
Base model assessment on posterior distribution of the log-likelihood

of the data.
measure of complexity:
D()
pD = D
effective no. of parameters
This is equivalent to posterior distribution of the deviance:
+ pD = D()
+ 2pD
DIC = D
D() = 2 log p(y |) + 2 log p(y |sat )
The model with the smallest DIC value is preferred. DIC calculation is
implemented in WinBUGS.
231
232
DIC Example: Multiple Linear Regression
DIC Output
We will illustrate the use of DIC for comparing four different models for
the softdrink Example 4.1
Dbar = post.mean of -2logL;

Dhat = -2LogL at post.mean of stochastic nodes
1. Model 1: intercept only

2. Model 2: cases
3. Model 3: distance
4. Model 4: cases and distance
We run each model in WinBUGS and set the DIC tool in the Inference
menu.
233
Model
Intercept
Cases
Distance
Cases + Distance
235
Dhat
207.061
140.477
167.503
127.030
pD
2.031
3.072
3.072
4.259
Dbar
209.092
143.549
170.575
131.289
DIC
211.123
146.622
173.647
135.547
234
236
4.5 Analysis of Variance
ANOVA Models
Parametrizations and Interpretations

Need a constraint to make I + 1 parameters 0 , 1 , . . . , I identifiable.
Either:
Now
I
response variable Y : continuous
explanatory variable X : discrete

X is called a factor with levels i = 1, . . . , I
Corner Constraint:
Effect of baseline level (or reference category) is set to 0: 1 = 0
1 = 0
i = 0 + i , i = 2, . . . , I
ANOVA Model
Yij N(i , 2 )
i = 1, . . . , I,
j = 1, . . . , ni
or
Sum-to-zero Constraint:
where
I
Yij is jth observation of Y at level i of X
i = 0 + i
0 overall common mean
i group-specific parameter
I
X
1 =
I
X
i=2
PI
0 =
i=1 i overall mean effect
i deviation of each level from this overall mean effect
237
ANOVA in WinBUGS
Assume data are given in pairs (xi , yi ), i = 1, . . . , n (n =
238
ANOVA Example
P
ni )
Example 4.2
McCarthy (2007) describes a dataset of weights of starlings at four
different locations.
#likelihood
for (i in 1:n){
y[i]
~ dnorm(mu[i],tau)
mu[i] ~ beta0 + beta[x[i]]
}
#corner constraint
beta[1] <- 0.0
#sum-to-zero constraint
#beta[1] <- - sum( beta[2:I] )
#prior
for (i in 2:I){
beta[i] ~ dnorm(0.0,1.0E-4)
}
or
i=1
1
I
i = 0
Location 1
78
88
87
88
83
82
81
80
80
89
239
Location 2
78
78
83
81
78
81
81
82
76
76
Location 3
79
73
79
75
77
78
80
78
83
84
Location 4
77
69
75
70
74
83
80
75
76
75
240
Classical ANOVA
R-Output
Analysis of Variance Table

Frequentist analysis in R:
star.df <- read.table("starlingdata.txt", header=TRUE)
attach(star.df)
loc<- factor(location)
star.aov<-aov(Y~loc)
anova(star.aov)
summary.lm(star.aov)$coef
241
Response: Y
Df Sum Sq Mean Sq F value
Pr(>F)
loc
3 341.90 113.97 9.0053 0.0001390 ***
Residuals 36 455.60
12.66
--> summary.lm(star.aov)$coef
Estimate Std. Error
(Intercept)
83.6
1.124969
loc2
-4.2
1.590947
loc3
-5.0
1.590947
loc4
-8.2
1.590947
WinBUGS Code
Pr(>|t|)
5.325939e-41
1.218170e-02
3.342926e-03
9.372412e-06
242
WinBUGS Code
model
{ for (i in 1:40) {
mu[i] <- beta0 + beta[location[i]]
Y[i] ~ dnorm(mu[i], tau)
}
#prior, corner constraint
beta[1] <- 0
for (i in 2:4){
beta[i] ~ dnorm(0.0, 1.0E-6)
}
tau ~ dgamma(0.001, 0.001) # uninformative prio precision
}
#inits
list(beta0=70, beta=c(NA, 70, 70, 70), tau=1)
t value
74.313150
-2.639938
-3.142783
-5.154164
243
#data
location[] Y[]
1 78
...
1 89
2 78
...
2 76
3 79
...
3 84
4 77
...
4 75
END
244
WinBUGS Results
WinBUGS Results
Using the comparison tool of the Inference menu and clicking on
"boxplot" for beta:
node
mean
sd
MC error
2.5%
median
97.5%
start
sample
-4.204
-4.963
-8.143
83.58
0.07878
1.65
1.597
1.61
1.142
0.01887
0.03838
0.04041
0.03213
0.02757
4.333E-4
-7.302
-7.977
-11.26
81.31
0.04582
-4.162
-4.964
-8.168
83.59
0.07712
-0.9981
-1.699
-5.014
85.7
0.1183
1001
1001
1001
1001
1001
2000
2000
2000
2000
2000
box plot: beta
0.0
beta[2]
beta[3]
beta[4]
beta0
tau
[2 ]
[3 ]
[4 ]
-5.0
-10.0
-15.0
Figure 12: Boxplot of location effects.

245
246
Model Comparison
Let us compare the fit of this one-way ANOVA model with a model that
assumes no differences in the expected weights at the different
locations:
for (i in 1:40) {
Y[i] ~ dnorm(beta0, tau)
}
Model
ANOVA
Same Mean
Dbar
216.156
235.316
Dhat
211.053
233.229
pD
5.103
2.087
DIC
221.259
237.402
247
248
4.6 Generalized Liner Models
Generalized Linear Models
Generalized Linear Models
3 components of a LM:
Generalized Linear Models (GLMs) are a generalization of the linear
model for modelling of random variables from the exponential family,
thus including the Normal, Binomial, Poisson, Exponential and Gamma
distributions.
GLMs are one of the most important components of modern statistical
theory, unifying an approach to statistical modelling
Details on GLMs can be found in McCullagh and Nelder (1989),
Fahrmeir and Tutz (2001), and Dey, Ghosh, Mallick (2000)
249
stochastic component: Yi N(i , 2 ), i.e. E[Yi ] = i
systematic component: i = x0i (linear predictor)
link function: g(i ) = i identity
3 components of a GLM:
I
stochastic component: Yi Exponential family with location

parameter , dispersion parameter
stystematic component: i = x0i
link function: g(i ) = i
Models for Binary Response
250
The binary data are summarized in the following table:
Example 4.3
Fahrmeir and Tutz (1994) describe data provided by the Klinikum
Grosshadern, Munich, on infection from births by Caesarean section.
The response variable of interest is the occurrence or nonoccurrence
of infection, with three dichotomous covariates: whether the
Caesarean section was planned or not, whether any risk factors such
as diabetes, being overweight etc were present or not and whether
antibiotics were given as a prophylaxis. The aim was to analyse the
effects of the covariates on the risk of infection, especially whether
antibiotics can decrease the risk of infection.
Caesarean planned
Infection
yes
no
251
Not planned
Infection
yes
no
Antibiotics
Risk factors
No risk factors
1
0
17
2
11
0
87
0
No antibiotics
Risk factors
No risk factors
28
8
30
32
23
0
3
9
252
Interpretation of Logit Parameters
Let Yi = 1 if infection occurs for ith patient, 0 otherwise and xi denote

the corresponding vector of covariate values.
I
Yi |xi , i Bernoulli(i )
i = x0i = 0 + 1 xi1 + 2 xi2 + 3 xi3
link function = g() or = F () where F cdf

logit model: g() = log
e
=
logistic cdf
1 + e
1
probit model: g() = (),
= ()
Normal cdf
complimentary log-log model: g() = log( log(1 ))
ORx,x+1 =
= 1 exp( exp())
= 0 + 1 x
1
= exp(0 ) exp(1 x)
1
odds(x + 1)
exp(0 ) exp(1 (x + 1)
=
= exp(1 )
odds(x)
exp(0 ) exp(1 x)
If x increase by 1 unit, odds ratio increases by exp(1 ) units.

For other link functions
I Interpret covariate effects on linear predictor = x0 .
I Transform this linear effect on into a nonlinear effect on (with
the aid of a graph of the response function = g 1 ().
extreme-minimal-value cdf

log
Exponentials of covariate effects have a multiplicative effect on the

odds/relative risk.
253
Logit WinBUGS Code

model{
for( i in 1 : N ) {
y[i] ~ dbern(p[i])
logit(p[i]) <- beta0 + beta[1] *plan[i] +
beta[2]*factor[i] + beta[3]*antib[i]
# centered covariates
# logit(p[i])<-beta0 + beta[1] *(plan[i]-mean(plan[])) +
#
beta[2]*(factor[i]-mean(factor[])) +
#
beta[3]*(antib[i]-mean(antib[]))
}
beta0 ~ dnorm(0.0,0.001)
for (i in 1:3){
beta[i] ~ dnorm(0.0,0.001)
or[i] <- exp(beta[i])}
}
list(beta0=0,beta=c(0,0,0)) #inits
255
list(N=251)
#data
254
WinBUGS Output
beta[1]
1.0
0.0
-1.0
-2.0
-3.0
1001
1500
2000
2500
3000
2500
3000
2500
3000
2500
3000
iteration
beta[2]
4.0
3.0
2.0
1.0
0.0
1001
1500
2000
iteration
beta[3]
-1.0
-2.0
-3.0
-4.0
-5.0
-6.0
1001
1500
2000
iteration
beta0
1.0
0.0
-1.0
-2.0
-3.0
1001
1500
2000
iteration
Figure 13: Traceplots for uncentered covariates.
256
WinBUGS Output
WinBUGS Output
Centered Covariates
beta[1]
1.0
0.0
-1.0
-2.0
beta[1]
-3.0
-4.0
1001
1500
2000
2500
beta[2]
1.0
0.5
0.0
-0.5
-1.0
3000
iteration
1.0
0.5
0.0
-0.5
-1.0
beta[2]
4.0
20
40
20
40
lag
lag
3.0
2.0
1.0
beta[3]
0.0
1001
1500
2000
2500
beta0
1.0
0.5
0.0
-0.5
-1.0
3000
iteration
beta[3]
1.0
0.5
0.0
-0.5
-1.0
0
-1.0
-2.0
-3.0
-4.0
-5.0
-6.0
20
40
20
lag
1001
2000
1500
2500
40
lag
Uncentered Covariates
3000
iteration
beta0
-0.5
beta[1]
-1.0
-2.0
-2.5
1001
1500
2000
2500
beta[2]
1.0
0.5
0.0
-0.5
-1.0
-1.5
3000
1.0
0.5
0.0
-0.5
-1.0
0
iteration
20
40
20
lag
beta[3]
beta0
1.0
0.5
0.0
-0.5
-1.0
Figure 14: Traceplots for centered covariates.
40
lag
1.0
0.5
0.0
-0.5
-1.0
0
20
40
20
lag
40
lag
Figure 15: Autocorrelation plots.

257
WinBUGS Output
258
Comparing Model Fits
Summary Statistics for model with uncentered covariates:

node
beta[1]
beta[2]
beta[3]
beta0
or[1]
or[2]
or[3]
mean
-1.116
2.069
-3.333
-0.8242
0.3604
8.988
0.04017
sd
0.4392
0.4982
0.4921
0.5331
0.1639
4.894
0.02009
MC error
0.02788
0.03463
0.02534
0.04337
0.009911
0.3246
0.001003
2.5%
-1.993
1.157
-4.346
-1.961
0.1362
3.181
0.01295
median
-1.114
2.055
-3.316
-0.8118
0.3282
7.804
0.03628
97.5%
-0.2388
3.057
-2.393
0.1738
0.7878
21.26
0.09139
Consider 3 different models with 3 different link functions. Compare

the fit with DIC.
Logit
Probit
Cloglog
None of the 95% credible intervals of covariate effects contains 0.
Dhat
226.588
227.041
224.152
pD
4.033
4.180
3.949
DIC
234.654
235.400
232.050
The complementary log-log link seems to give a better fit but there are
only minor differences in the values of DIC.
Antibiotics lower the odds of infection by a factor of 0.04. When the

Caesarean is planned, the odds of infection decreases by a factor of
0.36, and when risk factors are present, the odds of infection is 8.988
higher.
Dbar
230.621
231.221
228.101
259
260
4.7 Hierarchical Models
Hierarchical Models
Hierarchical Models
In many statistical applications, model parameters are related by the

structure of the problem.
For example, in a study of the effectiveness of cardiac treatments, it is
assumed that patients in hospital j have survival probability j .
Hierarchical model with hyperparameters:
Estimating each of these j separately, might result in large standard

errors for hospitals with few patients. It can also lead to overfitting and
lead to models that cannot predict new data well.
Yij |j
f (yij |j )
j | f (j |)
Assuming all survival probabilities are the same will ignore potential
treatment differences between hospitals and will not fit the data
accurately.
f ()
It might be reasonable to expect that the j s are related and should be

estimated jointly. This is achieved in a natural way by assuming that
the j s come from a common population distribution. This population
distribution can depend on a further parameter.
261
Hierarchical Models: Rat Tumor Example
262
Example 4.4
This example in the context of drug evaluation for possible clinical trial
application is taken from Gelman et al. (2004). A control group of 14
laboratory rats of type F344 is given a zero dose of a certain drug.
The aim is to estimate the probability of developing endometrial
stromal polyps (a certain tumor). The outcome is that 4 out of 14 rats
developed this tumor.
Historical Data: 70 previous experiments on same type of rats

0/20
0/19
1/18
2/20
3/20
4/20
6/23
0/20
0/18
1/18
1/10
2/13
10/48
5/19
0/20
0/18
2/25
5/49
9/48
4/19
6/22
0/20
0/17
2/24
2/19
10/50
4/19
6/20
0/20
1/20
2/23
5/46
4/20
4/19
6/20
0/20
1/20
2/20
3/37
4/20
5/22
6/20
0/20
1/20
2/20
2/17
4/20
11/46
16/52
0/19
1/20
2/20
7/49
4/20
12/49
15/47
0/19
1/19
2/20
7/47
4/20
5/20
15/46
0/19
1/19
2/20
3/20
4/20
5/20
9/24
1. Approach: Bayesian model with fixed prior

Yi | Binomial(14, )
Beta(, )
Assume that we know from historical data the mean and sd of tumor
probabilities among female lab rats of type F344. We find values of
and of the beta distribution with this mean and sd. This yields a
Beta( + 4, + 14) posterior distribution for .
263
Observed sample mean and sd of yj /nj is 0.136 and 0.103,

respectively. Setting
0.136 =
+
0.1032 =
( + )2 ( + + 1)
yields = 1.4 and = 8.6
264
Using a Beta(1.4,8.6) prior for yields a Beta(5.4,18.6) posterior

distribution with posterior mean= 0.223 and posterior sd= 0.083,
whereas 4/14 = 0.286.
2. Approach: Hierarchical Bayesian model

In absence of any information about the j s (other than the data) and
no ordering or grouping of the parameters can be made, we must
assume symmetry in the prior distribution of the parameters.
Assumptions:
I
1 , . . . , 70 , 71 can be considered a random sample from a

common distribution
no time trend
This means, that the parameters (1 , . . . , J ) are modelled as

exchangeable in their joint prior distribution. I.e.
Questions:
I
Can we use the same prior to make inference about the tumor
probabilities in the first 70 groups?
Is the point estimate used to derive and representative?
Does it make sense to estimate and ?

f (1 , . . . , J ) is invariant to permutations of the indices (1, . . . , J).
265
J
Y
266

Key part of hierarchical models:
Assume simplest form of exchangeability:

j are iid given some unknown parameter :
f (1 , . . . , J |) =
is unknown, has a prior distribution, f (), and we estimate its

posterior distribution after observing the data. We have a parameter
vector (, ) with joint prior distribution:
f (j |)
i=1
f (, ) = f ()f (|)
By integration, the joint (unconditional or marginal) distribution is
#
Z "Y
J
f (1 , . . . , J ) =
f (j |) f ()d
The joint posterior distribution is:
i=1
f (, |y ) = f (y |, )f (, )
De Finettis theorem states that as J , any exchangeable

distribution (under certain regularity conditions) can be written in the iid
mixture form above.
267
= f (y |)f (|)f ()
268
Hyperprior Distribution:
WinBUGS Code: Rat Tumor Example
If little is known about the hyperparameter , we can assign a diffuse

prior distribution. But we always need to check whether the resulting
posterior distribution is proper. In most real problems, there is sufficient
substantial knowledge about to constrain to some finite region.
In the rat tumor Example 4.4, we reparametrize to i =logit(i ), i.e.
i
i
exp(i )
1 + exp(i )
N(, )
=
and specify the following diffuse hyperprior distrution for mean and
precision :
N(0, 0.001)
Gamma(0.001, 0.001)
269
# rat example
model
{ for (i in 1:71){
y[i] ~ dbin(theta[i],n[i])
theta[i] <- exp(mu[i])/(1+exp(mu[i]))
mu[i] ~ dnorm(nu,tau)
r[i]<-y[i]/n[i]
}
nu ~ dnorm(0.0,0.001)
tau ~ dgamma(0.001,0.001)
mtheta<-exp(nu)/(1+exp(nu))
}
#inits
list(nu=0,tau=1)
WinBUGS Output: Rat Tumor Example
270
Based on 10,000 iterations and burn-in of 10,000:

node
mtheta
nu
tau
theta[71]
mean
0.1261
-1.941
2.399
0.2059
sd
0.01336
0.1224
1.134
0.077
MC error
3.035E-4
0.002774
0.03409
7.983E-4
2.5%
0.1002
-2.195
1.052
0.0827
median
0.126
-1.937
2.184
0.1965
97.5%
0.1526
-1.715
4.891
0.3825
From the boxplot and the "model fit" plot of j estimates against sample
proportions rj , we see that rates j are shrunk from their sample point
estimate rj = yj /nj , towards the population distribution with mean
0.126. Experiments with fewer observations are shrunk more and have
higher posterior variances. In contrast to the model with fixed prior
parameters, this full Bayesian hierarchical analysis has taken the
uncertainty in the hyperparameters into account.
271
bo x pl ot: the ta
0.6
[7 0 ]
[2 8 ]
0.4
[6 ]
[5 ]
[4 ]
[3 ]
0.2
[1 4 ]
[7 ]
[1[2
] ]
3]
[1[1
2]
[1 1 ]
[3 5 ]
[6 3 ]
[5 6 ]
[4 9 ]
[7 1 ]
[6 2 ]
[2 7 ]
[4 1 ]
[4 0 ]
4]
[3[3
3]
[4 8 ]
[4 7 ]
[5 5 ]
[5 4 ]
[6 9 ]
[6 8 ]
[6 1 ]
[2 6 ]
[1 9 ]
[6 7 ]
[4 6 ]
[2 5 ]
[1 0 ]
[9 ]
[8 ]
[4 2 ]
[2 1 ]
[2 0 ]
[1 7 ]
[1 8 ]
[1 6 ]
[1 5 ]
[2 4 ]
[2 3 ]
[2 2 ]
[3[3
8]
9]
[3 1 ]
[3 0[3
] 2]
[2 9 ]
[3 7 ]
[3 6 ]
[4 5 ]
[4 4 ]
[4 3 ]
[5[5
2]
3]
[5 1 ]
[5 0 ]
0]
[5[6
9]
[5 8 ]
[5 7 ]
[6 6 ]
[6 5 ]
[6 4 ]
0.0
Figure 16: Boxplots for rat tumor rates.
272
Hierarchical Models: Pump Failure Example

Example 4.5
George et al (1993) discuss Bayesian analysis of hierarchical models.
The example they consider relates to 10 power plant pumps. The data
is given in the following table and gives the number of failures xi and
the length of operation time ti (in thousands of hours) for each pump.
m o de l fit: the ta
0.6
0.4
0.2
0.0
0.0
0.1
0.2
0.3
0.4
Figure 17: Model fit for rat tumor rates.
273
Pump
1
2
3
4
5
6
7
8
9
10
ti
94.50
15.70
62.90
126.00
5.24
31.40
1.05
1.05
2.10
10.50
xi
5
1
5
14
3
19
1
1
4
22
Hierarchical Models: Pump Failure Example
274
WinBUGS Code: Pump Failure Example
The number of failures Xi is assumed to follow a Poisson distribution:

Xi |i Poisson(i ti )
i = 1, . . . , 10
where i denotes the failure rate for pump i.

Assuming that the failure rates of the pumps are related, we specify a
hierarchical Bayesian model and a conjugate prior distribution for i :
i Gamma(, ),
i = 1, . . . , 10.
We have insufficient information about the pump failure rates to specify

values for and but want the data to inform us about these. We
specify a hyperprior distribution using substantive knowledge:
Exponential(1.0)
Gamma(0.1, 1.0)
275
model
{
for (i in 1 : N) {
theta[i] ~ dgamma(alpha, beta)
lambda[i] <- theta[i] * t[i]
x[i] ~ dpois(lambda[i])
}
alpha ~ dexp(1)
beta ~ dgamma(0.1, 1.0)
}
list(t=c(94.3,15.7,62.9,126,5.24,31.4,1.05,1.05,2.1,10.5),
x=c(5,1,5,14,3,19,1,1,4,22), N=10) #data
list(alpha = 1, beta = 1) #inits
276
WinBUGS Output: Pump Failure Example
MLE: Pump Failure Example

node
alpha
beta
theta[1]
theta[2]
theta[3]
theta[4]
theta[5]
theta[6]
theta[7]
theta[8]
theta[9]
theta[10]
mean
0.6874
0.9126 0.5411
0.0599
0.1012
0.08922
0.1148
0.5964
0.6067
0.9106
0.8997
1.599
1.995
sd
0.2723
0.01506
0.02496
0.07978
0.03818
0.03023
0.3127
0.137
0.7541
0.7396
0.7679
0.4327
MC error
0.007535
0.1771
3.49E-4
0.001012
5.284E-4
3.901E-4
0.004145
0.001753
0.01089
0.01236
0.01115
0.00605
2.5%
0.2806
0.8161
0.02099
0.00801
0.03137
0.06324
0.1508
0.3761
0.07487
0.07952
0.4925
1.254
median
0.6456
2.222
0.05683
0.08247
0.08349
0.1121
0.5445
0.595
0.7165
0.7016
1.467
1.966
97.5%
1.338
0.1184
0.3089
0.1786
0.1829
1.338
0.9082
2.845
2.732
3.444
2.917
To compare the results to maximum likelihood estimates (MLE) for

individual pump failures, we calculate the (log)likelihood:
(i ti )xi
exp(i ti )
xi !
log f (xi |i ) = xi log(i ti ) i ti
f (xi |i ) =
Setting the first derivative to 0 and solving w.r.t i gives

xi
i =
ti
277
MLE Comparison: Pump Failure Example
278
Remarks: Pump Failure Example
The following table compares MLEs and Bayesian estimates:

hours
94.50
15.70
62.90
126.00
5.24
31.40
1.05
1.05
2.10
10.50
failures
5
1
5
14
3
19
1
1
4
22
MLE
0.0530
0.0637
0.0795
0.1111
0.5725
0.6051
0.9524
0.9524
1.9048
2.0952
Bayesian
0.0599
0.1012
0.08922
0.1148
0.5964
0.6067
0.9106
0.8997
1.599
1.995
279
Individual estimates are "shrunk" away from MLE toward a

common mean.
Individual estimates "borrow strength" from the rest of the data.
i s for observations with large "sample size" (operation time) are

shrunk less than i s for other observations.
i s far from the common mean (0.7389) are shrunk more than
those near it.
280
Boxplot: Pump Failure Example
Model Fit Plot: Pump failure Example
model fit: theta
box plot: theta
4.0
4.0
[9]
3.0
3.0
[10]
[7]
[8]
2.0
2.0
1.0
[5]
[6]
1.0
0.0
[2]
[1]
[3]
[4]
0.0
0.0
281
2.0
3.0
Figure 19: Model fit for pump failure rates.
Figure 18: Boxplots for pump failure rates.

1.0
283
282
284
4.8 Survival Analysis
Survival Analysis
Hazard Function
Survival Analysis refers to a class of statistical models used to analyse

the duration of time until an event of interest (such as death, tumor
occurrence, component failure) occurs. Time-to-event data arise in
many disciplines, including medicine, biology, engineering,
epidemiology and economics. Frequentist textbooks include Cox and
Oakes (1984) and Klein and Moeschberger (1997); a comprehensive
Bayesian perspective is given in Ibrahim, Chen and Sinha (2001).
As duration times are non-negative, only non-negative random
variables can be used to model survival times.
Failure time data are often censored, i.e. incomplete in that one knows
that a patient survived the study end point, but one does not know the
exact time of death.
Let T be a continuous nonnegative random variable, representing the

duration time until a certain event occurs. Let f (t) denote the pdf and
F (t) the cdf of T . Let S(t) = 1 F (t) = P(T t) be the survival
function, which provides the probability of surviving until timepoint t.
Definition 4.6
The hazard function is defined as
P(t < T t + t|T > t)
h(t) = lim
t
t0
0
f (t)
S (t)
=
=
S(t)
S(t)
and can be interpreted as the instantaneous death (or event) rate of an
individual, provided that this person survived until time t. In particular,
h(t)t is the approximate probability of failure in [t, t + t), given
survival up to time t.
In survival analysis, we are less interested in the mean of the

distribution but we are interested in the hazard function.
285
Hazard Function
286
Hazard Function
d
Since f (t) = dt
S(t), Definition 4.6 implies that
h(t) =
d
log S(t)
dt
Integrating both sides of (4.1), and then exponentiating, yields

Z t

S(t) = exp
h(u)du .
(4.1)
Thus, the hazard function has the properties

Z
h(t) 0 and
h(t)dt = .
0
(4.2)
The cumulative hazard, Ht(t) is defined as

Z t
H(t) =
h(u)du
Finally, it follows from Definition 4.6 and (4.1) that

Z t

f (t) = h(t) exp
h(u)du .
(4.3)
so S(t) = exp(H(t)). Since S() = 0, H() = .

287
288
Example: Weibull Distribution
Suppose T has pdf

t 1 exp(t ),
f (t) =
0,
Proportional Hazards Models

The hazard function depends in general on both time and a set of
covariates. The proportional hazards model (Cox, 1972) separates
these components by specifying that the hazard at time t for an
individual whose covariate vector is x is given by
for t > 0, > 0, > 0,

otherwise.
h(t, x) = h0 (t) exp{G(x, )}

where h0 (t) is called the baseline hazard function and is a vector of
regression coefficients. The second term is written in exponential form
because it must be positive.
This is a Weibull distribution with parameters (, ). It follows easily

from the equations above, that
I
S(t) = exp(t ),
The ratio of hazards for two individuals is constant over time. Often,
the effect on the covariates is assumed to be multiplicative, leading to
the hazard function
h(t, x) = h0 (t) exp(x0 )
t 1 ,
h(t) =
H(t) = t .
where = x0 is called the linear predictor. Thus the ratio of hazards

for two individuals depends on the difference between their linear
predictors at any time.
289
Partial Likelihood
290
Likelihood under Censoring
Coxs version (Cox, 1975) of the proportional hazards model is

semiparametric as the baseline hazard function h0 (t) is not modeled
as a parametric function of t.
Survival data are often right-censored. An observation is said to be

right-censored at c if the exact value of the observation is not known
but only that it is greater than c.
Assumptions:
I n individuals, d have distinct event times, n d have right
censored survival times
I no ties, ordered survival times: y(1) , . . . , y(d)
I Rj = set of individuals who are at risk at time y(j) , jth risk set
Let n be number of subjects where individual i has survival time ti and

fixed censoring time ci . ti are iid with pdf f (t).
The exact survival time ti of an individual will be observed only if ti ci .
Data can be represented by n pairs of random variables (yi , i ) where
yi = min(ti , ci )
Then the partial likelihood is:

PL() =
exp(x0(j) )
n
Y
P
i=1
0
lRj exp(xl )
(4.4)
and

i =
0
1
if ti ci ,
if ti > ci .
The partial MLE of can be obtained by maximizing (4.4) w.r.t. .

291
292
The likelihood function for (, h0 (t)) for right censored data:

L(, h0 (t)|D)
n
Y
i=1
n
Y
i=1
n
Y
f (yi )1i S(yi )i

h(yi )1i S(yi )1i S(yi )i
If we assume a parametric model for the baseline hazard, e.g.

Weibull(, 1), and define i = exp(i ), then the likelihood above is that
of independent censored Weibull(, i ) distributions.
h(yi )1i S(yi )
i=1
n
Y
i=1
n
Y
h(yi )1i exp{H(yi )}

h0 (yi ) exp(i )]1i exp{ exp(i )H0 (yi )}
i=1
where the data D = (n, y, X , ).

293
Censoring in WinBUGS
294
Mice Example in WinBUGS
In WinBUGS, right censoring can be implemented using the command

I(a,) (and I(,b) and I(a,b) for left and interval censoring,
respectively).
Two variables are required to define the survival times:
I
the actual survival times t[i] taking NA values for censored

observations and
the censoring times t.cen[i], which take the value 0 when

actual survival times (deaths) are observed.
We will now look at the mice example in WinBUGS Example

Volume 1.
For example, the likelihood of a Weibull(, ) distribution with right

censored data can be expressed as
t[i]
dweib(rho,gamma)I(t.cen[i],)
295
296
MAC AIDS Trial
Primary Endpoint Data

Unit
A
A
A
A
D
D
D
D
D
D
D
D
D
D
G
G
G
G
G
G
J
J
J
J
Here we come back to the analysis of controlled clinical AIDS trial

discussed in the introduction. Our data arises from a clinical trial
comparing two treatments for Mycobacterium avium complex (MAC), a
disease common in late state HIV-infected persons.
11 clinical centers (units) have enrolled a total of 69 patients in the
trial, of which 18 have died. The data have been analysed in Carlin
and Hodges (1999) and Cai and Meyer (2011)
I
For j = 1, . . . , ni and i = 1, . . . , k let

tij = time to death or censoring
xij = treatment indicator for subject j in stratum i
The next page gives survival times (in half-days) from the MAC
treatment trial, where "+" indicates a censored observation
297
Treatm.
1
2
1
2
2
2
2
2
1
1
1
1
1
1
2
1
2
2
2
1
1
1
2
2
Time
74+
248
272+
244
20+
64
88
148+
162+
184+
188+
198+
382+
436+
32+
64+
102
162+
182+
364+
18+
36+
160+
254
Unit
B
B
Treatm.
2
1
Time
4+
156+
C
E
E
E
E
E
E
E
E
2
1
2
2
1
1
1
2
2
20+
50+
64+
82
186+
214+
214
228+
262
H
H
H
H
H
H
K
K
K
2
1
1
1
1
2
1
1
2
22+
22+
74+
88+
148+
162
28+
70+
106+
Proportional Hazards Model
Treatm.
1
2
1
2
2
1
1
2
1
1
2
2
Time
6
16+
76
80
202
258+
268+
368+
380+
424+
428+
436+
I
I
I
I
I
I
I
I
I
I
I
I
I
2
2
2
1
1
2
1
2
1
1
2
2
1
8
16+
40
120+
168+
174+
268+
276
286+
366
396+
466+
468+
Unit
F
F
F
F
F
F
F
F
F
F
F
F
298
Proportional Hazards Model
With proportional hazards and a Weibull baseline hazard, stratum is

hazard is
h(tij ) = h0 (tij )i exp(0 + 1 xij )
= i tijrhoi 1 exp(0 + 1 xij )
As in the mice example,
where i > 0 and = (0 , 1 ).
ij = exp(0 + 1 xij )
The i allow differing baseline hazards which are increasing if i > 1

and decreasing if i < 1. As the strata may be similar, we model the
shape parameters as exchangeable, i.e.
so that
Tij Weibull(i , ij ).
iid
i Gamma(, ).
Thus, the mean of the i is one, corresponding to a constant baseline
hazard and variance 1 . We put a proper but low information
Gamma(3.0, 0.1) prior on , reflecting a prior guess for the standard
deviation of i of 301/2 0.18 allowing a fairly broad region of values
centered around one.
299
300
Weibull Prop. Hazards: WinBUGS Code
WinBUGS Output
model{
for (i in 1 : 69) {
t[i] ~ dweib(rho[unit[i]], mu[i]) I(t.cen[i], )
mu[i] <-exp(beta0+beta1*x[i])
}
for (k in 1:11){
rho[k] ~ dgamma(alpha,alpha)
}
alpha ~ dgamma(3.0,0.1)
beta0 ~ dnorm(0.0,0.001)
beta1 ~ dnorm(0.0,0.001)
r <- exp(2.0*beta1)
}
301
node
alpha
beta0
beta1
r
rho[1]
rho[2]
rho[3]
rho[4]
rho[5]
rho[6]
rho[7]
rho[8]
rho[9]
rho[10]
rho[11]
mean
48.45
-6.788
0.5973
3.887
1.028
0.9848
0.972
0.999
1.066
0.9642
0.9724
1.038
0.9756
1.008
0.9616
sd
20.12
0.4114
0.2805
2.515
0.1078
0.1456
0.1414
0.1108
0.1024
0.08855
0.1169
0.1273
0.09325
0.12
0.1386
MC error
0.3892
0.01758
0.009956
0.08594
0.002538
0.003415
0.002471
0.004363
0.002894
0.002924
0.00354
0.003974
0.003106
0.002795
0.003722
median
45.61
-6.78
0.5894
3.251
1.029
0.9794
0.9696
1.0
1.064
0.9654
0.9709
1.038
0.9763
1.006
0.96
2.5%
18.47
-7.626
0.06683
1.143
0.8111
0.704
0.7016
0.7739
0.8667
0.7894
0.748
0.7931
0.7885
0.7667
0.6873
97.5%
95.32
-6.006
1.189
10.78
1.237
1.289
1.255
1.214
1.273
1.133
1.204
1.296
1.158
1.248
1.242
302
WinBUGS Output
Units A, E, and H have increasing baseline hazard functions

(i > 0).
All other units have constant or decreasing baseline hazard

functions (i 0).
There is a significant treatment effect:

95% CI for 1 does not include 0
95% CI for r does not include 1
Posterior mean of the relative risk is closer to frequentist estimate

r = 3.1 for the unstratified Cox proportional hazards model (cf.
Introduction).
303
304
4.9 State-Space Modelling of Time Series
State-Space Modelling of Time Series
State-Space Modelling of Time Series
State-space models are among the most powerful tools for dynamic
modeling and forecasting of time series and longitudinal data.
Overviews can be found in Fahrmeir and Tutz (1994) and Kuensch
(2001).
Observation equation:
yt = ht (t ) + vt
ML estimation of unknown parameters and latent states is difficult.

Kalman filter is applicable only for linear Gaussian state-space models.
For nonlinear non-normal state-space models, the likelihood function is
intractable.
gives the conditional distribution of observations yt at time t given

latent states t . vt is an error distribution, e.g. N(0, 2 ).
For nonlinear non-normal state-space models, Carlin et al. (1992)

suggested the Gibbs sampler for posterior computation.
State equation:
t = gt (t1 ) + ut
gives the Markovian transition of state t1 to t where ut denotes an
error distribution. The ability to include knowledge of the system
behaviour in the statistical model is largely what makes state-space
modeling so attractive for biologists, economists, engineers and
physicists.
In the sequel, we will look at examples of state-space models

implemented in WinBUGS.
305
Fisheries Stock Assessment: Data
306
Fisheries Stock Assessment: Data

Yellowfin tuna data from Pella and Tomlinson (1969)
The data available for stock assessment purposes quite often consist
of a time series of annual catches Ct , t = 1, . . . , N, and relative
abundance indices It , t = 1, . . . , N, such as research survey catch
rates or catch-per-unit-effort (CPUE) indices from commercial
fisheries.
For example, the next table gives an historical dataset of catch-effort
data of South Atlantic albacore tuna (Thunnus alalunga) from 1967 to
1989. Catch is in thousands of tons and CPUE in (kg/100 hooks).
307
Year (t)
1967
1968
1969
1970
1971
..
.
Catch (Ct )
15.9
25.7
28.5
23.7
25.0
..
.
CPUE (It )
61.89
78.98
55.59
44.61
56.89
..
.
1987
1988
1989
37.5
25.9
25.3
23.36
22.36
21.91
308
Fisheries Stock Assessment: Objectives
Fisheries Stock Assessment: Biomass Dynamics

Biomass Dynamics Model
Age-composition data are not available for this stock. This dataset has
previously been analysed by Polacheck et al. (1993).
new biomass =
old biomass
+ growth
+ recruitment
Objectives: estimation of
I
the size of the stock at the end of 1989,
natural mortality
the maximum surplus production (MSP),
catch
the biomass at which MSP occurs (BMSP ),
the optimal effort (EMSP ), the level of commercial fishing effort

required to harvest MSP when the stock is at BMSP .
The biomass dynamics equations can be written in the form:

Bt = Bt1 + g(Bt1 ) Ct1
When only catch-effort data are available, biomass dynamics models

are the primary assessment tools for many fisheries (Hilborn and
Walters 1992).
where Bt , Ct , and g(Bt ) denote biomass at the start of year t, catch

during year t, and the surplus production function, respectively.
g(0) = g(K ) = 0, where K is the carrying capacity ( the level of the
stock biomass at equilibrium prior to commencement of the fishery).
309
Fisheries Stock Assessment: Surplus Production Model
Fisheries Stock Assessment: Relative Abundance Index
The Schaefer (1954) form of the surplus production function is

Bt1
.
g(Bt1 ) = rBt1 1
K
A common, though simplifying assumption is that the relative

abundance index is directly proportional to the biomass, i.e.
It = qBt
Substituting this in the biomass dynamics equation gives a

parsimonious model describing the annual biomass dynamics
transitions with just the two parameters r , the intrinsic growth rate, and
K:

Bt1
Bt = Bt1 + rBt1 1
Ct1 .
K
310
(4.5)
(4.6)
with catchability parameter q.

For the Schaefer surplus production model, the maximum surplus
production MSP = rK /4 occurs at BMSP = K /2.
When the biomass indices are CPUEs from commercial fishing, then
the equation above gives MSP/EMSP = qK /2 and thereby the optimal
effort is EMSP = r /2q.
Note that the annual catch is treated as a fixed constant.

311
312
Fisheries Stock Assessment: Process and Observation Error
Fisheries Stock Assessment: State-Space Model
Polacheck et al. (1993) compare three commonly used statistical

techniques for fitting the model defined by equations (4.5) and (4.6),
process error models, observation error models, and equilibrium
models.
This is possible, however, using a state-space model.

Equations (4.5) and (4.6) are the deterministic versions of the
stochastic state and observation equations.
None of these is capable of incorporating uncertainty present in both

equations:
I
natural variability underlying the annual biomass dynamics

transitions (process error) and
uncertainty in the observed abundance indices due to

measurement and sampling error (observation error).
We assumed log-normal error structures.

We used a reparametrization (Pt = Bt /K ) by expressing the annual
biomass as a proportion of carrying capacity as in Millar and Meyer
(2000) to speed mixing (i.e. sampling over the support of the posterior
distribution) of the Gibbs sampler.
313
Fisheries Stock Assessment: State-Space Model
314
Fisheries Stock Assessment: Posterior Distribution

A fully Bayesian model consists of the joint prior distribution of all
unobservables, here the five parameters, K , r , q, 2 , 2 , and the
unknown states, P1 , . . . , PN ,
State equations:
P1 | 2 = eu1 ,
Pt |Pt1 , K , r , 2 = (Pt1 + rPt1 (1 Pt1 ) Ct1 /K ) eut ,
t = 2, . . . , N
(4.7)
Observation equations:
It |Pt , q, 2 = qKPt evt ,
t = 1, . . . , N,
and the joint distribution of the observables, here the relative

abundance indices I1 , . . . , IN .
We assume that the parameters K , r , q, 2 , 2 are independent a priori.
By a successive application of Bayes theorem and conditional
independence of subsequent states, the joint prior density is given by
(4.8)
p(K , r , q, 2 , 2 , P1 , . . . , PN ) = p(K )p(r )p(q)p( 2 )p( 2 )p(P1 | 2 )
N
Y
where ut are iid normal with mean 0 and variance 2 , and vt are iid
normal with mean 0 and variance 2 .
p(Pt |Pt1 , K , r , 2 ).
i=2
315
316
Fisheries Stock Assessment: Prior Specification
Fisheries Stock Assessment: Likelihood

Because of the conditional independence assumption of the relative
abundance indices given the unobserved states, the sampling
distribution is
A noninformative prior is chosen for q.

Prior distributions for K , r , 2 , 2 are specified using biological
knowledge and inferences from related species and stocks as
discussed in Millar and Meyer (2000):
K
r
p(I1 , . . . , IN |K , r , q, 2 , 2 , P1 , . . . , PN ) =
N
Y
p(It |Pt , q, 2 ).
(4.10)
t=1
lognormal(K = 5.04, K = 0.5162),
Then, by Bayes theorem, the joint posterior distribution of the

unobservables given the data is
lognormal(r = 1.38, r = 0.51),
p(q) 1/q,
2 inverse-gamma(3.79, 0.0102),
p(K , r , q, 2 , 2 , P1 , . . . , PN |I1 , . . . , IN ) p(K )p(r )p(q)p( 2 )p( 2 )p(P1 | 2 )

N
Y
p(Pt |Pt1 , K , r , 2 )
2 inverse-gamma(1.71, 0.0086).
i=2
N
Y
317
p(It |Pt , q, 2 )
Applied Bayesian
Inference
t=1
318
(4.11)
Fisheries Stock Assessment: WinBUGS Code
Fisheries Stock Assessment: WinBUGS Code
model {
# lognormal prior on K
K ~ dlnorm(5.042905,3.7603664)I(10,1000)
# lognormal prior on r
r ~ dlnorm(-1.151293,1.239084233)I(0.005,1.0)
# instead of improper (prop. to 1/q) use just proper IG
iq ~ dgamma(0.001,0.001)I(0.5,200)
q <- 1/iq
# inverse gamma on isigma2
isigma2 ~ dgamma(a0,b0)
sigma2 <- 1/isigma2
# inverse gamma on itau2
itau2
~ dgamma(c0,d0)
tau2
<- 1/itau2
319
Pmean[1] <- 0
P[1] ~ dlnorm(Pmean[1],isigma2) I(0.05,1.6)
for (i in 2:N) {
Pmean[i]<-log(max(P[i-1] + r*P[i-1]*(1-P[i-1]) - C[i-1]
P[i]
~ dlnorm(Pmean[i],isigma2)I(0.05,1.5)
}
for (i in 1:N) {
Imean[i] <- log(q*K*P[i])
I[i] ~ dlnorm(Imean[i],itau2)
}
P24 ~ dlnorm(Pmean24, isigma2)I(0.05,1.5)
Pmean24<- log(max(P[23] + r*P[23]*(1-P[23]) - C[23]/K,0.01)
MSP<- r*K/4
B_MSP<- K/2
E_MSP<- r/(2*q)
}
320
Fisheries Stock Assessment: DAG
Fisheries Stock Assessment: WinBUGS Output

P[t-1]
is igm a2
Pm ed[t]
P[t]
C[t-1]
node
BMSP
EMSP
K
MSP
P[1]
P[2]
P[3]
P[4]
P[21]
P[22]
P[23]
P24
q
r
sigma2
tau2
Pm ed[t+1]
Im ed[t]
I[t]
iq
itau2
for(t IN 2 : N)
Figure 20: Representation of surplus production model as DAG.
321
mean
135.5
0.6154
271.0
19.52
1.018
0.9944
0.8772
0.7825
0.4175
0.353
0.3271
0.2964
0.2486
0.3088
0.003105
0.01225
MC error
1.272
0.001935
2.544
0.05968
8.062E-4
0.001368
0.001485
0.001524
8.162E-4
9.208E-4
0.00103
0.001221
0.002411
0.003559
2.22E-5
2.778E-5
323
2.5%
87.2
0.4346
174.4
13.9
0.919
0.8737
0.7616
0.6711
0.3545
0.292
0.2573
0.2093
0.1449
0.1416
0.001132
0.005832
median
130.2
0.6148
260.4
19.76
1.016
0.986
0.8726
0.779
0.4156
0.35
0.3241
0.2926
0.244
0.3031
0.00261
0.01145
sd
32.44
0.09112
64.88
2.537
0.05427
0.07386
0.06548
0.06205
0.03452
0.03519
0.03964
0.04939
0.06136
0.09576
0.001912
0.004516
97.5%
2121
0.8002
424.2
23.94
1.133
1.164
1.019
0.9144
0.491
0.4296
0.4123
0.4028
0.3777
0.5104
0.008057
0.02327
322
324
Example: Stochastic Volatility in Financial Time Series
Example: Stochastic Volatility in Financial Time Series
The stochastic volatility (SV) model introduced by Tauchen and Pitts

(1983) is used to describe financial time series. It offers an alternative
to the ARCH-type models of Engle (1982) for the well documented
time varying volatility exhibited in many financial time series.
The SV model provides a more realistic and flexible modeling of
financial time series than the ARCH-type models, since it essentially
involves two noise processes, one for the observations, and one for the
latent volatilities.
The so called observation errors account for the variability due to
measurement and sampling errors whereas the process errors assess
variation in the underlying volatility dynamics.
325
Classical parameter estimation for SV models is difficult due to the

intractable form of the likelihood function. Recently, a variety of
frequentist estimation methods have been proposed for the SV model,
including Generalized Method of Moments (Melino and Turnbull
(1990), Sorenson (2000)), Quasi-Maximum Likelihood (Harvey et al.,
1994), Efficient Method of Moments (Gallant et al., 1997), Simulated
Maximum Likelihood (Danielsson, 1994, and Sandmann and
Koopman, 1998), and approximate Maximum Likelihood (Fridman and
Harris, 1998).
Bayesian MCMC procedures for the SV model have been suggested
by Jacquier et al. (1994), Shephard and Pitt (1997), Kim et al. (1998)
and Meyer and Yu (2000). Here we demonstrate the implementation of
the Gibbs sampler in WinBUGS.
326
Stochastic Volatility: Data
Stochastic Volatility: State-Space Model
The data consist of a time series of daily Pound/Dollar exchange rates

{xt } from 01/10/81 to 28/6/85. The series of interest are the daily
mean-corrected returns, {y
t }, given by the transformation
1 Pn
yt = log xt log xt1 n i=1 (log xi log xi1 ), t = 1, . . . , n.
The SV model used for analyzing these data can be written in the form
of a nonlinear state-space model:
Observation equations:

1
yt |t = exp
t u t ,
2
returns.dat
-0.320221363079782
1.46071929942995
-0.408629619810947
1.06096027386685
1.71288920763163
0.404314365893326
-0.905699012715806
.
.
.
2.22371628398118
iid
ut N(0, 1), t = 1, . . . , n.
(4.12)
State equations:
t |t1 , , , 2 = + (t1 ) + vt ,
iid
vt N(0, 2 ), t = 1, . . . , n,
(4.13)
with 0 N(, 2 ).
327
328
Stochastic Volatility: Parameters
Stochastic Volatility: Prior Specification
By successive conditioning, the joint prior density is

I
t determines the amount of volatility on day t,
the value of , 1 < < 1, measures the autocorrelation present

in the logged squared data; thus can be interpreted as the
persistence in the volatility, and
329
p(t |t1 , , , 2 ).
(4.14)
We employ a slightly informative prior for , N(0, 10).
We set = 2 1 and specify a Beta(, ) prior for with

= 20 and = 1.5 which gives a prior mean for of 0.86.
A conjugate inverse-gamma prior is chosen for 2 , i.e.

2 IG(2.5, 0.025) which gives a prior mean of 0.0167 and prior
standard deviation of 0.0236.
Stochastic Volatility: Likelihood
n
Y
t=1
as the volatility of log-volatilities.
p(, , , 0 , 1 , . . . , n ) = p(, , )p(0 |, )
the constant scaling factor = exp(/2) as the modal volatility,

and
330
Stochastic Volatility: Posterior Distribution
Then, by Bayes theorem, the joint posterior distribution of the

unobservables given the data is proportional to the prior times
likelihood, i.e.
p(y1 , . . . , yn |, , 2 , 0 , . . . , n )
The likelihood
is specified by the
observation equations (4.12) and the conditional independence
assumption:
p(, , 2 , 0 , . . . , n |y1 , . . . , yn ) p()p()p( 2 )p(0 |, 2 )

p(y1 , . . . , yn |, , 2 , 0 , . . . , n ) =
n
Y
p(yt |t ).
Qn
(4.15)
t=1 p(t |t1 , , ,
2 )
t=1
Qn
t=1 p(yt |t ).
(4.16)
331
332
Stochastic Volatility: DAG
mu
theta[t-1]
phi
thmean[t]
Stochastic Volatility: DAG
itau2
The solid arrows indicate that given its parent nodes, each node v is
independent of all other nodes except descendants of v .
theta[t]
thmean[t+1]
yisigma2[t]
For instance, if on day t we know the volatility on day t 1 and the

values of the parameters , , and 2 , then our belief about the
volatility, t , on day t is independent of the volatilities on previous days
1 to t 2 and the data of all other days except the current return yt .
y[t]
for(t IN 1 : n)
Figure 21: Representation of the stochastic volatility model as a DAG.
333
Stochastic Volatility: WinBUGS Output
334
Stochastic Volatility: Final Remarks
This example clearly shows the limitations of the WinBUGS software.

The time to generate 1000 observations takes several seconds.
Based on 10,000 iterations and burn-in of 10,000 (insufficient):

node
beta
mu
phi
tau
mean
0.7163
-0.6927
0.9805
0.1493
sd
0.1244
0.3074
0.01081
0.03052
MC error
0.00958
0.02252
8.306E-4
0.002965
2.5%
0.5554
-1.176
0.9552
0.1033
median
0.6925
-0.735
0.9823
0.1435
Due to the high posterior correlation between parameters,

convergence is VERY slow and a huge number of MCMC iterations is
required to achieve convergence. This takes almost prohibitively long.
97.5%
1.005
0.01074
0.9962
0.2196
More efficient samplers than the single-update Gibbs sampler can be

constructed either by so-called blocking parameters and updating a
whole parameter vector in a Gibbs sampling step. An alternative is a
Metropolis-Hastings algorithm with a multivariate proposal distribution.
335
336
4.10 Copulas
4.10 Copulas
Copulas
Applications of Copulas
The study of copulas and their applications in statistics is a rather

modern phenomenon, although the concept goes back to Sklar (1959),
but interest in copulas has been growing over the last 15 years.
Copulas are used to

I study scale-free measures of dependence
I construct families of bivariate/multivariate distributions (as
alternatives to the multivariate normal, where the normal
distribution does not provide an adequate approximation to many
datasets, e.g. lifetime random variables and long-tailed claim
variables)
Main applications:
I in financial risk assessment and actuarial analysis some believe
the methodology of applying the Gaussian copula to credit
derivatives to be one of the reasons behind the global financial
crisis of 2008-2009,
I in engineering for reliability studies
I in biostatistics/epidemiology to model joint survival times of groups
of individuals, e.g. husband and wife, twins, father and son, etc.
What are copulas?

The word copula is a Latin noun that means "a link, tie, bond".
In statistics, copulas are functions that join or "couple" multivariate
distribution functions to their one-dimensional marginal distribution
functions.
Or: Copulas are multivariate distribution functions whose
one-dimensional margins are uniform on the interval (0,1).
An extensive theoretical discussion of copulas can be found in Nelsen
(2006).
337
4.10 Copulas
Definition of a Copula
338
4.10 Copulas
Sklars Theorem (1959)
Definition 4.7
Theorem 4.8
A copula C(u1 , . . . , ud ) is a multivariate distribution function on the unit

hypercube [0, 1]d with univariate marginal distributions that are all
uniform on the interval [0, 1], i.e.
Let F be a joint distribution function with margins F1 and F2 . Then

there exists a copula C such that for all x1 , x2 IR
C(u1 , . . . , ud ) = P(U1 u1 , . . . , Ud ud )
For ease of notation, we assume from now on that d = 2.
(4.17)
If F1 and F2 are continuous, then C is unique.

Conversely, if C is a copula and F1 and F2 are distribution functions,
then the function F defined by (4.17) is a joint distribution function with
margins F1 and F2 .
where Ui Uniform(0, 1) for i = 1, . . . , d.
F (x1 , x2 ) = C (F1 (x1 ), F2 (x2 )) .
339
340
4.10 Copulas
Copula Density
4.10 Copulas
Some Copula Families

Clayton Copula
1/
C(u, v ) = max u + v 1, 0
By differentiation, it is easy to show that the density function of a

bivariate distribution F (x1 , x2 ) = C (F1 (x1 ), F2 (x2 )) with marginal
densities f1 and f2 is given by
f (x1 , x2 ) = c(F1 (x1 ), F2 (x2 ))f1 (x1 )f2 (x2 )
Frank Copula
(4.18)
C(u, v ) =

1
(eu 1)(ev 1)
log 1 +
e 1
(, )\{0}
Gumbel Copula
where c denotes the copula density of C, i.e.

c(u1 , u2 ) =
[1, )\{0}
C(u, v ) = uv exp( log u log v )
2
C(u1 , u2 )
u1 u2
(0, 1]
Gaussian Copula

C(u, v ) = 1 (u), 1 (v )
where is the standard bivariate normal distribution function with
correlation , and is the standard normal distribution function.
341
4.10 Copulas
Dependance Measure: Concordance
Two observations (xi , yi ) and (xj , yj ) of a random vector (X , Y ) are

concordant (discordant) if
xi < xj and yi < yj , or if xi > xj and yi > yj
(xi < xj and yi > yj , or if xi > xj and yi < yj )
or equivalently:
(xi xj )(yi yj ) > 0
((xi xj )(yi yj ) < 0)
342
4.10 Copulas
Dependance Measure: Kendalls tau
Informally, a pair of rvs are concordant if "large" values of one tend to

be associated with "large" values of the other and "small" values of one
with "small" values of the other.
The sample version of Kendalls tau is defined in terms of concordance

as follows:
Let (xi , yi ), i = 1, .
. . , n denote
a random sample of n observations of

n
(X , Y ). There are
distinct pairs (xi , yi ) and (xj , yj ) of
2
observations in the sample, and each pair is either concordant or
discordant. Let c denote the number of concordant pairs and d the
number of discordant pairs. Then Kendalls tau is defined as

cd
n
.
= (c d)/
=
2
c+d
The population version of Kendalls tau is defined as the probability of
concordance minus the probability of discordance:
= P[(X1 X2 )(Y1 Y2 ) > 0] P[(X1 X2 )(Y1 Y2 ) < 0]
343
344
4.10 Copulas
Relationship: Kendalls tau and copula parameter
Parameter Estimation
Flexible multivariate distributions can be constructed with

pre-specified, discrete and/or continuous marginal distributions and
copula function that represents the desired dependence structure. The
joint distribution is usually estimated by a standard two-step procedure
We have the following functional relationships between Kendalls tau

and the parameters of the copula families above:
Clayton
Frank
2
2+

Z
1 t
4
=1
1
dt
0 et 1
=1
Gumbel
= 1 1
Gauss
345
the marginals are approximated by their empirical distribution or

parameters of the marginals are estimated via ML
the parameters in the copula function are estimated by maximum

likelihood conditional on the parameter estimates in the first step.
346
4.10 Copulas
Simulation Study: R2WinBUGS Code
We use the copula package in R to simulate N = 500 bivariate failure

times from a Clayton copula with Exponential(i ) marginal distributions
and a Kendalls tau value of 0.8 (as a measure for the association
between the failure times). The rates for the marginal distributions are
1 = 2 = 0.0001.
We use R2WinBUGS to sample from the posterior distribution of the
unknown parameters. We use a Jeffreys prior for the rates of the
Exponential distributions (i.e. approximately Jeffreys with
i Gamma(0.001, 0.001) and we assume a Uniform(0,100) prior for
(based for instance on a priori information that the association
between failure times is positive and wont exceed 0.98).
To specify the likelihood, we need to calculate the density of the
multivariate distribution first using (4.18). Exercise!
4.10 Copulas
Simulation Study
Here, we propose to estimate jointly all parameters of marginal

distributions and copula using a Bayesian approach implemented in
WinBUGS as in Kelly (2007).
2
arcsin()
4.10 Copulas
library(copula)
library(R2WinBUGS)
p <- 2 # copula dimension
tau <- 0.8 # value of Kendalls tau
alpha<-2*tau/(1-tau) #relationship between tau and alpha
c.clayton<-archmCopula(family="clayton",dim=p,param=alpha)
# Marginals are exponential lambda1 and lambda2

lambda1 <- 0.0001
lambda2 <- 0.0001
distr.clayton<-mvdc(c.clayton, margins=rep("exp",p),
paramMargins = list(list(rate=lambda1),list(rate=lamb
# Draw a random sample of size N
N <- 500
w <- rmvdc(distr.clayton, N)
347
348
4.10 Copulas
10000 20000 30000 40000 50000 60000 70000

0
w[, 2]
Simulation Study
Implementation in WinBUGS: Zeros Trick

If we want to implement parameter estimation of this copula model in
WinBUGS, we face a problem as copula distributions are not included
in the list of standard distributions implemented in WinBUGS.
Fortunately, we can use the so-called zeros trick to specify a new

sampling distribution. An observation yi with new sampling distribution
f (yi |) contributes a likelihood term L(i) = f (yi |). Let l(i) = log L(i),
then the model likelihood can be written as
10000
20000
f (y1 , . . . , yn |) =
50000
i=1
60000
w[, 1]
349
To ensure that the Poisson means are all positive, we may have to add
a positive constant C to each l(i). This is equivalent to multiplying
the likelihood by a constant term enC . With this approach, the original
likelihood can be written as the product of Poisson likelihoods with
observations all equal to zero:
f (y|) =
i=1
0!
e(l(i)+C) =
0!
e(l(i))
4.10 Copulas
n
Y
el(i)
i.e. the product of densities of Poisson random variables with

mean= l(i) and all observations equal to zero.
Implementation in WinBUGS: Zeros Trick
n
Y
(l(i) + C)0
n
Y
i=1
n
Y
l(i)0
=
40000
f (yi |) =
i=1
Figure 22: Scatterplot of 500 simulated values from Clayton copula with
Exp(0.0001) marginals.
n
Y
30000
4.10 Copulas
fPoisson (0| l(i) + C)
350
4.10 Copulas
Implementation in WinBUGS: Ones Trick
As an alternative to the zeros trick, the Bernoulli distribution can be

used. The likelihood can be written as
f (y1 , . . . , yn |) =
n
Y
el(i)
1
1 el(i)
0
i=1
i=1
Generic WinBUGS code:
n
Y
fBernoulli (1|el(i) )
i=1
C <- 10000
for (i in 1:n){
zeros[i]<-0
zeros[i]~ dpois(zeros.mean[i])
zeros.mean[i]<- -l[i]+C
l[i]<- ...#expression of log-likelihood for obs. i
} Prof. Dr. Renate Meyer
351
i.e. the product of Bernoulli densities with success probability el(i) and
all observations equal to 1.
352
4.10 Copulas
Implementation in WinBUGS: Ones Trick
Simulation Study: R2WinBUGS Code
To ensure that the success probability is less than 1, we multiply each

likelihood term by eC where C is a large positive constant. Then the
joint likelihood becomes:
f (y|) =
n
Y
l(i)C
1
1e
l(i)C
0
i=1
n
Y
fBernoulli (1|el(i)C )
i=1
Generic WinBUGS code:

C <- 100
for (i in 1:n){
ones[i]<-1
ones[i]~ dbern(ones.p[i])
ones.p[i]<- exp(l[i]-C)
l[i]<- ...#expression of log-likelihood for obs. i
}
353
#Call WinBUGS
data=list(N=500,x=w[,1],y=w[,2])
inits=list(list(lambda1=0.001,lambda2=0.002,alpha=5))
parameters=c("lambda1","lambda2","alpha")
clayton.sim<-bugs(data,inits,parameters.to.save=parameters,
model.file="model_clayton.odc", n.chains=1,
n.iter=2000,n.burnin=1000,working.directory=getwd())
This performs 2000 iterations of the Gibbs sampler with a burn-in

period of 1000 and monitors the values of the three model parameters.
The WinBUGS Code in model_clayton.odc is:
4.10 Copulas
354
4.10 Copulas
Simulation Study: WinBUGS Output
model
{lambda1 ~ dgamma(0.001,0.001) #Jeffreys prior
lambda2 ~ dgamma(0.001,0.001) #Jeffreys prior
alpha ~ dunif(0,100) #Uniform prior on alpha
# likelihood specification using zeros trick
C<-10000
for(i in 1:N) {
zeros[i] <-0
zeros[i] ~ dpois(mu[i])
mu[i]<- - l[i] +C
u[i] <- 1-exp(-lambda1*x[i])
v[i] <- 1-exp(-lambda2*y[i])
l[i]<-log((1+alpha)*
pow(pow(u[i],-alpha)+pow(v[i],-alpha)-1,-1/alpha-2)
*pow(u[i],-alpha-1)*pow(v[i],-alpha-1)*
lambda1*exp(-lambda1*x[i])*lambda2*exp(-lambda2*y[i])) }}
Simulation Study: WinBUGS Code
4.10 Copulas
355

node
alpha
deviance
lambda1
lambda2
mean
8.001
1.002E+7
9.434E-5
9.415E-5
sd
0.3863
2.507
3.815E-6
3.813E-6
MC error
0.02022
0.1517
4.306E-7
4.298E-7
2.5%
7.279
1.002E+7
8.75E-5
8.723E-5
median
8.007
1.002E+7
9.401E-5
9.383E-5
97.5%
8.789
1.002E+7
1.018E-4
1.017E-4
356
5 References
5 References
References I
References II
Albert, J.H. (2007), Bayesian Computation with R, Springer, New

York.
Bolstad, W.M. (2004) Introduction to Bayesian Statistics, John

Wiley& Sons.
Aitkin, M. (1997), The calibration of P-values, posterior Bayes

factors and the AIC from the posterior distribution of the likelihood,
Statistics and Computing 7: 253:272.
Borel E. (1921), La Theorie du jeu et les Equations Integrales a

Noyau Symetrique, Comptes Rendus de LAcademie des
Sciences 173 1304-13-8.
Aitkin, M. (2010), Statistical Inference, An Integrated

Bayesian/Likelihood Approach, Chapman& Hall, Cambridge, UK.
Bellhouse, D.R. (2004), The Reverend Thomas Bayes, FRS: A

Biography to Celebrate the Tercentenary of his Birth, Statistical
Science 19, 3-43.
Cai, B., Meyer, R. (2011) Bayesian semiparametric modeling of

survival data based on mixtures of B-spline distributions,
Computational Statistics and Data Analysis to appear.
Carlin, B.P., Polson, N.G., and Stoffer, D.S. (1992). A Monte Carlo
approach to nonnormal and nonlinear state-space modeling. J.
Amer. Statist. Assoc. 87, 493500.
Carlin, B.P. and Louis, Th.A. (2008) Bayesian Methods for Data
Analysis, Chapman & Hall.
Berger, J.O. and Wolpert, R.L. (1988) The Likelihood Principle,

Hayward, CA.
Bernardo, J. and Smith, A. (1994) Bayesian Theory, Wiley,
Chichester, UK.
357
5 References
References IV
Carlin, B.P. and Hodges, (1999), Hierarchical Proportional

Hazards Regression Models for Highly Stratified Data, Biometrics
55, 1162-1170.
Cox, D.R. (1972), Regression models and life tables, Journal of

the Royal Statistical Society B 34, 187-220.
Cox, D.R. (1975), Partial Likelihood, Biometrika 62, 269-276.
Cox, D.R. and Oakes, D. (1984) Analysis of Survival Data.

London: Chapman& Hall.
Dempster, A.P. (1974), The direct use of likelihood for significance

testing, in (Barndorff-Nielsen et al, eds.) Proc. of the Conference
on the Foundational Questions of Statistical Inference,335-352,
Reprinted in Statistics and Computing 7, 247-252 (1997).
Dey, D., Ghosh, S. and Mallick, B. (2000), Generalized Liner

Models: A Bayesian Perspective, Marcel Dekker, New York.
358
5 References
References III
I
359
Efron, B. (2005), Bayesians, Frequentists, and Scientists, Journal

of the American Statistical Association 100.
Fahrmeir, L. and Tutz, G. (2001), Multivariate Statistical Modelling

Based on Generalized Linear Models, Springer Series in
Statistics, Springer Verlag, New York.
Fisher, R.A. (1922), On the interpretation of chi-square from

contingency tables and the calculation of p, Journal of the Royal
Statistical Society B, 85, 87-94.
Gelfand, A., Dey, D., Chang, H. (1992), Model determination using

predictive distributions with implementation via sampling-based
methods, in (Bernardo et al. eds) Bayesian Statistics 4, Oxford
University Press, 407-425.
Gelman, A., Carlin, J., Stern, H., Rubin, D. (2004), Bayesian Data
Analysis, Texts in Statistical Science, 2nd ed., Chapman& Hall,
London.
360
5 References
5 References
References V
I
References VI
Gelman, A. and Meng, X.L. (1996), Model Checking and model

improvement, in (Gilks et al, eds) Markov Chain Monte Carlo in
Practice, Chapman& Hall, UK, 189-201.
Geman, S. and Geman, D. (1984), Stochastic relaxation, Gibbs
distributions and the Bayesian restoration of images, IEEE
transactions on Pattern Analysis and Machine Intelligence 6,
721-741.
George, E.I., Makov, U.E. and Smith, A.F.M. (1993), Conjugate
Likelihood Distributions, Scandinavian Journal of Statistics, 20,
147-156.
Jeffreys, H. (1939) Theory of Probability, Oxford University Press,

Oxford.
Jeffresy, H. (1961) Theory of Probability, 3rd edition, Oxford

University Press, Oxford.
Kelly, D.L. (2007), Using Copulas to Model Dependence in

Simulation Risk Assessment, Proceedings of International
Mechanical Engineering Congress and Exposition,
IMECE2007-41284.
Keynes, J.M. (1922) A Treatise on Probability, Volume 8, St

Martins.
Gilks, W., Richardson, S. and Spiegelhalter, D. (1996), Markov

Chain Monte Carlo in Practice, Chapman& Hall, Cambridge, UK.
Klein, J.P. and Moeschberger, M.L. (1997), Survival Analysis, New

York: Springer.
Ibrahim, J.G., Chen, M-H., Sinha, D. (2001) Bayesian Survival

Analysis. Springer, New York.
Kuensch, H.R. (2001), State space and hidden Markov models, In:
Barndorff-Nielsen et al. (Ed.), Complex stochastic systems,
Chapman & Hall, London, 109174.
361
5 References
References VIII
I
Lawless J.F. (1982) Statistical Models and Methods for Life Time
Data. New York, Wiley.
Ntzoufras, I. (2009) Bayesian Modeling Using WinBUGS, John

Wiley& Sons, Inc.
McCullagh, P. and Nelder, J. (1989), Generalized Linear Models,

Chapman& Hall, Cambridge, UK.
Raiffa, H. and Schlaiffer, R. Applied Statistical Decision Theory,

Cambridge, MIT Press.
Ramsay, F.P. (1926), Truth and Probability, Publised in 1931 as

Foundations of Mathematics and Other Logical Essays Ch. VII,
156-198.
Rubin, D.B. (1984), Bayesianly justifiable and relevant frequency

calculations for the applied statistician, Annals of Statistics 12,
1151-1172.
Sklar, A. 91959), Fonctions de repartition a n dimensions e leurs

marges. Publ. Inst. Stat. Univ. Paris 8, 229-231.
Spiegelhalter, D.J., Best, N.G., Carlin, B.P., van der Linde, A.

(2002), Bayesian measures of model complexity and model fit.
Journal of the Royal Statistical Society B 64, 583-639.
McCarthy, M.A. (2007) Bayesian Methods for Ecology, Cambridge

University Press, 2007.
Meyer, R. and J. Yu (2000), BUGS for a Bayesian analysis of

stochastic volatility models. Econometrics Journal 3, 198-215.
Millar, R.B. and Meyer R. (2000), State-Space Modeling of

Non-Linear Fisheries Biomass Dynamics Using the Gibbs
Sampler. Applied Statistics, 49, 327-342.
Nelsen, R.B. (2006) An Introduction to Copulas, Springer, New

York.
362
5 References
References VII
I
363
364

Abi Script Stud

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Abi Script Stud

Hochgeladen von

Copyright:

Verfügbare Formate

Applied Bayesian Inference

Prof. Dr. Renate Meyer1,2

for Stochastics, Karlsruhe Institute of Technology, Germany

KIT, Winter Semester 2010/2011

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1.1 Course Overview

Overview: Applied Bayesian Inference A

Conjugate examples: Binomial, Exponential

Simulation-based posterior computation

Regression, ANOVA, GLM, hierarchical models, survival analysis,

1.1 Course Overview

Conjugate examples: Poisson, Normal, Exponential Family

Specification of prior distributions

Multivariate and hierarchical models

Techniques for posterior computation

Basic model checking with WinBUGS

Markov Chain Monte Carlo

Convergence diagnostics with CODA

Bayes Factors, model checking and determination

Decision-theoretic foundations of Bayesian inference

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Overview: Applied Bayesian Inference B

Bayes theorem, discrete continuous

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Prof. Dr. Renate Meyer

Applied Bayesian Inference

1.1 Course Overview

1.2 Why Bayesian Inference?

Why Bayesian Inference?

Or: What is wrong with standard statistical inference?

R mostly covered in class

WinBUGS completely covered in class

confidence intervals and

Other at your own risk

The two mainstays of standard/classical statistical inference are

Anything wrong with them?

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1.2 Why Bayesian Inference?

Example: Newcombs Speed of Light

1.2 Why Bayesian Inference?

Newcombs Speed of Light: CI

Let us assume that the individual measurements

P(24.8182 < < 24.8378) = 0.95

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Applied Bayesian Inference

1.2 Why Bayesian Inference?

Newcombs Speed of Light: CI

1.2 Why Bayesian Inference?

Newcombs Speed of Light: Simulation

The Level of Confidence

draw 1000 samples of size 10 each from

for each sample calculate the 95% CI

check whether the true = 24.828 is inside or outside the CI

Figure 1: Coverage over repeated sampling.

Applied Bayesian Inference