Sie sind auf Seite 1von 91

Applied Bayesian Inference

Prof. Dr. Renate Meyer1,2


1 Institute

for Stochastics, Karlsruhe Institute of Technology, Germany


of Statistics, University of Auckland, New Zealand

2 Department

KIT, Winter Semester 2010/2011

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1 Introduction

1.1 Course Overview

1 Introduction

Overview: Applied Bayesian Inference A

Conjugate examples: Binomial, Exponential

Introduction to R

Simulation-based posterior computation

Introduction to WinBUGS

Regression, ANOVA, GLM, hierarchical models, survival analysis,


state-space models for time series, copulas

1.1 Course Overview

Conjugate examples: Poisson, Normal, Exponential Family

Specification of prior distributions

Likelihood Principle

Multivariate and hierarchical models

Techniques for posterior computation

Normal approximation

Non-iterative Simulation

Basic model checking with WinBUGS

Markov Chain Monte Carlo

Convergence diagnostics with CODA

Bayes Factors, model checking and determination

Decision-theoretic foundations of Bayesian inference

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Overview: Applied Bayesian Inference B

Bayes theorem, discrete continuous

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Prof. Dr. Renate Meyer

Applied Bayesian Inference

1 Introduction

1.1 Course Overview

1 Introduction

Computing

1.2 Why Bayesian Inference?

Why Bayesian Inference?

Or: What is wrong with standard statistical inference?


I

R mostly covered in class

WinBUGS completely covered in class

confidence intervals and

Other at your own risk

hypothesis tests.

The two mainstays of standard/classical statistical inference are

Anything wrong with them?

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1 Introduction

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1.2 Why Bayesian Inference?

1 Introduction

Example: Newcombs Speed of Light

1.2 Why Bayesian Inference?

Newcombs Speed of Light: CI

Example 1.1
Light travels fast, but it is not transmitted instantaneously. Light takes
over a second to reach us from the moon and over 10 billion years to
reach us from the most distant objects yet observed in the expanding
universe. Because radio and radar also travel at the speed of light, an
accurate value for that speed is important in communicating with
astronauts and orbiting satellites. An accurate value for the speed of
light is also important to computer designers because electrical signals
travel only at light speed.
The first reasonably accurate measurements of the speed of light were
made by Simon Newcomb between July and September 1882. He
measured the time in seconds that a light signal took to pass from his
laboratory on the Potomac River to a mirror at the base of the
Washington Monument and back, a total distance of 7400m. His first
measurement was 24.828 millions of a second.

Let us assume that the individual measurements


Xi N(, 2 = 0.0052 ) with known measurement variance
2 = 0.0052 . We want to find a 95% confidence interval for .
Answer:

x 1.96 / n


X
N(0, 1):
/ n



X
< 1.96
P 1.96 <
= 0.95
/ n


1.96/ n < < X
1.96/ n = 0.95
P X

Because as

P(24.8182 < < 24.8378) = 0.95


This means that is in this interval with 95% probability.Certainly NOT!

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Applied Bayesian Inference

1 Introduction

1.2 Why Bayesian Inference?

1 Introduction

Newcombs Speed of Light: CI

1.2 Why Bayesian Inference?

Newcombs Speed of Light: Simulation

The Level of Confidence


True mean

After collecting the data and computing the CI, this interval either
contains the true mean or it does not. Its coverage probability is not
0.95 but either 0 or 1.
Then where does our 95% confidence come from?
Let us do an experiment:
N(24.828, 0.0052 )

draw 1000 samples of size 10 each from

for each sample calculate the 95% CI

check whether the true = 24.828 is inside or outside the CI

24.8

Sample

Coverage
to date

1st

100%

2nd

100%

3rd
4th

100%
100%

5th
6th

100%

7th

100%

8th
9th

100%
88.9%

10th
.
100th
.
991st
.
1000th

90.0%
.
94.0%
.
95.2%
.
95.2%

100%

S1

Figure 1: Coverage over repeated sampling.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1 Introduction

1.2 Why Bayesian Inference?

1 Introduction

Newcombs Speed of Light: CI

952 of the 1000 CIs include the true mean.

48 of the 1000 CIs do not include the true mean.

In reality, we dont know the true mean.

We do not sample repeatedly, we only take one sample and


calculate one CI.

1.2 Why Bayesian Inference?

By contrast, Bayesian confidence intervals, known as credible intervals


do not require this awkward frequentist interpretation.
One can make the more natural and direct statement concerning the
probability of the unknown parameter falling in this interval.

Will this CI contain the true value?

It either will or will not but we do not know.

We take comfort in the fact that the method works 95% of the time
in the long run, i.e. the method produces a CI that contains the
unknown mean 95% of the time that the method is used in the
long run.

Applied Bayesian Inference

10

Newcombs Speed of Light: CI

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Prof. Dr. Renate Meyer

11

One needs to provide additional structure to make this interpretation


possible.

Prof. Dr. Renate Meyer

Applied Bayesian Inference

12

1 Introduction

1.2 Why Bayesian Inference?

1 Introduction

Newcombs Speed of Light: Hypothesis Test

H0 : 0 (= 24.828)

Newcombs Speed of Light: Hypothesis Test


The P-value is the probability to observe a value of the test statistic
that is more extreme than the actually observed value uobs if the null
hypothesis were true (under repeated sampling).

versus H1 : > 0

Test statistic:

We can do another thought experiment


0
X
N(0, 1)
U=
/ n

if = 0

Small values of uobs are consistent with H0 , large values favour H1

P-value:

imagine we take 1000 samples of size 10 from a Normal


distribution with mean 0 .

we calculate the P-value for each sample.

it will only we smaller than 0.05 in about 5% of the samples, in


about 50 samples.

we take comfort in the fact that this test works 95% of the time in
the long run, i.e. rejects H0 even though H0 is true only in 5% of
the cases that this method is used.

p = P(U > uobs | = 0 ) = 1 (u0 )


I

1.2 Why Bayesian Inference?

if P-value < 0.05 (= usual type I error rate), reject H0

The P-value is the probability that H0 is true.Certainly NOT.


Applied Bayesian Inference

Prof. Dr. Renate Meyer

1 Introduction

13

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1.2 Why Bayesian Inference?

1 Introduction

Newcombs Speed of Light: Hypothesis Test

1.2 Why Bayesian Inference?

Newcombs Speed of Light: Hypothesis Test

It can only offer evidence against the null hypothesis. A large


P-value does not offer evidence that H0 is true.

P-value cannot be directly interpreted as "weight of evidence" but


only as a long-term probability (in a hypothetical repetition of the
same experiment) of obtaining data at least as unusual as what
was actually observed.

By contrast, the Bayesian approach to hypothesis testing, due


primarily to Jeffreys (1961) is much simpler and avoids the pitfalls of
the traditional Neyman-Pearson-based approach.

Most practitioners are tempted to say that the P-value is the


probability that H0 ist true.

It allows the direct calculation of the probability that a hypothesis is


true and thus a direct and straightforward interpretation.

P-values depend not only on the observed data but also the
sampling probability of certain unobserved datapoints. This
violates the Likelihood Principle.

Again, as in the case of CIs, we need to add more structure to the


underlying probability model.

This has serious practical implications for instance for the analysis
of clinical trials, where often interim analyses and unexpected
drug toxicities change the original trial design.
Prof. Dr. Renate Meyer

Applied Bayesian Inference

14

15

Prof. Dr. Renate Meyer

Applied Bayesian Inference

16

1 Introduction

1.3 Historical Overview

1 Introduction

Historical Overview

1.3 Historical Overview

Inverse Probability

I
I

Bayes and Laplace (late 1700s) inverse probability


Example: Given x successes in n iid trials with success probability

probability statements about observables given assumptions


about unknown parameters
P(9 X 12|)
deductive

inverse probability statements about unknown parameters given


observed data values
P(a < < b|X = 9)
inductive

Figure 2: From William Jefferys webpage, Univ. of Texas at Austin.


Applied Bayesian Inference

Prof. Dr. Renate Meyer

1 Introduction

17

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1.3 Historical Overview

1 Introduction

Thomas Bayes

18

1.3 Historical Overview

Bayes Biography

(b. 1702, London d. 1761, Tunbridge Wells, Kent)

Presbyterian minister and mathematician

Bellhouse, D.R. (2004) The Reverend Thomas Bayes: FRS: A


Biography to Celebrate the Tercentenary of His Birth. Statistical
Science 19(1):3-43.

Son of one of the first 6 Nonconformist ministers in England


Private education (by De Moivre?)
Ordained as Nonconformist minister and took the position as minister
at the Presbyterian Chapel, Tunbridge Wells
Educated and interested in mathematics, probability and statistics,
believed to be the first to use probability inductively, defended the
views and philosophy of Sir Isaac Newton against criticism by Bishop
Berkeley
Two papers published while he was still living:
I Divine Providence and Government is the Happiness of His
Creatures (1731)
I An Introduction to the Doctrine of Fluxions, and a Defense of the
Analyst (1736)

Figure 3: Reverend Thomas Bayes 1702-1761.


Prof. Dr. Renate Meyer

Applied Bayesian Inference

19

Prof. Dr. Renate Meyer

Applied Bayesian Inference

20

1 Introduction

1.3 Historical Overview

1 Introduction

Bayes Biography

1.3 Historical Overview

Bayes Biography

Elected Fellow of the Royal Society in 1742


Most well-known paper published posthumously, submitted by his
friend Richard Price,
Essay Towards Solving a Problem in the Doctrine of Chances" (1763),
Philosophical Trans. of the Royal Society of London
begins with :
Given the number of times in which an unknown event
has happened and failed: Required the chance that the
probability of its happening in a single trial lies
somewhere between any two degrees of probability that
can be named.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1 Introduction

Figure 4: Bayes vault at Bunhill Fields, London

21

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1.3 Historical Overview

1 Introduction

18 and 19th Century

22

1.3 Historical Overview

20th Century
Sir R.A. Fisher (1890-1962) was a lifelong critic of inverse probability.
and one of the most important persons involved in the demise of
inverse probability.

Bayes laid the foundations of modern Bayesian statistics


Pierre Simon Laplace (1749-1827), French mathematician and
astronomer, developed mathematical astronomy and statistics
refined inverse probablity, acknowledging Bayes work in a monograph
in 1812
George Boole challenged inverse probability in his Laws of Thought in
1854. The Bayesian approach has been controversial ever since but
was predominent in practical applications until the early 20th century
because of a lack of a frequentist alternative. Inverse probability
became an integral part of the Universities statistics curriculum.

Figure 5: Sir Ronald A. Fisher (1890-1962) .


Prof. Dr. Renate Meyer

Applied Bayesian Inference

23

Prof. Dr. Renate Meyer

Applied Bayesian Inference

24

1 Introduction

1.3 Historical Overview

1 Introduction

20th Century

1.3 Historical Overview

Rise of Subjective Probability

Fishers (1922) paper revolutionized statistical thinking by introducing


the notions of "maximum likelihood", "sufficiency", and "efficiency". His
main argument was that one needed to look at the likelihood of the
data given the theory NOT the likelihood of the theory given the data.
He thus advocated an "indirect" approach to statistical inference based
on ideas of logic called "proof by contradiction".
His work impressed two young statisticians at University College
London: J. Neyman and E. Pearson. They developed the mathematical
theory of significance testing and confidence intervals which had a
huge influence on statistical applications (for good or bad).

Inverse probability ideas were studied by Keynes (1921), Borel (1921)


and Ramsay (1926).
In 1930s Harold Jeffreys engaged in a published exchange with R.A.
Fisher on Fishers fiducial argument and Jeffreys inverse probability.
Jeffreys (1939) book on "Theory of Probability" is the most cited in the
current "objective Bayesian" literature.
In Italy in the 1930s, Bruno de Finetti gave a different justification for
subjective probability, introducing the notion of "exchangeability".
Neo-Bayesian revival in 1950s (Savage, Good, Lindley. . . ).
Current huge popularity of Bayesian methods is due to fast computers
and MCMC methods.
Syntheses of Bayesian and non-Bayesian methods? see e.g. Efron
(2005) "Bayesians, frequentists, and scientists"

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1 Introduction

25

1.4 Bayesian and Frequentist Inference

1 Introduction

Two main approaches to statistical inference

26

1.4 Bayesian and Frequentist Inference

Motivating Example: CPCRA AIDS Trial

the Bayesian approach


Carlin and Hodges (1999), Biometrics

- parameters are random variables


- subjective probability (for some)
I

Applied Bayesian Inference

Prof. Dr. Renate Meyer

the frequentist/conventional/classical/orthodox approach


- parameters are fixed but unknown quantities
- probability as long-run relative frequency

Some controversy in the past (and present)

In this course: not adversarial

Prof. Dr. Renate Meyer

Applied Bayesian Inference

27

Compare two treatments for Mycobacterium avium complex, a


disease common in late-stage HIV-infected people

Total of 69 patients

In 11 clinical centers

5 deaths in treatment group 1

13 deaths in treatment group 2

Prof. Dr. Renate Meyer

Applied Bayesian Inference

28

1 Introduction

1.4 Bayesian and Frequentist Inference

1 Introduction

Primary Endpoint Data


Unit
A
A
A
A
D
D
D
D
D
D
D
D
D
D
G
G
G
G
G
G
J
J
J
J

Treatm.
1
2
1
2
2
2
2
2
1
1
1
1
1
1
2
1
2
2
2
1
1
1
2
2

Time
74+
248
272+
244
20+
64
88
148+
162+
184+
188+
198+
382+
436+
32+
64+
102
162+
182+
364+
18+
36+
160+
254

Unit
B
B

Treatm.
2
1

Time
4+
156+

C
E
E
E
E
E
E
E
E

2
1
2
2
1
1
1
2
2

20+
50+
64+
82
186+
214+
214
228+
262

H
H
H
H
H
H
K
K
K

2
1
1
1
1
2
1
1
2

22+
22+
74+
88+
148+
162
28+
70+
106+

Data Safety and Monitoring Board

Unit
F
F
F
F
F
F
F
F
F
F
F
F

Treatm.
1
2
1
2
2
1
1
2
1
1
2
2

Time
6
16+
76
80
202
258+
268+
368+
380+
424+
428+
436+

I
I
I
I
I
I
I
I
I
I
I
I
I

2
2
2
1
1
2
1
2
1
1
2
2
1

8
16+
40
120+
168+
174+
268+
276
286+
366
396+
466+
468+

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1 Introduction

Decision based on:


I

Stratified Cox proportional hazards model


relative risk r =1.9 with 95%-CI [0.6, 5.9],
P-value 0.24

Unstratified Cox proportional hazards model


relative risk r =3.1 with 95%-CI [1.1, 8.7],
P-value 0.02

On the basis of the stratified analysis, the Board would have had to
continue the trial.
The P-value of the unstratified analysis was small enough to convince
the Board to stop the trial.

29

1.4 Bayesian and Frequentist Inference

1 Introduction

Li () =

k =1

If the largest time in ith stratum is a death, then the partial likelihood
derives no information from this event.

Applied Bayesian Inference

1.4 Bayesian and Frequentist Inference

Stratified:
hi (t) = h0i (t) exp( 0 x)

Unstratified:
hi (t) = h0 (t) exp( 0 x)

unit-specific dummy variables

frailty model

stratum-specific baseline hazards are random draws from a


certain population of hazard functions

Bayesian analysis offers a flexibility in modelling, that is not possible


with the frequentist approach.
We will analyze this example in a Bayesian way in Chapter 4.

This is the case in the study: 4 deaths that have largest survival time
per stratum and these are all in treatment group 2.

Prof. Dr. Renate Meyer

30

Compromise Stratified-Unstratified Analysis?

Why does the stratified analysis fail to detect the treatment difference?
Contribution of ith stratum to partial likelihood:
e xik
P
0 xik
jRik e

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Stratified Cox PH Model

di
Y

1.4 Bayesian and Frequentist Inference

31

Prof. Dr. Renate Meyer

Applied Bayesian Inference

32

1 Introduction

1.4 Bayesian and Frequentist Inference

1 Introduction

1.4 Bayesian and Frequentist Inference

Some Advantages of Bayesian Inference

Highly nonlinear models with many parameters can be analyzed

Offers hitherto unknown flexibility in statistical modelling

Can handle "nuisance" parameters that pose problems for


frequentist inference

Does not rely on large sample asymptotics, but gives valid


inference also for small sample sizes

Possibility to incorporate prior knowledge and expert judgement

Adheres to the Likelihood Principle

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1 Introduction

Applied Bayesian Inference

Prof. Dr. Renate Meyer

33

1.5 Discrete Version of Bayes Theorem

1 Introduction

Reminder of Bayes Theorem: Discrete Case

34

1.5 Discrete Version of Bayes Theorem

Chess Example
Example 1.3

Theorem 1.2
Let A1 , A2 , . . . , An be a set of mutually exclusive and exhaustive events.
Then

You are in a chess tournament and will play your next game against
either Jun or Martha, depending on results of some other games.
7
, but of beating Martha is
Suppose your probability of beating Jun is 10
2
only 10 . You assess your probability of playing Jun as 14 .

P(Ai |B) = P(Ai )P(B|Ai )/P(B)


I

Prof. Dr. Renate Meyer

P(A )P(B|Ai )
Pn i
.
j=1 P(Aj )P(B|Aj )

Applied Bayesian Inference

35

How likely is it that you win your next game?


Given:
7
2
P(W |J) = 10
,
P(W |M) = 10
P(J) = 14 ,
P(M) = 34
Then P(W )
= P(W |J)P(J) + P(W |M)P(M)
7 1
2 3
13
= 10
4 + 10 4 = 40 = 0.325.

Prof. Dr. Renate Meyer

Applied Bayesian Inference

36

1 Introduction

1.5 Discrete Version of Bayes Theorem

1 Introduction

Chess Example

1.5 Discrete Version of Bayes Theorem

Diagnostic Testing
Example 1.4

A new home HIV test is claimed to have 95% sensitivity and 98%
specificity. In a population with an HIV prevalence of 1/1000, what is
the chance that someone testing positive actually has HIV? Let A be

Now suppose that you tell me you won your next chess game.
Who was your opponent?

be the event
the event that the individual is truly HIV positive and A
that the individual is truly HIV negative.

P(J|W )
=

P(W |J)P(J)
7
=
P(W |J)P(J) + P(W |M)P(M)
13

P(A) = 0.001.
Let B be the event that the test is positive. We want P(A|B).
95% sensitivity" means that
P(B|A) = 0.95.
98% specificity" means that
A)
= 0.98 or P(B|A)
= 0.02.
P(B|

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1 Introduction

37

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1.5 Discrete Version of Bayes Theorem

1 Introduction

Diagnostic Testing

38

1.5 Discrete Version of Bayes Theorem

Monty Hall Problem


Example 1.5

Now Bayes theorem says P(A|B)


=

P(B|A)P(A)
A)

P(B|A)P(A) + P(B|A)P(

.95 .001
= .045.
.95 .001 + .02 .999

You are a contestant on the TV show Lets Make a Deal" and given
the choice of three doors. Two of the doors have a goat behind them
and one a car. You choose a door, say door 2, but before opening the
chosen door, the emcee, Monty Hall, opens a door that has a goat
behind it (e.g. door 1). He gives you the option of revising your choice
or sticking to your first choice. What do you do?

Thus, over 95% of those testing positive will, in fact, not have HIV.
The following example caused a stir in 1991 after a US columnist, who
calls herself Marilyn Vos Savant, used it in her column. She gave the
correct answer. A surprising number of mathematicians wrote to her
saying that she was wrong.

Prof. Dr. Renate Meyer

Applied Bayesian Inference

39

Since either box 2 or box 3 must contain the key, he claimed that her
probability of winning had increased to 12 .
Obviously, choose box 3. The probability of finding the prize in either
box 1 or 3 is 2/3. As the emcee showed you that it is not in box 1, the
probability that it is in box 2 is 2/3.
Prof. Dr. Renate Meyer

Applied Bayesian Inference

40

1 Introduction

1.5 Discrete Version of Bayes Theorem

1 Introduction

Monty Hall Problem

Bayes Theorem again

With Bayes theorem:


Let Ai = car behind door No. i", i = 1, 2, 3.
These form a partition.
P(Ai ) = 13 are the prior probabilities for i = 1, 2, 3.

Let H1 , H2 , . . . , Hn denote n hypotheses (mutually disjoint) and D


observed data. Then Bayes theorem says:
P(Hi )P(D|Hi )
P(Hi |D) = Pn
.
j=1 P(Hj )P(D|Hj )

Let B = Monty Hall opens door 1 (with goat)"


P(B|A1 ) = 0
P(B|A2 ) = 12
P(B|A3 ) = 1

likelihood of A1
likelihood of A2
likelihood of A3

P(D|Hi ) are known as likelihoods, the likelihoods given to Hi by D,


or statisticians usually say the likelihood of Hi given D. (This
notion is used extensively in frequentist statistical inference/
method of maximum likelihood means finding the hypothesis
under which the observations are most likely to have occurred.)

P(Hi ) are prior probabilities.

P(Hi |D) are posterior probabilities.

We want P(A3 |B)


=
=

P(B|A3 )P(A3 )
P(B|A1 )P(A1 ) + P(B|A2 )P(A2 ) + P(B|A3 )P(A3 )
1 13

0
2
= 3.

1
3

1
2

1
3

+1

1
3

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1 Introduction

41

1 Introduction

Importance of Prior Plausibility

Why do I think it is a tree?

H2 =

man

H3 =

something else

Applied Bayesian Inference

1.5 Discrete Version of Bayes Theorem

P(H1 ) has a high prior probability.


P(H2 ) has a high prior probability.
P(H3 ) has a low prior probability.
Bayes theorem is in complete accord with this natural reasoning. The
posterior probabilities of the various hypotheses are in proportion to
the products of their prior probabilities and their likelihoods:
P(Hi |D) P(Hi )P(D|Hi )
Bayes theorem thus combines two sources of information:

P(D|H1 ) is close to 1, whereas P(D|H2 ) is close to 0.


But likelihood is not the only consideration in this reasoning.
More specifically, let H3 = cardboard replica of a tree.
Then P(D|H3 ) is close to 1.
H3 has the same likelihood as H1 , but it is not a plausible hypothesis
because it has a very much lower prior probability.
Prof. Dr. Renate Meyer

42

Importance of Prior Plausibility

Example 1.6
D = event that I look through my window and see a tall, branched thing
with green blobs covering its branches.

tree

Applied Bayesian Inference

Prof. Dr. Renate Meyer

1.5 Discrete Version of Bayes Theorem

H1 =

1.5 Discrete Version of Bayes Theorem

prior information represented by prior probabilities


new information represented by likelihoods
These together add up to the total information represented by
posterior probabilities.
43

Prof. Dr. Renate Meyer

Applied Bayesian Inference

44

2 Bayesian Inference

2.1 Statistical Model

2 Bayesian Inference

Notation and Definitions

2.1 Statistical Model

Notation and Definitions


Definition 2.1

Here, we only consider parametric models.


We assume that the observations X1 , . . . , Xn have been generated
from a parametrized probability distribution, i.e., Xi (1 i n) has a
distribution with probability density function (pdf) f (xi |) on IR , such that
the parameters = (1 , . . . , p ) are unknown and the pdf f is known.
This model can then be represented more simply by X f (x|), where
x is the vector of observations and the vector of parameters.

We are usually interested in questions of the form:

What is the value of 1 ?


Is 1 larger than 3 ?

Example: Xi N(, 2 ) iid for i = 1, . . . , n, Then


Q
Q
1 (x )2
f (x|, 2 ) = ni=1 f (xi |, 2 ) = ni=1 1 e 22 i

parameter estimation

hypothesis testing

What is the most likely value of a future event, whose distribution


depends on ?
prediction

A parametric statistical model consists of the observation of a random


variable X, distributed according to f (x|) where only the parameter
is unknown and belongs to a vector space IR p of finite dimension.

(, 2 )

Applied Bayesian Inference

Prof. Dr. Renate Meyer

2 Bayesian Inference

45

Applied Bayesian Inference

Prof. Dr. Renate Meyer

2.2 Likelihood-based Functions

2 Bayesian Inference

Overview

46

2.2 Likelihood-based Functions

Likelihood Function
Definition 2.2
The likelihood function of is the function that associates the value
f (x|) to each . This function is denoted by l(; x). Other common
notations are lx (), l(|x) and l(). It is defined by

In this section, we will introduce (or remind you of)


I

likelihood function

maximum likelihood estimation

information criteria

where x is the observed value of X.

score function

Fisher information

The likelihood function associates to each value of , the probability of


an observed value x for X (if X is discrete). Then, the larger the value
of l the greater are the chances associated to the event under
consideration, using a particular value of . Therefore, by fixing the
value of x and varying we observe the plausibility (or likelihood) of
each value of . The likelihood function is of fundamental importance
in many theories of statistical inference.

Prof. Dr. Renate Meyer

l(; x) = f (x|)

Applied Bayesian Inference

47

Prof. Dr. Renate Meyer

( )

Applied Bayesian Inference

(2.1)

48

2 Bayesian Inference

2.2 Likelihood-based Functions

2 Bayesian Inference

Maximum Likelihood Estimate

General Information Criteria


Modeling process: Suppose f belongs to some family F of meaningful
functional forms, but where the dimension p of the parameter may vary
among members of the family. Then choose f F to maximize
x) p .
GIC = General Information Criterion = log l(;
2
x) denotes the maximum of the log-likelihood function,
Here log l(;

and 2 provides a penalty per parameter in the model.


2 choices
I = 2 (Akaike, 1978)

Definition 2.3
maximizing (2.1) as a function of , with x fixed,
Any vector
provides a maximum likelihood (ML) estimate of .
In intuitive terms, this gives the realization of most likely to have
given rise to the current data set, an important finite sample property.
R

Note that even though

R
I n

f (x|)dx = 1,

l(; x)d

2.2 Likelihood-based Functions

6= 1, in general.

x) p
AIC = Akaike Information Criterion = log l(;
I

= log(n/2) (Schwarz, 1978)


x)
BIC = Bayesian Information Criterion = log l(;

Applied Bayesian Inference

Prof. Dr. Renate Meyer

2 Bayesian Inference

49

Applied Bayesian Inference

Prof. Dr. Renate Meyer

2.2 Likelihood-based Functions

2 Bayesian Inference

Binomial Example

p
n
log
2
2
50

2.2 Likelihood-based Functions

Binomial Example

Example 2.4
X Binomial(2, ). Then
f (x|) = l(; x)


2
=
x (1 )2x ,
x
and

Note that:
x = 0, 1, 2; = (0, 1)

1. if x = 1 then l(; x = 1)= 2(1 ).


The value of that gives highest likelihood to x = 1 or, in other
words, the most likely value of is 0.5

f (x|) = 1

2. If x = 2 then l(; x = 2)= 2 . The most likely value of is 1.

but
Z 1

3. If x = 0 then l(; x = 0)= (1 )2 . The most likely value is 0.


l(; x)d


=

2
x

Z

x (1 )2x d =

0
Prof. Dr. Renate Meyer

2
x


B(x + 1, 3 x) =

Applied Bayesian Inference

1
6= 1.
3
51

Prof. Dr. Renate Meyer

Applied Bayesian Inference

52

2 Bayesian Inference

2.2 Likelihood-based Functions

2 Bayesian Inference

Binomial Example

2.2 Likelihood-based Functions

Geometric Example
Example 2.5

1.0

Let X1 , X2 , . . . , Xn denote a random sample from a geometric


distribution with pdf
(xi = 1, 2, . . .).

0.6
0.4

a) Find the likelihood function of .


l(; x)

0.2

likelihood

0.8

f (Xi = xi |) = (1 )xi 1
l(theta;x=0)
l(theta;x=1)
l(theta;x=2)

0.0

= P(X1 = x1 , X2 = x2 , . . . , Xn = xn |) = f (x1 , . . . xn |)
Q
Q
= ni=1 f (xi |) = ni=1 (1 )xi 1

0.0

0.2

0.4

0.6

0.8

1.0

Pn

= n (1 )

theta

Applied Bayesian Inference

2 Bayesian Inference

= n (1 )n(x1)

(This is a Beta curve as a function of .)

Figure 6: Likelihood function for different values of x.

Prof. Dr. Renate Meyer

i=1 (xi 1)

53

Applied Bayesian Inference

Prof. Dr. Renate Meyer

2.2 Likelihood-based Functions

2 Bayesian Inference

Geometric Example

54

2.2 Likelihood-based Functions

Geometric Example

b) The maximum likelihood estimate of maximizes the probability

of obtaining the observations actually observed. Find .


Easier to maximize the log-likelihood.
log l(; x) = n log + n(x 1) log(1 )
d
d log l(; x)
x 1)
n
= n(1

=
d2
d2

n(x 1)
1

=0

c) The invariance property of maximum likelihood estimates tells that


is the ML estimate of g().
for any function = g() of , = g()
Find the ML estimate of = (1 ) = P(X1 = 2).

)
=
= (1

1
x

1
x

1
x

= n2

n(x 1)
(1)2

< 0

Thus is a global maximum.

Prof. Dr. Renate Meyer

Applied Bayesian Inference

55

Prof. Dr. Renate Meyer

Applied Bayesian Inference

56

2 Bayesian Inference

2.2 Likelihood-based Functions

2 Bayesian Inference

Exponential Example

2.2 Likelihood-based Functions

Exponential Example

The joint pdf of X1 , . . . , Xn is

Example 2.6
Let X1 , X2 , . . . Xn denote a random sample from the exponential
distribution with unknown location parameter , unknown scale
parameter , and pdf
f (x|, ) = exp{(x )}

f (x1 , . . . , xn |, ) =

( < x < ),

n
Y
i=1
n
Y

f (xi |, )
exp{(xi )}I( xi )

i=1

where < < and 0 < < .


The common mean and variance of the Xi are = + 1 and
2 = 2 . Find the likelihood function of and and the ML estimates
of and 2 , in situations where the observed values x1 , x2 , . . . , xn are
not all equal.

Thus, the likelihood of and when x1 , . . . , xn are observed is


(
) n
n
X
Y
n
l(, ; x1 , . . . , xn ) = exp
(xi )
I( xi )
i=1

Applied Bayesian Inference

Prof. Dr. Renate Meyer

2 Bayesian Inference

Applied Bayesian Inference

Prof. Dr. Renate Meyer

57

2.2 Likelihood-based Functions

2 Bayesian Inference

Exponential Example

i=1

58

2.2 Likelihood-based Functions

Exponential Example

Defining z = min(x1 , . . . , xn )

Then
log g() = n log a.

l(, ; x1 , . . . , xn ) = n exp{n(x )}I( z)


d log g()
n
= a = 0
d

As a function of

l(, ; x1 , . . . , xn )

=n = 1 .
=
a
x z

exp(n), z,
0
otherwise.

This is a global maximum as the 2. derivative is always negative.


By the invariance property of ML estimators:

This is maximized when = = z.


Now as a function of , the likelihood is proportional to

1 = z + (x z) = x ,

= +
2 = (x z)2 .

2 =

g() = n exp{a}
> 0 (if x1 , . . . , xn are not all equal).
with a = n(x )
Prof. Dr. Renate Meyer

Applied Bayesian Inference

59

Prof. Dr. Renate Meyer

Applied Bayesian Inference

60

2 Bayesian Inference

2.2 Likelihood-based Functions

2 Bayesian Inference

Fisher Information

Fisher Information
The information measure defined this way is related to the mean value
of the curvature of the likelihood. The larger this curvature is, the
larger is the information content summarized in the likelihood function
and so the larger will I() be. Since the curvature is expected to be
negative, the information value is taken as minus the curvature. The
expectation is taken with respect to the sample distribution. The
observed Fisher information corresponds to minus the second
derivative of the log likelihood:

 2
log f (X|)
JX () =
0

Definition 2.7
Let X be a random vector with pdf f (x|) depending on a 1-dim.
parameter .
The expected Fisher information measure of through X is defined by

 2
log f (X|)
.
I() = EX|
2
If = (1 , . . . , p ) is a vector then the expected Fisher information
matrix of through X is defined by
 2

log f (X|)
I() = EX|
0
with elements Iij () given by
 2

log f (X|)
Iij () = EX|
,
i j

2 Bayesian Inference

and is interpreted as a local measure of the information content while


its expected value, the expected Fisher information, is a global
measure.

i, j = 1, . . . , p.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

2.2 Likelihood-based Functions

61

Applied Bayesian Inference

Prof. Dr. Renate Meyer

2.2 Likelihood-based Functions

2 Bayesian Inference

Fisher Information Example

62

2.2 Likelihood-based Functions

Fisher Information

Example 2.8
Let X N(, 2 ) with 2 known. It is easy to get I() = JX () = 2 ,
the normal precision. Verify!
log f (X |) = log{ 1 e
2

1
(X )2
2 2

} = const.

1
(X
2 2

One of the most useful properties of the Fisher information is the


additivity of the information with respect to independent observations.
This means if X = (X1 , . . . , Xn ) are independent random variables with
densities fi (x|) and I and Ii the expected Fisher information measures
obtained through X and Xi , respectively, then

)2

d
2
X
(X ) =
log f (X |) =
d
2 2
2
2
d
1
log f (X |) = 2
d2



 
d2
1
1
I() = E 2 log f (X |) = E
= 2 = JX ()
2
d

i.e. the normal precision

Prof. Dr. Renate Meyer

Applied Bayesian Inference

I() =

n
X

Ii ().

i=1

This states that the total information obtained from independent


observations is the sum of the information of the individual
observations.

63

Prof. Dr. Renate Meyer

Applied Bayesian Inference

64

2 Bayesian Inference

2.2 Likelihood-based Functions

2 Bayesian Inference

Score Function

Example: Fisher Info for Binomial


Example 2.10
Let X1 , . . . , Xn Binomial(1, ). Show that the ML estimate of has an
) distribution.
asymptotic N(, (1)
n

Definition 2.9
The score function of X, is defined as
U(X; ) =

2.2 Likelihood-based Functions

log f (X|)
.

iid

Xi | Binomial(1, ) with
E(Xi ) = and Var (Xi ) = (1 )

One can show that under certain regularity conditions:


I() = EX| [U 2 (X; )].
will, for large n, possess a
In a large number of situations,
distribution that is approximately multivariate normal with mean vector
and covariance matrix I()1 .
1
) is said to converge in distribution, as n ,
The vector I() 2 (
with p fixed, to a standard spherical normal distribution (i.e. a
multivariate normal distribution N(0, Ip ) with zero mean vector and
covariance matrix equal to the p p identity matrix).
Applied Bayesian Inference

Prof. Dr. Renate Meyer

2 Bayesian Inference

65

l(; x1 , . . . , xn ) =

i=1
P

xi

n
Y

xi (1 )1xi

i=1
P
n xi

(1 )

= x (1 )nx

Applied Bayesian Inference

Prof. Dr. Renate Meyer

2 Bayesian Inference

Example: Fisher Info for Binomial


x
nx

=0

1
x
= =
n

f (xi |) =

P
where x = ni=1 xi .
log l(; x1 , . . . , xn ) = x log + (n x) log(1 )

2.2 Likelihood-based Functions

d
log l(; x1 , . . . , xn ) =
d

n
Y

66

2.3 Bayes Theorem: Continuous Case

Bayesian Statistical Model

Given data x whose distribution depends on an unknown parameter .


We require inference about . (x and can be vectors, but we assume
for ease of notation that they are 1-dim.)

d
Xi
1 Xi
Xi
log f (Xi |) =

=
d

1
(1 )
2
(Xi )
U 2 (Xi ; ) = 2
(1 )2
Var (Xi )
(1 )
1
Ii () = E[U 2 (Xi ; )] = 2
= 2
=
2
2
(1 )
(1 )
(1 )
n
X
n
I() =
Ii () =
.
(1 )
U(Xi ; ) =

Definition 2.11
A Bayesian statistical model consists of a parametric statistical model
(the sampling distribution or likelihood), f (x|), and a prior
distribution on the parameters f ().

i=1

Prof. Dr. Renate Meyer

Applied Bayesian Inference

67

Prof. Dr. Renate Meyer

Applied Bayesian Inference

68

2 Bayesian Inference

2.3 Bayes Theorem: Continuous Case

2 Bayesian Inference

Bayes theorem

2.3 Bayes Theorem: Continuous Case

Essential Distributions
Given a complete Bayesian model, we can construct:
a) the joint distribution of (, X ),

Theorem 2.12
Continuous version of Bayes theorem:

f (, x) = f (x|)f ();

Given a Bayesian statistical model, we can update the prior pdf of to


the posterior pdf of given the data x:

b) the marginal or prior predictive distribution of X ,


Z
Z
f (x) = f (, x)d = f (x|)f ()d;

f (|x) = f ()f (x|)/f (x)

c) the posterior distribution of


=

f ()f (x|)
R
f ()f (x|)d

f (|x) = R

prior likelihood

d) the posterior predictive distribution for a future obs. Y given x,


Z
Z
f (y |x) = f (y , |x)d = f (y |)f (|x)d.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

2 Bayesian Inference

2 Bayesian Inference

Presentation of Posterior Distribution

After seeing the data x, what do we now know about the parameter ?
plot of posterior density function

summary statistics like measures of location and


dispersion/precision
(analogue to frequentist point estimates: e.g. posterior mean,
median, mode)

hypothesis test, e.g. H0 : 0 :


f (|x)d

Applied Bayesian Inference

analogue to frequentist confidence intervals:


central posterior interval and
highest posterior density region.

If exactly 100(/2)% of the posterior probability lies above and


below the posterior interval, it is called a central posterior interval
with coverage probability 1 = p2 p1 .
It is sometimes desirable to find an interval/region which is as
short as possible for a given coverage probability. This is called a
highest posterior density region (HPD).

Prof. Dr. Renate Meyer

2.3 Bayes Theorem: Continuous Case

If F (|x) is the posterior cdf and if


F (1 |x) = p1 , F (2 |x) = p2 > p1 , then the interval (1 , 2 ] is a
posterior interval of with coverage probability p2 p1 (credible
interval).

Pr (H0 true|x) = Pr ( 0 |x) =

70

Presentation of Posterior Distribution


I

Applied Bayesian Inference

Prof. Dr. Renate Meyer

69

2.3 Bayes Theorem: Continuous Case

f ()f (x|)
f ()f (x|)
;
=
f (x)
f ()f (x|)d

71

Prof. Dr. Renate Meyer

Applied Bayesian Inference

72

3 Conjugate Distributions

3 Conjugate Distributions

Conjugate Distributions

Bernoulli Trials Discrete Prior

Assume a drug may have response rate of 0.2, 0.4, 0.6, 0.8, each of
equal prior probability. If we observe a single positive response
(x = 1), how is our prior revised?

The term conjugate refers to cases where the posterior distribution is


in the same family as the prior distribution.
In Bayesian probability theory, if the posterior distributions f (|x) are in
the same family as the prior distributions f () for all , the prior
and posterior are called conjugate distributions, and the prior is called
a conjugate prior.

Likelihood:

The concept, as well as the term "conjugate prior", were introduced by


Howard Raiffa and Robert Schlaifer in their work on Bayesian decision
theory (1961).

f (x = 0|) = 1

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

73

f (x|) = x (1 )1x
f (x = 1|) =

Posterior:
f (x|)f ()
f (|x) = P
f (x|)f ()
j f (x|j )f (j )

.2
.4
.6
.8
P

prior
f ()
0.25
0.25
0.25
0.25
1.0

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.1 Bernoulli Distribution Discrete Prior

3 Conjugate Distributions

Calculating the Posterior

3.1 Bernoulli Distribution Discrete Prior

74

3.1 Bernoulli Distribution Discrete Prior

Prior Predictive Distribution


With a Bayesian approach, prediction is straightforward.
The prior predictive distribution of X is given by:
X
P(X = 1)= f (x = 1) =
f (x = 1|j )f (j ) = 0.5

likelihood prior posterior


f (x = 1|)f ()
f (|x = 1)
0.2 0.25 = 0.05
0.10
0.4 0.25 = 0.10
0.20
0.6 0.25 = 0.15
0.30
0.8 0.25 = 0.20
0.40
0.50
1.00

P(X = 0) = f (x = 0) = 1 f (x = 1) = 0.5
The prior predictive probability is thus a weighted average of the
likelihoods under the 4 possible values of :
f (x) =

wj f (x|j )

with prior weights given by wj = f (j ).

Note: a single positive response makes it 4 times as likely that the true
response rate is 80% rather than 20%.

Furthermore:
f (x = 1) =

j wj = prior mean of = E[]

j
Prof. Dr. Renate Meyer

Applied Bayesian Inference

75

Prof. Dr. Renate Meyer

Applied Bayesian Inference

76

3 Conjugate Distributions

3.1 Bernoulli Distribution Discrete Prior

3 Conjugate Distributions

Posterior Predictive Distribution

Posterior Predictive Distribution

Suppose we wish to predict the outcome of a new observation z, given


what we have already observed.
For discrete we have the posterior predictive distribution:
X
f (z, j |x)
f (z|x) =
which, since z is usually conditionally independent of x given , is
generally equal to
P

f (z|j , x)f (j |x) =

Example: The posterior predictive probability that the next treatment is


successful:
f (z = 1|x = 1)

f (z|x) =

3.1 Bernoulli Distribution Discrete Prior

f (z|j )f (j |x)

j f (j |x) = posterior mean of

= 0.2 0.1 + 0.4 0.2 + 0.6 0.3 + 0.8 0.4 = 0.6

f (z|j )wj (x)

where the wj (x) = f (j |x) are posterior weights.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

77

3.2 Binomial Distribution Discrete Prior

3 Conjugate Distributions

Binomial response Discrete Prior

Likelihood
f (x = r |) =

n
r

.2
.4
.6
.8
P

r (1 )nr r (1 )nr

Suppose n = 20, r = 15
f (x = 15||) = 15 (1 )5

Prof. Dr. Renate Meyer

Applied Bayesian Inference

78

3.2 Binomial Distribution Discrete Prior

Binomial response Discrete Prior

If we observe r responses out of n patients, how is our prior revised?

Applied Bayesian Inference

Prof. Dr. Renate Meyer

79

prior likelihood prior posterior


f ()
f (x = r |)f ()
f (|x = r )
7
(10 )
.25
0.0
0.0
.25
0.2
0.005
.25
12.0
0.298
.25
28.1
0.697
1.0
40.3
1.0

Prof. Dr. Renate Meyer

Applied Bayesian Inference

80

3 Conjugate Distributions

3.2 Binomial Distribution Discrete Prior

3 Conjugate Distributions

Binomial response discrete Prior

3.2 Binomial Distribution Discrete Prior

Summary and Terminology (Discrete Prior)


Two random variables: X (observable), (unobservable).

After observing x = 15 successes, what is the posterior predictive


probability of a positive response for patient No. 21?

Let X | Binomial(n, )
(or Xj | Bernoulli() conditionally independent for j = 1, . . . , n),
where the unknown parameter can attain I different values i , with a
priori probabilities f (i ), i = 1, . . . , I, respectively.

f (z = 1|x = 15)
P
= i f (z = 1|i )f (i |x = 15)

X | Binomial(n, ) is called the sampling distribution.

= 0.2 0.0 + 0.4 0.005 + 0.6 0.298 + 0.8 0.697

f (i ), i = 1, . . . , I is called the prior distribution.

= 0.7384

The likelihood function:




n
f (x|) =
x (1 )nx x (1 )nx
x

= 1 , . . . , I

NOTE: This is considered as a function of only; x is considered fixed.


Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

81

3.2 Binomial Distribution Discrete Prior

3 Conjugate Distributions

Summary and Terminology (Discrete Prior)

I
X

82

3.2 Binomial Distribution Discrete Prior

Summary and Terminology (Discrete Prior)


Posterior predictive pdf for another future observation Y of the
Bernoulli experiment:

Prior predictive pdf of X :


f (x) =

Applied Bayesian Inference

Prof. Dr. Renate Meyer

f (x|i )f (i )

for x = 0, 1, . . . , n

f (y |x) =

i=1

I
X

f (y |i )f (i |x)

i=1

(mean or weighted average of f (x|) with weights given by the prior


probabilities for , f (i ))

(mean or weighted average of f (x|) with weights given by the


posterior probabilities for , f (i |x))

Posterior pdf of :

As Y can attain only the values 0, 1 this gives:

f (i |x) =

f (i )f (x|i )
=
PI
f (x)
j=1 f (j )f (x|j )
f (i )f (x|i )

f (i )f (x|i )

Prof. Dr. Renate Meyer

f (1|x) =

i f (i |x) = posterior mean of

I=1

i = 1, . . . , I

Applied Bayesian Inference

I
X

f (0|x) = 1 f (1|x)
83

Prof. Dr. Renate Meyer

Applied Bayesian Inference

84

3 Conjugate Distributions

3.3 Binomial Distribution Continuous Prior

3 Conjugate Distributions

Binomial Response Continuous Prior

3.3 Binomial Distribution Continuous Prior

Calculating Posterior

Data: x successes from n independent trials


Likelihood:

Posterior:

f (x|) =

n
x

x (1 )nx x (1 )nx

f (|x)
f (x|)f ()

Prior: flexible conjugate beta family

x (1 )nx 1 (1 )1

Beta(, )

+x1 (1 )+nx1

( + ) 1

(1 )1
()()

f () =

Beta( + x, + n x)
Note: the Binomial and Beta distributions are conjugate distributions

1 (1 )1

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

85

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.3 Binomial Distribution Continuous Prior

3 Conjugate Distributions

Posterior Moments

86

3.3 Binomial Distribution Continuous Prior

Prior and Posterior Densities

For a beta(, ) distribution:


mode m = ( 1)/( + 2)
5

mean = /( + )

likelihood
prior
posterior

variance 2 = (1 )/( + + 1) = /[( + )2 ( + + 1)]

Solving = 0.4 and 2 = 0.12 gives = 9.2, = 13.8.

Convenient to think of this as equivalent to having observed 9.2


successes in + = 23 patients.

density
3

Suppose our prior estimate of the response rate is 0.4 with a standard
deviation of 0.1.

successes
failures
Prof. Dr. Renate Meyer

prior
9.2
13.8

likelihood
15
5

0.0

0.4

0.6

0.8

1.0

theta

posterior
24.2
18.8

Applied Bayesian Inference

0.2

Figure 7: Prior, likelihood, and posterior density of .

87

Prof. Dr. Renate Meyer

Applied Bayesian Inference

88

3 Conjugate Distributions

3.3 Binomial Distribution Continuous Prior

3 Conjugate Distributions

Prior and Posterior Means and Modes

Compromise
In general, the posterior mean is a compromise between prior mean
and data mean, i.e. for some w, 0 w 1:

Compare modes of prior, likelihood and posterior:


8.2
= 0.39
21
15
mode of likelihood:
= 0.75
20
23.2
= 0.57
posterior mode:
41

posterior mean = wprior mean + (1 w) data mean

prior mode:

x +
n++
Solve w.r.t. w:

9.2
= 0.4
23
15
data mean:
= 0.75
20
24.2
posterior mean:
= 0.56
43

x
+ (1 w)
+
n

i.e.

prior mean:

w=

Applied Bayesian Inference

3 Conjugate Distributions

= w

x +
+

n
x
=
+
n++
n++ + n++ n

Compare means of prior, data and posterior:

Prof. Dr. Renate Meyer

3.3 Binomial Distribution Continuous Prior

89

prior gets weight

+
n++

data gets weight

n
n++

+
n++
for n
1 for n
Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.3 Binomial Distribution Continuous Prior

3 Conjugate Distributions

Compromise

90

3.3 Binomial Distribution Continuous Prior

Hypothesis Test
H0 : > 0 = 0.4
Calculate prior and posterior probability of H0 :
Z 1
Z 0
P( > 0 ) =
f ()d = 1
f ()d = 1 FBeta(,) (0 )
0

"A Bayesian is one who, vaguely expecting a horse, and catching a


glimpse of a donkey, strongly believes he has seen a mule. "

P( > 0 |x) =

Z
f (|x)d = 1

f (|x)d = 1FBeta(+x,+nx) (0 )

For 0 = 0.4, use R function


> priorprob=1-pbeta(0.4,9.2,13.8)
> priorprob
[1] 0.4886101
> postprob=1-pbeta(0.4,24.2,18.8)
> postprob
[1] 0.9842593
Prof. Dr. Renate Meyer

Applied Bayesian Inference

91

Prof. Dr. Renate Meyer

Applied Bayesian Inference

92

3 Conjugate Distributions

3.3 Binomial Distribution Continuous Prior

3 Conjugate Distributions

Analogue to Confidence Interval

Posterior Predictive Distribution


What is the posterior predictive success probability for a further
n + 1 = 21st patient entering the trial?
Z 1
P(Xn+1 = 1|x) =
f (xn+1 = 1|)f (|x1 , . . . , xn )d

Posterior Credible Interval


95% central posterior credible interval for : (l , u )
where
Z

0.95 =

f (|x)d

l and u are 2.5% and 97.5% quantiles of posterior


=
Use R function
=

> l=qbeta(0.025,24.2,18.8)l
[1] 0.4142266
> u=qbeta(0.975,24.2,18.8)
> u
[1] 0.7058181

=
=

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

3.3 Binomial Distribution Continuous Prior

93

(n + + )
+x1 (1 )+nx1 d
( + x)( + n x)
0
Z 1
(n + + )
+x (1 )+nx1 d
( + x)( + n x) 0
(n + + )
( + x + 1)( + n x)
( + x)( + n x)
(n + + + 1)
(n + + )
( + x)( + x)( + n x)
( + x)( + n x) (n + + )(n + + )
+x
9.2 + 15
=
= 0.562797
++n
23 + 20

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.3 Binomial Distribution Continuous Prior

3 Conjugate Distributions

94

3.3 Binomial Distribution Continuous Prior

Posterior Predictive Distribution


If N = 100 further patients enter the trial, what is the posterior
predictive distribution of the number of successes?
Let Y |) Binomial(N, ). Then for y = 0, 1, . . . , N, f (y |x)
Z

f (y |)f (|x)d

Z 1
(n + + )
N
=
y (1 )Ny
+x1 (1 )+nx1 d
y
(
+
x)(
+
n

x)
0


Z 1
(n + + )
N
=
y ++x1 (1 )Ny ++nx1 d
y
( + x)( + n x) 0


(n + + )
( + + n + N)
N
=
y
( + x)( + n x) ( + x + y )( + n x + N y )

This is called a Beta-Binomial distribution.


Prof. Dr. Renate Meyer

Applied Bayesian Inference

95

Prof. Dr. Renate Meyer

Applied Bayesian Inference

96

3 Conjugate Distributions

3.4 Exchangeability

3 Conjugate Distributions

Independence?

3.4 Exchangeability

Marginal Bivariate Distribution


I

A common statement in statistics:

If f () = Unif(0,1), then

Assume X1 , . . . , Xn are iid random variables


In Bayesian statistics, we need to think hard about independence.
Why?
I

Consider two independent" Bernoulli trials with probability of


success .

Z
=

x1 +x2 (1 )2x1 x2 d

It is true that

f (x1 , x2 |) = x1 +x2 (1 )2x1 x2 f (x1 |)f (x2 |)

f (x1 , x2 |)f ()d

f (x1 , x2 ) =

(x1 + x2 + 1)(3 x1 x2 )
(4)

so that X1 and X2 are independent given .


R
But f (x1 , x2 ) = f (x1 , x2 |)f ()d may not factor.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

97

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.4 Exchangeability

3 Conjugate Distributions

Exchangeability

3.4 Exchangeability

Relationship between exchangeability and independence

If independence is no longer the key, then what is?


Exchangeability

rvs that are iid given are exchangeable

Informal definition: subscripts dont matter

More formally: Given events A1 , A2 , . . . , An , we say they are


exchangeable if

an infinite sequence of exchangeable rvs can always be thought


of as iid given some parameter
(DeFinettis theorem)

note previous point requires an infinite sequence

P(A1 , A2 , . . . Ak ) = P(Ai1 , Ai2 , . . . Aik )

98

What is not exchangeable?

for every k where i1 , i2 , . . . , in are permutations of the indices

time series, spatial data

Similarly, given random variables X1 , X2 , . . . , Xn , we say that they


are exchangeable if

may become exchangeable if we explicitly include time in the


analysis
i.e. x1 , x2 , . . . , xt , . . . are not exchangeable but
(t1 , x1 ), (t2 , x2 ), . . . may be

P(X1 x1 , . . . , Xk xk ) = P(Xi1 xi1 , . . . , Xik xik )


for every k .
Prof. Dr. Renate Meyer

Applied Bayesian Inference

99

Prof. Dr. Renate Meyer

Applied Bayesian Inference

100

3 Conjugate Distributions

3.5 Sequential Learning

3 Conjugate Distributions

3.5 Sequential Learning

Sequential Inference
Suppose we obtain an observation x1 and form the posterior
f (|x1 ) f (x1 |)f () and then we obtain a further observation x2 which
is conditionally independent of x1 given . The posterior on x1 , x2 is
given by:
f (|x1 , x2 ) f (x2 |, x1 ) f (|x1 )
f (x2 |) f (|x1 )
Todays posterior is tomorrows prior!
The resultant posterior is the same as if we have obtained the data
x1 , x2 together:
f (|x1 , x2 ) f (x1 , x2 |) f ()
f (x2 |) f (x1 |) f ()
Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

Prof. Dr. Renate Meyer

101

3.5 Sequential Learning

Applied Bayesian Inference

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

103

Prof. Dr. Renate Meyer

102

3.5 Sequential Learning

Applied Bayesian Inference

104

3 Conjugate Distributions

3.6 Comparing Bayesian and Frequentist Inference for Proportion

3 Conjugate Distributions

Comparing Bayesian and Frequentist Inference for Proportion

Point Estimation
A single statistic is calculated from the sample data and used to
estimate the unknown parameter.
The statistic depends on the random sample, so it is random, and its
distribution is called its sampling distribution.
We call the statistic an estimator of the parameter and the value it
takes for the actual sample data an estimate.
There are various frequentist approaches for finding estimators, such
as

Frequentist inference is concerned with


I

point estimation,

interval estimation,

and hypothesis testing.

3.6 Comparing Bayesian and Frequentist Inference for Proportion

Least Squares (LS),

maximum likelihood estimation (MLE) and

uniformly minimum variance unbiased estimation (UMVUE).

For estimating the binomial parameter , the LS, MLE and UMVUE of
the population proportion is the sample proportion.
Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

105

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.6 Comparing Bayesian and Frequentist Inference for Proportion

3 Conjugate Distributions

106

3.6 Comparing Bayesian and Frequentist Inference for Proportion

Bias

Mean Squared Error

From a Bayesian perspective, point estimation means summarizing the


posterior distribution by a single statistic, such as the posterior mean,
median or mode. Here, we will use the posterior mean as the Bayesian
point estimate (it minimizes the posterior mean squared error to give a
decision-theoretic justification).

An estimator is said to be a minimum variance unbiased estimator if no


other unbiased estimator has a smaller variance. However, it is
possible that there may be a biased estimator that, on average, is
closer to the true value than the unbiased estimator. We need to look
at the possible trade-off between bias and variance.
The (frequentist) mean squared error of an estimator is the average
squared distance the estimator is away from the true value:
Z
2

MS() = E[( ) ] = ( )2 f (|)d


.

An estimator is said to be unbiased if the mean of its sampling


distribution is the true parameter, i.e. is unbiased if
Z
(|)d

= ,
E[] = f

One can show that

is the sampling distribution of the estimator given the


where f (|)
parameter . The bias of an estimator is

= bias()
2 + Var ().

MS()

= E[]
.
bias()

Thus, it gives a better frequentist criterion for judging estimators than


the bias or the variance alone.

(Bayes estimators are usually biased.)


Prof. Dr. Renate Meyer

Applied Bayesian Inference

107

Prof. Dr. Renate Meyer

Applied Bayesian Inference

108

3 Conjugate Distributions

3.6 Comparing Bayesian and Frequentist Inference for Proportion

3 Conjugate Distributions

MSE Comparison

MSE Comparison

We will now compare the mean squared error of the Bayesian and the
frequentist estimator of the population proportion .
The frequentist estimator for is

Suppose we use the posterior mean as the Bayesian estimate for ,


where we use the Beta(1,1) prior (uniform prior), then
B = 1+x = x + 1
n+2

X
f = ,
n

and

E[B ] =

1
n+2

1
n(1
(n+2)2

(1 )
Var (f ) =
n
(1 )
2

MS(f ) = 0 +
n
109

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.6 Comparing Bayesian and Frequentist Inference for Proportion

3 Conjugate Distributions

MSE Comparison

MS(f ) =

110

3.6 Comparing Bayesian and Frequentist Inference for Proportion

MSE Comparison

For example, suppose = 0.4 and the sample size is n = 10. Then
0.40.6
10

Hence, the mean squared error is



2
n
1
1
MS(B ) =
+
+
n(1 )
n+2 n+2
(n + 2)2


1 2 2
1
n(1 )
=
+
n+2
(n + 2)2

E[f ] =

3 Conjugate Distributions

n
n+2

Var (B ) =

Thus,

Applied Bayesian Inference

n+2

and the variance of its sampling distribution is

Var (X ) = n(1 )

Prof. Dr. Renate Meyer

n+2

Thus, the mean of its sampling distribution is

where X , the number of successes in n trials, has the Binomial(n, )


distribution with mean and variance given by
E(X ) = n

3.6 Comparing Bayesian and Frequentist Inference for Proportion

Figure 8 shows the mean squared error for the Bayesian and the
frequentist estimator as a function of . Over most (but not all) of the
range, the Bayesian estimator (using uniform prior) performs better
that the frequentist estimator.

= 0.024

0.025

and

Bayes
frequentist

0.020

MS(B ) = 0.0169

0.010

MSE

0.015

Next, suppose = 0.5 and n = 10. Then

0.005

MS(f ) = 0.025

0.0

and

0.0

MS(B ) = 0.01736

Prof. Dr. Renate Meyer

0.2

0.4

0.6

0.8

1.0

theta

Figure 8: Mean squared error for the two estimates.

Applied Bayesian Inference

111

Prof. Dr. Renate Meyer

Applied Bayesian Inference

112

3 Conjugate Distributions

3.6 Comparing Bayesian and Frequentist Inference for Proportion

3 Conjugate Distributions

Interval Estimation

3.6 Comparing Bayesian and Frequentist Inference for Proportion

Confidence Credible Interval


In this case, the confidence interval has the form

The aim is to find an interval (l, u) that has a predetermined probability


of containing the parameter

where the critical value comes from the normal or t table. For the
sample proportion, an approximate (1 ) 100% confidence interval
for is given by:
s
f (1 f )
.
f tn1 (/2)
n

P(l u) = 1 .
In the frequentist interpretation, the parameter is fixed but unknown
and, before the sample is taken, the interval endpoints are random
because they depend on the data. After the sample is taken, and the
endpoints are calculated, there is nothing random, so the interval is
called a confidence interval for the parameter. Under the frequentist
paradigm, the correct interpretation for a (1 ) 100% confidence
interval is that (1 ) 100% of the random intervals calculated this
way will contain the true value. Often, the sampling distribution of the
estimator is approximately normal or tn1 distributed with mean equal
to the true value.
Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

estimator critical value stdev of estimator ,

A Bayesian credible interval for the parameter on the other hand, has
the natural interpretation that we want. Because it is found from the
posterior distribution of , it has the coverage probability we want for
this specific data.

113

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.6 Comparing Bayesian and Frequentist Inference for Proportion

3 Conjugate Distributions

Example: Interval Estimation

Frequentist 95% confidence interval:


r
0.26 0.74
0.26 1.96
= (0.174, 0.346)
100
Bayesian 95% credible interval:
prior: Beta(1,1)
posterior: Beta(1 + 26, 1 + 74) =Beta(27,75)
> lu=qbeta(c(0.025,0.975),27,75)
> lu
[1] 0.1841349 0.3540134
Applied Bayesian Inference

3.6 Comparing Bayesian and Frequentist Inference for Proportion

Hypothesis Testing

Example 3.1
Out of a random sample of 100 Hamilton residents, x = 26 said they
support a casino in Hamilton. Compare the frequentist 95% confidence
interval with the Bayesian credible interval (using a uniform prior).

Prof. Dr. Renate Meyer

114

Example 3.2
Suppose we wish to determine whether a new treatment is better than
the standard treatment. If so, , the proportion of patients who benefit
from the new treatment, should be higher than 0 , the proportion who
benefit from the standard treatment. It is known from historical records
that 0 = .6. A random group of 10 patients are given the new
treatment. X , the number who benefit from the treatment will be
Binomial(n, ). We observe x = 8 patients benefit. This is better than
we would expect if = 0.6. But, is it sufficiently better for us to
conclude that > 0.6 at the 5% level of significance?
The following table gives the null distribution of X :
x
f (x|0 )

115

0
.001

1
.0016

2
.0106

Prof. Dr. Renate Meyer

3
.0425

4
.1115

5
.2007

6
.2508

7
.2150

Applied Bayesian Inference

8
.1209

9
.0403

10
.0060

116

3 Conjugate Distributions

3.6 Comparing Bayesian and Frequentist Inference for Proportion

3 Conjugate Distributions

Frequentist Test

3.6 Comparing Bayesian and Frequentist Inference for Proportion

Bayesian Test

prior: Beta(1,1)
H0 : 0.6

H1 : > 0.6

data: x = 8, n x = 2
posterior: Beta(9,3)

Under H0 : X | = 0.6 Binomial(10, 0.6)


P-value

P(H0 |x = 8) = P( 0.6|x = 8)
= pbeta(0.6, 9, 3)

= P(X 8|H0 true)= P(X 8| = 0.6)= 1 pbinom(7, 10, 0.6)


not reject H0
= 0.1209 + 0.0403 + 0.0060 = 0.1672 > 0.05 =

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

Prof. Dr. Renate Meyer

= 0.1189

117

3.6 Comparing Bayesian and Frequentist Inference for Proportion

Applied Bayesian Inference

119

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

Prof. Dr. Renate Meyer

118

3.6 Comparing Bayesian and Frequentist Inference for Proportion

Applied Bayesian Inference

120

3 Conjugate Distributions

3.7 Exponential Distribution

3 Conjugate Distributions

Exponential data

Gamma Prior

The exponential distribution is commonly used to model waiting times


and other continuous positive real-valued random variables, usually
measured on a time scale. The sampling distribution of an outcome x,
given parameter , is
f (x|) = exp(x),

3.7 Exponential Distribution

Let X1 , . . . , Xn be iid Exponential() random variables.


Likelihood:
f (x|) n exp(nx )
conjugate Gamma(, ) prior:
f () =

for x > 0.

The exponential distribution is a special case of the Gamma


distribution with parameters (, ) = (1, ).

exp()
()

Posterior density:
f (|x) n+1 exp((nx + )) Gamma( + n, + nx )

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

121

3.7 Exponential Distribution

3 Conjugate Distributions

Exponential Example

3.7 Exponential Distribution

iii) The length of life of a light bulb manufactured by a certain process


has an exponential distribution with unknown rate . Suppose the
prior distribution for is a Gamma distribution with coefficient of
variation 0.5.
( The coefficient of variation is defined as the standard deviation
divided by the mean.)

Let Yi , i = 1, . . . , n, be iid exponentially distributed.


i) Using a conjugate Gamma(, ) distribution, derive the the
posterior mean, variance, and mode of . For which values and
does the posterior mode coincide with the ML estimate of ?
ii) What is the posterior density of the mean = 1 ? Which
distribution is conjugate for ?

Applied Bayesian Inference

122

Exponential Example

Example 3.3

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Prof. Dr. Renate Meyer

A random sample of light bulbs is to be tested and the lifetime of


each obtained. If the coefficient of variation of the distribution of
is to be reduced to 0.1, how many light bulbs need to be tested?
iv) In part iii), if the coefficient of variation refers to instead of , how
would your answer be changed?

123

Prof. Dr. Renate Meyer

Applied Bayesian Inference

124

3 Conjugate Distributions

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

Prof. Dr. Renate Meyer

3.7 Exponential Distribution

3 Conjugate Distributions

125

Applied Bayesian Inference

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.7 Exponential Distribution

3 Conjugate Distributions

127

Prof. Dr. Renate Meyer

3.7 Exponential Distribution

126

3.7 Exponential Distribution

Applied Bayesian Inference

128

3 Conjugate Distributions

3.8 Poisson Distribution

3 Conjugate Distributions

Poisson Data

Gamma Prior

Let X be the number of times a certain event occurs in a unit interval of


time and the following conditions hold
I

The events are occurring at a constant average rate of per unit


time.

The number of events in any one interval of time is statistically


independent of the number in any other nonoverlapping interval.

The probability of more than one event occurring in an interval of


length d goes to zero as d goes to zero.

Let X be a Poisson() random variable and we observe X = x.


Likelihood:
x e
x!
x e

f (x|) =

Conjugate Gamma(, ) prior:

Any process producing events which satisfy the above three axioms is
called a Poisson process and X , the number of events in a unit time
interal, is distributed as Poisson().

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

3.8 Poisson Distribution

129

f () =

exp()
()

1 exp()

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.8 Poisson Distribution

3 Conjugate Distributions

Calculating Posterior

130

3.8 Poisson Distribution

Prior Predictive Distribution


Prior predictive distribution for X : f (x)

Posterior density:

f (x) =

f (x|)f ()d
0

1 e x e

+x1 e(+1)

=
=

i.e. pdf of Gamma( + x, + 1)


=

Prof. Dr. Renate Meyer

Applied Bayesian Inference

131

Prof. Dr. Renate Meyer

x e 1

e
d
x! ()
0
Z
1 +x1 (+1)

e
d
() x! 0
1 ( + x)
() x! ( + 1)+x

( + x 1)!

x
( + 1) ( + 1) ( 1)!x!

 
x 

1

+x 1
x
+1
+1
Z

f (|x) f ()f (x|)

Applied Bayesian Inference

132

3 Conjugate Distributions

3.8 Poisson Distribution

3 Conjugate Distributions

Negative Binomial

3.8 Poisson Distribution

Multiple Poisson Data


Now let X1 , . . . , Xn be iid Poisson() random variables. Suppose we
observe x = (x1 , . . . , xn ).

i.e. Negative-Binomial(, )
i.e. the no. of Bernoulli failures obtained before the th success when

the success probability is p = +1

Likelihood:
f (x|) =

which shows
Z
Neg-bin(x|, ) =

Poisson(x|)Gamma(|, )d

n
Y

f (xi |)

i=1
n
Y
i=1

xi e
xi !
1

Qn

i=1 xi !

Pn

i=1 xi

en

nx en

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

133

3.8 Poisson Distribution

3 Conjugate Distributions

Multiple Poisson Data

134

3.8 Poisson Distribution

Poisson Example

Conjugate Gamma(, ) prior:

Example 3.4
Suppose that causes of death are reviewed in detail for a city in the US
for a single year. It is found that 3 persons, out of a population of
200,000, died of asthma, giving a crude estimated asthma mortality
rate in the city of 1.5 per 100,000 persons per year. A Poisson
sampling model is often used for epidemiological data of this form. Let
represent the true underlying long-term asthma mortality rate in the
city (measured in cases per 100,000 persons per year). Reviews of
asthma mortality rates around the world suggest that mortality rates
above 1.5 per 100,000 people are rare in Western countries, with
typical asthma mortality rates around 0.6 per 100,000.

f () 1 exp()
Posterior density:
f (|x) f ()f (x|)
1 e nx en
+nx1 e(+n)

a) Construct a conjugate prior density and derive the posterior


distribution of .

i.e. pdf of Gamma( + nx , + n)


Prof. Dr. Renate Meyer

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Applied Bayesian Inference

135

Prof. Dr. Renate Meyer

Applied Bayesian Inference

136

3 Conjugate Distributions

3.8 Poisson Distribution

3 Conjugate Distributions

Poisson Example

3.8 Poisson Distribution

Poisson Example

b) What is the posterior probability that the long-term death rate from
asthma in the city is more than 1.0 per 100,000 per year?
c) What is the posterior predictive distribution of a future observation
Y?
d) To consider the effect of additional data, suppose that ten years of
data are obtained for the city in this example with y = 30 deaths
over 10 years. Assuming the population is constant at 200,000,
and assuming the outcomes in the ten years are independent with
constant long-term rate , derive the posterior distribution of .
e) What is the posterior probability that the long-term death rate from
asthma in the city is more than 1.0 per 100,000 per year?

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

137

3.8 Poisson Distribution

3 Conjugate Distributions

Poisson Example

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Applied Bayesian Inference

Prof. Dr. Renate Meyer

138

3.8 Poisson Distribution

Poisson Example

139

Prof. Dr. Renate Meyer

Applied Bayesian Inference

140

3 Conjugate Distributions

3.9 Normal Distribution

3 Conjugate Distributions

Normal data, known variance, single data

3.9 Normal Distribution

Normal Example
Example 3.5
According to Kennett and Ross (1983), Geochronology, London:
Longmans, the first apparently reliable datings for the age of Ennerdale
granophyre were obtained from the K/Ar method (which depends on
observing the relative proportions of potassium-40 and argon-40 in the
rock) in the 1960s and early 1970s, and these resulted in an estimate
of 370 20 million years. Later in the 1970s, measurements based on
the Rb/Sr method (depending on the relative proportions of
rubidium-87 and strontium-87) gave an age of 421 8 million years. It
appears that the errors marked are meant to be standard deviations,
and it seems plausible that the errors are normally distributed. If then a
scientist had the K/Ar measurements available in the early 1970s, this
could be the basis of her prior beliefs about the age of these rocks.

A random variable X has a Normal distribution with mean and


variance 2 if X has a continuous distribution with pdf
"

 #
1
1 x 2
f (x) =
exp
for < x < .
2

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

141

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.9 Normal Distribution

3 Conjugate Distributions

Normal Prior

Conjugate prior:

0 , 02

Posterior: |x N(1 , 12 )
where
1 =

are hyperparameters
NB:

1
f () =
exp 2 ( 0 )2
20
20

Prof. Dr. Renate Meyer

1
exp 2 ( 0 )2
20

Applied Bayesian Inference

3.9 Normal Distribution

Calculating Posterior

Likelihood: X | N(, 2 ), 2 known






1
1
1
2
2
exp 2 (x ) exp 2 (x )
f (x|) =
2
2
2
N(0 , 02 ),

142

143

1
+ 12 x
02 0
1
+ 12
02

1
1
1
= 2+ 2
2

1
0

posterior precision = prior precision + data precision


posterior mean = weighted average of prior mean and observation

Prof. Dr. Renate Meyer

Applied Bayesian Inference

144

3 Conjugate Distributions

3.9 Normal Distribution

3 Conjugate Distributions

Calculating Posterior

Posterior Mean Expressions

f (|x) f (x|)f ()
"
#!
1 (x )2 ( 0 )2
exp
+
2
2
02

Alternative expressions for posterior mean:


2
1 = 0 + (x 0 ) 2 0 2
+ 0

1
1
(x 2 2x + 2 ) 2 (2 20 + 20 )
2
2
20
"
!
!
#!
1 2 1
1
x
0
exp

+
2
+
+ const.
2
2 02
2 02

0
+ 2
2
1
1

2
0
+ const.
exp 
1 2 1
1
2 1
+ 2
2
+ 1
exp

02

1
exp 
2

1
1
2

1
02

prior mean adjusted towards observed value


1 = x (x 0 )

1

x
2
1
2

2
2 + 02

data shrunk towards prior mean

3.9 Normal Distribution

2
+ 02
0

+ 12
0

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

145

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.9 Normal Distribution

3 Conjugate Distributions

Prior Predictive Distribution

146

3.9 Normal Distribution

Reminder: Conditional Mean and Variance

Prior predictive distribution of X : X N(0 , 2 + 02 )


Because:
R
f (x) = f (x|)f ()d

f (x, ) = f (x|)f () exp 21 2 (x )2

1
(
202

0 )2

If U and V are random variables, then

E[U] = E[E[U|V ]]

i.e. (X , ) have a bivariate normal distribution


then the marginal distribution of X is normal

Var (U) = E[Var (U|V )] + Var (E[U|V ])

Now:
E[X ] = E[E[X |]] = E[] = 0
Var (X ) = E[Var (X |)] + Var (E[X |])
= E[ 2 ] + Var () = 2 + 02
Prof. Dr. Renate Meyer

Applied Bayesian Inference

147

Prof. Dr. Renate Meyer

Applied Bayesian Inference

148

3 Conjugate Distributions

3.9 Normal Distribution

3 Conjugate Distributions

Posterior Predictive Distribution

Normal Example
Now back to Example 3.5:

1
(
212

1 ) 2

Single Normal observation, Normal prior


0.05

Posterior predictive distribution of future Y :


Y |x N(1 , 2 + 12 ) Because:
R
f (y |x) = f (y |)f (|x)d

f (y , |x) = f (y |)f (|x) exp 21 2 (y )2

3.9 Normal Distribution

prior
posterior
likelihood

density
0.03

0.04

i.e. (Y , )|x have a bivariate normal distribution


then the marginal distribution of Y |x is normal

0.02

Now:

0.01

E[Y |x] = E[E[Y |]|x] = E[|x] = 1

0.0

Var (Y |x) = E[Var (Y |)|x] + Var (E[Y |]|x)

300

= E[ 2 |x] + Var (|x) = 2 + 12

350

400

450

mu

Figure 9: Conjugate Normal prior and single observation.


Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

149

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.9 Normal Distribution

3 Conjugate Distributions

Normal data, known variance, multiple data

What is now called the National Institute of Standards and Technology


(NIST) in Washington DC conducts extremely high precision
measurement of physical constants, such as the actual weight of
so-called check-weights that are supposed to serve as reference
standards (like the official kg). In 1962-63, for example, n = 100
weighings of a block of metal called NB10, which was supposed to
weigh exactly 10g, were made under conditions as close to iid as
possible. The 100 measurements x1 , . . . , xn (the units are micrograms
below 10g) have a mean of x = 404.6 and a SD of s = 6.5.

Applied Bayesian Inference

3.9 Normal Distribution

Normal Example

Example 3.6

Prof. Dr. Renate Meyer

150

151

weight
375
392
393
397
398
399
400
401
402
403
404
405

Prof. Dr. Renate Meyer

frequency
1
1
1
1
2
7
4
12
8
6
9
5

weight
406
407
408
409
410
411
412
413
415
418
423
437

frequency
12
8
5
5
4
1
3
1
1
1
1
1

Applied Bayesian Inference

152

3 Conjugate Distributions

3.9 Normal Distribution

3 Conjugate Distributions

Normal Example

3.9 Normal Distribution

Calculting Posterior

iid

Likelihood: Xi | N(, 2 ), i = 1, . . . , n, 2 known

Questions:
1. How much does NB10 really weigh?
2. How certain are you given the data that the true weight of NB10 is
less than 405.25 g below 10g?
3. What is the underlying accuracy of the NB10 measuring process?
4. How accurately can you predict the 101st measurement?

Conjugate prior: N(0 , 02 ), 0 , 02


Posterior: |x N(n , n2 )
where
n =

A Normal qqplot shows that a Normal sampling distribution is


appropriate. We first assume that 2 is known.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

153

1
+ n2 x
02 0
1
+ n2
02

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.9 Normal Distribution

n
1
1
= 2+ 2

n2
0

3 Conjugate Distributions

Calculating Posterior

154

3.9 Normal Distribution

Calculating Posterior

Why?
Reduction to the case of single data point of previous section:

f (x |)

iid

The likelihood depends on data x1 , . . . , xn only through the sufficient


statistic x

If X1 , . . . , Xn | N(, 2 )
Likelihood: f (x1 , . . . , xn |)
h
Q
Q
= ni=1 f (xi |)= ni=1 1 exp 21
2

= const. exp 12

Prof. Dr. Renate Meyer

Pn

i=1

i
xi 2

xi 2

| N(, 2 /n).
and X

Thus, in previous section 3.9, simply substitute 2 by 2 /n and x by x .




. . . exp 21

Applied Bayesian Inference

/ n

2 

155

Prof. Dr. Renate Meyer

Applied Bayesian Inference

156

3 Conjugate Distributions

3.9 Normal Distribution

3 Conjugate Distributions

Remarks

3.9 Normal Distribution

Remarks

1. If 02 = 2 then
n =
1
n2

P
0 + nx
0 + xi
=
n+1
n+1
n+1
2

4. The prior info is equivalent to


to 0 since
n =

i.e. prior has weight of one additional observation with value 0 .


2. If n large, the posterior is determined by x and

2.

3. If 02 (diffuse prior) and n fixed, then


|x N(x , 2 /n)

2
02

additional observations all equal

1
+ n2 x
02 0
1
+ n2
02
2

02 0
2
02

xi

+n

posterior mean = MLE


Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

157

3.9 Normal Distribution

3 Conjugate Distributions

Back to Normal Example

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Applied Bayesian Inference

Prof. Dr. Renate Meyer

158

3.9 Normal Distribution

Back to Normal Example

159

Prof. Dr. Renate Meyer

Applied Bayesian Inference

160

3 Conjugate Distributions

3.9 Normal Distribution

3 Conjugate Distributions

Back to Normal Example

Normal data, known variance, noninformative prior

Example 3.7

0.6

Multiple Normal observations, Normal prior

0.5

Changes in blood pressure (in mmHg) were recorded for each of 100
patients, where negative numbers are decreases while on the drug
and positive numbers are increases:

prior
posterior
likelihood

density
0.3 0.4

3.9 Normal Distribution

0.2

+3.7 6.7 10.5 . . . 16.7 7.2

0.0

0.1

with sample mean x = 7.99 and standard deviation s = 4.33.


360

380

400
mu

420

We will assume that the change in blood pressure X has a Normal


distribution with unknown mean and known variance 2 = 4.332 .

440

Figure 10: Conjugate Normal prior and several observations.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

161

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.9 Normal Distribution

3 Conjugate Distributions

Example

162

3.9 Normal Distribution

Calculating Posterior

Let us assume that we dont know anything about the mean change in
blood pressure induced by the new drug and thus assume that can
attain any real value with equal probability. This gives a flat prior
distribution for on (, ), i.e.
f () 1.
(There is no proper continuous uniform distribution on (, ), but
you can think of being uniform on some finite interval (a, a), for
some large a and ignore the normalization constant, as it is not
needed for the application of Bayes theorem).

Posterior pdf: f (|x )


prior likelihood
f ()f (x |)


2 
x
1

1 exp 2 / n

exp

12

x
/ n

2 

pdf of Normal(x , 2 /n)

What is the posterior distribution of ?

Prof. Dr. Renate Meyer

Applied Bayesian Inference

163

Prof. Dr. Renate Meyer

Applied Bayesian Inference

164

3 Conjugate Distributions

3.9 Normal Distribution

3 Conjugate Distributions

Simple Updating Rule

3.9 Normal Distribution

Credible Intervals
95% posterior probability interval for :

iid

If Xi Normal(, 2 ), i = 1, . . . , n and a flat prior is used, then the


posterior distribution of |x is

L = 2.5% quantile of N(7.99, 0.187489)


U = 97.5% quantile of N(7.99, 0.187489)

Normal(n , n2 ) with

In R:

n = x and
n2 = 2 /n.

> lu=qnorm(c(0.025,0.975),-7.99,sqrt(0.187489))
> lu
[1] -8.838664 -7.141336

In Example 3.7
n = 7.99
n2 = 4.332 /100 = 0.187489

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

165

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.9 Normal Distribution

3 Conjugate Distributions

166

3.9 Normal Distribution

Hypothesis Test

Test the null hypothesis H0 : 7.0.


P(H0 |x ) = P( 7.0|x )
In R:

> p=pnorm(-7,-7.99,sqrt(0.187489))
> p
[1] 0.9888838

Prof. Dr. Renate Meyer

Applied Bayesian Inference

167

Prof. Dr. Renate Meyer

Applied Bayesian Inference

168

3 Conjugate Distributions

3.9 Normal Distribution

3 Conjugate Distributions

2-Parameter Normal with Conjugate Prior

3.9 Normal Distribution

2-Parameter Normal with Conjugate Prior


joint posterior density:

prior distribution:
| 2 N(0 , 2 /0 )
2 Inv-2 (0 , 02 )

where
0
n
0 +
x
0 + n
0 + n
= 0 + n

n =
n

n = 0 + n

N-Inv-2 (0 , 02 /0 , 0 , 02 )

n n2 = 0 02 + (n 1)s2 +

Applied Bayesian Inference

3 Conjugate Distributions

2 (0 /2+1)

N-Inv-2 (n , n2 /n , n , n2 )

where Inv-2 (0 , 02 ) denotes the scaled inverse 2 -distribution with


scale 02 and 0 degrees of freedom, i.e. the distribution of 02 0 /Z
where Z is a 2 random variable with 0 degrees of freedom.
joint prior density:


1
2
1 2 (0 /2+1)
2
2
f (, ) ( )
exp 2 [0 0 + 0 (0 ) ]
2

Prof. Dr. Renate Meyer


1
2
2
f (, |x) ( )
exp 2 [0 0 + 0 (0 ) ]
2


1
2 n/2
2
2

( )
exp 2 [(n 1)s + n(x ) ]
2
1

169

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.9 Normal Distribution

0 n
(x 0 )2 .
0 + n

3 Conjugate Distributions

170

3.9 Normal Distribution

2-Parameter Normal with Conjugate Prior


conditional posterior of :
| 2 , x N(n , 2 /n )
= N

0
+ n2 x
2 0
, 0
0
+ n2
2
2

1
+

!
n
2

marginal posterior of 2 :
2 |x Inv-2 (n , n2 )
marginal posterior of :

f (|x)

n ( n )2
1+
n n2

(n +1)/2

tn (|n , n2 /n )
Prof. Dr. Renate Meyer

Applied Bayesian Inference

171

Prof. Dr. Renate Meyer

Applied Bayesian Inference

172

3 Conjugate Distributions

3.10 Normal Linear Regression

3 Conjugate Distributions

Normal Linear Regression

Conjugate Normal-Inverse-Gamma Prior

This can be extended to linear regression models:

The multivariate normal-inverse gamma prior distribution


(, 2 ) NIG( , V, a, b) is conjugate and can be specified as:

Sampling distribution:
Yi |i , 2 N(i , 2 ),

| 2 Np ( , 2 V)

i = 1, . . . , n,

with i = 0 + 1 xi1 + p1 xi,p1 = x0i

where

X=

1 x11 x12 . . . x1,p1


1 x21 x22 . . . x2,p1
..
..
..
..
..
.
.
.
.
.
1 xn1 xn2 . . . xn,p1

0
1
..
.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3 Conjugate Distributions

2 Inv-Gamma(a, b).

= (X
0 y + V1 )

= (X0 X + V 1 )1

n
=
a
+a
2
= SS + b
b
2
0 1
+ 0 V1 .
SS = y0 y

Y|, Nn (X, In )

and

with
a
, b)
Posterior is NIG(, V,

or in matrix notation with n p design matrix X (with rows xi ):


2

3.10 Normal Linear Regression

173

Applied Bayesian Inference

Prof. Dr. Renate Meyer

3.10 Normal Linear Regression

3 Conjugate Distributions

174

3.10 Normal Linear Regression

Weighted Average

can be written as a weighted average of prior and sample mean as

in the univariate normal case:


= W
+ (Ip W)

with W = (X0 X + V1 )1 X0 X

= (X0 X)1 X0 y is the MLE.


where
The marginal posterior distribution of is a multivariate Student
distribution. For details, see Bernardo and Smith (1994).
The marginal posterior distribution of 2 is an Inverse Gamma
above.
and b
distribution with parameters a

Prof. Dr. Renate Meyer

Applied Bayesian Inference

175

Prof. Dr. Renate Meyer

Applied Bayesian Inference

176

4 WinBUGS Applications

4.1 WinBUGS Handouts

4 WinBUGS Applications

WinBUGS Applications: Overview

WinBUGS Applications: Overview

Calculation of the posterior distribution is difficult in situations with:


I

nonconjugate priors

multiple parameters

The most successful approach, for reasons that we will discuss in the
subsequent sections, is based on simulation. That means, instead of
explicitly calculating the posterior and performing integrations, we
generate a sample from the posterior distribution and use that sample
to approximate any quantity of interest, e.g. approximate the posterior
mean by the sample mean etc.

as we need to calculate summary statistics, like mean and variance,


and in high-dim. problems, calculate marginal posterior distributions.
All this involves integration, which has been a very big hurdle for
Bayesian inference in the past.

A very versatile software to do these posterior simulations is


WinBUGS, the Windows version of BUGS (Bayesian inference Using
Gibbs Sampling), developed by David Spiegelhalter and colleagues at
the MRC Biostatistics Unit of Cambridge University, England.

For low parameter dimensions, say 2,3,4,5 numerical integration


techniques, asymptotic approximations etc may be used but these
break down for higher dimensions.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

177

4.1 WinBUGS Handouts

4 WinBUGS Applications

178

4.1 WinBUGS Handouts

WinBUGS Handouts

WinBUGS uses the Gibbs sampler to generate samples from the


posterior distribution of parameters of a Bayesian model. We will
discuss the Gibbs sampler and other Markov chain Monte Carlo
techniques in detail in Chapter 6. For now, we simply consider the
simulation method used in WinBUGS as a black box but simply keep in
mind, that the samples generated are not independent but dependent,
i.e. they are samples from a Markov chain that converges towards the
posterior distribution. Therefore, we can use the samples only from a
point in time where convergence has set in and need to discard the
initial so-called burn-in samples.

Applied Bayesian Inference

Applied Bayesian Inference

Prof. Dr. Renate Meyer

WinBUGS Applications: Overview

Prof. Dr. Renate Meyer

4.1 WinBUGS Handouts

179

We illustrate this sampling-based approach using our familiar example


of Binomial data with a conjugate prior distribution and refer to the
handout
Brief Introduction to WinBUGS
Other handouts will discuss running WinBUGS in batch mode, from
within R using R2WinBUGS and how to use the R package CODA for
convergence diagnostics.
Once familiar with WinBUGS, we will look at the huge range of
Bayesian models, especially Bayesian hierarchical models, that can be
handled with WinBUGS and concentrate on practical implementation
issues rather than theory. The underlying theory will be recouped in
the subsequent chapters.
Prof. Dr. Renate Meyer

Applied Bayesian Inference

180

4 WinBUGS Applications

4.2 Bayesian Linear Regression

4 WinBUGS Applications

Simple Linear Regression

4.2 Bayesian Linear Regression

Example
Example 4.1

In regression analysis, we look at the conditional distribution of the


response variable at different levels of a predictor variable
I Response variable Y
I
I
I

This example investigates the quality of the delivery system network of


a softdrink company, see Example 5.1 in Ntzoufras (2009). One is
interested in estimating the time each employee needs to refill an
automatic vending machine owned and served by the company. For
this reason, a small quality assurance study was set up by an industrial
engineer of the company. The response variable is the total service
time (measured in minutes) of each machine, including its stocking
with beverages and any required maintenance or housekeeping. After
examining the problem, the industrial engineer recommends two
important variables that affect delivery time: the number of cases of
stocked products and the distance walked by the employee (measured
in feet). A dataset of 25 observations was finally collected.

also called dependent or outcome variable


what we want to explain or predict
in simple linear regression, the response variable is continuous

Predictor variables X1 , . . . , Xp
I
I

also called independent variables or covariates


in simple linear regression, the predictor variable is usually
continuous
which variable is response and which is predictor depends on our
research question

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

181

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.2 Bayesian Linear Regression

4 WinBUGS Applications

Data: Softdrink Delivery Times


Delivery Time
16.68
11.5
12.03
14.88
13.75
18.11
8
17.83
79.24
21.5
40.33
21
13.5
19.75
24
29
15.35
19
9.5
35.1
17.9
52.32
18.75
19.83
10.75

Prof. Dr. Renate Meyer

Cases
7
3
3
4
6
7
2
7
30
5
16
10
4
6
9
10
6
7
3
17
10
26
9
8
4

4.2 Bayesian Linear Regression

Model Assumptions
The explanatory variables are assumed fixed, their values denoted by
xi1 , . . . , xip for i = 1, . . . , n. Given the values of the explanatory
variables, the observations of the response variable are assumed
independent, normally distributed

Distance
560
220
340
80
150
330
110
210
1460
605
688
215
255
462
448
776
200
132
36
770
140
810
450
635
150

Applied Bayesian Inference

182

Yi |xi1 , . . . , xip N(i , 2 )


i

with

= 0 + 1 xi1 + + p xip

for i = 1, . . . , n

or in matrix notation:
Y|x Nn (, 2 I)

with

= X
where 2 and = (0 , 1 , . . . , p ) is the set of regression parameters, I
denotes the identity matrix, Y the vector of observations and X = (xij )
the n (p + 1) design matrix.
183

Prof. Dr. Renate Meyer

Applied Bayesian Inference

184

4 WinBUGS Applications

4.2 Bayesian Linear Regression

4 WinBUGS Applications

Likelihood Specification in WinBUGS

4.2 Bayesian Linear Regression

Prior Specification
In normal regression models, the simplest approach is to assume that
all parameters are a priori independent, i.e.

Note that in WinBUGS the normal distribution is parametrized in terms


of the precision = 12 . The likelihood is thus specified by:

f (, ) =

p
Y

f (j )f ( )

j=0

for (i in 1:n){
y[i] ~ dnorm(mu[i],tau)
mu[i] <- beta0 + beta1*x1[i] + ... + betap*xp[i]
}
sigma2 <- 1/tau
sigma <- sqrt(sigma2)

N(j , cj2 )

Gamma(a, b)

for j = 0, . . . , p

Thus, the precision has a prior mean of E( ) = ba and prior variance


Var( ) = ba2 . This corresponds to an Inverse Gamma prior distribution
for 2 with E( 2 ) =

b
a1

and Var( 2 ) =

b2
.
(a1)2 (a2)

No info about j : j = 0 and cj2 = 10000.


No info about : a = b = 0.001
Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

185

4.2 Bayesian Linear Regression

4 WinBUGS Applications

Prior Specification in WinBUGS

beta0
beta1
....
betap
tau

~
~

Applied Bayesian Inference

Prof. Dr. Renate Meyer

186

4.2 Bayesian Linear Regression

Interpretation of Regression Coefficients

Each regression coefficient j measures the effect of the explanatory


variable Xj on the expected value of the response variable Y adjusted
for the remaining covariates.

dnorm(0.0,1.0E-4)
dnorm(0.0,1.0E-4)

Questions of interest are


~
~

1. Is the effect of Xj important for the description of Y ?

dnorm(0.0,1.0E-4)
dgamma(0.001,0.001)

2. What is the association between Y and Xj (positive or negative)?


3. What is the magnitude of the effect of Xj on Y ?

Prof. Dr. Renate Meyer

Applied Bayesian Inference

187

Prof. Dr. Renate Meyer

Applied Bayesian Inference

188

4 WinBUGS Applications

4.2 Bayesian Linear Regression

4 WinBUGS Applications

Interpretation of Regression Coefficients

Interpretation of 0

Answers:

0 measures the posterior expected value of Y if all covariates are


zero. Often, zero is not in the range of the covariates, and thus the
interpretation of 0 is not meaningful.

1. Look at posterior distribution of j and its credible interval. Does


the credible interval contain 0?

Example: response: heart rate, covariate: body temperature in


degrees C
Better: Center the covariates at their mean xijc = xij xj

2. Calculate the posterior probability P(j > 0) and P(j < 0).
In WinBUGS, use the step function
p.betaj <- step(betaj)

i = 0c + 1c (xi1 x1 ) + + 1c (xip xp )
0c = expected value of Y when all covariates are equal to their means

which creates a binary node p.betaj taking values 1 if j > 0


and 0 otherwise.

Centering the covariates is also advisable from a computational point


of view: it decreases the posterior correlation between parameters and
thus improves convergence of the Gibbs sampler. We will show this in
Section 6.

3. Posterior mean/median of j is a measure of the posterior


expected change of the response variable Y if Xj increases by 1
unit and all other covariates are fixed.
Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

189

4.2 Bayesian Linear Regression

4 WinBUGS Applications

190

4.2 Bayesian Linear Regression

Regression Example in R

Prepare the data file by including variable names to be used by


WinBUGS at the top of each column and END at the end and save as
plain text file softdrinkdata.txt in your working directory.
cases[]
7
3
3
4
6
..
.
17
10
26
9
8
4

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Regression Example in WinBUGS

time[]
16.68
11.5
12.03
14.88
13.75
..
.
35.1
17.9
52.32
18.75
19.83
10.75
END

4.2 Bayesian Linear Regression

Alternatively, if we want to fit a linear model in the frequentist way in R


first, to compare later on with the Bayesian results in WinBUGS, we
read in the data, fit a linear model and output a list using dput(),
using the following R commands:

distance[]
560
220
340
80
150
..
.
770
140
810
450
635
150

softdrink <- read.table(file="softdrinkdata.txt",header=TRUE, sep


attach(softdrink)
cases_cent<- cases - mean(cases)
distance_cent <- distance - mean(distance)
summary(lm(time ~ cases_cent + distance_cent))
dput(list(time=time,cases=cases,distance=distance)
"softdrinkdatalist.txt")

For some odd reason (bug in WinBUGS?), make sure there is a blank
line after END.
Prof. Dr. Renate Meyer

Applied Bayesian Inference

191

Prof. Dr. Renate Meyer

Applied Bayesian Inference

192

4 WinBUGS Applications

4.2 Bayesian Linear Regression

4 WinBUGS Applications

Regression Output in R

Regression Model in WinBUGS

Call:
lm(formula = time ~ cases_cent + distance_cent)
Residuals:
Min
1Q Median
3Q
Max
-5.7880 -0.6629 0.4364 1.1566 7.4197
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
22.384000
0.651895 34.337 < 2e-16 ***
cases_cent
1.615907
0.170735
9.464 3.25e-09 ***
distance_cent 0.014385
0.003613
3.981 0.000631 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 3.259 on 22 degrees of freedom
Multiple R-squared: 0.9596,
Adjusted R-squared: 0.9559
F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

193

model
{
# likelihood
for (i in 1:n){
time[i] ~ dnorm( mu[i],tau)
mu[i] <- beta0 + beta1*(cases[i]-mean(cases[])) +
beta2* (distance[i]-mean(distance[]))
}
# prior distributions
tau ~ dgamma(0.001,0.001)
beta0 ~ dnorm(0.0,1.0E-4)
beta1 ~ dnorm(0.0,1.0E-4)
beta2 ~ dnorm(0.0,1.0E-4)

4.2 Bayesian Linear Regression

4 WinBUGS Applications

194

4.2 Bayesian Linear Regression

Regression Output in WinBUGS

# definition of sigma, sigma2, and sd(Y)


sigma2 <- 1/tau
sigma <- sqrt(sigma2)
#calculation of Bayesian version of Rsquared
R2B <- 1-sigma2/pow(sd(time[]),2)
#posterior probabilities
p.beta0 <- step(beta0)
p.beta1 <- step(beta1)
p.beta2 <- step(beta2)
}
#inits
list(tau=1,beta0=1, beta1=0, beta2=0)

Applied Bayesian Inference

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Regression Model in WinBUGS

Prof. Dr. Renate Meyer

4.2 Bayesian Linear Regression

node
R2B
beta0
beta1
beta2
p.beta0
p.beta1
p.beta2
sigma2

195

mean
0.9516
22.37
1.61
0.01447
1.0
1.0
1.0
11.67

Prof. Dr. Renate Meyer

sd
0.01732
0.6681
0.1851
0.003931
0.0
0.0
0.0
4.175

MC error
7.742E-4
0.02255
0.005237
1.263E-4
3.162E-12
3.162E-12
3.162E-12
0.1866

2.5%
0.9063
21.15
1.254
0.006683
1.0
1.0
1.0
6.364

Applied Bayesian Inference

median
0.9551
22.35
1.606
0.0144
1.0
1.0
1.0
10.82

97.5%
0.9737
23.78
1.992
0.02251
1.0
1.0
1.0
22.7

196

start
1001
1001
1001
1001
1001
1001
1001
1001

4 WinBUGS Applications

4.2 Bayesian Linear Regression

4 WinBUGS Applications

Bayesian Coefficient of Variation

Bayesian Coefficient of Variation

A high value of the precision (low 2 ) indicates that the model can
accurately predict the expected value of Y . We can rescale this
quantity using the sample variance of the response variable Y , sY2 ,
using the RB2 statistic given by:
RB2 = 1

It can be regarded as the Bayesian analog of the adjusted coefficient


of determination

2
2
Radj
=1 2 ,
sY
where

2
1
=
1

.
sY2
sY2

2 =

Applied Bayesian Inference

4 WinBUGS Applications

n
1 X
(yi yi )2
np

197

4 WinBUGS Applications

MC error
0.0821

2.5%
13.71

xij j ,

i=1

198

4.2 Bayesian Linear Regression

Prediction in WinBUGS

Assume for instance, that observation 21 in the linear regression


Example 4.1 was missing, i.e. time[21] for cases[21]=10 and
distance[21]=140 was missing.
In WinBUGS, missing values are denoted by NA in the dataset.
Substituting 17.9 in the dataset by NA and running the code again, now
monitoring the node time[21], we get the output
sd
3.696

p
X

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.2 Bayesian Linear Regression

Missing data are easily incorporated in a Bayesian analysis.


They are treated as unknown parameters to be estimated.

mean
21.06

yi = 0 +

where j are the maximum likelihood estimates of j .

Missing Data

node
time[21]

with

i=1

This quantity can be interpreted as the proportional reduction of


uncertainty concerning the response variable Y achieved by
incorporating the explanatory variables Xj in the model.

Prof. Dr. Renate Meyer

4.2 Bayesian Linear Regression

median
21.02

97.5%
28.4

start
1001

sample
2000

Predicting future observations that follow the same distributional


assumptions as the observed data is straightforward. In the regression
context, we are interested in the posterior predictive distribution of a
future observation Yn+1 |y1 , . . . , yn with certain values of the predictors
x. Its posterior predictive pdf is
Z
f (yn+1 |y, x) =

f (yn+1 |, x)f (|y, x)d

or ignoring the dependence on x:


Z
f (yn+1 |y) =

f (yn+1 |)f (|y)d

and we can use the mixture method (to be discussed in Chapter 7) to


simulate from this distribution. This is easily implemented in WinBUGS.
Prof. Dr. Renate Meyer

Applied Bayesian Inference

199

Prof. Dr. Renate Meyer

Applied Bayesian Inference

200

4 WinBUGS Applications

4.2 Bayesian Linear Regression

4 WinBUGS Applications

4.2 Bayesian Linear Regression

Prediction in WinBUGS
In the linear regression Example 4.1, this means defining another
variable in the code with the same distribution as the original data and
values for the predictor variables for which we want to forecast, e.g.
cases=20 and distance=1000, and including this variable in the
dataset with value NA:
pred.time ~ dnorm( pmu,tau)
pmu<- beta0 + beta1*(20-mean(cases[])) +
beta2* (1000-mean(distance[]))
Running the model again and monitoring pred.time gives the
posterior predictive summary:
node
pred.time

mean
48.98

sd
3.71

MC error
0.07796

2.5%
41.73

median
49.0

97.5%
56.56

start
1001

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

sample
2000
201

4.3 Model Checking

4 WinBUGS Applications

Model Assessment

Several authors have suggested using the marginal distribution of the


data, p(y), in this regard. Observed yi values for which p(yi ) is small
are unlikely", and therefore may be considered outliers under the
assumed model. Too many small values of p(yi ) suggest the model
itself is inadequate, and should be modified and expanded.

Applied Bayesian Inference

202

4.3 Model Checking

Model Assessment

Having successfully fit a model to a given dataset, the statistician must


be concerned with whether the fit is adequate, and whether the
assumptions made by the model are justified. For example, in standard
linear regression, the assumptions of normality, independence,
linearity, and homogeneity of variance must all be investigated.

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Prof. Dr. Renate Meyer

A problem, with this approach is the difficulty in defining how small is


small" and how many outliers are too many". In addition, we have the
problem of the possible impropriety of p(y) under noninformative
priors.
As such, we might work with the predictive distributions, since they will
be proper whenever the posterior is.

203

Prof. Dr. Renate Meyer

Applied Bayesian Inference

204

4 WinBUGS Applications

4.3 Model Checking

4 WinBUGS Applications

Model Checking

4.3 Model Checking

Examination of Individual Observations


Consider data y1 , . . . , yn and parameters under the assumed model.
Gelfand et al. (1992) suggest a series of "checking functions". These
are based on comparing a predictive distribution p(Yirep ) (to be made
precise in the following) with the actual observed yi :
1. the residuals: yi E[Yirep ]

Checking validity of model assumptions:


I

examination of individual observations

comparison between two or more competitor models (later)

global goodness-of-fit checks

yi E[Yirep ]
2. the standardised residuals: q
Var (Yirep )
3. the chance of getting a more extreme observation:
min(P(Yirep < yi ), P(Yirep yi ))
4. the chance of getting more surprising observation:
P(Yirep : f (Yirep ) f (yi ))
5. the predictive ordinate of the observation: f (yirep )

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

4.3 Model Checking

4 WinBUGS Applications

Separate Evaluation Data Available

As usually the yi s are conditionally independent of the zi s given , this


becomes
Z
f (yi |z) = f (yi |)f (|z)d

Applied Bayesian Inference

206

4.3 Model Checking

Separate Evaluation Data Available

Assume the data has been divided into a training set z and an
evaluation set y. Then the posterior distribution of is based on z and
the predictive distribution above is given by
Z
f (yi |z) = f (yi |z, )f (|z)d

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Prof. Dr. Renate Meyer

205

207

In WinBUGS, calculating the predictive distribution just requires


defining an additional node of each Yirep with the appropriate parents
and monitoring the Yirep s.
The observed yi can then be compared with their predictive distribution
through the residuals or standardized residuals
ri = yi

E[Yirep |z]

yi E[Yirep |z]
q
and sri =
Var (Yirep |z)

Plotting these residuals versus fitted values might reveal failure in


a normality or homogeneity of variance assumption.

Plotting them versus time could reveal a failure of independence.

Summing their squares or absolute values could provide an


overall measure of fit.
Prof. Dr. Renate Meyer

Applied Bayesian Inference

208

4 WinBUGS Applications

4.3 Model Checking

4 WinBUGS Applications

No Separate Evaluation Data Available

Cross-Validation Approach

The above discussion assumes the existence of two independent data


samples, which may well be unavailable in many problems. As such,
Gelfand et al. (1992) suggested a cross-validation approach, wherein
the fitted value for yirep is computed conditionally on all the data except
yi , namely y(i) = (y1 , . . . , yi1 , yi+1 , . . . , yn ).That is, the ith residual
becomes
ri = yi E[Yirep |y(i) ],
yi E[Yirep |y(i) ]
sri = q
.
rep
Var (Yi |y(i) )

Applied Bayesian Inference

4 WinBUGS Applications

Note that in this cross-validatory approach we compute the posterior


mean and variance with respect to the conditional predictive
distribution,
Z
p(y)
= p(yi |, y(i) )p(|y(i) )d,
p(yi |y(i) ) =
p(y(i) )
which gives the likelihood of each point given the remainder of the
data.
The actual values of p(yi |y(i) ), referred to as the conditional predictive
ordinate, or CPO, can be plotted versus i as an outlier diagnostic,
since data values having low CPO are poorly fit by the model.

and the ith standardized residual

Prof. Dr. Renate Meyer

209

4.3 Model Checking

4 WinBUGS Applications

210

4.3 Model Checking

WinBUGS Cross-Validation

Unfortunately, this is generally difficult to do within WinBUGS.


But an approximation to the cross-validatory method is to use the
methods for a separate evaluation set, but replacing z by y. Hence our
predictive distribution becomes the posterior predictive density without
case omission
Z
rep
f (yi |y) =
f (yirep |y, )f (|y)d
Z
=
f (yirep |)f (|y)d
If we do wish to sample from the correct cross-validatory predictive
distribution, this can be carried out using an additional importance
sampling step to remove the effect of yi when repredicting Yirep
(Gelfand et al., 1992), although this would have to be carried out
external to WinBUGS.
Applied Bayesian Inference

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Cross-Validation Approach

Prof. Dr. Renate Meyer

4.3 Model Checking

Let us implement checking functions 1 and 2 in WinBUGS for the


Example 4.1 using the approximate cross-validatory method.
Note that
E[Yirep |y] =

yirep f (yirep |y)dyirep


Z

Z
=
yirep
f (yirep |)f (|y)d dyirep

Z Z
rep
rep
rep
=
yi f (yi |)dyi
f (|y)d
= E[i |y]

i.e. the posterior mean of i = 0 + 1 xi1 + 2 xi2 . Similarly,


Var (Yirep |y) = posterior mean of .
211

Prof. Dr. Renate Meyer

Applied Bayesian Inference

212

4 WinBUGS Applications

4.3 Model Checking

4 WinBUGS Applications

WinBUGS Cross-Validation

4.3 Model Checking

Examination of Individual Observations in WinBUGS


Monitoring these vectors r and sr, we can look at summary statistics
etc. However, we get a better overview by using the comparison tool of
the Inference menu and clicking on "boxplot":

Thus, in WinBUGS we only need to define the following nodes:

box plot: sr

[9]

4.0

for (i in 1:n){
r[i]<- time[i]-mu[i]
sr[i]<- (time[i]-mu[i])*sqrt(tau)
}

[4]

2.0

[21]

[10]

[18]
[11]

[2]

[7]

[3]
[5]

[19]

[13] [14]

[8]
[12]

[6]

[15]

[16]

[17]

[25]
[22]

0.0

[24]

[1]
[20]

[23]

-2.0

-4.0

Figure 11: Boxplot of standardized residuals.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

213

4.3 Model Checking

4 WinBUGS Applications

Checking Function 3 in WinBUGS

node
p.smaller[1]
p.smaller[2]
p.smaller[3]
p.smaller[4]
p.smaller[5]
p.smaller[6]
p.smaller[7]
p.smaller[8]
p.smaller[9]
p.smaller[10]
p.smaller[12]
p.smaller[13]
p.smaller[14]
p.smaller[15]
p.smaller[16]
p.smaller[17]
p.smaller[18]
p.smaller[19]
p.smaller[20]
p.smaller[21]
p.smaller[22]
p.smaller[23]
p.smaller[24]
p.smaller[25]

The step() function is then used to calculate the variable


p.smaller[i] which takes the value 1 if time[i]-time.rep[i]
0 and zero otherwise.
The posterior mean of p.smaller[i] is simply the proportion of
iterations for which time.rep[i] < time[i].
P(Yi yi ) = 1 posterior mean of p.smaller. The chance of
observing a more extreme value for Yi is thus the minimum of these
two probabilities.

Applied Bayesian Inference

214

4.3 Model Checking

Checking Function 3 in WinBUGS

To compute P(Yirep < yi ), we first need to obtain sample values of the


random variable Yirep by generating a replicate dataset time.rep[i]
which depends on the current values of mu[i] and tau at each
iteration.

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Prof. Dr. Renate Meyer

215

Prof. Dr. Renate Meyer

mean
0.077
0.626
0.4875
0.9275
0.449
0.459
0.5915
0.6325
0.9555
0.7575
0.431
0.631
0.633
0.591
0.4285
0.571
0.8505
0.712
0.052
0.235
0.175
0.093
0.09
0.4685

sd
0.2666
0.4839
0.4998
0.2593
0.4974
0.4983
0.4916
0.4821
0.2062
0.4286
0.4952
0.4825
0.482
0.4916
0.4949
0.4949
0.3566
0.4528
0.222
0.424
0.38
0.2904
0.2862
0.499

MC error
0.005964
0.01051
0.01109
0.006629
0.009629
0.009853
0.01047
0.01033
0.004386
0.01117
0.009716
0.009968
0.0116
0.009021
0.012
0.01115
0.007033
0.009266
0.004984
0.008387
0.008043
0.006644
0.007328
0.01068

Applied Bayesian Inference

216

4 WinBUGS Applications

4.3 Model Checking

4 WinBUGS Applications

Checking Function 5 in WinBUGS

Checking Function 5 in WinBUGS


Thus, the ith CPO can be estimated from the inverse of the sample
mean of the inverse likelihood of yi for each generated from the full
posterior distribution.

The CPO, checking function 5, can be explicitly calculated in


WinBUGS using the relationship
1
f (yirep |y(i) )

=
=
=
=
=

f (y(i) )
f (y)
Z
f (y(i) |)f ()
d
f (y)
Z
1 f (y|)f ()
d
f (yi |) f (y)
Z
1
f (|y)d
f (yi |)


1
E|y
f (yi |)

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

I.e. a Monte Carlo estimate of CPOi is


i=
CPO

Prof. Dr. Renate Meyer

mean
34.12
9.383
8.959
31.2
8.929
8.712
9.184
9.37
6273.0
13.03
11.38
9.211
9.213
9.338
8.846
9.538
8.844
16.44
10.51
53.19
13.66
30.14
24.4
27.73
8.82

In WinBUGS:
like[i] <- sqrt(tau/(2*PI))*exp(-0.5*pow(sr[i],2))
p.inv[i] <- 1/like[i]
217

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

218

4.3 Model Checking

Global Goodness-of-fit Checks


The idea of global goodness-of-fit checks goes back to Rubin (1984).
One constructs test statistics or other discrepancy measures D(y)
that attempt to measure departures of the observed data from the
assumed model (likelihood and prior distribution).

MC error
0.7646
0.04268
0.03359
0.4766
0.03761
0.03228
0.042
0.0396
3500.0
0.1362
0.0671
0.04563
0.03934
0.0409
0.03416
0.0458
0.03562
0.1572
0.06173
1.06
0.1599
1.025
0.237
0.5858
0.03519

Applied Bayesian Inference

!1

which is the harmonic mean of the likelihood function. But note that
harmonic means are notoriously unstable so care is required regarding
convergence!

4.3 Model Checking

sd
27.03
1.698
1.627
20.32
1.512
1.41
1.669
1.669
154700.0
6.565
2.956
1.792
1.586
1.699
1.423
2.268
1.473
6.838
2.532
49.06
7.111
53.81
9.003
21.34
1.473

N
1X
1
N
f (yi |(n) )
n=1

Checking Function 5 in WinBUGS


node
p.inv[1]
p.inv[2]
p.inv[3]
p.inv[4]
p.inv[5]
p.inv[6]
p.inv[7]
p.inv[8]
p.inv[9]
p.inv[10]
p.inv[11]
p.inv[12]
p.inv[13]
p.inv[14]
p.inv[15]
p.inv[16]
p.inv[17]
p.inv[18]
p.inv[19]
p.inv[20]
p.inv[21]
p.inv[22]
p.inv[23]
p.inv[24]
p.inv[25]

4.3 Model Checking

For example, suppose we have fit a normal distribution to a sample of


univariate data, and wish to investigate the models fit in the lower tail.
We might compare the observed value of the discrepancy measure
D(y) = ymin
with its posterior predictive distribution, p(D(yrep )|y), where yrep
denotes a hypothetical future value of y. If the observed value is
extreme relative to this reference distribution, doubt is cast on some
aspect of the model.
219

Prof. Dr. Renate Meyer

Applied Bayesian Inference

220

4 WinBUGS Applications

4.3 Model Checking

4 WinBUGS Applications

Posterior Predictive Model Checks

Posterior Predictive Model Checks

In order to be computable in the classical framework, test statistics


must be functions of the observed data alone. But as pointed out by
Gelman et al. (1996), basing Bayesian model checking on the
posterior predictive distribution allows generalized test statistics D(y, )
that depend on the parameters as well as the data.
For example, as an omnibus goodness-of-fit measure, Gelman et al.
(1996) recommend
D(y, ) =

n
X
(yi E[Yi |])2
.
Var (Yi |)

With varying according to its posterior distribution, we would now


compare the distribution of D(y, ) for the observed y with that of
D(y , ) for a future observation y .
Applied Bayesian Inference

4 WinBUGS Applications

pD = P [D(yrep , ) > D(y, )|y]


Z
=
P [D(yrep , ) > D(y, )|] p(|y)d.

As such, pD is sometimes referred to as the Bayesian P-value.

221

4 WinBUGS Applications

222

4.3 Model Checking

Posterior Predictive Model Checks in WinBUGS

In Example 4.1, we consider 2 different statistics for D(y , ) which may


be sensitive to outlying observations in a Normal model. These are
"
 #
X 3
I coefficient of skewness: E

measure of asymmetry, skewness of Normal rv is zero


"
 #
X 4
I coefficient of kurtosis: E

measure of peakedness, kurtosis of Normal rv is 3

Applied Bayesian Inference

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.3 Model Checking

Posterior Predictive Model Checks in WinBUGS

Prof. Dr. Renate Meyer

A convenient summary measure of the extremeness of the former with


respect to the latter is the tail area

In the case where D(y , )s distribution is free of , pD is exactly equal


to the frequentist P-value, or the probability of seeing a test statistic as
extreme as the one actually observed.

i=1

Prof. Dr. Renate Meyer

4.3 Model Checking

223

for (i in 1:n){
#residuals and moments for observed data
r[i]<- time[i]-mu[i]
sr[i]<- (time[i]-mu[i])*sqrt(tau)
m3[i] <- pow(sr[i],3)
m4[i] <- pow(sr[i],4)

# residuals and moments of replicates for Bayesian p-valu


time.rep[i] ~ dnorm(mu[i], tau)
resid.rep[i] <- time.rep[i]-mu[i]
sresid.rep[i] <- resid.rep[i]*sqrt(tau)
m3.rep[i] <- pow(sresid.rep[i],3)
m4.rep[i] <- pow(sresid.rep[i],4) }

Prof. Dr. Renate Meyer

Applied Bayesian Inference

224

4 WinBUGS Applications

4.3 Model Checking

4 WinBUGS Applications

Posterior Predictive Model Checks in WinBUGS

4.3 Model Checking

Bayesian P-values in WinBUGS

# Bayesian p-value:
node
skew.obs
skew.rep
p.skew
kurtosis.obs
kurtosis.rep
p.kurtosis

skew.obs <- sum(m3[])/n


skew.rep <- sum(m3.rep[])/n
p.skew
<- step(skew.rep-skew.obs)
kurtosis.obs <- sum(m4[])/n
kurtosis.rep <- sum(m4.rep[])/n
p.kurtosis
<- step(kurtosis.rep-kurtosis.obs)

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

Prof. Dr. Renate Meyer

225

4 WinBUGS Applications

227

Prof. Dr. Renate Meyer

sd
0.8858
0.7959
0.499
2.754
2.023
0.4931

MC error
0.0185
0.01879
0.01028
0.05979
0.04379
0.01081

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.3 Model Checking

Applied Bayesian Inference

mean
0.09787
-0.02244
0.4685
3.783
3.045
0.417

226

4.3 Model Checking

Applied Bayesian Inference

228

4 WinBUGS Applications

4.4 Model Comparison via DIC

4 WinBUGS Applications

Model Comparison via DIC

4.4 Model Comparison via DIC

Problems with Classical Information Criteria

In general, for model comparison we need:


I

measure of fit

measure of complexity

Problems:

e.g.
+ 2p
AIC = 2 log p(y |)
+ p log n
BIC = 2 log p(y |)

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

229

2 -approximation for small samples

p = no. of parameters in hierarchical models

n = no. of observations in hierarchical models

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.4 Model Comparison via DIC

4 WinBUGS Applications

Deviance

230

4.4 Model Comparison via DIC

Deviance Information Criterion

Suggestion by Spiegelhalter et al. (2002)


Suggestion by Dempster (1974):

measure of fit:

= E|y [D]
D
posterior mean of deviance

Base model assessment on posterior distribution of the log-likelihood


of the data.

measure of complexity:

D()

pD = D
effective no. of parameters

This is equivalent to posterior distribution of the deviance:

+ pD = D()
+ 2pD
DIC = D
D() = 2 log p(y |) + 2 log p(y |sat )
The model with the smallest DIC value is preferred. DIC calculation is
implemented in WinBUGS.

Prof. Dr. Renate Meyer

Applied Bayesian Inference

231

Prof. Dr. Renate Meyer

Applied Bayesian Inference

232

4 WinBUGS Applications

4.4 Model Comparison via DIC

4 WinBUGS Applications

DIC Example: Multiple Linear Regression

4.4 Model Comparison via DIC

DIC Output

We will illustrate the use of DIC for comparing four different models for
the softdrink Example 4.1

Dbar = post.mean of -2logL;


Dhat = -2LogL at post.mean of stochastic nodes

1. Model 1: intercept only


2. Model 2: cases
3. Model 3: distance
4. Model 4: cases and distance
We run each model in WinBUGS and set the DIC tool in the Inference
menu.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

Prof. Dr. Renate Meyer

233

Model
Intercept
Cases
Distance
Cases + Distance

4 WinBUGS Applications

235

Prof. Dr. Renate Meyer

Dhat
207.061
140.477
167.503
127.030

pD
2.031
3.072
3.072
4.259

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.4 Model Comparison via DIC

Applied Bayesian Inference

Dbar
209.092
143.549
170.575
131.289

DIC
211.123
146.622
173.647
135.547

234

4.4 Model Comparison via DIC

Applied Bayesian Inference

236

4 WinBUGS Applications

4.5 Analysis of Variance

4 WinBUGS Applications

ANOVA Models

4.5 Analysis of Variance

Parametrizations and Interpretations


Need a constraint to make I + 1 parameters 0 , 1 , . . . , I identifiable.
Either:

Now
I

response variable Y : continuous

explanatory variable X : discrete


X is called a factor with levels i = 1, . . . , I

Corner Constraint:
Effect of baseline level (or reference category) is set to 0: 1 = 0
1 = 0
i = 0 + i , i = 2, . . . , I

ANOVA Model
Yij N(i , 2 )

i = 1, . . . , I,

j = 1, . . . , ni

or
Sum-to-zero Constraint:

where
I

Yij is jth observation of Y at level i of X

i = 0 + i
0 overall common mean
i group-specific parameter

I
X

4 WinBUGS Applications

1 =

I
X

i=2

PI

0 =
i=1 i overall mean effect
i deviation of each level from this overall mean effect
237

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.5 Analysis of Variance

4 WinBUGS Applications

ANOVA in WinBUGS
Assume data are given in pairs (xi , yi ), i = 1, . . . , n (n =

Applied Bayesian Inference

238

4.5 Analysis of Variance

ANOVA Example
P

ni )

Example 4.2
McCarthy (2007) describes a dataset of weights of starlings at four
different locations.

#likelihood
for (i in 1:n){
y[i]
~ dnorm(mu[i],tau)
mu[i] ~ beta0 + beta[x[i]]
}
#corner constraint
beta[1] <- 0.0
#sum-to-zero constraint
#beta[1] <- - sum( beta[2:I] )
#prior
beta0 ~ dnorm(0.0,1.0E-4)
for (i in 2:I){
beta[i] ~ dnorm(0.0,1.0E-4)
}
Prof. Dr. Renate Meyer

or

i=1
1
I

Applied Bayesian Inference

Prof. Dr. Renate Meyer

i = 0

Location 1
78
88
87
88
83
82
81
80
80
89

239

Prof. Dr. Renate Meyer

Location 2
78
78
83
81
78
81
81
82
76
76

Location 3
79
73
79
75
77
78
80
78
83
84

Location 4
77
69
75
70
74
83
80
75
76
75

Applied Bayesian Inference

240

4 WinBUGS Applications

4.5 Analysis of Variance

4 WinBUGS Applications

Classical ANOVA

4.5 Analysis of Variance

R-Output

Analysis of Variance Table


Frequentist analysis in R:
star.df <- read.table("starlingdata.txt", header=TRUE)
attach(star.df)
loc<- factor(location)
star.aov<-aov(Y~loc)
anova(star.aov)
summary.lm(star.aov)$coef

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

241

Response: Y
Df Sum Sq Mean Sq F value
Pr(>F)
loc
3 341.90 113.97 9.0053 0.0001390 ***
Residuals 36 455.60
12.66
--> summary.lm(star.aov)$coef
Estimate Std. Error
(Intercept)
83.6
1.124969
loc2
-4.2
1.590947
loc3
-5.0
1.590947
loc4
-8.2
1.590947

4 WinBUGS Applications

WinBUGS Code

Applied Bayesian Inference

Pr(>|t|)
5.325939e-41
1.218170e-02
3.342926e-03
9.372412e-06
242

4.5 Analysis of Variance

WinBUGS Code

model
{ for (i in 1:40) {
mu[i] <- beta0 + beta[location[i]]
Y[i] ~ dnorm(mu[i], tau)
}
#prior, corner constraint
beta[1] <- 0
beta0 ~ dnorm(0.0,1.0E-4)
for (i in 2:4){
beta[i] ~ dnorm(0.0, 1.0E-6)
}
tau ~ dgamma(0.001, 0.001) # uninformative prio precision
}
#inits
list(beta0=70, beta=c(NA, 70, 70, 70), tau=1)
Prof. Dr. Renate Meyer

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.5 Analysis of Variance

t value
74.313150
-2.639938
-3.142783
-5.154164

243

#data
location[] Y[]
1 78
...
1 89
2 78
...
2 76
3 79
...
3 84
4 77
...
4 75
END
Prof. Dr. Renate Meyer

Applied Bayesian Inference

244

4 WinBUGS Applications

4.5 Analysis of Variance

4 WinBUGS Applications

WinBUGS Results

4.5 Analysis of Variance

WinBUGS Results
Using the comparison tool of the Inference menu and clicking on
"boxplot" for beta:

node

mean

sd

MC error

2.5%

median

97.5%

start

sample

-4.204
-4.963
-8.143
83.58
0.07878

1.65
1.597
1.61
1.142
0.01887

0.03838
0.04041
0.03213
0.02757
4.333E-4

-7.302
-7.977
-11.26
81.31
0.04582

-4.162
-4.964
-8.168
83.59
0.07712

-0.9981
-1.699
-5.014
85.7
0.1183

1001
1001
1001
1001
1001

2000
2000
2000
2000
2000

box plot: beta

0.0

beta[2]
beta[3]
beta[4]
beta0
tau

[2 ]
[3 ]

[4 ]

-5.0

-10.0

-15.0

Figure 12: Boxplot of location effects.


Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

245

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.5 Analysis of Variance

4 WinBUGS Applications

246

4.5 Analysis of Variance

Model Comparison

Let us compare the fit of this one-way ANOVA model with a model that
assumes no differences in the expected weights at the different
locations:
for (i in 1:40) {
Y[i] ~ dnorm(beta0, tau)
}
Model
ANOVA
Same Mean

Dbar
216.156
235.316

Prof. Dr. Renate Meyer

Dhat
211.053
233.229

pD
5.103
2.087

DIC
221.259
237.402

Applied Bayesian Inference

247

Prof. Dr. Renate Meyer

Applied Bayesian Inference

248

4 WinBUGS Applications

4.6 Generalized Liner Models

4 WinBUGS Applications

Generalized Linear Models

4.6 Generalized Liner Models

Generalized Linear Models

3 components of a LM:
Generalized Linear Models (GLMs) are a generalization of the linear
model for modelling of random variables from the exponential family,
thus including the Normal, Binomial, Poisson, Exponential and Gamma
distributions.
GLMs are one of the most important components of modern statistical
theory, unifying an approach to statistical modelling
Details on GLMs can be found in McCullagh and Nelder (1989),
Fahrmeir and Tutz (2001), and Dey, Ghosh, Mallick (2000)

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

249

stochastic component: Yi N(i , 2 ), i.e. E[Yi ] = i

systematic component: i = x0i (linear predictor)

link function: g(i ) = i identity

3 components of a GLM:
I

stochastic component: Yi Exponential family with location


parameter , dispersion parameter

stystematic component: i = x0i

link function: g(i ) = i

4.6 Generalized Liner Models

4 WinBUGS Applications

Models for Binary Response

250

4.6 Generalized Liner Models

Models for Binary Response

The binary data are summarized in the following table:

Example 4.3
Fahrmeir and Tutz (1994) describe data provided by the Klinikum
Grosshadern, Munich, on infection from births by Caesarean section.
The response variable of interest is the occurrence or nonoccurrence
of infection, with three dichotomous covariates: whether the
Caesarean section was planned or not, whether any risk factors such
as diabetes, being overweight etc were present or not and whether
antibiotics were given as a prophylaxis. The aim was to analyse the
effects of the covariates on the risk of infection, especially whether
antibiotics can decrease the risk of infection.

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Applied Bayesian Inference

Caesarean planned
Infection
yes
no

251

Not planned
Infection
yes
no

Antibiotics
Risk factors
No risk factors

1
0

17
2

11
0

87
0

No antibiotics
Risk factors
No risk factors

28
8

30
32

23
0

3
9

Prof. Dr. Renate Meyer

Applied Bayesian Inference

252

4 WinBUGS Applications

4.6 Generalized Liner Models

4 WinBUGS Applications

Models for Binary Response

Interpretation of Logit Parameters

Let Yi = 1 if infection occurs for ith patient, 0 otherwise and xi denote


the corresponding vector of covariate values.
I

Yi |xi , i Bernoulli(i )

i = x0i = 0 + 1 xi1 + 2 xi2 + 3 xi3

link function = g() or = F () where F cdf




logit model: g() = log

e
=
logistic cdf
1 + e
1
probit model: g() = (),

= ()
Normal cdf
complimentary log-log model: g() = log( log(1 ))

ORx,x+1 =

= 1 exp( exp())

Applied Bayesian Inference

4 WinBUGS Applications

= 0 + 1 x
1

= exp(0 ) exp(1 x)
1

odds(x + 1)
exp(0 ) exp(1 (x + 1)
=
= exp(1 )
odds(x)
exp(0 ) exp(1 x)

If x increase by 1 unit, odds ratio increases by exp(1 ) units.


For other link functions
I Interpret covariate effects on linear predictor = x0 .
I Transform this linear effect on into a nonlinear effect on (with
the aid of a graph of the response function = g 1 ().

extreme-minimal-value cdf
Prof. Dr. Renate Meyer


log

Exponentials of covariate effects have a multiplicative effect on the


odds/relative risk.

4.6 Generalized Liner Models

253

4.6 Generalized Liner Models

4 WinBUGS Applications

Logit WinBUGS Code


model{
for( i in 1 : N ) {
y[i] ~ dbern(p[i])
logit(p[i]) <- beta0 + beta[1] *plan[i] +
beta[2]*factor[i] + beta[3]*antib[i]
# centered covariates
# logit(p[i])<-beta0 + beta[1] *(plan[i]-mean(plan[])) +
#
beta[2]*(factor[i]-mean(factor[])) +
#
beta[3]*(antib[i]-mean(antib[]))
}
beta0 ~ dnorm(0.0,0.001)
for (i in 1:3){
beta[i] ~ dnorm(0.0,0.001)
or[i] <- exp(beta[i])}
}
list(beta0=0,beta=c(0,0,0)) #inits
Prof. Dr. Renate Meyer
Applied Bayesian Inference
255
list(N=251)
#data

Applied Bayesian Inference

Prof. Dr. Renate Meyer

254

4.6 Generalized Liner Models

WinBUGS Output

beta[1]
1.0
0.0
-1.0
-2.0
-3.0
1001

1500

2000

2500

3000

2500

3000

2500

3000

2500

3000

iteration
beta[2]
4.0
3.0
2.0
1.0
0.0
1001

1500

2000
iteration

beta[3]
-1.0
-2.0
-3.0
-4.0
-5.0
-6.0
1001

1500

2000
iteration

beta0
1.0
0.0
-1.0
-2.0
-3.0
1001

1500

2000
iteration

Figure 13: Traceplots for uncentered covariates.

Prof. Dr. Renate Meyer

Applied Bayesian Inference

256

4 WinBUGS Applications

4.6 Generalized Liner Models

4 WinBUGS Applications

WinBUGS Output

4.6 Generalized Liner Models

WinBUGS Output

Centered Covariates

beta[1]
1.0
0.0
-1.0
-2.0

beta[1]

-3.0
-4.0
1001

1500

2000

2500

beta[2]

1.0
0.5
0.0
-0.5
-1.0

3000

iteration

1.0
0.5
0.0
-0.5
-1.0

beta[2]

4.0

20

40

20

40
lag

lag

3.0
2.0
1.0

beta[3]

0.0
1001

1500

2000

2500

beta0

1.0
0.5
0.0
-0.5
-1.0

3000

iteration

beta[3]

1.0
0.5
0.0
-0.5
-1.0
0

-1.0
-2.0
-3.0
-4.0
-5.0
-6.0

20

40

20

lag

1001

2000

1500

2500

40
lag

Uncentered Covariates

3000

iteration
beta0
-0.5

beta[1]

-1.0

-2.0
-2.5
1001

1500

2000

2500

beta[2]

1.0
0.5
0.0
-0.5
-1.0

-1.5

3000

1.0
0.5
0.0
-0.5
-1.0
0

iteration

20

40

20

lag
beta[3]

beta0

1.0
0.5
0.0
-0.5
-1.0

Figure 14: Traceplots for centered covariates.

40
lag

1.0
0.5
0.0
-0.5
-1.0
0

20

40

20

lag

40
lag

Figure 15: Autocorrelation plots.


Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

257

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.6 Generalized Liner Models

4 WinBUGS Applications

WinBUGS Output

258

4.6 Generalized Liner Models

Comparing Model Fits

Summary Statistics for model with uncentered covariates:


node
beta[1]
beta[2]
beta[3]
beta0
or[1]
or[2]
or[3]

mean
-1.116
2.069
-3.333
-0.8242
0.3604
8.988
0.04017

sd
0.4392
0.4982
0.4921
0.5331
0.1639
4.894
0.02009

MC error
0.02788
0.03463
0.02534
0.04337
0.009911
0.3246
0.001003

2.5%
-1.993
1.157
-4.346
-1.961
0.1362
3.181
0.01295

median
-1.114
2.055
-3.316
-0.8118
0.3282
7.804
0.03628

97.5%
-0.2388
3.057
-2.393
0.1738
0.7878
21.26
0.09139

Consider 3 different models with 3 different link functions. Compare


the fit with DIC.
Logit
Probit
Cloglog

None of the 95% credible intervals of covariate effects contains 0.

Applied Bayesian Inference

Dhat
226.588
227.041
224.152

pD
4.033
4.180
3.949

DIC
234.654
235.400
232.050

The complementary log-log link seems to give a better fit but there are
only minor differences in the values of DIC.

Antibiotics lower the odds of infection by a factor of 0.04. When the


Caesarean is planned, the odds of infection decreases by a factor of
0.36, and when risk factors are present, the odds of infection is 8.988
higher.
Prof. Dr. Renate Meyer

Dbar
230.621
231.221
228.101

259

Prof. Dr. Renate Meyer

Applied Bayesian Inference

260

4 WinBUGS Applications

4.7 Hierarchical Models

4 WinBUGS Applications

Hierarchical Models

4.7 Hierarchical Models

Hierarchical Models

In many statistical applications, model parameters are related by the


structure of the problem.
For example, in a study of the effectiveness of cardiac treatments, it is
assumed that patients in hospital j have survival probability j .

Hierarchical model with hyperparameters:

Estimating each of these j separately, might result in large standard


errors for hospitals with few patients. It can also lead to overfitting and
lead to models that cannot predict new data well.

Yij |j

f (yij |j )

j | f (j |)

Assuming all survival probabilities are the same will ignore potential
treatment differences between hospitals and will not fit the data
accurately.

f ()

It might be reasonable to expect that the j s are related and should be


estimated jointly. This is achieved in a natural way by assuming that
the j s come from a common population distribution. This population
distribution can depend on a further parameter.
Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

261

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.7 Hierarchical Models

4 WinBUGS Applications

Hierarchical Models: Rat Tumor Example

262

4.7 Hierarchical Models

Hierarchical Models: Rat Tumor Example

Example 4.4
This example in the context of drug evaluation for possible clinical trial
application is taken from Gelman et al. (2004). A control group of 14
laboratory rats of type F344 is given a zero dose of a certain drug.
The aim is to estimate the probability of developing endometrial
stromal polyps (a certain tumor). The outcome is that 4 out of 14 rats
developed this tumor.

Historical Data: 70 previous experiments on same type of rats


0/20
0/19
1/18
2/20
3/20
4/20
6/23

0/20
0/18
1/18
1/10
2/13
10/48
5/19

0/20
0/18
2/25
5/49
9/48
4/19
6/22

0/20
0/17
2/24
2/19
10/50
4/19
6/20

0/20
1/20
2/23
5/46
4/20
4/19
6/20

0/20
1/20
2/20
3/37
4/20
5/22
6/20

0/20
1/20
2/20
2/17
4/20
11/46
16/52

0/19
1/20
2/20
7/49
4/20
12/49
15/47

0/19
1/19
2/20
7/47
4/20
5/20
15/46

0/19
1/19
2/20
3/20
4/20
5/20
9/24

1. Approach: Bayesian model with fixed prior


Yi | Binomial(14, )
Beta(, )
Assume that we know from historical data the mean and sd of tumor
probabilities among female lab rats of type F344. We find values of
and of the beta distribution with this mean and sd. This yields a
Beta( + 4, + 14) posterior distribution for .
Prof. Dr. Renate Meyer

Applied Bayesian Inference

263

Observed sample mean and sd of yj /nj is 0.136 and 0.103,


respectively. Setting

0.136 =
+

0.1032 =
( + )2 ( + + 1)
yields = 1.4 and = 8.6
Prof. Dr. Renate Meyer

Applied Bayesian Inference

264

4 WinBUGS Applications

4.7 Hierarchical Models

4 WinBUGS Applications

Hierarchical Models: Rat Tumor Example

Hierarchical Models: Rat Tumor Example

Using a Beta(1.4,8.6) prior for yields a Beta(5.4,18.6) posterior


distribution with posterior mean= 0.223 and posterior sd= 0.083,
whereas 4/14 = 0.286.

2. Approach: Hierarchical Bayesian model


In absence of any information about the j s (other than the data) and
no ordering or grouping of the parameters can be made, we must
assume symmetry in the prior distribution of the parameters.

Assumptions:
I

1 , . . . , 70 , 71 can be considered a random sample from a


common distribution

no time trend

4.7 Hierarchical Models

This means, that the parameters (1 , . . . , J ) are modelled as


exchangeable in their joint prior distribution. I.e.

Questions:
I

Can we use the same prior to make inference about the tumor
probabilities in the first 70 groups?

Is the point estimate used to derive and representative?

Does it make sense to estimate and ?


Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

f (1 , . . . , J ) is invariant to permutations of the indices (1, . . . , J).

265

4.7 Hierarchical Models

4 WinBUGS Applications

Hierarchical Models: Rat Tumor Example

J
Y

266

4.7 Hierarchical Models

Hierarchical Models: Rat Tumor Example


Key part of hierarchical models:

Assume simplest form of exchangeability:


j are iid given some unknown parameter :
f (1 , . . . , J |) =

Applied Bayesian Inference

Prof. Dr. Renate Meyer

is unknown, has a prior distribution, f (), and we estimate its


posterior distribution after observing the data. We have a parameter
vector (, ) with joint prior distribution:

f (j |)

i=1

f (, ) = f ()f (|)
By integration, the joint (unconditional or marginal) distribution is
#
Z "Y
J
f (1 , . . . , J ) =
f (j |) f ()d

The joint posterior distribution is:

i=1

f (, |y ) = f (y |, )f (, )

De Finettis theorem states that as J , any exchangeable


distribution (under certain regularity conditions) can be written in the iid
mixture form above.
Prof. Dr. Renate Meyer

Applied Bayesian Inference

267

= f (y |)f (|)f ()

Prof. Dr. Renate Meyer

Applied Bayesian Inference

268

4 WinBUGS Applications

4.7 Hierarchical Models

4 WinBUGS Applications

Hyperprior Distribution:

WinBUGS Code: Rat Tumor Example

If little is known about the hyperparameter , we can assign a diffuse


prior distribution. But we always need to check whether the resulting
posterior distribution is proper. In most real problems, there is sufficient
substantial knowledge about to constrain to some finite region.
In the rat tumor Example 4.4, we reparametrize to i =logit(i ), i.e.
i
i

exp(i )
1 + exp(i )
N(, )
=

and specify the following diffuse hyperprior distrution for mean and
precision :
N(0, 0.001)

Gamma(0.001, 0.001)
Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

4.7 Hierarchical Models

269

# rat example
model
{ for (i in 1:71){
y[i] ~ dbin(theta[i],n[i])
theta[i] <- exp(mu[i])/(1+exp(mu[i]))
mu[i] ~ dnorm(nu,tau)
r[i]<-y[i]/n[i]
}
nu ~ dnorm(0.0,0.001)
tau ~ dgamma(0.001,0.001)
mtheta<-exp(nu)/(1+exp(nu))
}
#inits
list(nu=0,tau=1)

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.7 Hierarchical Models

4 WinBUGS Applications

WinBUGS Output: Rat Tumor Example

270

4.7 Hierarchical Models

WinBUGS Output: Rat Tumor Example

Based on 10,000 iterations and burn-in of 10,000:


node
mtheta
nu
tau
theta[71]

mean
0.1261
-1.941
2.399
0.2059

sd
0.01336
0.1224
1.134
0.077

MC error
3.035E-4
0.002774
0.03409
7.983E-4

2.5%
0.1002
-2.195
1.052
0.0827

median
0.126
-1.937
2.184
0.1965

97.5%
0.1526
-1.715
4.891
0.3825

From the boxplot and the "model fit" plot of j estimates against sample
proportions rj , we see that rates j are shrunk from their sample point
estimate rj = yj /nj , towards the population distribution with mean
0.126. Experiments with fewer observations are shrunk more and have
higher posterior variances. In contrast to the model with fixed prior
parameters, this full Bayesian hierarchical analysis has taken the
uncertainty in the hyperparameters into account.
Prof. Dr. Renate Meyer

Applied Bayesian Inference

271

bo x pl ot: the ta

0.6

[7 0 ]

[2 8 ]

0.4

[6 ]
[5 ]
[4 ]
[3 ]

0.2

[1 4 ]

[7 ]

[1[2
] ]

3]
[1[1
2]
[1 1 ]

[3 5 ]

[6 3 ]

[5 6 ]

[4 9 ]

[7 1 ]
[6 2 ]

[2 7 ]

[4 1 ]
[4 0 ]

4]
[3[3
3]

[4 8 ]
[4 7 ]

[5 5 ]
[5 4 ]

[6 9 ]
[6 8 ]

[6 1 ]

[2 6 ]

[1 9 ]

[6 7 ]

[4 6 ]
[2 5 ]

[1 0 ]
[9 ]
[8 ]

[4 2 ]

[2 1 ]
[2 0 ]

[1 7 ]
[1 8 ]
[1 6 ]
[1 5 ]

[2 4 ]
[2 3 ]
[2 2 ]

[3[3
8]
9]
[3 1 ]
[3 0[3
] 2]
[2 9 ]

[3 7 ]
[3 6 ]

[4 5 ]
[4 4 ]
[4 3 ]

[5[5
2]
3]
[5 1 ]
[5 0 ]

0]
[5[6
9]
[5 8 ]
[5 7 ]

[6 6 ]
[6 5 ]
[6 4 ]

0.0

Figure 16: Boxplots for rat tumor rates.

Prof. Dr. Renate Meyer

Applied Bayesian Inference

272

4 WinBUGS Applications

4.7 Hierarchical Models

4 WinBUGS Applications

WinBUGS Output: Rat Tumor Example

4.7 Hierarchical Models

Hierarchical Models: Pump Failure Example


Example 4.5
George et al (1993) discuss Bayesian analysis of hierarchical models.
The example they consider relates to 10 power plant pumps. The data
is given in the following table and gives the number of failures xi and
the length of operation time ti (in thousands of hours) for each pump.

m o de l fit: the ta
0.6
0.4
0.2
0.0
0.0

0.1

0.2

0.3

0.4

Figure 17: Model fit for rat tumor rates.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

273

Pump
1
2
3
4
5
6
7
8
9
10

ti
94.50
15.70
62.90
126.00
5.24
31.40
1.05
1.05
2.10
10.50

xi
5
1
5
14
3
19
1
1
4
22
Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.7 Hierarchical Models

4 WinBUGS Applications

Hierarchical Models: Pump Failure Example

274

4.7 Hierarchical Models

WinBUGS Code: Pump Failure Example

The number of failures Xi is assumed to follow a Poisson distribution:


Xi |i Poisson(i ti )

i = 1, . . . , 10

where i denotes the failure rate for pump i.


Assuming that the failure rates of the pumps are related, we specify a
hierarchical Bayesian model and a conjugate prior distribution for i :
i Gamma(, ),

i = 1, . . . , 10.

We have insufficient information about the pump failure rates to specify


values for and but want the data to inform us about these. We
specify a hyperprior distribution using substantive knowledge:
Exponential(1.0)
Gamma(0.1, 1.0)
Prof. Dr. Renate Meyer

Applied Bayesian Inference

275

model
{
for (i in 1 : N) {
theta[i] ~ dgamma(alpha, beta)
lambda[i] <- theta[i] * t[i]
x[i] ~ dpois(lambda[i])
}
alpha ~ dexp(1)
beta ~ dgamma(0.1, 1.0)
}
list(t=c(94.3,15.7,62.9,126,5.24,31.4,1.05,1.05,2.1,10.5),
x=c(5,1,5,14,3,19,1,1,4,22), N=10) #data
list(alpha = 1, beta = 1) #inits

Prof. Dr. Renate Meyer

Applied Bayesian Inference

276

4 WinBUGS Applications

4.7 Hierarchical Models

4 WinBUGS Applications

WinBUGS Output: Pump Failure Example

MLE: Pump Failure Example

Based on 5,000 iterations and burn-in of 1,000:


node
alpha
beta
theta[1]
theta[2]
theta[3]
theta[4]
theta[5]
theta[6]
theta[7]
theta[8]
theta[9]
theta[10]

mean
0.6874
0.9126 0.5411
0.0599
0.1012
0.08922
0.1148
0.5964
0.6067
0.9106
0.8997
1.599
1.995

sd
0.2723
0.01506
0.02496
0.07978
0.03818
0.03023
0.3127
0.137
0.7541
0.7396
0.7679
0.4327

MC error
0.007535
0.1771
3.49E-4
0.001012
5.284E-4
3.901E-4
0.004145
0.001753
0.01089
0.01236
0.01115
0.00605

2.5%
0.2806
0.8161
0.02099
0.00801
0.03137
0.06324
0.1508
0.3761
0.07487
0.07952
0.4925
1.254

median
0.6456
2.222
0.05683
0.08247
0.08349
0.1121
0.5445
0.595
0.7165
0.7016
1.467
1.966

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

4.7 Hierarchical Models

97.5%
1.338
0.1184
0.3089
0.1786
0.1829
1.338
0.9082
2.845
2.732
3.444
2.917

To compare the results to maximum likelihood estimates (MLE) for


individual pump failures, we calculate the (log)likelihood:
(i ti )xi
exp(i ti )
xi !
log f (xi |i ) = xi log(i ti ) i ti
f (xi |i ) =

Setting the first derivative to 0 and solving w.r.t i gives


xi
i =
ti

Applied Bayesian Inference

Prof. Dr. Renate Meyer

277

4.7 Hierarchical Models

4 WinBUGS Applications

MLE Comparison: Pump Failure Example

278

4.7 Hierarchical Models

Remarks: Pump Failure Example

The following table compares MLEs and Bayesian estimates:


hours
94.50
15.70
62.90
126.00
5.24
31.40
1.05
1.05
2.10
10.50

Prof. Dr. Renate Meyer

failures
5
1
5
14
3
19
1
1
4
22

MLE
0.0530
0.0637
0.0795
0.1111
0.5725
0.6051
0.9524
0.9524
1.9048
2.0952

Bayesian
0.0599
0.1012
0.08922
0.1148
0.5964
0.6067
0.9106
0.8997
1.599
1.995

Applied Bayesian Inference

279

Individual estimates are "shrunk" away from MLE toward a


common mean.

Individual estimates "borrow strength" from the rest of the data.

i s for observations with large "sample size" (operation time) are


shrunk less than i s for other observations.

i s far from the common mean (0.7389) are shrunk more than
those near it.

Prof. Dr. Renate Meyer

Applied Bayesian Inference

280

4 WinBUGS Applications

4.7 Hierarchical Models

4 WinBUGS Applications

Boxplot: Pump Failure Example

4.7 Hierarchical Models

Model Fit Plot: Pump failure Example

model fit: theta

box plot: theta

4.0

4.0
[9]

3.0

3.0

[10]

[7]
[8]

2.0
2.0

1.0

[5]

[6]

1.0

0.0
[2]
[1]

[3]

[4]

0.0

0.0

Applied Bayesian Inference

4 WinBUGS Applications

Prof. Dr. Renate Meyer

281

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.7 Hierarchical Models

Applied Bayesian Inference

2.0

3.0

Figure 19: Model fit for pump failure rates.

Figure 18: Boxplots for pump failure rates.


Prof. Dr. Renate Meyer

1.0

4 WinBUGS Applications

283

Prof. Dr. Renate Meyer

282

4.7 Hierarchical Models

Applied Bayesian Inference

284

4 WinBUGS Applications

4.8 Survival Analysis

4 WinBUGS Applications

Survival Analysis

Hazard Function

Survival Analysis refers to a class of statistical models used to analyse


the duration of time until an event of interest (such as death, tumor
occurrence, component failure) occurs. Time-to-event data arise in
many disciplines, including medicine, biology, engineering,
epidemiology and economics. Frequentist textbooks include Cox and
Oakes (1984) and Klein and Moeschberger (1997); a comprehensive
Bayesian perspective is given in Ibrahim, Chen and Sinha (2001).
As duration times are non-negative, only non-negative random
variables can be used to model survival times.
Failure time data are often censored, i.e. incomplete in that one knows
that a patient survived the study end point, but one does not know the
exact time of death.

Applied Bayesian Inference

4 WinBUGS Applications

Let T be a continuous nonnegative random variable, representing the


duration time until a certain event occurs. Let f (t) denote the pdf and
F (t) the cdf of T . Let S(t) = 1 F (t) = P(T t) be the survival
function, which provides the probability of surviving until timepoint t.
Definition 4.6
The hazard function is defined as
P(t < T t + t|T > t)
h(t) = lim
t
t0
0
f (t)
S (t)
=
=
S(t)
S(t)
and can be interpreted as the instantaneous death (or event) rate of an
individual, provided that this person survived until time t. In particular,
h(t)t is the approximate probability of failure in [t, t + t), given
survival up to time t.

In survival analysis, we are less interested in the mean of the


distribution but we are interested in the hazard function.
Prof. Dr. Renate Meyer

4.8 Survival Analysis

285

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.8 Survival Analysis

4 WinBUGS Applications

Hazard Function

286

4.8 Survival Analysis

Hazard Function

d
Since f (t) = dt
S(t), Definition 4.6 implies that

h(t) =

d
log S(t)
dt

Integrating both sides of (4.1), and then exponentiating, yields


 Z t

S(t) = exp
h(u)du .

(4.1)

Thus, the hazard function has the properties


Z
h(t) 0 and
h(t)dt = .
0

(4.2)

The cumulative hazard, Ht(t) is defined as


Z t
H(t) =
h(u)du

Finally, it follows from Definition 4.6 and (4.1) that


 Z t

f (t) = h(t) exp
h(u)du .

(4.3)

so S(t) = exp(H(t)). Since S() = 0, H() = .


Prof. Dr. Renate Meyer

Applied Bayesian Inference

287

Prof. Dr. Renate Meyer

Applied Bayesian Inference

288

4 WinBUGS Applications

4.8 Survival Analysis

4 WinBUGS Applications

Example: Weibull Distribution

Suppose T has pdf



t 1 exp(t ),
f (t) =
0,

Proportional Hazards Models


The hazard function depends in general on both time and a set of
covariates. The proportional hazards model (Cox, 1972) separates
these components by specifying that the hazard at time t for an
individual whose covariate vector is x is given by

for t > 0, > 0, > 0,


otherwise.

h(t, x) = h0 (t) exp{G(x, )}


where h0 (t) is called the baseline hazard function and is a vector of
regression coefficients. The second term is written in exponential form
because it must be positive.

This is a Weibull distribution with parameters (, ). It follows easily


from the equations above, that
I

S(t) = exp(t ),

The ratio of hazards for two individuals is constant over time. Often,
the effect on the covariates is assumed to be multiplicative, leading to
the hazard function
h(t, x) = h0 (t) exp(x0 )

t 1 ,

h(t) =

H(t) = t .

4.8 Survival Analysis

where = x0 is called the linear predictor. Thus the ratio of hazards


for two individuals depends on the difference between their linear
predictors at any time.
Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

289

4.8 Survival Analysis

4 WinBUGS Applications

Partial Likelihood

290

4.8 Survival Analysis

Likelihood under Censoring

Coxs version (Cox, 1975) of the proportional hazards model is


semiparametric as the baseline hazard function h0 (t) is not modeled
as a parametric function of t.

Survival data are often right-censored. An observation is said to be


right-censored at c if the exact value of the observation is not known
but only that it is greater than c.

Assumptions:
I n individuals, d have distinct event times, n d have right
censored survival times
I no ties, ordered survival times: y(1) , . . . , y(d)
I Rj = set of individuals who are at risk at time y(j) , jth risk set

Let n be number of subjects where individual i has survival time ti and


fixed censoring time ci . ti are iid with pdf f (t).
The exact survival time ti of an individual will be observed only if ti ci .
Data can be represented by n pairs of random variables (yi , i ) where
yi = min(ti , ci )

Then the partial likelihood is:


PL() =

Applied Bayesian Inference

Prof. Dr. Renate Meyer

exp(x0(j) )

n
Y

P
i=1

0
lRj exp(xl )

(4.4)

and


i =

0
1

if ti ci ,
if ti > ci .

The partial MLE of can be obtained by maximizing (4.4) w.r.t. .


Prof. Dr. Renate Meyer

Applied Bayesian Inference

291

Prof. Dr. Renate Meyer

Applied Bayesian Inference

292

4 WinBUGS Applications

4.8 Survival Analysis

4 WinBUGS Applications

Likelihood under Censoring

4.8 Survival Analysis

Likelihood under Censoring

The likelihood function for (, h0 (t)) for right censored data:


L(, h0 (t)|D)

n
Y
i=1
n
Y
i=1
n
Y

f (yi )1i S(yi )i


h(yi )1i S(yi )1i S(yi )i

If we assume a parametric model for the baseline hazard, e.g.


Weibull(, 1), and define i = exp(i ), then the likelihood above is that
of independent censored Weibull(, i ) distributions.

h(yi )1i S(yi )

i=1

n
Y
i=1
n
Y

h(yi )1i exp{H(yi )}


h0 (yi ) exp(i )]1i exp{ exp(i )H0 (yi )}

i=1

where the data D = (n, y, X , ).


Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

293

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.8 Survival Analysis

4 WinBUGS Applications

Censoring in WinBUGS

294

4.8 Survival Analysis

Mice Example in WinBUGS

In WinBUGS, right censoring can be implemented using the command


I(a,) (and I(,b) and I(a,b) for left and interval censoring,
respectively).
Two variables are required to define the survival times:
I

the actual survival times t[i] taking NA values for censored


observations and

the censoring times t.cen[i], which take the value 0 when


actual survival times (deaths) are observed.

We will now look at the mice example in WinBUGS Example


Volume 1.

For example, the likelihood of a Weibull(, ) distribution with right


censored data can be expressed as
t[i]

Prof. Dr. Renate Meyer

dweib(rho,gamma)I(t.cen[i],)

Applied Bayesian Inference

295

Prof. Dr. Renate Meyer

Applied Bayesian Inference

296

4 WinBUGS Applications

4.8 Survival Analysis

4 WinBUGS Applications

MAC AIDS Trial

Primary Endpoint Data


Unit
A
A
A
A
D
D
D
D
D
D
D
D
D
D
G
G
G
G
G
G
J
J
J
J

Here we come back to the analysis of controlled clinical AIDS trial


discussed in the introduction. Our data arises from a clinical trial
comparing two treatments for Mycobacterium avium complex (MAC), a
disease common in late state HIV-infected persons.
11 clinical centers (units) have enrolled a total of 69 patients in the
trial, of which 18 have died. The data have been analysed in Carlin
and Hodges (1999) and Cai and Meyer (2011)
I

For j = 1, . . . , ni and i = 1, . . . , k let


tij = time to death or censoring
xij = treatment indicator for subject j in stratum i

The next page gives survival times (in half-days) from the MAC
treatment trial, where "+" indicates a censored observation

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

4.8 Survival Analysis

297

Treatm.
1
2
1
2
2
2
2
2
1
1
1
1
1
1
2
1
2
2
2
1
1
1
2
2

Time
74+
248
272+
244
20+
64
88
148+
162+
184+
188+
198+
382+
436+
32+
64+
102
162+
182+
364+
18+
36+
160+
254

Unit
B
B

Treatm.
2
1

Time
4+
156+

C
E
E
E
E
E
E
E
E

2
1
2
2
1
1
1
2
2

20+
50+
64+
82
186+
214+
214
228+
262

H
H
H
H
H
H
K
K
K

2
1
1
1
1
2
1
1
2

22+
22+
74+
88+
148+
162
28+
70+
106+

4 WinBUGS Applications

Proportional Hazards Model

Treatm.
1
2
1
2
2
1
1
2
1
1
2
2

Time
6
16+
76
80
202
258+
268+
368+
380+
424+
428+
436+

I
I
I
I
I
I
I
I
I
I
I
I
I

2
2
2
1
1
2
1
2
1
1
2
2
1

8
16+
40
120+
168+
174+
268+
276
286+
366
396+
466+
468+

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.8 Survival Analysis

Unit
F
F
F
F
F
F
F
F
F
F
F
F

298

4.8 Survival Analysis

Proportional Hazards Model

With proportional hazards and a Weibull baseline hazard, stratum is


hazard is
h(tij ) = h0 (tij )i exp(0 + 1 xij )
= i tijrhoi 1 exp(0 + 1 xij )

As in the mice example,

where i > 0 and = (0 , 1 ).

ij = exp(0 + 1 xij )

The i allow differing baseline hazards which are increasing if i > 1


and decreasing if i < 1. As the strata may be similar, we model the
shape parameters as exchangeable, i.e.

so that
Tij Weibull(i , ij ).

iid

i Gamma(, ).
Thus, the mean of the i is one, corresponding to a constant baseline
hazard and variance 1 . We put a proper but low information
Gamma(3.0, 0.1) prior on , reflecting a prior guess for the standard
deviation of i of 301/2 0.18 allowing a fairly broad region of values
centered around one.
Prof. Dr. Renate Meyer

Applied Bayesian Inference

299

Prof. Dr. Renate Meyer

Applied Bayesian Inference

300

4 WinBUGS Applications

4.8 Survival Analysis

4 WinBUGS Applications

Weibull Prop. Hazards: WinBUGS Code

4.8 Survival Analysis

WinBUGS Output
Based on 10,000 iterations and burn-in of 5,000:

model{
for (i in 1 : 69) {
t[i] ~ dweib(rho[unit[i]], mu[i]) I(t.cen[i], )
mu[i] <-exp(beta0+beta1*x[i])
}
for (k in 1:11){
rho[k] ~ dgamma(alpha,alpha)
}
alpha ~ dgamma(3.0,0.1)
beta0 ~ dnorm(0.0,0.001)
beta1 ~ dnorm(0.0,0.001)
r <- exp(2.0*beta1)
}

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

301

node
alpha
beta0
beta1
r
rho[1]
rho[2]
rho[3]
rho[4]
rho[5]
rho[6]
rho[7]
rho[8]
rho[9]
rho[10]
rho[11]

mean
48.45
-6.788
0.5973
3.887
1.028
0.9848
0.972
0.999
1.066
0.9642
0.9724
1.038
0.9756
1.008
0.9616

sd
20.12
0.4114
0.2805
2.515
0.1078
0.1456
0.1414
0.1108
0.1024
0.08855
0.1169
0.1273
0.09325
0.12
0.1386

MC error
0.3892
0.01758
0.009956
0.08594
0.002538
0.003415
0.002471
0.004363
0.002894
0.002924
0.00354
0.003974
0.003106
0.002795
0.003722

median
45.61
-6.78
0.5894
3.251
1.029
0.9794
0.9696
1.0
1.064
0.9654
0.9709
1.038
0.9763
1.006
0.96

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.8 Survival Analysis

2.5%
18.47
-7.626
0.06683
1.143
0.8111
0.704
0.7016
0.7739
0.8667
0.7894
0.748
0.7931
0.7885
0.7667
0.6873

4 WinBUGS Applications

97.5%
95.32
-6.006
1.189
10.78
1.237
1.289
1.255
1.214
1.273
1.133
1.204
1.296
1.158
1.248
1.242
302

4.8 Survival Analysis

WinBUGS Output

Units A, E, and H have increasing baseline hazard functions


(i > 0).

All other units have constant or decreasing baseline hazard


functions (i 0).

There is a significant treatment effect:


95% CI for 1 does not include 0
95% CI for r does not include 1

Posterior mean of the relative risk is closer to frequentist estimate


r = 3.1 for the unstratified Cox proportional hazards model (cf.
Introduction).

Prof. Dr. Renate Meyer

Applied Bayesian Inference

303

Prof. Dr. Renate Meyer

Applied Bayesian Inference

304

4 WinBUGS Applications

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

State-Space Modelling of Time Series

State-Space Modelling of Time Series

State-space models are among the most powerful tools for dynamic
modeling and forecasting of time series and longitudinal data.
Overviews can be found in Fahrmeir and Tutz (1994) and Kuensch
(2001).
Observation equation:
yt = ht (t ) + vt

ML estimation of unknown parameters and latent states is difficult.


Kalman filter is applicable only for linear Gaussian state-space models.
For nonlinear non-normal state-space models, the likelihood function is
intractable.

gives the conditional distribution of observations yt at time t given


latent states t . vt is an error distribution, e.g. N(0, 2 ).

For nonlinear non-normal state-space models, Carlin et al. (1992)


suggested the Gibbs sampler for posterior computation.

State equation:
t = gt (t1 ) + ut
gives the Markovian transition of state t1 to t where ut denotes an
error distribution. The ability to include knowledge of the system
behaviour in the statistical model is largely what makes state-space
modeling so attractive for biologists, economists, engineers and
physicists.
Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

4.9 State-Space Modelling of Time Series

In the sequel, we will look at examples of state-space models


implemented in WinBUGS.

305

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Fisheries Stock Assessment: Data

306

4.9 State-Space Modelling of Time Series

Fisheries Stock Assessment: Data


Yellowfin tuna data from Pella and Tomlinson (1969)

The data available for stock assessment purposes quite often consist
of a time series of annual catches Ct , t = 1, . . . , N, and relative
abundance indices It , t = 1, . . . , N, such as research survey catch
rates or catch-per-unit-effort (CPUE) indices from commercial
fisheries.
For example, the next table gives an historical dataset of catch-effort
data of South Atlantic albacore tuna (Thunnus alalunga) from 1967 to
1989. Catch is in thousands of tons and CPUE in (kg/100 hooks).

Prof. Dr. Renate Meyer

Applied Bayesian Inference

307

Year (t)
1967
1968
1969
1970
1971
..
.

Catch (Ct )
15.9
25.7
28.5
23.7
25.0
..
.

CPUE (It )
61.89
78.98
55.59
44.61
56.89
..
.

1987
1988
1989

37.5
25.9
25.3

23.36
22.36
21.91

Prof. Dr. Renate Meyer

Applied Bayesian Inference

308

4 WinBUGS Applications

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Fisheries Stock Assessment: Objectives

4.9 State-Space Modelling of Time Series

Fisheries Stock Assessment: Biomass Dynamics


Biomass Dynamics Model

Age-composition data are not available for this stock. This dataset has
previously been analysed by Polacheck et al. (1993).

new biomass =

old biomass
+ growth
+ recruitment

Objectives: estimation of
I

the size of the stock at the end of 1989,

natural mortality

the maximum surplus production (MSP),

catch

the biomass at which MSP occurs (BMSP ),

the optimal effort (EMSP ), the level of commercial fishing effort


required to harvest MSP when the stock is at BMSP .

The biomass dynamics equations can be written in the form:


Bt = Bt1 + g(Bt1 ) Ct1

When only catch-effort data are available, biomass dynamics models


are the primary assessment tools for many fisheries (Hilborn and
Walters 1992).

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

where Bt , Ct , and g(Bt ) denote biomass at the start of year t, catch


during year t, and the surplus production function, respectively.
g(0) = g(K ) = 0, where K is the carrying capacity ( the level of the
stock biomass at equilibrium prior to commencement of the fishery).
309

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Fisheries Stock Assessment: Surplus Production Model

4.9 State-Space Modelling of Time Series

Fisheries Stock Assessment: Relative Abundance Index

The Schaefer (1954) form of the surplus production function is




Bt1
.
g(Bt1 ) = rBt1 1
K

A common, though simplifying assumption is that the relative


abundance index is directly proportional to the biomass, i.e.
It = qBt

Substituting this in the biomass dynamics equation gives a


parsimonious model describing the annual biomass dynamics
transitions with just the two parameters r , the intrinsic growth rate, and
K:


Bt1
Bt = Bt1 + rBt1 1
Ct1 .
K

310

(4.5)

(4.6)

with catchability parameter q.


For the Schaefer surplus production model, the maximum surplus
production MSP = rK /4 occurs at BMSP = K /2.
When the biomass indices are CPUEs from commercial fishing, then
the equation above gives MSP/EMSP = qK /2 and thereby the optimal
effort is EMSP = r /2q.

Note that the annual catch is treated as a fixed constant.


Prof. Dr. Renate Meyer

Applied Bayesian Inference

311

Prof. Dr. Renate Meyer

Applied Bayesian Inference

312

4 WinBUGS Applications

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Fisheries Stock Assessment: Process and Observation Error

Fisheries Stock Assessment: State-Space Model

Polacheck et al. (1993) compare three commonly used statistical


techniques for fitting the model defined by equations (4.5) and (4.6),
process error models, observation error models, and equilibrium
models.

This is possible, however, using a state-space model.


Equations (4.5) and (4.6) are the deterministic versions of the
stochastic state and observation equations.

None of these is capable of incorporating uncertainty present in both


equations:
I

natural variability underlying the annual biomass dynamics


transitions (process error) and

uncertainty in the observed abundance indices due to


measurement and sampling error (observation error).

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

We assumed log-normal error structures.


We used a reparametrization (Pt = Bt /K ) by expressing the annual
biomass as a proportion of carrying capacity as in Millar and Meyer
(2000) to speed mixing (i.e. sampling over the support of the posterior
distribution) of the Gibbs sampler.

313

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Fisheries Stock Assessment: State-Space Model

314

4.9 State-Space Modelling of Time Series

Fisheries Stock Assessment: Posterior Distribution


A fully Bayesian model consists of the joint prior distribution of all
unobservables, here the five parameters, K , r , q, 2 , 2 , and the
unknown states, P1 , . . . , PN ,

State equations:
P1 | 2 = eu1 ,
Pt |Pt1 , K , r , 2 = (Pt1 + rPt1 (1 Pt1 ) Ct1 /K ) eut ,

t = 2, . . . , N
(4.7)

Observation equations:
It |Pt , q, 2 = qKPt evt ,

t = 1, . . . , N,

and the joint distribution of the observables, here the relative


abundance indices I1 , . . . , IN .
We assume that the parameters K , r , q, 2 , 2 are independent a priori.
By a successive application of Bayes theorem and conditional
independence of subsequent states, the joint prior density is given by

(4.8)
p(K , r , q, 2 , 2 , P1 , . . . , PN ) = p(K )p(r )p(q)p( 2 )p( 2 )p(P1 | 2 )
N
Y

where ut are iid normal with mean 0 and variance 2 , and vt are iid
normal with mean 0 and variance 2 .
Prof. Dr. Renate Meyer

4.9 State-Space Modelling of Time Series

Applied Bayesian Inference

p(Pt |Pt1 , K , r , 2 ).

i=2
315

Prof. Dr. Renate Meyer

Applied Bayesian Inference

316

4 WinBUGS Applications

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Fisheries Stock Assessment: Prior Specification

Fisheries Stock Assessment: Likelihood


Because of the conditional independence assumption of the relative
abundance indices given the unobserved states, the sampling
distribution is

A noninformative prior is chosen for q.


Prior distributions for K , r , 2 , 2 are specified using biological
knowledge and inferences from related species and stocks as
discussed in Millar and Meyer (2000):
K
r

4.9 State-Space Modelling of Time Series

p(I1 , . . . , IN |K , r , q, 2 , 2 , P1 , . . . , PN ) =

N
Y

p(It |Pt , q, 2 ).

(4.10)

t=1

lognormal(K = 5.04, K = 0.5162),

Then, by Bayes theorem, the joint posterior distribution of the


unobservables given the data is

lognormal(r = 1.38, r = 0.51),

p(q) 1/q,
2 inverse-gamma(3.79, 0.0102),

p(K , r , q, 2 , 2 , P1 , . . . , PN |I1 , . . . , IN ) p(K )p(r )p(q)p( 2 )p( 2 )p(P1 | 2 )


N
Y
p(Pt |Pt1 , K , r , 2 )

2 inverse-gamma(1.71, 0.0086).

i=2
N
Y
Applied Bayesian Inference

Prof. Dr. Renate Meyer

317

p(It |Pt , q, 2 )

Applied Bayesian
Inference
t=1

Prof. Dr. Renate Meyer

318

(4.11)
4 WinBUGS Applications

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Fisheries Stock Assessment: WinBUGS Code

Fisheries Stock Assessment: WinBUGS Code

model {
# lognormal prior on K
K ~ dlnorm(5.042905,3.7603664)I(10,1000)
# lognormal prior on r
r ~ dlnorm(-1.151293,1.239084233)I(0.005,1.0)
# instead of improper (prop. to 1/q) use just proper IG
iq ~ dgamma(0.001,0.001)I(0.5,200)
q <- 1/iq
# inverse gamma on isigma2
isigma2 ~ dgamma(a0,b0)
sigma2 <- 1/isigma2
# inverse gamma on itau2
itau2
~ dgamma(c0,d0)
tau2
<- 1/itau2
Prof. Dr. Renate Meyer

Applied Bayesian Inference

4.9 State-Space Modelling of Time Series

319

Pmean[1] <- 0
P[1] ~ dlnorm(Pmean[1],isigma2) I(0.05,1.6)
for (i in 2:N) {
Pmean[i]<-log(max(P[i-1] + r*P[i-1]*(1-P[i-1]) - C[i-1]
P[i]
~ dlnorm(Pmean[i],isigma2)I(0.05,1.5)
}
for (i in 1:N) {
Imean[i] <- log(q*K*P[i])
I[i] ~ dlnorm(Imean[i],itau2)
}
P24 ~ dlnorm(Pmean24, isigma2)I(0.05,1.5)
Pmean24<- log(max(P[23] + r*P[23]*(1-P[23]) - C[23]/K,0.01)
MSP<- r*K/4
B_MSP<- K/2
E_MSP<- r/(2*q)
}
Prof. Dr. Renate Meyer

Applied Bayesian Inference

320

4 WinBUGS Applications

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Fisheries Stock Assessment: DAG

4.9 State-Space Modelling of Time Series

Fisheries Stock Assessment: WinBUGS Output


Based on 100,000 iterations and burn-in of 100,000:

P[t-1]

is igm a2

Pm ed[t]

P[t]

C[t-1]

node
BMSP
EMSP
K
MSP
P[1]
P[2]
P[3]
P[4]
P[21]
P[22]
P[23]
P24
q
r
sigma2
tau2

Pm ed[t+1]

Im ed[t]

I[t]

iq

itau2

for(t IN 2 : N)

Figure 20: Representation of surplus production model as DAG.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

Prof. Dr. Renate Meyer

321

mean
135.5
0.6154
271.0
19.52
1.018
0.9944
0.8772
0.7825
0.4175
0.353
0.3271
0.2964
0.2486
0.3088
0.003105
0.01225

MC error
1.272
0.001935
2.544
0.05968
8.062E-4
0.001368
0.001485
0.001524
8.162E-4
9.208E-4
0.00103
0.001221
0.002411
0.003559
2.22E-5
2.778E-5

4 WinBUGS Applications

323

Prof. Dr. Renate Meyer

2.5%
87.2
0.4346
174.4
13.9
0.919
0.8737
0.7616
0.6711
0.3545
0.292
0.2573
0.2093
0.1449
0.1416
0.001132
0.005832

median
130.2
0.6148
260.4
19.76
1.016
0.986
0.8726
0.779
0.4156
0.35
0.3241
0.2926
0.244
0.3031
0.00261
0.01145

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.9 State-Space Modelling of Time Series

Applied Bayesian Inference

sd
32.44
0.09112
64.88
2.537
0.05427
0.07386
0.06548
0.06205
0.03452
0.03519
0.03964
0.04939
0.06136
0.09576
0.001912
0.004516

97.5%
2121
0.8002
424.2
23.94
1.133
1.164
1.019
0.9144
0.491
0.4296
0.4123
0.4028
0.3777
0.5104
0.008057
0.02327
322

4.9 State-Space Modelling of Time Series

Applied Bayesian Inference

324

4 WinBUGS Applications

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Example: Stochastic Volatility in Financial Time Series

Example: Stochastic Volatility in Financial Time Series

The stochastic volatility (SV) model introduced by Tauchen and Pitts


(1983) is used to describe financial time series. It offers an alternative
to the ARCH-type models of Engle (1982) for the well documented
time varying volatility exhibited in many financial time series.
The SV model provides a more realistic and flexible modeling of
financial time series than the ARCH-type models, since it essentially
involves two noise processes, one for the observations, and one for the
latent volatilities.
The so called observation errors account for the variability due to
measurement and sampling errors whereas the process errors assess
variation in the underlying volatility dynamics.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

4.9 State-Space Modelling of Time Series

325

Classical parameter estimation for SV models is difficult due to the


intractable form of the likelihood function. Recently, a variety of
frequentist estimation methods have been proposed for the SV model,
including Generalized Method of Moments (Melino and Turnbull
(1990), Sorenson (2000)), Quasi-Maximum Likelihood (Harvey et al.,
1994), Efficient Method of Moments (Gallant et al., 1997), Simulated
Maximum Likelihood (Danielsson, 1994, and Sandmann and
Koopman, 1998), and approximate Maximum Likelihood (Fridman and
Harris, 1998).
Bayesian MCMC procedures for the SV model have been suggested
by Jacquier et al. (1994), Shephard and Pitt (1997), Kim et al. (1998)
and Meyer and Yu (2000). Here we demonstrate the implementation of
the Gibbs sampler in WinBUGS.
Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

326

4.9 State-Space Modelling of Time Series

Stochastic Volatility: Data

Stochastic Volatility: State-Space Model

The data consist of a time series of daily Pound/Dollar exchange rates


{xt } from 01/10/81 to 28/6/85. The series of interest are the daily
mean-corrected returns, {y
t }, given by the transformation
1 Pn
yt = log xt log xt1 n i=1 (log xi log xi1 ), t = 1, . . . , n.

The SV model used for analyzing these data can be written in the form
of a nonlinear state-space model:
Observation equations:


1
yt |t = exp
t u t ,
2

returns.dat
-0.320221363079782
1.46071929942995
-0.408629619810947
1.06096027386685
1.71288920763163
0.404314365893326
-0.905699012715806
.
.
.
2.22371628398118
Prof. Dr. Renate Meyer

iid

ut N(0, 1), t = 1, . . . , n.

(4.12)

State equations:
t |t1 , , , 2 = + (t1 ) + vt ,

iid

vt N(0, 2 ), t = 1, . . . , n,
(4.13)

with 0 N(, 2 ).

Applied Bayesian Inference

327

Prof. Dr. Renate Meyer

Applied Bayesian Inference

328

4 WinBUGS Applications

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Stochastic Volatility: Parameters

4.9 State-Space Modelling of Time Series

Stochastic Volatility: Prior Specification

By successive conditioning, the joint prior density is


I

t determines the amount of volatility on day t,

the value of , 1 < < 1, measures the autocorrelation present


in the logged squared data; thus can be interpreted as the
persistence in the volatility, and

4 WinBUGS Applications

329

p(t |t1 , , , 2 ).
(4.14)

We employ a slightly informative prior for , N(0, 10).

We set = 2 1 and specify a Beta(, ) prior for with


= 20 and = 1.5 which gives a prior mean for of 0.86.

A conjugate inverse-gamma prior is chosen for 2 , i.e.


2 IG(2.5, 0.025) which gives a prior mean of 0.0167 and prior
standard deviation of 0.0236.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Stochastic Volatility: Likelihood

n
Y
t=1

as the volatility of log-volatilities.

Applied Bayesian Inference

p(, , , 0 , 1 , . . . , n ) = p(, , )p(0 |, )

the constant scaling factor = exp(/2) as the modal volatility,


and

Prof. Dr. Renate Meyer

330

4.9 State-Space Modelling of Time Series

Stochastic Volatility: Posterior Distribution

Then, by Bayes theorem, the joint posterior distribution of the


unobservables given the data is proportional to the prior times
likelihood, i.e.

p(y1 , . . . , yn |, , 2 , 0 , . . . , n )

The likelihood
is specified by the
observation equations (4.12) and the conditional independence
assumption:

p(, , 2 , 0 , . . . , n |y1 , . . . , yn ) p()p()p( 2 )p(0 |, 2 )


p(y1 , . . . , yn |, , 2 , 0 , . . . , n ) =

n
Y

p(yt |t ).

Qn

(4.15)

t=1 p(t |t1 , , ,

2 )

t=1

Qn

t=1 p(yt |t ).

(4.16)

Prof. Dr. Renate Meyer

Applied Bayesian Inference

331

Prof. Dr. Renate Meyer

Applied Bayesian Inference

332

4 WinBUGS Applications

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Stochastic Volatility: DAG

mu

theta[t-1]

phi

thmean[t]

4.9 State-Space Modelling of Time Series

Stochastic Volatility: DAG

itau2

The solid arrows indicate that given its parent nodes, each node v is
independent of all other nodes except descendants of v .

theta[t]

thmean[t+1]

yisigma2[t]

For instance, if on day t we know the volatility on day t 1 and the


values of the parameters , , and 2 , then our belief about the
volatility, t , on day t is independent of the volatilities on previous days
1 to t 2 and the data of all other days except the current return yt .

y[t]

for(t IN 1 : n)

Figure 21: Representation of the stochastic volatility model as a DAG.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

333

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.9 State-Space Modelling of Time Series

4 WinBUGS Applications

Stochastic Volatility: WinBUGS Output

334

4.9 State-Space Modelling of Time Series

Stochastic Volatility: Final Remarks

This example clearly shows the limitations of the WinBUGS software.


The time to generate 1000 observations takes several seconds.

Based on 10,000 iterations and burn-in of 10,000 (insufficient):


node
beta
mu
phi
tau

mean
0.7163
-0.6927
0.9805
0.1493

sd
0.1244
0.3074
0.01081
0.03052

Prof. Dr. Renate Meyer

MC error
0.00958
0.02252
8.306E-4
0.002965

2.5%
0.5554
-1.176
0.9552
0.1033

median
0.6925
-0.735
0.9823
0.1435

Applied Bayesian Inference

Due to the high posterior correlation between parameters,


convergence is VERY slow and a huge number of MCMC iterations is
required to achieve convergence. This takes almost prohibitively long.

97.5%
1.005
0.01074
0.9962
0.2196

More efficient samplers than the single-update Gibbs sampler can be


constructed either by so-called blocking parameters and updating a
whole parameter vector in a Gibbs sampling step. An alternative is a
Metropolis-Hastings algorithm with a multivariate proposal distribution.

335

Prof. Dr. Renate Meyer

Applied Bayesian Inference

336

4 WinBUGS Applications

4.10 Copulas

4 WinBUGS Applications

4.10 Copulas

Copulas

Applications of Copulas

The study of copulas and their applications in statistics is a rather


modern phenomenon, although the concept goes back to Sklar (1959),
but interest in copulas has been growing over the last 15 years.

Copulas are used to


I study scale-free measures of dependence
I construct families of bivariate/multivariate distributions (as
alternatives to the multivariate normal, where the normal
distribution does not provide an adequate approximation to many
datasets, e.g. lifetime random variables and long-tailed claim
variables)
Main applications:
I in financial risk assessment and actuarial analysis some believe
the methodology of applying the Gaussian copula to credit
derivatives to be one of the reasons behind the global financial
crisis of 2008-2009,
I in engineering for reliability studies
I in biostatistics/epidemiology to model joint survival times of groups
of individuals, e.g. husband and wife, twins, father and son, etc.

What are copulas?


The word copula is a Latin noun that means "a link, tie, bond".
In statistics, copulas are functions that join or "couple" multivariate
distribution functions to their one-dimensional marginal distribution
functions.
Or: Copulas are multivariate distribution functions whose
one-dimensional margins are uniform on the interval (0,1).
An extensive theoretical discussion of copulas can be found in Nelsen
(2006).
Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

337

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.10 Copulas

4 WinBUGS Applications

Definition of a Copula

338

4.10 Copulas

Sklars Theorem (1959)

Definition 4.7

Theorem 4.8

A copula C(u1 , . . . , ud ) is a multivariate distribution function on the unit


hypercube [0, 1]d with univariate marginal distributions that are all
uniform on the interval [0, 1], i.e.

Let F be a joint distribution function with margins F1 and F2 . Then


there exists a copula C such that for all x1 , x2 IR

C(u1 , . . . , ud ) = P(U1 u1 , . . . , Ud ud )

For ease of notation, we assume from now on that d = 2.

Applied Bayesian Inference

(4.17)

If F1 and F2 are continuous, then C is unique.


Conversely, if C is a copula and F1 and F2 are distribution functions,
then the function F defined by (4.17) is a joint distribution function with
margins F1 and F2 .

where Ui Uniform(0, 1) for i = 1, . . . , d.

Prof. Dr. Renate Meyer

F (x1 , x2 ) = C (F1 (x1 ), F2 (x2 )) .

339

Prof. Dr. Renate Meyer

Applied Bayesian Inference

340

4 WinBUGS Applications

4.10 Copulas

4 WinBUGS Applications

Copula Density

4.10 Copulas

Some Copula Families


Clayton Copula
1/
C(u, v ) = max u + v 1, 0

By differentiation, it is easy to show that the density function of a


bivariate distribution F (x1 , x2 ) = C (F1 (x1 ), F2 (x2 )) with marginal
densities f1 and f2 is given by
f (x1 , x2 ) = c(F1 (x1 ), F2 (x2 ))f1 (x1 )f2 (x2 )

Frank Copula
(4.18)

C(u, v ) =



1
(eu 1)(ev 1)
log 1 +

e 1

(, )\{0}

Gumbel Copula

where c denotes the copula density of C, i.e.


c(u1 , u2 ) =

[1, )\{0}

C(u, v ) = uv exp( log u log v )

2
C(u1 , u2 )
u1 u2

(0, 1]

Gaussian Copula



C(u, v ) = 1 (u), 1 (v )
where is the standard bivariate normal distribution function with
correlation , and is the standard normal distribution function.
Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

341

4.10 Copulas

4 WinBUGS Applications

Dependance Measure: Concordance

Two observations (xi , yi ) and (xj , yj ) of a random vector (X , Y ) are


concordant (discordant) if
xi < xj and yi < yj , or if xi > xj and yi > yj
(xi < xj and yi > yj , or if xi > xj and yi < yj )

or equivalently:
(xi xj )(yi yj ) > 0
((xi xj )(yi yj ) < 0)

342

4.10 Copulas

Dependance Measure: Kendalls tau

Informally, a pair of rvs are concordant if "large" values of one tend to


be associated with "large" values of the other and "small" values of one
with "small" values of the other.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

The sample version of Kendalls tau is defined in terms of concordance


as follows:
Let (xi , yi ), i = 1, . 
. . , n denote
a random sample of n observations of

n
(X , Y ). There are
distinct pairs (xi , yi ) and (xj , yj ) of
2
observations in the sample, and each pair is either concordant or
discordant. Let c denote the number of concordant pairs and d the
number of discordant pairs. Then Kendalls tau is defined as
 
cd
n
.
= (c d)/
=
2
c+d
The population version of Kendalls tau is defined as the probability of
concordance minus the probability of discordance:
= P[(X1 X2 )(Y1 Y2 ) > 0] P[(X1 X2 )(Y1 Y2 ) < 0]

Prof. Dr. Renate Meyer

Applied Bayesian Inference

343

Prof. Dr. Renate Meyer

Applied Bayesian Inference

344

4 WinBUGS Applications

4.10 Copulas

4 WinBUGS Applications

Relationship: Kendalls tau and copula parameter

Parameter Estimation

Flexible multivariate distributions can be constructed with


pre-specified, discrete and/or continuous marginal distributions and
copula function that represents the desired dependence structure. The
joint distribution is usually estimated by a standard two-step procedure

We have the following functional relationships between Kendalls tau


and the parameters of the copula families above:
Clayton

Frank

2
2+


Z
1 t
4
=1
1
dt

0 et 1
=1

Gumbel

= 1 1

Gauss

Applied Bayesian Inference

4 WinBUGS Applications

345

the marginals are approximated by their empirical distribution or


parameters of the marginals are estimated via ML

the parameters in the copula function are estimated by maximum


likelihood conditional on the parameter estimates in the first step.

4 WinBUGS Applications

346

4.10 Copulas

Simulation Study: R2WinBUGS Code

We use the copula package in R to simulate N = 500 bivariate failure


times from a Clayton copula with Exponential(i ) marginal distributions
and a Kendalls tau value of 0.8 (as a measure for the association
between the failure times). The rates for the marginal distributions are
1 = 2 = 0.0001.
We use R2WinBUGS to sample from the posterior distribution of the
unknown parameters. We use a Jeffreys prior for the rates of the
Exponential distributions (i.e. approximately Jeffreys with
i Gamma(0.001, 0.001) and we assume a Uniform(0,100) prior for
(based for instance on a priori information that the association
between failure times is positive and wont exceed 0.98).
To specify the likelihood, we need to calculate the density of the
multivariate distribution first using (4.18). Exercise!
Applied Bayesian Inference

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.10 Copulas

Simulation Study

Prof. Dr. Renate Meyer

Here, we propose to estimate jointly all parameters of marginal


distributions and copula using a Bayesian approach implemented in
WinBUGS as in Kelly (2007).

2
arcsin()

Prof. Dr. Renate Meyer

4.10 Copulas

library(copula)
library(R2WinBUGS)
p <- 2 # copula dimension
tau <- 0.8 # value of Kendalls tau
alpha<-2*tau/(1-tau) #relationship between tau and alpha
c.clayton<-archmCopula(family="clayton",dim=p,param=alpha)

# Marginals are exponential lambda1 and lambda2


lambda1 <- 0.0001
lambda2 <- 0.0001
distr.clayton<-mvdc(c.clayton, margins=rep("exp",p),
paramMargins = list(list(rate=lambda1),list(rate=lamb
# Draw a random sample of size N
N <- 500
w <- rmvdc(distr.clayton, N)

347

Prof. Dr. Renate Meyer

Applied Bayesian Inference

348

4 WinBUGS Applications

4.10 Copulas

4 WinBUGS Applications

10000 20000 30000 40000 50000 60000 70000


0

w[, 2]

Simulation Study

Implementation in WinBUGS: Zeros Trick


If we want to implement parameter estimation of this copula model in
WinBUGS, we face a problem as copula distributions are not included
in the list of standard distributions implemented in WinBUGS.

Fortunately, we can use the so-called zeros trick to specify a new


sampling distribution. An observation yi with new sampling distribution
f (yi |) contributes a likelihood term L(i) = f (yi |). Let l(i) = log L(i),
then the model likelihood can be written as

10000

20000

f (y1 , . . . , yn |) =

50000

i=1

60000

w[, 1]

Applied Bayesian Inference

4 WinBUGS Applications

349

4 WinBUGS Applications

To ensure that the Poisson means are all positive, we may have to add
a positive constant C to each l(i). This is equivalent to multiplying
the likelihood by a constant term enC . With this approach, the original
likelihood can be written as the product of Poisson likelihoods with
observations all equal to zero:
f (y|) =

i=1

0!

e(l(i)+C) =

0!

e(l(i))

Applied Bayesian Inference

Prof. Dr. Renate Meyer

4.10 Copulas

n
Y

el(i)

i.e. the product of densities of Poisson random variables with


mean= l(i) and all observations equal to zero.

Implementation in WinBUGS: Zeros Trick

n
Y
(l(i) + C)0

n
Y
i=1

n
Y
l(i)0

=
40000

f (yi |) =

i=1

Figure 22: Scatterplot of 500 simulated values from Clayton copula with
Exp(0.0001) marginals.
Prof. Dr. Renate Meyer

n
Y

30000

4.10 Copulas

fPoisson (0| l(i) + C)

350

4.10 Copulas

Implementation in WinBUGS: Ones Trick

As an alternative to the zeros trick, the Bernoulli distribution can be


used. The likelihood can be written as
f (y1 , . . . , yn |) =

n 
Y

el(i)

1 

1 el(i)

0

i=1

i=1

Generic WinBUGS code:

n
Y

fBernoulli (1|el(i) )

i=1

C <- 10000
for (i in 1:n){
zeros[i]<-0
zeros[i]~ dpois(zeros.mean[i])
zeros.mean[i]<- -l[i]+C
l[i]<- ...#expression of log-likelihood for obs. i
} Prof. Dr. Renate Meyer
Applied Bayesian Inference
351

i.e. the product of Bernoulli densities with success probability el(i) and
all observations equal to 1.

Prof. Dr. Renate Meyer

Applied Bayesian Inference

352

4 WinBUGS Applications

4.10 Copulas

4 WinBUGS Applications

Implementation in WinBUGS: Ones Trick

Simulation Study: R2WinBUGS Code

To ensure that the success probability is less than 1, we multiply each


likelihood term by eC where C is a large positive constant. Then the
joint likelihood becomes:
f (y|) =

n 
Y

l(i)C

1 

1e

l(i)C

0

i=1

n
Y

fBernoulli (1|el(i)C )

i=1

Generic WinBUGS code:


C <- 100
for (i in 1:n){
ones[i]<-1
ones[i]~ dbern(ones.p[i])
ones.p[i]<- exp(l[i]-C)
l[i]<- ...#expression of log-likelihood for obs. i
}
Applied Bayesian Inference

Prof. Dr. Renate Meyer

4 WinBUGS Applications

353

#Call WinBUGS

data=list(N=500,x=w[,1],y=w[,2])
inits=list(list(lambda1=0.001,lambda2=0.002,alpha=5))
parameters=c("lambda1","lambda2","alpha")
clayton.sim<-bugs(data,inits,parameters.to.save=parameters,
model.file="model_clayton.odc", n.chains=1,
n.iter=2000,n.burnin=1000,working.directory=getwd())

This performs 2000 iterations of the Gibbs sampler with a burn-in


period of 1000 and monitors the values of the three model parameters.
The WinBUGS Code in model_clayton.odc is:

4.10 Copulas

4 WinBUGS Applications

354

4.10 Copulas

Simulation Study: WinBUGS Output

model
{lambda1 ~ dgamma(0.001,0.001) #Jeffreys prior
lambda2 ~ dgamma(0.001,0.001) #Jeffreys prior
alpha ~ dunif(0,100) #Uniform prior on alpha
# likelihood specification using zeros trick
C<-10000
for(i in 1:N) {
zeros[i] <-0
zeros[i] ~ dpois(mu[i])
mu[i]<- - l[i] +C
u[i] <- 1-exp(-lambda1*x[i])
v[i] <- 1-exp(-lambda2*y[i])
l[i]<-log((1+alpha)*
pow(pow(u[i],-alpha)+pow(v[i],-alpha)-1,-1/alpha-2)
*pow(u[i],-alpha-1)*pow(v[i],-alpha-1)*
lambda1*exp(-lambda1*x[i])*lambda2*exp(-lambda2*y[i])) }}
Applied Bayesian Inference

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Simulation Study: WinBUGS Code

Prof. Dr. Renate Meyer

4.10 Copulas

355

Based on 1,000 iterations and burn-in of 1,000:


node
alpha
deviance
lambda1
lambda2

mean
8.001
1.002E+7
9.434E-5
9.415E-5

Prof. Dr. Renate Meyer

sd
0.3863
2.507
3.815E-6
3.813E-6

MC error
0.02022
0.1517
4.306E-7
4.298E-7

2.5%
7.279
1.002E+7
8.75E-5
8.723E-5

Applied Bayesian Inference

median
8.007
1.002E+7
9.401E-5
9.383E-5

97.5%
8.789
1.002E+7
1.018E-4
1.017E-4

356

5 References

5 References

References I

References II

Albert, J.H. (2007), Bayesian Computation with R, Springer, New


York.

Bolstad, W.M. (2004) Introduction to Bayesian Statistics, John


Wiley& Sons.

Aitkin, M. (1997), The calibration of P-values, posterior Bayes


factors and the AIC from the posterior distribution of the likelihood,
Statistics and Computing 7: 253:272.

Borel E. (1921), La Theorie du jeu et les Equations Integrales a


Noyau Symetrique, Comptes Rendus de LAcademie des
Sciences 173 1304-13-8.

Aitkin, M. (2010), Statistical Inference, An Integrated


Bayesian/Likelihood Approach, Chapman& Hall, Cambridge, UK.

Bellhouse, D.R. (2004), The Reverend Thomas Bayes, FRS: A


Biography to Celebrate the Tercentenary of his Birth, Statistical
Science 19, 3-43.

Cai, B., Meyer, R. (2011) Bayesian semiparametric modeling of


survival data based on mixtures of B-spline distributions,
Computational Statistics and Data Analysis to appear.

Carlin, B.P., Polson, N.G., and Stoffer, D.S. (1992). A Monte Carlo
approach to nonnormal and nonlinear state-space modeling. J.
Amer. Statist. Assoc. 87, 493500.

Carlin, B.P. and Louis, Th.A. (2008) Bayesian Methods for Data
Analysis, Chapman & Hall.

Berger, J.O. and Wolpert, R.L. (1988) The Likelihood Principle,


Hayward, CA.
Bernardo, J. and Smith, A. (1994) Bayesian Theory, Wiley,
Chichester, UK.
Applied Bayesian Inference

Prof. Dr. Renate Meyer

357

5 References

References IV

Carlin, B.P. and Hodges, (1999), Hierarchical Proportional


Hazards Regression Models for Highly Stratified Data, Biometrics
55, 1162-1170.

Cox, D.R. (1972), Regression models and life tables, Journal of


the Royal Statistical Society B 34, 187-220.

Cox, D.R. (1975), Partial Likelihood, Biometrika 62, 269-276.

Cox, D.R. and Oakes, D. (1984) Analysis of Survival Data.


London: Chapman& Hall.

Dempster, A.P. (1974), The direct use of likelihood for significance


testing, in (Barndorff-Nielsen et al, eds.) Proc. of the Conference
on the Foundational Questions of Statistical Inference,335-352,
Reprinted in Statistics and Computing 7, 247-252 (1997).

Dey, D., Ghosh, S. and Mallick, B. (2000), Generalized Liner


Models: A Bayesian Perspective, Marcel Dekker, New York.
Prof. Dr. Renate Meyer

358

5 References

References III
I

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Applied Bayesian Inference

359

Efron, B. (2005), Bayesians, Frequentists, and Scientists, Journal


of the American Statistical Association 100.

Fahrmeir, L. and Tutz, G. (2001), Multivariate Statistical Modelling


Based on Generalized Linear Models, Springer Series in
Statistics, Springer Verlag, New York.

Fisher, R.A. (1922), On the interpretation of chi-square from


contingency tables and the calculation of p, Journal of the Royal
Statistical Society B, 85, 87-94.

Gelfand, A., Dey, D., Chang, H. (1992), Model determination using


predictive distributions with implementation via sampling-based
methods, in (Bernardo et al. eds) Bayesian Statistics 4, Oxford
University Press, 407-425.

Gelman, A., Carlin, J., Stern, H., Rubin, D. (2004), Bayesian Data
Analysis, Texts in Statistical Science, 2nd ed., Chapman& Hall,
London.
Prof. Dr. Renate Meyer

Applied Bayesian Inference

360

5 References

5 References

References V
I

References VI

Gelman, A. and Meng, X.L. (1996), Model Checking and model


improvement, in (Gilks et al, eds) Markov Chain Monte Carlo in
Practice, Chapman& Hall, UK, 189-201.
Geman, S. and Geman, D. (1984), Stochastic relaxation, Gibbs
distributions and the Bayesian restoration of images, IEEE
transactions on Pattern Analysis and Machine Intelligence 6,
721-741.
George, E.I., Makov, U.E. and Smith, A.F.M. (1993), Conjugate
Likelihood Distributions, Scandinavian Journal of Statistics, 20,
147-156.

Jeffreys, H. (1939) Theory of Probability, Oxford University Press,


Oxford.

Jeffresy, H. (1961) Theory of Probability, 3rd edition, Oxford


University Press, Oxford.

Kelly, D.L. (2007), Using Copulas to Model Dependence in


Simulation Risk Assessment, Proceedings of International
Mechanical Engineering Congress and Exposition,
IMECE2007-41284.

Keynes, J.M. (1922) A Treatise on Probability, Volume 8, St


Martins.

Gilks, W., Richardson, S. and Spiegelhalter, D. (1996), Markov


Chain Monte Carlo in Practice, Chapman& Hall, Cambridge, UK.

Klein, J.P. and Moeschberger, M.L. (1997), Survival Analysis, New


York: Springer.

Ibrahim, J.G., Chen, M-H., Sinha, D. (2001) Bayesian Survival


Analysis. Springer, New York.

Kuensch, H.R. (2001), State space and hidden Markov models, In:
Barndorff-Nielsen et al. (Ed.), Complex stochastic systems,
Chapman & Hall, London, 109174.

Applied Bayesian Inference

Prof. Dr. Renate Meyer

361

5 References

References VIII
I

Lawless J.F. (1982) Statistical Models and Methods for Life Time
Data. New York, Wiley.

Ntzoufras, I. (2009) Bayesian Modeling Using WinBUGS, John


Wiley& Sons, Inc.

McCullagh, P. and Nelder, J. (1989), Generalized Linear Models,


Chapman& Hall, Cambridge, UK.

Raiffa, H. and Schlaiffer, R. Applied Statistical Decision Theory,


Cambridge, MIT Press.

Ramsay, F.P. (1926), Truth and Probability, Publised in 1931 as


Foundations of Mathematics and Other Logical Essays Ch. VII,
156-198.

Rubin, D.B. (1984), Bayesianly justifiable and relevant frequency


calculations for the applied statistician, Annals of Statistics 12,
1151-1172.

Sklar, A. 91959), Fonctions de repartition a n dimensions e leurs


marges. Publ. Inst. Stat. Univ. Paris 8, 229-231.

Spiegelhalter, D.J., Best, N.G., Carlin, B.P., van der Linde, A.


(2002), Bayesian measures of model complexity and model fit.
Journal of the Royal Statistical Society B 64, 583-639.

McCarthy, M.A. (2007) Bayesian Methods for Ecology, Cambridge


University Press, 2007.

Meyer, R. and J. Yu (2000), BUGS for a Bayesian analysis of


stochastic volatility models. Econometrics Journal 3, 198-215.

Millar, R.B. and Meyer R. (2000), State-Space Modeling of


Non-Linear Fisheries Biomass Dynamics Using the Gibbs
Sampler. Applied Statistics, 49, 327-342.

Nelsen, R.B. (2006) An Introduction to Copulas, Springer, New


York.
Prof. Dr. Renate Meyer

362

5 References

References VII
I

Applied Bayesian Inference

Prof. Dr. Renate Meyer

Applied Bayesian Inference

363

Prof. Dr. Renate Meyer

Applied Bayesian Inference

364

Das könnte Ihnen auch gefallen