Sie sind auf Seite 1von 111

Introduction to Bayesian Inference

Tom Loredo
Dept. of Astronomy, Cornell University
http://www.astro.cornell.edu/staff/loredo/bayes/
CASt Summer School June 7, 2007
1 / 111
Outline
1 The Big Picture
2 FoundationsLogic & Probability Theory
3 Inference With Parametric Models
Parameter Estimation
Model Uncertainty
4 Simple Examples
Binary Outcomes
Normal Distribution
Poisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
2 / 111
Outline
1 The Big Picture
2 FoundationsLogic & Probability Theory
3 Inference With Parametric Models
Parameter Estimation
Model Uncertainty
4 Simple Examples
Binary Outcomes
Normal Distribution
Poisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
3 / 111
Scientic Method
Science is more than a body of knowledge; it is a way of thinking.
The method of science, as stodgy and grumpy as it may seem,
is far more important than the ndings of science.
Carl Sagan
Scientists argue!
Argument Collection of statements comprising an act of
reasoning from premises to a conclusion
A key goal of science: Explain or predict quantitative
measurements
Framework: Mathematical modeling
Science uses rational argument to construct and appraise
mathematical models for measurements
4 / 111
Mathematical Models
A model (in physics) is a representation of structure in a physical
system and/or its properties. (Hestenes)
REAL
WORLD
MATHEMATICAL
WORLD
TRANSLATION &
INTERPRETATION
A model is a surrogate
The model is not the modeled system! It stands in for the system
for a particular purpose, and is subject to revision.
A model is an idealization
A model is a caricature of the system being modeled (Kac). It
focuses on a subset of system properties of interest.
5 / 111
A model is an abstraction
Models identify common features of dierent things so that general
ideas can be created and applied to dierent situations.
We seek a mathematical model for quantifying uncertaintyit will share
these characteristics with physical models.
Asides
Theories are frameworks guiding model construction (laws, principles).
Physics as modeling is a leading school of thought in physics education research;
e.g., http://modeling.la.asu.edu/
6 / 111
The Role of Data
Data do not speak for themselves!
We dont just tabulate data, we analyze data.
We gather data so they may speak for or against existing
hypotheses, and guide the formation of new hypotheses.
A key role of data in science is to be among the premises in
scientic arguments.
7 / 111
Data Analysis
Building & Appraising Arguments Using Data
Statistical inference is but one of several interacting modes of
analyzing data.
8 / 111
Bayesian Statistical Inference
A dierent approach to all statistical inference problems (i.e.,
not just another method in the list: BLUE, maximum
likelihood,
2
testing, ANOVA, survival analysis . . . )
Foundation: Use probability theory to quantify the strength of
arguments (i.e., a more abstract view than restricting PT to
describe variability in repeated random experiments)
Focuses on deriving consequences of modeling assumptions
rather than devising and calibrating procedures
9 / 111
Outline
1 The Big Picture
2 FoundationsLogic & Probability Theory
3 Inference With Parametric Models
Parameter Estimation
Model Uncertainty
4 Simple Examples
Binary Outcomes
Normal Distribution
Poisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
10 / 111
LogicSome Essentials
Logic can be dened as the analysis and appraisal of arguments
Gensler, Intro to Logic
Build arguments with propositions and logical
operators/connectives
Propositions: Statements that may be true or false
P : Universe can be modeled with CDM
A :
tot
[0.9, 1.1]
B :

is not 0
B : not B, i.e.,

= 0
Connectives:
A B : Aand B are both true
A B : Aor B is true, or both are
11 / 111
Arguments
Argument: Assertion that an hypothesized conclusion, H, follows
from premises, P = {A, B, C, . . .} (take , = and)
Notation:
H|P : Premises P imply H
Hmay be deduced from P
H follows from P
H is true given that P is true
Arguments are (compound) propositions.
Central role of arguments special terminology:
A true argument is valid
A false argument is invalid or fallacious
12 / 111
Valid vs. Sound Arguments
Content vs. form
An argument is factually correct i all of its premises are true
(it has good content).
An argument is valid i its conclusion follows from its
premises (it has good form).
An argument is sound i it is both factually correct and valid
(it has good form and content).
We want to make sound arguments. Logic and statistical methods
address validity, but there is no formal approach for addressing
factual correctness.
13 / 111
Factual Correctness
Although logic can teach us something about validity and
invalidity, it can teach us very little about factual correctness. The
question of the truth or falsity of individual statements is primarily
the subject matter of the sciences.
Hardegree, Symbolic Logic
To test the truth or falsehood of premisses is the task of
science. . . . But as a matter of fact we are interested in, and must
often depend upon, the correctness of arguments whose premisses
are not known to be true.
Copi, Introduction to Logic
14 / 111
Premises
Facts Things known to be true, e.g. observed data
Obvious assumptions Axioms, postulates, e.g., Euclids
rst 4 postulates (line segment b/t 2 points; congruency of
right angles . . . )
Reasonable or working assumptions E.g., Euclids fth
postulate (parallel lines)
Desperate presumption!
Conclusions from other arguments
15 / 111
Deductive and Inductive Inference
DeductionSyllogism as prototype
Premise 1: A implies H
Premise 2: A is true
Deduction: H is true
H|P is valid
InductionAnalogy as prototype
Premise 1: A, B, C, D, E all share properties x, y, z
Premise 2: F has properties x, y
Induction: F has property z
F has z|P is not valid, but may still be rational (likely,
plausible, probable); some such arguments are stronger than
others
Boolean algebra (and/or/not over {0, 1}) quanties deduction.
Probability theory generalizes this to quantify the strength of
inductive arguments.
16 / 111
Real Number Representation of Induction
P(H|P) strength of argument H|P
P = 0 Argument is invalid
= 1 Argument is valid
(0, 1) Degree of deducibility
A mathematical model for induction:
AND (product rule) P(A B|P) = P(A|P) P(B|A P)
= P(B|P) P(A|B P)
OR (sum rule) P(A B|P) = P(A|P) + P(B|P)
P(A B|P)
We will explore the implications of this model.
17 / 111
Interpreting Bayesian Probabilities
If we like there is no harm in saying that a probability expresses a
degree of reasonable belief. . . . Degree of conrmation has been
used by Carnap, and possibly avoids some confusion. But whatever
verbal expression we use to try to convey the primitive idea, this
expression cannot amount to a denition. Essentially the notion
can only be described by reference to instances where it is used. It
is intended to express a kind of relation between data and
consequence that habitually arises in science and in everyday life,
and the reader should be able to recognize the relation from
examples of the circumstances when it arises.
Sir Harold Jereys, Scientic Inference
18 / 111
More On Interpretation
Physics uses words drawn from ordinary languagemass, weight,
momentum, force, temperature, heat, etc.but their technical meaning
is more abstract than their colloquial meaning. We can map between the
colloquial and abstract meanings associated with specic values by using
specic instances as calibrators.
A Thermal Analogy
Intuitive notion Quantification Calibration
Hot, cold Temperature, T Cold as ice = 273K
Boiling hot = 373K
uncertainty Probability, P Certainty = 0, 1
p = 1/36:
plausible as snakes eyes
p = 1/1024:
plausible as 10 heads
19 / 111
A Bit More On Interpretation
Bayesian
Probability quanties uncertainty in an inductive inference. p(x)
describes how probability is distributed over the possible values x
might have taken in the single case before us:
P
x
p is distributed
x has a single,
uncertain value
Frequentist
Probabilities are always (limiting) rates/proportions/frequencies in
an ensemble. p(x) describes variability, how the values of x are
distributed among the cases in the ensemble:
x is distributed
x
P
20 / 111
Arguments Relating
Hypotheses, Data, and Models
We seek to appraise scientic hypotheses in light of observed data
and modeling assumptions.
Consider the data and modeling assumptions to be the premises of
an argument with each of various hypotheses, H
i
, as conclusions:
H
i
|D
obs
, I . (I = background information, everything deemed
relevant besides the observed data)
P(H
i
|D
obs
, I ) measures the degree to which (D
obs
, I ) allow one to
deduce H
i
. It provides an ordering among arguments for various H
i
that share common premises.
Probability theory tells us how to analyze and appraise the
argument, i.e., how to calculate P(H
i
|D
obs
, I ) from simpler,
hopefully more accessible probabilities.
21 / 111
The Bayesian Recipe
Assess hypotheses by calculating their probabilities p(H
i
| . . .)
conditional on known and/or presumed information using the
rules of probability theory.
Probability Theory Axioms:
OR (sum rule) P(H
1
H
2
|I ) = P(H
1
|I ) + P(H
2
|I )
P(H
1
, H
2
|I )
AND (product rule) P(H
1
, D|I ) = P(H
1
|I ) P(D|H
1
, I )
= P(D|I ) P(H
1
|D, I )
22 / 111
Three Important Theorems
Bayess Theorem (BT)
Consider P(H
i
, D
obs
|I ) using the product rule:
P(H
i
, D
obs
|I ) = P(H
i
|I ) P(D
obs
|H
i
, I )
= P(D
obs
|I ) P(H
i
|D
obs
, I )
Solve for the posterior probability:
P(H
i
|D
obs
, I ) = P(H
i
|I )
P(D
obs
|H
i
, I )
P(D
obs
|I )
Theorem holds for any propositions, but for hypotheses &
data the factors have names:
posterior prior likelihood
norm. const. P(D
obs
|I ) = prior predictive
23 / 111
Law of Total Probability (LTP)
Consider exclusive, exhaustive {B
i
} (I asserts one of them
must be true),

i
P(A, B
i
|I ) =

i
P(B
i
|A, I )P(A|I ) = P(A|I )
=

i
P(B
i
|I )P(A|B
i
, I )
If we do not see how to get P(A|I ) directly, we can nd a set
{B
i
} and use it as a basisextend the conversation:
P(A|I ) =

i
P(B
i
|I )P(A|B
i
, I )
If our problem already has B
i
in it, we can use LTP to get
P(A|I ) from the joint probabilitiesmarginalization:
P(A|I ) =

i
P(A, B
i
|I )
24 / 111
Example: Take A = D
obs
, B
i
= H
i
; then
P(D
obs
|I ) =

i
P(D
obs
, H
i
|I )
=

i
P(H
i
|I )P(D
obs
|H
i
, I )
prior predictive for D
obs
= Average likelihood for H
i
(aka marginal likelihood)
Normalization
For exclusive, exhaustive H
i
,

i
P(H
i
| ) = 1
25 / 111
Well-Posed Problems
The rules express desired probabilities in terms of other
probabilities.
To get a numerical value out, at some point we have to put
numerical values in.
Direct probabilities are probabilities with numerical values
determined directly by premises (via modeling assumptions,
symmetry arguments, previous calculations, desperate
presumption . . . ).
An inference problem is well posed only if all the needed
probabilities are assignable based on the premises. We may need to
add new assumptions as we see what needs to be assigned. We
may not be entirely comfortable with what we need to assume!
(Remember Euclids fth postulate!)
Should explore how results depend on uncomfortable assumptions
(robustness).
26 / 111
Recap
Bayesian inference is more than BT
Bayesian inference quanties uncertainty by reporting
probabilities for things we are uncertain of, given specied
premises.
It uses all of probability theory, not just (or even primarily)
Bayess theorem.
The Rules in Plain English
Ground rule: Specify premises that include everything relevant
that you know or are willing to presume to be true (for the
sake of the argument!).
BT: Make your appraisal account for all of your premises.
Things you know are false must not enter your accounting.
LTP: If the premises allow multiple arguments for a
hypothesis, its appraisal must account for all of them.
Do not just focus on the most or least favorable way a
hypothesis may be realized.
27 / 111
Outline
1 The Big Picture
2 FoundationsLogic & Probability Theory
3 Inference With Parametric Models
Parameter Estimation
Model Uncertainty
4 Simple Examples
Binary Outcomes
Normal Distribution
Poisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
28 / 111
Inference With Parametric Models
Models M
i
(i = 1 to N), each with parameters
i
, each imply a
sampling distn (conditional predictive distn for possible data):
p(D|
i
, M
i
)
The
i
dependence when we x attention on the observed data is
the likelihood function:
L
i
(
i
) p(D
obs
|
i
, M
i
)
We may be uncertain about i (model uncertainty) or
i
(parameter
uncertainty).
29 / 111
Three Classes of Problems
Parameter Estimation
Premise = choice of model (pick specic i )
What can we say about
i
?
Model Assessment
Model comparison: Premise = {M
i
}
What can we say about i ?
Model adequacy/GoF: Premise = M
1
all alternatives
Is M
1
adequate?
Model Averaging
Models share some common params:
i
= {,
i
}
What can we say about w/o committing to one model?
(Systematic error is an example)
30 / 111
Parameter Estimation
Problem statement
I = Model M with parameters (+ any addl info)
H
i
= statements about ; e.g. [2.5, 3.5], or > 0
Probability for any such statement can be found using a
probability density function (PDF) for :
P( [, + d]| ) = f ()d
= p(| )d
Posterior probability density
p(|D, M) =
p(|M) L()
_
d p(|M) L()
31 / 111
Summaries of posterior
Best t values:
Mode,

, maximizes p(|D, M)
Posterior mean, =
_
d p(|D, M)
Uncertainties:
Credible region of probability C:
C = P( |D, M) =
_

d p(|D, M)
Highest Posterior Density (HPD) region has p(|D, M) higher
inside than outside
Posterior standard deviation, variance, covariances
Marginal distributions
Interesting parameters , nuisance parameters
Marginal distn for : p(|D, M) =
_
dp(, |D, M)
32 / 111
Nuisance Parameters and Marginalization
To model most data, we need to introduce parameters besides
those of ultimate interest: nuisance parameters.
Example
We have data from measuring a rate r = s + b that is a sum
of an interesting signal s and a background b.
We have additional data just about b.
What do the data tell us about s?
33 / 111
Marginal posterior distribution
p(s|D, M) =
_
db p(s, b|D, M)
p(s|M)
_
db p(b|s) L(s, b)
p(s|M)L
m
(s)
with L
m
(s) the marginal likelihood for s. For broad prior,
L
m
(s) p(

b
s
|s) L(s,

b
s
) b
s
best b given s
b uncertainty given s
Prole likelihood L
p
(s) L(s,

b
s
) gets weighted by a parameter
space volume factor
E.g., Gaussians: s = r

b,
2
s
=
2
r
+
2
b
Background subtraction is a special case of background marginalization.
34 / 111
Model Comparison
Problem statement
I = (M
1
M
2
. . .) Specify a set of models.
H
i
= M
i
Hypothesis chooses a model.
Posterior probability for a model
p(M
i
|D, I ) = p(M
i
|I )
p(D|M
i
, I )
p(D|I )
p(M
i
|I )L(M
i
)
But L(M
i
) = p(D|M
i
) =
_
d
i
p(
i
|M
i
)p(D|
i
, M
i
).
Likelihood for model = Average likelihood for its parameters
L(M
i
) = L(
i
)
Varied terminology: Prior predictive = Average likelihood = Global
likelihood = Marginal likelihood = (Weight of) Evidence for model
35 / 111
Odds and Bayes factors
Ratios of probabilities for two propositions using the same premises
are called odds:
O
ij

p(M
i
|D, I )
p(M
j
|D, I )
=
p(M
i
|I )
p(M
j
|I )

p(D|M
j
, I )
p(D|M
j
, I )
The data-dependent part is called the Bayes factor:
B
ij

p(D|M
j
, I )
p(D|M
j
, I )
It is a likelihood ratio; the BF terminology is usually reserved for
cases when the likelihoods are marginal/average likelihoods.
36 / 111
An Automatic Occams Razor
Predictive probabilities can favor simpler models
p(D|M
i
) =
_
d
i
p(
i
|M) L(
i
)
D
obs
D
P(D|H)
Complicated H
Simple H
37 / 111
The Occam Factor
p, L

Prior
Likelihood
p(D|M
i
) =
_
d
i
p(
i
|M) L(
i
) p(

i
|M)L(

i
)
i
L(

i
)

i

i
= Maximum Likelihood Occam Factor
Models with more parameters often make the data more
probable for the best fit
Occam factor penalizes models for wasted volume of
parameter space
Quanties intuition that models shouldnt require ne-tuning
38 / 111
Model Averaging
Problem statement
I = (M
1
M
2
. . .) Specify a set of models
Models all share a set of interesting parameters,
Each has dierent set of nuisance parameters
i
(or dierent
prior info about them)
H
i
= statements about
Model averaging
Calculate posterior PDF for :
p(|D, I ) =

i
p(M
i
|D, I ) p(|D, M
i
)

i
L(M
i
)
_
d
i
p(,
i
|D, M
i
)
The model choice is a (discrete) nuisance parameter here.
39 / 111
Theme: Parameter Space Volume
Bayesian calculations sum/integrate over parameter/hypothesis
space!
(Frequentist calculations average over sample space & typically optimize
over parameter space.)
Marginalization weights the prole likelihood by a volume
factor for the nuisance parameters.
Model likelihoods have Occam factors resulting from
parameter space volume factors.
Many virtues of Bayesian methods can be attributed to this
accounting for the size of parameter space. This idea does not
arise naturally in frequentist statistics (but it can be added by
hand).
40 / 111
Roles of the Prior
Prior has two roles
Incorporate any relevant prior information
Convert likelihood from intensity to measure
Accounts for size of hypothesis space
Physical analogy
Heat: Q =
_
dV c
v
(r)T(r)
Probability: P
_
d p(|I )L()
Maximum likelihood focuses on the hottest hypotheses.
Bayes focuses on the hypotheses with the most heat.
A high-T region may contain little heat if its c
v
is low or if
its volume is small.
A high-L region may contain little probability if its prior is low or if
its volume is small.
41 / 111
Recap of Key Ideas
Probability as generalized logic for appraising arguments
Three theorems: BT, LTP, Normalization
Calculations characterized by parameter space integrals
Credible regions, posterior expectations
Marginalization over nuisance parameters
Occams razor via marginal likelihoods
42 / 111
Outline
1 The Big Picture
2 FoundationsLogic & Probability Theory
3 Inference With Parametric Models
Parameter Estimation
Model Uncertainty
4 Simple Examples
Binary Outcomes
Normal Distribution
Poisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
43 / 111
Binary Outcomes:
Parameter Estimation
M = Existence of two outcomes, S and F; each trial has same
probability for S or F
H
i
= Statements about , the probability for success on the next
trial seek p(|D, M)
D = Sequence of results from N observed trials:
FFSSSSFSSSFS (n = 8 successes in N = 12 trials)
Likelihood:
p(D|, M) = p(failure|, M) p(success|, M)
=
n
(1 )
Nn
= L()
44 / 111
Prior
Starting with no information about beyond its denition,
use as an uninformative prior p(|M) = 1. Justications:
Intuition: Dont prefer any interval to any other of same size
Bayess justication: Ignorance means that before doing the
N trials, we have no preference for how many will be successes:
P(n success|M) =
1
N + 1
p(|M) = 1
Consider this a conventionan assumption added to M to
make the problem well posed.
45 / 111
Prior Predictive
p(D|M) =
_
d
n
(1 )
Nn
= B(n + 1, N n + 1) =
n!(N n)!
(N + 1)!
A Beta integral, B(a, b)
_
dx x
a1
(1 x)
b1
=
(a)(b)
(a+b)
.
46 / 111
Posterior
p(|D, M) =
(N + 1)!
n!(N n)!

n
(1 )
Nn
A Beta distribution. Summaries:
Best-t: =
n
N
= 2/3; =
n+1
N+2
0.64
Uncertainty:

=
_
(n+1)(Nn+1)
(N+2)
2
(N+3)
0.12
Find credible regions numerically, or with incomplete beta
function
Note that the posterior depends on the data only through n,
not the N binary numbers describing the sequence.
n is a (minimal) Sucient Statistic.
47 / 111
48 / 111
Binary Outcomes: Model Comparison
Equal Probabilities?
M
1
: = 1/2
M
2
: [0, 1] with at prior.
Maximum Likelihoods
M
1
: p(D|M
1
) =
1
2
N
= 2.44 10
4
M
2
: L( ) =
_
2
3
_
n
_
1
3
_
Nn
= 4.82 10
4
p(D|M
1
)
p(D| , M
2
)
= 0.51
Maximum likelihoods favor M
2
(failures more probable).
49 / 111
Bayes Factor (ratio of model likelihoods)
p(D|M
1
) =
1
2
N
; and p(D|M
2
) =
n!(N n)!
(N + 1)!
B
12

p(D|M
1
)
p(D|M
2
)
=
(N + 1)!
n!(N n)!2
N
= 1.57
Bayes factor (odds) favors M
1
(equiprobable).
Note that for n = 6, B
12
= 2.93; for this small amount of
data, we can never be very sure results are equiprobable.
If n = 0, B
12
1/315; if n = 2, B
12
1/4.8; for extreme
data, 12 ips can be enough to lead us to strongly suspect
outcomes have dierent probabilities.
(Frequentist signicance tests can reject null for any sample size.)
50 / 111
Binary Outcomes: Binomial Distribution
Suppose D = n (number of heads in N trials), rather than the
actual sequence. What is p(|n, M)?
Likelihood
Let S = a sequence of ips with n heads.
p(n|, M) =

S
p(S|, M) p(n|S, , M)

n
(1 )
Nn
[ # successes = n]
=
n
(1 )
Nn
C
n,N
C
n,N
= # of sequences of length N with n heads.
p(n|, M) =
N!
n!(N n)!

n
(1 )
Nn
The binomial distribution for n given , N.
51 / 111
Posterior
p(|n, M) =
N!
n!(Nn)!

n
(1 )
Nn
p(n|M)
p(n|M) =
N!
n!(N n)!
_
d
n
(1 )
Nn
=
1
N + 1
p(|n, M) =
(N + 1)!
n!(N n)!

n
(1 )
Nn
Same result as when data specied the actual sequence.
52 / 111
Another Variation: Negative Binomial
Suppose D = N, the number of trials it took to obtain a predined
number of successes, n = 8. What is p(|N, M)?
Likelihood
p(N|, M) is probability for n 1 successes in N 1 trials,
times probability that the nal trial is a success:
p(N|, M) =
(N 1)!
(n 1)!(N n)!

n1
(1 )
Nn

=
(N 1)!
(n 1)!(N n)!

n
(1 )
Nn
The negative binomial distribution for N given , n.
53 / 111
Posterior
p(|D, M) = C

n,N

n
(1 )
Nn
p(D|M)
p(D|M) = C

n,N
_
d
n
(1 )
Nn
p(|D, M) =
(N + 1)!
n!(N n)!

n
(1 )
Nn
Same result as other cases.
54 / 111
Final Variation: Meteorological Stopping
Suppose D = (N, n), the number of samples and number of
successes in an observing run whose total number was determined
by the weather at the telescope. What is p(|D, M

)?
(M

adds info about weather to M.)


Likelihood
p(D|, M

) is the binomial distribution times the probability


that the weather allowed N samples, W(N):
p(D|, M

) = W(N)
N!
n!(N n)!

n
(1 )
Nn
Let C
n,N
= W(N)
_
N
n
_
. We get the same result as before!
55 / 111
Likelihood Principle
To dene L(H
i
) = p(D
obs
|H
i
, I ), we must contemplate what other
data we might have obtained. But the real sample space may be
determined by many complicated, seemingly irrelevant factors; it
may not be well-specied at all. Should this concern us?
Likelihood principle: The result of inferences depends only on how
p(D
obs
|H
i
, I ) varies w.r.t. hypotheses. We can ignore aspects of the
observing/sampling procedure that do not aect this dependence.
This is a sensible property that frequentist methods do not share.
Frequentist probabilities are long run rates of performance, and
thus depend on details of the sample space that may be irrelevant
in a Bayesian calculation.
Example: Predict 10% of sample is Type A; observe n
A
= 5 for N = 96
Signicance test accepts = 0.1 for binomial sampling;
p(>
2
| = 0.1) = 0.12
Signicance test rejects = 0.1 for negative binomial sampling;
p(>
2
| = 0.1) = 0.03
56 / 111
Inference With Normals/Gaussians
Gaussian PDF
p(x|, ) =
1

2
e

(x)
2
2
2
over [, ]
Common abbreviated notation: x N(,
2
)
Parameters
= x
_
dx x p(x|, )

2
= (x )
2

_
dx (x )
2
p(x|, )
57 / 111
Gausss Observation: Suciency
Suppose our data consist of N measurements, d
i
= +
i
.
Suppose the noise contributions are independent, and

i
N(0,
2
).
p(D|, , M) =

i
p(d
i
|, , M)
=

i
p(
i
= d
i
|, , M)
=

i
1

2
exp
_

(d
i
)
2
2
2
_
=
1

N
(2)
N/2
e
Q()/2
2
58 / 111
Find dependence of Q on by completing the square:
Q =

i
(d
i
)
2
=

i
d
2
i
+ N
2
2Nd where d
1
N

i
d
i
= N( d)
2
+ Nr
2
where r
2

1
N

i
(d
i
d)
2
Likelihood depends on {d
i
} only through d and r :
L(, ) =
1

N
(2)
N/2
exp
_

Nr
2
2
2
_
exp
_

N( d)
2
2
2
_
The sample mean and variance are sucient statistics.
This is a miraculous compression of informationthe normal distn
is highly abnormal in this respect!
59 / 111
Estimating a Normal Mean
Problem specication
Model: d
i
= +
i
,
i
N(0,
2
), is known I = (, M).
Parameter space: ; seek p(|D, , M)
Likelihood
p(D|, , M) =
1

N
(2)
N/2
exp
_

Nr
2
2
2
_
exp
_

N( d)
2
2
2
_
exp
_

N( d)
2
2
2
_
60 / 111
Uninformative prior
Translation invariance p() C, a constant.
This prior is improper unless bounded.
Prior predictive/normalization
p(D|, M) =
_
d C exp
_

N( d)
2
2
2
_
= C(/

N)

2
. . . minus a tiny bit from tails, using a proper prior.
61 / 111
Posterior
p(|D, , M) =
1
(/

N)

2
exp
_

N( d)
2
2
2
_
Posterior is N(d, w
2
), with standard deviation w = /

N.
68.3% HPD credible region for is d /

N.
Note that C drops out limit of innite prior range is well
behaved.
62 / 111
Informative Conjugate Prior
Use a normal prior, N(
0
, w
2
0
)
Posterior
Normal N( , w
2
), but mean, std. deviation shrink towards
prior.
Dene B =
w
2
w
2
+w
2
0
, so B < 1 and B = 0 when w
0
is large.
Then
= (1 B) d + B
0
w = w

1 B
Principle of stable estimation: The prior aects estimates
only when data are not informative relative to prior.
63 / 111
Estimating a Normal Mean: Unknown
Problem specication
Model: d
i
= +
i
,
i
N(0,
2
), is unknown
Parameter space: (, ); seek p(|D, , M)
Likelihood
p(D|, , M) =
1

N
(2)
N/2
exp
_

Nr
2
2
2
_
exp
_

N( d)
2
2
2
_

N
e
Q/2
2
where Q = N
_
r
2
+ ( d)
2

64 / 111
Uninformative Priors
Assume priors for and are independent.
Translation invariance p() C, a constant.
Scale invariance p() 1/ (at in log ).
Joint Posterior for ,
p(, |D, M)
1

N+1
e
Q()/2
2
65 / 111
Marginal Posterior
p(|D, M)
_
d
1

N+1
e
Q/2
2
Let =
Q
2
2
so =
_
Q
2
and |d| =
3/2
_
Q
2
p(|D, M) 2
N/2
Q
N/2
_
d
N
2
1
e

Q
N/2
66 / 111
Write Q = Nr
2
_
1 +
_
d
r
_
2
_
and normalize:
p(|D, M) =
_
N
2
1
_
!
_
N
2

3
2
_
!

1
r
_
1 +
1
N
_
d
r /

N
_
2
_
N/2
Students t distribution, with t =
(d)
r /

N
A bell curve, but with power-law tails
Large N:
p(|D, M) e
N(d)
2
/2r
2
67 / 111
Poisson Distn: Infer a Rate from Counts
Problem: Observe n counts in T; infer rate, r
Likelihood
L(r ) p(n|r , M) = p(n|r , M) =
(rT)
n
n!
e
rT
Prior
Two standard choices:
r known to be nonzero; it is a scale parameter:
p(r |M) =
1
ln(r
u
/r
l
)
1
r
r may vanish; require p(n|M) Const:
p(r |M) =
1
r
u
68 / 111
Prior predictive
p(n|M) =
1
r
u
1
n!
_
r
u
0
dr (rT)
n
e
rT
=
1
r
u
T
1
n!
_
r
u
T
0
d(rT)(rT)
n
e
rT

1
r
u
T
for r
u

n
T
Posterior
A gamma distribution:
p(r |n, M) =
T(rT)
n
n!
e
rT
69 / 111
Gamma Distributions
A 2-parameter family of distributions over nonnegative x, with
shape parameter and scale parameter s:
p

(x|, s) =
1
s()
_
x
s
_
1
e
x/s
Moments:
E(x) = s Var(x) = s
2

Our posterior corresponds to = n + 1, s = 1/T.


Mode r =
n
T
; mean r =
n+1
T
(shift down 1 with 1/r prior)
Std. devn
r
=

n+1
T
; credible regions found by integrating (can
use incomplete gamma function)
70 / 111
71 / 111
The at prior
Bayess justication: Not that ignorance of r p(r |I ) = C
Require (discrete) predictive distribution to be flat:
p(n|I ) =
_
dr p(r |I )p(n|r , I ) = C
p(r |I ) = C
Useful conventions
Use a at prior for a rate that may be zero
Use a log-at prior ( 1/r ) for a nonzero scale parameter
Use proper (normalized, bounded) priors
Plot posterior with abscissa that makes prior at
72 / 111
The On/O Problem
Basic problem
Look o-source; unknown background rate b
Count N
o
photons in interval T
o
Look on-source; rate is r = s + b with unknown signal s
Count N
on
photons in interval T
on
Infer s
Conventional solution

b = N
o
/T
o
;
b
=

N
o
/T
o
r = N
on
/T
on
;
r
=

N
on
/T
on
s = r

b;
s
=
_

2
r
+
2
b
But s can be negative!
73 / 111
Examples
Spectra of X-Ray Sources
Bassani et al. 1989 Di Salvo et al. 2001
74 / 111
Spectrum of Ultrahigh-Energy Cosmic Rays
Nagano & Watson 2000
HiRes Team 2007
log
10
(E) (eV)
F
l
u
x
*
E
3
/
1
0
2
4

(
e
V
2

m
-
2

s
-
1

s
r
-
1
)
AGASA
HiRes-1 Monocular
HiRes-2 Monocular
1
10
17 17.5 18 18.5 19 19.5 20 20.5 21
75 / 111
N is Never Large
Sample sizes are never large. If N is too small to get a
suciently-precise estimate, you need to get more data (or make
more assumptions). But once N is large enough, you can start
subdividing the data to learn more (for example, in a public
opinion poll, once you have a good estimate for the entire country,
you can estimate among men and women, northerners and
southerners, dierent age groups, etc etc). N is never enough
because if it were enough youd already be on to the next
problem for which you need more data.
Similarly, you never have quite enough money. But thats another
story.
Andrew Gelman (blog entry, 31 July 2005)
76 / 111
Backgrounds as Nuisance Parameters
Background marginalization with Gaussian noise
Measure background rate b =

b
b
with source o. Measure total
rate r = r
r
with source on. Infer signal source strength s, where
r = s + b. With at priors,
p(s, b|D, M) exp
_

(b

b)
2
2
2
b
_
exp
_

(s + b r )
2
2
2
r
_
77 / 111
Marginalize b to summarize the results for s (complete the
square to isolate b dependence; then do a simple Gaussian
integral over b):
p(s|D, M) exp
_

(s s)
2
2
2
s
_
s = r

2
s
=
2
r
+
2
b
Background subtraction is a special case of background
marginalization.
78 / 111
Bayesian Solution to On/O Problem
First consider o-source data; use it to estimate b:
p(b|N
o
, I
o
) =
T
o
(bT
o
)
N
o
e
bT
o
N
o
!
Use this as a prior for b to analyze on-source data. For on-source
analysis I
all
= (I
on
, N
o
, I
o
):
p(s, b|N
on
) p(s)p(b)[(s + b)T
on
]
N
on
e
(s+b)T
on
|| I
all
p(s|I
all
) is at, but p(b|I
all
) = p(b|N
o
, I
o
), so
p(s, b|N
on
, I
all
) (s + b)
N
on
b
N
o
e
sT
on
e
b(T
on
+T
o
)
79 / 111
Now marginalize over b;
p(s|N
on
, I
all
) =
_
db p(s, b | N
on
, I
all
)

_
db (s + b)
N
on
b
N
o
e
sT
on
e
b(T
on
+T
o
)
Expand (s + b)
N
on
and do the resulting integrals:
p(s|N
on
, I
all
) =
N
on

i =0
C
i
T
on
(sT
on
)
i
e
sT
on
i !
C
i

_
1 +
T
o
T
on
_
i
(N
on
+ N
o
i )!
(N
on
i )!
Posterior is a weighted sum of Gamma distributions, each assigning a
dierent number of on-source counts to the source. (Evaluate via
recursive algorithm or conuent hypergeometric function.)
80 / 111
Example On/O PosteriorsShort Integrations
T
on
= 1
81 / 111
Example On/O PosteriorsLong Background Integrations
T
on
= 1
82 / 111
Multibin On/O
The more typical on/o scenario:
Data = spectrum or image with counts in many bins
Model M gives signal rate s
k
() in bin k, parameters
To infer , we need the likelihood:
L() =

k
p(N
onk
, N
ok
|s
k
(), M)
For each k, we have an on/o problem as before, only we just
need the marginal likelihood for s
k
(not the posterior). The same
C
i
coecients arise.
XSPEC and CIAO/Sherpa provide this as an option.
CHASC approach does the same thing via data augmentation.
83 / 111
Bayesian Computation
Large sample size: Laplace approximation
Approximate posterior as multivariate normal det(covar) factors
Uses ingredients available in
2
/ML tting software (MLE, Hessian)
Often accurate to O(1/N)
Low-dimensional models (d
<

10 to 20)
Adaptive quadrature
Monte Carlo integration (importance sampling, quasirandom MC)
Hi-dimensional models (d
>

5)
Posterior samplingcreate RNG that samples posterior
MCMC is most general framework
84 / 111
Outline
1 The Big Picture
2 FoundationsLogic & Probability Theory
3 Inference With Parametric Models
Parameter Estimation
Model Uncertainty
4 Simple Examples
Binary Outcomes
Normal Distribution
Poisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
85 / 111
Extrasolar Planets
Exoplanet detection/measurement methods:
Direct: Transits, gravitational lensing, imaging,
interferometric nulling
Indirect: Keplerian reex motion (line-of-sight velocity,
astrometric wobble)
The Suns Wobble From 10 pc
-1000 -500 0 500 1000
-1000
-500
0
500
1000
86 / 111
Radial Velocity Technique
As of May 2007, 242 planets found, including 26 multiple-planet
systems
Vast majority (230) found via Doppler radial velocity (RV)
measurements
Analysis method: Identify best candidate period via periodogram;
t parameters with nonlinear least squares
Issues: Multimodality & multiple planets, nonlinearity, stellar
jitter, long periods, marginal detections, population studies. . .
87 / 111
Keplerian Radial Velocity Model
Parameters for single planet
= orbital period (days)
e = orbital eccentricity
K = velocity amplitude (m/s)
Longitude of pericenter
Mean anomaly of pericenter
passage M
p
System center-of-mass velocity
v
0
Velocity vs. time
v(t) = v
0
+ K (e cos + cos[ + (t)])
True anomaly (t) found via Keplers equation for eccentric
anomaly:
E(t) e sin E(t) =
2t

M
p
; tan

2
=
_
1 + e
1 e
_
1/2
tan
E
2
A strongly nonlinear model!
Keplers laws relate (K, , e) to masses, semimajor axis a,
inclination i
88 / 111
The Likelihood Function
Keplerian velocity model with parameters = {K, , e, M
p
, , v
0
}:
d
i
= v(t
i
; ) +
i
For measurement errors with std devn
i
, and additional jitter
with std devn
J
,
L(,
J
) p({d
i
}|,
J
)
=
N

i =1
1
2
_

2
i
+
2
J
exp
_

1
2
[d
i
v(t
i
; )]
2

2
i
+
2
J
_

_
_

i
1
2
_

2
i
+
2
J
_
_
exp
_

1
2

2
()
_
where
2
(,
J
)

i
[d
i
v(t
i
; )]
2

2
i
+
2
J
Ignore jitter for now . . .
89 / 111
What To Do With It
Parameter estimation
Posterior distn for parameters of model M
i
with i planets:
p(|D, M) p(|M)L
i
()
Summarize with mode, means, credible regions (found by
integrating over )
Detection
Calculate probability for no planets (M
0
), one planet
(M
2
) . . . . Let I = {M
i
}.
p(M
i
|D, I ) p(M
i
|I )L(M
i
)
where L(M
i
) =
_
d p(|M
i
)L()
Marginal likelihood L(M
i
) includes Occam factor
90 / 111
Design
Predict future datum d
t
at time t, accounting for model
uncertainties:
p(d
t
|D, M
i
) =
_
d p(d
t
|, M
i
) p(|D, M
i
).
1st factor is Gaussian for d
t
with known model; 2nd term &
integral account for uncertainty
Bayesian adaptive exploration
Find time likely to make updated posterior shrink the most.
Information theory best time has largest d
t
uncertainty
(maximum entropy in p(d
t
|D, M
i
))
Prior information & data Combined information
Interim
Strategy New data Predictions Strategy
Observn Design
Inference
results
91 / 111
Periodogram Connections
Assume circular orbits: = {K, , M
p
, v
0
}
Frequentist
For given , maximize likelihood over K and M
p
(set v
0
to
data mean, v) prole likelihood:
log L
p
(, v) Lomb-Scargle periodogram
Bayesian
For given , integrate (marginalize) likelihood prior over
K and M
p
(set v
0
to data mean, v) marginal posterior:
log p(, v|D) Lomb-Scargle periodogram
Additionally marginalize over v
0
oating-mean LSP
92 / 111
Kepler Periodogram for Eccentric Orbits
The circular model is linear wrt K sin M
p
, K cos M
p
coincidence
of prole and marginal.
The full model is nonlinear wrt e, M
p
, .
Proling is known to behave poorly for nonlinear parameters (e.g.,
asymptotic condence regions have incorrect coverage for nite
samples; can even be inconsistent).
Marginalization is valid with complete generality; also has good
frequentist properties.
For given , marginalize likelihood prior over all params but :
log p(|D) Kepler periodogram
This requires a 5-d integral.
Trick: For K prior K (interim prior), three of the integrals can
be done analytically integrate numerically only over (e, M
p
)
93 / 111
A Bayesian Workow for Exoplanets
Use Kepler periodogram to reduce dimensionality to 3-d (, e, M
p
).
Use Kepler periodogram results (p(), moments of e, M
p
) to
dene initial population for adaptive, population-based MCMC.
Once we have {, e, M
p
}, get associated {K, , v
0
} samples from
their exact conditional distribution.
Fix the interim prior by weighting the MCMC samples.
(See work by Phil Gregory & Eric Ford for simpler but less ecient
workows.)
94 / 111
A Bayesian Workow for Exoplanets
95 / 111
Estimation Results for HD222582
24 Keck RV observations spanning 683 days; long period; hi e
Kepler Periodogram
96 / 111
Dierential Evolution MCMC Performance
Marginal for (, e) Convergence: Autocorrelation
Reaches convergence much more quickly than simpler algorithms
97 / 111
Multiple Planets in HD 208487
Phil Gregorys parallel tempering algorithm found a 2nd planet in
HD 208487.

1
= 129.8 d;
2
= 909 d
0 0.2 0.4 0.6 0.8
e
2
10
20
30
40
50
K
2
m
s
1
0 0.5 1 1.5 2
P
2
Orbital phase
20
10
0
10
20
V
e
l
o
c
i
t
y
m
s
1
b
0 0.5 1 1.5 2
P
1
Orbital phase
30
20
10
0
10
20
30
V
e
l
o
c
i
t
y
m
s
1
a
98 / 111
Adaptive Exploration for Toy Problem
Data are Kepler velocities plus noise:
d
i
= V(t
i
; , e, K) + e
i
3 remaining geometrical params (t
0
, , i ) are fixed.
Noise probability is Gaussian with known = 8 m s
1
.
Simulate data with typical Jupiter-like exoplanet parameters:
= 800 d
e = 0.5
K = 50 ms
1
Goal: Estimate parameters , e and K as eciently as possible.
99 / 111
Cycle 1: Observation
Prior setup stage specifies 10 equispaced observations.
100 / 111
Cycle 1: Inference
Use flat priors,
p(, e, K|D, I ) exp[Q(, e, K)/2
2
]
Q = sum of squared residuals using best-fit amplitudes.
Generate {
j
, e
j
, K
j
} via posterior sampling.
101 / 111
Cycle 1 Design: Predictions, Entropy
102 / 111
Cycle 2: Observation
103 / 111
Cycle 2: Inference
Volume V
2
= V
1
/5.8
104 / 111
Evolution of Inferences
Cycle 1 (10 samples)
Cycle 2 (11 samples; V
2
= V
1
/5.8)
105 / 111
Cycle 3 (12 samples; V
3
= V
2
/3.9)
Cycle 4 (13 samples; V
4
= V
3
/1.8)
106 / 111
Outline
1 The Big Picture
2 FoundationsLogic & Probability Theory
3 Inference With Parametric Models
Parameter Estimation
Model Uncertainty
4 Simple Examples
Binary Outcomes
Normal Distribution
Poisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
107 / 111
Probability & Frequency
Frequencies are relevant when modeling repeated trials, or
repeated sampling from a population or ensemble.
Frequencies are observables:
When available, can be used to infer probabilities for next trial
When unavailable, can be predicted
Bayesian/Frequentist relationships:
General relationships between probability and frequency
Long-run performance of Bayesian procedures
Examples of Bayesian/frequentist dierences
108 / 111
Relationships Between Probability & Frequency
Frequency from probability
Bernoullis law of large numbers: In repeated i.i.d. trials, given
P(success| . . .) = , predict
N
success
N
total
as N
total

Probability from frequency
Bayess An Essay Towards Solving a Problem in the Doctrine
of Chances First use of Bayess theorem:
Probability for success in next trial of i.i.d. sequence:
E
N
success
N
total
as N
total

109 / 111
Subtle Relationships For Non-IID Cases
Predict frequency in dependent trials
r
t
= result of trial t; p(r
1
, r
2
. . . r
N
|M) known; predict f
f =
1
N

t
p(r
t
= success|M)
where p(r
1
|M) =

r
2

r
N
p(r
1
, r
2
. . . |M
3
)
Expected frequency of outcome in many trials =
average probability for outcome across trials.
But also nd that
f
neednt converge to 0.
Infer probabilities for dierent but related trials
Shrinkage: Biased estimators of the probability that share info
across trials are better than unbiased/BLUE/MLE estimators.
A formalism that distinguishes p from f from the outset is particularly
valuable for exploring subtle connections. E.g., shrinkage is explored via
hierarchical and empirical Bayes.
110 / 111
Frequentist Performance of Bayesian Procedures
Many results known for parametric Bayes performance:
Estimates are consistent if the prior doesnt exclude the true value.
Credible regions found with at priors are typically condence
regions to O(n
1/2
); reference priors can improve their
performance to O(n
1
).
Marginal distributions have better frequentist performance than
conventional methods like prole likelihood. (Bartlett correction,
ancillaries, bootstrap are competitive but hard.)
Bayesian model comparison is asymptotically consistent (not true of
signicance/NP tests, AIC).
For separate (not nested) models, the posterior probability for the
true model converges to 1 exponentially quickly.
Walds complete class theorem: Optimal frequentist methods are
Bayes rules (equivalent to Bayes for some prior)
. . .
Parametric Bayesian methods are typically good frequentist methods.
(Not so clear in nonparametric problems.)
111 / 111

Das könnte Ihnen auch gefallen