Sie sind auf Seite 1von 45

George Mason University

Department of Systems Engineering and Operations Research

Bayesian Inference and


Decision Theory
Instructor: Kathryn Blackmond Laskey
Room 2214 ENGR
(703) 993-1644
Office Hours: Wednesday and Thursday 4:30-5:30 PM,
or by appointment
Spring 2015

Unit 1: A Brief Tour of


Bayesian Inference and Decision Theory
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 1 -

George Mason University

Department of Systems Engineering and Operations Research

What this Course is About


You will learn a way of thinking about problems of inference
under uncertainty
You will learn to construct mathematical models for inference
and decision problems
You will learn how to apply these models to draw inferences
from data and to make decisions
These methods are based on Bayesian Decision Theory, a
formal theory for rational inference and decision making

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 2 -

Department of Systems Engineering and Operations Research

George Mason University

Logistics
Web site
http://seor.gmu.edu/~klaskey/SYST664
Blackboard site: http://mymason.gmu.edu

Textbook and Software

Hoff, A First Course in Bayesian Statistical Methods, Springer, 2009


Other recommended texts on course web site
Some assignments can be done in Excel
We will use R, a free open-source statistical computing environment:
http://www.r-project.org/. R code for many textbook examples is on authors web site
We will use JAGS, an open-source package for Markov Chain Monte Carlo simulation:
http://mcmc-jags.sourceforge.net/

Requirements
Regular assignments (20%): can be handed in on paper or through Blackboard
Take-home midterm (30%) and final (30%)
Project (20%): apply methods to a problem of your choosing

Office hours
Official office hours are 4:30-5:30PM Wednesdays and Thursdays
I respond to questions by email and am available by appointment

Policies and Resources


Academic integrity policy
Read the policies and resources section of the syllabus
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 3 -

Department of Systems Engineering and Operations Research

George Mason University

Course Outline
Unit 1: A Brief Tour of Bayesian Inference and Decision Theory
Unit 2: Random Variables, Parametric Models, and Inference
from Observation
Unit 3: Statistical Models with a Single Parameter
Unit 4: Monte Carlo Approximation
Unit 5: The Normal Model
Unit 6: Gibbs Sampling
Unit 7: The Multivariate Normal Model
Unit 6: Bayesian Regression and Analysis of Variance
Unit 8: Hierarchical Bayesian Models
Unit 9: Linear Regression
Unit 10: Metropolis-Hastings Sampling

(Later units are subject to change)


Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 4 -

George Mason University

Department of Systems Engineering and Operations Research

Learning Objectives for Unit 1


Describe the elements of a decision model
Refresh knowledge of probability
Apply Bayes rule for simple inference and decision problems
and interpret the results
Use a graph to express conditional independence among
uncertain variables
Explain why Bayesians believe inference cannot be separated
from decision making
Compare Bayesian and frequentist philosophies of statistical
inference
Compute and interpret the expected value of information (VOI)
for a decision problem with an option to collect information

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 5 -

Department of Systems Engineering and Operations Research

George Mason University

Bayesian Inference
Bayesians use probability to quantify rational degrees of belief
Bayesians view inference as belief dynamics
Begin with prior beliefs
Use evidence to update prior beliefs to posterior beliefs
Posterior beliefs become prior beliefs for next evidence

Inference problems are


usually embedded
in decision problems

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 6 -

Department of Systems Engineering and Operations Research

George Mason University

Decision Theory
Decision theory is a formal theory of decision making under
uncertainty
A decision problem consists of:
Possible actions: {a}aA
States of the world (usually uncertain): {s}sS
Possible consequences: {c}cC (depends on state and action)

Question: What is the best action?


Answer (according to decision theory):
Measure goodness of consequences with a utility function u(c)
Measure likelihood of states with probability distribution p(s)
Best action with respect to model maximizes expected utility:
a *= argmax{E[u(c) | a]}
For brevity, we may write E[u(a)] for E[u(c) | a]
a

Caveat emptor:
How good it is for you depends on fidelity of model to your beliefs
and preferences
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 7 -

Department of Systems Engineering and Operations Research

George Mason University

Illustrative Example
(Highly Oversimplified)

Decision problem: Should patient be treated for disease?


We suspect she may have disease but do not know
Without treatment the disease will lead to long illness
Treatment has unpleasant side effects

Decision model:
Actions: aT (treat) and aN (dont treat)
States of world: sD (disease now) and sW (well now)
Consequences: cWN (well shortly, no side effects), cWS (well shortly, side
effects), cDN (disease for long time, no side effects)
Probabilities and Utilities:
P(sD) = 0.3
u(cWN) = 100, u(cWS) = 90; u(cDN) = 0

u(cWN)
u(cWS)

100
90

EU(aN)

70

u(cDN)

Expected utility:
Treat:

.390 + .790 =

Don't treat: .30

+ .7100 =

90
70

Best action is aT (treat patient)


Kathryn Blackmond Laskey

P(sD) = 0.3

Spring 2015

P(sW) = 0.7

Unit 1 (v5) - 8 -

George Mason University

Department of Systems Engineering and Operations Research

Sensitivity Analysis:
Optimal Decision as Function of Sickness Probability
Expected utility of the two actions:
E[U|aT] =

90

E[U|aN] =

0p + 100(1-p) = 100(1 p)

Expected utility of not treating


depends on p = P(sD), the probability
that the patient has the disease

We should treat if p > 0.1, dont treat if p < 0.1

Typically we are uncertain about the value of p


The chart shows how the optimal decision changes as we vary p

We may want to collect more information to refine our estimate of p

Expected gain from


treatment at p = 0.3

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 9 -

Department of Systems Engineering and Operations Research

George Mason University

Why be a Bayesian?
Arguments from theory
A coherent decision maker uses probability to represent uncertainty, uses utility
to represent value, and maximizes expected utility
If you are not coherent then someone can make "Dutch book" on you (turn you
into a "money pump")

Pragmatic arguments
Decision theory provides a useful and principled methodology for modeling
problems of inference, decision and learning from experience
Engineering tradeoffs between accuracy, complexity and cost can be analyzed
and evaluated
Both empirical data and informed engineering judgment can be explicitly
represented and incorporated into a model
Bayesian methods can handle small, moderate and large sample sizes; small,
moderate and large numbers of parameters
With other approaches it is often more difficult to understand why you got the
results you did and how to improve your model

Arguments from experience


Successful applications
Success attributed to decision theoretic technology

Caution:
Uncritical application of cookbook methods can lead to disaster!
Good modelers iteratively assess, check and revise assumptions
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 10 -

George Mason University

Department of Systems Engineering and Operations Research

Review: Probability Basics


A probability distribution is a function P() applied to sets such that:
P(A) 0 for all subsets A of the universal set %
P() = 1
If AiAj = then P(A1A2) = P(A1) + P(A2)+

From this basic definition we can deduce other identities of


probability theory, e.g.:
P() = 0
P(A) 1 for all subsets A of %
P(AB) = P(A) + P(B) - P(AB) for any A and B
If AiAj = and A=A1A2, then P(AB) = i P(B|Ai)P(Ai)

Law of total
probability

The conditional probability of A given B for any two sets A and B is


defined as a number P(A|B) satisfying:
P(A|B)P(B) = P(AB)
If P(B)0 this is equivalent to the traditional
formula:
P(A B)
P(A | B) =
P(B)
A is independent of B if P(A|B) = P(A)
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 11 -

Department of Systems Engineering and Operations Research

George Mason University

Extending the Disease Example:


Gathering Information
We may be able to perform a test before deciding whether to
treat the patient
Test has two outcomes: tP (positive) and tN (negative)
Quality of test is characterized by two numbers:
Sensitivity: Probability that test is positive if patient has disease
Specificity: Probability that test is negative if patient does not have
disease

We will assume:
Sensitivity: P(tP | sD) = 0.95
Specificity: P(tN | sW) = 0.85

How does the model change if test results are available?


Take test, observe outcome t
Revise prior beliefs P(sD) to obtain posterior beliefs P(sD|t)
Re-compute optimal decision using P(sD|t)

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 12 -

Department of Systems Engineering and Operations Research

George Mason University

Bayes Rule: The Law of Belief Dynamics


Objective: use evidence E to update beliefs about a hypothesis H
H: patient has (does not have) disease
E: evidence from test

Procedure: apply Bayes Rule (standard form):


P(H | E) =

P(E H ) P(E | H )P(H )


=
=
P(E)
P(E)

P(E | H )P(H )
i P(E | H i )P(H i )

P(E)>0, HiHj = , =H1H2

Bayes Rule (odds likelihood form):


P(H 1 | E) P(E | H 1 )P(H 1 )
=
P(H 2 | E) P(E | H 2 )P(H 2 )

P(E)>0, P(H2)>0

Terminology:
P(H)

- The prior probability of H

P(E|H) - The likelihood for E given H

P(E)

- The predictive probability of E

P(H|E) - The posterior probability of H given E

P(E | H1)
P(E | H2)

- The likelihood ratio for H1 versus H2

P(H1)
P(H2)

- The prior odds ratio for H1 versus H2

The posterior probability of H1 increases relative to H2 if the evidence is


more likely given H1 than given H2
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 13 -

Department of Systems Engineering and Operations Research

George Mason University

Disease Example with Test


Review of Problem Ingredients:
P(sD) = 0.3
P(tP | sD) = 0.95; P(tN | sW) = 0.85
u(cWN) = 100, u(cWS) = 90; u(cDN) = 0

(prior probability of disease)


(sensitivity & specificity of test)
(utilities)

If test is negative:
P(sD | tN) = (0.3 x 0.05)/(0.3 x 0.05 + 0.7 x 0.85) = 0.025
EU(aN | tN) = 0.025 0 + 0.975 100 = 97.5
EU(aT | tN) = 0.025 90 + 0.975 90 = 90
Best action is not to treat
EU(aN | tN)

If test is positive:

P(sD | tP) = (0.3 x 0.95)/(0.3 x 0.95 + 0.7 x 0.15) = 0.731


EU(aN | tP) = 0.731 0 + 0.269 100 = 26.9
EU(aT | tP) = 0.731 90 + 0.269 90 = 90
Best action is to treat

Optimal policy is strategy aF (FollowTest):

EU(aT)
= EU(aT|tP)
= EU(aT|tP)

97.5
90
70

EU(aN)

EU(aN | tP)

26.9

Treat if test is positive; dont treat if test is negative


Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 14 -

George Mason University

Department of Systems Engineering and Operations Research

Value of Information
Reminder of problem ingredients:
P(sD) = 0.3
P(tP | sD) = 0.95; P(tN | sW) = 0.85
u(cWN) = 100, u(cWS) = 90; u(cDN) = 0

(prior probability of disease)


(sensitivity & specificity of test)
(utilities)

Probability test will be positive:


P(tP) = P(tP | sD) P(sD) + P(tP | sW) P(sW) = 0.950.3 + 0.150.7 = 0.39

If test is positive we should treat, with EU(aT) = 90


If test is negative we should not treat, with EU(aN | tN) = 97.5
Expected utility of FollowTest strategy (treat if test is positive,
otherwise dont):
EU(aF) = P(tP) EU(aT) + P(tN) EU(aN | tN)
= 0.39 90 + (1-0.39) 97.5 = 94.575

If we do not perform the test, our best action is to treat everyone


with EU(aT) = 90
Gain from test is 94.575 90 = 4.575
This is called the Expected Value of Sample Information (EVSI)
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 15 -

George Mason University

Department of Systems Engineering and Operations Research

Expected Value of Perfect Information


Expected Value of Perfect Information (EVPI) is increase
in utility from perfect knowledge of an uncertain variable
Suppose an oracle will tell us whether patient is sick
An oracle has Sensitivity = Specificity = 1

30% chance we discover she is sick and treat - utility 90


70% chance we discover she is well and dont treat - utility 100
Expected utility if we ask the oracle 0.3 x 90 + 0.7 x 100 = 97
EVPI = 97 - 90 = 7

EVPI EVSI 0
EVPI = EVSI = 0 if the test will not change your decision

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 16 -

George Mason University

Department of Systems Engineering and Operations Research

Should We Collect Information?


General Principle: Free information can never hurt
To analyze decision of whether to collect information D on outcome
variable V:
Find maximum expected utility option if we don't collect information
Compute its expected utility U0
Find EVPI
For each possible value V=v, assume it is known to be the true outcome, find optimal
decision, calculate its expected utility, and multiply by the probability that V=v
Add these values and subtract from no-information expected utility to get EVPI

Compare EVPI with cost of information


If EVPI is too small in relation to cost then stop; otherwise, compute EVSI
For each possible result D=d of the experiment, find the maximum expected utility
action a(d) and its utility u(a(d))
For each outcome V=v and result D=d of the experiment, find the joint probability P(v,d)
Calculate the expected utility with information USI = v,d P(v,d)u(a(d))
Compute difference USI = U0 in expected utility between no-information decision and
decision with information to get EVSI

Compare EVSI with cost of information


Collect information if EVSI is large enough in relation to cost of information
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 17 -

Department of Systems Engineering and Operations Research

George Mason University

Expected Utility of FollowTest Policy


as Function of = P(sD)
FollowTest strategy treats if test is positive and otherwise not
World
State

Probability

Action

Utility

Sick,
Positive

.95

Treat

90

Sick,
Negative

.05

NoTreat

Well,
Positive

.15(1-)

Treat

90

Well,
Negative

.85(1-)

NoTreat

100

E[U|aF] = 0.9590 + .050 + 0.15 (1-)90 + 0.85 (1-)100


= 98.5 13

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 18 -

Department of Systems Engineering and Operations Research

George Mason University

Another Way to Find


Expected Utility of FollowTest Policy
FollowTest strategy treats if test is positive and otherwise not
Before doing test, we think:

World
State

Probability

Action

Utility

Sick,
Positive

P(sD|tP)P(tP)

Treat

90

Sick,
Negative

P(sD|tN)P(tN)

NoTreat

Well,
Positive

P(sW|tP)P(tP)

Treat

90

Well,
Negative

P(sW|tN)P(tN)

NoTreat

100

We are uncertain about the test


result: it will be positive with
probability P(tP) and negative
with probability P(tN).
If it is positive our expected utility
will be EU[aT | tP].
If it is negative our expected
utility will be EU[aN | tN].
So our expected utility of
following the test is
P(tP) EU[aT|tP] + P(tN) EU[aN|tN]

E[U|aF] = P(sD|tP)P(tP)90 + . P(sD|tN)P(tN)0


+ P(sW|tP)P(tP)90 + P(sW|tN)P(tN)100
= P(tP) [P(sD|tP)90 + P(sW|tP)90]
(Compare with Slide 15)
+ P(tN) [P(sD|tN)0 + P(sW|tN)100]
= P(tP) EU[aT | tP] + P(tN) EU[aN | tN]
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 19 -

Department of Systems Engineering and Operations Research

George Mason University

Strategy Regions
FollowTest:
AlwaysTreat:

EU(aF) = 98.5 13
EU(aT) = 90

FollowTest is better when 98.5 13 > 90 or < 8.5/13 = 0.654

NoTreat:

EU(aN) = 100(1 )

FollowTest is better when 98.5 - 13 > 100(1-) or > 1.5/87 = 0.017

EVSI

Kathryn Blackmond Laskey

EVSI is positive for


0.017 < ps < 0.654

Spring 2015

Unit 1 (v5) - 20 -

Department of Systems Engineering and Operations Research

George Mason University

Strategy Regions for Costly Test


FollowTest:
AlwaysTreat:

EU(aF) = 98.5 13 c (c is cost of test)


EU(aT) = 90

FollowTest is better when 98.5 13 c > 90 or < (8.5-c)/13

NoTreat:

EU(aN) = 100(1 )

FollowTest is better when 98.5 - 13 c > 100(1-) or > (1.5+c)/87


Expected(U*lity(of(Op*mal(Strategy(with(Costly(Test(
100"

98"

Test is worth doing if


(1.5+c)/87 < ps < (8.5-c)/13

c=0"
c=1"

96"

c=4"
c7.2"

94"

92"

Gain from doing test with c=1 at p=0.3

90"

88"
0"

Kathryn Blackmond Laskey

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

Spring 2015

0.8"

0.9"

1"

Unit 1 (v5) - 21 -

Department of Systems Engineering and Operations Research

George Mason University

EVSI and Costly Test


For a test with cost c:
E[U|FollowTest] = 98.5 13 - c
NoTreat is best when < (1.5+c)/87
FollowTest is best when (1.5+c)/87 < < (8.5-c)/13

Probability range where testing is optimal depends on cost of test


If 0.018<<0.029 then test if c= 0.1 but do nothing if c= 1
If 0.577<<0.646 then test if c=0.1 but treat if c=1

Information collection is optimal when EVSI is greater than cost of


test
Expected(U*lity(of(Op*mal(Strategy(with(Costly(Test(

EVSI%as%a%Func,on%of%Prior%Probability%

100"

8"

98"

c=0"

7"

c=1"
96"

6"

c=4"

5"

c7.2"
94"

4"

Range of optimality
of test with c=1

3"

92"

2"
90"

1"
0"

88"
0"

0.1"

0.2"

Kathryn Blackmond Laskey

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0.9"

1"

Spring 2015

0"

0.2"

0.4"

0.6"

0.8"

1"

Unit 1 (v5) - 22 -

George Mason University

Department of Systems Engineering and Operations Research

Value of Information: Summary


Collecting information may have value if it might change your
decision
Expected value of perfect information (EVPI) is utility gain from knowing true
value of uncertain variable
Expected value of sample information (EVSI) is utility gain from available
information

In our example, EVSI is positive for 0.017 < < 0.654


If 0.017 0.1
If 0.1 0.654
If = 0.3

EVSI is 87 - 1.5
EVSI is 8.5 - 13
EVSI is 8.5 - 13 = 4.6

(testing is optimal)

Costly information has value when EVSI is greater than cost of


information
In our example:
If 0.017 0.1
If 0.1 0.654
If p = 0.3

Kathryn Blackmond Laskey

Test if 87 - 1.5 > c


Test if 8.5 - 13 > c
Test if 4.6 > c

Spring 2015

(where c is cost of test)


(test if c is less than 4.6)

Unit 1 (v5) - 23 -

George Mason University

Department of Systems Engineering and Operations Research

When Parameters of a Model are Unknown


Our disease model depends on several parameters
Prior probability of disease
Sensitivity of test
Specificity of test

Usually these probabilities are estimated from data and/or expert


judgment
Randomized clinical trials have established that Test T has sensitivity 0.95
and specificity 0.85 for Disease D
Given the presenting symptoms and my clinical judgment, I estimate a 60%
probability that the patient has Disease D.

How does a Bayesian combine data and expert judgment?


Use clinical judgment to quantify uncertainty about the parameter as a
probability distribution
Collect data and use Bayes rule to obtain posterior distribution for the
parameters given the data
If appropriate, use clinical judgment to adjust results of studies to apply to a
particular patient

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 24 -

George Mason University

Department of Systems Engineering and Operations Research

Learning from a Random Sample


drawn from a Parametric Model
Many statistical models assume observations are drawn at
random from a common probability distribution
Data X1, , Xn are drawn at random from distribution with
probability mass function f(x|) if:
Notation convention: uppercase
P(Xi=x|) = f(x|) for all i
Xi is independent of Xj given for ij

Example:

letters for unknowns; lowercase


letters for particular values the
unknowns can take on

Middle-aged male patients who complain of symptom S are drawn at random


from a population with a proportion who have disease D
Data X1, , Xn are independent and identically distributed (iid) given
= P(Sick)
Xi iid means that if is
If the value is unknown we can express
our uncertainty about it by defining a probability known, learning the
condition of some patients
distribution for its value
does not affect our beliefs
Xi=1 (disease) or Xi=0 (no disease)
Pr(Xi = 1 | ) = (this is called a Bernoulli distribution) about the conditions of
the remaining patients
Pr( = ) = g()

We now use Bayes rule to obtain a posterior distribution


g( | X1, , Xn )
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 25 -

George Mason University

Department of Systems Engineering and Operations Research

Random Sample: Graphical Representation


We can use a graph to represent conditional
independence
Arc from to Xi means the distribution
of Xi depends on
No arc from Xi to Xj means that Xi and Xj
are independent given

A plate represents repeated structure


Because all the Xi are inside the same plate,
they all follow the same probability distribution

X1

X2

X3

~ g()
Xi |
~ Bernoulli()

Xi

i=1,,3

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 26 -

George Mason University

Department of Systems Engineering and Operations Research

Example: Bayesian Inference about a Parameter


(with a very small sample)

Uninformative prior distribution Pr( = ) = g()


We assume that can have one of 20 equally
likely values: 0.025, 0.075, , 0.975
We could represent prior knowledge by assigning
some values of a greater probability than others

actually has a continuous


range of values. We will treat
continuous parameters later.
For now we approximate with
a finite set of values.

Observe 5 iid cases: X1, X2, X3, X4, X5


Case 2 has disease; cases 1, 3, 4 and 5 do not
Likelihood function (probability of observations as function of ):

Pr(X1, X2, X3, X4, X5 | = ) = (1-)4


The likelihood function depends only on how many cases have the disease
The number of cases having the disease and the total number of cases are
sufficient for inference about %
0.12
0.10
0.08

0.625

0.725

0.825

Posterior Distribution for Theta

0.925
0.14
0.12
0.10

0.525
Theta

0.08

0.425

0.06

0.325

Posterior Probability

0.225

0.04

Because the prior distribution is uniform, the posterior


distribution is proportional to the likelihood

0.125

0.02

0.025

0.00

Prior Probability

0.06
0.04
0.02

1
(1 )4
g( ) f (x | )
(1 )4
20
g( | x) =
=
=
' g( ') f (x | ') ' 1 '(1 ')4 ' '(1 ')4
20

0.00

Use Bayes rule to calculate posterior


distribution for :

0.14

Prior Distribution for Theta

0.025

0.125

0.225

0.325

0.425

0.525

0.625

0.725

0.825

0.925

Theta

Underscore indicates a vector: x=(x1, x2, x3, x4, x5)


Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 27 -

George Mason University

Department of Systems Engineering and Operations Research

Bayesian Inference Example:


R Code

0.08
0.06
0.00

0.02

0.04

Prior Probability

0.10

0.12

0.14

Prior Distribution for Theta

0.025

0.125

0.225

0.325

0.425

0.525

0.625

0.725

0.825

0.925

0.825

0.925

Theta

0.10
0.08
0.06
0.00

0.02

0.04

Posterior Probability

0.12

0.14

Posterior Distribution for Theta

0.025

Horizontal axis is = P(Sick);


height of bar is probability that =
Kathryn Blackmond Laskey

Spring 2015

0.125

0.225

0.325

0.425

0.525

0.625

0.725

Theta

Unit 1 (v5) - 28 -

George Mason University

Department of Systems Engineering and Operations Research

R Computing Environment
R (http://www.r-project.org) is a free, open source statistical
computing language and environment that includes:

data handling and storage


matrix and array operations
tools for data analysis
graphical facilities for data analysis and display
programming language which includes conditionals, loops, user-defined
recursive functions and input and output facilities

Vibrant and active user community contributes new functionality


Users can contribute packages to extend functionality

Many resources exist to help users at a variety of levels


http://cran.r-project.org/doc/manuals/R-intro.pdf

RStudio (http://www.rstudio.com) is a free, open-source integrated


development environment for R
We will use R heavily in this course
Most R assignments can be done by modifying sample code I will
provide
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 29 -

George Mason University

Department of Systems Engineering and Operations Research

Bayesian Learning and Sample Size


When the sample size is very large:
The posterior distribution will be concentrated around the maximum likelihood
estimate and is relatively insensitive to the prior distribution
We wont go too far wrong if we act as if the parameter is equal to the
maximum likelihood estimate

When the sample size is very small:


The posterior distribution is highly dependent on the prior distribution
Reasonable people may disagree on the value of the parameter

When the sample size is moderate, Bayesian learning can be a big


improvement on either expert judgment alone or data alone
Achieving the benefit requires careful modeling
This course will teach methods for constructing Bayesian models

A powerful characteristic of the Bayesian approach is the flexibility


to tailor results to moderate-sized sub-populations
Bayesian estimate shrinks estimates of sub-population parameters toward
population average
Amount of shrinkage depends on sample size and similarity of sub-population
to overall population
Shrinkage improves estimates for small to moderate sized sub-populations
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 30 -

George Mason University

Department of Systems Engineering and Operations Research

Effect of Sample Size on Posterior Distribution


Sample size 5: 1 with, 4 without
0.4"

These plots show the posterior


distribution for when:
Prior distribution is uniform
20% of patients in sample have
the disease

0.35"

0.3"

0.25"

0.2"

Posterior distribution becomes


more concentrated around 1/5 as
sample size gets larger

0.15"

0.1"

0.05"

0"
0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

Sample size 20: 4 with, 16 without

Sample size 80: 16 with, 64 without

0.4"

0.4"

0.35"

0.35"

0.3"

0.3"

0.25"

0.25"

0.2"

0.2"

0.15"

0.15"

0.1"

0.1"

0.05"

0.05"

0"

0"

0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

Horizontal axis is = P(sD); height of bar is probability that =


Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 31 -

Department of Systems Engineering and Operations Research

George Mason University

Effect of Sample Size


on Impact of the Prior Distribution
Prior distribution favors low probabilities:
Prior%Distribu+on%%

Posterior(Distribu,on(for(1(Case(in(5(Samples(
0.4"

0.4"

0.35"

0.35"

0.35"

0.3"

0.3"

0.3"

0.25"

0.25"

0.25"
0.2"
0.15"
0.1"
0.05"

0.2"

0.2"

0.15"

0.15"

0.1"

0.1"

0.05"

0.05"

0"

0"

Posterior(Distribu,on(for(10(Cases(in(50(Samples(
0.4"

0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

0"
0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

Horizontal axis is = P(sD); height of bar is probability that =

Prior distribution favors high probabilities:


0.35"

0.4"

0.4"

0.35"

0.35"
0.3"

0.3"

0.3"

0.25"

0.25"

0.25"
0.2"

0.2"

0.15"

0.15"

0.1"

0.1"

0.05"

0.05"

0"
0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

Posterior(Distribu,on(for(10(Cases(in(50(Samples(

Posterior(Distribu,on(for(1(Case(in(5(Samples(

Prior%Distribu+on%
0.4"

0.2"
0.15"
0.1"
0.05"
0"

0"
0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

0.025" 0.075" 0.125" 0.175" 0.225" 0.275" 0.325" 0.375" 0.425" 0.475" 0.525" 0.575" 0.625" 0.675" 0.725" 0.775" 0.825" 0.875" 0.925" 0.975"

Horizontal axis is = P(sD); height of bar is probability that =

Bayesian inference shrinks posterior distribution toward prior expectations


Posterior distribution for small sample is very sensitive to prior distribution
Posterior distribution for larger sample is less sensitive to prior distribution
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 32 -

Department of Systems Engineering and Operations Research

George Mason University

The Bayesian Resurgence


Bayesian inference is as old as probability
Bayesian view fell into disfavor in nineteenth and early
twentieth centuries
Positivism, empiricism, and quest for objectivity in science
Paradoxes and systematic deviation of human judgment from
Bayesian norm

There has been a recent resurgence


Computational advances make calculation possible for complex models
Bayesian models can coherently integrate many different kinds of
information

Physical cause and effect


Logical implication
Informed expert judgment
Empirical observation

Unified theory and methods for data-rich and data-poor problems


Clear connection to decision making

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 33 -

George Mason University

Department of Systems Engineering and Operations Research

Some Concepts of Probability


Classical - Probability is a ratio of favorable cases to total (equipossible)
cases
Frequency - Probability is the limiting value as the number of trials
becomes infinite of the frequency of occurrence of some event
Logical - Probability is a logical property of ones state of information
about a phenomenon
Propensity - Probability is a propensity for certain kinds of event to occur
and is a property of physical systems
Subjectivist - Probability is an ideal rational agents degree of belief about
an uncertain event
Algorithmic - The algorithmic probability of a finite sequence is the
probability that a universal computer fed a random input will give the
sequence as output (related to Kolmogorov complexity)
Game Theoretic - Probability is an agents optimal announced certainty
for an event in a multi-agent game in which agents receive rewards that
depend on both forecasts and outcomes

Probability really is none of these things.


Probability can represent all of these things.
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 34 -

George Mason University

Department of Systems Engineering and Operations Research

Historical Notes
People have long noticed that some events are imperfectly predictable
Mathematical probability first arose to describe regularities in games of
chance
In the twentieth century it became clear that probability theory provided a
good model for a much broader class of problems:
Physical (thermodynamics; quantum mechanics)
Social (actuarial tables; sample surveys)
Industrial (equipment failures)

The subjectivist interpretation dates from the 18th century but fell out of
favor because of the positivitist orientation of Western 19th and 20th
century science
Von Mises formulated a rigorous (and much-debated) frequency theory in
the mid-twentieth century.
Hierarchy of generality:

Classical interpretation is restricted to equipossible cases


Frequency interpretation is restricted to repeatable, random phenomena
Subjectivist interpretation applies to any event about which the agent is uncertain
Game theoretic interpretation applies even when probabilities are not true beliefs

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 35 -

George Mason University

Department of Systems Engineering and Operations Research

The Frequentist
A frequentist believes:

Probability can be legitimately applied only to repeatable problems


Probability is an objective property in the real world
Probability applies only to random processes
Probabilities are associated only with collectives not individual events

Frequentist Inference
Data are drawn from a distribution of known form but with an unknown parameter
(this includes nonparametric statistics in which the unknown parameter is the
distribution itself)
Often this distribution arises from explicit randomization (when not, statistician
argues that the procedure was close enough to randomized that inferences
apply)
Inferences regard the data as random and the parameter as fixed (even though
the data are known and the parameter is unknown)
For example: A sample X1,XN is drawn from a normal distribution with mean .
A 95% confidence interval is constructed. The interpretation is:
If an experiment like this were performed many times we would expect in 95% of the cases
that an interval calculated by the procedure we applied would include the true value of .

A frequentist can say nothing about this experiment or about what we


should believe about !
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 36 -

Department of Systems Engineering and Operations Research

George Mason University

The Subjectivist
A subjectivist believes:
Probability as an expression of a rational agents degrees of belief about
uncertain propositions.
Rational agents may disagree. There is no one correct probability.
If the agent receives feedback her assessed probabilities will in the limit
converge to observed frequencies

Subjectivist Inference:
Probability distributions are assigned to the unknown parameters and to
the observations given the unknown parameters.
Condition on knowns; use probability to express uncertainty about
unknowns
For example: A sample X1,XN is drawn from a normal distribution with
mean having prior distribution g(). A 95% posterior credible interval is
constructed, and the result is the interval (3.7, 4.9). The interpretation is:
Given the prior distribution for and the observed data, the probability that lies
between 3.7 and 4.9 is 95%.

A subjectivist can draw conclusions about what we should


believe about and about what we should expect on the next trial

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 37 -

George Mason University

Department of Systems Engineering and Operations Research

Comparison: Understandability,
Subjectivity and Honest Reporting
Often the Bayesian answer is what the decision maker really wants
to hear.
Untrained people often interpret results in the Bayesian way.
Frequentists are disturbed by the dependence of the posterior
interval on the subjective prior distribution.
It is more important that stochastics provides a means of communication
among researchers whose personal beliefs about the phenomena under study
may differ. If these beliefs are allowed to contaminate the reporting of results,
how are the results of different researchers to be compared?
- H. Dinges

Bayesians say the prior distribution is not the only subjective


element in an analysis. Assumptions about the sampling distribution
are often also subjective.
Bayesian probability statements are always subjective, but statistical
analyses are often done for public consumption. Whose probability
distribution should be reported?
When there are enough data, a good Bayesian analysis and a good frequentist
analysis will typically be in close agreement
If the results are sensitive to the prior distribution, a Bayesian analyst is
obligated to report this sensitivity, and to present the range of results obtained
from a range of prior distributions
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 38 -

George Mason University

Department of Systems Engineering and Operations Research

Comparison: Generality
Subjectivists can handle problems the frequentist approach
cannot (in particular, problems with not enough data for sound
frequentist inference).
Frequentist statisticians say this comes at a price -- when
there are not enough data the result will be highly dependent
on the prior distribution.
Subjectivists often apply frequentist techniques but with a
Bayesian interpretation
Frequentists often apply Bayesian methods if they have good
frequency properties

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 39 -

Department of Systems Engineering and Operations Research

George Mason University

Axioms for Probability


De Groot, 1970

There is a qualitative relationship of relative likelihood , that operates on pairs of


events, that satisfies the following conditions:
SP1. For any two uncertain events A and B, one of the following relations holds: A
B, A B or A ~ B.
SP2. If A 1, A2, B1, and B2 are four events such that A1A2=, B1B2=, and if Ai
Bi, for i = 1,2, then A1A2 B1B2. If in addition Ai Bi for either i=1 or
i=2, then A1A2 B1B2.
SP3. If A is any event, then A. Furthermore, there is some event A0 for which
A0.
SP4. If A1A2 is a decreasing sequence of events, and B is some event such
that Ai B for i=1, 2, , then i=1 Ai B .
SP5. There is an experiment, with a numerical outcome between the values of 0 and
1, such that if Ai is the event that the outcome x lies within the interval ai x
bi, for i=1,2, then A1 A2 if and only if (b1-a1) (b2-a2).

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 40 -

George Mason University

Department of Systems Engineering and Operations Research

Axioms for Utility


Watson and Buede, 1987

A reward is a prize the decision maker cares about. A lottery is a situation in which
the decision maker will receive one of the possible rewards, where the reward to be
received is governed by a probability distribution. There is a qualitative relationship
of relative preference * , that operates on lotteries, that satisfies the following
conditions:
SU1. For any two lotteries L1 and L2, either L1 * L2, L1 * L2, or L1~*L2.
Furthermore, if L1, L2, and L3 are any lotteries such that L1 * L2 and
L2 * L3, then L1 * L3.
SU2. If r 1, r2 and r3 are rewards such that r1 * r2 * r3, then there exists a
probability p such that [r1: p; r3: (1-p)] ~* r2, where [r1:p; r3:(1-p)] is a
lottery that pays r1 with probability p and r3 with probability (1-p ).
SU3. If r1 ~* r2 are rewards, then for any probability p and any reward r3,
[r1: p; r3: (1-p)] ~* [r2: p; r3: (1-p )]
SU4. If r1 * r2 are rewards, then [r1: p; r2: (1-p)] * [r1: q; r2: (1-q)] if and
only if p > q.
SU5. Consider three lotteries, Li = [r1: pi; r2: (1-pi)], i = 1, 2, 3, giving different
probabilities of the two rewards r1 and r2. Suppose lottery M gives entry to
lottery L2 with probability q and L3 with probability 1-q. Then L1~*M if and
only if p1 = qp2 + (1-q)p3.
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 41 -

George Mason University

Department of Systems Engineering and Operations Research

Probabilities and Utilities


If your beliefs satisfy SP1-SP5, then there is a probability
distribution Pr() over events such that for any two events A1
and A2, Pr(A1) Pr(A2) if and only if A1 A2.
If your preferences satisfy SU1-SU5, then there is a utility
function u() defined on rewards such that for any two lotteries
L1 and L2, L1 * L2 if and only if E[u(L1)] E[u(L2)], where
E[] denotes the expected value with respect to the probability
distribution Pr().

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 42 -

George Mason University

Department of Systems Engineering and Operations Research

Myth of the Cold-Hearted Rationalist


A common criticism of decision theory is that its adherents are cold-hearted
technocrats who care about numbers and not about what really matters
They would put a dollar value on a human life
They would send people to possible death on the basis of utilitarian calculations
And so on

The applied decision theorists response

These kinds of tradeoffs are unquestionably difficult


Whether we quantify them or not, as a society and as individuals we make them all the time
They will be irrational and capricious if we approach them without a principled methodology
Refusing to think about the tradeoffs will only ensure that they will be addressed haphazardly and/
or by back door manipulation by powerful special interests
As a society we need open debate and discussion of the difficult tradeoffs we are forced to make.
Decision theory provides a justifiable, communicable framework for doing so
disagreements about fact are clearly separated from disagreements about value
inconsistencies can be spotted, discussed, and resolved
commonly recurring problems need not be revisited once consensus has been reached

Decision theory can be misused if models are sloppily built and leave out important
elements
When a group or society has not reached consensus there is no clear best choice
Explicitly modeling subjective elements of a problem provides a framework for
informed debate
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 43 -

George Mason University

Department of Systems Engineering and Operations Research

Unit 1: Summary and Synthesis


The inventors of probability theory thought of it as a logic of
enlightened rational reasoning. In the nineteenth century this was
replaced by a view of probability as measuring objective
propensities of intrinsically random phenomena
The twentieth century has seen a resurgence of interest in subjective
probability and an increased understanding of the appropriate role of
subjectivity in science
Bayesian methods often require more computational power than
traditional frequentist methods
The computer revolution has enabled the Bayesian resurgence
Most statistics texts and courses take a frequentist approach but this
is changing
Bayesian decision theory provides methodology for rational choice
under uncertainty
Bayesian statistics is a theory of rational belief dynamics
We took a broad-brush tour of Bayesian methodology
We applied Bayesian thinking to a simple example that illustrates
many of the concepts we will be learning this semester
Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 44 -

George Mason University

Department of Systems Engineering and Operations Research

References for Unit 1

Bayes, Thomas. An essay towards solving a problem in the doctrine of chances.


Philosophical Transactions of the Royal Society of London, 53:370- 418, 1763.
Bashir, S.A., Getting Started in R, http://www.luchsinger-mathematics.ch/Bashir.pdf
Dawid, A.P. and Vovk, V.G. (1999), Prequential Probability: Principles and Properties,
Bernoulli, 5: 125-162.
de Finetti, Bruno. Theory of Probability: A Critical Introductory Treatment. New York:
Wiley, 1974.
Gelman, et al., Chapter 1
Hjek, Alan, "Interpretations of Probability", The Stanford Encyclopedia of Philosophy
(Summer 2003 Edition), Edward N. Zalta(ed.), URL = <http://plato.stanford.edu/
archives/sum2003/entries/probability-interpret/>.
Lee, Chapter 1
Li, Ming and Vitanyi, Paul. An Introduction to Kolmogorov Complexity and Its
Applications. (2nd ed) Springer-Verlag, 2005.
Nau, Robert F. (1999), Arbitrage, Incomplete Models, And Interactive Rationality,
working paper, Fuqua School of Business, Duke University.
Neapolitan, R. Learning Bayesian Networks, Prentice Hall, 2003.
Jaynes, E., Probability Theory: The Logic of Science, Cambridge University Press,
2003)
Savage, L.J., The Foundations of Statistics. Dover, 1972.
Shafer, G. Probability and Finance: Its Only a Game, Wiley, 2001.
von Mises R., 1957, Probability, Statistics and Truth, revised English edition, New
York: Macmillan

Kathryn Blackmond Laskey

Spring 2015

Unit 1 (v5) - 45 -