You are on page 1of 18

# Module 1: Foundations

Contents
1 How random is the flip of a coin?
1.1 “It’s a toss-up” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Probability and Statistics are two sides of the same coin . . . . . . . . . . . . . . . . . . . . .
1.3 The Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1
2
2

2 Beta-Bernoulli model
2.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 The posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4
4
5
6

3 The cast of characters
3.1 Marginal likelihood and posterior predictive . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Example: Beta-Bernoulli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8
8
8

4 Decision theory
4.1 The basics of Bayesian decision theory .
4.2 Real-world decision problems . . . . . .
4.3 Example: Resource allocation for disease
4.4 Frequentist risk and Integrated risk . . .

. . . . . . . . . . . . .
. . . . . . . . . . . . .
prevention/treatment
. . . . . . . . . . . . .

5 Exercises

1

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

9
10
11
12
14
16

How random is the flip of a coin?

1.1

“It’s a toss-up”

• It is so generally assumed that a coin toss comes up heads half of the time, that it has
even become a standard metaphor for two events with equal probability.
• But think about it—is it really 50-50? Suppose we always flip a coin starting with
heads up. Could the outcome actually be biased toward either heads or tails?
• Assume the coin is physically symmetric. Since an “in-flight” coin is a relatively simple
physical system, the outcome should be essentially determined by the initial conditions
at the beginning of it’s trajectory. So, any randomness comes from the flipper, not the
flipping.

1

this is expressed via Bayes’ theorem. and for making rational decisions based on such inferences. to make inferences about θ.1) where x is the observed data (for example. we say “the posterior is proportional to the likelihood times the prior”. the Bayesian approach—in a nutshell—is to assume a prior distribution on any unknowns. 3 . In words. • Mathematically. x = x1:n ). • More generally. This is referred to as the posterior distribution. Then. p(θ|x) = p(x|θ)p(θ) ∝ p(x|θ)p(θ). p(x) (1. a distribution representing the plausibility of each possible value of θ before the data is observed. and then just follow the rules of probability to answer any questions of interest.P(this = Bayes | data) < 1 Richard Price Pierre-Simon Laplace • The idea is to assume a prior probability distribution for θ—that is. This provides a coherent framework for making inferences about unknown parameters θ as well as any future data or missing data. one simply considers the conditional distribution of θ given the observed data. since it represents the plausibility of each possible value of θ after seeing the data. Bayes’ theorem is essentially just the definition of conditional probability P(B|A) = P(A ∩ B) P(A|B)P(B) = P(A) P(A) extended to conditional densities.

etc. combinatorics. due to the ubiquity of binary outcomes (for instance. In other words. the p. public health and epidemiology. The symbol ∈ means “belongs to the set”. neuroscience. 2. Oh. etc. • It is named for Jacob Bernoulli (1655–1705). differential equations.e. in computer vision.Jacob Bernoulli (not in a bad mood.m. demographics and polling. • The Bernoulli distribution shows up everywhere. • The mean (or expectation) is EX = P x∈{0. • Notation: P denotes “the probability of”. • We write X ∼ Bernoulli(θ) to mean that  P(X = x | θ) = θ if x = 1 1 − θ if x = 0 and is 0 otherwise. 1}). and the calculus of variations. 1(E). and he discovered the constant e.1 Bernoulli distribution • The Bernoulli distribution models binary outcomes. who is known for various inconsequential trivialities such as developing the foundations of probability theory (including the law of large numbers). (probability mass function) is p(x|θ) = P(X = x | θ) = θx (1 − θ)1−x 1(x ∈ {0.. taking two possible values.f.). everyone is just annoying) 2 Beta-Bernoulli model We now formally explore a Bayesian approach to the coin flipping problem.1} xp(x|θ) = θ. etc. equals 1 when E is true and is 0 otherwise. 4 . i. The convention is to use 0 and 1. The indicator function. and E denotes the expectation.

. which is a special case of the beta distribution.f. . xn ∈ {0. . . p(x1:n |θ) = P(X1 = x1 . .d. b) = θa−1 (1 − θ)b−1 1(0 < θ < 1). p(x1:n |θ) is called the likelihood function. Xn = xn | θ) n Y = P(Xi = xi | θ) i=1 = n Y p(xi |θ) i=1 = n Y i=1 P =θ θxi (1 − θ)1−xi xi P (1 − θ)n− xi . .f. . but this becomes ambiguous when one is also sampling from the posterior).1) • Viewed as a function of θ. (probability density function) 1 p(θ) = Beta(θ|a. b > 0. x1:n ) to emphasize this.s iid • If X1 . (2. we write θ ∼ Beta(a. (2. It is sometimes denoted L(θ.Some Beta p. . Xn ∼ Bernoulli(θ) then for x1 . . Viewed as a distribution on x1:n . b) 5 . 1}. 2. b) to mean θ has p.d. . .2 Beta distribution • Bayes used a uniform prior on θ. .2) B(a. • Given a. we will refer to this as the generator or generating distribution (sometimes it is referred to as the “sampling distribution”.

=θ P xi P (1 − θ)n− xi 6 (2.m. the posterior is p(θ|x1:n ) ∝ p(x1:n |θ)p(θ) 1 θa−1 (1 − θ)b−1 1(0 < θ < 1) B(a.g.2). in the case of theta.1) and the prior (Equation 2. p(θ) ∝ θa−1 (1 − θ)b−1 on the interval from 0 to 1. say x and y.g.A posteriori i. This simple device is surprisingly useful for deriving posterior distributions. For functions of multiple variables.. we use capital letters to denote random variables (e.e. b) is Euler’s beta function. 2.1). x).g.f..3 The posterior • Using Bayes’ theorem (Equation 1. we will use bold font to denote the random variable θ.. However. the θ in the expression p(θ)) indicates which random variable we are talking about. X) and lowercase for particular values (e. b) P P ∝ θa+ xi −1 (1 − θ)b+n− xi −1 1(0 < θ < 1) P P  ∝ Beta θ | a + xi . • Usually. B(a.s and p. • We will usually use p for all p. we write ∝ to indicate x proportionality with respect to x only. following the usual convention that the symbol used (e. Notation • f (x) ∝ g(x) (“f is proportional to g”) means there is a constant c such that f (x) = cg(x) for all x.d. R • The mean is E θ = θ p(θ)dθ = a/(a + b).f. Here.3) . and unbold for particular values θ.. and plugging in the likelihood (Equation 2.s. b + n − xi .

g. Posterior densities. for increasing amounts of data. sampling from the posterior. let’s generate data X1 . Example • Suppose we choose a = 1 and b = 1.) 7 . so that the prior is uniform. for computing certain integrals with respect to the posterior. .f. it is easy to work with—e.f. The dotted line shows the true value of theta. and its derivatives. because the samples will be different. • Figure 1 shows the posterior p. and computing the posterior p. the posterior has the same form (a Beta distribution) as the prior! When this happens.. .d. . As a simulation iid to see how the posterior behaves. • So. .Figure 1.51.d. • Since the posterior has such a nice form. Xn ∼ Bernoulli(θ0 ) with θ0 = 0. (Note that this will be different each time the experiment is run. we say that the prior is conjugate (more on this later).

. x1:n )p(θ|x1:n ) dθ Z = p(xn+1 |θ)p(θ|x1:n ) dθ.d. the posterior predictive distribution is the distribution of Xn+1 given X1:n = x1:n . it is the marginal p. x = x1:n = (x1 . and posterior. the posterior predictive p./p. b + n − xi = . b) 0 P P  B a + xi . In the rest of this module.f.1 p(x|θ) p(θ) p(θ|x) p(x) p(xn+1 |x1:n ) `(s. When θ is a vector. . ./p. generator / likelihood prior posterior marginal likelihood posterior predictive loss function posterior expected loss risk / frequentist risk integrated risk 3. b) 8 . is given by Z p(xn+1 |x1:n ) = p(xn+1 .g.e. this will be a multi-dimensional integral. B(a. . . b) and X1 . When the data is a sequence x = (x1 . 3. .2 Example: Beta-Bernoulli If θ ∼ Beta(a. Xn .. e. a) ρ(a. Bernoulli(θ) (as in Section 2). . . Xn | θ = θ are i.i. θ|x1:n ) dθ Z = p(xn+1 |θ. we’ve seen the likelihood. . θ) = p(x|θ)p(θ). we will get acquainted with the rest of them. .f.f. obtained by integrating θ out of the joint density p(x. Here. then the marginal likelihood is Z p(x1:n ) = p(x1:n |θ)p(θ) dθ Z 1 P P 1 θa−1 (1 − θ)b−1 dθ = θ xi (1 − θ)n− xi B(a. .3 The cast of characters Here’s a list of the mathematical objects we will most frequently encounter. δ) r(δ) Marginal likelihood and posterior predictive The marginal likelihood is Z p(x) = p(x|θ)p(θ) dθ i.. . x) R(θ. When X1 .m.d. So far. we denote the observed data by x. .d. Xn+1 are independent given θ = θ.f.m. xn ). noting that this may consist of many data points. xn ). of the observed data. . . . prior.

we start with the end in mind—how are we actually going to use our inferences and what consequences will this have? The basic goal is to minimize loss (or equivalently. = θ Beta(θ|an . Z P(Xn+1 = 1 | x1:n ) = P(Xn+1 = 1 | θ)p(θ|x1:n )dθ Z an . an + b n Decision theory In decision theory. Letting an = a + xi and bn = b + n − xi for brevity.3) that p(θ|x1:n ) = Beta(θ|an .f. bn ) = an + b n hence. 9 . to maximize utility/gain). and using the fact (from Equation 2. Thus. in which Blaise Pascal (1623–1662) suggested the following argument for believing in God: If God exists then one will reap either an infinite gain or infinite loss (eternity in heaven or hell). 1}). is p(xn+1 |x1:n ) = 4 axnn+1 bn1−xn+1 1(xn+1 ∈ {0. which is to minimize posterior expected loss. the gain or loss is finite. depending on whether one believes or not—meanwhile. if he does not exist.Blaise Pascal Abraham Wald P P by the integral definition of the Beta function. the posterior predictive p.m. bn ). A famous early example of decision-theoretic reasoning is Pascal’s Wager. here we consider the standard Bayesian approach. While there are multiple ways of making this precise.

the rational decision is to believe. S state (unknown) x observation (known) a action `(s. a)p(s|x) if S is a discrete random variable. to avoid certain pathologies. Wald was born in Austria-Hungary. x) = s `(s. 0 indicates non-existence. x). and the optimal decision is to choose an action a that minimizes the posterior expected loss. and moved to the United States after the annexation of Austria into Germany in 1938. he published a groundbreaking paper establishing the foundations of modern statistical decision theory. β are finite values: Truth Belief 0 1 0 α β 1 −∞ ∞ In statistics. • A decision procedure δ is a systematic way of choosing actions a based on observations x.1 The basics of Bayesian decision theory • The general setup is that there is some unknown state S (a.k. a) loss • In the Bayesian approach. In 1939. and we incur a real-valued loss `(S. • A Bayes procedure is a decision procedure that chooses an a minimizing the posterior expected loss ρ(a. Pascal’s loss function can be represented by the following matrix. we receive an observation x. P In other words. in which 1 indicates existence. the distribution of x depends on S. x) = E(`(S. the state of nature).a. 4. Typically. S is a random variable. this is a deterministic function a = δ(x) (but sometimes introducing some randomness into a can be useful). and α. • Note: Sometimes the loss is restricted to be nonnegative. we take an action a. a). but the real developments would have to wait until the 1900s. Wald also developed sequential analysis.he reasoned. 10 . loss functions were used in a limited way during the 1700s and 1800s (most notably Laplace’s absolute error and Gauss’ quadratic error). ρ(a. for each x. made significant contributions to econometrics and geometry. no matter how small the probability that God exists. a)|x). ρ(a. while if S is continuous then the sum is replaced by an integral. The father of statistical decision theory was Abraham Wald (1902–1950). and provided an important statistical analysis of aircraft vulnerability during World War II.

a.2 Real-world decision problems Medical decision-making At what age should you get early screening for cancer (such as prostate or breast cancer)? There have been recent recommendations to delay screening until later ages due to a high number of false positives. 11 . with quadratic loss • Setup: – State: S = θ – Observation: x = x1:n – Action: a = θˆ ˆ = (θ − θ) ˆ 2 (quadratic loss. note that `(θ. which is convex as a function of θ. xn+1 4. with 0 − 1 loss • Assume Xn+1 is a discrete random variable. square loss) – Loss: `(θ. since the optimal decision is simply to estimate θ by the posterior mean—in other words. ˆ = θ2 − 2θθˆ + θˆ2 . θ) ˆ x1:n ) = E(`(θ.. a. and thus • To see why. we find that the minimum occurs at θˆ = E(θ|x1:n ). since it turns out that the optimal decision is simply to predict the most probable value according to the posterior predictive distribution. which lead to unnecessary biopsies and considerable physical discomfort and mental distress. in addition to medical costs. xˆn+1 = δ(x1:n ) = arg max p(xn+1 |x1:n ).e. ˆ Setting the derivative with respect to θˆ equal to 0. θ) • Using quadratic loss here works out nicely. θ)|x ˆ 1:n ) = E(θ 2 |x1:n ) − 2θE(θ|x ˆ ˆ2 ρ(θ.k. and solving. • Setup: – State: S = Xn+1 – Observation: x = x1:n – Action: a = xˆn+1 – Loss: `(s. Example 2: Predicting the next outcome. Xn+1 . to choose θˆ = δ(x1:n ) = E(θ|x1:n ).Example 1: Estimating θ. i. 1:n ) + θ . a) = 1(s 6= a) (this is called the 0 − 1 loss) • Using 0 − 1 loss here works out nicely.

.Public health policy The CDC estimates that 5–20% of the US population gets influenza annually. suppose θ ∼ Beta(a. Decision-theoretic analysis can help to understand and explore a decision problem. in order to produce the right kinds of flu shots in sufficient quantities. weighed against the cost of the policy. xn . and it’s a bad idea to adhere strictly to such a procedure. • By considering data from other similar cities. . 12 . • They conduct a survey assessing the disease status of n = 30 individuals.e. For simplicity. . they determine a prior p(θ). you would need to think about your probability of dying early and the financial impact it would have on your family. but after all the analysis. b)). . x1 . but the fraction θ of infected individuals in the city is unknown.05 and b = 1. After deliberation. . • Suppose they allocate enough resources to accomodate a fraction c of the population. regulation of nuclear power has been very stringent. Personal financial decisions Should you buy life insurance? To make this decision. iid This is modeled as X1 . P i. c) = 10|θ − c| if c < θ. . there are approximately 4000 times more deaths due to coal power generation than due to nuclear power—not counting the environmental costs. Suppose all but one are disease-free. 4. researchers and vaccine manufacturers have to predict the prevalence of different strains of the virus at least 6 months in advance of flu season. b) (i. perhaps due to a misperception of the risk. and thousands die. with a = 0. they tentatively adopt the following loss function:  |θ − c| if c ≥ θ `(θ. . ni=1 xi = 1.. Nonetheless. Government regulations Per watt produced. . If c is too large.3 Example: Resource allocation for disease prevention/treatment • Suppose public health officials in a small city need to decide how much resources to devote toward prevention and treatment of a certain disease. there will be wasted resources.e. Each year. which is reasonable if the individuals are uniformly sampled and the population is large. preventable cases may occur and some individuals may go untreated. A word of caution Use good judgment! A formal decision analysis is almost always oversimplified. p(θ) = Beta(θ|a.. while if it is too small. Xn ∼ Bernoulli(θ). decisions should be made based on your best judgment.

The Bayes procedure • The Bayes procedure is to minimize the posterior expected loss Z ρ(c. x) for our example.Figure 3: Posterior expected loss for the disease prevalence example. More on this later. c) as a function of c. and think about taking a weighted average of these functions. To visualize why it looks like this. for some fixed θ—then imagine how it changes as θ goes from 0 to 1.3. which does not account for uncertainty in θ and the large loss that would result from possible under-resourcing.08. so under the assumptions above. with weights determined by p(θ|x). x) = E(`(θ. • (Note: A sensitivity analysis should also be performed to assess how much these results depend on the assumptions. so we can numerically compute this integral for each c. c)|x) = `(θ. Note that this makes more sense than naively choosing c = x¯ = 1/30 ≈ 0. think about the shape of `(θ. c)p(θ|x)dθ where x = x1:n .) 13 . this is the optimal amount of resources to allocate. We know p(θ|x) from Equation 2. • Figure 3 shows ρ(c. • The minimum occurs at c ≈ 0.03.

δ(X)) Post. δ) = E `(θ. δ(X)) for brevity) helps to visualize the relationships between all of these concepts. δ(x)) p(x|θ) dx if X is continuous. δ(X)) | θ = θ . δ(X)) = R(θ. exp.Loss L = `(θ. • The integrated risk associated with δ is Z r(δ) = E(`(θ. In other words. Z R(θ. revisited • The frequentist risk provides a useful way to compare decision procedures in a prior-free way. 4.1 (constant). P • Figure 5 shows each procedure as a function of xi .4. • The risk (or frequentist risk ) associated with a decision procedure δ is  R(θ. where X has distribution p(x|θ). δ) p(θ) dθ.1 Example: Resource allocation.4 Frequentist risk and Integrated risk • Consider a decision problem in which S = θ. δ) = `(θ. the Bayes procedure always picks c to be a little bigger than x¯. For the prior we have chosen. • The diagram in Figure 4 (denoting L = `(θ. consider two other possibilities: choosing c = x¯ (sample mean) or c = 0. while the integral is replaced with a sum if X is discrete. loss E(L | X = x) Frequentist risk E(L | θ = θ) Integrated risk E(L) Figure 4: Visualizing the relationships between different decision-theoretic objects. • In addition to the Bayes procedure above. 14 . the observed number of diseased cases. 4.

. for the three different procedures. but it doesn’t mean a procedure is necessarily good. δ) for all θ and R(θ.g. • Admissibility is nice to have. the risk is the expected loss. The observed data doesn’t factor into it at all. the constant procedure c = 0. δ is admissible if there is no δ 0 such that R(θ. (Recall that for each θ.) • The constant procedure is fantastic when θ is near 0.1 is admissible too! 15 . δ 0 ) < R(θ. averaging over all possible data sets. but gets very bad very quickly for larger θ.1. as a function of the number of diseased individuals observed. δ 0 ) ≤ R(θ. in this example. • Bayes procedures are admissible under very general conditions. The Bayes procedure is better than the sample mean for nearly all θ’s. Silly procedures can still be admissible—e. δ) as a function of θ for each procedure. δ) for at least one θ. P xi . • Figure 6 shows the risk R(θ. These curves reflect the usual situation—some procedures will work better for certain θ’s and some will work better for others. • A decision procedure is called admissible if there is no other procedure that is at least as good for all θ and strictly better for some. Smaller risk is better.Figure 5: Resources allocated c. That is.

d. Derive the formula for the posterior density. and suppose that your prior is θ ∼ Gamma(a. 3. Give an example where an Exponential model would NOT be appropriate. Using the programming language of your choice. or Gamma). It has been used to model the time between events (such as neuron spikes.Figure 6: Risk functions for the three different procedures. b). 21. plot the prior and posterior p. .6. 1. Give the form of the posterior in terms of one of the distributions we’ve considered so far (Bernoulli. is p(x|θ) = Exp(x|θ) = θ exp(−θx)1(x > 0).f.8. .1. . and your data is (x1 . . observations from an Exponential distribution. b) = ba a−1 θ exp(−bθ)1(θ > 0). 69.s. that is.i. p(θ|x1:n ). suppose you are measuring the number of seconds between lightning strikes during a storm. extreme values such as maximum daily rainfall over a period of one year. its p. (Be sure to make your plots on a scale that allows you to clearly see the important features. 5 Exercises Gamma-Exponential model We write X ∼ Exp(θ) to indicate that X has the Exponential distribution.f. or the amount of time until a product fails (lightbulbs are a standard example). website hits. and explain why. The Exponential distribution has some special properties that make it a good model for certain applications.d.d. Beta. .4. neutrinos captured in a detector).4. 2. 6.) 3. .7.0). 10. 16 .9.0). x8 ) = (20. . Now. your prior is Gamma(0.7. Give a specific example of an application where an Exponential model would be reasonable. 21. . Γ(a) 1. xn which you are modeling as i. 0. Suppose you have data x1 . p(θ) = Gamma(θ|a. that is. Exponential.

youtube. Come up with a scenario in which S is discrete but the 0 − 1 loss would NOT be appropriate. . 8. x1:n ) = E(`(S. write it as t¯ x + (1 − t)E(θ) for some t ∈ [0.1 – 5. Dynamical bias in the coin toss. and x1 . reproduce the plot inR Figure 3.2–2. what is the Bayes procedure for making this prediction when ` is 0 − 1 loss? 6. Machine Learning (ML) 7.com/playlist?list=PL17567A1A3F5DB5E4 Beta-Bernoulli model • Hoff (2009). 5.6. R. xn ? Express this 1 as a convex combination of the sample mean x¯ = n xi and the prior mean (that is. 17 . SIAM review. . • mathematicalmonk videos. xn ? Using your result from Question 1. Sections 2.com/playlist?list=PLD0F06AA0D2E8FFBA Coin flipping bias • Diaconis. Pb. 1]). . Holmes. (2007). 49(2). Probability Primer (PP) 2. S. in terms of a. Using the programming language of your choice. b for which the Bayes procedure agrees with your intuitive procedure? Qualitatively (not quantitatively).5 https://www. • mathematicalmonk videos. a)|x1:n ) is the a that maximizes P(S = a | x1:n ). .youtube. .. then the action a that minimizes the posterior expected loss ρ(a. & Montgomery.3. how do a and b influence the Bayes procedure? 7. .Decision theory 4.1. Are there settings of the “hyperparameters” a. consider the loss function and the prior from the example in Section 4. Do the PNintegrals 1 1 numerically using a Riemann sum approximation. References and supplements Probability basics • Hoff (2009). how would you predict xn+1 based on observations x1 . and give an example of the loss function that would be more suitable. Consider the Beta-Bernoulli model. P. .5 and 7. . Intuitively. What is the posterior mean E(θ|x1:n ). 2 9. such as 0 f (x)dx ≈ N i=1 f ((i − 1 )/N ) for a suitably large N . Now. 211-235. beginning of Section 3. Show that if ` is 0 − 1 loss and S is a discrete random variable..6 https://www.

J. 1902-1950. (2001). • Statistical Decision Theory and Bayesian Analysis. Stigler. C. P.com/tag/abraham-wald/ 18 . Springer. S. J. History • The history of statistics: The measurement of uncertainty before 1900. Springer Texts in Statistics. (1986). Berger. (1985).O. The Annals of Mathematical Statistics. Harvard University Press.M. Robert. • Interesting podcast about Wald: http://youarenotsosmart. (1952).Decision theory • The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. 1-13. • Wolfowitz. Abraham Wald.