Beruflich Dokumente
Kultur Dokumente
3
Phil.015
February 29, 2016
ELEMENTS OF CLASSICAL PROBABILITY THEORY
1. Passage from Deductive to Inductive Reasoning
In this handout I introduce the basic measure-theoretic machinery for probability calculus. The philosophy behind various definitions is discussed as we
move along. Recall that a logically valid deductive inference always provides
certainty, but generally only at the cost of a loss of information. For example,
we can deductively infer from the premise All swans are white the conclusion that a particular swan named (say) Alma, is also white. In our formal
language, the inference has the obvious predicate logic form
x[Sx W x] Sa W a.
Note that the information (in a classical sense of the term, to be discussed later)
always flows downhill from the premises to the conclusion. That is to say, for
any information measure Info (that assigns non-negative real numbers, e.g.,
in bits, to sentences) the inequality Info(x[Sx W x]) Info(Sa W a)
holds. For another example, to arrive deductively at a much desired conclusion
that the universe is finite, it is necessary to come up with powerful premises
that carry sufficient information about spacetime, distribution of matter in the
universe, and so forth, together with sophisticated astronomical data.
Inductive logicians are interested in a reverse form of inference, having the form
Sa W a, Sb W b, | x[Sx W x]
that goes from particular (observational) premises with less information to conclusions that generally carry vastly more information. Because observing a
couple of swans and finding them to be white does not necessarily mean that
all swans are white, we use the wiggly (or wavy) entailment sign | , indicating
that the inference is risky and no longer guaranteed. As well known, David
Hume and Karl Popper have argued that the foregoing inductive inference (and
others like it) cannot be justified, period. Simply, we can not be certain that the
conclusion is correct. A hasty conclusion is that there is no inductive logic in
the foregoing sense. However, if we agree to measure the strength (possibility,
plausibility, probability, degree of certainty or risk) of the inference by a number
1
(iii) T(P Q) = max T(P ), T(Q) = T(P ) + T(Q) T(P ) T(Q).
(iv) T(P Q) = max 1 T(P ), T(Q) .
Thus, a truth function preserves the meanings of logical connectives, and it may
be specified by the states of affairs out there (e.g., we set T(P ) = 1 because
the sentence P happens to be true in reality).
Now, recall that a deductive inference
1 , 2 ,
is logically valid if and only if for all truth functions T, whenever T(1 ) =
1, T(2 ) = 1, holds simultaneously for all premises, we have T() = 1. Or
simply, we have
T(1 & 2 & ) T()
and equivalently
T (1 & 2 & ) = 1
for all truth functions T. This is the crux of the truth table method.
Because, as we will note later, truth functions T are limit cases of probability
measures,1 we may wish to consider a conditional truth function
TP (Q) =df T(P Q)
for any sentence P . It is easy to see that
TP satisfies all conditions itemized
above. There is an obvious iteration TP Q = T(P & Q) of conditioning and the
multiplication law
T(P & Q) = T(P ) TP (Q),
important also in classical probability theory.
I want to make two points from this digression into deductive inference and
truth functions. First, deductive logic abstracts from all truth functions T by
quantifying them away from a valid deductive inference. In so doing, logical
claims and inferences turn out to be claims about all possible worlds and not
about any specific world in particular.2 In contrast, in applied probability the1
We should remember that truth functions T are special probability measures that take only two
values, namely 0 and 1 (earlier we used the symbols T and F ). In view of this simplicity, any pair
of sentences turns out to be probabilistically independent for any T.
2
This sheds some light on the information loss problem.
ory, certain probability functions P (to be defined later) are given or presumed
to be identifiable, and not quantified away. Of course, the axioms and basic
theorems of probability calculus hold for all probability functions P.
Second, deductive logic prompts to treat inductive inference in a logically familiar manner, having the form
1 , 2 , |p ,
that completely suppresses the crucial role of a function that actually assigns
the numerical value p to the relationship between premises 1 , 2 , and conclusion . The inference above says that the premises 1 , 2 , entail with
probability p the conclusion .
In complete analogy with the role of truth functions T in
T (1 & 2 & ) = 1
we now rewrite the foregoing inductive inference into an equational form
P | 1 & 2 & = p.
In particular, the example about white swans is now symbolized in the form of
an equation, belonging to probability logic
P x[Sx W x] | (Sa W a) & (Sb W b) = p
with some P, where (i) the measure P of (un)certainty is now explicitly mentioned, (ii) the conclusion is always put first, and (iii) the converse of the inductive entailment relation | is symbolized quite simply by a vertical bar | ,
intended to separate the conclusion from the premise. The foregoing formula
expresses the fact that the degree of (un)certainty (probability) that the hypothesis x[Sx W x] is true, given evidence or premise (Sa W a) & (Sb W b),
is p.
Unfortunately, the problem has a further twist for applications. Exactly how
should the measure P be identified? Bayesians hold that P measures the probability theorists degrees of belief in all sentences of a given formal language, and
it should be identified via subsequent revisions in face of new information. The
so-called objectivists or frequentists argue that since P measures an objective
property of a system out there, it can be identified by calculating the relative
frequencies of occurrences of events or it may be obtained a priori from geometric or other symmetry considerations. Both interpretations of P are held
4
Often probability experiments are entirely conceptual. For example, in discrete probability
calculus it is common to assume that a hypothetical urn contains x marbles, each colored differently
from any of the others. A marble is drawn from the urn and its color is noted. It does not matter
whether or not this is a real-life situation. The interest is in determining the probability that a
marble of a designated color is drawn.
4
Because we are conceptually encoding the empirical outcomes,the points in could be 0 and 1.
outcome heads and the second encodes the empirical outcome tails. It is
strictly a question of good mathematical modeling whether or not should
include a sample point E for additional outcomes, say edge, and so forth.5
Unfortunately, sample spaces are not directly available in the logical approach
to probabilistic inference. Yet, they have proved to be extremely useful in upholding the linear laws of random variables and the convex manipulations of
probability density functions.
In many applications, sample spaces are frequently introduced in a mixing
or conditional manner. Suppose the target probability experiment consists
of two steps. First a coin is flipped. Then, if the outcome is tails, a die
is rolled. But if the outcome is heads, then the coin is flipped again. It
is immediately obvious that the correct sample space is specified by 0 =df
{T 1, T 2, T 3, T 4, T 5, T 6, HT, HH}. Here T 1 encodes the first outcome of heads
and the subsequent outcome of the upturned face numbered 1, after the die
has been rolled, and so on. Observe that the sample space 0 is actually
defined by the mixture or conditional sample space ( , ) =df
{T } + {H} , where = {H, T } encodes the outcomes of the first
trial and = {1, 2, 3, 4, 5, 6} is used in the second trial if T occurs, else is
used again in the second trial (namely, if H occurs in the first trial). Here the
sample spaces and are invoked conditionally on .
Set theory has two standard concepts we shall need: Cartesian product and
disjoint sum. Given two sets and , we use the expression to denote
their Cartesian product set , i.e., the set of all pairs (, ), where is a member
of and is a member of . For example, the samplespace for flipping a
coin twice is defined by the Cartesian product = HH, HT, T H, T T
that should be written more pedantically as (H, H), (H, T ), (T, H), (T, T ) .
Clearly, in general 6= .
Given two sets and , we use the expression + to denote the disjoint
sum of sets and . It is the set containing both the elements of and .
Possible double-counting is allowed by labeling the elements of separately by
0 and those of by 1.
Compositions of experiments are typically represented by Cartesian products of
sample spaces (e.g., flipping a coin n times requires the n-fold Cartesian product
= n , and by iterated mixing (conditioning) constructions.
A slightly more complicated sample spaces are needed in cascade probability
5
By agreeing on the univese of sample points that encodes all seriously possible outcomes, we
are of course resorting to idealization. If edge were to occur reasonably often in repeated tosses,
then we would make an unrealistic assumption in insisting to have = {H, T }. On the other
hand, if edge never seems to occur, then it is inconvenient to specify the working sample space by
= {H, T, E}.
In probability logic, a probability measure measures the probability that a sentence of a designated formal language is true.
#{E}
,
#{}
however, that the Laplacean model is confined to finite sample spaces only.
Problem: Suppose two dice are rolled. Find the probability that the sum of
outcomes shown by the first and second die is exactly 6.
Solution:
The sample space for the experiment is given by the Cartesian product
= (1, 1), (1, 2), , (2, 1), (2, 2), , , (6, 6) , where = {1, 2, 3, 4, 5, 6}.
Now, because there are 5 pairs of outcomes with the same sum 6, namely
(5, 1), (4, 2), (3, 3), (2, 4), (1, 5), the probability of the event described by The
5
sum of outcomes shown by the first and second die is 6 is equal to 36
.
Note that in general, a measurable space h, Ai admits infinitely many distinct
probability measures, forming a kind of meta-sample space Prob , A with its
own meta-event algebra, important in higher-order probability theory. Observe
also that unlike in deductive inference, where usually all possible truth functions T are considered and then quantified away, in probability theory the focus
tends to be on just one particular probability measure, namely the one that is
believed to be (in some sense) the correct probability measure for the target
experiment. Hence, probabilistic reasoning tend to be local and model-driven.
We will see more of this style in statistical reasoning.
Now we turn to some of the basic theorems of probability theory. Henceforth, we
shall assume that we have an arbitrary but fixed probability model h, A, Pi
with events A, B, C, , belonging to its event algebra A. We ignore a host
of trivial facts, including P() = 0, P(A) 1, P(E) = 1 P(E), and
P(A) P(B) if A B. Since the probability model above is arbitrary,
the theorems below in effect hold for all probability measures, and they do not
depend on Bayesian, objectivist or other interpretations at all. For this reason, we should
preface each theorem below with the quantifier for all P in
Prob , A .
P1
P1
Suppose that A and B are two sets. Then the set A is said to be a subset of B provided
that every element of A is also an element of B. This important relationship between two sets is
symbolized succinctly as A B.
11
P1
P2
P3
P4
P5
P6
P7
max 0, P(A) + P(B) 1 P(A B) min P(A), P(B)
10
This theorem is often referred as the general addition law, because there is no assumption in it
about the disjointness of events.
11
Given two sets A and B, often we shall write AB as short for A B.
12
P8
P(A B) max P(A), P(B) .
We conclude this section with probability logic, pausing to define probability
functions as analogs of truth functions. Inductive logic studies risky arguments
using probability ideas. Probabilities are expressed in numbers from the unit
interval. Probability logic is deductive logic enriched with a designated probability measure. In this framework logicians can analyze questions of the form
What is the probability that proposition is true?. Thus, focus is on the
probability of propositions or sentences being true (or false), and not on the
probability that a given event occurs (or does not occur).
Suppose we are given a language L of sentential, predicate or other logic, the
notion of probability in probability logic is treated as a generalization of a truth
functions. Specifically, a function P : L [0, 1] that assigns
to each sentence
of a formal language L a unique real number 0 P 1 (interpreted
as the probability that is true) is called a probability function or probability
measure (or belief state) on L provided that the following conditions hold for
all sentences and in L:
(i) P() = 1, if .
(ii) P() P(), if .
(iii) P( ) = 1 P().
(iv) P( ) = P() + P() P( & ).
(v) P( & ) = P() P | , where P | =df P( & ) : P().
(vi) P( ) = P( ).
(vii) P(x(x)) = Inf a1 , ,an P((a1 ) & (an )).
Where a typical probability function P differs from truth functions T is in
specifying logical truth and logical validity. If we wanted to define logical truth
by setting precisely when P() = 1 for all probability functions P on L,
we would not get anything new, beyond the usual logical theorems we obtain
from truth functions. However, if we set
P P() = 1
for just one particular probability function, say P, then we obtain a deductively
closed system of sentences enjoying full belief or certainty in belief state P
that of course always includes all logically true sentences. The most important
13
(v) If C
A and C
B, then C
AB,
where A and B are disjoint.
(vi) If A and B are disjoint events with P(A) 6= 0 6= P(B), then not A
B.
As an illustrative example, consider the nickel-and-dime experiment. We are
tossing both a nickel and a dime into the air once, the object being to observe
the upturned faces on both coins. A suitable description of the four possible outcomes
is afforded by four ordered pairs, specifying the sample space
=df (H, H), (H, T ), (T, H), (T, T ) . The event algebra A consists of
all subsets of the sample space. Since we are assuming the Laplacean (equally
likely) model, the probability of getting the
falls
event described
by The nickel
1
heads and modeled by the subset A = (H, H), (H, T ) is P(A) = 2 . Similarly, the probability of getting the event descibed
by The dime falls tails and
represented by the subset B = (H, T ), (T, T ) is P(B) = 21 . From the physical
nature of the experiment (the coins are physically separate objects)
we know
that events A and B should be independent. Since A B = (H, T ) , we see
that P(A B) = 14 = 21 21 = P(A) P(B), hence A
B, as suspected.
Clearly, a given pair of events A and B may be probabilistically independent with respect a probability measure P, but not necessarily with respect
14
to some other probability measure P . In particular, note that if the probability measure P on the measurable space h , Ai above is now intended
to represent biased coins, given by (say) P ({(H, H)}) = 0.5, P ({(H, T )}) =
0.4, P ({(T, H)}) = 0.05, and P ({(H, H)}) = 0.05, then we have P (A) =
0.9, P (B) = 0.45, P (A B) = 0.400, but P (A) P (B) = 0.405, so that with
respect to P the events A and B are not probabilistically independent.
In general, with respect to a so-called Dirac {0, 1}-valued probability measure
P defined by P (A) = 1 if A and P (B) = 0 if
/ B for all events
A and B, all events are probabilistically independent. In particular, since each
truth function T is a probability measure over the formal language of sentential
logic, all pairs of sentences are automatically T-independent. Suppose an experiment consists of rolling a die twice. It is easy to see that the event described
by Getting an odd number in the first roll. is probabilistically independent
of Getting an even number in the second roll.
If there are three events A, B, C, they are called pairwise independent just in
case A
B, A
C, and B
C. However, multiple independence or mutual
independence, in symbols
(A, B, C), means considerably more, namely, pairwise independence and equality P(A B C) = P(A) P(B) P(C). The last
condition is not reducible to pairwise independence. The concept of multiple
independence readily extends to any string of events.
Suppose we are given an arbitrary but fixed probability model h, A, Pi and a
fixed event C with P(C) > 0. The conditional
probability of event A, given
12
conditioning event C, denoted by P A | C , is defined by the numerical fraction
P(A C)
.
PC (A) = P A | C =df
P(C)
As an example, let = {1, 2, 3, 4, 5, 6} be the sample space for the roll of a
die. Let B = {2, 4, 6} and A = {2}. Assume the Laplacean (equally likely)
model, so
P(B) = 12 and P(A) = 16 . Then it is easy to calculate that
that
P A | B = 31 and P B | A = 1.
In the Laplacean (equally likely) model we have
#{A C}
.
P A|C =
#{C}
A conditional probability measure is just like any other probability measure
in that it satisfies the following properties of probability measures, referred to
12
Another reading is: The probability that event A will occur, given that event B occurs.
15
earlier:
CP1
CP2
P | C = 1.
P A | C = 1, if C A.
P(A C) = P(C) P A | C .
CP4 P A | C 0.
CP5 P A | C = 1 P A | C .
CP6 P B A | C = P B | C P A | C , if A B.
CP7 P A B | C = P A | C + P B | C P AB | C .
CP8 P A | C = P AB | C + P AB | C .
CP9 P A B | C = P A | C P B | A C .
CP10
P(A B C) = P(A) P B | A P C | A B , where the
conditioning events have a positive probability.
CP11 P(A) = P(B) P A | B + P(B) P A | B .
CP3
where P(A) 6= 0 and P(AB) 6= 0. The only difference is that in the setting
of deductive logic conditioning (using the symbol ) is quite simple, whereas
in probability theory (using the symbol | ) can be rather complex. Earlier
we mentioned that iterated truth function conditioning of the form TP Q =
T(P & Q) does not offer anything new.13 This is also true in the
case of iterated
conditionalization
of
probability
measures,
since
we
have
P
(A | B) | C =
P A | BC . In probability theory a major complication arises from the fact
that a probability measure P can take its values in the entire unit interval [0, 1].
3. Bayes Theorem and Its Applications to Complex Experiments.
Given an arbitrary but fixed probability model h, A, Pi, for any pair of events
A and B, with P(A) P(B) 6= 0, we can write down two conditional probability
formulas
P(A B)
P(A B)
P A|B =
and P B | A =
,
P(B)
P(A)
giving
P(A) P B | A = P(B) P A | B ,
which is easily seen to be the probabilistic counterpart of the tautology
P & (P Q) Q & (Q P ) and that of truth functional identity
T(P ) TP (Q) = T(Q) TQ (P ).
It is easy to see that the foregoing equality can be rewritten into one of the
simplest forms of the so-called Bayes formula
P B|A
P A | B = P(A)
P(B)
or
P B|A
.
P A | B = P(A)
P(A) P B | A + P(A) P B | A
These formulas are used whenever the quantities P(A), P B | A and P B | A
are given or can be calculated. The typical situation arises when event B occurs
logically or temporally after A, so that the probabilities P(A) and P B | A
can be readily computed. In applications, Bayes formula is used when we know
the effect of a cause and we wish to make some inference about the cause.
13
17
P
E
|
H
j
j
j=1
8
4
5
6
=
.
6+4+5 6+4+5 6+4+5
225
Example 2: Three balls are drawn from a box containing 6 red balls, 4 white
balls, and 5 blue balls. Find the probability that they are drawn in the
order red, white and blue, if none of the balls is replaced.
Solution: We shall use the same notation as in Example 1. Now, in view of
P(R1 W2 B3 ) = P(R1 )P W2 | R1 P B3 | R1 W2
we have
18
P(R1 W2 B3 ) =
6
4
4
5
= .
6+4+5 5+4+5 6+3+5
91
Example 3:
Suppose Box No. I contains 3 red and 2 blue marbles, while
Box No. II contains 2 red and 8 blue marbles. A fair coin is tossed. If
the coin turns up heads, a marble is chosen from Box I; if it turns up
tails, a marble is chosen from Box II. Find the probability that a red
marble is chosen. (Note that here the sample space for the coin can be
identified with = {Box I, Box II}, and the sample spaces for the Boxes
are 5 = {r1 , r2 , r3 , b1 , b2 } and 10 = {r1 , r2 , b1 , b2 , , b8 }, respectively.
The composite sample space has the conditional form (5 , 10 ),
prompting the use of conditional probability.)
Solution: Let R denote the event described by A red marble is chosen.
and let I denote the event that Box I is chosen. Likewise, let II denote the
event that Box II is chosen. Now the probability of choosing a red marble
is calculated using Theorem CP11 as follows:
1 3 1 2 2
P(R) = P(I)P R | I +P(II)P R | II =
+
= .
2 3+2
2 2+8
5
Example 4: Suppose Box No. I contains 3 red and 2 blue marbles, while Box
No. II contains 2 red and 8 blue marbles. A fair coin is tossed, but this
time it is not revealed whether it has turned up heads or tails, so that the
box from which a marble was chosen is not revealed. But we do know that
a red marble was chosen. What is the probability that Box I was chosen,
i.e., that the coin turned up heads, given that a red marble was chosen?
Solution: This problem requires Bayes Theorem. Let R denote the event
described by A red marble is chosen. and let I denote the event that
Box I is chosen. Likewise, let II denote the event that Box II is chosen.
Now the probability of choosing a red marble is calculated using Bayes
Theorem
P(I)P R | I
=
P I|R =
P(I)P R | I + P(II)P R | II
19
1
2
1
2
3
3+2
3
3+2
2
+ 12 2+8
3
= .
4