UPENN Phil 015 Handout 3

Handout No.
3
Phil.015
February 29, 2016
ELEMENTS OF CLASSICAL PROBABILITY THEORY
1. Passage from Deductive to Inductive Reasoning
In this handout I introduce the basic measure-theoretic machinery for probability calculus. The philosophy behind various definitions is discussed as we
move along. Recall that a logically valid deductive inference always provides
certainty, but generally only at the cost of a loss of information. For example,
we can deductively infer from the premise All swans are white the conclusion that a particular swan named (say) Alma, is also white. In our formal
language, the inference has the obvious predicate logic form
x[Sx W x] Sa W a.
Note that the information (in a classical sense of the term, to be discussed later)
always flows downhill from the premises to the conclusion. That is to say, for
any information measure Info (that assigns non-negative real numbers, e.g.,
in bits, to sentences) the inequality Info(x[Sx W x]) Info(Sa W a)
holds. For another example, to arrive deductively at a much desired conclusion
that the universe is finite, it is necessary to come up with powerful premises
that carry sufficient information about spacetime, distribution of matter in the
universe, and so forth, together with sophisticated astronomical data.
Inductive logicians are interested in a reverse form of inference, having the form
Sa W a, Sb W b, | x[Sx W x]
that goes from particular (observational) premises with less information to conclusions that generally carry vastly more information. Because observing a
couple of swans and finding them to be white does not necessarily mean that
all swans are white, we use the wiggly (or wavy) entailment sign | , indicating
that the inference is risky and no longer guaranteed. As well known, David
Hume and Karl Popper have argued that the foregoing inductive inference (and
others like it) cannot be justified, period. Simply, we can not be certain that the
conclusion is correct. A hasty conclusion is that there is no inductive logic in
the foregoing sense. However, if we agree to measure the strength (possibility,
plausibility, probability, degree of certainty or risk) of the inference by a number
1
0 p 1 that indexes or parametrizes the inductive entailment relation as in

|p , then the revised inductive inference
Sa W a, Sb W b, |p x[Sx W x]
begins to make some sense. Simply, there is a trade-off between certainty and
information flow from premises to conclusions. Given a string of premises, the
more information is in the conclusion, the less certain is the inference. What is
troubling in this approach is the identification of (indexing) parameter values.
For example, because induction is non-ampliative, there is no reason to think
that under additional evidence, as in the inductive inference
Sa W a, Sb W b, Sc W c, Sd W d, |p x[Sx W x],
the strength of the inference is bound to go up and up, i.e., first we will have
p < p , then p p , and then p p , and so forth. Clearly, at any point of
adding new observational data, the ornithologists may discover a black swan,
forcing p to take a dive to the lowest value 0.
Another problem regarding the individuation of parameter value p has something to do with the empirical content of sentences, used in an inductive inference. For example, if the claim is All reptiles have three-chambered hearts,
in symbols x[Rx Hx], then in view of evolutionary species theory, a tiny
sample (say a small n) of normal reptiles will suffice to accept the inductive
inference
Ra1 Ha1 , Ra2 Ha2 , , Ran Han |p x[Rx Hx]
with a near-maximal parameter value p. This happens to those empirical laws
that are discovered via exploratory data. In brief, parameter value p has more
to do with an empirical law than with some principles of inductive logic. Since
much of our consideration regarding inductive inference depends on the structural properties of parameter p, we should go deeper to uncover its nature.
Recall that deductive inference in classical sentential logic is based on truthvalue assignments or truth functions. We say that a function T : L {0, 1}
that assigns to each sentence P a unique value T(P ) (say T(P ) = 1 if P is
true, and T(P ) = 0 if P is false) is a truth function just in case it satisfies the
following conditions:
(i) T( P ) = 1 T(P ).

(ii) T(P & Q) = T(P ) T(Q) = min T(P ), T(Q) .
2

(iii) T(P Q) = max T(P ), T(Q) = T(P ) + T(Q) T(P ) T(Q).

(iv) T(P Q) = max 1 T(P ), T(Q) .
Thus, a truth function preserves the meanings of logical connectives, and it may
be specified by the states of affairs out there (e.g., we set T(P ) = 1 because
the sentence P happens to be true in reality).
Now, recall that a deductive inference
1 , 2 ,
is logically valid if and only if for all truth functions T, whenever T(1 ) =
1, T(2 ) = 1, holds simultaneously for all premises, we have T() = 1. Or
simply, we have
T(1 & 2 & ) T()
and equivalently

T (1 & 2 & ) = 1
for all truth functions T. This is the crux of the truth table method.
Because, as we will note later, truth functions T are limit cases of probability
measures,1 we may wish to consider a conditional truth function
TP (Q) =df T(P Q)
for any sentence P . It is easy to see that
TP satisfies all conditions itemized
above. There is an obvious iteration TP Q = T(P & Q) of conditioning and the
multiplication law
T(P & Q) = T(P ) TP (Q),
important also in classical probability theory.
I want to make two points from this digression into deductive inference and
truth functions. First, deductive logic abstracts from all truth functions T by
quantifying them away from a valid deductive inference. In so doing, logical
claims and inferences turn out to be claims about all possible worlds and not
about any specific world in particular.2 In contrast, in applied probability the1
We should remember that truth functions T are special probability measures that take only two
values, namely 0 and 1 (earlier we used the symbols T and F ). In view of this simplicity, any pair
of sentences turns out to be probabilistically independent for any T.
2
This sheds some light on the information loss problem.
ory, certain probability functions P (to be defined later) are given or presumed
to be identifiable, and not quantified away. Of course, the axioms and basic
theorems of probability calculus hold for all probability functions P.
Second, deductive logic prompts to treat inductive inference in a logically familiar manner, having the form
1 , 2 , |p ,
that completely suppresses the crucial role of a function that actually assigns
the numerical value p to the relationship between premises 1 , 2 , and conclusion . The inference above says that the premises 1 , 2 , entail with
probability p the conclusion .
In complete analogy with the role of truth functions T in

T (1 & 2 & ) = 1
we now rewrite the foregoing inductive inference into an equational form

P | 1 & 2 & = p.
In particular, the example about white swans is now symbolized in the form of
an equation, belonging to probability logic

P x[Sx W x] | (Sa W a) & (Sb W b) = p
with some P, where (i) the measure P of (un)certainty is now explicitly mentioned, (ii) the conclusion is always put first, and (iii) the converse of the inductive entailment relation | is symbolized quite simply by a vertical bar | ,
intended to separate the conclusion from the premise. The foregoing formula
expresses the fact that the degree of (un)certainty (probability) that the hypothesis x[Sx W x] is true, given evidence or premise (Sa W a) & (Sb W b),
is p.
Unfortunately, the problem has a further twist for applications. Exactly how
should the measure P be identified? Bayesians hold that P measures the probability theorists degrees of belief in all sentences of a given formal language, and
it should be identified via subsequent revisions in face of new information. The
so-called objectivists or frequentists argue that since P measures an objective
property of a system out there, it can be identified by calculating the relative
frequencies of occurrences of events or it may be obtained a priori from geometric or other symmetry considerations. Both interpretations of P are held
4
quite widely among most working statisticians.

There is still a worthwhile debate to be had on what kind of assumptions should
be made about the underlying formal language of sentences. Since in practice,
probability applies to finitary or countable populations, or possibly to a continuum situation, quantifiers are not needed in their fullest generality. On the
other hand, because countably long disjunctions P1 P2 P3 (and also
conjunctions) of sentences are important in characterizing various probability
experiments, classical sentential logic tuns out to be too weak for the purposes
of statistical inference. Statistical problems differ from most problems encountered in probability logic in that almost invariably they rely on the notions of
random variable and parameters indexing probability distributions. A major
attempt to overcome these problems is to develop a probability calculus quite
independently of logical systems, using only set theory. All formal probability
concepts are built from set-theoretic notions. The next section develops this
approach in some detail.
2. Axioms and Basic Theorems of Classical Probability Theory
Random, chance and other processes involving uncertainty (e.g., flipping a coin,
rolling a die many times, measuring a quantity, testing a product, winning the
next presidential election, observing a stochastic dynamical system, and so on)
are collectively called somewhat misleadingly probability experiments, experiments of chance or random experiments.3 It is assumed that each outcome of
a probability experiment is random or uncertain in some way. In order to be
able to compute the probability of occurrence of events associated with the
experiment, scientists need to find the correct sample space and an event algebra for their probability experiment under consideration often referred to
as the target experiment. One of the most basic issues in probabilistic (and
statistical) modeling is the right (or at least a reasonably good) mathematical
representation of events associated with the target experiment. A sample space
usually symbolized by , , , is a nonempty set of sample points, mathematically encoding all possible outcomes of the probability experiment under
consideration. Simply, a sample space is a universe of sample points, encoding
the possible outcomes of the target experiment we are interested in studying.
Perhaps the most trivial and most familiar probability experiment is flipping a
coin once. The corresponding sample space =df {H, T } consists of exactly
two distinct points labeled H and T .4 . In this notation, the first encodes the
3
Often probability experiments are entirely conceptual. For example, in discrete probability
calculus it is common to assume that a hypothetical urn contains x marbles, each colored differently
from any of the others. A marble is drawn from the urn and its color is noted. It does not matter
whether or not this is a real-life situation. The interest is in determining the probability that a
marble of a designated color is drawn.
4
Because we are conceptually encoding the empirical outcomes,the points in could be 0 and 1.
outcome heads and the second encodes the empirical outcome tails. It is
strictly a question of good mathematical modeling whether or not should
include a sample point E for additional outcomes, say edge, and so forth.5
Unfortunately, sample spaces are not directly available in the logical approach
to probabilistic inference. Yet, they have proved to be extremely useful in upholding the linear laws of random variables and the convex manipulations of
probability density functions.
In many applications, sample spaces are frequently introduced in a mixing
or conditional manner. Suppose the target probability experiment consists
of two steps. First a coin is flipped. Then, if the outcome is tails, a die
is rolled. But if the outcome is heads, then the coin is flipped again. It
is immediately obvious that the correct sample space is specified by 0 =df
{T 1, T 2, T 3, T 4, T 5, T 6, HT, HH}. Here T 1 encodes the first outcome of heads
and the subsequent outcome of the upturned face numbered 1, after the die
has been rolled, and so on. Observe that the sample space 0 is actually
defined by the mixture or conditional sample space ( , ) =df
{T } + {H} , where = {H, T } encodes the outcomes of the first
trial and = {1, 2, 3, 4, 5, 6} is used in the second trial if T occurs, else is
used again in the second trial (namely, if H occurs in the first trial). Here the
sample spaces and are invoked conditionally on .
Set theory has two standard concepts we shall need: Cartesian product and
disjoint sum. Given two sets and , we use the expression to denote
their Cartesian product set , i.e., the set of all pairs (, ), where is a member
of and is a member of . For example, the samplespace for flipping a
coin twice is defined by the Cartesian product = HH, HT, T H, T T
that should be written more pedantically as (H, H), (H, T ), (T, H), (T, T ) .
Clearly, in general 6= .
Given two sets and , we use the expression + to denote the disjoint
sum of sets and . It is the set containing both the elements of and .
Possible double-counting is allowed by labeling the elements of separately by
0 and those of by 1.
Compositions of experiments are typically represented by Cartesian products of
sample spaces (e.g., flipping a coin n times requires the n-fold Cartesian product
= n , and by iterated mixing (conditioning) constructions.
A slightly more complicated sample spaces are needed in cascade probability
5
By agreeing on the univese of sample points that encodes all seriously possible outcomes, we
are of course resorting to idealization. If edge were to occur reasonably often in repeated tosses,
then we would make an unrealistic assumption in insisting to have = {H, T }. On the other
hand, if edge never seems to occur, then it is inconvenient to specify the working sample space by
= {H, T, E}.
experiments without replacement. Suppose a box contains a red ball (encoded

by point R), a blue ball (encoded by point B), and a yellow ball (encoded by
point Y ), comprising the sample space =df {R, B, Y }. Two balls are selected at random in succession, where the first ball is not replaced before the
second ball is drawn. It is easy to see that the correct sample space is given
by 0 =df {RB, RY, BR, BY, Y R, Y B}. However, a slightly deeper conceptual
analysis reveals that
0 = (/R , /B , /Y ) = {R} /R + {B} /B + {Y } /Y ,
where /R = {R} = {RB, RY }, /B = {B} = {BR, BY }, and
/Y = {Y } = {Y R, Y B} are the reduced sample spaces, conditioned by
previous outcomes. This kind of decomposition or tree analysis of complex
sample spaces will prove to be very useful in setting up conditional probabilities.
These are all finite sample spaces. But often countably infinite sample spaces
are needed. For example, consider a stream of potential customers entering a
gift shop. Starting at an arbitrary point in time (e.g., 9:00 a.m.), we count how
many different people come in before one makes a purchase. This number is a
nonnegative integer. Because we cant be certain that a purchase will be made
withinany specified
finite number of potential customers, it is correct to define
= 0, 1, 2, . Spinning a pointer that has a continuous scale showing the
angle (measured from a reference position with infinite accuracy) at which it
stops requires a real half-open interval (or circle) sample space [0, 2).
8pt] According to the objectivist approach, a probability measure P measures
the probability that a certain event associated with the target experiment will
occur.6 Events occurring in a probability experiment are mathematically encoded by the so-called measurable subsets of points of the target experiments
sample space. For example, if a die is rolled once, its six possible outcomes
are mathematically encoded by sample points 1, 2, 3, 4, 5 and 6, comprising the
dies sample space = {1, 2, 3, 4, 5, 6}. Now, the event described by getting
a prime number after rolling the die once is encoded by the subset {2, 3, 5} of
, because there are exactly three ways to get a prime number, namely having
2, 3, or 5 dots on the dies upturned face, after it lands. Likewise, the event
described by getting an odd number is encoded by the subset {1, 3, 5} of ,
since there are exactly three ways to get an odd number in the experiment
defined by rolling a die once, namely, getting 1, 3, or 5 in a roll. So an event
represented by the subset E of occurs precisely when the target experiment
results in an outcome that is encoded by a sample point in the subset E. And
the event represented by the subset E does not occur provided that the target
6
In probability logic, a probability measure measures the probability that a sentence of a designated formal language is true.
experiment results in an outcome that is encoded by a sample point not in the

subset E of . Often by abuse of language we shall call {1, 3, 5} an event
instead of a pedantical subset of that mathematically encodes the event
described by getting an odd number. Likewise, by abuse of terminology often
will be referred to as the set of outcomes instead of the sample space whose
elements mathematically encode the empirical outcomes of the target experiment. These distinctions are meant to ensure that a mathematical model is not
confused with reality.
Because combinations of and relations between events (associated with a probability experiment) play a crucial role, we need the notion of an event algebra.
A mathematical reason for event algebras is prompted by measurability considerations. It turns out that not any old subset of a general sample space is
measurable by a probability measure! Fortunately, in finitary settings our major focus, all subsets are measurable.
Given a sample space of the target experiment, a system A of subsets of is
said to be an event algebra of provided that it satisfies the following closure
conditions:
(i) A, i.e., is itself represents an event(the so-called sure event) and
hence is a member of A;7
(ii) Closure under Complementation: If E A, then E A, where E denotes
the complement of E, i.e., the set of elements of that do not belong to
E. For example, since = , the so-called impossible event, encoded by ,
is also in A. In this regard, it is well to emphasize the difference between
a sample point of and the singleton event {}, consiting of as its
only member. Now, depending on how A is specified, it may but need not
contain {}.
(iii) Closure Under Union: If E A and E A, then E E A, where
E E denotes the union of subsets E and E of . Recall that for two
subsets E and E of , their union, written E E , is the subset of
consisting of sample points that are in E or in E , or in both.8
(iv) Closure Under Countable Union: If E1 A, E2 A, , then E1
E2 A. In brief, for each sequence of subsets belonging to A, their
countable union is also in A. This is a technical condition we shall not
use. Formally, the event algera is a so-called Boolean (sigma) algebra of
measurable subsets of a sample space that upholds all laws of sentential
7
In general, to indicate the membership of an object x in a set S, we write x S and read x is

in S.
8
For two subsets E and E of , their intersection, written E E , is the subset of consisting
of sample points that are in both of the subsets E and E .
logic, formulated set-theoretically. For example, since in propositional logic

we have the theorem (P & Q) [P & (P Q], in A we have its settheoretic counterpart in the form of equality A B = A (A B),
where the set-theoretic conditional (using the same syumbol) is defined by
A B =df AB. Simply, A is the set-theoretic embodiment of the formal
language of enriched sentential logic (enriched with countable conjunctions
and dijunctions).
For example, suppose = {1, 2, 3, 4, 5, 6} is the sample space associated with
the die-rolling experiment, discussed
earlier. Then it is easy to check that the

system of four subsets A =df , {2, 3, 5}, {1, 4, 6}, is an event algebra of ,
convenient for calculating the probability of getting a prime number. In general,
a sample space has many possible event algebras, but typically only one of them
is chosen for any target experiment, contingently upon the events of interest.
The simplest choice is the system of all subsets of , which is trivially an event
algebra. This is what we shall typically choose in the case of all finite sample
spaces. In advanced probability textbooks the sample space together with its
event algebra, written as a pair h, Ai, is called a measurable space.
The main reason for the introduction of event algebras is that they are needed
for specifying the probability measures, associated with target experiments. We
have already noted that in some (possibly pathological) sample spaces not all
subsets can be measured by a probability measure. Event algebras ensure
that only measurable subsets will be used in encoding events associated with
a target experiment. So formally, probability measures are always defined on a
suitable event algebra. This takes us to the next definition.
Let be a sample space of a probability experiment of interest and let A be
the designated event algebra of , representing all events associated with a
target experiment. A real-valued function P : A R that assigns to each
event E in A a unique real number P(E), called the probability of E (or more
precisely the probability that event E occurs), is called a probability measure on
the measurable space h, Ai provided that the following axioms are satisfied:
(i) Normality: P() = 1.
Since represents the so-called sure event that always occurs, it is fully
meaningful to assign value 1 to it.
(ii) Non-negativity: P(E) 0 for all events E in A.
This axiom is motivated by the fact that it is empirically meaningless to
consider negative probability values. The axioms ensure that the probability values always fall into the real unit interval [0, 1].
) = P(E)+P(E ), where E E
=df E E with
(iii) Finite Additivity: P(E E
E E = denotes the disjoint union of events E and E . Recall that two
9
events E and E are said to be mutually exclusive or simply disjoint just

in case E E = .
(iv) Countable Additivity:
2 ) = P(E1 ) + P(E2 ) +
P(E1 E
for any sequence E1 , E2 , of pairwise or mutually disjoint events in A,
i.e., when Ei Ej = for all i 6= j, i, j = 1, 2, . The last axiom
is a technical condition that supports many non-trivial theorems about
probability measures.
A triple h, A, Pi of mathematical objects, consiting of a sample space , an
event algebra A on it, and a probability measure on the event algebra is called
a probability model. Probability models are used to represent the chance mechanisms of probability experiments. Observe that this approach to inductive
reasoning differs from deductive inference in that here we work with only one
probability measure P at a time, whereas deduction generally relies on all truth
functions T, behaving as counterparts of P.
As one immediate application of the definition above, we present the following simple example. Suppose the target probability experiment is specified by
flipping an unbiased coin once. The we can calculate the probability of obtaining heads and tails a priori with the help of the probability model h, A, Pi,
where =
{H, T } is the correct sample space with meanings introduced earlier, A = , {H}, {T }, is the correct event algebra (set of all subsets of ),
and P is defined for all events in A as follows: P() = 0, P({H}) = f rac12,
P({T }) = f rac12, and P() = 1. Clearly, the definition of P is highly redundant, since the probability of is axiomatically 1 and because P(E) = 1P(E),
the probability of {T } is immediately obtained from that of {H} (its complement), and likewise for the empty set encoding the so-called impossible event
that never occurs. So the only probability value that is sufficient to specify the
rest of probability values is P({H}). Now why is it set equal to 12 ? Because the
coin is unbiased, meaning that the likelihood of obtaining heads is presumed to
be the same as that of obtaining tails. In general, the probability measure P in
a so-called Laplacean (a priori, or equally likely) probability model h, A, Pi is
defined by
P(E) =
#{E}
,
#{}
where #{E} denotes the number of sample points in subset E. Whenever we

can think of the outcomes as finitary and equally likely (e.g., for symmetry and
other geometric or physical reasons), the Laplacean model applies. Remember,
10
however, that the Laplacean model is confined to finite sample spaces only.
Problem: Suppose two dice are rolled. Find the probability that the sum of
outcomes shown by the first and second die is exactly 6.
Solution:
The sample space for the experiment is given by the Cartesian product
= (1, 1), (1, 2), , (2, 1), (2, 2), , , (6, 6) , where = {1, 2, 3, 4, 5, 6}.
Now, because there are 5 pairs of outcomes with the same sum 6, namely
(5, 1), (4, 2), (3, 3), (2, 4), (1, 5), the probability of the event described by The
5
sum of outcomes shown by the first and second die is 6 is equal to 36
.
Note that in general, a measurable space h, Ai admits infinitely many distinct
probability measures, forming a kind of meta-sample space Prob , A with its
own meta-event algebra, important in higher-order probability theory. Observe
also that unlike in deductive inference, where usually all possible truth functions T are considered and then quantified away, in probability theory the focus
tends to be on just one particular probability measure, namely the one that is
believed to be (in some sense) the correct probability measure for the target
experiment. Hence, probabilistic reasoning tend to be local and model-driven.
We will see more of this style in statistical reasoning.
Now we turn to some of the basic theorems of probability theory. Henceforth, we
shall assume that we have an arbitrary but fixed probability model h, A, Pi
with events A, B, C, , belonging to its event algebra A. We ignore a host
of trivial facts, including P() = 0, P(A) 1, P(E) = 1 P(E), and
P(A) P(B) if A B. Since the probability model above is arbitrary,
the theorems below in effect hold for all probability measures, and they do not
depend on Bayesian, objectivist or other interpretations at all. For this reason, we should
preface each theorem below with the quantifier for all P in

Prob , A .
P1
P(B A) = P(B) P(A), if A B.9

where BA denotes
Proof. In view of A B we can write B = (BA)A,
the set-theoretic difference, meaning the set of those elements that are in
B but not in A. Quite simply, we have B A = B A. Now, using the
additivity law for probability measures we get P(B) = P((B A)A)

=
P(B A) + P(A). And this immediately gives P(B) P(A) = P(B A),
as wanted.
P1
P(A B) = P(A) P(A B).
Suppose that A and B are two sets. Then the set A is said to be a subset of B provided
that every element of A is also an element of B. This important relationship between two sets is
symbolized succinctly as A B.
11
P1
In general, we always have P(A B) P(A) P(B) and P(A B)

P(A) + P(B) for any A and B.
Proof. Follows directly from Theorem P1 above.
P2
P(A B) = P(A) + P(B) P(A B).10 The theorem generalizes to

the case of P(A B C) and larger finitary unions.
(A B)]. Therefore the additivity
Proof. First note that A B = A[B
law gives P(A B) = P(A) + P(B (A B)). Now Theorem P1 allows
to write P(A B) = P(A) + P(B) P(A B).
P3
P(A) = P(AB) + P(AB).11
for any set B. Now, since A =

Proof. First note that
= B B
A = A B B = AB AB, by the additivity law we immediately

have P(A) = P(AB) + P(AB). This theorem is generalized below in the
form of the so-called law of total probability.
P4
P(A) = P(AB1 ) + P(AB2 ) + + P(ABn ), where the string of

events
B1 , B2 , , B n forms a measurable partition of . Recall that a
set B1 , B2 , , Bn forms a measurable partition of just in case all
events belong to A, are pairwise (mutually) disjoint, and they fill out the
entire sample space, meaning B1 B2 Bn = . Because the events
AB1 , AB2 , ABn are also pairwise disjoint, the theorem immediately follows.
P5
P(A B) = 0, if P(A) = 0, and P(A B) = 1, if P(B) = 1.
P6
P(AB) P(A) + P(B) 1.

Proof. First, note that the general addition law has an alternative form
P(A B) = P(A) + P(B) P(A B). Now since P(A B) 1, if we
subtract the larger value (namely 1) on the right (instead of P(A B)),
we immediately obtain the desired inequality.
P7

max 0, P(A) + P(B) 1 P(A B) min P(A), P(B)
10
This theorem is often referred as the general addition law, because there is no assumption in it
about the disjointness of events.
11
Given two sets A and B, often we shall write AB as short for A B.
12
P8

P(A B) max P(A), P(B) .
We conclude this section with probability logic, pausing to define probability
functions as analogs of truth functions. Inductive logic studies risky arguments
using probability ideas. Probabilities are expressed in numbers from the unit
interval. Probability logic is deductive logic enriched with a designated probability measure. In this framework logicians can analyze questions of the form
What is the probability that proposition is true?. Thus, focus is on the
probability of propositions or sentences being true (or false), and not on the
probability that a given event occurs (or does not occur).
Suppose we are given a language L of sentential, predicate or other logic, the
notion of probability in probability logic is treated as a generalization of a truth
functions. Specifically, a function P : L [0, 1] that assigns
to each sentence
of a formal language L a unique real number 0 P 1 (interpreted
as the probability that is true) is called a probability function or probability
measure (or belief state) on L provided that the following conditions hold for
all sentences and in L:
(i) P() = 1, if .
(ii) P() P(), if .
(iii) P( ) = 1 P().
(iv) P( ) = P() + P() P( & ).

(v) P( & ) = P() P | , where P | =df P( & ) : P().
(vi) P( ) = P( ).
(vii) P(x(x)) = Inf a1 , ,an P((a1 ) & (an )).
Where a typical probability function P differs from truth functions T is in
specifying logical truth and logical validity. If we wanted to define logical truth
by setting precisely when P() = 1 for all probability functions P on L,
we would not get anything new, beyond the usual logical theorems we obtain
from truth functions. However, if we set
P P() = 1
for just one particular probability function, say P, then we obtain a deductively
closed system of sentences enjoying full belief or certainty in belief state P
that of course always includes all logically true sentences. The most important
13
conclusion to be drawn from this conceptual analysis is that probaility functions

in logic are not quantified away but used individually. Thus, the inductive
inference
1 , 2 , |p ,
can only mean the equality

P | 1 & 2 & = p
for some specific probability function P, and not all probabiolity functions.
3. Probabilistic Independence and Conditional Probability
Given a probability model h, A, Pi, two events A and B in the algebra of
events A are said to be probabilistically (or statistically) independent, in symbols
A
B, just in case P(AB) = P(A)P(B). This relation enjoys several simple
properties that drop out from its definition:
(i) Triviality:
B and
B.
(ii) Symmetry: If A
B, then B
A.
(iii) Complementation: If A
B, then A
B.
(iv) Double-complementation: If A
B, then A
B.
(v) If C
A and C
B, then C
AB,
where A and B are disjoint.
(vi) If A and B are disjoint events with P(A) 6= 0 6= P(B), then not A
B.
As an illustrative example, consider the nickel-and-dime experiment. We are
tossing both a nickel and a dime into the air once, the object being to observe
the upturned faces on both coins. A suitable description of the four possible outcomes
is afforded by four ordered pairs, specifying the sample space
=df (H, H), (H, T ), (T, H), (T, T ) . The event algebra A consists of
all subsets of the sample space. Since we are assuming the Laplacean (equally
likely) model, the probability of getting the
falls
event described
by The nickel
1
heads and modeled by the subset A = (H, H), (H, T ) is P(A) = 2 . Similarly, the probability of getting the event descibed
by The dime falls tails and
represented by the subset B = (H, T ), (T, T ) is P(B) = 21 . From the physical
nature of the experiment (the coins are physically separate objects)

we know
that events A and B should be independent. Since A B = (H, T ) , we see
that P(A B) = 14 = 21 21 = P(A) P(B), hence A
B, as suspected.
Clearly, a given pair of events A and B may be probabilistically independent with respect a probability measure P, but not necessarily with respect
14
to some other probability measure P . In particular, note that if the probability measure P on the measurable space h , Ai above is now intended
to represent biased coins, given by (say) P ({(H, H)}) = 0.5, P ({(H, T )}) =
0.4, P ({(T, H)}) = 0.05, and P ({(H, H)}) = 0.05, then we have P (A) =
0.9, P (B) = 0.45, P (A B) = 0.400, but P (A) P (B) = 0.405, so that with
respect to P the events A and B are not probabilistically independent.
In general, with respect to a so-called Dirac {0, 1}-valued probability measure
P defined by P (A) = 1 if A and P (B) = 0 if
/ B for all events
A and B, all events are probabilistically independent. In particular, since each
truth function T is a probability measure over the formal language of sentential
logic, all pairs of sentences are automatically T-independent. Suppose an experiment consists of rolling a die twice. It is easy to see that the event described
by Getting an odd number in the first roll. is probabilistically independent
of Getting an even number in the second roll.
If there are three events A, B, C, they are called pairwise independent just in
case A
B, A
C, and B
C. However, multiple independence or mutual
independence, in symbols
(A, B, C), means considerably more, namely, pairwise independence and equality P(A B C) = P(A) P(B) P(C). The last
condition is not reducible to pairwise independence. The concept of multiple
independence readily extends to any string of events.
Suppose we are given an arbitrary but fixed probability model h, A, Pi and a
fixed event C with P(C) > 0. The conditional
probability of event A, given
12
conditioning event C, denoted by P A | C , is defined by the numerical fraction

P(A C)
.
PC (A) = P A | C =df
P(C)
As an example, let = {1, 2, 3, 4, 5, 6} be the sample space for the roll of a
die. Let B = {2, 4, 6} and A = {2}. Assume the Laplacean (equally likely)
model, so
P(B) = 12 and P(A) = 16 . Then it is easy to calculate that
that
P A | B = 31 and P B | A = 1.
In the Laplacean (equally likely) model we have
#{A C}
.
P A|C =
#{C}
A conditional probability measure is just like any other probability measure
in that it satisfies the following properties of probability measures, referred to
12
Another reading is: The probability that event A will occur, given that event B occurs.
15
earlier:
CP1
CP2

P | C = 1.

P A | C = 1, if C A.

P(A C) = P(C) P A | C .

CP4 P A | C 0.

CP5 P A | C = 1 P A | C .

CP6 P B A | C = P B | C P A | C , if A B.

CP7 P A B | C = P A | C + P B | C P AB | C .

CP8 P A | C = P AB | C + P AB | C .

CP9 P A B | C = P A | C P B | A C .

CP10
P(A B C) = P(A) P B | A P C | A B , where the
conditioning events have a positive probability.

CP11 P(A) = P(B) P A | B + P(B) P A | B .
CP3
The proofs are not hard and are left as exercises.

To get a better feeling for the meaning of probabilistic conditionalization P A |
C , we recall some of the most amazing similarities between logical and probabilistic conditionals. First, recall the tautology (P & Q) [P & (P Q)]
and its truth function evaluation T(P & Q) = T(P ) T(P Q) for any
T. Now, if we remind ourselves the definition of a conditional truth function TP (Q) =df T(P Q) from Section one, we immediately note the striking similarity between Theorem CP3 above and the truth functional identity
T(P & Q) = T(P ) TP (Q).
The similarity goes even further. Since the tautology

(P & Q & R) P & (P Q) & [(P & Q) R]
T(P & Q & R) = T(P ) T(P Q) T[(P & Q) R, ]

upon reformulation into conditional truth functions we get
T(P & Q & R) = T(P ) TP (Q) T(P & Q) (R)
which is an exact counterpart of the theorem

P(ABC) = P(A) P B | A P C | AB ,
16
where P(A) 6= 0 and P(AB) 6= 0. The only difference is that in the setting
of deductive logic conditioning (using the symbol ) is quite simple, whereas
in probability theory (using the symbol | ) can be rather complex. Earlier

we mentioned that iterated truth function conditioning of the form TP Q =
T(P & Q) does not offer anything new.13 This is also true in the
case of iterated

conditionalization
of
probability
measures,
since
we
have
P
(A | B) | C =

P A | BC . In probability theory a major complication arises from the fact
that a probability measure P can take its values in the entire unit interval [0, 1].
3. Bayes Theorem and Its Applications to Complex Experiments.
Given an arbitrary but fixed probability model h, A, Pi, for any pair of events
A and B, with P(A) P(B) 6= 0, we can write down two conditional probability
formulas
P(A B)
P(A B)
P A|B =
and P B | A =
,
P(B)
P(A)
giving

P(A) P B | A = P(B) P A | B ,
which is easily seen to be the probabilistic counterpart of the tautology
P & (P Q) Q & (Q P ) and that of truth functional identity
T(P ) TP (Q) = T(Q) TQ (P ).
It is easy to see that the foregoing equality can be rewritten into one of the
simplest forms of the so-called Bayes formula

P B|A
P A | B = P(A)
P(B)
or

P B|A

.
P A | B = P(A)
P(A) P B | A + P(A) P B | A

These formulas are used whenever the quantities P(A), P B | A and P B | A
are given or can be calculated. The typical situation arises when event B occurs
logically or temporally after A, so that the probabilities P(A) and P B | A
can be readily computed. In applications, Bayes formula is used when we know
the effect of a cause and we wish to make some inference about the cause.
13
This fact follows from the tautology [P (Q R) [(P & Q) R].
17
The general form of Bayes Theorem (Rule) applies to a system of mutually

(pairwise) exclusive and jointly exhaustive hypotheses about some potential
causes, i.e., a string of potential cause events H1 , H2 , Hn such that Hi Hj =
for i 6= j and H1 H2 Hn = . Further, it is assumed that these hypotheses
are not impossible, i.e., P(Hi ) 6= 0 for all i = 1, 2, , n. Now, suppose event
E with P(E) 6= 0 is an effect (serving as evidence of a cause). Then we have

P E | Hi
.
P Hi | E = P(Hi ) Pn
P(H
)
P
E
|
H
j
j
j=1
Calculation of Conditional Probabilities:

Example 1: Three balls are drawn from a box containing 6 red balls, 4 white
balls, and 5 blue balls. Find the probability that they are drawn in the
order red, white and blue, if each ball is replaced.
Solution: In most examples below we shall tacitly assume that the sample
space is obvious. The event algebra is always the family of all subsets of the
sample space. And the probability is of the equal likelihood (Laplacean)
type. This avoids several pedantic technicalities that do not add much to
problem solving skills. Let
1. R1 = event described by Red ball is drawn on first draw.
2. W2 = event described by White ball is drawn on second draw.
3. B3 = event described by Blue ball is drawn on third draw.
In view of Theorem CP10 and probabilistic independence of color drawings we have

P(R1 W2 B3 ) = P(R1 )P W2 | R1 P B3 | R1 W2 = P(R1 )P(W2 )P(B3 ),
and hence
P(R1 W2 B3 ) =

8
4
5
6
=
.
6+4+5 6+4+5 6+4+5
225
Example 2: Three balls are drawn from a box containing 6 red balls, 4 white
balls, and 5 blue balls. Find the probability that they are drawn in the
order red, white and blue, if none of the balls is replaced.
Solution: We shall use the same notation as in Example 1. Now, in view of

P(R1 W2 B3 ) = P(R1 )P W2 | R1 P B3 | R1 W2
we have
18
P(R1 W2 B3 ) =

6
4
4
5
= .
6+4+5 5+4+5 6+3+5
91
Example 3:
Suppose Box No. I contains 3 red and 2 blue marbles, while
Box No. II contains 2 red and 8 blue marbles. A fair coin is tossed. If
the coin turns up heads, a marble is chosen from Box I; if it turns up
tails, a marble is chosen from Box II. Find the probability that a red
marble is chosen. (Note that here the sample space for the coin can be
identified with = {Box I, Box II}, and the sample spaces for the Boxes
are 5 = {r1 , r2 , r3 , b1 , b2 } and 10 = {r1 , r2 , b1 , b2 , , b8 }, respectively.
The composite sample space has the conditional form (5 , 10 ),
prompting the use of conditional probability.)
Solution: Let R denote the event described by A red marble is chosen.
and let I denote the event that Box I is chosen. Likewise, let II denote the
event that Box II is chosen. Now the probability of choosing a red marble
is calculated using Theorem CP11 as follows:

1 3 1 2 2
P(R) = P(I)P R | I +P(II)P R | II =
+
= .
2 3+2
2 2+8
5
Example 4: Suppose Box No. I contains 3 red and 2 blue marbles, while Box
No. II contains 2 red and 8 blue marbles. A fair coin is tossed, but this
time it is not revealed whether it has turned up heads or tails, so that the
box from which a marble was chosen is not revealed. But we do know that
a red marble was chosen. What is the probability that Box I was chosen,
i.e., that the coin turned up heads, given that a red marble was chosen?
Solution: This problem requires Bayes Theorem. Let R denote the event
described by A red marble is chosen. and let I denote the event that
Box I is chosen. Likewise, let II denote the event that Box II is chosen.
Now the probability of choosing a red marble is calculated using Bayes
Theorem

P(I)P R | I

=
P I|R =
P(I)P R | I + P(II)P R | II
19
1
2
1
2
3
3+2
3
3+2
2
+ 12 2+8
3
= .
4

UPENN Phil 015 Handout 3

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

UPENN Phil 015 Handout 3

Hochgeladen von

Copyright:

Verfügbare Formate

Handout No.

0 p 1 that indexes or parametrizes the inductive entailment relation as in

quite widely among most working statisticians.

experiments without replacement. Suppose a box contains a red ball (encoded

experiment results in an outcome that is encoded by a sample point not in the

In general, to indicate the membership of an object x in a set S, we write x S and read x is

logic, formulated set-theoretically. For example, since in propositional logic

events E and E are said to be mutually exclusive or simply disjoint just

where #{E} denotes the number of sample points in subset E. Whenever we

P(B A) = P(B) P(A), if A B.9

additivity law for probability measures we get P(B) = P((B A)A)

P(A B) = P(A) P(A B).

In general, we always have P(A B) P(A) P(B) and P(A B)

P(A B) = P(A) + P(B) P(A B).10 The theorem generalizes to

P(A) = P(AB) + P(AB).11

for any set B. Now, since A =

A = A B B = AB AB, by the additivity law we immediately

P(A) = P(AB1 ) + P(AB2 ) + + P(ABn ), where the string of

P(A B) = 0, if P(A) = 0, and P(A B) = 1, if P(B) = 1.

P(AB) P(A) + P(B) 1.

conclusion to be drawn from this conceptual analysis is that probaility functions

The proofs are not hard and are left as exercises.

T(P & Q & R) = T(P ) T(P Q) T[(P & Q) R, ]

This fact follows from the tautology [P (Q R) [(P & Q) R].

The general form of Bayes Theorem (Rule) applies to a system of mutually

Calculation of Conditional Probabilities:

Das könnte Ihnen auch gefallen