Beruflich Dokumente
Kultur Dokumente
Basic Prob-
ability Theory
This material is extracted from the books of Robert B. Ash
1 Basic concepts
The classical denition of probability is the following: the probability of an
event is the number of outcomes favorable to the event, divided by the total
number of outcomes, where all outcomes are equally likely. This denition
is restrictive (nite number of outcomes) and circular (equally likely =
equally probable).
The frequency approach is based on the physical observations of the fol-
lowing type: if an unbiased coin is tossed independently n times, where n
is very large, the relative frequency of heads is likely to be close to 1/2. We
would like to dene the probability of an event as the limit of Sn /n, where
Sn is the number of occurrences of the event. However, it is possible that
Sn /n converges to any number between 0 and 1, or has no limit at all.
We need a rigorous denition of probability to construct a mathematical
theory. We now introduce the basic concepts of the mathematical probability
theory.
1
In the experiment of tossing a single die, we can chose = {1, 2, 3, 4, 5, 6}.
Another choice: = { N is even , N is odd } but if we are interested,
for example, in whether or not N 3, this second space is not useful.
and = intersection: A B.
not = complement: Ac \ A.
For example, in the experiment of tossing a single die with = {1, 2, 3, 4, 5, 6}
and N the result, we can consider the following events: A = {N 3} =
{3, 4, 5, 6} and B = {N is even} = {2, 4, 6}. Then
2
A B = {N 3 or N is even} = {2, 3, 4, 5, 6}
Ai Aj = for i = j.
In some ways the algebra of events is similar to the algebra of real num-
bers, with union corresponding to addition and intersection to multiplication.
For example, the commutative and associative properties hold:
A B = B A, A (B C) = (A B) C
A B = B A, A (B C) = (A B) C
In many ways the algebra of events diers from the algebra of real num-
bers, as some of the identities below indicate.
A A = A, A Ac =
A A = A, A Ac =
A = A, A=A
A = , A=
3
1.1.3 Class of events
In some situations, we may not have a complete information about the out-
In
comes of an experiment. For example, if the experiment involves tossing a
= {(a1 , a2 , a3 ) | ai {H, T }, i = 1, 2, 3}
Imagine that after the experiment is performed, we have the following infor-
mation about the outcome:
= (H, T, ).
In this case, we are not able to give a yes or no answer to the question
Is A? (this depends on the result of the last toss, and we miss this
information). So, A is not measurable with respect to the given information.
In contrast, the subset B = {(T, T, T ), (T, T, H)} which corresponds to
the condition the rst two tosses are tails is an event. Indeed, with the
information about the rst two tosses, we are always able to say whether
B .
class
is in or not, for all possible outcomes
called the
of events
This leads to consider a particular class of subsets of
F . For reasons
sigma eld
. The standard notation for the class of events is of
mathematical consistency, we require that F form a , which is a
collection of subsets of satisfying the following three requirements.
1. F
The above conditions imply also that F is closed under nite or countable
intersection, and that the empty set belongs to F (exercise).
Examples of sigma elds:
4
F = {, }
R R R+ , [0, 1]),
Borel sets
If is a part of (for instance, or or we will typically
consider the sigma eld B of . That is, the smallest sigma eld
containing the intervals (and, in consequence, unions and intersections
of intervals).
relative frequency
We now consider the assignment of probabilities to events. The probability
of an event should somehow reect the of the event in a
large number of independent repetitions of the experiment. Thus, if A F,
the probability P (A) 0 and 1, with
should be a number between P () = 0
and P () = 1. B are disjoint (cannot
Furthermore, if the events A and
occur in the same time), then the number of occurrences of A B is the
sum of the number of occurrences of A and the number of occurrences of B ,
so we should have P (A B) = P (A) + P (B). This motivates the following
denition.
P (A) to each set A in the sigma eld F
probability measure
A function that assigns a number
is called a on F, provided that the following conditions
are satised:
2. P () = 1
5
subsets of a line (this is the reason why we consider only the Borel sets in
Rn ).
From this denition, we can deduce the following properties of a proba-
bility measure (exercise):
P () = 0
P (A B) = P (A) + P (B) P (A B)
If B A, then P (B) P (A).
P (A1 A2 ) P (A1 ) + P (A2 ) +
If the sets An are nondecreasing, i.e. An An+1 , n 1, then
P( An ) = lim P (An ).
n
n1
Exercises
1. Let = {1 , 2 , . . . , n , . . . } be a countable sample space and F be
the class of all subsets of . To each sample point n let us attach an
arbitrary weight pn subject to the conditions
n : pn 0, pn = 1.
n
A , sum of
the weights of all points in it
Now for any subset of we dene its probability to be the
. In symbols,
A , P (A) = pn .
n A
6
2. If is countably innite, may all the sample points n be equally
likely (that is, all pn in the previous example be equal)?
|A|
P (A) = .
||
4. Suppose that the land of a square kingdom is divided into three strips
A, B , C of equal area and suppose the value per unit is in the ratio
of 1 : 3 : 2. For any piece of (measurable) land S in this kingdom, the
relative value with respect to that of the kingdom is then given by the
formula
P (S A) + 3P (S B) + 2P (S C)
V (S) =
2
where P is as in the previous exercise. Show that V is a probability
measure.
1.2 Independence
Consider the following experiment. A person is selected at random and is
height is recorded. After this the last digit of the licence number of the next
car to pass is noted. If A is the event that the height is over 1m70, and B is
the event that the digit is 7, then, intuitively, A and B are independent.
The knowledge about the occurrence or nonoccurrence of one of the events
should not inuence the odds about the other.
7
B
all only
In other words, we expect that the relative frequency of occurrence of
P (A B) = P (A)P (B).
We list below some properties of independent events (the proofs are left
in exercise).
8
P (A B) = P (A)P (B),
Conversely, it is possible to have, for example,
P (A C) = P (A)P (C), P (B C) = P (B)P (C), but P (A B C) =
P (A)P (B)P (C). Thus A and B are independent, as are A and C , and
also B and C , but A, B , and C are not independent.
Exercises
1. What can you say about the event A if it is independent of itself ? If
the events A and B are disjoint and independent, what can you say of
them?
NAB NAB /N
=
NA NA /N
This discussion suggests the following denition.
P (A B) P (A)P (B)
P (B | A) = = = P (B),
P (A) P (A)
which is in accordance with the intuition.
9
We have P (A B) = P (A)P (B | A), and we can extend this formula
to more than two events:
Similarly,
and so on.
39 38 37
P (A1 A2 A3 ) = P (A1 )P (A2 | A1 )P (A3 | A1 A2 ) = .
52 51 50
We now formulate the most useful results on conditional probabilities.
P (A) = P (Bi )P (A | Bi )
i
10
1.3.2 Bayes' theorem
Notice that under the above assumptions we have
P (A Bk ) P (Bk )P (A | Bk )
P (Bk | A) = =
P (A) i P (Bi )P (A | Bi )
a posteriori
urns. Suppose that we did not observe the whole experiment but only the
nal result: a black ball is drawn. We would like to estimate the
probability that the coin falled heads. Let C = {a black ball is drawn}. We
use the Bayes theorem to compute P (B1 | C):
(1) (2)
P (B1 )P (C | B1 ) 4
P (B1 | C) = = ( 1 ) ( 2 )3 (5 2 ) ( 3 ) =
P (B1 )P (C | B1 ) + P (B2 )P (C | B2 ) 3 5
+ 3 4 19
2 Random variables
2.1 Denition of a random variable
Intuitively, a random variable is a quantity that is measured in connection
with a random experiment. If is a sample space, and the outcome of the
, a measuring process is carried out to obtain a number R().
a random variable is a real-valued function on a sample space
experiment is
Thus . Let us
give some examples:
11
If we are interested in a random variable R, we generally want to know
the probability of events involving R. In general these events are of the form
R lies in a set B R. For instance, R is less than 5 or R lies in the
interval [a, b).
Notation. { : a R() < b} will be often abbreviated to
The event
{a R < b}. We note its probability P (a R < b).
Example. A biased coin is tossed independently n times, with probability
p of coming up heads on a given toss. Let R be the number of heads. Then,
for integers k l ,
l
P (k R l) = Cni pi (1 p)ni .
i=k
Remember that Borel sets in R are the sets obtained from the intervals
by applying the operations of union and intersection. The last condition in
the denition means that the assertions of the form R belongs to a Borel
set B are events on our probability space.
pR (xi ) = P (R = xi ), i = 1, 2, . . .
12
We say that R has masses of probability at the points xi . The probability
function denes the probabilities of all events involving R:
P (R B) = P (R = xi ) = pR (xi )
xi B xi B
13
the distribution function of an abso-
lutely continuous random variable is continuous
From the denition, it follows that
F . Note that if R is dieren-
tiable at x, then its derivative is given by fR :
x
d
FR (x) = fR (y)dy = fR (x).
dx
{ 1
ba
, x [a, b],
fR (x) =
0, otherwise.
14
Notation. The uniform distribution on [a, b] is noted U ([a, b]), and we write
R U ([a, b]).
standard normal
well known and its values are listed in tables or may be easily obtained on
= 0 = 1, R
distribution
a computer. If and we say that has a
. There is a standard notation for the distribution function in
this case: x
1 y2
(x) = e 2 dy
2
{
ex , x 0,
fR (x) =
0, x < 0.
{
1 ex , x 0,
FR (x) =
0, x < 0.
15
we can identify the intervals where R has a density fR and the points where
R has masses of probability. For example,
0, x < 0,
FR (x) = x+30
, 0 x < 120,
200
1, x 120.
On the intervals (, 0), (0, 120), and (120, ), R has a density function
{
1/200, x (0, 120),
gR (x) = FR (x) =
0, x < 0 or x > 120.
where xi are the points where R has masses of probability. For instance, in
the example above,
120
1
P (R > 100) = gR (y)dy + P (R = 120) = dy + 0.25 = 0.35.
100 100 200
16
3. lim F (x) = 0
x
17
Denition 10. The joint distribution function of two arbitrary random vari-
ables R1 and R2 on the same probability space is dened by
random vector (R , R , . . . , R )
joint distribution function
In a similar way, we can dene a 1 2 n with
(This is the uniform density on the unit square.) Let us calculate the prob-
ability that 1/2 R1 + R2 3/2:
( )
1 3
P (1/2 R1 + R2 3/2) = 1 dxdy = 1 2 =
8 4
1/2x+y3/2
18
2.5 Relationship between joint and individual distribu-
tions
In this section, we investigate the relationship between joint and individual
distributions of random variables dened on the same probability space.
marginal densities
The answer to this question is positive, and the individual densities (also
called ) are given by
f1 (x) = f12 (x, y)dy, f2 (y) = f12 (x, y)dx
Indeed, we have
R1 R2
not
The answer to this question is negative; that is, if and are each
absolutely continuous then (R1 , R2 ) is necessarily absolutely continuous.
19
(R1 , R2 ) is absolutely continuous, f1 (x) and f2 (x)
not
Furthermore, even if do
determine f12 (x, y). The examples below illustrate these statements.
since L has area 0. This contradiction proves that (R1 , R2 ) cannot have a
density.
do
Thus the individual densities are not sucient to determine the joint density.
independent
However, there is one situation where the individual densities deter-
mine the joint density: when the random variables are .
20
Note that the last equality in itself does not imply that events {R1
B1 }, . . . , {Rn Bn } are independent. However, since we require that this
equality holds for all Borel subsets B1 , . . . , Bn , it holds for any subfamily
of Bi . Indeed, it is sucient to replace the other ones by (, ). For
example, in the case n = 3,
P (R1 B1 , R2 B2 ) = P (R1 B1 , R2 B2 , R3 (, ))
= P (R1 B1 )P (R2 B2 )P (R3 (, )) = P (R1 B1 )P (R2 B2 ),
Thus in this sense the joint density is the product of the individual den-
sities.
2.7 Problems
1. An absolutely continuous random variable R has a density function
f (x) = (1/2)e|x| .
i. {|R| 2}
ii. {|R| 2 or R 0}
iii. {|R| 2 and R 1}
iv. {|R| + |R 3| 3}
v. {R3 R2 R 2 0}
vi. {esin R 1}
21
vii. {R is irrational} = { : R() is an irrational number}
3 Expectation
The physical meaning of the expectation of a random variable is the average
value of this variable in a very large number of independent repetitions of the
random experiment. Before we make this denition mathematically precise,
let us consider the following example.
Suppose that we observe the length of a telephone call made from a
specic phone booth at a given time of the day (say, the rst call after 12
o'clock). Suppose that the cost R2 of a call depends on its length R1 in the
following way:
If 0 R1 3 (minutes) R2 = 10 (cents)
If 3 < R1 6 R2 = 20
If 6 < R1 9 R2 = 30
22
Observe how we have computed the average:
E[R] = xP (R = x)
x
n
n
n!
E[R] = kCnk pk q nk = pk q nk
k=0 k=1
(k 1)!(n k)!
n
(n 1)! (n 1)!n1
= np pk1 q nk = np pl q n1l
k=1
(k 1)!(n k)! l=0
l!(n 1 l)!
n1
l
= np Cn1 pl q n1l = np(p + q)n1 = np
l=0
x 3 2 0 10 15
pR (x) 0.1 0.35 0.2 0.05 0.3
23
If R is discrete with innitely (countably) many possible values, the ex-
pectation is globally dened in the same way but there is a little complication:
an innite sum is not always convergent. This leads to the following con-
+
struction. Let R = max(R, 0) and R = max(R, 0) be the positive and
negative parts of R. We have R = R R . For example, for the simple
+
+
random variable R above, the probability functions of R and R are given
by
x 0 10 15
p+ (x) 0.1+0.35+0.2=0.65 0.05 0.3
x 3 2 0
p (x) 0.1 0.35 0.2+0.05+0.3=0.55
We dene
E[R+ ] = xP (R+ = x), E[R ] = xP (R = x)
x x
Since R+ and R take on only nonnegative values, these sums are always
well dened (they may be nite or equal to +). Now we dene
E[R+ ] = a E[R+ ] = +
E[R ] = b E[R] = b a E[R] = +
E[R ] = + E[R] = E[R] does not exist
Example. Let R have a Poisson distribution with parameter > 0. Its
probability function is given by
n
pR (n) = e , n = 0, 1, 2, . . .
n!
Let us calculate the expectation of R:
n
n1
k
E[R] = ne = e = e = e e =
n=0
n! n=1
(n 1)! k=0
k!
24
3.2 Expectation of absolutely continuous random vari-
ables
If R is absolutely continuous, the denition of the expectation is similar but
the sum is replaced by an integral, and P (R = x) by the density fR (x).
Denition 14. Let R be an absolutely continuous random variable with
density fR (x). The expectation of R is dened as
E[R] = xfR (x)dx
if this integral is well dened; that is, E[R ] =
+
xfR (x)dx and E[R ] =
0 0
(x)fR (x)dx are not equal to + simultaneously.
Note. Don't confuse, however, fR (x)
P (R = x). For an absolutely
and
continuous random variable, the probability P (R = x) is zero for all x but
the density is a non-zero function. The density fR (x) is a probability: in not
particular, it need not be 1. The quantity which represents a probability
in this expression is fR (x)dx: informally speaking, this is the probability that
R belongs to the innitesimal interval (x, x + dx).
Then
b
1 1 x2 b 1 b 2 a2 a+b
E[R] = x dx = = =
a ba ba 2 a ba 2 2
Example. If R is an exponential random variable with parameter then
x x x
E[R] = xe dx = x(e ) dx = xe + ex dx
0 0 0 0
ex 1
= =
0
Example. Let R be a standard normal random variable, R N (0, 1). Then
1
x ex /2 dx = 0
2
E[R] =
2
25
since the integrand is an odd function of x. If R N (, 2 ) then
1
(x)2 1 y2
E[R] = x e 2 2
dx = (y + ) e 22 dy
2 2
1 y 2
1 y2
= y e 22 dy + e 22 dy = 0 + 1 =
2 2
On the last line, the rst integral is equal to zero because the integrand is
odd, and the second integral is equal to 1 because this is the total mass of
2
the density function of a N (0, ) random variable.
Thus the meaning of the parameter of a normal random variable is its
expectation.
Remark. It is possible for the expectation to be innite, or not exist at all.
For example, let
{ 1
x2
, x1
fR (x) =
0, x<1
Then
1
E[R] = xfR (x)dx = x dx =
1 x2
As another example, let fR (x) = 1/(2x ), |x| 1; fR (x) = 0, |x| < 1.
2
Then
1 1
E[R ] =
+
xfR (x)dx = x 2 dx =
0 2 1 x
0 1
1 1
E[R ] = (x)fR (x)dx = (x) 2 dx =
2 x
26
Example. If R is the mixed random variable from the example in Sec-
tion 2.2.3 then
120
1
E[R] = x dx + 0 P (R = 0) + 120 P (R = 120)
0 200
1 1202
= + 120 0.25 = 66
200 2
E[g(R1 )] = E[R2 ] = g(xi )P (R1 = xi )
xi
27
If (R1 , . . . , Rn ) is absolutely continuous with density f12...n , then
E[g(R1 , . . . , Rn )] = g(x1 , . . . , xn )f12...n (x1 , . . . , xn )dx1 dxn
3.4.1 Terminology
If R is a random variable, the k th moment of R (k > 0, not necessarily an
integer) is dened by
mk = E[Rk ]
if the expectation exists. Thus
{ k
x p (x) if R is discrete
mk = x k R
x fR (x) if R is absolutely continuous
m1 E[R].
th central moment
The rst moment is simply the expectation
The k of R (k > 0) is dened by
ck = E[(R E[R])k ]
{
x (x m1 ) kpR (x)
k
if R is discrete
=
(x m1 ) fR (x) if R is absolutely continuous
if E[R] is nite and the expectation in question exists. Note that the rst
c1 = E[R m1 ] = m1 m1 = 0.
variance
central moment is zero:
The second central moment E[(R E[R]) ] is called the
2
R,
of
2
written or Var(R). The positive square root of the variance = Var(R)
is called the standard deviation of R.
The variance may be interpreted as a measure of dispersion. A large
variance corresponds to a high probability that R will fall far from its mean,
while a small variance indicates that R
is likely to be close to its mean.
The quantities E[g(R)], with g other than x or (x m1 ) , are sometimes
k k
MR (t) = E[eRt ], tR
28
wherever this expectation exists. Provided MR (t) exists in an open interval
around t = 0, the k th moment is given by
(k) dk MR (t)
E[Rk ] = MR (0) =
dtk t=0
For example, MR (t) = E[Re ], hence MR (0)
Rt
= E[R]. Similarly, MR (t) =
E[R2 eRt ] yields MR (0) = E[R2 ], and so on.
2 6
MR (t) = , MR (t) = , MR (t) =
( t)2 ( t)3 ( t)4
Thus we obtain
E[R] = MR (0) = 1/
E[R2 ] = MR (0) = 2/2
E[R3 ] = MR (0) = 6/3
n
(et )n
nt
= e ee = e(e 1)
t t
MR (t) = E[e ] = Rt
e e =e
n=0
n! n=0
n!
Therefore,
E[R] = MR (0) =
E[R2 ] = MR (0) = + 2
29
3.5 Properties of expectation
In this section we list several basic properties of the expectation of a random
variable. (We will always assume that all expectations which appear in the
properties listed below exist.)
(if + and do not both appear in the sum E[R1 ] + + E[Rn ]).
2. If E[R] exists and a is any real number, then E[aR] exists and
E[aR] = aE[R]
then
E[R1 R2 Rn ] = E[R1 ]E[R2 ] E[Rn ]
Var(aR + b) = a2 2
30
8. Let R1 , . . . , R n be independent random variables, each with nite ex-
pectation. Then
3.6 Correlation
R1 R2
covariance
If and are random variables on a given probability space, we dene
their as
31
Denition 16. Let R1 and R2 be random variables dened on a given proba-
bility space. If Var(R1 ) > 0, Var(R2 ) > 0 we dene the correlation coecient
of R1 and R2 as
Cov(R1 , R2 )
(R1 , R2 ) =
Var(R1 ) Var(R2 )
By Theorem 15, if R1 and R2 are independent, they are uncorrelated;
that is, (R1 , R2 ) = 0, but not conversely.
It can be shown that
1 (R1 , R2 ) 1
Note that this is a (simple) discrete random variable since it has only two
possible values. Its expectation is given by
where
E[IA ] = P (A).
32
If Ai is the event that ith toss results in a 1, and Bi the event that the
ith toss results in a 2, then
R1 = IA1 + + IAn ,
R2 = IB1 + + IBn .
Hence
n
E[R1 R2 ] = E[IAi IBj ].
i,j=1
1
E[IAi IBj ] = E[IAi ]E[IBj ] = P (Ai )P (Bj ) = .
36
If i = j , Ai and Bi are disjoint, since the ith toss cannot simultaneously
result in a 1 and in a 2. Thus IAi IBi = IAi Bi = 0. Thus
n(n 1)
E[R1 R2 ] =
36
since there are n(n1) ordered pairs (i, j) of integers belonging to {1, 2, . . . , n}
such that i = j .
xP ({R = x} A) E[RIA ]
E[R | A] = xP (R = x | A) = x
=
x
P (A) P (A)
E[RIA ]
E[R | A] =
P (A)
Example. Let R be the result of the toss of a single die and A be the event
the result is even. Then
6
E[R | A] = nP (R = n | A)
n=1
33
We have
P (R = 1 | A) = P (R = 3 | A) = P (R = 5 | A) = 0
and
E[R | A] = E[R]
Example. Let N be a discrete random variable with possible values n =
0, 1, 2, . . . (for instance, N P()) and X1 , X2 , . . . are i.i.d. random vari-
ables, independent from N . Dene
N
S= Xi
i=1
34
For example, if N P(200) and X1 E(1/1000) then E[S] = 1000 200 =
2 105 .
P (R1 + R2 1 | R1 = x) = P (x + R2 1) = P (R2 1 x) = 1 x
or
x
E[R1 R2 | R1 = x] = E[xR2 | R1 = x] = xE[R2 ] =
2
The rigorous denition of conditional probability and expectation with re-
spect to the events of probability zero of the type {R = x} is rather involved
and is out of the scope of these lectures.
Instead, we give the continuous equivalents of theorems of total prob-
ability and total expectation and dene the notion of conditional density
consistent with these theorems.
35
By theorem of total probability,
P (R1 + R2 2) = P (R1 + R2 2 | R1 = x)f1 (x)dx
= P (x + R2 2 | R1 = x)xex dx
1 0 2
x x
= (1)xe dx + P (R2 2 x | R1 = x)xe dx + (0)xex dx
0 1 2
We have
2x
P (R2 2 x | R1 = x) =
x
Therefore
1
x
2
2 x x
P (R1 + R2 2) = xe dx + xe dx = 1 2e1 + e2
0 1 x
x
1 x
E[R2 | R1 = x] = y dy =
0 x 2
x
1 ex 1
E[e | R1 = x] =
R2
ey dy =
0 x x
Thus we write
R1
E[R2 | R1 ] =
2
eR1 1
E[eR2 | R1 ] =
R1
36
4.3 Conditional expectation with respect to a -eld
Denition 21. Let X be a random variable on a probability space (, F, P )
E[|X|] < , G F a -eld. The random variable E[X | G]
conditional expectation of given
such that and
( X G ) is a random variable Z satisfying the
following properties:
Z G (Z is G -measurable);
Remark The random variable dened above is unique in the sense that if Z1
and Z2 satisfy the properties of the conditional expectation E[X | G], then
Z1 = Z2 almost surely (P { | Z1 () = Z2 ()} = 0).
37
Properties Let X, Y, Z be random variables on a given probability space
(, F, P ). Let G, H F be -algebras of events.
Y G E[Y | G] = E[Y ].
(Don't confuse independent and orthogonal in the sense of L ! Here,
2. If then
greater subspace and then project the result on the smaller subspace.)
The comments above do not constitute the proofs of these properties but
only provide some intuition about them. To prove these properties, we have
to use the denition of the conditional expectation. For example, let us prove
property 2:
We need to prove that the (constant) random variable Z = E[Y ] satises
the two properties of the conditional expectation of Y G . Since it is
given
constant, it is G -measurable. G be an arbitrary bounded random variable,
Let
G G. We have to check that E[Y G] = E[E[Y ]G]. By independence of Y
with respect to G , the left hand side is equal to E[Y ]E[G]. The right hand
side is equal to the same expression simply by taken the constant E[Y ] out
of the expectation.
We omit here the proofs of the other properties of the conditional expec-
tation.
5 Gaussian vectors
(to be completed)
38