Sie sind auf Seite 1von 3

Stat 205B Lecture Notes of May 14, 2002

Prepared by Gabor Pete


Disclaimer: These lecture notes have been only lightly proofread. Please inform Prof. Peres
<peres@stat.berkeley.edu> of errors.

1 The main ergodic theorems


Let ( F P) be a probability space, and T : ;! a measure preserving transformation,
meaning two things: it is measurable, and P(T ;1(A)) = P(A) for all A 2 F . Here we use the
inverse image because the image T (A) of A is not necessarily measurable. The set of T -invariant
sets, I = fA 2 F : T ;1A Ag, is a sub- -algebra of F . Note that A = T ;1 A almost everywhere
for A 2 I . For any function f 2 L1 ( F P) (hereafter, L1 ( )), let us de ne the measurable
functions
nX
;1
Sn f (!) := f (T j !) Mn f (!) := 1max
S f (!) and Mf (!) := sup Sj f (!):
j n j
j

j =0

The most important result in ergodic theory is the following Pointwise Ergodic Theorem due
to G. Birkho (1931). He formulated and proved it stimulated by J. von Neumann's much
simpler L2 version.
Theorem 1.1. (Birkho 's Pointwise Ergodic Theorem) For f 2 L1( ), and a.e. !,
1
n Sn f (!) ! f (!) := E(f j I )(!):
The original proof was 50 pages long. Then Kakutani and Yosida formulated their Maximal
Ergodic Theorem in 1939, having a much cleaner 10 page proof, from which Birkho 's theorem
can be deduced nicely.
Theorem 1.2. (Maximal Ergodic Theorem) For f 2 L1 ( ),
Z
f dP 0:
fMf>0g

Similar maximal inequalities are very important in all parts of analysis, and usually they
can be considered as improvements to Markov's inequality. Just to mention two examples: we
saw the di erent Lp maximal inequalities for (sub/super)martingales. The Hardy-Littlewood
maximal function associated to a real valued Lebesgue-measurable function f 2 L1loc (Rd ) is
Z
1
jf (y)j d (y)
Mf (x) = sup Sr jf (x)j = sup B (x)
r
r
r
B (x)
r

and the corresponding maximal inequality is

dZ
fx : Mf (x) > g 3 jf (x)j d (x)

for any > 0.

Now we deduce Birkho 's theorem from the Maximal Ergodic Theorem. It could be done
more shortly, in one step, but instead we rst prove the existence of the a.s. limit, then identify
the limit.
Proof of Birkho 's theorem. For a < b rationals let us de ne the measurable set a b =
flim infn Snf < a and lim supn Snf > bg. It is easy to check that T a b a b . Now suppose
~ ( ) = P( j a b). It is clear that a b
that P( a b) > 0 for some a b 2 Q, and de ne P
~)
fM (f ; b) > 0g\ fM (a ; f ) > 0g, so the Maximal Ergodic Theorem applied to ( a b Fj P
gives
Z
Z
~ 0 and
~ 0:
(f ; b) dP
(a ; f ) dP
R
~ 0, which contradicts a ; b < 0 and
Summing up these inequalities we get
(a ; b) dP
~ ( a b) = 1. Thus we have
P
P
=0
a<b2 a b
which means a.s. convergence.
Now write f = limn Snf for this a.s. limit. What can this limit be? First of all, note that f
is invariant: f = f T . Or, to say the same thing di erently: it is I -measurable.
R
R
Lemma 1.3. For any f 2 L1( ) andR measure preserving
T , f T dP = f dP. More
R
generally, for any invariant set B 2 I , B f T dP = B f dP.
Proof. The rst statement is true for any f = 1A, A 2 F , by the de nition of a measurepreserving transformation. Then we can pass to general f 's by the \standard machine": approximate f 0 by step functions and use the Monotone Convergence Theorem, then write f 2 L1
as f = f+ ; f; . The second statement follows by noticing that (f 1B ) T = (f T )1B a.s. for
B 2 I.
R
R
This lemmaRimplies thatR B Sn f dP = n B f dPfor B 2 I . For f 0 we can now apply Fatou's
lemma to get BRf dP B Rf dP, and for bounded f we can apply the Dominated Convergence
Theorem to get B f dP = B f dP. So it is reasonable to expect that f = E (f j I ), the unique
I -measurable function that gives the same integral on each invariant set as f .
To prove this claim, set g = f ; E(f j I ). Since E(f j I ) is T -invariant, we have to prove
that g = lim Sng equals 0 almost surely note that we know the existence of the limit from the
existence of f . Let us proceed similarly as before: take = fg > g 2 I for some > 0, and
consider the restriction of our dynamical system to . If P( ) > 0, then
R we have a decent
measurable dynamical system, and the Maximal Ergodic Theorem gives (g ; ) dP 0. If
P( ) = 0, then the same inequality is trivial. Hence
Z
Z
P( )
g dP = E (g j I ) dP = 0
n

a b

a b

a b

a b

where in the rst equality we used 2 I and the de nition of conditional expectation, while
the second one follows simply from the de nition of g. Thus we have P( ) = 0. Similarly,
P(g < ; ) = 0. These show that g = 0 a.s., and the proof is complete.
The general belief after 1939 was that the maximal theorem and Birkho 's theorem were
equivalent. Garsia's three-line proof of the maximal theorem in 1965 came as a shock.
Garsia's proof of the Maximal Ergodic Theorem. The sets fMn f > 0g increase monotonically
to fMf > 0g, so by the Dominated Convergence Theorem it is enough to prove
R
+
+
Mn f , hence
fM f>0g f dP 0. Note that f + Mn;1 (f T )] = Mn f , and f + Mn (f T )]
n

Z
fM f>0g

f dP

ZfM f>0g
n

Mn f ; Mn (f T )]+ dP

Mnf ]+ ; Mn (f T )]+ dP
ZfM f>0g
Z
+
Mn f ] dP; Mn(f T )]+ dP = 0
n

where in the last step we used Lemma ?? again.

2 Information Theory
We close the semester with a very brief introduction to information theory by Peter Ralph.
Given a random variable X on ( B P), and a sub- -algebra F B, let us de ne the conditional probability, information and entropy as
;
P(X j F ) = E 1X ;1 (X ( )) j F
I (X j F ) = ; log P(X j F )
H ( X j F ) = E I ( X j F ):
In particular, if X is a discrete variable, P(X = x) = px , and F = B, then P(X j F ) is the
random variable ! 7! pX (!) , and
X
H (X ) = H (X j F ) = ; px log px :
x

If F = (Y ) for another r.v. Y , then


X
H (X j Y ) = ; P(X = x Y = y) log P(X = x j Y = y) = H (X Y ) ; H (Y ):
xy

So using the fact that H (X jY ) H (X ), we see


H (X Y ) H (X ) + H (Y )
with equality if and only if X and Y are independent.
The most important theorem about entropy is probably the following:

Theorem 2.1. (Shannon { McMillan { Breiman) For a stationary ergodic sequence (Xn)1;1
on a countable state space, with H (X0 ) < 1,
as
; n1 log I (X1 : : : Xn ) ;!
H := H (X0 j X;1 X;2 : : : ):
L1
An equivalent reformulation in the case of Xi 2 S , where S is a nite set: to capture 1 ;
: :

probability mass of the possible outcomes (X1 : : : Xn ), we need at least around exp((H )n)
sequences. E.g. in the case of an i.i.d. uniform sequence, we have H = log jS j, so we need almost
all possible outcomes. Thus larger entropy can be interpreted as a larger degree of random
independence in the sequence. Among all probability distributions X on a nite set, the uniform
one has the largest entropy H (X ), and among all probability densities on R with E X 2 = 1, the
standard normal has this maximizing property.
It is also possible to de ne the entropy of a measure preserving transformation, which notion
is central in the theory of dynamical systems.

Das könnte Ihnen auch gefallen