Chapter 3

i
Chapter 3
The Law of Large

Numbers and
Weierstrass
Approximation Theorem
I guess I think of lotteries as a tax on the mathematically challenged.

R. Jones
It is likely that unlikely things should happen.

Aristotle
. . . there are many functions which are not at all suited for approximation by
a single polynomial in the entire interval which is of interest.
G. Dahlquist and . Bjrk
The perfidious polynomial

J. Wilkinson
You do not want to meet a polynomial of degree 1000 on a dark night.

B. Parlett
A panorama
The main goal of this chapter is to begin exploring the deep connection between proba-
bility and analysis. That connection is fully described by measure theory. But it is not at
all obvious, which may be the reason measure theory was not applied to probability until
thirty-some years after it was developed.
Analysis becomes relevant when we ask more about probability than simply carrying
out familiar probability computations in discrete probability, e.g. chances of picking a
particular card from a deck. For example, much deeper mathematical issues arise when
we play a probability game and over, e.g. how is the probability of choosing a king from
a single shuffled deck related to the results that are obtained when we choose at random a
single card from each deck in a large, or infinite, collection of decks. That relation is key
to using probability as a model for physical phenomena and developing a systematic way
to carry out probability computations.
To give an example of a deep question in probability, we begin by explaining the Law
of Large Numbers, which partially answers the question of what happens when we re-
peat a random experiment over and over. This theorem is a central theme because it is
connected to the use of probability as a mathematical model and we develop a number
21
i
22 Chapter 3. The Law of Large Numbers and Weierstrass Approximation Theorem
of variations in subsequent chapters. Then, in a startling development, we demonstrate

the connection between probability and analysis by using the Law of Large Numbers to
prove the celebrated Weierstrass Approximation Theorem, which states that continuous
functions can be approximated arbitrarily well by polynomials.
We are still some distance from introducing measure theory. But the proofs in this
chapter are a good warmup.
3.1 Some discrete probability

We begin by recalling some discrete probability definitions and ideas that are used in the
rest of the book. As with the material in Chapter 1, it may be a good idea to keep an
elementary probability textbook nearby as a reference while we develop key probability
concepts from a rigorous measure theory point of view.
The following definitions are made in reference to an experiment, which may be
thought of as an abstract generalization of a scientific experiment. Thus, it is an opera-
tion we carry out that produces a result that is subject to some degree of uncertainty or
possible variation. The outcomes of a scientific experiment may be subject to a degree of
uncertainty both because of intrinsic reasons, e.g. the physical system varies in a way that
is not completely predictable or at a scale that is not readily observable, and for external
reasons, e.g., due to experimental and observation errors.
Definition 3.1.1
An experiment is a repeatable procedure that yields exactly one outcome from a

specified set of possible results. The set of possible outcomes is called the sample
space. A trial is a single performance of the experiment. Discrete probability refers
to a situation in which the sample space is at most countable.
Often, an experiment may be associated with several different sample spaces.
Example 3.1.1
The experiment is to draw a card from a standard deck. We can classify the possible
outcome in a number of ways, e.g.,
Sample space 1 A point in the space of 52 outcomes.
Sample space 2 A point in the space of two outcomes: red or black.

Sample space 3 A point in the space of 13 outcomes: {2, 3, 4, . . ., King, Ace}.
Note that sample spaces 2 and 3 are sets whose points are sets.
There is a special case of a sequence of trials that greatly simplifies analysis, but is also
still important.
Definition 3.1.2
A sequence of trials of an experiment is independent if the outcome of one trial

in the sequence does not affect the outcomes of the later trials. We say these are
i
3.1. Some discrete probability 23
independent trials and the trials are independent.
Example 3.1.2
We typically assume that a sequence of coin tosses is independent.
Example 3.1.3
Consider a bag holding a collection of white and black marbles. In experiment 1, we

choose a marble without looking, mark its color, put it back in the bag, and shake
vigorously. In experiment 2, we choose a marble without looking, mark its color,
and discard it (not in the bag). It is reasonable to consider the trials of experiment 1
to be independent, but the trials of experiment 2 are certainly not independent.
We often wish to further group elements in a sample space.
Definition 3.1.3
Any collection or set of outcomes in a sample space is called an event. Individual

members (singleton sets) of the sample space are called (sample) points. We say an
event occurs in a trial if the outcome of the trial is a sample point that is in the event.
Example 3.1.4
Consider Example 3.1.1. In sample space 1, black is an event with 26 points. In

sample space 2, black is an event with 1 point. Black is not an event in sample space
3.
On a purely functional level, probability is a function defined on events in a sample

space satisfying specific rules.
Definition 3.1.4
Consider a sample space with n outcomes. Probabilities are numbers assigned to

events with the properties:
1. Each sample point is assigned a non-negative number called the probability.
2. The sum of the n probabilities of the n outcomes is 1.
3. If A is an event and P (A) is the probability of A, then P (A) is the sum of prob-
abilities of the outcomes in A.
P is generally reserved for probability. The probability (function) on the events in
the sample space is the function that assigns a probability to each event. For later
reference, probability is a non-negative finitely additive set function.
i
Probability is associated with randomness and uncertainty and the rules governing
probability reflect properties of the experiment. But, it is important to note that there
is nothing uncertain or random about the rules governing probability. The connection
to uncertainty or randomness comes through the interpretation of the probability values
placed on the outcomes and how those values are assigned.
Example 3.1.5
Consider the experiment of flipping a two side coin with a head side (H ) and a tail
side (T ). The sample space is {H , T }. Given the complexity of modeling the physics
of the motion through the flip to the catch, we might assign probability by assuming
that each outcome is equally likely, i.e. P (H ) = P (T ) = 1/2. The randomness or
uncertainty in the experiment is that, short of carrying out a complex predictive
physics simulation, we cannot predict which outcome will occur before the toss is
made.
In general, a common approach for assigning probabilities in the absence of any infor-
mation about probabilities of events is based on assuming each outcome is equally likely.
If the sample space has n outcomes, then P (any outcome) = 1/n. It follows that,
number of outcomes in the event

P (event) = .
total number of outcomes
But, that rule does not always apply.
Example 3.1.6
An important example of transport in porous media is the flow of groundwater

through the ground. The porosity in a region varies on a wide range of scales due to
a variety of geological features and properties. Generally, it is too challenging to use
models for flow valid at the smallest scales, so the most common partial differential
equation models consider bulk flow on a scale that is much larger than the indi-
vidual pores in the region of the Earths surface under consideration. Such models
use a porosity parameter field that is obtained by some kind of upscaling of the
fine grained variation. For example, the porosity might be obtained in a region by
averaging (using the harmonic average) the fine scaled porosity over the region. The
variations in the porosity are not random but are often modeled as a kind of random
function that perturbs the upscaled porosity. If one examines the porosity at differ-
ent points in a region, say by looking at samples obtained by drilling at a few places
(drilling is expensive), the variations do indeed appear random and heterogeneous.
Example 3.1.7
We consider a sample space with six outcomes a, b , . . . , f . We assign P (a) = .1,

P (b ) = .2, P (c) = .1, P (d ) = .1, P (e) = .2, and P ( f ) = .3. PX = ;, {a}, {b },
, { f }, {a, b }, {a, c}, , {a, b , c, d , e, f } . If B = {d , e, f }, then P (B) = .4.

A system involving probability has three ingredients.

i
3.2. The Law of Large Numbers 25
Definition 3.1.5
In discrete probability, a sample space together with its power set and a set of prob-
abilities is called a probability space. If X is the sample space and P the probability,
then we write (X, PX , P ) to emphasize the three ingredients.
We give names to some special kinds of events,
Definition 3.1.6
A sure event must occur in an experiment, so it contains the entire sample space.
An almost sure event is an event with probability one. An event with probability
zero happens almost never. An impossible event never occurs in an experiment,
so it is the event with no outcomes.
Definition 3.1.7
If A is an event in a sample space, its complement Ac is the set of outcomes not in A.
Probabilities must satisfy certain properties with respect to taking unions and inter-
sections of events. For example,
Theorem 3.1.1
If A, B are events in a probability space, then
P (A B) = P (A) + P (B) P (A B). (3.1)
Definition 3.1.8
Two events in a probability space are (mutually) exclusive if they have no outcomes
in common.
Theorem 3.1.2
If A, B are exclusive events in a probability space, then
P (A B) = P (A) + P (B). (3.2)
In general, if {Ai : 1 i n} is a collection of n exclusive events, then,

n
X
P (A1 A2 . . . An ) = P (Ai ). (3.3)
i =1
3.2 The Law of Large Numbers

The Law of Large Numbers is a probabilistic statement about the frequency of occur-
rence of an outcome in a sequence of a large number of repeated independent trials of an
i
experiment. It is a result that we discuss several more times throughout the book. In this
section, we give an elementary proof of a simple version that does not require measure
theory or any significant probability theory.
We work in a sample space associated with a given experiment. We assume that a
certain outcome O occurs with some specific probability x when the experiment is con-
ducted, but that we do not know x. How might we determine it? If we conduct a single
trial, O might result or it might not. In either case, it gives little information about x.
However, if we conduct a large number m 1 of trials, intuition suggests O should oc-
cur approximately x m times, at least most of the time. Another way of stating this
intuition is,
number of times O occurs
probability of O .
total number of trials
But, we have to be wary about intuitive feelings:
Example 3.2.1
If we conduct a set of trials in which we flip a fair coin many times (m), we expect to
m
see around 50% ( 2 ) heads most of the time. However, it turns out the probability
of getting exactly m/2 heads in m flips is approximately,
1
p ,
m
for m large. This goes to 0 as m increases.
So, we have to be careful about how we state obtaining the expected result. Moreover,
we might have a run of either good or bad luck in the sequence of experiments, that
undermines the intuition.
Theorem 3.2.1: Law of Large Numbers
In an experiment, assume outcome O occurs with probability x and let k be the

number of times O occurs in m independent trials. Let > 0 and > 0 be given.
The probability that k/m differs from x by less than is greater that 1 ,
k

P x < > 1 ,

(3.4)
m
for all m sufficiently large.
It is important to spend some time reading the conclusion of Theorem 3.2.1 and un-
derstanding its meaning. The theorem does not say O will occur exactly x m times in m
trials nor that O must occur approximately x m times in m trials. The role of is that
it quantifies the way in which k/m approximates x, thus avoiding the issue in Ex. 3.2.1.
The role of is that it allows the possibility, however small in probability, that the se-
quence of trials can produce a result that is not expected. For example, we would have the
(mis)fortune to obtain all heads in the sequence of trials. By making small, we obtain a
better approximation to x. By making small, we obtain the expected result with higher
probability. The cost in each case is that having to conduct a large number m of trials.
As well as being interesting in its own right, the Law of Large Numbers (LLN) is
centrally important to various aspects of probability theory. For example, it is important
i
3.2. The Law of Large Numbers 27
in consideration of the role of probability theory as a mathematical model of reality. An

important component of mathematical modeling is validation, which roughly is the
process of verifying the usefulness of the model in terms of describing the system being
modeled. One aspect of validation is quantifying the accuracy of predictions made by the
model. In the context of the LLN, we have assigned a probability to an event. In many
situations, this is a simple way to model the result of a complex process.
Example 3.2.2
We could try to describe the result of a particular coin flip for a fair coin determin-
istically by using physics involving the initial position on the thumb, equations de-
scribing the effects of force and direction of the flip, the effect of airflow, and so on.
In the absence of such a detailed computation for a particular flip, it is reasonable to
believe that the outcome is equally likely to be head or tails.
The LLN describes how the assignment of a probability to the event in question can be
validated by repeated independent experiments.
The LLN can be proved using a very elementary argument based on the binomial
expansion. Of course, the binomial coefficients and binomial expansions are very impor-
tant in probability. We recall the basic ideas.
Definition 3.2.1
The binomial coefficient is

i i!
= , i, j N.
j j !(i j )!
Theorem 3.2.2
For any numbers a, b and positive integer m,

m
m
X m k mk
(a + b ) = a b ,
k=0
k
m
m1
X m k1 mk
m(a + b ) = k a b ,
k=0
k
m
k m k mk
a(a + b ) m1 =
X
a b ,
k=0
m k
m 2
k k m k mk
(1 m 1 )a 2 (a + b ) m2 =
X
a b .
m 2 m k
k=0
By setting a = x and b = 1 x, we see that Theorem 3.2.2 implies

m
X m k
1= x (1 x) mk .
k=0
k
i
Definition 3.2.2
The m + 1 binomial polynomials of degree m are

m k
p m,k (x) = x (1 x) mk , k = 0, 1, . . . , m.
k
The connection to probability is,
Theorem 3.2.3
If 0 x 1 is the probability of an event E, then p m,k (x) is the probability that E

occurs exactly k times in m independent trials.
Theorem 3.2.2 implies
Theorem 3.2.4
0 p m,k (x) 1, for all 0 x 1. (3.5a)

m
X
p m,k (x) = 1, for all 0 x 1. (3.5b)
k=0
m
X
k p m,k (x) = m x, for all 0 x 1. (3.5c)
k=0
m
X
k 2 p m,k (x) = (m 2 m)x 2 + m x, for all 0 x 1. (3.5d)
k=0
One important use of binomial polynomials is proving the LLN.
Proof. We prove Theorem 3.2.1 by proving that given > 0 and > 0
X
p m,k (x) > 1 ,
0km
k
m x <

for all m sufficiently large.

Consider the complementary sum
X X
p m,k (x) = 1 p m,k (x),
0km 0km
k k
m x m x <

which is estimated
X k 2
X 1 1
p m,k (x) 2 x p m,k (x) T (x),
2 2 m
m m
0km 0km
k k
m x m x

i
3.3. The Weierstrass Approximation Theorem 29
where
m
X
T m (x) = (k m x)2 p m,k (x).
k=0
m
Using (3.5c) and (3.5d), we find T m (x) = m x(1 x), and so T m (x) 4
, for all 0 x 1.
Therefore,
X 1 X 1
p m,k (x) and p m,k (x) 1 , 0 x 1. (3.6)
4m 2 4m 2
0km 0km
k k
m x m x <

This shows the result.
Remark 3.2.1
It is interesting to consider how the final line implies the result. Given and , we
require
1
m .
4 2
This can be achieved uniformly with respect to the value of x. However, increasing
the accuracy by decreasing requires a very substantial increase in the number of
trials m. This adverse scaling occurs again, unfortunately.
3.3 The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem states that a real-valued continuous function
on a closed interval can be approximated pointwise by a polynomial to any desired ac-
curacy over the interval. Over-familiarity with the commonplace use of polynomials to
approximate functions may weaken the sense of the importance of this fact. It helps to
consider the issues underlying two common ways to generate an approximate polynomial
for a function.
The most familiar approach is the Taylor polynomial, which is computed using the set
of values of a function and its derivatives at a point. The first issue is that this applies only
to functions with derivatives, not functions that are merely continuous. Another issue is
that this is a local approximation in the sense that the approximation is only accurate in
a small neighborhood of the point where the derivatives are evaluated.
The second common approach is to use interpolation. To any set of distinct n + 1
points in a given interval, there is a unique polynomial that interpolates, or agrees in
value, with the function at the points. An issue is that unless special care is taken in re-
gards to construction of the interpolation points, it becomes increasingly difficult to com-
pute the approximate polynomial with increasing n and, while the resulting polynomial
matches the function exactly at the interpolating points, it provides a poor approximation
in between the interpolating points. Essentially, interpolating polynomials of high degree
must oscillate with extremely large amplitudes in order to match a set of function values.
This issue can be treated under additional assumptions, e.g. an inner product yielding
orthogonality.
Recall the metric space C ([a, b ]) of real-valued continuous functions on [a, b ] with
the max/sup metric.
i
Theorem 3.3.1: Weierstrass Approximation Theorem
Assume f C ([a, b ]). Given > 0, there is a polynomial b m with sufficiently high
degree m, such that
sup | f (x) b m (x)| < .
[a,b ]
We observe that the set of polynomials with rational coefficients is dense in the space
of all polynomials on [a, b ] with respect to the sup metric. The set of polynomials with
rational coefficients is a countable set. We conclude,
Theorem 3.3.2
C ([a, b ]) is separable.
Proof. [Weierstrass Approximation Theorem] To begin, we note that it suffices to treat

[a, b ] = [0, 1]. Next, we introduce a convenient way to quantify the continuous behavior
of a function. Recall that a function f that is continuous on [a, b ] is actually uniformly
continuous. If f is uniformly continuous on [a, b ], then for any > 0, the set of numbers
{| f (x) f (y)| : x, y [a, b ] |x y| < } ,
is bounded, and this set must have a least upper bound.
Definition 3.3.1
Let f be continuous on [a, b ]. The modulus of continuity of f is
( f , ) = sup | f (x) f (y)| .

x,y[a,b ]
|xy|<
To create the approximation, we use special polynomials. We partition [0, 1] into

m + 1 uniformly spaced nodes xk = k/m for k = 0, 1, . . . , m.
Definition 3.3.2
The Bernstein polynomial of degree m for f on [0, 1] is

m
X
b m ( f , x) = f (xk )p m,k (x).
k=0
The degree of b m ( f , x) m but not necessarily m.

Theorem 3.3.1 follows from
Theorem 3.3.3: Bernstein Approximation Theorem
Let f be a continuous function on [0, 1] and m be a positive integer. Then for x

i
3.4. References 31
[0, 1],
9
| f (x) b m ( f , x)| ( f , m 1/2 ). (3.7)
4
Using (3.5b), we write the error as
m
X m
X m
X
f (x) b m ( f , x) = f (x)p m,k (x) f (xk )p m,k (x) = ( f (x) f (xk ))p m,k (x).
k=0 k=0 k=0
We split the last sum into two parts. For > 0,

X X
f (x) b m ( f , x) = ( f (x) f (xk ))p m,k (x) + ( f (x) f (xk ))p m,k (x) .
0km 0km
k k
m x < m x

| {z } | {z }
I II
Now I is small by continuity,

X
|I| | f (x) f (xk )| p m,k (x) ( f , ) 1
0km
k
m x <

For II, we note that there is a C such that | f (x)| C , for 0 x 1. Hence, II is small by
the Law of Large Numbers. More precisely, (3.6) in the proof of the LLN implies
X C
|II| 2C p m,k (x) .
0km
2m 2

k
m x

So, we can make II as small as desired by taking m large. It is a good exercise to show that
in fact,
1

|II| ( f , ) 1 + ,
4m 2
and so,
1

| f (x) b m ( f , x)| ( f , ) 2 + .
4m 2
Setting = m 1/2 proves the result.
The Bernstein polynomials are not interpolating polynomials.

Example 3.3.1
For x 2 on [0, 1],

m
k 2 1
X
2
b m (x , x) = p m,k (x) = x 2 + x(1 x).
k=0
m m
So, b m (x 2 , x) 6= x 2 for x 6= 0, 1! However, |b m (x 2 , x) x 2 | 1/(4m) 0 as m .
3.4 References
3.5 Worked problems

Chapter 3

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chapter 3

Hochgeladen von

Copyright:

Verfügbare Formate

i

The Law of Large

I guess I think of lotteries as a tax on the mathematically challenged.

It is likely that unlikely things should happen.

The perfidious polynomial

You do not want to meet a polynomial of degree 1000 on a dark night.

22 Chapter 3. The Law of Large Numbers and Weierstrass Approximation Theorem

of variations in subsequent chapters. Then, in a startling development, we demonstrate

3.1 Some discrete probability

An experiment is a repeatable procedure that yields exactly one outcome from a

Often, an experiment may be associated with several different sample spaces.

Sample space 2 A point in the space of two outcomes: red or black.

A sequence of trials of an experiment is independent if the outcome of one trial

3.1. Some discrete probability 23

independent trials and the trials are independent.

We typically assume that a sequence of coin tosses is independent.

Consider a bag holding a collection of white and black marbles. In experiment 1, we

We often wish to further group elements in a sample space.

Any collection or set of outcomes in a sample space is called an event. Individual

Consider Example 3.1.1. In sample space 1, black is an event with 26 points. In

On a purely functional level, probability is a function defined on events in a sample

Consider a sample space with n outcomes. Probabilities are numbers assigned to

24 Chapter 3. The Law of Large Numbers and Weierstrass Approximation Theorem

number of outcomes in the event

An important example of transport in porous media is the flow of groundwater

We consider a sample space with six outcomes a, b , . . . , f . We assign  P (a) = .1,

A system involving probability has three ingredients.

3.2. The Law of Large Numbers 25

We give names to some special kinds of events,

If A is an event in a sample space, its complement Ac is the set of outcomes not in A.

If A, B are events in a probability space, then

P (A B) = P (A) + P (B) P (A B). (3.1)

If A, B are exclusive events in a probability space, then

P (A B) = P (A) + P (B). (3.2)

In general, if {Ai : 1 i n} is a collection of n exclusive events, then,

3.2 The Law of Large Numbers

26 Chapter 3. The Law of Large Numbers and Weierstrass Approximation Theorem

for m large. This goes to 0 as m increases.

In an experiment, assume outcome O occurs with probability x and let k be the

for all m sufficiently large.

3.2. The Law of Large Numbers 27

in consideration of the role of probability theory as a mathematical model of reality. An

The binomial coefficient is

For any numbers a, b and positive integer m,

By setting a = x and b = 1 x, we see that Theorem 3.2.2 implies

28 Chapter 3. The Law of Large Numbers and Weierstrass Approximation Theorem

The m + 1 binomial polynomials of degree m are

The connection to probability is,

If 0 x 1 is the probability of an event E, then p m,k (x) is the probability that E

Theorem 3.2.2 implies

0 p m,k (x) 1, for all 0 x 1. (3.5a)

One important use of binomial polynomials is proving the LLN.

for all m sufficiently large.

3.3. The Weierstrass Approximation Theorem 29

This shows the result.

3.3 The Weierstrass Approximation Theorem

30 Chapter 3. The Law of Large Numbers and Weierstrass Approximation Theorem

Theorem 3.3.1: Weierstrass Approximation Theorem

Proof. [Weierstrass Approximation Theorem] To begin, we note that it suffices to treat

{| f (x) f (y)| : x, y [a, b ] |x y| < } ,

is bounded, and this set must have a least upper bound.

Let f be continuous on [a, b ]. The modulus of continuity of f is

( f , ) = sup | f (x) f (y)| .

To create the approximation, we use special polynomials. We partition [0, 1] into

We consider a sample space with six outcomes a, b , . . . , f . We assign P (a) = .1,