You are on page 1of 34

4 Chaining 1

4.1 What is chaining? . . . . . . . . . . . . . . . . . . . . . . . . 1


4.2 Covering and packing numbers . . . . . . . . . . . . . . . . . 5
4.3 Bounding sums by integrals . . . . . . . . . . . . . . . . . . . 7
4.4 Orlicz norms of maxima of finitely many variables . . . . . . 8
4.5 Chaining with norms and packing numbers . . . . . . . . . . 10
4.6 An alternative to packing: majorizing measures . . . . . . . . 16
4.7 Chaining with tail probabilities . . . . . . . . . . . . . . . . . 18
4.7.1 Tail chaining via packing . . . . . . . . . . . . . . . . 20
4.7.2 Tail chaining via majorizing measures . . . . . . . . . 21
4.8 Brownian Motion, for example . . . . . . . . . . . . . . . . . 22
4.8.1 Control by (conditional) expectations . . . . . . . . . 24
4.8.2 Control by 2 -norm . . . . . . . . . . . . . . . . . . . 25
4.8.3 Control by tail probabilities . . . . . . . . . . . . . . . 26
4.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.10 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Printed: 22 November 2015

version: 20nov15 Mini-empirical


printed: 22 November 2015 David
c Pollard
4.1 What is chaining? 1

Chapter 4

Chaining

This chapter introduces some of the tools that are used in the next chapter
(on chaining) to construct approximations to stochastic processes.

Section 4.1 describes two related methods for quantifying the complexity of
a metric space: by means of covering/packing numbers or by means of
majorizing measures.
Section 4.2 defines covering and packing numbers.
Section 4.3 explains why chaining bounds are often expressed as integrals.
Section 4.4 presents a few simple ways to bound the Orlicz norm of a
maximum of finitely many random variables.
Section 4.5 illustrates the method for combining the chaining with the max-
imal inequalities like those from Section 4.4 to control the norm of the
oscillation of a stochastic process.
Section 4.6 mentions a more subtle alternative to packing/covering, which
is discussed in detail in Chapter 7.
Section 4.7 discusses chaining with tail probabilities.
Section 4.8 compares various chaining methods by applying them to the
humble example of Brownian motion on the unit interval.

4.1 What is chaining?


Chaining::S:intro
This chapter begins the task of developing various probabilistic bounds for
stochastic processes, X = {Xt : t T }, where the index set T is equipped
with a metric (or semi-metric) d that gives some control over the increments
of the process. For example, we might have kXs Xt k Cd(s, t) for some

Draft: 20nov15 David


c Pollard
4.1 What is chaining? 2

norm (or semi-norm) kk and some constant C, or we might have a tail


bound
P{|Xs Xt | d(s, t)} () for all s, t T ,
for some decreasing function . The leading casethe example that drove
the development of much theoryis the centered Gaussian process with
(semi-)metric defined by d2X (s, t) = P|Xs Xt |2 .
Remark. Especially if T is finite or countably infinite, it is not essential
that d(s, t) = 0 should imply that s = t. It usually suffices to have
Xs = Xt almost surely if d(s, t) = 0. More elegantly, we could
partition T into equivalence classes [t] = {t T : d(s, t) = 0} then
replace T by a subset T0 consisting of one point from each equivalence
class. The restriction of d to T0 would then be a metric. Like most
probabilists, I tend to ignore these niceties unless they start causing
trouble. When dealing with finite index sets we may assume d is a
metric.

The central task in many probabilistic and statistical problems is to find


good upper bounds for quantities such as suptT |Xt | or the oscillation,
\E@ osc.def <1> osc(, X, T ) := sup{|Xs Xt | : s, t T, d(s, t) < }.
For example, we might seek bounds on ksuptT |Xt | k or kosc(, X, T ) k, for
various norms, or on their tail probability analogs
P{suptT |Xt | > } or P{osc(, X, T ) > }.
Such quantities play an important role in the theory of stochastic processes,
empirical process theory, and statistical asymptotic theory. As explained in
Section 5.4, bounds for the oscillation are essential for the construction of
processes with continuous sample paths and for the study of convergence
in distribution of sequences of stochstic processes. In particular, oscillation
control is the key ingredient in the proofs of Donsker theorems for empirical
processes. In the literature on the asymptotic theory for estimators defined
by optimization over random processes, oscillation bounds (or something
similar) have played a major role under various names, such as stochastic
asymptotic equicontinuity.
If T is uncountable the suprema in the previous paragraph need not
be measurable, which leads to difficulties of the type discussed in Chap-
ter 5. Some authors sidestep this difficulty by interpreting the inequalities
as statements about arbitrarily large finite subsets of T . For example,
interpret P suptT Xt to mean supS {P suptS Xt : finite S T }.

Draft: 20nov15 David


c Pollard
4.1 What is chaining? 3

With such a convention the real challenge becomes: find bounds for
finite S that do not grow unhelpfully large as the size (cardinality) of S
increases. For example, if T is any countable subset of T one might hope
to get a reasonable bound for P suptT Xt by passing to the limit along a
sequence of finite subsets Sn that increase to T . If T is dense in T the
methods from Chapter 5 then take over, leading to bounds for P suptT Xt
if the version of X has suitable sample paths.
The workhorse of the modern approach to approximating stochastic pro-
cesses is called chaining. Suppose you wish to approximate maxtS Xt on
a (possibly very large) finite subset S of T . You could try a union bound,
such as
X
P{maxtS |Xt | } P{|Xt | },
tS

but typically the upper bound grows larger than 1 as the size of S in-
creases. Instead you could creep up on S through a sequence of finite subsets,
S0 , S1 , . . . , Sm = S, breaking the process on S into a contribution from S0
plus a sum of increments across each Si+1 to Si pair.
To carry out such a strategy you need maps `i : Si+1 Si for i =
0, . . . , m 1. The composition Lp = `p `p+1 `m1 maps Sm into Sp ,
for 0 p < m. Each t = tm in Sm is connected to a point t0 = L0 t in S0 by
a chain of points
`m1 `m2 0`
\E@ the.chain <2> tm = t tm1 = Lm1 (t) tm2 = ... t0 = L0 (t),

with a corresponding decomposition of the process into a sum of increments,

Xm1
\E@ X.increments <3> X(t) X(t0 ) = X(ti+1 ) X(ti ) with ti = Li (t).
i=0

t0= L0(t)
S0
S1
tm-2 = m-2tm-1
tm-1 = m-1t
S m-1
Sm
t=tm

Draft: 20nov15 David


c Pollard
4.1 What is chaining? 4

Remark. To me ` stands for link. The pair (t, `i t) with t Si+1 defines
a link in the chain connecting t to L0 t.

The point of decomposition <3> is that many different ts in S might


share the same increment X(s) X(`i (s)) with s Si+1 , but you need to
control that increment only once. For example, for each t in S the triangle
inquality applied to <3> gives
Xm1 Xm1
|X(t)X(t0 )| |X(ti+1 )X(ti )| max |X(s) X(`i (s))| .
i=0 i=0 sSi+1

Remark. Does | | stand for the absolute value of a real-valued random


variable or the norm of some random element of a normed vector
space? It really doesnt matter which interpretation you prefer. The
chaining argument generalizes easily to more exotic random elements
of vector spaces.

The final sum is the same for every t; it also provides an upper bound for
the maximum over t:
Xm1
\E@ max.link <4> maxtS |X(t) X(t0 )| maxsSi+1 |X(s) X(`i (s))| .
i=0

If  = 1 + + m then a simple union bound provides some control over


the quantity on the left-hand side of <4>:

P{max |X(t) X(t0 )| }


tS
[ [
P {|X(s) X(`i (s))| i }
i sSi+1
Xm1 X
P{|X(s) X(`i (s))| i }.
i=0
sSi+1

The tail probability P{maxtS |Xt | +} is less than P{maxtS0 |Xt | }
plus the double sum over i and Si+1 . If the growth in the size of the Si s could
be offset by a decrease in the d(s, `i s) distances you might get probabilistic
bounds that do not depend explicitly on the size of S.
The extra details involving the choice of the i s makes chaining with tail
probabilities seem more complicated that the analogous argument where one
merely takes some norm of both sides of inequality <4>, giving
  X
m1 
\E@ norm.max.link <5> max |X(t) X(t0 )| maxsSi+1 |X(s) X(`i (s))| .
tS i=0

Here I am assuming that Y1 Y2 if 0 Y1 Y2 . More precisely, from


now on I assume that has the following properties

Draft: 20nov15 David


c Pollard
4.2 Covering and packing numbers 5

Chaining::norm.like <6> Assumption. Let M+ = M+ (, F, P) denote the set of all F-measurable


functions on a probability space (, F, P) taking values in [0, ]. Assume
is a map from M+ into [0, ] that satisfies

(i) (X + Y ) (X) + (Y )

(ii) (cX) = c(X) for all c R+ .

(iii) (X) (Y ) if X Y almost surely

(iv) (X) = 0 iff X = 0 almost surely.

Remark. If we intend to pass to the limit as a sequence of finite subsets


increase to a countable subset of T we also need something like: if
0 X1 X2 . . . X almost surely then (Xn ) (X).

In place of union bounds we now need ways to control of the maximum


over finite sets. See Section 4.4 for some simple methods when is an Orlicz
norm.
The traditional approach to chaining chooses the Si sets for their uni-
form approximation properties. Typically one starts with a decreasing se-
quence {i }, such as i = /2i for some > 0, then seeks sets Si with
minsSi d(t, s) i for every t in S such that Si is as small as possible. The
smallest cardinality is called the covering number. Section 4.2 gives the
formal definition, as well as describing a slightly more convenient concept
called the packing number.
Section 4.5 illustrates the method for combining bounds on packing
numbers with bounds on the norms of maxima of finitely many variables
(Section 4.4) to obtain useful bounds for a norm of maxtS |X(t) X(t0 )|.
Section 4.7 replaces norms by tail probabilities.
Section 4.8 returns to the relatively simple case of Brownian motion on
the unit interval as a test case for comparing the different approaches.
For each method I use the task of controlling the oscillation of a stochas-
tic process as a nontrivial example of what can be done with chaining.

4.2 Covering and packing numbers


Chaining::S:covering
The simplest strategy for chaining is to minimize the size (cardinality) of Si
subject to a given upper bound on suptT minsSi d(t, s), That idea trans-
lates into a statement about covering numbers.

Draft: 20nov15 David


c Pollard
4.2 Covering and packing numbers 6

Remark. The logarithm of the covering number is sometimes called the


metric entropy, probably because it appears naturally when dealing
with processes whose increments have tail probabilities that decrease
exponentially fast.

Chaining::covering <7> Definition. For a subset F of T write NT (, F, d) for the -covering num-
ber, the smallest number of closed -balls needed to cover F . That is, the
covering number is the smallest N for which there exist points t1 , . . . , tN
in T with miniN d(t, ti ) for each t in F . The set of centers {ti } is
called a -net for F .
Remark. Notice a small subtlety related to the subscript T in the
definition. If we regard F as a metric space in its own right, not just as
a subset of T , then the covering numbers might be larger because the
centers ti would be forced to lie in F . It is an easy exercise (select a
point of F from each covering ball that actually intersects F ) to show
that NF (2, F, d) NT (, F, d). The extra factor of 2 would usually
be of little consequence. When in doubt, you should interpret covering
numbers to refer to NF .

Some metric spaces (such as the whole real line under its usual metric)
cannot be covered by a finite set of balls of a fixed radius. A metric space T
for which NT (, T, d) < for every > 0 is said to be totally bounded.
A metric space is compact if and only if it is both complete and totally
bounded (Dudley, 2003, Section 2.3).
I prefer to work with the packing number pack(, F, d), defined as the
largest N for which there exist points t1 , . . . , tN in F that are -separated,
that is, for which d(ti , tj ) > if i 6= j. Notice the lack of a subscript T ; the
packing numbers are an intrinsic property of F , and do not depend on T
except through the metric it defines on F .
Chaining::cover.pack <8> Lemma. For each > 0,

NF (, F, d) pack(, F, d) NT (/2, F, d) NF (/2, F, d).

Proof For the middle inequality, observe that no closed ball of radius /2
can contain points more than apart. Each of the centers for pack(, F, d)
must lie in a distinct /2 covering ball. The other inequalities have similarly
simple proofs.

Chaining::rr.norm <9> Example. Let kk denote any norm on Rk . For example, it might P be or-
dinary Euclidean distance (the `2 norm), or the `1 norm, kxk1 = ik |xi |.
The covering numbers for such norms share a common geometric bound.

Draft: 20nov15 David


c Pollard
4.3 Bounding sums by integrals 7

Write BR for the ball of radius R centered at the origin. For a fixed ,
with 0 <  1, how many balls of radius R does it take to cover BR ?
Equivalently, what are the packing numbers for BR ?
Let {x1 , . . . , xN } be any set of points in BR that is R-separated. The
closed balls B[xi , R/2] of radius R/2 centered at the xi are disjoint and
their union lies within BR+R/2 . Write for the Lebesgue measure of the
unit ball B1 . Each B[xi , R] has Lebesgue measure (R/2)k and BR+R/2
has Lebesgue measure (R + R/2)k . It follows that
k
(R + R/2)k

2+
N = (3/)k
(R/2)k 

That is, pack(R, BR , d) (3/)k for 0 <  1, where d denotes the metric
corresponding to kk.


Finite packing numbers lend themselves in a simple way to the con-


struction of increasing sequences of approximating subsets. We start with a
maximal subset S0 of points that are 0 -separated, then enlarge to a maximal
subset of points that are 1 -separated, and so on.
If S is finite, the whole procedure must stop after a finite number of
steps, leaving S itself as the final approximating set. In any case, we have
constructed nested finite subsets S0 S1 S2 S3 . . . and we have a
bound on the sizes, #Si pack(i , S, d). If 0 = diam(S) then S0 consists
of only one point.
The natural choice for `i is the nearest neighbor map from Si+1 to Si , for
which d(t, `i (t)) = minsSi d(t, s) i , with a convention such as choosing
the s that appeared earliest in the construction to handle ties.

4.3 Bounding sums by integrals


Chaining::S:sum.to.integral
Typically we do not know the exact values of the packing numbers but
instead only have an upper bound, pack(, S, d) M (). The chain-
ing strategy outlined in Section 4.1 often leads to upper bounds of the
form ik G(i )i1 , with G() an increasing function, such as 1 (M ()).
P

If the i s decrease geometrically, say i = 0 /2i , then the sum can be


G(y)
bounded above by an integral,
area = G()/4
X Z 1 Z 1
G(i )i1 4 G(r) dr 4 G(r) dr.
ik
y k+1 0

Draft: 20nov15 David


c Pollard
4.4 Orlicz norms of maxima of finitely many variables 8

If G is integrable the replacement of the lower terminal by 0 costs little. Note


the similarity to Dudleys entropy integral assumption <24> in Section 4.6.
In the early empirical process literature, some authors chose the i s to make
the G(i )s increase geometrically, for similar reasons.

4.4 Orlicz norms of maxima of finitely many variables


Chaining::S:finitemax
The Orlicz norms are natural candidates for the of Assumption <6>.
The next Theorem collects a few simple inequalities well suited to inequal-
ity <5>. See Section 4.8 for an example that highlights the relative merits
of the inequalities.

Chaining::orlicz.maximal <10> Theorem. Let X1 , . . . , XN be random variables defined on the same (, F, P)


with maxiN kXi k . Define M := maxiN Xi .

(i) PM 1 (N )

(ii) For each event B with PB > 0,

PB M 1 (N/PB)

where PB denotes conditional expectation given B.

(iii) Let be another Young function that is related to by the inequality


()() () whenever ( ) 1. Then

kM k 2 1 (N ).

(iv) In particular, if there exist positive constants K0 , c0 , and c1 for which


()() K0 (c0 ) whenever c1 then

kM k C0 1 (N )

for a constant C0 that depends only on .

Remark. Note well that the Theorem makes no independence assump-


tions.
For each p 1 the Young function (t) = tp satisfies condition (iv)
with equality when K0 = c0 = 1 and c1 = 0. Problem [??] shows that
each of the functions from Section 2.3, which grow like exp(t ),
also satisfy (iv).

Draft: 20nov15 David


c Pollard
4.4 Orlicz norms of maxima of finitely many variables 9

Proof Without loss of generality, suppose = 1 and each Xi is nonnega-


tive, so that P(Xi ) 1 for each i.
Even though assertion (i) is a special case of assertion (ii), I think it
helps understanding to see both proofs. For (i), Jensens inequality then
monotonicity of imply
X
(PM ) P maxiN (Xi ) P(Xi ) N.
iN

For (ii), partition B into disjoint subsets Bi such that maxj Xj = Xi


on Bi . Without loss of generality assume PBi > 0 for each i. (Alternatively,
just discard those Bi with zero probability.) By definition,
X  X PB
i
PB M = PB Bi Xi = PBi Xi
i i PB

By Jensens inequality,

PBi (Xi ) P(Xi ) 1


(PBi Xi ) PBi (Xi ) = .
PBi PBi PBi
Thus
X 
PBi
(PB M ) = PB Xi
i PB i
X PBi
(PBi Xi ) by convexity of
i PB
X PBi 1 N
= .
i PB PBi PB
Perhaps it would be better to write the last equality as an inequality, to
cover the case where some PBi are zero.
For (iii), let = 1 (N ), so that () = N . Trivially,

{(M/) 1} (M/) 1

and, by the relationship between and ,


X
{(M/) > 1} (M/) () (M ) (Xi ).
i

Divide both sides of the second inequality by N , add, then take expectations
to deduce that
1 X
P (M/) 1 + P(Xi ) 2,
N i

Draft: 20nov15 David


c Pollard
4.5 Chaining with norms and packing numbers 10

which implies kM k 2 (see Section 2.3).


For (iv), first note that K0 (x) (K1 x) for K1 = max(1, K0 ). If
L 1 the assumed bound gives

()() (L)(L) (K1 c0 LL) if (L) (L) c1 .

Choose L so that L1
2 (1) c1 . Then (L) (L) c1 if () () 1
and

()() ()
e := (c0 K1 L2 ) if () () 1.

Invoke the result from (iii), noting that kY k 2


e = c0 K1 L kY k2 for each
random variable Y .


4.5 Chaining with norms and packing numbers


Chaining::S:norm.chain
Consider a process {Xt : t T } indexed by a (necessarily totally bounded)
metric space for which we have some control over the packing numbers.
Instead of working via an assumption like kXs Xt k Cd(s, t), combined
with one of the bounds from Section 4.4, Ill cut out the middle man by
assuming directly there is a norm that satisfies Assumption <6> and an
increasing function H : N R+ for which
 
|Xsi Xti |
\E@ max.norm <11> max H(N )
iN d(si , ti )

for all finite sets of increments with d(si , ti ) > 0. For example, if were
expected value and the kXs Xt k2 d(s, t) then H(N ) = 1 2 (N ) =
p
log(1 + N ) .

Chaining::normchain1 <12> Lemma. Suppose X, , and H satisfy <11> and S0 S1 Sm are


finite subsets of T for which d(s, Si ) i for all s Si+1 and #Si Ni .
Then, in the chaining notation from <2>,
  X
m1
max |X(t) X(L0 t)| i H(Ni+1 ).
tSm i=0

In particular, if i = 0 /2i and Si is i -separated then Ni pack(i , T, d)


and
  Z 1
max |X(t) X(L0 t)| 4 H (pack(r, T, d)) dr.
tSm m+1

Draft: 20nov15 David


c Pollard
4.5 Chaining with norms and packing numbers 11

Proof The first inequality results from applying to both sides of the
inequality
Xm1 |X(s) X(`i s)|
max |X(t) X(L0 t)| i max
tSm i=0 sSi+1 d(s, `i s)

The method described in Section 4.3 provides the integral bound.




Chaining::subgaussian.norm <13> Example. Suppose the {Xt : t T } process has subgaussian increments
2
with kXs Xt k2 d(s, t). The function 2 (t) = et 1 satisfies the
assumptions of Theorem <10> part (iv). Inequality <11> holds
p with equal
to the kk2 and H(N ) a constant multiple of 1
2 (N ) = log(1 + N ) .
Because pack(r, T, d) = 1 for all r > diam(T ) we may as well assume
0 diam(T ), in which case pack(r, T, d) 2 for all r 1 . That lets us
absorb the pesky 1+ from the log() into the constant, leaving
Z 1 p

max |X(t) X(L0 t)| C log pack(r, T, d) dr,
tSm
2 0

for some universal constant C. Remember that for each k 1 there is a


constant Ck for which kZkk Ck kZk2 (Section 2.4). Thus the upper
bounds for the kk2 norm of maxtSm |X(t) X(L0 t)| automatically imply
similar upper bounds for its kkk . In particular, both Theorems 3.2 and 3.3
in the expository paper of Pollard (1989) follow from the result for the 2 -
norm.


Lemma <12> captures the main idea for chaining with norms. The next
Theorem uses the example of oscillation control to show why the uniform
approximation of {Xs : s Sm } by {Xs : s Sm } is such a powerful tool.

Chaining::osc.rho <14> Theorem. Suppose {Xt : t T } is a stochastic process whose increments


satisfy <11>, that is, (maxiN |Xsi Xti |/d(si , ti )) H(N ) for all finite
sets of increments. Suppose also that
Z D
H(pack(r, T, d)) dr < where D := diam(T ).
0

Then for each  > 0 there exists a > 0 for which (osc(, X, S)) <  for
every finite subset S of T .

Draft: 20nov15 David


c Pollard
4.5 Chaining with norms and packing numbers 12

Proof Choose 0 so that


Z 0
\E@ del0.choice <15> 4 H(pack(r, T, d)) dr < /5.
0

Chain with i = 0 /2i , finding a sequence of i -separated subsets Si for


which S0 Sm = S with

Ni = #Si pack(i , S, d) pack(i , T, d).

Define := maxtSm |X(t) X(L0 t)|. Lemma <12> gives () < /5.
The value N0 = #S0 , which only depends on , is now fixed.
There might be many pairs s, t in Sm for which d(s, t) < , but they
correspond to at most N20 pairs L0 s, L0 t in S0 for which
Xm1 Xm1
\E@ chain.lengths <16> d(L0 s, L0 t) d(Li+1 s, Li s) + + d(Li+1 t, Li t) 40 + .
i=0 i=0

For the subgaussian case, where H(N ) grows like log N , we could
afford to invoke a finite maximal inequality to control the contributions from
the pairs in S0 by a constant multiple of H(N02 ) (40 + ), which could be
controlled by <15> if = 0 because H(N02 ) constant H(N0 ). Without
such behavior for H we would need something stronger than <15>.
For general H() it pays to make a few more trips up and down the
chains, using a clever idea of Ledoux and Talagrand (1991, Theorem 11.6).

L0s = L0tE,F L0t = L0tF,E


s tE,F tF,E t

E F
The map L0 defines an equivalence relation on Sm , with t t0 if and only
if L0 t = L0 t0 . The corresponding equivalence classes define a partition m
of Sm into at most N0 subsets. For each distinct pair E, F from m choose
points tE,F E and tF,E F such that

d(tE,F , tF,E ) = d(E, F ) = min{d(s, t) : s E, t F }

Draft: 20nov15 David


c Pollard
4.5 Chaining with norms and packing numbers 13

then define
 
|X(tE,F ) X(tF,E )|
:= max : E, F m , E 6= F } .
d(E, F )
Assumption <11> implies () H(N02 ).
Remark. The definition of might be awkward if d were a semi-metric
and d(E, F ) = 0 for some pair E 6= F . By working with equivalence
classes we avoid such awkwardness.

Now consider any pair s, t Sm for which d(s, t) < . If s t, so that


L0 s = L0 t, we have
|X(s) X(t)| |X(s) X(L0 s)| + 0 + |X(L0 t) X(t)| 2.

L0s = L0tE,F L0t = L0tF,E If s and t belong to different sets, E and F , from m then

|X(s) X(tE,F )| 2 and |X(t) X(tF,E )| 2.


We also have d(E, F ) d(s, t) < , so that |X(tE,F ) X(tF,E )| and
s tE,F tF,E t
|X(s) X(t)| is bounded by
E F

|X(s) X(tE,F )| + |X(tE,F ) X(tF,E )| + |X(tF,E ) X(t)| 4 + .


The upper bound does not depend on the particular s, t pair for which
d(s, t) < . It follows that
(osc(, X, S)) < 4/5 + H(N02 ) <  for small enough.
The upper bound has no hidden dependence on the size of S.

The final Example shows that even very weak control over the incre-
ments of a process can lead to powerful inequalities if the index set has
small packing numbers. The Example makes use of some properties of the
Hellinger distance h(P, Q) between two probability measures P and Q,
defined by
s s !2 s
dP dQ dP dQ
h2 (P, Q) = = 2 2
d d d d

where is any measure dominating both P and Q. For product measures


the distance reduces to a simple expression involving the marginals,
n
\E@ Hellinger.product <17> h2 (P n , Qn ) = 2 2 1 1/2h2 (P, Q) nh2 (P, Q).
See, for example, Pollard (2001, Problem 4.18).

Draft: 20nov15 David


c Pollard
4.5 Chaining with norms and packing numbers 14

Chaining::dqm.and.mle <18> Example. Let {P : [0, 1] } be a family of probability measures having


densities p(x, ) with respect to a fixed measure on some set X. Let
P,n = P P and n = , both n-fold product measures.
For a fixed 0 , abbreviate P0 ,n to P. Under P0 ,n the coordinates of the
typical point x = (x1 , . . . , xn ) in Xn become identically distributed random
variables, each with distribution P0 .
To avoid distracting measurability issues, I suppose 7 p(x, ) is con-
tinuous. I also ignore irrelevant questions regarding uniqueness and measur-
ability of the the maximum likelihood estimator (MLE) bn = bnQ(x1 , . . . , xn ),
which by definition maximizes the likelihood process Ln () := in p(xi , ).
Suppose there exists positive constants C1 and C2 for which

\E@ euc.hell <19> C1 | 0 | h(P , P0 ) C2 | 0 | for all , 0 [0, 1].

Remark. Notice that such an inequality can hold only for a bounded
parameter set, because H is a bounded metric. Typically it would
follow from a slight strengthening of a Hellinger differentiability
condition. In a sense, the right metric to use is h. The upper bound
in Assumption <19> ensures that the packing numbers under h
behave like packing numbers punder ordinary Euclidean distance. The
lower bound ensures that P Ln (t) decays rapidly as t increasessee
inequality <22> below.

Then a chaining argument will show that, for each 0 in [0, 1], the esti-
mator bn converges at an n1/2 -rate to 0 . More precisely, there exists
constants C3 and C4 for which

\E@ mle.dev <20> P0 ,n { n|bn 0 | y} C3 exp(C4 y 2 ) for all y 0 and all n.

The MLE also maximizes the square root of the likelihood process. The

standardized estimator btn = n(bn 0 ) maximizes, over the interval Tn =

{t R : 0 + t/ n [0, 1]}, the process
s
Ln (0 + t/ n)
Zn (t) =
Ln (0 )
Y q
= (xi , t) where (z, t) := p(z, 0 + t/ n)/p(z, 0 ) .
in

Remark. For a more careful treatment, which has no significant effect


on the final bound, one should include an indicator function {z S},
where S = {p(, 0 ) > 0}, into the definition of . And a few equalities
should actually be inequalities.

Draft: 20nov15 David


c Pollard
4.5 Chaining with norms and packing numbers 15

Inequality <20> will come from the fact that Zn (b


tn ) Zn (0) = 1, which
implies for each y0 0 that

P0 ,n {|b
tn | y0 } P0 ,n {sup|t|y0 Zn (t) 1} P0 ,n sup|t|y0 Zn (t).

The last expected value involves a supremum over an uncountable set, which
sample path continuity allows us to calculate as a limit over maxima over
finite sets.
Let me show how Lemma <12> handles half the range, supty0 Zn (t).
The argument for the other half is analogous.
To simplify notation, abbreviate P0 ,n to P, with 0 fixed for the rest
of the Example. Split the set {t Tn : t y0 } into a union of intervals
Jk := [yk , yk+1 ) Tn , where yk = y0 + k for k N. Then
X
\E@ strat <21> P supty0 Zn (t) P suptJk Zn (t)
k0

Remark. This technique is sometimes called stratification.

Inequality <19> provides the means for bounding the kth term in the
sum <21>. The lower bound from <19> gives us some control over the
expected value of Zn :
Yq
PZn (t) n p(xi , 0 + t/ n)p(xi , 0 )
in
 n
= 1 1/2h2 (P0 , P0 +t/n )
\E@ pwise.bound <22> exp( 1/2C12 t2 )

for each t in Tn . The upper bound in <19> gives some control over the
increments of the Zn process: for t1 , t2 Tn ,

P|Zn (t1 ) Zn (t2 )|2


p p
= n | Ln (1 ) Ln (2 ) |2 where i = 0 + ti / n
= h2 (Pn1 , Pn2 )
C22 |t1 t2 |2 by <17> and <19>.

By Theorem <10> part (i), it then follows that

|Zn (si ) Zn (ti )|


\E@ L2 <23> P max H(N ) := C2 N
iN |si ti |

Draft: 20nov15 David


c Pollard
4.6 An alternative to packing: majorizing measures 16

for all finite sets of increments.


Consider the kth summand in <21>. With d as the usual Euclidean
metric, Jk has diameter at most 1 and pack(r, Jk , d) 1/r for r 1. For
a 0 that needs to be specified, invoke Lemma <12> then pass to the limit
as Sm expands up to a countable dense subset to get
Z 1 p
P suptJk Zn (t) P maxtS0 Zn (t) + 4C2 1/r dr
0
X 1 2 2 1/2
e /2C1 t + 4C2 21
tS0
1 2 2 1/2
(1/0 )e /2C1 yk + C2 80 .
1 2 2
The last sum is approximately minimized by choosing 0 = e 3 C1 yk , which
leads to
X
P supty0 Zn (t) (1 + C2 8) exp(C12 (y0 + k)2 /6).
k0

After some fiddling around with constants, the bound <20> emerges.


4.6 An alternative to packing: majorizing measures


Chaining::S:MM
The modern theory owes much to the work of Dudley (1973), who applied
chaining arguments to establish fine control over the sample paths of the
so-called isonormal Gaussian process, and Dudley (1978), who initiated the
modern theory of empirical processes by adapting the Gaussian methods to
processes more general than the classical empirical distribution function. In
both cases he constrained the complexity of the index set by using bounds
on the covering numbers N () = NT (, T, d). For example, for a Gaussian
process he proved existence of a version with continuous sample paths under
the assumption that
Z Dp
log N (r) dr < where D = diam(T ), the diameter of T .
0

Equivalently, but more suggestively, the last inequality can be written as


Z D
\E@ entropy.integral <24> 1
2 (NT (r)) dr < where 2 (x) = exp(x2 ) 1,
0

which highlights the role of the 2 -Orlicz norm in controlling the increments
of Gaussian processes.

Draft: 20nov15 David


c Pollard
4.6 An alternative to packing: majorizing measures 17

Work by Fernique and Talagrand showed that assumption <24> can be


weakened slightly to get improved bounds, which in some cases are optimal
up to multiplicative constants. As originally formulated, the improvements
assumed existence of a majorizing measure for a Young function (= 2
for the Gaussian case): a probability measure on the Borel sigma-field of T
for which
Z D  
1
\E@ MM.def <25> suptT 1 dr < ,
0 B[t, r]
where B[t, r] denotes the closed ball of radius r and center t.

Remark. When I first encountered <25> I was puzzled by how a


single could control the sample paths of a stochastic process. It
seemed not at all obvious that could be used to replace the familiar
constructions with covering numbers. It helped me to dig inside some
of the proofs to find that the key fact is: if B1 , . . . , Bn are disjoint Borel
sets with Bi  for each i then n 1/. The same idea appears in
the very classical setting of Example <9>, where Lebesgue measure
on Rk plays a similar role in bounding covering numbers.

Fernique (1975) proved many results about the sample paths of a cen-
tered Gaussian process X. For example, he proved (his Section 6) a result
stronger than: existence of a majorizing measure implies that

\E@ bdd.paths <26> suptT |X(, t)| < for almost all .

Slight strengthenings of <25> also imply existence of a version of X with


almost all sample paths uniformly continuous.
Talagrand (1987) brilliantly demonstrated the central role of majorizing
measures by showing that a condition like <26> implies existence of some
probability for which <25> holds with d(s, t) = kXs Xt k2 . He subse-
quently showed (Talagrand, 2001) that the majorizing measures work their
magic by generating a nested sequence {i } of finite partitions of T , which
can be used for chaining arguments analogous to those used by Dudley.
(See subsection 4.7.2 and Chapter 7 for details.) Each i+1 is a refinement
of the previous i . Each point t of T belongs to exactly one member, de-
noted by Ei (t), of i . Talagrand allowed the size #i to grow very rapidly.
For Young functions that grow like exp(x ) for some > 0, he showed
i
that <25> is equivalent to the existence of partitions i with #i ni := 22
for which
X
\E@ partitions <27> suptT 1 (ni ) diam (Ei (t)) <
iN

Draft: 20nov15 David


c Pollard
4.7 Chaining with tail probabilities 18

More precisely, he showed that


the infimum of <27> over all partitions i with #i ni
the infimum of the integral in <25> over all majorizing measures
is bounded above and below by positive constants that depend only on .

Remark. The subtle point, which distinguishes <27> from the


analogous constructions based on covering numbers, is that the uniform
bound on the sum does not come via a bound on maxEi diam(E).
As Talagrand (2005, bottom of page 13) has pointed out, such
uniform control of the diameters is not the best way to proceed, because
it tacitly assumes some sort of homogeneity of complexity over different
subregions of T . By contrast, the partitions constructed from majorizing
measures are much more adaptive to local complexities, subdividing
more finely in regions where the stochastic process fluctuates more
wildly.

In more recent work Talagrand established analogous equivalences be-


tween majorizing measures and partitions for nongaussian processes. He has
even declared (Talagrand, 2005, Section 1.4) that the measures themselves
are now totally obsolete, with everything one needs to know coming from
the nested partitions.
Despite the convincing case made by Talagrand for majorizing measures
without measures, I still think there might be a role for the measures them-
selves in some statistical applications. And in any case, it does help to
know something about how an idea evolved when struggling with its fancier
modern refinements.

4.7 Chaining with tail probabilities


Chaining::S:tails
Start once again from a sequence of finite subsets S0 S1 Sm ,
with #Si = Ni and the chains defined as in Section 4.1. Again we have a
decomposition for each t in Sm ,
Xm1
X(t) X(t0 ) = X(ti+1 ) X(ti ) with ti = Li (t).
i=0

And again assume we have increment control via tail probabilities,

P{|Xs Xt | d(s, t)} () for all s, t T .

with a decreasing function.

Draft: 20nov15 David


c Pollard
4.7 Chaining with tail probabilities 19

By analogy with the approach followed in Section 4.5 define

|X(s) X(`i s)|


Mi+1 := max where i (s) := d(s, `i s),
sSi+1 i (s)

with the convention that any terms for which i (s) = 0 are omitted from
the maxit matters only that |X(s) X(`i s)| Mi+1 i (s) for all s Si+1 .
For i 0,
X
P{Mi+1 i } P{|X(s) X(`i s)| i i (s)} Ni+1 (i )
sSi+1

The event B := {i m : Mi i } has probability at most m1


P
i=0 Ni+1 (i )
On the event B c we have
Xm1
m (t) := |X(t) X(L0 t)| i i (t) for each t Sm .
i=0

Put another way


Xm1 Xm1
\E@ tail.chain0 <28> P{t Sm : m (t) > i i (t)} Ni+1 (i ).
i=0 i=0

Remark. We could replace Ni+1 (i ), which is an upper bound on


the probability of a union, by min(1, Ni+1 (i )). Of course such a
modification typically gains us nothing because <28> is trivial if even
one of the summands on the right-hand side equals 1. Nevertheless, the
modification does eliminate some pesky trivial cases from the following
arguments.

Now comes the clever part: How should we choose the Si s and the i s?
I do not know any systematic way to handle that question but I do know
some heuristics that have proven useful.
Let me attempt to explain for the specific case of -norms, for > 0,
with

\E@ beta.alpha <29> () = min(1, 1/ ()) C exp( ),

first for the traditional approach based on packing numbers then for the
majoring measure alternative.

Remark. You should not be misled by what follows into believing that
all chaining arguments with tail probabilities involve a lot of tedious
fiddling. I present some of the messy details just because so few authors
seem to explain the reasons for their clever choices of constants.

Draft: 20nov15 David


c Pollard
4.7 Chaining with tail probabilities 20

4.7.1 Tail chaining via packing


Chaining::tails.pack
Choose the Si as i -separated subsets with, for example, i = 0 /2i . That
ensures Ni pack(i , T, d) and maxt i (t) i . Inequality <28> becomes
Xm1 Xm1
\E@ pack.bnd <30> P{t Sm : m (t) > i i } Ni+1 (i ),
i=0 i=0

that is
Xm1 Xm1
P{max |X(t) X(L0 t)| > i i } Ni+1 (i ).
tSm i=0 i=0
P P
P the {i } to control i i , thenP
We could choose hope for a useful Ni+1 (i ),
or to control Ni+1 (i ), hoping for a useful i i .
Here are some heuristics for as in <29>.
To make the right-hand side of <30> small we should expect m1
P
i=0 i i
to be larger than Pm , which an analog of Lemma < 12> with H = C1
Pm1 1
bounds by a constant multiple of i=0 (N i+1 ) i . That suggests we
should make i a bit bigger than 1 (Ni+1 ). Try

i = 1 1

(Ni+1 (yi )) c (Ni+1 ) + yi for some yi > 0.
I write the extra factor in that strange way because it gives a clean bound,

min (1, Ni+1 (i )) = min (1, Ni+1 / (i )) min (1, 1/ (yi )) C eyi .
Inequality <30> then implies
Xm1 Xm1
1 eyi .

\E@ pack.bnd2 <31> P{max m (t) > c (Ni+1 ) + yi i } C
t i=0 i=0

Now we need to choose a {yi } sequence to make the sums tractable, then
maybe bound sums by integrals to get neater expressions, and so on. If
you find such games amusing, look at Problem [2], which guides you to a
more elegant bound. For the special case of subgaussian increments it gives
existence of positive constants K1 , K2 , and c for which
Z p
2
y + pack(r, T, d) + log(1/r) dr} K2 ecy .
p
P{max m (t) > K1
t 0
p
Remark. The integral of pack(r, T, d) is a surrogate for P maxt m (t).
The presence of a factor K1 , which might be much larger than 1, dis-
appoints. With a lot more effort, and sharper techniques, one can get
bounds for P{maxt m (t) P maxt m (t) + . . . }. See, for example
Massart (2000), who showed how tensorization methods (see Chap-
ter 12) can be used to rederive a conconcentration inequality due to
Talagrand (1996).

Draft: 20nov15 David


c Pollard
4.7 Chaining with tail probabilities 21

4.7.2 Tail chaining via majorizing measures


Chaining::tails.MM
The calculations depend on two simplifying facts about :
1/
(i) (x) exp(x ) for all x > 0, so 1
(y) log (y) for all y > 0;

(ii) there exists an x0 for which (x) 21 exp(x ) for all x x0 , so there
1/
exists a y0 for which that 1
(y) log (2y) for all y y0 .

As mentioned in Section 4.6, the majorizing measure can be used to


constructed a nested sequence of partitions i of the index set T with #i
i
Ni := 22 such that
X
KX := suptT 1
(Ni+1 ) diam (Ei (t)) <
iN

where Ei (t) denotes the unique member of i that contains the index point t.

Remark. If you check back you might notice that originally the ith
summand contained 1 (Ni ). The change has only a trivial effect on
the sum because 1
(y) grows like log1/ (y).

For each E in i select a point tE E then define Si = {tE : E i }.


Clearly we have #Si = #i . With a little care we can also ensure Si Si+1 :
if F i and tF E i+1 then make sure to choose tE = tF . The
link function `i maps each tE in Si+1 onto the tF for which E F i .

0 S0

1 S1

2 S2
3 S3

Remark. We could also ensure, for any given finite subset S of T , that
S0 S1 Sm = S, for some m

For each t in Sm the chain t 7 Lm1 (t) 7 . . . 7 L0 (t) then follows


a path through the corresponding sets Em (t) Em1 (t) E0 (t)
from the partitions. The diameter of the set Ei (t) provides an upper bound

Draft: 20nov15 David


c Pollard
4.8 Brownian Motion, for example 22

for the link length, i (t) = d(Li+1 (t), Li (t)) diam(Ei (t)). If we choose
i = y1
(Ni+1 ) with y large enough
 
(i ) 1/ y log1/ (Ni+1 ) 2 exp(y log Ni+1 ).

so that inequality <28> implies


P{t Sm : m (t) > yKX }
Xm1
Ni+1 / (i )
i=0
Xm1
2 exp (log Ni+1 y log Ni+1 )
i=0
k exp (y /2) for some constant k , if y 2.
With a suitable increase in k the final inequality holds for all y 0. That
is,
/2
P{max |X(t) X(L0 t)| > yKX } k ey for all y 0.
t
R
Remark. When integrated 0 . . . dy, the last inequality implies that
P maxtS Xt is bounded above, unformly over all finite subsets S of T ,
by a constant times KX . See Chapter 7 for a converse when X is a
centered Gaussian process.

4.8 Brownian Motion, for example


Chaining::S:BM
This Section provides one way of comparing the merits of the various chain-
ing methods described in Sections 4.5 and 4.7, using Brownian motion pro-
cess {Xt : 0 t 1} as a test case.
Remember that X is a stochastic process for which
(i) X0 = 0
(ii) each increment Xs Xt has a N (0, |s t|) distribution
(iii) for all 0 s1 < t1 s2 < t2 < tn 1 the increments X(si )
X(ti ) are independent
(iv) each sample path X(, ) is a continuous function on [0, 1].
For current purposes it suffices to know that
2
P{|Xs Xt |  |s t| } 2e /2
p
\E@ tail.incr <32> for  0
p p
\E@ psi2.incr <33> kXs Xt k2 c0 |s t| where c0 = 8/3 .

Draft: 20nov15 David


c Pollard
4.8 Brownian Motion, for example 23

Of course 2 (x) = exp(x2 ) 1, as before.


The oscillation (also known as the modulus of continuity) is defined by

osc1 () := sup{|Xs Xt | : |s t| < }.

I have added the subscript 1 as a reminder that the oscillation is calculated


for the usual metric d1 (s, t) := |s t| on [0, 1]. Inequalities <32> and
p <33>
suggest that it would be cleaner to work with the metric d2 (s, t) := |s t| .
Note that

osc2 () := sup{|Xs Xt | : d2 (s, t) < } = osc1 ( 2 ).

Remark. Note that d2 (s, t) is just the L2 (Leb) distance between the
indicator functions of the intervals [0, s] and [0, t].

A beatiful result of Levy asserts that

osc1 () p
\E@ Levy <34> lim =1 almost surely, where h1 () = 2 log(1/) .
0 h1 ()

See McKean (1969, Section 1.6) for a detailed proof (which is similar in
spirit to the tail chaining described in subsection 4.8.3) of Levys theorem.
Under the d2 metric the result becomes
osc2 () p
lim =1 almost surely, where h2 () = h1 ( 2 ) = log(1/) .
0 h2 ()

To avoid too much detail I will settle for something less than <34>, namely
upper bounds that recover the O(h2 ()) behavior. For reasons that should
soon be clear, it simplifies matters to replace the function h2 by the increas-
ing function (see Problem [5])

h() := 1 2
p
2
2 (1/ ) = log(1 + ) .

Continuity of the sample paths of Brownian Motion process ensures that


osc(, Sm , d2 ) osc2 () as m , so it suffices to obtain probabilistic
bounds that hold uniformly in m before passing to the limit as m tends to
infinity.
Define i = 2i and sj,i = ji2 and Si = {sj,i : j = 0, 1, . . . , 4i 1}, so
that Ni = #Si = 4i = (1/i )2 . The function `i that rounds down to an
integer multiple of i2 has the property that d2 (s, `i s) i for all s [0, 1].

Draft: 20nov15 David


c Pollard
4.8 Brownian Motion, for example 24

Remark. I omitted the endpoint s = 1 just to avoid having #Si = 1+4i .


Not important. You might object that a smaller covering set could
be obtained by shifting Si1 slightly to the right then taking `i to
2
map to the nearest integer multiple of i1 . The slight improvement
would only affect the constant in the O(h2 ()), at the cost of a messier
argument.

Given a < 1 let p be the integer for which p+1 < p . The chains
will only extend to the Sp -level, rather than to the S0 -level. If the change
from S0 to Sp disturbs you, you could work with Sei := Si+p and ei := i+p
for i = 0, 1, . . . .
The map Lp takes points s < t in Sm to points Lp s Lp t in Sp with
d2 (Lp s, Lp t) p + d2 (s, t). If d2 (s, t) < then d2 (Lp s, Lp t) < 2p , which
means that either Lp s = Lp t or that Lp t = Lp s + i2 . Define
Xm1
m,p := maxtSm |X(t) X(Lp t)| max |X(s) X(`i s)|
i=p sSi+1

p := max{|X(sj,p ) X(sj+1,p )| : j = 0, 1, . . . , 4p 1}.


Then for s, t Sm ,
|Xs Xt | |X(s) X(Lp s)| + |X(Lp s) X(Lp t)| + |X(Lp t) X(t)|
m,p + p + m,p ,
which leads to the bound
\E@ osc.bnd <35> osc(, Sm , d2 ) 2m,p + p .
From here on it is just a matter of applying the various maximal inequalities
to the m,p and p terms.

4.8.1 Control by (conditional) expectations


Chaining::conditEcontrol
For each event B with := PB > 0, inequality <33> and Theorem <10>
part (ii) give
|X(si ) X(ti )|
H(N ) := c0 1
p
\E@ ceH <36> PB max 2 (N/) where c0 = 8/3.
iN d2 (si , ti )
The inequality 1 1 1
2 (u)+2 (v) 2 (uv) (see the Problems for Chapter 2)
separates the N and contributions. In particular,
i H(Ni ) = c0 i 1 i2 1

2
c0 i 1 2 1

2 (1/i ) + 2 (1/)
= c0 (h(i ) + i ) where := 1
2 (1/).

Draft: 20nov15 David


c Pollard
4.8 Brownian Motion, for example 25

Inequality <36> gives a neat expression for the conditional expectation


of the upper bound in <35>:

PB p p H(Np ) c0 h(p ) + c0 p

and, by a minor modification of Lemma <12>,


Xm1 X
PB m,p i H(Ni+1 ) 2 c0 (h(i+1 ) + i+1 ) .
i=p i=p

By Problem [5] the h(i ) decrease geometrically fast: h(i+1 ) < 0.77h(i )
for all i. It follows that there exists a universal constant C for which
C
h(p+1 ) + p+1 1 1
 
PB osc(, Sm , d2 ) 2 (1/) C h() + 2 (1/PB) .
2
Now let m tend to infinity to obtain an analogous upper bound for PB osc2 ().
If we choose B equal to the whole sample space the 1 (1/PB) term is
superfluous. We have Posc2 () Ch(), which is sharp up to a multiplica-
tive constant: Problem [6] uses the independence of the Brownian motion
increments to show that P osc2 () ch() for all 0 < 1/2, where c is
a positive constant.
Now for the surprising part. If, for an x > 0, we choose

B = {osc2 () Ch() + Cx}

then

Ch() + Cx PB osc2 () Ch() + C1


2 (1/PB).

Solve for PB to conclude that

\E@ subg.osc <37> P{osc2 () Ch() + Cx} 1/2 (x),

a clean subgaussian tail bound.

4.8.2 Control by 2 -norm


Chaining::Psicontrol
Theorem <10> part (iv) gives


max |X(s i ) X(t i )| H(N ) := C0 1 (N )

2
iN d2 (si , ti )
2

which is the same as <36> except that PB has been replaced by kk2 and
= 1 (and the constant is different). With those modifications, repeat the

Draft: 20nov15 David


c Pollard
4.8 Brownian Motion, for example 26

argument from subsection 4.8.1 to deduce that kosc(, Sm , d2 )k2 Ch(),


for some new constant C,that is
 
osc(, Sm , d2 )
P2 1.
Ch()
Again let m increase to infinity to deduce that kosc2 ()k2 Ch(), which
is clearly an improvement over Posc2 () Ch(). For example, we can
immediately deduce that kosc2 ()kr Cr h() for each r 1. We also get
a tail bound,

P{osc2 () xCh()} 1/2 (x) for x > 0,

which is inferior to inequality <37>.

4.8.3 Control by tail probabilities


Chaining::tailcontrol
Once again start from inequality <35>

osc(, Sm , d2 ) 2m,p + p .

For nonnegative constants p and i , whose choice will soon be considered,


define y = p p + m1
P
i=p i i . Then, as in Section 4.1,

P{osc(, Sm , d2 ) y}
Xm1
P{p p p } + P{ max |X(s) X(`i s)| i i }
i=p sSi+1
Xm1
2p /2 2
2Np e + 2Ni+1 ei /2
i=p
 Xm1
2 exp log Np 2p /2 + 2 exp log Ni+1 i2 /2

\E@ tail.sum <38>
i=p

How to choose the i s and p ? For the sake of comparison with the
inequality <37> obtained by the clever choice of the conditioning event B,
2
let me try for an upper bound that is a constant multiple of ex /2 , for a
given x 0.
Consider the first term in the bound <38>. To make the exponential
2
exactly equal to ex /2 we should choose
q p
p = 2 log Np + x2 2 log Np + x2 ,

The contribution to y is then at most


q
p 2 log(1/p2 ) + p x 2h(p ) + 2x.

Draft: 20nov15 David


c Pollard
4.9 Problems 27

A similar idea works for each of the i s, except that we need to add on a
little bit more to keep the sum bounded as m goes off to infinity. If the terms
were to decrease geometrically then the whole sum would be bounded by a
multiple of the first term, which would roughly match the p contribution.
With those thought in mind, choose
p p p
i = 2 log Ni+1 + x2 + 2 log 2ip 2 log Ni+1 + x + 2(i p) log 2

so that
Xm1 2
2 exp log Ni+1 i2 /2 4ex /2

i=p
Pm1
and i i is less than
i=p

X  q  X p
2
2i+1 2 log(1/i+1 ) + xi + p k 2k log 2
i=p k=0
X
8 h(i+1 ) + p (x + c1 ) c2 (h() + x)
i=p

for universal constants c1 and c2 . (I absorbed the c1 into the h() term.)
With these simplifications, inequality <38> gives a clean bound,
2 /2
P{osc(, Sm , d2 ) > Ch() + Cx} 5ex

for some universal constant C. Here I very cunningly changed the to a >
to ensure a clean passage to the limit, via

{osc(, Sm , d2 ) > y} {osc2 () > y} as m .

In the limit we get an inequality very similar to <37>.

4.9 Problems
Chaining::S:Problems

Chaining::P:K1934 [1] Prove the 1934 result of Kolmogorov that is cited at the start of Section 4.10.
(It might help to look at Chapter 5 first.)

Chaining::P:Psia.tail [2] In inequality <31> put yi = (y + i p) for some nonnegative y. Use the
fact that for each decreasing nonnegative function g on R+ and nonnegative
integers a and b
Xb Z b1
g(i) g(a) + g(r) dr
i=a a

Draft: 20nov15 David


c Pollard
4.9 Problems 28

to deduce existence of positive constants K , K0 for which


Z p

P{m,p > K1 y + 1
(pack(r, T, d)) + log
1/
(1/r) dr} K2 ecy
0

for all y 0.

Chaining::P:pack.MM [3] (How to build a majorizing


R 1measure via packing numbers.) Suppose is a
Young function for which 0 1 (1/r) dr < and

1 (uv) C0 1 (u) + 1 (v)



if min(u, v) 1

for some positive constant C0 . Suppose also that


Z D
1 (pack(r, T, d)) , dr < where D = diam(T ).
0

Define i = D/2i for i = 0, 1, 2, . . . and let Si be the maximal i -separated


subset of T , with Ni := #Si = pack(k i, T, d), as described near the end of
Section 4.2. Define i to be the uniform probability distribution on Si , that
is, mass 1/Ni at each point of Si , and = i0 2i1 i .
P

(i) For each t T and r > i show that B[t, r] 2i1 i B[t, r] (2i+1 Ni )1 .
Hint: Could Si B[t, r] be empty?
(ii) By splitting the range of integration into intervals where i1 r > i ,
deduce (cf. Section 4.3) that
Z D  
1 X  
1 dr 2C0 i 1 (2k+1 ) + 1 (Nk ) < .
0 B[t, r] i1

That is, is a majorizing measure.

aining::P:MM.from.partition [4] Supose {i : i N} is a nested sequence of partitions of T with #i ni =


i
22 and
X
suptT 1 (2i ni )diam(Ei (t)) < .
iN

Define Si to consist of exactly one point fromR each E in i . Mimic the


D
method from Problem [3] to show that suptT 0 1 (1/B[t, r]) dr <
for some Borel probability measure .

Chaining::P:modulus [5] From the fact that g(y) = 2 (y)/y 2 is an increasing function on R+ deduce
that the function
q
h() = 1/ g(1 1 2
p
2 2
2 (1/ )) = (1/ ) = log(1 + )

Draft: 20nov15 David


c Pollard
4.10 Notes 29

is increasing on R+ . More precisely, for each (0, 1), show


q that h(y)/h(y) =
p
g (y) where y 7 g (y) is a decreasing function with g1/2 (1) < 0.77.

Chaining::P:indep.BM [6] Suppose Z1 , . . . , Zn are independent random variables, each distributed N (0, 1)
(with density ). Define Mn = maxin |Zi |. For a positive constant c, let
xn = xn (c) be the value for which P{|Zi | xn } = c/n.
(i) Show that P{Mn xn } = (1 c/n)n ec as n .
q
If 0 x + 1 2 log n, show that P{|Zi | x} 2(1 + x) n1 2/.
p
(ii)
p q
Deduce that the xn ( 2/ ) + 1 > 2 log n.

(iii) Deduce that there exists some positive constant C for which PMn C log n
for all n.
(iv) For X a standard Brownianp motion as in Section 4.8, deduce from (iii)
that P osc(, X) c log(1/) for all 0 < 1/2, where c is a positive
constant. Hint: Write 2m/2 X1 as a sum of 2m independent standard normals.

Chaining::P:normal.max [7] (A sharpening of the result from the previous Problem.) The classical
bounds (Feller, 1968, Section VII.1 and Problem 7.1) show the normal tail
probability (x) = P{N (0, 1) > x} behaves asymptotically like (x)/x.
2
 1 1 2x x(x)/(x)
More precisely,
2
1 for all x > 0. Less
precisely,
log c0 (x) = 2 x + log x + O(x ) as x , where c0 = 2 .

(i) (Compare with Leadbetter et al. 1983, Theorem 1.5.3.) Define an = 2 log n
and Ln = log an . For each constant define m,n = an (1+ )Ln /an . Show
that
c0 (m,n ) = n1 exp ( Ln + o(1))

(ii) Define Mn = maxin Zi for iid N (0, 1) random variables Z1 , Z2 , . . . . Show


that
n
P{Mn m,n } = 1 (m,n ) = exp an (c1

0 + o(1))

(iii) If n increases rapidly enough, show that there exists an interval In of


length o(Ln /an ) containing the point an Ln /an for which P{Mn
/ In } 0
very fast.

4.10 Notes
Chaining::S:Notes
Credit for the idea of chaining as a method of successive approximations
clearly belongs to Kolmogorov, at least for the case of a one-dimensional
index set. For example, at the start of the paper of Chentsov (1956):

Draft: 20nov15 David


c Pollard
4.10 Notes 30

In 1934 A. N. Kolmogorov proved the following


Theorem. If (t) is a separable (see [1]) stochastic process,
0 t 1, and
(1) M |(t1 ) (t2 )|p < C|t1 t2 |1+r ,
where p > 0, r > 0 and C is a constant independent of t, then
the trajectories of the process are continuous with probability 1.
A generalization of this theorem is the following proposition
which was suggested to the author by A. N. Kolmogorov: . . .

The statement of the Theorem was footnoted by the comment This theorem
was first published in a paper by E. E. Slutskii [2], with a reference to a 1937
paper that I have not seen. See Billingsley (1968, Section 12) for a small
generalization of the theoremwith credit to Kolmogorov, via Slutsky, and
Chentsovand a chaining proof.
See Dudley (1973, Section 1) and Dudley (1999a, Section 1.2 and Notes)
for more about packing and covering. The definitive early work is due to
Kolmogorov and Tikhomirov (1959).
Dudley (1973) used chaining with packing/covering numbers and tail in-
equalities to establish various probabilistic bounds for Gaussian processes.
Dudley (1978) adapted the methods using the Bernstein inequality and met-
ric entropy and inclusion assumptions (now called bracketingsee Chap-
ter 13) to extend the Gaussian techniques to empirical processes indexed
by collections of sets. He also derived bounds for processes indexed by VC
classes of sets (see Chapter 9) via symmetrization (see Chapter 8) argu-
ments. In each case he controlled the increments of the empirical processes
by exponential inequalities like those in Chapter ChapHoeffBenn.
Pisier (1983) is usually credited for realizing that the entropy methods
used for Gaussian processes could also be extended to nonGaussian processes
with Orlicz norm control of the increments. However, as Pisier (page 127)
remarked:
For the proof of this theorem, we follow essentially [10]; I have
included a slight improvement over [10] which was kindly pointed
out to me by X. Fernique. Moreover, I should mention that N.
Kono [6] proved a result which is very close to the above; at the
time of [10], I was not aware of Konos paper [6].
Here [10] = Pisier (1980) and [6] = Kono (1980). The earlier paper [10]
included extensive discussion of other precursors for the idea. See also the
Notes to Section 2.6 of Dudley (1999b).

Draft: 20nov15 David


c Pollard
4.10 Notes 31

Using methods like those in Section 4.5, Nolan and Pollard (1988) proved
a functional central limit for the U-statistic analog of the empirical process.
Kim and Pollard (1990) and Pollard (1990) proved limit theorems for a
variety of statistical estimators using second moment control for suprema of
empirical processes.
My analysis in Example <18> is based on arguments of Ibragimov and
Hasminskii (1981, Section 1.5), with the chaining bound replacing their
method for deriving maximal inequalities. The analysis could be extended
to unbounded subsets of R by similar adaptations of their arguments for
unbounded sets.
See Pollard (1985) for one way to use a form of oscillation bound (under
the name stochastic differentiability) to establish central limit theorems for
M-estimators. Pakes and Pollard (1989, Lemma 2.17) used a property more
easily recognized as oscillation around a fixed index point.

References
Billingsley68book Billingsley, P. (1968). Convergence of Probability Measures. New York:
Wiley.

Chentsov1956TPA Chentsov, N. N. (1956). Weak convergence of stochastic processes whose


trajectories have no discontinuities of the second kind and the heuristic
approach to the Kolmogorov-Smirnov tests. Theory of Probability and Its
Applications 1 (1), 140144.

Dudley73gauss Dudley, R. M. (1973). Sample functions of the Gaussian process. Annals of


Probability 1, 66103.

Dudley78clt Dudley, R. M. (1978). Central limit theorems for empirical measures. Annals
of Probability 6, 899929.

Dudley1999UCLT Dudley, R. M. (1999a). Uniform Central Limit Theorems. Cambridge Uni-


versity Press.

Dudley99UCLT Dudley, R. M. (1999b). Uniform Central Limit Theorems. Cambridge Uni-


versity Press.

Dudley2003RAP Dudley, R. M. (2003). Real Analysis and Probability (2nd ed.). Cambridge
studies in advanced mathematics. Cambridge University Press.

Feller1 Feller, W. (1968). An Introduction to Probability Theory and Its Applications


(third ed.), Volume 1. New York: Wiley.

Draft: 20nov15 David


c Pollard
4.10 Notes 32

Fernique75StFlour Fernique, X. (1975). Regularite des trajectoires des fonctions aleatoires


gaussiennes. Springer Lecture Notes in Mathematics 480, 197. Ecole
dEte de Probabilites de Saint-Flour IV1974.

Ibragimov:Hasminskii:81book Ibragimov, I. A. and R. Z. Hasminskii (1981). Statistical Estimation:


Asymptotic Theory. New York: Springer. (English translation from 1979
Russian edition).

KimPollard90cuberoot Kim, J. and D. Pollard (1990). Cube root asymptotics. Annals of Statis-
tics 18, 191219.

KolmogorovTikhomirov1959 Kolmogorov, A. N. and V. M. Tikhomirov (1959). -entropy and -capacity


of sets in function spaces. Uspekhi Mat. Nauk 14 (2 (86)), 386. Review by
G. G. Lorentz at MathSciNet MR0112032. Included as paper 7 in volume 3
of Selected Works of A. N. Kolmogorov.

Kono1980JMKY Kono, N. (1980). Sample path properties of stochastic processes. J. Math.


Kyoto Univ. 20 (2), 295313.

LeadbetterLindgrenRootzen83 Leadbetter, M. R., G. Lindgren, and H. Rootzen (1983). Extremes and


Related Properties of Random Sequences and Processes. Springer-Verlag.

LedouxTalagrand91book Ledoux, M. and M. Talagrand (1991). Probability in Banach Spaces:


Isoperimetry and Processes. New York: Springer.

Massart2000AnnProb Massart, P. (2000). About the constants in Talagrands concentration in-


equalities for empirical processes. The Annals of Probability 28 (2), pp.
863884.

McKean69si McKean, H. P. (1969). Stochastic Integrals. Academic Press.

NolanPollard88Uproc2 Nolan, D. and D. Pollard (1988). Functional limit theorems for U-processes.
Annals of Probability 16, 12911298.

PakesPollard89simulation Pakes, A. and D. Pollard (1989). Simulation and the asymptotics of opti-
mization estimators. Econometrica 57, 10271058.

Pisier1979-80 Pisier, G. (1980). Conditions dentropie assurant la continuite de certains


processus et applications a lanalyse harmonique. In Seminaire danalyse
fonctionnelle, 1979-80, pp. 141. Ecole Polytechnique Palaiseau. Available
from http://archive.numdam.org/.

Pisier83metricEntropy Pisier, G. (1983). Some applications of the metric entropy condition to


harmonic analysis. Springer Lecture Notes in Mathematics 995, 123154.

Draft: 20nov15 David


c Pollard
References for Chapter 4 33

Pollard85NewWays Pollard, D. (1985). New ways to prove central limit theorems. Econometric
Theory 1, 295314.

Pollard89StatSci Pollard, D. (1989). Asymptotics via empirical processes (with discussion).


Statistical Science 4, 341366.

Pollard90Iowa Pollard, D. (1990). Empirical Processes: Theory and Applications, Volume 2


of NSF-CBMS Regional Conference Series in Probability and Statistics.
Hayward, CA: Institute of Mathematical Statistics.

PollardUGMTP Pollard, D. (2001). A Users Guide to Measure Theoretic Probability. Cam-


bridge University Press.

Talagrand1987ActaMath Talagrand, M. (1987). Regularity of Gaussian processes. Acta Mathemat-


ica 159, 99149.

Talagrand96IM Talagrand, M. (1996). New concentration inequalities in product spaces.


Inventiones Mathematicae 126, 505563.

Talagrand2001AnnProb-MMwM Talagrand, M. (2001). Majorizing measures without measures. The Annals


of Probability 29 (1), 411417.

Talagrand2005MMbook Talagrand, M. (2005). The Generic Chaining: Upper and lower bounds of
stochastic processes. Springer-Verlag.

Draft: 20nov15 David


c Pollard