3 views

Uploaded by Alex Gj

Chaining-Pollard David Mathematics

- Dispense Processi Aleatori
- 1-s2.0-0304414981900120-main
- 07350215.pdf
- Random Signals Notes
- Brown Ian Motion
- homeAssignment_1
- Lecture 1
- Algorithms of Hidden Markov Model and a Prediction Method on Product Preferences
- Kiel Notes
- [Bas98] Richard F. Bass. Diffusions and Elliptic Operators
- Brief Study of Markov Chains
- Emergence and Complexity in Theoretical Models of Self-Organized Criticality
- Statistical Machine Learning W4400 Lecture Slides.pdf
- Curs 1 SSL - Introduction
- artico regresiva
- SSRN-id715301.pdf
- Coecke, Kissinger, Categorical Quantum Mechanics II - Classical-Quantum Interaction
- Regression 290611
- Lecture5 - Short Rate Models
- Assessment of local influence in elliptical linear models with - F.Osòrio.pdf

You are on page 1of 34

4.2 Covering and packing numbers . . . . . . . . . . . . . . . . . 5

4.3 Bounding sums by integrals . . . . . . . . . . . . . . . . . . . 7

4.4 Orlicz norms of maxima of finitely many variables . . . . . . 8

4.5 Chaining with norms and packing numbers . . . . . . . . . . 10

4.6 An alternative to packing: majorizing measures . . . . . . . . 16

4.7 Chaining with tail probabilities . . . . . . . . . . . . . . . . . 18

4.7.1 Tail chaining via packing . . . . . . . . . . . . . . . . 20

4.7.2 Tail chaining via majorizing measures . . . . . . . . . 21

4.8 Brownian Motion, for example . . . . . . . . . . . . . . . . . 22

4.8.1 Control by (conditional) expectations . . . . . . . . . 24

4.8.2 Control by 2 -norm . . . . . . . . . . . . . . . . . . . 25

4.8.3 Control by tail probabilities . . . . . . . . . . . . . . . 26

4.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.10 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

printed: 22 November 2015 David

c Pollard

4.1 What is chaining? 1

Chapter 4

Chaining

This chapter introduces some of the tools that are used in the next chapter

(on chaining) to construct approximations to stochastic processes.

Section 4.1 describes two related methods for quantifying the complexity of

a metric space: by means of covering/packing numbers or by means of

majorizing measures.

Section 4.2 defines covering and packing numbers.

Section 4.3 explains why chaining bounds are often expressed as integrals.

Section 4.4 presents a few simple ways to bound the Orlicz norm of a

maximum of finitely many random variables.

Section 4.5 illustrates the method for combining the chaining with the max-

imal inequalities like those from Section 4.4 to control the norm of the

oscillation of a stochastic process.

Section 4.6 mentions a more subtle alternative to packing/covering, which

is discussed in detail in Chapter 7.

Section 4.7 discusses chaining with tail probabilities.

Section 4.8 compares various chaining methods by applying them to the

humble example of Brownian motion on the unit interval.

Chaining::S:intro

This chapter begins the task of developing various probabilistic bounds for

stochastic processes, X = {Xt : t T }, where the index set T is equipped

with a metric (or semi-metric) d that gives some control over the increments

of the process. For example, we might have kXs Xt k Cd(s, t) for some

c Pollard

4.1 What is chaining? 2

bound

P{|Xs Xt | d(s, t)} () for all s, t T ,

for some decreasing function . The leading casethe example that drove

the development of much theoryis the centered Gaussian process with

(semi-)metric defined by d2X (s, t) = P|Xs Xt |2 .

Remark. Especially if T is finite or countably infinite, it is not essential

that d(s, t) = 0 should imply that s = t. It usually suffices to have

Xs = Xt almost surely if d(s, t) = 0. More elegantly, we could

partition T into equivalence classes [t] = {t T : d(s, t) = 0} then

replace T by a subset T0 consisting of one point from each equivalence

class. The restriction of d to T0 would then be a metric. Like most

probabilists, I tend to ignore these niceties unless they start causing

trouble. When dealing with finite index sets we may assume d is a

metric.

good upper bounds for quantities such as suptT |Xt | or the oscillation,

\E@ osc.def <1> osc(, X, T ) := sup{|Xs Xt | : s, t T, d(s, t) < }.

For example, we might seek bounds on ksuptT |Xt | k or kosc(, X, T ) k, for

various norms, or on their tail probability analogs

P{suptT |Xt | > } or P{osc(, X, T ) > }.

Such quantities play an important role in the theory of stochastic processes,

empirical process theory, and statistical asymptotic theory. As explained in

Section 5.4, bounds for the oscillation are essential for the construction of

processes with continuous sample paths and for the study of convergence

in distribution of sequences of stochstic processes. In particular, oscillation

control is the key ingredient in the proofs of Donsker theorems for empirical

processes. In the literature on the asymptotic theory for estimators defined

by optimization over random processes, oscillation bounds (or something

similar) have played a major role under various names, such as stochastic

asymptotic equicontinuity.

If T is uncountable the suprema in the previous paragraph need not

be measurable, which leads to difficulties of the type discussed in Chap-

ter 5. Some authors sidestep this difficulty by interpreting the inequalities

as statements about arbitrarily large finite subsets of T . For example,

interpret P suptT Xt to mean supS {P suptS Xt : finite S T }.

c Pollard

4.1 What is chaining? 3

With such a convention the real challenge becomes: find bounds for

finite S that do not grow unhelpfully large as the size (cardinality) of S

increases. For example, if T is any countable subset of T one might hope

to get a reasonable bound for P suptT Xt by passing to the limit along a

sequence of finite subsets Sn that increase to T . If T is dense in T the

methods from Chapter 5 then take over, leading to bounds for P suptT Xt

if the version of X has suitable sample paths.

The workhorse of the modern approach to approximating stochastic pro-

cesses is called chaining. Suppose you wish to approximate maxtS Xt on

a (possibly very large) finite subset S of T . You could try a union bound,

such as

X

P{maxtS |Xt | } P{|Xt | },

tS

but typically the upper bound grows larger than 1 as the size of S in-

creases. Instead you could creep up on S through a sequence of finite subsets,

S0 , S1 , . . . , Sm = S, breaking the process on S into a contribution from S0

plus a sum of increments across each Si+1 to Si pair.

To carry out such a strategy you need maps `i : Si+1 Si for i =

0, . . . , m 1. The composition Lp = `p `p+1 `m1 maps Sm into Sp ,

for 0 p < m. Each t = tm in Sm is connected to a point t0 = L0 t in S0 by

a chain of points

`m1 `m2 0`

\E@ the.chain <2> tm = t tm1 = Lm1 (t) tm2 = ... t0 = L0 (t),

Xm1

\E@ X.increments <3> X(t) X(t0 ) = X(ti+1 ) X(ti ) with ti = Li (t).

i=0

t0= L0(t)

S0

S1

tm-2 = m-2tm-1

tm-1 = m-1t

S m-1

Sm

t=tm

c Pollard

4.1 What is chaining? 4

Remark. To me ` stands for link. The pair (t, `i t) with t Si+1 defines

a link in the chain connecting t to L0 t.

share the same increment X(s) X(`i (s)) with s Si+1 , but you need to

control that increment only once. For example, for each t in S the triangle

inquality applied to <3> gives

Xm1 Xm1

|X(t)X(t0 )| |X(ti+1 )X(ti )| max |X(s) X(`i (s))| .

i=0 i=0 sSi+1

variable or the norm of some random element of a normed vector

space? It really doesnt matter which interpretation you prefer. The

chaining argument generalizes easily to more exotic random elements

of vector spaces.

The final sum is the same for every t; it also provides an upper bound for

the maximum over t:

Xm1

\E@ max.link <4> maxtS |X(t) X(t0 )| maxsSi+1 |X(s) X(`i (s))| .

i=0

the quantity on the left-hand side of <4>:

tS

[ [

P {|X(s) X(`i (s))| i }

i sSi+1

Xm1 X

P{|X(s) X(`i (s))| i }.

i=0

sSi+1

The tail probability P{maxtS |Xt | +} is less than P{maxtS0 |Xt | }

plus the double sum over i and Si+1 . If the growth in the size of the Si s could

be offset by a decrease in the d(s, `i s) distances you might get probabilistic

bounds that do not depend explicitly on the size of S.

The extra details involving the choice of the i s makes chaining with tail

probabilities seem more complicated that the analogous argument where one

merely takes some norm of both sides of inequality <4>, giving

X

m1

\E@ norm.max.link <5> max |X(t) X(t0 )| maxsSi+1 |X(s) X(`i (s))| .

tS i=0

now on I assume that has the following properties

c Pollard

4.2 Covering and packing numbers 5

functions on a probability space (, F, P) taking values in [0, ]. Assume

is a map from M+ into [0, ] that satisfies

(i) (X + Y ) (X) + (Y )

increase to a countable subset of T we also need something like: if

0 X1 X2 . . . X almost surely then (Xn ) (X).

over finite sets. See Section 4.4 for some simple methods when is an Orlicz

norm.

The traditional approach to chaining chooses the Si sets for their uni-

form approximation properties. Typically one starts with a decreasing se-

quence {i }, such as i = /2i for some > 0, then seeks sets Si with

minsSi d(t, s) i for every t in S such that Si is as small as possible. The

smallest cardinality is called the covering number. Section 4.2 gives the

formal definition, as well as describing a slightly more convenient concept

called the packing number.

Section 4.5 illustrates the method for combining bounds on packing

numbers with bounds on the norms of maxima of finitely many variables

(Section 4.4) to obtain useful bounds for a norm of maxtS |X(t) X(t0 )|.

Section 4.7 replaces norms by tail probabilities.

Section 4.8 returns to the relatively simple case of Brownian motion on

the unit interval as a test case for comparing the different approaches.

For each method I use the task of controlling the oscillation of a stochas-

tic process as a nontrivial example of what can be done with chaining.

Chaining::S:covering

The simplest strategy for chaining is to minimize the size (cardinality) of Si

subject to a given upper bound on suptT minsSi d(t, s), That idea trans-

lates into a statement about covering numbers.

c Pollard

4.2 Covering and packing numbers 6

metric entropy, probably because it appears naturally when dealing

with processes whose increments have tail probabilities that decrease

exponentially fast.

Chaining::covering <7> Definition. For a subset F of T write NT (, F, d) for the -covering num-

ber, the smallest number of closed -balls needed to cover F . That is, the

covering number is the smallest N for which there exist points t1 , . . . , tN

in T with miniN d(t, ti ) for each t in F . The set of centers {ti } is

called a -net for F .

Remark. Notice a small subtlety related to the subscript T in the

definition. If we regard F as a metric space in its own right, not just as

a subset of T , then the covering numbers might be larger because the

centers ti would be forced to lie in F . It is an easy exercise (select a

point of F from each covering ball that actually intersects F ) to show

that NF (2, F, d) NT (, F, d). The extra factor of 2 would usually

be of little consequence. When in doubt, you should interpret covering

numbers to refer to NF .

Some metric spaces (such as the whole real line under its usual metric)

cannot be covered by a finite set of balls of a fixed radius. A metric space T

for which NT (, T, d) < for every > 0 is said to be totally bounded.

A metric space is compact if and only if it is both complete and totally

bounded (Dudley, 2003, Section 2.3).

I prefer to work with the packing number pack(, F, d), defined as the

largest N for which there exist points t1 , . . . , tN in F that are -separated,

that is, for which d(ti , tj ) > if i 6= j. Notice the lack of a subscript T ; the

packing numbers are an intrinsic property of F , and do not depend on T

except through the metric it defines on F .

Chaining::cover.pack <8> Lemma. For each > 0,

Proof For the middle inequality, observe that no closed ball of radius /2

can contain points more than apart. Each of the centers for pack(, F, d)

must lie in a distinct /2 covering ball. The other inequalities have similarly

simple proofs.

Chaining::rr.norm <9> Example. Let kk denote any norm on Rk . For example, it might P be or-

dinary Euclidean distance (the `2 norm), or the `1 norm, kxk1 = ik |xi |.

The covering numbers for such norms share a common geometric bound.

c Pollard

4.3 Bounding sums by integrals 7

Write BR for the ball of radius R centered at the origin. For a fixed ,

with 0 < 1, how many balls of radius R does it take to cover BR ?

Equivalently, what are the packing numbers for BR ?

Let {x1 , . . . , xN } be any set of points in BR that is R-separated. The

closed balls B[xi , R/2] of radius R/2 centered at the xi are disjoint and

their union lies within BR+R/2 . Write for the Lebesgue measure of the

unit ball B1 . Each B[xi , R] has Lebesgue measure (R/2)k and BR+R/2

has Lebesgue measure (R + R/2)k . It follows that

k

(R + R/2)k

2+

N = (3/)k

(R/2)k

That is, pack(R, BR , d) (3/)k for 0 < 1, where d denotes the metric

corresponding to kk.

struction of increasing sequences of approximating subsets. We start with a

maximal subset S0 of points that are 0 -separated, then enlarge to a maximal

subset of points that are 1 -separated, and so on.

If S is finite, the whole procedure must stop after a finite number of

steps, leaving S itself as the final approximating set. In any case, we have

constructed nested finite subsets S0 S1 S2 S3 . . . and we have a

bound on the sizes, #Si pack(i , S, d). If 0 = diam(S) then S0 consists

of only one point.

The natural choice for `i is the nearest neighbor map from Si+1 to Si , for

which d(t, `i (t)) = minsSi d(t, s) i , with a convention such as choosing

the s that appeared earliest in the construction to handle ties.

Chaining::S:sum.to.integral

Typically we do not know the exact values of the packing numbers but

instead only have an upper bound, pack(, S, d) M (). The chain-

ing strategy outlined in Section 4.1 often leads to upper bounds of the

form ik G(i )i1 , with G() an increasing function, such as 1 (M ()).

P

G(y)

bounded above by an integral,

area = G()/4

X Z 1 Z 1

G(i )i1 4 G(r) dr 4 G(r) dr.

ik

y k+1 0

c Pollard

4.4 Orlicz norms of maxima of finitely many variables 8

the similarity to Dudleys entropy integral assumption <24> in Section 4.6.

In the early empirical process literature, some authors chose the i s to make

the G(i )s increase geometrically, for similar reasons.

Chaining::S:finitemax

The Orlicz norms are natural candidates for the of Assumption <6>.

The next Theorem collects a few simple inequalities well suited to inequal-

ity <5>. See Section 4.8 for an example that highlights the relative merits

of the inequalities.

with maxiN kXi k . Define M := maxiN Xi .

(i) PM 1 (N )

PB M 1 (N/PB)

()() () whenever ( ) 1. Then

kM k 2 1 (N ).

()() K0 (c0 ) whenever c1 then

kM k C0 1 (N )

tions.

For each p 1 the Young function (t) = tp satisfies condition (iv)

with equality when K0 = c0 = 1 and c1 = 0. Problem [??] shows that

each of the functions from Section 2.3, which grow like exp(t ),

also satisfy (iv).

c Pollard

4.4 Orlicz norms of maxima of finitely many variables 9

tive, so that P(Xi ) 1 for each i.

Even though assertion (i) is a special case of assertion (ii), I think it

helps understanding to see both proofs. For (i), Jensens inequality then

monotonicity of imply

X

(PM ) P maxiN (Xi ) P(Xi ) N.

iN

on Bi . Without loss of generality assume PBi > 0 for each i. (Alternatively,

just discard those Bi with zero probability.) By definition,

X X PB

i

PB M = PB Bi Xi = PBi Xi

i i PB

By Jensens inequality,

(PBi Xi ) PBi (Xi ) = .

PBi PBi PBi

Thus

X

PBi

(PB M ) = PB Xi

i PB i

X PBi

(PBi Xi ) by convexity of

i PB

X PBi 1 N

= .

i PB PBi PB

Perhaps it would be better to write the last equality as an inequality, to

cover the case where some PBi are zero.

For (iii), let = 1 (N ), so that () = N . Trivially,

{(M/) 1} (M/) 1

X

{(M/) > 1} (M/) () (M ) (Xi ).

i

Divide both sides of the second inequality by N , add, then take expectations

to deduce that

1 X

P (M/) 1 + P(Xi ) 2,

N i

c Pollard

4.5 Chaining with norms and packing numbers 10

For (iv), first note that K0 (x) (K1 x) for K1 = max(1, K0 ). If

L 1 the assumed bound gives

Choose L so that L1

2 (1) c1 . Then (L) (L) c1 if () () 1

and

()() ()

e := (c0 K1 L2 ) if () () 1.

e = c0 K1 L kY k2 for each

random variable Y .

Chaining::S:norm.chain

Consider a process {Xt : t T } indexed by a (necessarily totally bounded)

metric space for which we have some control over the packing numbers.

Instead of working via an assumption like kXs Xt k Cd(s, t), combined

with one of the bounds from Section 4.4, Ill cut out the middle man by

assuming directly there is a norm that satisfies Assumption <6> and an

increasing function H : N R+ for which

|Xsi Xti |

\E@ max.norm <11> max H(N )

iN d(si , ti )

for all finite sets of increments with d(si , ti ) > 0. For example, if were

expected value and the kXs Xt k2 d(s, t) then H(N ) = 1 2 (N ) =

p

log(1 + N ) .

finite subsets of T for which d(s, Si ) i for all s Si+1 and #Si Ni .

Then, in the chaining notation from <2>,

X

m1

max |X(t) X(L0 t)| i H(Ni+1 ).

tSm i=0

and

Z 1

max |X(t) X(L0 t)| 4 H (pack(r, T, d)) dr.

tSm m+1

c Pollard

4.5 Chaining with norms and packing numbers 11

Proof The first inequality results from applying to both sides of the

inequality

Xm1 |X(s) X(`i s)|

max |X(t) X(L0 t)| i max

tSm i=0 sSi+1 d(s, `i s)

Chaining::subgaussian.norm <13> Example. Suppose the {Xt : t T } process has subgaussian increments

2

with kXs Xt k2 d(s, t). The function 2 (t) = et 1 satisfies the

assumptions of Theorem <10> part (iv). Inequality <11> holds

p with equal

to the kk2 and H(N ) a constant multiple of 1

2 (N ) = log(1 + N ) .

Because pack(r, T, d) = 1 for all r > diam(T ) we may as well assume

0 diam(T ), in which case pack(r, T, d) 2 for all r 1 . That lets us

absorb the pesky 1+ from the log() into the constant, leaving

Z 1 p

max |X(t) X(L0 t)|
C log pack(r, T, d) dr,

tSm

2 0

constant Ck for which kZkk Ck kZk2 (Section 2.4). Thus the upper

bounds for the kk2 norm of maxtSm |X(t) X(L0 t)| automatically imply

similar upper bounds for its kkk . In particular, both Theorems 3.2 and 3.3

in the expository paper of Pollard (1989) follow from the result for the 2 -

norm.

Lemma <12> captures the main idea for chaining with norms. The next

Theorem uses the example of oscillation control to show why the uniform

approximation of {Xs : s Sm } by {Xs : s Sm } is such a powerful tool.

satisfy <11>, that is, (maxiN |Xsi Xti |/d(si , ti )) H(N ) for all finite

sets of increments. Suppose also that

Z D

H(pack(r, T, d)) dr < where D := diam(T ).

0

Then for each > 0 there exists a > 0 for which (osc(, X, S)) < for

every finite subset S of T .

c Pollard

4.5 Chaining with norms and packing numbers 12

Z 0

\E@ del0.choice <15> 4 H(pack(r, T, d)) dr < /5.

0

which S0 Sm = S with

Define := maxtSm |X(t) X(L0 t)|. Lemma <12> gives () < /5.

The value N0 = #S0 , which only depends on , is now fixed.

There might be many pairs s, t in Sm for which d(s, t) < , but they

correspond to at most N20 pairs L0 s, L0 t in S0 for which

Xm1 Xm1

\E@ chain.lengths <16> d(L0 s, L0 t) d(Li+1 s, Li s) + + d(Li+1 t, Li t) 40 + .

i=0 i=0

For the subgaussian case, where H(N ) grows like log N , we could

afford to invoke a finite maximal inequality to control the contributions from

the pairs in S0 by a constant multiple of H(N02 ) (40 + ), which could be

controlled by <15> if = 0 because H(N02 ) constant H(N0 ). Without

such behavior for H we would need something stronger than <15>.

For general H() it pays to make a few more trips up and down the

chains, using a clever idea of Ledoux and Talagrand (1991, Theorem 11.6).

s tE,F tF,E t

E F

The map L0 defines an equivalence relation on Sm , with t t0 if and only

if L0 t = L0 t0 . The corresponding equivalence classes define a partition m

of Sm into at most N0 subsets. For each distinct pair E, F from m choose

points tE,F E and tF,E F such that

c Pollard

4.5 Chaining with norms and packing numbers 13

then define

|X(tE,F ) X(tF,E )|

:= max : E, F m , E 6= F } .

d(E, F )

Assumption <11> implies () H(N02 ).

Remark. The definition of might be awkward if d were a semi-metric

and d(E, F ) = 0 for some pair E 6= F . By working with equivalence

classes we avoid such awkwardness.

L0 s = L0 t, we have

|X(s) X(t)| |X(s) X(L0 s)| + 0 + |X(L0 t) X(t)| 2.

L0s = L0tE,F L0t = L0tF,E If s and t belong to different sets, E and F , from m then

We also have d(E, F ) d(s, t) < , so that |X(tE,F ) X(tF,E )| and

s tE,F tF,E t

|X(s) X(t)| is bounded by

E F

The upper bound does not depend on the particular s, t pair for which

d(s, t) < . It follows that

(osc(, X, S)) < 4/5 + H(N02 ) < for small enough.

The upper bound has no hidden dependence on the size of S.

The final Example shows that even very weak control over the incre-

ments of a process can lead to powerful inequalities if the index set has

small packing numbers. The Example makes use of some properties of the

Hellinger distance h(P, Q) between two probability measures P and Q,

defined by

s s !2 s

dP dQ dP dQ

h2 (P, Q) = = 2 2

d d d d

the distance reduces to a simple expression involving the marginals,

n

\E@ Hellinger.product <17> h2 (P n , Qn ) = 2 2 1 1/2h2 (P, Q) nh2 (P, Q).

See, for example, Pollard (2001, Problem 4.18).

c Pollard

4.5 Chaining with norms and packing numbers 14

densities p(x, ) with respect to a fixed measure on some set X. Let

P,n = P P and n = , both n-fold product measures.

For a fixed 0 , abbreviate P0 ,n to P. Under P0 ,n the coordinates of the

typical point x = (x1 , . . . , xn ) in Xn become identically distributed random

variables, each with distribution P0 .

To avoid distracting measurability issues, I suppose 7 p(x, ) is con-

tinuous. I also ignore irrelevant questions regarding uniqueness and measur-

ability of the the maximum likelihood estimator (MLE) bn = bnQ(x1 , . . . , xn ),

which by definition maximizes the likelihood process Ln () := in p(xi , ).

Suppose there exists positive constants C1 and C2 for which

Remark. Notice that such an inequality can hold only for a bounded

parameter set, because H is a bounded metric. Typically it would

follow from a slight strengthening of a Hellinger differentiability

condition. In a sense, the right metric to use is h. The upper bound

in Assumption <19> ensures that the packing numbers under h

behave like packing numbers punder ordinary Euclidean distance. The

lower bound ensures that P Ln (t) decays rapidly as t increasessee

inequality <22> below.

Then a chaining argument will show that, for each 0 in [0, 1], the esti-

mator bn converges at an n1/2 -rate to 0 . More precisely, there exists

constants C3 and C4 for which

\E@ mle.dev <20> P0 ,n { n|bn 0 | y} C3 exp(C4 y 2 ) for all y 0 and all n.

The MLE also maximizes the square root of the likelihood process. The

standardized estimator btn = n(bn 0 ) maximizes, over the interval Tn =

{t R : 0 + t/ n [0, 1]}, the process

s

Ln (0 + t/ n)

Zn (t) =

Ln (0 )

Y q

= (xi , t) where (z, t) := p(z, 0 + t/ n)/p(z, 0 ) .

in

on the final bound, one should include an indicator function {z S},

where S = {p(, 0 ) > 0}, into the definition of . And a few equalities

should actually be inequalities.

c Pollard

4.5 Chaining with norms and packing numbers 15

tn ) Zn (0) = 1, which

implies for each y0 0 that

P0 ,n {|b

tn | y0 } P0 ,n {sup|t|y0 Zn (t) 1} P0 ,n sup|t|y0 Zn (t).

The last expected value involves a supremum over an uncountable set, which

sample path continuity allows us to calculate as a limit over maxima over

finite sets.

Let me show how Lemma <12> handles half the range, supty0 Zn (t).

The argument for the other half is analogous.

To simplify notation, abbreviate P0 ,n to P, with 0 fixed for the rest

of the Example. Split the set {t Tn : t y0 } into a union of intervals

Jk := [yk , yk+1 ) Tn , where yk = y0 + k for k N. Then

X

\E@ strat <21> P supty0 Zn (t) P suptJk Zn (t)

k0

Inequality <19> provides the means for bounding the kth term in the

sum <21>. The lower bound from <19> gives us some control over the

expected value of Zn :

Yq

PZn (t) n p(xi , 0 + t/ n)p(xi , 0 )

in

n

= 1 1/2h2 (P0 , P0 +t/n )

\E@ pwise.bound <22> exp( 1/2C12 t2 )

for each t in Tn . The upper bound in <19> gives some control over the

increments of the Zn process: for t1 , t2 Tn ,

p p

= n | Ln (1 ) Ln (2 ) |2 where i = 0 + ti / n

= h2 (Pn1 , Pn2 )

C22 |t1 t2 |2 by <17> and <19>.

\E@ L2 <23> P max H(N ) := C2 N

iN |si ti |

c Pollard

4.6 An alternative to packing: majorizing measures 16

Consider the kth summand in <21>. With d as the usual Euclidean

metric, Jk has diameter at most 1 and pack(r, Jk , d) 1/r for r 1. For

a 0 that needs to be specified, invoke Lemma <12> then pass to the limit

as Sm expands up to a countable dense subset to get

Z 1 p

P suptJk Zn (t) P maxtS0 Zn (t) + 4C2 1/r dr

0

X 1 2 2 1/2

e /2C1 t + 4C2 21

tS0

1 2 2 1/2

(1/0 )e /2C1 yk + C2 80 .

1 2 2

The last sum is approximately minimized by choosing 0 = e 3 C1 yk , which

leads to

X

P supty0 Zn (t) (1 + C2 8) exp(C12 (y0 + k)2 /6).

k0

After some fiddling around with constants, the bound <20> emerges.

Chaining::S:MM

The modern theory owes much to the work of Dudley (1973), who applied

chaining arguments to establish fine control over the sample paths of the

so-called isonormal Gaussian process, and Dudley (1978), who initiated the

modern theory of empirical processes by adapting the Gaussian methods to

processes more general than the classical empirical distribution function. In

both cases he constrained the complexity of the index set by using bounds

on the covering numbers N () = NT (, T, d). For example, for a Gaussian

process he proved existence of a version with continuous sample paths under

the assumption that

Z Dp

log N (r) dr < where D = diam(T ), the diameter of T .

0

Z D

\E@ entropy.integral <24> 1

2 (NT (r)) dr < where 2 (x) = exp(x2 ) 1,

0

which highlights the role of the 2 -Orlicz norm in controlling the increments

of Gaussian processes.

c Pollard

4.6 An alternative to packing: majorizing measures 17

weakened slightly to get improved bounds, which in some cases are optimal

up to multiplicative constants. As originally formulated, the improvements

assumed existence of a majorizing measure for a Young function (= 2

for the Gaussian case): a probability measure on the Borel sigma-field of T

for which

Z D

1

\E@ MM.def <25> suptT 1 dr < ,

0 B[t, r]

where B[t, r] denotes the closed ball of radius r and center t.

single could control the sample paths of a stochastic process. It

seemed not at all obvious that could be used to replace the familiar

constructions with covering numbers. It helped me to dig inside some

of the proofs to find that the key fact is: if B1 , . . . , Bn are disjoint Borel

sets with Bi for each i then n 1/. The same idea appears in

the very classical setting of Example <9>, where Lebesgue measure

on Rk plays a similar role in bounding covering numbers.

Fernique (1975) proved many results about the sample paths of a cen-

tered Gaussian process X. For example, he proved (his Section 6) a result

stronger than: existence of a majorizing measure implies that

\E@ bdd.paths <26> suptT |X(, t)| < for almost all .

almost all sample paths uniformly continuous.

Talagrand (1987) brilliantly demonstrated the central role of majorizing

measures by showing that a condition like <26> implies existence of some

probability for which <25> holds with d(s, t) = kXs Xt k2 . He subse-

quently showed (Talagrand, 2001) that the majorizing measures work their

magic by generating a nested sequence {i } of finite partitions of T , which

can be used for chaining arguments analogous to those used by Dudley.

(See subsection 4.7.2 and Chapter 7 for details.) Each i+1 is a refinement

of the previous i . Each point t of T belongs to exactly one member, de-

noted by Ei (t), of i . Talagrand allowed the size #i to grow very rapidly.

For Young functions that grow like exp(x ) for some > 0, he showed

i

that <25> is equivalent to the existence of partitions i with #i ni := 22

for which

X

\E@ partitions <27> suptT 1 (ni ) diam (Ei (t)) <

iN

c Pollard

4.7 Chaining with tail probabilities 18

the infimum of <27> over all partitions i with #i ni

the infimum of the integral in <25> over all majorizing measures

is bounded above and below by positive constants that depend only on .

analogous constructions based on covering numbers, is that the uniform

bound on the sum does not come via a bound on maxEi diam(E).

As Talagrand (2005, bottom of page 13) has pointed out, such

uniform control of the diameters is not the best way to proceed, because

it tacitly assumes some sort of homogeneity of complexity over different

subregions of T . By contrast, the partitions constructed from majorizing

measures are much more adaptive to local complexities, subdividing

more finely in regions where the stochastic process fluctuates more

wildly.

tween majorizing measures and partitions for nongaussian processes. He has

even declared (Talagrand, 2005, Section 1.4) that the measures themselves

are now totally obsolete, with everything one needs to know coming from

the nested partitions.

Despite the convincing case made by Talagrand for majorizing measures

without measures, I still think there might be a role for the measures them-

selves in some statistical applications. And in any case, it does help to

know something about how an idea evolved when struggling with its fancier

modern refinements.

Chaining::S:tails

Start once again from a sequence of finite subsets S0 S1 Sm ,

with #Si = Ni and the chains defined as in Section 4.1. Again we have a

decomposition for each t in Sm ,

Xm1

X(t) X(t0 ) = X(ti+1 ) X(ti ) with ti = Li (t).

i=0

c Pollard

4.7 Chaining with tail probabilities 19

Mi+1 := max where i (s) := d(s, `i s),

sSi+1 i (s)

with the convention that any terms for which i (s) = 0 are omitted from

the maxit matters only that |X(s) X(`i s)| Mi+1 i (s) for all s Si+1 .

For i 0,

X

P{Mi+1 i } P{|X(s) X(`i s)| i i (s)} Ni+1 (i )

sSi+1

P

i=0 Ni+1 (i )

On the event B c we have

Xm1

m (t) := |X(t) X(L0 t)| i i (t) for each t Sm .

i=0

Xm1 Xm1

\E@ tail.chain0 <28> P{t Sm : m (t) > i i (t)} Ni+1 (i ).

i=0 i=0

the probability of a union, by min(1, Ni+1 (i )). Of course such a

modification typically gains us nothing because <28> is trivial if even

one of the summands on the right-hand side equals 1. Nevertheless, the

modification does eliminate some pesky trivial cases from the following

arguments.

Now comes the clever part: How should we choose the Si s and the i s?

I do not know any systematic way to handle that question but I do know

some heuristics that have proven useful.

Let me attempt to explain for the specific case of -norms, for > 0,

with

first for the traditional approach based on packing numbers then for the

majoring measure alternative.

Remark. You should not be misled by what follows into believing that

all chaining arguments with tail probabilities involve a lot of tedious

fiddling. I present some of the messy details just because so few authors

seem to explain the reasons for their clever choices of constants.

c Pollard

4.7 Chaining with tail probabilities 20

Chaining::tails.pack

Choose the Si as i -separated subsets with, for example, i = 0 /2i . That

ensures Ni pack(i , T, d) and maxt i (t) i . Inequality <28> becomes

Xm1 Xm1

\E@ pack.bnd <30> P{t Sm : m (t) > i i } Ni+1 (i ),

i=0 i=0

that is

Xm1 Xm1

P{max |X(t) X(L0 t)| > i i } Ni+1 (i ).

tSm i=0 i=0

P P

P the {i } to control i i , thenP

We could choose hope for a useful Ni+1 (i ),

or to control Ni+1 (i ), hoping for a useful i i .

Here are some heuristics for as in <29>.

To make the right-hand side of <30> small we should expect m1

P

i=0 i i

to be larger than Pm , which an analog of Lemma < 12> with H = C1

Pm1 1

bounds by a constant multiple of i=0 (N i+1 ) i . That suggests we

should make i a bit bigger than 1 (Ni+1 ). Try

i = 1 1

(Ni+1 (yi )) c (Ni+1 ) + yi for some yi > 0.

I write the extra factor in that strange way because it gives a clean bound,

min (1, Ni+1 (i )) = min (1, Ni+1 / (i )) min (1, 1/ (yi )) C eyi .

Inequality <30> then implies

Xm1 Xm1

1 eyi .

\E@ pack.bnd2 <31> P{max m (t) > c (Ni+1 ) + yi i } C

t i=0 i=0

Now we need to choose a {yi } sequence to make the sums tractable, then

maybe bound sums by integrals to get neater expressions, and so on. If

you find such games amusing, look at Problem [2], which guides you to a

more elegant bound. For the special case of subgaussian increments it gives

existence of positive constants K1 , K2 , and c for which

Z p

2

y + pack(r, T, d) + log(1/r) dr} K2 ecy .

p

P{max m (t) > K1

t 0

p

Remark. The integral of pack(r, T, d) is a surrogate for P maxt m (t).

The presence of a factor K1 , which might be much larger than 1, dis-

appoints. With a lot more effort, and sharper techniques, one can get

bounds for P{maxt m (t) P maxt m (t) + . . . }. See, for example

Massart (2000), who showed how tensorization methods (see Chap-

ter 12) can be used to rederive a conconcentration inequality due to

Talagrand (1996).

c Pollard

4.7 Chaining with tail probabilities 21

Chaining::tails.MM

The calculations depend on two simplifying facts about :

1/

(i) (x) exp(x ) for all x > 0, so 1

(y) log (y) for all y > 0;

(ii) there exists an x0 for which (x) 21 exp(x ) for all x x0 , so there

1/

exists a y0 for which that 1

(y) log (2y) for all y y0 .

constructed a nested sequence of partitions i of the index set T with #i

i

Ni := 22 such that

X

KX := suptT 1

(Ni+1 ) diam (Ei (t)) <

iN

where Ei (t) denotes the unique member of i that contains the index point t.

Remark. If you check back you might notice that originally the ith

summand contained 1 (Ni ). The change has only a trivial effect on

the sum because 1

(y) grows like log1/ (y).

Clearly we have #Si = #i . With a little care we can also ensure Si Si+1 :

if F i and tF E i+1 then make sure to choose tE = tF . The

link function `i maps each tE in Si+1 onto the tF for which E F i .

0 S0

1 S1

2 S2

3 S3

Remark. We could also ensure, for any given finite subset S of T , that

S0 S1 Sm = S, for some m

a path through the corresponding sets Em (t) Em1 (t) E0 (t)

from the partitions. The diameter of the set Ei (t) provides an upper bound

c Pollard

4.8 Brownian Motion, for example 22

for the link length, i (t) = d(Li+1 (t), Li (t)) diam(Ei (t)). If we choose

i = y1

(Ni+1 ) with y large enough

(i ) 1/ y log1/ (Ni+1 ) 2 exp(y log Ni+1 ).

P{t Sm : m (t) > yKX }

Xm1

Ni+1 / (i )

i=0

Xm1

2 exp (log Ni+1 y log Ni+1 )

i=0

k exp (y /2) for some constant k , if y 2.

With a suitable increase in k the final inequality holds for all y 0. That

is,

/2

P{max |X(t) X(L0 t)| > yKX } k ey for all y 0.

t

R

Remark. When integrated 0 . . . dy, the last inequality implies that

P maxtS Xt is bounded above, unformly over all finite subsets S of T ,

by a constant times KX . See Chapter 7 for a converse when X is a

centered Gaussian process.

Chaining::S:BM

This Section provides one way of comparing the merits of the various chain-

ing methods described in Sections 4.5 and 4.7, using Brownian motion pro-

cess {Xt : 0 t 1} as a test case.

Remember that X is a stochastic process for which

(i) X0 = 0

(ii) each increment Xs Xt has a N (0, |s t|) distribution

(iii) for all 0 s1 < t1 s2 < t2 < tn 1 the increments X(si )

X(ti ) are independent

(iv) each sample path X(, ) is a continuous function on [0, 1].

For current purposes it suffices to know that

2

P{|Xs Xt | |s t| } 2e /2

p

\E@ tail.incr <32> for 0

p p

\E@ psi2.incr <33> kXs Xt k2 c0 |s t| where c0 = 8/3 .

c Pollard

4.8 Brownian Motion, for example 23

The oscillation (also known as the modulus of continuity) is defined by

for the usual metric d1 (s, t) := |s t| on [0, 1]. Inequalities <32> and

p <33>

suggest that it would be cleaner to work with the metric d2 (s, t) := |s t| .

Note that

Remark. Note that d2 (s, t) is just the L2 (Leb) distance between the

indicator functions of the intervals [0, s] and [0, t].

osc1 () p

\E@ Levy <34> lim =1 almost surely, where h1 () = 2 log(1/) .

0 h1 ()

See McKean (1969, Section 1.6) for a detailed proof (which is similar in

spirit to the tail chaining described in subsection 4.8.3) of Levys theorem.

Under the d2 metric the result becomes

osc2 () p

lim =1 almost surely, where h2 () = h1 ( 2 ) = log(1/) .

0 h2 ()

To avoid too much detail I will settle for something less than <34>, namely

upper bounds that recover the O(h2 ()) behavior. For reasons that should

soon be clear, it simplifies matters to replace the function h2 by the increas-

ing function (see Problem [5])

h() := 1 2

p

2

2 (1/ ) = log(1 + ) .

osc(, Sm , d2 ) osc2 () as m , so it suffices to obtain probabilistic

bounds that hold uniformly in m before passing to the limit as m tends to

infinity.

Define i = 2i and sj,i = ji2 and Si = {sj,i : j = 0, 1, . . . , 4i 1}, so

that Ni = #Si = 4i = (1/i )2 . The function `i that rounds down to an

integer multiple of i2 has the property that d2 (s, `i s) i for all s [0, 1].

c Pollard

4.8 Brownian Motion, for example 24

Not important. You might object that a smaller covering set could

be obtained by shifting Si1 slightly to the right then taking `i to

2

map to the nearest integer multiple of i1 . The slight improvement

would only affect the constant in the O(h2 ()), at the cost of a messier

argument.

Given a < 1 let p be the integer for which p+1 < p . The chains

will only extend to the Sp -level, rather than to the S0 -level. If the change

from S0 to Sp disturbs you, you could work with Sei := Si+p and ei := i+p

for i = 0, 1, . . . .

The map Lp takes points s < t in Sm to points Lp s Lp t in Sp with

d2 (Lp s, Lp t) p + d2 (s, t). If d2 (s, t) < then d2 (Lp s, Lp t) < 2p , which

means that either Lp s = Lp t or that Lp t = Lp s + i2 . Define

Xm1

m,p := maxtSm |X(t) X(Lp t)| max |X(s) X(`i s)|

i=p sSi+1

Then for s, t Sm ,

|Xs Xt | |X(s) X(Lp s)| + |X(Lp s) X(Lp t)| + |X(Lp t) X(t)|

m,p + p + m,p ,

which leads to the bound

\E@ osc.bnd <35> osc(, Sm , d2 ) 2m,p + p .

From here on it is just a matter of applying the various maximal inequalities

to the m,p and p terms.

Chaining::conditEcontrol

For each event B with := PB > 0, inequality <33> and Theorem <10>

part (ii) give

|X(si ) X(ti )|

H(N ) := c0 1

p

\E@ ceH <36> PB max 2 (N/) where c0 = 8/3.

iN d2 (si , ti )

The inequality 1 1 1

2 (u)+2 (v) 2 (uv) (see the Problems for Chapter 2)

separates the N and contributions. In particular,

i H(Ni ) = c0 i 1 i2 1

2

c0 i 1 2 1

2 (1/i ) + 2 (1/)

= c0 (h(i ) + i ) where := 1

2 (1/).

c Pollard

4.8 Brownian Motion, for example 25

of the upper bound in <35>:

PB p p H(Np ) c0 h(p ) + c0 p

Xm1 X

PB m,p i H(Ni+1 ) 2 c0 (h(i+1 ) + i+1 ) .

i=p i=p

By Problem [5] the h(i ) decrease geometrically fast: h(i+1 ) < 0.77h(i )

for all i. It follows that there exists a universal constant C for which

C

h(p+1 ) + p+1 1 1

PB osc(, Sm , d2 ) 2 (1/) C h() + 2 (1/PB) .

2

Now let m tend to infinity to obtain an analogous upper bound for PB osc2 ().

If we choose B equal to the whole sample space the 1 (1/PB) term is

superfluous. We have Posc2 () Ch(), which is sharp up to a multiplica-

tive constant: Problem [6] uses the independence of the Brownian motion

increments to show that P osc2 () ch() for all 0 < 1/2, where c is

a positive constant.

Now for the surprising part. If, for an x > 0, we choose

then

2 (1/PB).

Chaining::Psicontrol

Theorem <10> part (iv) gives

max |X(s i ) X(t i )| H(N ) := C0 1 (N )

2

iN d2 (si , ti )

2

which is the same as <36> except that PB has been replaced by kk2 and

= 1 (and the constant is different). With those modifications, repeat the

c Pollard

4.8 Brownian Motion, for example 26

for some new constant C,that is

osc(, Sm , d2 )

P2 1.

Ch()

Again let m increase to infinity to deduce that kosc2 ()k2 Ch(), which

is clearly an improvement over Posc2 () Ch(). For example, we can

immediately deduce that kosc2 ()kr Cr h() for each r 1. We also get

a tail bound,

Chaining::tailcontrol

Once again start from inequality <35>

osc(, Sm , d2 ) 2m,p + p .

define y = p p + m1

P

i=p i i . Then, as in Section 4.1,

P{osc(, Sm , d2 ) y}

Xm1

P{p p p } + P{ max |X(s) X(`i s)| i i }

i=p sSi+1

Xm1

2p /2 2

2Np e + 2Ni+1 ei /2

i=p

Xm1

2 exp log Np 2p /2 + 2 exp log Ni+1 i2 /2

\E@ tail.sum <38>

i=p

How to choose the i s and p ? For the sake of comparison with the

inequality <37> obtained by the clever choice of the conditioning event B,

2

let me try for an upper bound that is a constant multiple of ex /2 , for a

given x 0.

Consider the first term in the bound <38>. To make the exponential

2

exactly equal to ex /2 we should choose

q p

p = 2 log Np + x2 2 log Np + x2 ,

q

p 2 log(1/p2 ) + p x 2h(p ) + 2x.

c Pollard

4.9 Problems 27

A similar idea works for each of the i s, except that we need to add on a

little bit more to keep the sum bounded as m goes off to infinity. If the terms

were to decrease geometrically then the whole sum would be bounded by a

multiple of the first term, which would roughly match the p contribution.

With those thought in mind, choose

p p p

i = 2 log Ni+1 + x2 + 2 log 2ip 2 log Ni+1 + x + 2(i p) log 2

so that

Xm1 2

2 exp log Ni+1 i2 /2 4ex /2

i=p

Pm1

and i i is less than

i=p

X q X p

2

2i+1 2 log(1/i+1 ) + xi + p k 2k log 2

i=p k=0

X

8 h(i+1 ) + p (x + c1 ) c2 (h() + x)

i=p

for universal constants c1 and c2 . (I absorbed the c1 into the h() term.)

With these simplifications, inequality <38> gives a clean bound,

2 /2

P{osc(, Sm , d2 ) > Ch() + Cx} 5ex

for some universal constant C. Here I very cunningly changed the to a >

to ensure a clean passage to the limit, via

4.9 Problems

Chaining::S:Problems

Chaining::P:K1934 [1] Prove the 1934 result of Kolmogorov that is cited at the start of Section 4.10.

(It might help to look at Chapter 5 first.)

Chaining::P:Psia.tail [2] In inequality <31> put yi = (y + i p) for some nonnegative y. Use the

fact that for each decreasing nonnegative function g on R+ and nonnegative

integers a and b

Xb Z b1

g(i) g(a) + g(r) dr

i=a a

c Pollard

4.9 Problems 28

Z p

P{m,p > K1 y + 1

(pack(r, T, d)) + log

1/

(1/r) dr} K2 ecy

0

for all y 0.

R 1measure via packing numbers.) Suppose is a

Young function for which 0 1 (1/r) dr < and

if min(u, v) 1

Z D

1 (pack(r, T, d)) , dr < where D = diam(T ).

0

subset of T , with Ni := #Si = pack(k i, T, d), as described near the end of

Section 4.2. Define i to be the uniform probability distribution on Si , that

is, mass 1/Ni at each point of Si , and = i0 2i1 i .

P

(i) For each t T and r > i show that B[t, r] 2i1 i B[t, r] (2i+1 Ni )1 .

Hint: Could Si B[t, r] be empty?

(ii) By splitting the range of integration into intervals where i1 r > i ,

deduce (cf. Section 4.3) that

Z D

1 X

1 dr 2C0 i 1 (2k+1 ) + 1 (Nk ) < .

0 B[t, r] i1

i

22 and

X

suptT 1 (2i ni )diam(Ei (t)) < .

iN

D

method from Problem [3] to show that suptT 0 1 (1/B[t, r]) dr <

for some Borel probability measure .

Chaining::P:modulus [5] From the fact that g(y) = 2 (y)/y 2 is an increasing function on R+ deduce

that the function

q

h() = 1/ g(1 1 2

p

2 2

2 (1/ )) = (1/ ) = log(1 + )

c Pollard

4.10 Notes 29

q that h(y)/h(y) =

p

g (y) where y 7 g (y) is a decreasing function with g1/2 (1) < 0.77.

Chaining::P:indep.BM [6] Suppose Z1 , . . . , Zn are independent random variables, each distributed N (0, 1)

(with density ). Define Mn = maxin |Zi |. For a positive constant c, let

xn = xn (c) be the value for which P{|Zi | xn } = c/n.

(i) Show that P{Mn xn } = (1 c/n)n ec as n .

q

If 0 x + 1 2 log n, show that P{|Zi | x} 2(1 + x) n1 2/.

p

(ii)

p q

Deduce that the xn ( 2/ ) + 1 > 2 log n.

(iii) Deduce that there exists some positive constant C for which PMn C log n

for all n.

(iv) For X a standard Brownianp motion as in Section 4.8, deduce from (iii)

that P osc(, X) c log(1/) for all 0 < 1/2, where c is a positive

constant. Hint: Write 2m/2 X1 as a sum of 2m independent standard normals.

Chaining::P:normal.max [7] (A sharpening of the result from the previous Problem.) The classical

bounds (Feller, 1968, Section VII.1 and Problem 7.1) show the normal tail

probability (x) = P{N (0, 1) > x} behaves asymptotically like (x)/x.

2

1 1 2x x(x)/(x)

More precisely,

2

1 for all x > 0. Less

precisely,

log c0 (x) = 2 x + log x + O(x ) as x , where c0 = 2 .

(i) (Compare with Leadbetter et al. 1983, Theorem 1.5.3.) Define an = 2 log n

and Ln = log an . For each constant define m,n = an (1+ )Ln /an . Show

that

c0 (m,n ) = n1 exp ( Ln + o(1))

that

n

P{Mn m,n } = 1 (m,n ) = exp an (c1

0 + o(1))

length o(Ln /an ) containing the point an Ln /an for which P{Mn

/ In } 0

very fast.

4.10 Notes

Chaining::S:Notes

Credit for the idea of chaining as a method of successive approximations

clearly belongs to Kolmogorov, at least for the case of a one-dimensional

index set. For example, at the start of the paper of Chentsov (1956):

c Pollard

4.10 Notes 30

Theorem. If (t) is a separable (see [1]) stochastic process,

0 t 1, and

(1) M |(t1 ) (t2 )|p < C|t1 t2 |1+r ,

where p > 0, r > 0 and C is a constant independent of t, then

the trajectories of the process are continuous with probability 1.

A generalization of this theorem is the following proposition

which was suggested to the author by A. N. Kolmogorov: . . .

The statement of the Theorem was footnoted by the comment This theorem

was first published in a paper by E. E. Slutskii [2], with a reference to a 1937

paper that I have not seen. See Billingsley (1968, Section 12) for a small

generalization of the theoremwith credit to Kolmogorov, via Slutsky, and

Chentsovand a chaining proof.

See Dudley (1973, Section 1) and Dudley (1999a, Section 1.2 and Notes)

for more about packing and covering. The definitive early work is due to

Kolmogorov and Tikhomirov (1959).

Dudley (1973) used chaining with packing/covering numbers and tail in-

equalities to establish various probabilistic bounds for Gaussian processes.

Dudley (1978) adapted the methods using the Bernstein inequality and met-

ric entropy and inclusion assumptions (now called bracketingsee Chap-

ter 13) to extend the Gaussian techniques to empirical processes indexed

by collections of sets. He also derived bounds for processes indexed by VC

classes of sets (see Chapter 9) via symmetrization (see Chapter 8) argu-

ments. In each case he controlled the increments of the empirical processes

by exponential inequalities like those in Chapter ChapHoeffBenn.

Pisier (1983) is usually credited for realizing that the entropy methods

used for Gaussian processes could also be extended to nonGaussian processes

with Orlicz norm control of the increments. However, as Pisier (page 127)

remarked:

For the proof of this theorem, we follow essentially [10]; I have

included a slight improvement over [10] which was kindly pointed

out to me by X. Fernique. Moreover, I should mention that N.

Kono [6] proved a result which is very close to the above; at the

time of [10], I was not aware of Konos paper [6].

Here [10] = Pisier (1980) and [6] = Kono (1980). The earlier paper [10]

included extensive discussion of other precursors for the idea. See also the

Notes to Section 2.6 of Dudley (1999b).

c Pollard

4.10 Notes 31

Using methods like those in Section 4.5, Nolan and Pollard (1988) proved

a functional central limit for the U-statistic analog of the empirical process.

Kim and Pollard (1990) and Pollard (1990) proved limit theorems for a

variety of statistical estimators using second moment control for suprema of

empirical processes.

My analysis in Example <18> is based on arguments of Ibragimov and

Hasminskii (1981, Section 1.5), with the chaining bound replacing their

method for deriving maximal inequalities. The analysis could be extended

to unbounded subsets of R by similar adaptations of their arguments for

unbounded sets.

See Pollard (1985) for one way to use a form of oscillation bound (under

the name stochastic differentiability) to establish central limit theorems for

M-estimators. Pakes and Pollard (1989, Lemma 2.17) used a property more

easily recognized as oscillation around a fixed index point.

References

Billingsley68book Billingsley, P. (1968). Convergence of Probability Measures. New York:

Wiley.

trajectories have no discontinuities of the second kind and the heuristic

approach to the Kolmogorov-Smirnov tests. Theory of Probability and Its

Applications 1 (1), 140144.

Probability 1, 66103.

Dudley78clt Dudley, R. M. (1978). Central limit theorems for empirical measures. Annals

of Probability 6, 899929.

versity Press.

versity Press.

Dudley2003RAP Dudley, R. M. (2003). Real Analysis and Probability (2nd ed.). Cambridge

studies in advanced mathematics. Cambridge University Press.

(third ed.), Volume 1. New York: Wiley.

c Pollard

4.10 Notes 32

gaussiennes. Springer Lecture Notes in Mathematics 480, 197. Ecole

dEte de Probabilites de Saint-Flour IV1974.

Asymptotic Theory. New York: Springer. (English translation from 1979

Russian edition).

KimPollard90cuberoot Kim, J. and D. Pollard (1990). Cube root asymptotics. Annals of Statis-

tics 18, 191219.

of sets in function spaces. Uspekhi Mat. Nauk 14 (2 (86)), 386. Review by

G. G. Lorentz at MathSciNet MR0112032. Included as paper 7 in volume 3

of Selected Works of A. N. Kolmogorov.

Kyoto Univ. 20 (2), 295313.

Related Properties of Random Sequences and Processes. Springer-Verlag.

Isoperimetry and Processes. New York: Springer.

equalities for empirical processes. The Annals of Probability 28 (2), pp.

863884.

NolanPollard88Uproc2 Nolan, D. and D. Pollard (1988). Functional limit theorems for U-processes.

Annals of Probability 16, 12911298.

PakesPollard89simulation Pakes, A. and D. Pollard (1989). Simulation and the asymptotics of opti-

mization estimators. Econometrica 57, 10271058.

processus et applications a lanalyse harmonique. In Seminaire danalyse

fonctionnelle, 1979-80, pp. 141. Ecole Polytechnique Palaiseau. Available

from http://archive.numdam.org/.

harmonic analysis. Springer Lecture Notes in Mathematics 995, 123154.

c Pollard

References for Chapter 4 33

Pollard85NewWays Pollard, D. (1985). New ways to prove central limit theorems. Econometric

Theory 1, 295314.

Statistical Science 4, 341366.

of NSF-CBMS Regional Conference Series in Probability and Statistics.

Hayward, CA: Institute of Mathematical Statistics.

bridge University Press.

ica 159, 99149.

Inventiones Mathematicae 126, 505563.

of Probability 29 (1), 411417.

Talagrand2005MMbook Talagrand, M. (2005). The Generic Chaining: Upper and lower bounds of

stochastic processes. Springer-Verlag.

c Pollard

- Dispense Processi AleatoriUploaded byMc Onx
- 1-s2.0-0304414981900120-mainUploaded byKrishnan Muralidharan
- 07350215.pdfUploaded bymani manis
- Random Signals NotesUploaded byRamyaKoganti
- Brown Ian MotionUploaded byRuiwei Ji
- homeAssignment_1Uploaded byangipyant
- Lecture 1Uploaded byAnonymous hOHi6TZTn
- Algorithms of Hidden Markov Model and a Prediction Method on Product PreferencesUploaded byJournal of Computing
- Kiel NotesUploaded byAnnisa Zakiya
- [Bas98] Richard F. Bass. Diffusions and Elliptic OperatorsUploaded byLuan Leonardo
- Brief Study of Markov ChainsUploaded byDenis Turcu
- Emergence and Complexity in Theoretical Models of Self-Organized CriticalityUploaded byTridib Sadhu
- Statistical Machine Learning W4400 Lecture Slides.pdfUploaded byAlex Yu
- Curs 1 SSL - IntroductionUploaded byMihai Ilie
- artico regresivaUploaded byFlorina Neagu
- SSRN-id715301.pdfUploaded bydambaodor
- Coecke, Kissinger, Categorical Quantum Mechanics II - Classical-Quantum InteractionUploaded byJason Payne
- Regression 290611Uploaded byAmrinder Singh
- Lecture5 - Short Rate ModelsUploaded byivolat
- Assessment of local influence in elliptical linear models with - F.Osòrio.pdfUploaded byÉrica Vieira Nogueira
- HilbertUploaded byPrashant
- Dysfunctional Role of HFT (2)Uploaded byJosé Ignacio
- M5A44LectureNotesUploaded bywilfredo quispe
- 0132311240_ism01Uploaded byaabeym
- Berger - Integrated Likelihood Methods for Eliminating Nuisance ParametresUploaded byJose Espinoza

- m820-2011Uploaded byTom Davis
- Fnctnl AnalysisUploaded byAsif Ahmad
- Topics in Real and Functional AnalysisUploaded byOzgur Tung
- Denis Auroux- Sphere packings, crystals, and error-correcting codesUploaded byFlaoeram
- (Mathematics and Its Applications 29) Stefan Rolewicz (Auth.)-Functional Analysis and Control Theory_ Linear Systems-Springer Netherlands (1987)Uploaded byluishipp
- AnjingUploaded byYudha
- 20122test17solsUploaded byPi
- Christopher J. Bishop, Yuval Peres-Fractals in Probability and Analysis-Cambridge (2017)Uploaded bylev76
- A Neighborhood of Infinity_ a Pictorial Proof of the Hairy Ball TheoremUploaded byDan Piponi
- Bengtsson I., Zyczkowski C. Geometry of Quantum States (Book Draft, CUP, 2007)(429s)_PQmUploaded byNagy Alexandra
- ba-a4Uploaded byAaa Mmm
- BallUploaded byoggyvukovich
- Chen MultivariableAnalysisUploaded byNishant Panda
- Brass-Research Problems in Discrete Geometry (ING)Uploaded byscvalencia606
- 615n08Uploaded byequicontinuous
- Closure (Topology)Uploaded byAlberto Conde Moreno
- Topology for DummiesUploaded bycuto9
- 33Uploaded bySilviu
- Geometry in (Very) High DimensionsUploaded bykoutsour27
- Siam CodesUploaded byShah Maqsumul Masrur Tanvi
- METRIC SPACES and SOME BASIC TOPOLOGYUploaded byhyd arnes
- Barcellos - Fractal Geometry of MandelbrotUploaded byPatrícia Netto
- intro_analysis_ch7.pdfUploaded byAnonymous iMFXzzibOB
- Talk Phd StudentsUploaded byan.radhakrishnan1452
- ttd_bagusUploaded byDavidSJ
- Contact detection algorithm for discrete element analysisUploaded byConcepción de Puentes
- Deep Nets for Local Manifold LearningUploaded byJan Hula
- Transversal Sum (Draft)Uploaded byToki Wartooth
- Topology.PDF.pdfUploaded byKainat Khowaja
- NotesUploaded bycommutativealgebra