Central Limit Theorem and convergence to stable laws in Mallows distance
Oliver Johnson and Richard Samworth Statistical Laboratory, University of Cambridge, Wilberforce Road, Cambridge, CB3 0WB, UK.
October 3, 2005
Running Title: CLT and stable convergence in Mallows distance Keywords: Central Limit Theorem, Mallows distance, probability metric, stable law, Wasserstein distance
Abstract
We give a new proof of the classical Central Limit Theorem, in the Mallows (L ^{r} Wasserstein) distance. Our proof is elementary in the sense that it does not require complex analysis, but rather m akes use of a simple subadditive inequality related to this metric. T he key is to analyse the case where equality holds. We provide some results con cerning rates of convergence. We also consider convergence to stable distributions, and obtain a bound on the rate of such convergence.
1 Introduction and main results
The spirit of the Central Limit Theorem, that normalised sums of indepen dent random variables converge to a normal distribution, can be understood in diﬀerent senses, according to the distance used. For example, in addition to the standard Central Limit Theorem in the sense of weak convergence, we mention the proofs in Prohorov (1952) of L ^{1} convergence of densities, in Gnedenko and Kolmogorov (1954) of L ^{∞} convergence of densities, in Barron
1
(1986) of convergence in relative entropy and in Shimizu (1975) and Johnson and Barron (2004) of convergence in Fisher information.
In this paper we consider the Central Limit Theorem with resp ect to the Mallows distance and prove convergence to stable laws in the inﬁnite variance setting. We study the rates of convergence in both cases.
Deﬁnition 1.1 For any r > 0 , we deﬁne the Mallows r distance between probability distribution functions F _{X} and F _{Y} as
d _{r} ( F _{X} , F _{Y} ) = inf
(X,Y ) E X − Y  ^{r} 1/r
,
where the inﬁmum is taken over pairs ( X, Y ) whose marginal distribution functions are F _{X} and F _{Y} respectively, and may be inﬁnite. Where it causes no confusion, we write d _{r} ( X, Y ) for d _{r} ( F _{X} , F _{Y} ) .
Deﬁne F _{r} to be the set of distribution functions F such that ^{}  x ^{r} dF ( x) < ∞ . Bickel and Freedman (1981) show that for r ≥ 1, d _{r} is a metric on F _{r} . If r < 1, then d _{r} ^{r} is a metric on F _{r} . In considering stable convergence, we shall also be concerned with the case where the absolute r th moments are not ﬁnite.
Throughout the paper, we write Z _{µ}_{,}_{σ} 2 for a N ( µ, σ ^{2} ) random variable, Z _{σ} 2 for a N (0 , σ ^{2} ) random variable, and Φ _{µ}_{,}_{σ} 2 and Φ _{σ} 2 for their respective distribution functions. We establish the following main theorems:
Theorem 1.2 Let X _{1} , X _{2} ,
dom variables with mean zero and ﬁnite variance σ ^{2} > 0 , and let S _{n} =
( X _{1} +
be independent and identically distributed ran
+ X _{n} ) / ^{√} n. Then
_{→}_{∞} d _{2} ( S _{n} , Z _{σ} 2 ) = 0 .
lim
n
Moreover, Theorem 3.2 shows that for any r ≥ 2, if d _{r} ( X _{i} , Z _{σ} 2 ) < ∞ , then lim _{n} _{→}_{∞} d _{r} ( S _{n} , Z _{σ} 2 ) = 0 . Theorem 1.2 implies the standard Central Limit Theorem in the sense of weak convergence (Bickel and Freedman 1981, Lemma 8.3).
Theorem 1.3 Fix α ∈ (0 , 2), and let X _{1} , X _{2} , variables (where EX _{i} = 0 , if α > 1 ), and S _{n} = (X _{1} +
2
be independent random + X _{n} ) /n ^{1}^{/}^{α} . If there
exists an α stable random variable Y such that sup _{i} d _{β} ( X _{i} , Y ) < ∞ for some β ∈ ( α, 2], then lim _{n} _{→}_{∞} d _{β} ( S _{n} , Y ) = 0 . In fact
d _{β} ( S _{n} , Y ) ≤ ^{2} ^{1}^{/}^{β}
n
1/α
n
i=1
_{β} ( X _{i} , Y ) 1/β ,
d ^{β}
so in the identically distributed case the rate of convergen ce is O ( n ^{1}^{/}^{β} ^{−} ^{1}^{/}^{α} ) .
See also Rachev and R¨uschendorf (1992,1994), who obtain similar results using diﬀerent techniques in the case of identically distributed X _{i} and strictly symmetric Y . In Lemma 5.3 we exhibit a large class C _{K} of distribution functions F _{X} for which d _{β} ( X, Y ) ≤ K , so the theorem can be applied.
Theorem 1.2 follows by understanding the subadditivity of d _{2} ^{2} ( S _{n} , Z _{σ} 2 ) (see Equation (4)). We consider the powersoftwo subsequence T _{k} = S _{2} k , and use R´enyi’s method, introduced in R´enyi (1961) to provide a proof of convergence to equilibrium of Markov chains; see also Kendall (1963). This technique was also used in Csisz´ar (1965) to show convergence to Haar mea sure for convolutions of measures on compact groups, and in Shimizu (1975) to show convergence of Fisher information in the Central Limit Theorem. The method has four stages:
1. Consider independent and identically distributed random variables X _{1} and X _{2} with mean µ and variance σ ^{2} > 0, and write D ( X ) for d _{2} ^{2} ( X, Z _{µ}_{,}_{σ} 2 ). In Proposition 2.4, we observe that
_{D} ^{} X _{1} + X _{2}
^{√}
2
≤ D ( X _{1} ) ,
(1)
with equality if and only if X _{1} , X _{2} ∼ Z _{µ}_{,}_{σ} 2 . Hence D ( T _{k} ) is decreasing and bounded below, so converges to some D .
2. In Proposition 2.5, we use a compactness argument to show that there exists a strictly increasing sequence k _{r} and a random variable T such that
Further,
_{→}_{∞} D ( T _{k} _{r} ) = D ( T ) .
lim
r
r
→∞ _{D} ^{} T _{k} _{r} + T ^{} _{=} _{D} ^{} T + T ^{} ,
k
r
^{√}
2
^{√}
2
_{→}_{∞} D ( T _{k} _{r} _{+}_{1} ) = lim
lim
r
where the T _{k} _{r} and T ^{} are independent copies of T _{k} _{r} and T respectively.
3
3. We combine these two results: since D ( T _{k} _{r} ) and D ( T _{k} _{r} _{+}_{1} ) are both subsequences of the convergent subsequence D ( T _{k} ), they must have a common limit. That is,
D
= D ( T ) = D ^{T} ^{+} ^{T} ^{} ,
^{√}
2
so by the condition for equality in Proposition 2.4, we deduce that
T ∼ N (0 , σ ^{2} ) and D = 0.
4. Proposition 2.4 implies the standard subadditive relation
( m + n) D ( S _{m}_{+} _{n} ) ≤ mD ( S _{m} ) + nD ( S _{n} ) .
Now Theorem 6.6.1 of Hille (1948) implies that D ( S _{n} ) converges to inf _{n} D ( S _{n} ) = 0.
The proof of Theorem 1.3 is given in Section 5.
2 Subadditivity of Mallows distance
The Mallows distance and related metrics originated with a transportation problem posed by Monge in 1781 (Rachev 1984, Dudley 1989, pp.329–330). Kantorovich generalised this problem, and considered the distance obtained by minimising Ec ( X, Y ), for a general metric c (known as the cost function), over all joint distributions of pairs (X, Y ) with ﬁxed marginals. This distance is also known as the Wasserstein metric. Rachev (1984) reviews applications to diﬀerential geometry, inﬁnitedimensional linear programming and infor mation theory, among many others. Mallows (1972) focused on the metric which we have called d _{2} , while d _{1} is sometimes called the Gini index.
In Lemma 2.3 below, we review the existence and uniqueness of the con struction which attains the inﬁmum in Deﬁnition 1.1, using the concept of a quasimonotone function.
Deﬁnition 2.1 A function k : R ^{2} → R induces a signed measure µ _{k} on R ^{2} given by
µ _{k} { ( x, x ^{} ] × ( y, y ^{} ]} = k ( x, y ) + k ( x ^{} , y ^{} ) − k ( x, y ^{} ) − k ( x ^{} , y ) .
We say that k is quasimonotone if µ _{k} is a nonnegative measure.
4
The function k ( x, y ) = − x − y  ^{r} is quasimonotone for r ≥ 1, and if r > 1 then the measure µ _{k} is absolutely continuous, with a density which is positive Lebesgue almost everywhere. Tchen (1980, Corollary 2.1) gives the following result, a twodimensional version of integration by parts.
Lemma 2.2 Let k ( x, y ) be a quasimonotone function and let H _{1} ( x, y ) and H _{2} ( x, y ) be distribution functions with the same marginals, where H _{1} ( x, y ) ≤ H _{2} ( x, y ) for all x, y . Suppose there exists an H _{1}  and H _{2}  integrable function g ( x, y ) , bounded on compact sets, such that k ( x ^{B} , y ^{B} ) ≤ g ( x, y ) , where x ^{B} = ( −B ) ∨ x ∧ B . Then
k ( x, y ) dH _{2} ( x, y ) − k ( x, y ) dH _{1} ( x, y ) = ^{} H ( x, y ) − H ( x, y ) ^{} dµ _{k} ( x, y ) .
−
2
−
1
Here H ( x, y ) = P ( X < x, Y < y ) , where ( X, Y ) have joint distribution function H _{i} .
−
i
Lemma 2.3 For r ≥ 1 , consider the joint distribution of pairs ( X, Y ) where X and Y have ﬁxed marginals F _{X} and F _{Y} , both in F _{r} . Then
E X − Y  ^{r} ≥ E X ^{∗} − Y ^{∗}  ^{r} ,
(2)
where X ^{∗} = F
−
X
1
( U ) , Y ^{∗} = F
−
Y
1
( U ) and U ∼ U (0 , 1). For r > 1 , equality is
attained only if ( X, Y ) ∼ ( X ^{∗} , Y ^{∗} ) .
Proof Observe, as in Fr´echet (1951), that if the random variables X, Y have ﬁxed marginals F _{X} and F _{Y} , then
P ( X ≤ x, Y ≤ y ) ≤ H _{+} ( x, y ) , 
(3) 
where H _{+} ( x, y ) = min(F _{X} ( x) , F _{Y} ( y )). This bound is achieved by 
taking 
U ∼ U (0 , 1) and setting X ^{∗} = F
−
X
1
( U ) , Y ^{∗} = F
−
Y
1
( U ).
Thus, by Lemma 2.2, with k ( x, y ) = − x − y  ^{r} , for r ≥ 1, and taking
H _{1} ( x, y )
= P ( X ≤ x, Y ≤ y ) and H _{2} = H _{+} , we deduce that
E X − Y  ^{r} − E X ^{∗} − Y ^{∗}  ^{r} = {H _{+} ( x, y ) − H _{1} ( x, y ) } dµ _{k} ( x, y ) ≥ 0 ,
so ( X ^{∗} , Y ^{∗} ) achieves the inﬁmum in the deﬁnition of the Wasserstein distance.
5
Finally, since taking r > 1 implies that the measure µ _{k} has a strictly pos itive density with respect to Lebesgue measure, we can only have equality in (2) if P ( X ≤ x, Y ≤ y ) = min { F _{X} ( x) , F _{Y} ( y ) } Lebesgue almost everywhere. But the joint distribution function is rightcontinuous, so this condition de termines the value of P ( X ≤ x, Y ≤ y ) everywhere.
Using the construction in Lemma 2.3, Bickel and Freedman (1981) establish that if X _{1} and X _{2} are independent and Y _{1} and Y _{2} are independent, then
d _{2} ^{2} ( X _{1} + X _{2} , Y _{1} + Y _{2} ) ≤ d ^{2} _{2} ( X _{1} , Y _{1} ) + d _{2} ^{2} ( X _{2} , Y _{2} ) . (4)
Similar subadditive expressions arise in the proof of convergence of Fisher information in Johnson and Barron (2004). By focusing on the case r = 2 in Deﬁnition 1.1, and by using the theory of L ^{2} spaces and projections, we establish parallels with the Fisher information argument.
We prove Equation (4) below, and further consider the case of equality in this relation. Major (1978, p.504) gives an equivalent construction to
that given in Lemma 2.3. If F _{Y} is a continuous distribution function, then
◦ F _{Y} ( Y ^{∗} ).
F _{Y} ( Y ) ∼ U (0 , 1), so we generate Y ^{∗} ∼ F _{Y} and take X ^{∗} = F
− 1
X
Recall that if EX = µ and Var X = σ ^{2} , we write D ( X ) for d _{2} ^{2} ( X, Z _{µ}_{,}_{σ} 2 ).
Proposition 2.4 If X _{1} , X _{2} are independent, with ﬁnite variances σ , σ > 0 , then for any t ∈ (0 , 1),
1
2
2
2
D ^{√} tX _{1} + ^{√} 1 − tX _{2} ≤ tD ( X _{1} ) + (1 − t) D ( X _{2} ) ,
with equality if and only if X _{1} and X _{2} are normal.
Proof We consider bounding D ( X _{1} + X _{2} ) for independent X _{1} and X _{2} with mean zero, since the general result follows on translation and rescaling.
◦ Φ _{σ} ( Y ) =
We generate independent Y ∼ N (0 , σ ), and take X = F
i
∗
i
∗
i
−
X
i
2
1
2
i
i
∗
2
2
h _{i} ( Y ), say, for i = 1 , 2. Further, writing σ ^{2} = σ + σ , we deﬁne Y ^{∗} = Y + Y and set X ^{∗} = F _{+} _{X} _{2} ◦ Φ _{σ} 2 ( Y + Y ) = h( Y + Y ), say. Then
i
∗
−
X
1
1
∗
1
∗
2
∗
1
1
∗
2
2
∗
1
∗
2
d _{2} ^{2} ( X _{1} + X _{2} , Y _{1} + Y _{2} ) =
E( X ^{∗} − Y ^{∗} ) ^{2}
≤ E( X + X − Y − Y ) ^{2}
∗
1
∗
2
∗
1
∗
2
= E( X − Y ) ^{2} + E( X − Y ) ^{2}
= d ^{2} _{2} ( X _{1} , Y _{1} ) + d _{2} ^{2} ( X _{2} , Y _{2} ) .
∗
1
∗
1
∗
2
∗
2
6
Equality holds if and only if (X + X , Y + Y ) has the same distribution as ( X ^{∗} , Y ^{∗} ). By our construction of Y ^{∗} = Y + Y , this means that (X +
X
2 =
, Y + Y ) has the same distribution as (X ^{∗} , Y + Y ), so P { X + X
h( Y + Y ) } = P { X ^{∗} = h( Y + Y ) } = 1. Thus, if equality holds, then
∗
1
∗
2
∗
2
∗
1
∗
2
∗
1
∗
2
∗
1
∗
1
∗
∗
2
∗
1
∗
1
∗
∗
2
∗
1
∗
2
2
∗
1
h _{1} ( Y ) + h _{2} ( Y ) = h( Y + Y ) almost surely .
∗
1
∗
2
∗
1
∗
2
(5)
Brown (1982) and Johnson and Barron (2004), showed that equality holds in Equation (5) if and only if h, h _{1} , h _{2} are linear. In particular, Proposition 2.1 of (Johnson and Barron 2004) implies that there exist constants a _{i} and b _{i} such that
E{ h( Y + Y ) − h _{1} ( Y ) − h _{2} ( Y ) } ^{2} 2 σ σ
≥
∗
1
2
1
∗
2
2
2
( σ + σ _{)} ^{2}
1
2
2
2
∗
1
∗
2
^{} E{ h _{1} ( Y ) − a _{1} Y − b _{1} } ^{2} + E{ h _{2} ( Y ) − a _{2} Y − b _{2} } ^{2} ^{} .(6)
∗
1
∗
1
∗
2
∗
2
Hence, if Equation (5) holds, then h _{i} ( u ) = a _{i} u + b _{i} almost everywhere. Since
Y
and X have the same mean and variance, it follows that a _{i} = 1, b _{i} = 0.
i
∗
∗
i
Hence h _{1} ( u ) = h _{2} ( u ) = u and X = Y
∗
i
i
∗
^{.}
+ X _{n} ) / ^{√} n is a normalised sum of
independent and identically distributed random variables of mean zero and ﬁnite variance σ ^{2} .
Recall that T _{k} = S _{2} _{k} , where S _{n} = (X _{1} +
Proposition 2.5 There exists a strictly increasing sequence ( k _{r} ) ∈ N and a random variable T such that
_{→}_{∞} D ( T _{k} _{r} ) = D ( T ) .
lim
r
If T _{k} _{r} and T ^{} are independent copies of T _{k} _{r} and T respectively, then
r
→∞ _{D} ^{} T _{k} _{r} + T ^{} _{=} _{D} ^{} T + T ^{}
k
r
^{√}
2
^{√}
2
_{→}_{∞} D ( T _{k} _{r} _{+}_{1} ) = lim
lim
r
.
Proof Since Var ( T _{k} ) = 1 for all k , the sequence ( T _{k} ) is tight. Therefore, by Prohorov’s theorem, there exists a strictly increasing sequence ( k _{r} ) and a random variable T such that
T
k r
d
→ T
7
(7)
as r → ∞ . Moreover, the proof of Lemma 5.2 of Brown (1982) shows that the sequence ( T _{r} ) is uniformly integrable. But this, combined with Equation (7) implies that lim _{r} _{→}_{∞} d _{2} ( T _{k} _{r} , T ) = 0 (Bickel and Freedman 1981, Lemma 8.3(b)). Hence
2
k
D ( T _{k} _{r} ) = d ^{2} _{2} ( T _{k} _{r} , Z _{σ} 2 ) ≤
{ d _{2} ( T _{k} _{r} , T ) + d _{2} ( T, Z _{σ} 2 ) } ^{2} → d _{2} ^{2} ( T, Z _{σ} 2 ) = D ( T )
as r → ∞ . Similarly, d _{2} ^{2} ( T, Z _{σ} 2 ) ≤ { d _{2} ( T, T _{k} _{r} ) + d _{2} ( T _{k} _{r} , Z _{σ} 2 ) } ^{2} , yielding the opposite inequality. This proves the ﬁrst part of the proposition.
→ d T + T ^{} as
r → ∞ , and E( T _{k} _{r} + T _{k} _{r} ) ^{2} → E( T + T ^{} ) ^{2} , and then use the same argument as in the ﬁrst part of the proposition.
For the second part, it suﬃces to observe that T _{k} _{r} + T
k
r
Combining Propositions 2.4 and 2.5, as described in Section 1, the proof of Theorem 1.2 is now complete.
3 Convergence of d _{r} for general r
The subadditive inequality (4) arises in part from a moment inequality; that is, if X _{1} and X _{2} are independent with mean zero, then E X _{1} + X _{2}  ^{r} ≤ E X _{1}  ^{r} + E X _{2}  ^{r} , for r = 2. Similar results imply that for r ≥ 2, we have lim _{n} _{→}_{∞} d _{r} ( S _{n} , Z _{σ} 2 ) = 0. First, we prove the following lemma:
Lemma 3.1 Consider independent random variables V _{1} , V _{2} ,
where for some r ≥ 2 and for all i, E V _{i}  ^{r} < ∞ and E W _{i}  ^{r} < ∞ . Then for any m, there exists a constant c ( r ) such that
and W _{1} , W _{2} ,
d _{r} ^{r} ( V _{1} +
≤ c ( r )
+ V _{m} , W _{1} +
+ W _{m} )
m
i=1
d _{r} ^{r} ( V _{i} , W _{i} ) +
m
i=1
d _{2} ^{2} ( V _{i} , W _{i} ) r/2 .
8
.,
Proof We consider independent U _{i} ∼ U (0 , 1), and set V = F
W
i
∗
−
V
∗
i
= F
−
1
W
( U _{i} ). Then
1
( U _{i} ) and
d _{r} ^{r} ( V _{1} +
≤
E
+ V _{m} , W _{1} +
m
i=1
( V − W )
i
∗
∗
i
r
+ W _{m} )
≤
c ( r )
m
i=1
E  V − W  ^{r} +
i
∗
∗
i
m
i=1
 2 ^{} r/2 ^{}
∗
E  V − W
i
∗
i
as required. This ﬁnal line is an application of Rosenthal’s inequality (Petrov 1995, Theorem 2.9) to the sequence (V − W ).
i
∗
∗
i
Using Lemma 3.1, we establish the following theorem.
Theorem 3.2 Let X _{1} , X _{2} ,
dom variables with mean zero, variance σ ^{2} > 0 and E X _{1}  ^{r} < ∞ for some
r ≥ 2 . If S _{n} = (X _{1} +
be independent and identically distributed ran
+ X _{n} ) / ^{√} n, then
_{→}_{∞} d _{r} ( S _{n} , Z _{σ} 2 ) = 0 .
lim
n
Proof Theorem 1.2 covers the case of r = 2, so need only consider r > 2. We use a scaled version of Lemma 3.1 twice. First, we use V _{i} = X _{i} , W _{i} ∼ N (0 , σ ^{2} ) and m = n, in order to deduce that, by monotonicity of the r norms:
d _{r} ^{r} ( S _{n} , Z _{σ} 2 )
≤ c ( r ) ^{} n ^{1}^{−} ^{r}^{/}^{2} d ^{r} _{r} ( X _{1} , Z _{σ} 2 ) + d _{2} ^{2} ( X _{1} , Z _{σ} 2 ) ^{r}^{/}^{2} ^{}
≤ c ( r ) ^{} n ^{1}^{−} ^{r}^{/}^{2} + 1 ^{}
d _{r} ^{r} ( X _{1} , Z _{σ} 2 ) ,
so that d _{r} ^{r} ( S _{n} , Z _{σ} 2 ) is uniformly bounded in n, by K , say. Then, for general
n,
Lemma 3.1, take
take m = n/N , and u = n − ( m − 1)N ≤ N . In
deﬁne N = ^{√} n ,
V _{i}
= X _{(}_{i}_{−} _{1}_{)}_{N} _{+}_{1} +
V _{m} = X _{(}_{m}_{−} _{1}_{)}_{N} _{+}_{1} +
+ X _{i}_{N} , for i = 1 , + X _{n} ,
, m − 1
and W _{i} ∼ N (0 , Nσ ^{2} ) for i = 1 ,
Now the uniform bound above gives, on rescaling,
, m − 1, W _{m} ∼ N (0 , uσ ^{2} ) independently.
d ^{r} _{r} ( V _{i} , W _{i} ) = N ^{r}^{/}^{2} d _{r} ^{r} ( S _{N} , Z _{σ} 2 ) ≤ N ^{r}^{/}^{2} K for i = 1 ,
m − 1
9
and d ^{r} _{r} ( V _{m} , W _{m} ) = u ^{r}^{/}^{2} d _{r} ^{r} ( S _{u} , Z _{σ} 2 ) ≤ N ^{r}^{/}^{2} K . Further d ^{2} _{2} ( V _{i} , W _{i} ) = Nd _{2} ^{2} ( S _{N} , Z _{σ} 2 )
for i = 1 ,
using Lemma 3.1 again, we obtain
m − 1 and d _{2} ^{2} ( V _{m} , W _{m} ) = ud ^{2} _{2} ( S _{u} , Z _{σ} 2 ) ≤ Nd _{2} ^{2} ( S _{1} , Z _{σ} 2 ). Hence,
d _{r} ^{r} (S _{n} , Z _{σ} 2 )
=
1
_{n} r/2 ^{d} r
_{r} ( V _{1} +
+ V _{m} , W _{1} +
+ W _{m} )
c
( r )
≤ _{n} r/2
m
i=1
d _{r} ^{r} ( V _{i} , W _{i} ) +
m
i=1
d _{2} ^{2} ( V _{i} , W _{i} ) r/2
≤ c ( r ) mK ^{N} ^{r}^{/}^{2}
_{n}
r/2
_{+}
^{} N ( m − 1)
n
d ^{2} _{2} ( S _{N} , Z _{σ} 2 ) + ^{N} _{2} ( S _{1} , Z _{σ} 2 ) r/2
n
d
^{2}
≤
c ( r )
mK
_{1}_{)} r/2 ^{+} ^{d} ^{2}
_{2} ( S _{N} , Z _{σ} 2 ) +
( m −
^{1}
− 1 ^{d} _{2} 2 ( S _{1} , Z _{σ} 2 ) r/2 .
m
This converges to zero since lim _{n} _{→}_{∞} d _{2} ( S _{N} , Z _{σ} 2 ) = 0.
4 Strengthening subadditivity
Under certain conditions, we obtain a rate for the convergence in Theorem 1.2. Equation (1) shows that D ( T _{k} ) is decreasing. Since D ( T _{k} ) is bounded below, the diﬀerence sequence D ( T _{k} ) − D ( T _{k} _{+}_{1} ) converges to zero, As in Johnson and Barron (2004) we examine this diﬀerence sequence, to show that its convergence implies convergence of D ( T _{k} ) to zero.
Further, in the spirit of Johnson and Barron (2004), we hope that if the diﬀerence sequence is small, then equality ‘nearly’ holds in Equation (5), and so the functions h, h _{1} , h _{2} are ‘nearly’ linear. This implies that if Cov (X, Y ) is close to its maximum, then X is be close to h( Y ) in the L ^{2} sense.
Following del Barrio, et al. (1999), we deﬁne a new distance quantity D ^{∗} ( X ) = inf _{m}_{,}_{s} 2 d _{2} ^{2} ( X, Z _{m}_{,}_{s} 2 ) . Notice that D ( X ) = 2 σ ^{2} − 2 σk ≤ 2 σ ^{2} , where
and Φ ^{−} ^{1} are increasing
functions, so k ≥ 0 by Chebyshev’s rearrangement lemma. Using results of
del Barrio et al. (1999), it follows that
k = ^{} F
0
1
−
1
X
( x)Φ ^{−} ^{1} ( x) dx. This follows since F
−
X
1
D ^{∗} ( X ) = σ ^{2} − k ^{2} = D ( X ) − ^{D} 4 ^{(} σ ^{X} ^{2} ^{)} ^{2} ^{,}
10
and convergence of D ( S _{n} ) to zero is equivalent to convergence of D ^{∗} ( S _{n} ) to zero.
Proposition 4.1 Let X _{1} and X _{2} be independent and identically distributed random variables with mean µ , variance σ ^{2} > 0 and densities (with respect to Lebesgue measure). Deﬁning g ( u ) = Φ _{µ}_{,}_{σ} _{2} ◦ F _{(}_{X} _{1} _{+} _{X} _{2} _{)}_{/} ^{√} _{2} ( u ) , if the derivative g ^{} ( u ) ≥ c for all u then
−
1
_{D} ^{} X _{1} + X _{2}
^{√}
2
≤ 1 − ^{c} D ( X _{1} ) + ^{c}^{D} ^{(} ^{X} ^{1} ^{)} ^{2}
2
8
σ ^{2}
≤
1 − _{4} D ( X _{1} ) .
c
Proof As before, translation invariance allows us to take EX _{i} = 0. For random variables X, Y , we consider the diﬀerence term Equation (3) and
◦ F _{X} ( u ), and h( u ) = g ^{−} ^{1} ( u ). The function k ( x, y ) = −{ x −
h( y ) } ^{2} is quasimonotone and induces the measure dµ _{k} ( x, y ) = 2 h ^{} ( y ) dxdy . Taking H _{1} ( x, y ) = P ( X ≤ x, Y ≤ y ) and H _{2} ( x, y ) = min { F _{X} ( x) , F _{Y} ( y ) } in Lemma 2.2 implies that
write g ( u ) = F
− 1
Y
E{ X − h( Y ) } ^{2} = 2 h ^{} ( y ) {H _{2} ( x, y ) − H _{1} ( x, y ) } dxdy,
since E{ X ^{∗} − h( Y ^{∗} ) } ^{2} = 0. By assumption h ^{} ( y ) ≤ 1 /c , so
E{ X − h( Y ) } ^{2} ≤ ^{2} _{c} { Cov (X ^{∗} , Y ^{∗} ) − Cov (X, Y )) } .
Again take Y , Y independent N (0 , σ ^{2} )
∗
1
∗
2
h _{i} ( Y ). Then deﬁne Y ^{∗} = Y + Y Then there exist a and b such that
i
∗
∗
1
2
∗ and
and set X
∗
i
1
= F
−
X
i
◦
F _{Y} _{i} ( Y ) =
i
∗
1
take X ^{∗} = F _{+} _{X} _{2} ◦ F _{Y} _{1} _{+} _{Y} _{2} ( Y ^{∗} ).
−
X
1
d _{2} ^{2} ( X _{1} , Y _{1} ) + d ^{2} _{2} ( X _{2} , Y _{2} ) − d _{2} ^{2} ( X _{1} + X _{2} , Y _{1} + Y _{2} )
= E( X + X − Y − Y ) ^{2} − E( X ^{∗} − Y ^{∗} ) ^{2}
∗
1
∗
2
∗
1
∗
2
= 2Cov (X ^{∗} , Y ^{∗} ) − 2Cov (X + X , Y + Y )
≥ c E{ X + X − h( Y + Y ) } ^{2}
= c E{ h _{1} ( Y ) + h _{2} ( Y ) − h( Y + Y ) } ^{2}
∗
1
∗
2
∗
1
∗
∗
∗
2
2
∗
2
1
∗
1
∗
2
∗
2
∗
1
∗
1
≥ c E{ h _{1} ( Y ) − aY − b} ^{2} ≥ cD ^{∗} ( X _{1} ) ,
∗
1
∗
1
where the penultimate inequality follows by Equation (6). Recall that D ( X ) ≤ 2 σ ^{2} , so that D ^{∗} ( X ) = D ( X ) − D ( X ) ^{2} / (4 σ ^{2} ) ≥ D ( X ) / 2. The result follows on rescaling.
11
We brieﬂy discuss the strength of the condition imposed. If X has mean zero, distribution function F _{X} and continuous density f _{X} , deﬁne the scale invariant quantity
1
C ( X ) = inf (Φ _{σ} ^{2} ◦ F _{X} ) ^{} ( u ) = inf
u
−
p
∈(0, 1)
f _{X} ( F ^{−} ^{1} ( p ))
X
( p _{)}_{)} ^{=} ^{i}^{n}^{f}
p
∈(0,
φ _{σ} 2 (Φ
1
−
σ
^{2}
1) _{σ} f _{X} ( F ^{−} ^{1}
X
(
p ))
φ (Φ ^{−} ^{1} ( p )) ^{.}
We want to understand when C ( X ) > 0.
Example 4.2 If X ∼ U (0 , 1), then C ( X ) = 1 / ^{} 12 sup _{x} φ ( x) = ^{} π/ 6 .
Lemma 4.3 If X has mean zero and variance σ ^{2} then C ( X ) ^{2} ≤ σ ^{2} / ( σ ^{2} + median( X ) ^{2} ) .
Proof By the Mean Value Inequality, for all p

so that
σ ^{2} + F
−
X
Φ
1
−
σ
^{2}
( p )  =  Φ
1
−
σ
^{2}
( p ) − Φ
1
−
σ
^{2}
(1 / 2) ≥ C ( X )  F
−
X
1
( p ) − F
−
X
1
(1 / 2)  ,
1
(1 / 2) ^{2} = 1 F
0
−
X
1
( p ) ^{2} dp + F
−
X
1
(1 / 2) ^{2} = 1 { F
0
−
X
≤
C ( X ) ^{2} 1
1
0
Φ
1
−
σ
^{2}
( p ) ^{2} dp =
^{σ} ^{2} C ( X ) ^{2} ^{.}
1
( p ) − F
−
X
1
(1 / 2) } ^{2} dp
In general we are concerned with the rate at which f _{X} ( x) → 0 at the edges of the support.
Lemma 4.4 If for some > 0 ,
then
then
lim _{p} _{→} _{1} f _{X} ( F
−
X
lim _{p} _{→} _{0} f _{X} ( F
−
X
1
1
f _{X} ( F
−
X
1
( p )) c (1 − p ) ^{1}^{−} ^{} as p → 1
( p )) /φ (Φ ^{−} ^{1} ( p )) = ∞ . Correspondingly if
f _{X} ( F
−
X
1
( p )) cp ^{1}^{−} ^{} as p → 0
( p )) /φ (Φ ^{−} ^{1} ( p )) = ∞ .
12
(8)
(9)
Proof Simply note that by the Mills ratio (Shorack and Wellner 1986, p.850) as x → ∞ , Φ( x) ∼ φ ( x) /x, so that as p → 1, φ (Φ ^{−} ^{1} ( p )) ∼ (1 − p )Φ ^{−} ^{1} ( p ) ∼ (1 − p ) ^{} −2 log(1 − p ).
Example 4.5
1. The density of the nfold convolution of U (0 , 1) random variables is
given by f _{X} ( x) = x ^{n} ^{−} ^{1} / ( n − 1)! for 0 < x < 1 , hence F
and f _{X} ( F
random variable, f _{X} ( F
− 1
X
( p ) = (n!p ) ^{1}^{/}^{n} ,
− 1
X
( p )) = n/ ( n!) ^{1}^{/}^{n} p ^{(}^{n} ^{−} ^{1}^{)}^{/}^{n} , so that Equation (9) holds.
− 1
X
2. For an Exp(1)
( p )) = 1 − p , so that Equation
(8) fails and C ( X ) = 0 .
To obtain bounds on D ( S _{n} ) as n → ∞ , we need to control the sequence C ( S _{n} ). Motivated by properties of the (seemingly related) Poincar´e constant, we conjecture that C (( X _{1} + X _{2} ) / ^{√} 2) ≥ C ( X _{1} ) for independent and identically distributed X _{i} . If this is true and C ( X ) = c then C ( S _{n} ) ≥ c for all n.
Assuming that C ( S _{n} ) ≥ c for all n, note that D ( T _{k} ) ≤ (1 − c/ 4) ^{k} D ( X _{1} ) ≤ (1 − c/ 4) ^{k} (2 σ ^{2} ). Now
so
∞
k =0 1 +
D ( T _{k} _{+}_{1} ) ≤ D ( T _{k} )(1 − c/ 2) 1 +
8 σ ^{2} (1 − c/ 2) ^{,}
cD
(
T
k
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.