Beruflich Dokumente
Kultur Dokumente
Thomas S. Ferguson
where the summation is over the group m of all permutations of an m-vector, is obviously
symmetric in its arguments, and has the same expectation under P as does f.
1
where the summation is over the set Pm,n of all n!/(n m)! permutations (i1 , i2 , . . . , im )
of size m chosen from (1, 2, . . . , n). If the kernel, h, is symmetric in its arguments, Un has
the equivalent form
1
Un = Un (h) = n h(Xi1 , . . . , Xim ) (4)
m Cm,n
n
where the summation is over the set Cm,n of all m
combinations of m integers, i1 < i2 <
. . . < im chosen from (1, 2, . . . , n).
2. Examples. 1. Moments. If P is the set of all distributions on the real line with
nite mean, then the mean, = (P ) = x dP (x), is an estimable parameter of degree
m = 1, because f(X1 ) = X1 is an unbiased
n estimate of . The corresponding U-statistic
is the sample mean, Un = Xn = (1/n) 1 Xi . Similarly, if P is the set ofall distributions
on the real line with nite kth moment, then the kth moment, k = xk dP (x) is an
n
estimable parameter of degree 1 with U-statistic, (1/n) 1 Xik .
How about estimating the square of the mean, (P ) = 2 ? Since E(X1 X2 ) = 2 , it
is also an estimable parameter with degree at most 2. It is easy to show it cannot have
degree 1 (Exercise 1), so it has degree 2. The U-statistic Un of (3) and (4) corresponding
to h(x1 , x2 ) = x1 x2 is
1 2
Un = Xi Xj = Xi Xj . (5)
n(n 1) n(n 1)
i=j i<j
If P is taken to be the set of all distributions on the real line with nite second moment,
then the variance, 2 = 2 2 , is also estimable of degree 2, since we can estimate 2 by
X12 and 2 by X1 X2 :
E(X12 X1 X2 ) = 2 . (6)
However the kernel, f(x1 , x2 ) = x21 x1 x2 , is not symmetric in x1 and x2 . The corre-
sponding symmetric kernel given by (2) is the average,
2
This leads to the U-statistic,
2 (Xi Xj )2
Un =
n(n 1) 2
i<j
(8)
1
n
= = s2x = (Xi X)2 .
n1 1
where R+i is the rank of |Zi | among |Z1 |, |Z2 |, . . . , |Zn |. Although it is not a U-statistic,
one can show (Exercise 4) that Wn+ is a linear combination of two U-statistics,
Wn+ = I(Zi > 0) + I(Zi + Zj > 0). (10)
i i<j
and writing it in this way gives some insight into its behavior. The rst U-statistic is
n
based on the kernel, h(z) = I(z > 0). The U-statistic itself is Un = n1 1 I(Zi > 0).
(1)
This is the U-statistic used for the sign test. The second U-statistic is based on the kernel,
(2) 1
h(z1 , z2 ) = I(z1 + z2 > 0), and the corresponding U-statistic is Un = n2 i<j I(Zi +
Zj > 0). Thus,
+ (1) n
Wn = nUn + U (2) . (11)
2 n
3
For large n the second term dominates the rst, so asymptotically Wn+ behaves like
(2)
n2 Un /2. The Wilcoxon signed rank test rejects H0 if Wn+ is too large, and this is asymp-
(2)
totically equivalent to the test that rejects if Un is too large.
3. Testing Symmetry. In some situations, it is important to test for symmetry about
an unknown center. Here is one method based of the observation that for a sample of size 3,
X1 , X2 , X3 from a continuous distribution, symmetric about a point , P (X1 > (X2 +
X3 )/2) = P ((X1 ) > ((X2 ) + (X3 ))/2) = 1/2. Because of this, f(X1 , X2 , X3 ) =
sgn(2X1 X2 X3 ) is an unbiased estimate of (P ) = P (2X1 > X2 + X3 ) P (2X1 <
X2 + X3 ). Here, sgn(x) represents the sign function, which is 1 if x > 0, 0 if x = 0 and
1 if x < 0. When P is symmetric, (P ) has value zero . The corresponding symmetric
kernel is
1
h(x1 , x2 , x3 ) = [sgn(2x1 x2 x3 ) + sgn(2x2 x1 x3 ) + sgn(2x3 x1 x2 )]. (12)
3
This is an example of a kernel of degree 3. The hypothesis of symmetry is rejected if the
corresponding U-statistic is too large in absolute value. One can easily show that
1
h(x1 , x2 , x3 ) = sgn(median(x1 , x2 , x3 ) mean(x1 , x2 , x3 )). (13)
3
Thus the validity of the test also follows from the observation that for a sample of size
three from a symmetric distribution, the sample median is equally likely to be above the
sample mean as below it.
4. Measures of Association. For continuous probability distributions in 2-dimensions,
there are several measures of dependence, or association, the simplest of which is perhaps
Kendalls tau. Two vectors (x1 , y1 ) and (x2 , y2 ), are said to be concordant if x1 < x2 and
y1 < y2 , or if x2 < x1 and y2 < y1 ; in other words, if the line joining the points has positive
slope. If the line joining the points has negative slope, the points are said to be discordant.
Suppose (X1 , Y1 ) and (X2 , Y2 ) are independently distributed according to a distribu-
tion F (x, y) in the plane. If the probability of concordance, P (X1 < X2 , Y1 < Y2 )+P (X2 <
X1 , Y2 < Y1 ) is bigger than 1/2, there is a positive association between X and Y . If it
is negative, there is negative association. This leads to a measure of association called
Kendalls , dened as
= 2[P (X1 < X2 , Y1 < Y2 )+P (X2 < X1 , Y2 < Y1 )]1 = 4 P(X1 < X2 , Y1 < Y2 )1. (14)
4
is known as Kendalls coecient of rank correlation. This was seen in Exercise 5.7 of
Ferguson (1996) to have an asymptotically normal distribution, when suitably normalized,
in the case where X and Y are independent. We will see that the asymptotic distribution
is normal for general dependent X and Y .
Another measure of association in 2-dimensions is given by Spearmans rho, dened
as
= 12 P (X1 < X2 , Y1 < Y3 ) 3, (16)
where (X1 , Y1 ), (X2 , Y2 ) and (X3 , Y3 ) are independently distributed according to F (x, y).
It also has the properties of a correlation coecient, being zero when the variables are
independent and always being between 1 and 1. In fact, one can show that is simply
the correlation coecient between the random variables F (X, ) and F (, Y ). It is clear
that is also an estimable parameter with kernel of degree 3, h((x1 , y1 ), (x2 , y2 ), (x3 , y3 )) =
12 I(x1 < x2 , y1 < y3 ) 3. The symmetrized version has 6 terms. The corresponding U-
statistic is related to the rank statistic of Example 12.5 of Ferguson (1996), which was seen
to have an asymptotically normal distribution under the hypothesis of independence.
5
Lemma 1. For P P and (i1 , . . . , im ) and (j1 , . . . , jm ) in Cm,n ,
Cov(h(Xi1 , . . . , Xim ), h(Xj1 , . . . , Xjm ))
= Cov(hc (X1 , . . . , Xc ), h(X1 , . . . , Xm )) (21)
= c2 ,
where c is the number of integers common to (i1 , . . . , im ) and (j1 , . . . , jm ).
Proof. If (i1 , . . . , im ) and (j1 , . . . , jm ) have c elements in common, then
Cov(h(Xi1 , . . . , Xim ), h(Xj1 , . . . , Xjm ))
(22)
= E[(h(X1 , . . . , Xc , Xc+1 , . . . , Xm ) )(h(X1 , . . . , Xc , Xc+1 . . . , Xm ) )),
where X1 , . . . , Xm , Xc+1 , . . . , Xm are i.i.d. Conditionally, given X1 , . . . , Xc , the two terms
in this expectation are independent, so taking the expectation of the conditional expecta-
tion, we have
Cov(h(Xi1 , . . . , Xim ), h(Xj1 , . . . , Xjm ))
= E[(hc (X1 , . . . , Xc ) )(hc (X1 , . . . , Xc ) )) (23)
= c2 .
6
If m2
< , then i2 < for i < m. For large n, the rstterm of
the sum dominates since
n m n
it is the largest order. The coecient of 12 is m / m2 /n.
m1 m
In the example of estimating a variance with kernel (7), h(x1 , x2 ) = (x1 x2 )2 /2,
we nd h1 (x1 ) = E(X x1 )2 /2 = 2 /2 + (x1 )2 /2. Then 12 = Var(h1 (X1 )) =
Var((X )2 /2) = (4 4 )/4, and 22 = Var((X1 X2 )2 /2) = (4 4 )/2. From this
we nd
2
Var(Un ) = [2(n 2)12 + 22 ] = (4 4 )/n. (26)
n(n 1)
L
2
Theorem 2. If m < , then n(Un ) N (0, m2 12 ).
Proof. Let
m
n
Un = (h1 (Xi ) ). (27)
n
k=1
Then since m(h1 (Xi ) ) are i.i.d. with mean 0 and variance m2 12 , the central limit
L
theorem implies
that nUn N (0, m2 12 ). We complete the proof by showing that
n(Un ) and nUn are asymptotically equivalent and so have the same limiting distri-
bution. For this it suces to show that nE(Un (Un ))2 0.
The rst term on the right is equal to m2 12 and the last term converges to m2 12 from
Theorem 1, so we will be nished when we show nCov(Un , Un ) is equal to m2 12 .
m
n
nCov(Un , Un ) = n Cov(h1 (Xk ), h(Xj1 , . . . , Xjm )). (29)
m k=1 jCm,n
The inside covariance is zero if k is not equal to one of the ji , and it is 12otherwise,
from
Lemma 1. For xed k the number of sets {i1 , . . . , im } containing k is m1 and since
n1
(2)
Application. As an application of this theorem, consider the U-statistic, Un with
kernel, h(x1 , x2 ) = I(x1 + x2 > 0) of degree m = 2, associated with the Wilcoxon signed
rank test. The parameter estimated is = Eh(X1 , X2 ) = P(X1 + X2 > 0). From Lemma
1, we have
Under the null hypothesis that the distribution P is continuous and symmetric about 0,
we have = 1/2 and P(X1 + X2 > 0, X1 + X3 > 0) = P(X1 > X2 , X1 > X3 ) = P(X1 >
7
X2 , X1 > X3 ) = 1/3, since this is just the probability that of three i.i.d. random variables,
the rst is the largest. Therefore, under the null hypothesis, 12 = (1/3) (1/2)2 = 1/12,
and since m = 2, Theorem 2 gives
L
n(Un(2) 1/2) N (0, 1/3). (32)
(2)
This test of the null hypothesis based on Un is consistent only for alternatives P for which
(P ) = 1/2. In Exercise 5, you are to nd a test that is consistent against all alternatives.
(2) L
Under the general hypothesis, n(Un ) N (0, 412 ). This may be used to nd
a condence interval for . For this purpose though, we need an estimate of 12 . Why
not use a U-statistic? One can estimate P(X1 + X2 > 0, X1 + X3 > 0) by the U-statistic
associated with the kernel, f(x1 , x2 , x3 ) = I(x1 + x2 > 0, x1 + x3 > 0), or its symmetrized
counterpart, h(x1 , x2 , x3 ) = (1/3)[f(x1 , x2 , x3 ) + f(x2 , x1 , x3 ) + f(x3 , x2 , x1 )].
4. Two-Sample Problems. The important extention to k-sample problems for
k 2 has been made by Lehmann (1951). The basic ideas are contained in the 2-sample
case which is discussed here. Here P is a family of pairs of probability measures, (F, G).
Consider independent samples, X1 , . . . , Xn1 from F (x) and Y1 , . . . , Yn2 from G(y).
Let h(x1 , . . . , xm1 , y1 , . . . , ym2 ) be a kernel, and let P be the set of all pairs such that the
expectation
= (F, G) = EF1 ,F2 h(X1 , . . . , Xm1 , Y1 , . . . , Ym2 ) (33)
is nite. As before we may assume without loss of generality that h is symmetric under
independent permutations of x1 , . . . , xm1 and y1 , . . . , ym2 . The corresponding U-statistic
is
1
Un1 ,n2 = U(h) = n1 n2 h(Xi1 , . . . , Xim1 , Yj1 , . . . , Yjm2 ), (34)
m1 m2
n1 n2
where the sum is over all m1 m2 sets of subscripts such that 1 i1 < < im1 n1
and 1 j1 < < jm2 n2 . Again it is clear that U is an unbiased estimate of .
Examples. There are various two-sample tests based on U-statistics of the hypothesis
of equality of distributions, H0 : F = G. They dier in their behavior against various
alternative hypotheses.
1. A two-sample comparison of means. Taking F and G to be distributions on the
real line with nite variances, let h(x1 , y1 ) = x1 y1 , a kernel of degree (m1 , m2 ) = (1, 1).
Then = EX EY . The corresponding U-statistic is
1
n1
n2
Un1 ,n2 = (Xi Yj ) = Xn1 Y n2 . (35)
n1 n2
i=1 j=1
1
n1 n2
W
Un1 ,n2 = h(Xi , Yj ) = (36)
n1 n2 i=1 j=1 n1 n2
8
where W is the number of pairs, (Xi , Yj ), with Xi > Yj . The corresponding test of the
hypothesis F = G (or = 1/2) is equivalent to the rank-sum test. It is consistent only
against alternatives (F, G) for which PF,G (X > Y ) = 1/2.
3. A test consistent against all alternatives. With F and G continuous as before, let
h(x1 , x2 , y1 , y2 ) = I(x1 < y1 , x2 < y1 ) + I(y1 < x1 , y2 < x1 ). (The symmetrized version
would have four terms.) The expectation is
= P (X1 < Y, X2 < Y ) + P (Y1 < X, Y2 < X)
2 (37)
= + (F (x) G(x))2 d(F (x) + G(x))/2.
3
(See Exercise 6.) The hypothesis that F = G is equivalent to the hypothesis = 2/3. The
test that rejects this hypothesis if the corresponding U-statistic is too large is consistent
against all alternatives.
Asymptotic Distribution. Corresponding to theorems 1 and 2, we have the follow-
ing. Let
2
ij = Cov[h(X1 , . . . , Xi , Xi+1 , . . . , Xm1 , Y1 , . . . , Yj , Yj+1 , . . . , Ym2 ),
(38)
h(X1 , . . . , Xi , Xi+1 , . . . , Xm 1
, Y1 , . . . , Yj , Yj+1 , . . . , Ym 2 )]
where the Xs and Y s are independently distributed according to F and G respectively.
Theorem 3. For P P,
m1 n1 m1 m2 n2 m2
m1
m2
Var(Un1 ,n2 ) = i
nm
1
1 i j
nm
2
2 j ij
2
. (39)
i=1 j=1 m1 m2
2
Moreover, if m 1 m2
is nite, and if n1 /N p (0, 1) as N = (n1 + n2 ) , then
L m21 2 m22 2
N (Un1 ,n2 ) N (0, 2 ), where 2 = 10 + . (40)
p 1 p 01
As an application of this theorem, let us derive the asymptotic distribution of the
Wilcoxon two-sample test of Example 2. We have h(x, y) = I(y < x) and = P(Y < X).
To nd 2 , we have m1 = m2 = 1 so we need 10
2 2
and 01 .
2
10 = Cov(I(Y < X), I(Y < X)) = P(Y < X, Y < X) P(Y < X)2 , (41)
2
and similarly, 01 = P(Y < X, Y < X ) P(Y < X)2 . Under the null hypothesis that
F = G, we have = 1/2 and 102 2
= 01 = 1/3 1/4 = 1/12, so that 2 = 1/(12p(1 p)).
Then p may be replaced by n1 /N resulting in
N (U 1/2) N (0, N 2 /(12n1 n2 )). (42)
9
Denition 3. We say that a U-statistic has a degeneracy of order k if 12 = = k2 = 0
2
and k+1 > 0.
To present the ideas, we restrict attention to kernels with degeneracy of order 1, for
which 12 = 0 and 22 > 0.
Example 1. Consider the kernel, h(x1 , x2 ) = x1 x2 , used in (5). Then, h1 (x1 ) =
E(x1 X2 ) = x1 E(X2 ) = x1 , and 12 = Var(h1 (X1 )) = 2 2 , where 2 = Var(X1 ). So
from Theorem 2,
L
n(Un 2 ) N (0, 42 2 ). (43)
But suppose that = E(X1 ) = 0 under the null hypothesis. Then the limiting variance is
zero, so that this theorem is useless for nding cuto points for a test of the null hypothesis.
But, assuming 2 > 0, we have 22 = Var(X1 X2 ) = 4 > 0, so that the degeneracy
1
is of order 1. To nd the asymptotic distribution of Un = n2 i<j Xi Xj for a sample
2
X1 , X2 , . . . from a distribution with mean 0 and variance , we rewrite Un as follows.
1 1 n n
Un = Xi Xj = (( Xi )2 Xi2 )
n(n 1) n(n 1) i=1 i=1
i=j
(44)
1 1
n
1 2
n
= (( Xi )2 Xi )
n1 n n
i=1 i=1
n L
From the central limit theorem we have 1n 1 Xi N (0, 2 ), and from the law of large
n L
numbers we have n1 1 Xi2 2 . Therefore by Slutskys Theorem, we have
L
nUn (Z 2 1) 2 where Z N (0, 1). (45)
as well.
Example 2. Suppose now that h(x1 , x2 ) = af(x1 )f(x2 ) + bg(x1 )g(x2 ), where f(x)
and g(x) are orthonormal functions of mean zero; that is, Ef(X)2 = Eg(X)2 = 1,
Ef(X)g(X) = 0 and Ef(X) = Eg(X) = 0. Then, h1 (x1 ) = Eh(x1 , X2 ) 0, so that
12 = 0, and
10
so the degeneracy is of order 1 (assuming a2 + b2 > 0). To nd the asymptotic distribution
of Un , we perform an analysis as in Example 1.
1
(n 1)Un = [af(Xi )f(Xj ) + bg(Xi )g(Xj )]
n
i=j
1 1 1 1
= a[( f(Xi ))2 f(Xi )2 ] + b[( g(Xi ))2 g(Xi )2 ]
n n n n
L
a(Z12 1) + b(Z22 1)
(48)
where Z1 and Z2 are independent N (0, 1).
The General Case. Example 2 is indicative of the general result for kernels with de-
generacy of order 1. This is due to a result from the Hilbert-Schmidt theory of integral
equations: For given i.i.d. random variables, X1 and X2 , any symmetric, square inte-
grable function, A(x1 , x2 ), (A(x1 , x2 ) = A(x2 , x1 ) and EA(X1 , X2 )2 < ), admits a series
expansion of the form,
A(x1 , x2 ) = k k (x1 )k (x2 ) (49)
k=1
where the k are real numbers, and the k are an orthonormal sequence,
1 if j = k,
Ej (X1 )k (X1 ) = (50)
0 if j = k.
The k are the eigenvalues, and the k (x) are corresponding eigenfunctions of the trans-
formation, g(x) EA(x, X1 )g(X1 ). That is, for all k,
Thus all eigenfunctions corresponding to nonzero eigenvalues have mean zero. Now we can
apply the method of Example 2, to nd the asymptotic distribution of n(Un ).
11
Theorem 4. Let Un be the U-statistic associated with a symmetric kernel of degree 2,
L
degeneracy of order 1, and expectation . Then n(Un ) 1 j (Zj2 1), where
Z1 , Z2 , . . . are independent N (0, 1) and 1 , 2 , . . . are the eigenvalues satisfying (49) with
A(x1 , x2 ) = h(x1 , x2 ) .
For h having degeneracy of order 1 and arbitrary degree m m 2,2 the corresponding
result gives the asymptotic distribution of n(Un ) as 2 1 j (Zj 1), where the i
are the eigenvalues for the kernel h2 (x1 , x2 ) . (See Sering (1980) or Lee (1990).)
Computation. To obtain the asymptotic distribution of Un in a specic case requires
computation of the eigenvalues, i , each taken with its multiplicity. In general, there may
be an innite number of these. However, for many kernels, there are just a nite number
of nonzero eigenvalues. This occurs, for example, when h(x, y) is a polynomial in x and
p
y, or more generally, when h(x, y) is given in the form, h(x, y) = 1 fi (x)gi (y), for some
functions fi and gi . See Exercise 8 for an indication of how the i are found for such
kernels.
Exercises.
1. Let P be the set of all distributions on the real line with nite rst moment. Show
that there does not exist a function f(x) such that EP f(X) = 2 for all P P, where
is the mean of P , and X is a random variable with distribution P .
2. Let g1 and g2 be estimable parameters within P with respective degrees m1 and
m2 . (a) Show g1 + g2 is an estimable parameter with degree max{m1 , m2 }. (b) Show
g1 g2 is an estimable parameter with degree at most m1 + m2 .
3. Let P be the class of distributions of two-dimensional vectors, V = (X, Y ), with
nite second moments. Find a kernel, h(V1 , V2 ) of degree 2, for estimating the co-
variance. Show
n that the corresponding U-statistic is the (unbiased) sample covariance,
sxy = n1 1 (Xi Xn )(Yi Y n ).
1
12
(c) What is the asymptotic distribution under the hypothesis H0 : F = G? (Give
numerical values for the mean and variance.)
7. Suppose the distribution of X is symmetric about the origin, with variance 2 > 0
and EX 4 < . Consider the kernel, h(x, y) = xy + (x2 2 )(y 2 2 ).
(a) Show the problem is degenerate of order 1.
(b) Find 1 , 2 , and 1(x) and 2 (x) orthonormal, so that h(x, y) = 1 1 (x)1 (y) +
2 2 (x)2 (y).
(c) Find the asymptotic distribution of nUn .
8. Suppose the distribution of X is symmetric about the origin, with variance 2 > 0
and EX 6 < . Consider the kernel, h(x, y) = xy(1 + x2 y 2 ).
(a) Show the problem is degenerate of order 1.
(b) Using (51) with A = h, show that any eigenfunction with nonzero eigenvalue must
be of the form, (x) = ax3 + bx, for some a and b.
(c) Specializing to the case where X has a N (0, 1) distribution (EX 2 = 1, EX 4 = 3
and EX 6 = 15), nd the linear equations for a and b by equating coecients of x and x3
in (51).
(d) Find the two nonzero eigenvalues (no need to nd the eigenfunctions).
(e) What is the asymptotic distribution of nUn ?
References.
Denker, Manfred (1985) Asymptotic Distribution Theory in Nonparametric Statistics,
Fr. Vieweg & Sohn, Braunschweig, Wiesbaden.
Ferguson, T. S. (1996) A Course in Large Sample Theory, Chapman-Hall, New York.
Fraser, D. A. S. (1957) Nonparametric Methods in Statistics, John Wiley & Sons,
New York.
Hoeding, W. (1948a) A class of statistics with asymptotically normal distribution
Ann. Math. Statist. 19, 293-325.
Hoeding, W. (1948b) A non-parametric test for independence, Ann. Math. Statist.
19, 546-557.
Lee, A. J. (1990) U-Statistics, Marcel Dekker Inc., New York.
Lehmann, E. L. (1951) Consistency and unbiasedness of certain nonparametric tests
Ann. Math. Statist. 22, 165-179.
Lehmann, E. L. (1999) Elements of Large Sample Theory, Springer.
Mann, H. B. and Whitney, D. R. (1947) On a test of whether one of two random
variables is stochastically larger than the other Ann. Math. Statist. 18, 50-60.
Sering, R. J. (1980) Approximation Theorems of Mathematical Statistics, John Wiley
& Sons, New York.
Wilcoxon, Frank (1945) Individual comparisons by ranking methods, Biometrics 1,
80-83.
13