U Statistics

U-STATISTICS
Notes for Statistics 200C, Spring 2005
Thomas S. Ferguson
1. Denitions. The basic theory of U-statistics was developed by W. Hoeding

(1948a). Detailed expositions of the general topic may be found in M. Denker (1985)
and A. J. Lee (1990). See also Fraser (1957) Chapter 6, Sering (1980) Chapter 5, and
Lehmann (1999) Chapter 6.
Let P be a family of probability measures on an arbitrary measurable space. The
problems treated here are nonparametric, which means that P will be taken to be a large
family of distributions subject only to mild restrictions such as continuity or existence of
moments. Let (P ) denote a real-valued function dened for P P. The rst notion we
need is that of an estimable parameter. (Hoeding called these regular parameters.)
Denition 1. We say that (P ) is an estimable parameter within P, if for some integer

m there exists an unbiased estimator of (P ) based on m i.i.d. random variables distributed
according to P ; that is, if there exists a real-valued measurable function h(x1 , . . . , xm ) such
that
EP (h(X1 , . . . , Xm )) = (P ) for all P P, (1)
when X1 , . . . , Xm are i.i.d. with distribution P . The smallest integer m with this property
is called the degree of (P ).
It should be noted that the function h may be assumed to be a symmetric function of

its arguments. This is because if f is an unbiased estimator of (P ), then the average of f
applied to all permutations of the variables is still unbiased and is, in addition, symmetric.
That is,
1
h(x1 , . . . , xm ) = f(x1 , . . . , xm ), (2)
m!
m
where the summation is over the group m of all permutations of an m-vector, is obviously
symmetric in its arguments, and has the same expectation under P as does f.
Denition 2. For a real-valued measurable function, h(x1 , . . . , xm ) and for a sample,

X1 , . . . , Xn , of size n m from a distribution P , a U-statistic with kernel h is dened
as
(n m)!
Un = Un (h) = h(Xi1 , . . . , Xim ) (3)
n!
Pm,n
1
where the summation is over the set Pm,n of all n!/(n m)! permutations (i1 , i2 , . . . , im )
of size m chosen from (1, 2, . . . , n). If the kernel, h, is symmetric in its arguments, Un has
the equivalent form
1
Un = Un (h) = n h(Xi1 , . . . , Xim ) (4)
m Cm,n
n
where the summation is over the set Cm,n of all m
combinations of m integers, i1 < i2 <
. . . < im chosen from (1, 2, . . . , n).
If (P ) = EP h(X1 , . . . , Xm ) exists for all P P, then an obvious property of the

U-statistic, Un , is that it is an unbiased estimate of (P ). Moreover it has the optimality
property of being a best unbiased estimate of (P ) if P is large enough. For example if it
contains all distributions, P , for which (P ) is nite, then the order statistics form a com-
plete sucient statistic from P P. And Un , being a symmetric function of X1 , . . . , Xn , is
a function of the order statistics, and so is a best unbiased estimate of its expectation, due
to the Hodges-Lehmann theorem. This means that no unbiased estimate of (P ), based on
X1 , . . . , Xn , can have a variance smaller than the variance of Un . We do not deal further
with this subject since our interest here is in the asymptotic distribution of Un .
2. Examples. 1. Moments. If P is the set of all distributions on the real line with
nite mean, then the mean, = (P ) = x dP (x), is an estimable parameter of degree
m = 1, because f(X1 ) = X1 is an unbiased
n estimate of . The corresponding U-statistic
is the sample mean, Un = Xn = (1/n) 1 Xi . Similarly, if P is the set ofall distributions
on the real line with nite kth moment, then the kth moment, k = xk dP (x) is an
n
estimable parameter of degree 1 with U-statistic, (1/n) 1 Xik .
How about estimating the square of the mean, (P ) = 2 ? Since E(X1 X2 ) = 2 , it
is also an estimable parameter with degree at most 2. It is easy to show it cannot have
degree 1 (Exercise 1), so it has degree 2. The U-statistic Un of (3) and (4) corresponding
to h(x1 , x2 ) = x1 x2 is
1 2
Un = Xi Xj = Xi Xj . (5)
n(n 1) n(n 1)
i=j i<j
If P is taken to be the set of all distributions on the real line with nite second moment,
then the variance, 2 = 2 2 , is also estimable of degree 2, since we can estimate 2 by
X12 and 2 by X1 X2 :
E(X12 X1 X2 ) = 2 . (6)
However the kernel, f(x1 , x2 ) = x21 x1 x2 , is not symmetric in x1 and x2 . The corre-
sponding symmetric kernel given by (2) is the average,
1 x2 2x1 x2 + x22 (x1 x2 )2

h(x1 , x2 ) = (f(x1 , x2 ) + f(x2 , x1 )) = 1 = . (7)
2 2 2
2
This leads to the U-statistic,
2 (Xi Xj )2
Un =
n(n 1) 2
i<j
(8)
1
n
= = s2x = (Xi X)2 .
n1 1
This is the unbiased sample variance.

It is easy to see that any linear combination of estimable parameters is estimable,
and any product of estimable parameters is estimable (Exercise 2). Thus there are U-
statistics for estimating all moments and all cumulants. (The cumulants are the coecients
of (it)k /k! in the power series expansion of log (t), the logarithm of the characteristic
function. They are polynomial functions of the moments.)
In the denition of estimable parameter and its corresponding U-statistic, no restric-
tion is made on the space on which the distributions must lie. Thus each P P could
be a distribution on the plane or in d-dimensions, and then the corresponding observa-
tions would be random vectors. One can construct U-statistics for estimating a covariance
(Exercise 3) and higher cross moments.
2. The Wilcoxon (1945) Signed Rank Test. Let P be the family of continuous dis-
tributions on the real line. Consider the problem of testing the hypothesis, H0 , that the
true distribution, P , is symmetric about the origin based on a sample Z1 , . . . , Zn from
P . (This problem arises most naturally from a paired comparison experiment based on
random variables, (Xi , Yi ), when Zi = Xi Yi . The hypothesis that Xi and Yi are inde-
pendent identically distributed leads to the hypothesis that Zi is distributed symmetrically
about the origin.)
Of course the sign test (reject H0 if the number of positive Zi is too large) can be
used in this problem as a quick and dirty test, but if you have more time, a better choice
is the Wilcoxon signed rank test. This test is based on the statistic

n
Wn+ = R+
i I(Zi > 0) (9)
i=1
where R+i is the rank of |Zi | among |Z1 |, |Z2 |, . . . , |Zn |. Although it is not a U-statistic,
one can show (Exercise 4) that Wn+ is a linear combination of two U-statistics,

Wn+ = I(Zi > 0) + I(Zi + Zj > 0). (10)
i i<j
and writing it in this way gives some insight into its behavior. The rst U-statistic is
n
based on the kernel, h(z) = I(z > 0). The U-statistic itself is Un = n1 1 I(Zi > 0).
(1)
This is the U-statistic used for the sign test. The second U-statistic is based on the kernel,
(2) 1
h(z1 , z2 ) = I(z1 + z2 > 0), and the corresponding U-statistic is Un = n2 i<j I(Zi +
Zj > 0). Thus,
+ (1) n
Wn = nUn + U (2) . (11)
2 n
3
For large n the second term dominates the rst, so asymptotically Wn+ behaves like
(2)
n2 Un /2. The Wilcoxon signed rank test rejects H0 if Wn+ is too large, and this is asymp-
(2)
totically equivalent to the test that rejects if Un is too large.
3. Testing Symmetry. In some situations, it is important to test for symmetry about
an unknown center. Here is one method based of the observation that for a sample of size 3,
X1 , X2 , X3 from a continuous distribution, symmetric about a point , P (X1 > (X2 +
X3 )/2) = P ((X1 ) > ((X2 ) + (X3 ))/2) = 1/2. Because of this, f(X1 , X2 , X3 ) =
sgn(2X1 X2 X3 ) is an unbiased estimate of (P ) = P (2X1 > X2 + X3 ) P (2X1 <
X2 + X3 ). Here, sgn(x) represents the sign function, which is 1 if x > 0, 0 if x = 0 and
1 if x < 0. When P is symmetric, (P ) has value zero . The corresponding symmetric
kernel is
1
h(x1 , x2 , x3 ) = [sgn(2x1 x2 x3 ) + sgn(2x2 x1 x3 ) + sgn(2x3 x1 x2 )]. (12)
3
This is an example of a kernel of degree 3. The hypothesis of symmetry is rejected if the
corresponding U-statistic is too large in absolute value. One can easily show that
1
h(x1 , x2 , x3 ) = sgn(median(x1 , x2 , x3 ) mean(x1 , x2 , x3 )). (13)
3
Thus the validity of the test also follows from the observation that for a sample of size
three from a symmetric distribution, the sample median is equally likely to be above the
sample mean as below it.
4. Measures of Association. For continuous probability distributions in 2-dimensions,
there are several measures of dependence, or association, the simplest of which is perhaps
Kendalls tau. Two vectors (x1 , y1 ) and (x2 , y2 ), are said to be concordant if x1 < x2 and
y1 < y2 , or if x2 < x1 and y2 < y1 ; in other words, if the line joining the points has positive
slope. If the line joining the points has negative slope, the points are said to be discordant.
Suppose (X1 , Y1 ) and (X2 , Y2 ) are independently distributed according to a distribu-
tion F (x, y) in the plane. If the probability of concordance, P (X1 < X2 , Y1 < Y2 )+P (X2 <
X1 , Y2 < Y1 ) is bigger than 1/2, there is a positive association between X and Y . If it
is negative, there is negative association. This leads to a measure of association called
Kendalls , dened as
= 2[P (X1 < X2 , Y1 < Y2 )+P (X2 < X1 , Y2 < Y1 )]1 = 4 P(X1 < X2 , Y1 < Y2 )1. (14)
Kendalls tau behaves like a correlation coecient in that 1 1, = 0 when X and

Y are independent, and = +1, (resp. = 1), if X is an increasing (resp. decreasing)
function of Y almost surely. The denition of Kendalls tau shows that it is an estimable
parameter with kernel, f((x1 , y1 ), (x2 , y2 )) = 4 I(x1 < x2 , y1 < y2 ) 1 of degree two, and
a corresponding symmetric kernel, h((x1 , y1 ), (x2 , y2 )) = 2 I(x1 < x2 , y1 < y2 ) + 2 I(x2 <
x1 , y2 < y1 ) 1. The corresponding U-statistic,
1
Un = n h((Xi , Yi ), (Xj , Yj )), (15)
2 i<j
4
is known as Kendalls coecient of rank correlation. This was seen in Exercise 5.7 of
Ferguson (1996) to have an asymptotically normal distribution, when suitably normalized,
in the case where X and Y are independent. We will see that the asymptotic distribution
is normal for general dependent X and Y .
Another measure of association in 2-dimensions is given by Spearmans rho, dened
as
= 12 P (X1 < X2 , Y1 < Y3 ) 3, (16)
where (X1 , Y1 ), (X2 , Y2 ) and (X3 , Y3 ) are independently distributed according to F (x, y).
It also has the properties of a correlation coecient, being zero when the variables are
independent and always being between 1 and 1. In fact, one can show that is simply
the correlation coecient between the random variables F (X, ) and F (, Y ). It is clear
that is also an estimable parameter with kernel of degree 3, h((x1 , y1 ), (x2 , y2 ), (x3 , y3 )) =
12 I(x1 < x2 , y1 < y3 ) 3. The symmetrized version has 6 terms. The corresponding U-
statistic is related to the rank statistic of Example 12.5 of Ferguson (1996), which was seen
to have an asymptotically normal distribution under the hypothesis of independence.
3. The Asymptotic Distribution of Un . For a given estimable parameter, =

(P ), and corresponding symmetric kernel, h(x1 , . . . , xm ), we take P to be the class of
distributions for which Var(h(X1 , . . . , Xm )) < . Let us dene a sequence of functions
related to h. For c = 0, 1, . . . , m, let
hc (x1 , . . . , xc ) = E h(x1 , . . . , xc , Xc+1 , . . . , Xm ) (17)
where Xc+1 , . . . , Xn are i.i.d. P . Then h0 = and hm (x1 , . . . , xm ) = h(x1 , . . . , xm ). These

functions are all have expectation ,
E hc (X1 , . . . , Xc ) = E h(X1 , . . . , Xc , Xc+1 , . . . , Xm ) = , (18)
but they cannot be called kernels since they may depend on P .

The variance of the U-statistic Un of (4) depends on the variances of the hc . For
c = 0, 1, . . . , m, let
c2 = Var(hc (X1 , . . . , Xc )), (19)
so that 02 = 0 and m
2
= Var(h(X1 , . . . , Xm )).
To compute the variance of Un of (4), we start out by

1
n
Var(Un ) = Var h(Xi1 , . . . , Xim )
m
iCm,n
2 (20)
n
= Cov(h(Xi1 , . . . , Xim ), h(Xj1 , . . . , Xjm ))
m
iCm,n jCm,n
The following lemma relates these covariances to the c2 .
5
Lemma 1. For P P and (i1 , . . . , im ) and (j1 , . . . , jm ) in Cm,n ,
Cov(h(Xi1 , . . . , Xim ), h(Xj1 , . . . , Xjm ))
= Cov(hc (X1 , . . . , Xc ), h(X1 , . . . , Xm )) (21)
= c2 ,
where c is the number of integers common to (i1 , . . . , im ) and (j1 , . . . , jm ).
Proof. If (i1 , . . . , im ) and (j1 , . . . , jm ) have c elements in common, then
(22)
= E[(h(X1 , . . . , Xc , Xc+1 , . . . , Xm ) )(h(X1 , . . . , Xc , Xc+1 . . . , Xm ) )),

where X1 , . . . , Xm , Xc+1 , . . . , Xm are i.i.d. Conditionally, given X1 , . . . , Xc , the two terms
in this expectation are independent, so taking the expectation of the conditional expecta-
tion, we have
= E[(hc (X1 , . . . , Xc ) )(hc (X1 , . . . , Xc ) )) (23)
= c2 .
The same argument shows Cov(hc (X1 , . . . , Xc ), h(X1 , . . . , Xm )) = c2 .

From this we see that c2 m
2
for all c because c2 = Cov(hc , h) c m . The same
argument shows that the c are nondecreasing: 12 22 m
2 2
.
Theorem 1. For P P,
1 m
n m nm 2
Var(Un ) = c . (24)
m c=1
c m c
2
If m < , then Var(Un ) m2 12 /n for large n.
Proof. We continue (20) by separating out of the sum those terms with exactly c elements
in common. The number of such pairs of m-tuples,
(i1 , . . . , im ) and (j1 , . .
. , jm
), having
n m nm n
exactly c elements in common is , because there are ways of
m c mc m
m
choosing i1 , . . . , im , and then ways of choosing a subset of size c from them, and
c
nm
nally ways of choosing the remaining m c elements of j1 , . . . , jm from the
mc
remaining n m numbers. Therefore,
2 m
n n m nm 2
Var(Un ) = c
m c=0
m c m c
1 m (25)
n m nm 2
= c .
m c=1
c m c
6
If m2
< , then i2 < for i < m. For large n, the rstterm of
the sum dominates since
n m n
it is the largest order. The coecient of 12 is m / m2 /n.
m1 m
In the example of estimating a variance with kernel (7), h(x1 , x2 ) = (x1 x2 )2 /2,
we nd h1 (x1 ) = E(X x1 )2 /2 = 2 /2 + (x1 )2 /2. Then 12 = Var(h1 (X1 )) =
Var((X )2 /2) = (4 4 )/4, and 22 = Var((X1 X2 )2 /2) = (4 4 )/2. From this
we nd
2
Var(Un ) = [2(n 2)12 + 22 ] = (4 4 )/n. (26)
n(n 1)
L
2
Theorem 2. If m < , then n(Un ) N (0, m2 12 ).
Proof. Let
m
n
Un = (h1 (Xi ) ). (27)
n
k=1
Then since m(h1 (Xi ) ) are i.i.d. with mean 0 and variance m2 12 , the central limit
L
theorem implies
that nUn N (0, m2 12 ). We complete the proof by showing that
n(Un ) and nUn are asymptotically equivalent and so have the same limiting distri-
bution. For this it suces to show that nE(Un (Un ))2 0.
nE(Un (Un ))2 = nVar(Un ) 2nCov(Un , Un ) + nVar(Un ) (28)
The rst term on the right is equal to m2 12 and the last term converges to m2 12 from
Theorem 1, so we will be nished when we show nCov(Un , Un ) is equal to m2 12 .
m
n
nCov(Un , Un ) = n Cov(h1 (Xk ), h(Xj1 , . . . , Xjm )). (29)
m k=1 jCm,n
The inside covariance is zero if k is not equal to one of the ji , and it is 12otherwise,
from
Lemma 1. For xed k the number of sets {i1 , . . . , im } containing k is m1 and since
n1
there are n such k,

m n1 2
nCov(Un , Un ) = n n 1 = m2 12 . (30)
m
m1
(2)
Application. As an application of this theorem, consider the U-statistic, Un with
kernel, h(x1 , x2 ) = I(x1 + x2 > 0) of degree m = 2, associated with the Wilcoxon signed
rank test. The parameter estimated is = Eh(X1 , X2 ) = P(X1 + X2 > 0). From Lemma
1, we have
12 = Cov(h(X1 , X2 ), h(X1 , X3 )) = P(X1 + X2 > 0, X1 + X3 > 0) 2 . (31)
Under the null hypothesis that the distribution P is continuous and symmetric about 0,
we have = 1/2 and P(X1 + X2 > 0, X1 + X3 > 0) = P(X1 > X2 , X1 > X3 ) = P(X1 >
7
X2 , X1 > X3 ) = 1/3, since this is just the probability that of three i.i.d. random variables,
the rst is the largest. Therefore, under the null hypothesis, 12 = (1/3) (1/2)2 = 1/12,
and since m = 2, Theorem 2 gives
L
n(Un(2) 1/2) N (0, 1/3). (32)
(2)
This test of the null hypothesis based on Un is consistent only for alternatives P for which
(P ) = 1/2. In Exercise 5, you are to nd a test that is consistent against all alternatives.
(2) L
Under the general hypothesis, n(Un ) N (0, 412 ). This may be used to nd
a condence interval for . For this purpose though, we need an estimate of 12 . Why
not use a U-statistic? One can estimate P(X1 + X2 > 0, X1 + X3 > 0) by the U-statistic
associated with the kernel, f(x1 , x2 , x3 ) = I(x1 + x2 > 0, x1 + x3 > 0), or its symmetrized
counterpart, h(x1 , x2 , x3 ) = (1/3)[f(x1 , x2 , x3 ) + f(x2 , x1 , x3 ) + f(x3 , x2 , x1 )].
4. Two-Sample Problems. The important extention to k-sample problems for
k 2 has been made by Lehmann (1951). The basic ideas are contained in the 2-sample
case which is discussed here. Here P is a family of pairs of probability measures, (F, G).
Consider independent samples, X1 , . . . , Xn1 from F (x) and Y1 , . . . , Yn2 from G(y).
Let h(x1 , . . . , xm1 , y1 , . . . , ym2 ) be a kernel, and let P be the set of all pairs such that the
expectation
= (F, G) = EF1 ,F2 h(X1 , . . . , Xm1 , Y1 , . . . , Ym2 ) (33)
is nite. As before we may assume without loss of generality that h is symmetric under
independent permutations of x1 , . . . , xm1 and y1 , . . . , ym2 . The corresponding U-statistic
is
1
Un1 ,n2 = U(h) = n1 n2 h(Xi1 , . . . , Xim1 , Yj1 , . . . , Yjm2 ), (34)
m1 m2
n1 n2
where the sum is over all m1 m2 sets of subscripts such that 1 i1 < < im1 n1
and 1 j1 < < jm2 n2 . Again it is clear that U is an unbiased estimate of .
Examples. There are various two-sample tests based on U-statistics of the hypothesis
of equality of distributions, H0 : F = G. They dier in their behavior against various
alternative hypotheses.
1. A two-sample comparison of means. Taking F and G to be distributions on the
real line with nite variances, let h(x1 , y1 ) = x1 y1 , a kernel of degree (m1 , m2 ) = (1, 1).
Then = EX EY . The corresponding U-statistic is
1
n1
n2
Un1 ,n2 = (Xi Yj ) = Xn1 Y n2 . (35)
n1 n2
i=1 j=1
2. The Wilcoxon (1945), Mann-Whitney (1947), two-sample test. Take F and G to

be continuous distributions on the real line, and let the kernel be h(x, y) = I(y < x), with
expectation = P (Y < X). The corresponding U-statistic is
1
n1 n2
W
Un1 ,n2 = h(Xi , Yj ) = (36)
n1 n2 i=1 j=1 n1 n2
8
where W is the number of pairs, (Xi , Yj ), with Xi > Yj . The corresponding test of the
hypothesis F = G (or = 1/2) is equivalent to the rank-sum test. It is consistent only
against alternatives (F, G) for which PF,G (X > Y ) = 1/2.
3. A test consistent against all alternatives. With F and G continuous as before, let
h(x1 , x2 , y1 , y2 ) = I(x1 < y1 , x2 < y1 ) + I(y1 < x1 , y2 < x1 ). (The symmetrized version
would have four terms.) The expectation is
= P (X1 < Y, X2 < Y ) + P (Y1 < X, Y2 < X)

2 (37)
= + (F (x) G(x))2 d(F (x) + G(x))/2.
3
(See Exercise 6.) The hypothesis that F = G is equivalent to the hypothesis = 2/3. The
test that rejects this hypothesis if the corresponding U-statistic is too large is consistent
against all alternatives.
Asymptotic Distribution. Corresponding to theorems 1 and 2, we have the follow-
ing. Let
2
ij = Cov[h(X1 , . . . , Xi , Xi+1 , . . . , Xm1 , Y1 , . . . , Yj , Yj+1 , . . . , Ym2 ),

(38)
h(X1 , . . . , Xi , Xi+1 , . . . , Xm 1
, Y1 , . . . , Yj , Yj+1 , . . . , Ym 2 )]
where the Xs and Y s are independently distributed according to F and G respectively.
Theorem 3. For P P,
m1 n1 m1 m2 n2 m2

m1
m2
Var(Un1 ,n2 ) = i
nm
1
1 i j
nm
2
2 j ij
2
. (39)
i=1 j=1 m1 m2
2
Moreover, if m 1 m2
is nite, and if n1 /N p (0, 1) as N = (n1 + n2 ) , then
L m21 2 m22 2
N (Un1 ,n2 ) N (0, 2 ), where 2 = 10 + . (40)
p 1 p 01
As an application of this theorem, let us derive the asymptotic distribution of the
Wilcoxon two-sample test of Example 2. We have h(x, y) = I(y < x) and = P(Y < X).
To nd 2 , we have m1 = m2 = 1 so we need 10
2 2
and 01 .
2
10 = Cov(I(Y < X), I(Y < X)) = P(Y < X, Y < X) P(Y < X)2 , (41)
2
and similarly, 01 = P(Y < X, Y < X ) P(Y < X)2 . Under the null hypothesis that
F = G, we have = 1/2 and 102 2
= 01 = 1/3 1/4 = 1/12, so that 2 = 1/(12p(1 p)).
Then p may be replaced by n1 /N resulting in

N (U 1/2) N (0, N 2 /(12n1 n2 )). (42)
5. Degeneracy. When using U-statistics for testing hypotheses, it occasionally

happens that at the null hypothesis, the asymptotic distribution has variance zero. This
is a degenerate case, and we cannot use Theorem 2 to nd approximate cuto points. The
general denition of degeneracy for a U-statistic of order m and variances, 12 22
2
m given by (19) is as follows.
9
Denition 3. We say that a U-statistic has a degeneracy of order k if 12 = = k2 = 0
2
and k+1 > 0.
To present the ideas, we restrict attention to kernels with degeneracy of order 1, for
which 12 = 0 and 22 > 0.
Example 1. Consider the kernel, h(x1 , x2 ) = x1 x2 , used in (5). Then, h1 (x1 ) =
E(x1 X2 ) = x1 E(X2 ) = x1 , and 12 = Var(h1 (X1 )) = 2 2 , where 2 = Var(X1 ). So
from Theorem 2,
L
n(Un 2 ) N (0, 42 2 ). (43)
But suppose that = E(X1 ) = 0 under the null hypothesis. Then the limiting variance is
zero, so that this theorem is useless for nding cuto points for a test of the null hypothesis.
But, assuming 2 > 0, we have 22 = Var(X1 X2 ) = 4 > 0, so that the degeneracy
1
is of order 1. To nd the asymptotic distribution of Un = n2 i<j Xi Xj for a sample
2
X1 , X2 , . . . from a distribution with mean 0 and variance , we rewrite Un as follows.
1 1 n n
Un = Xi Xj = (( Xi )2 Xi2 )
n(n 1) n(n 1) i=1 i=1
i=j
(44)
1 1
n
1 2
n
= (( Xi )2 Xi )
n1 n n
i=1 i=1
n L
From the central limit theorem we have 1n 1 Xi N (0, 2 ), and from the law of large
n L
numbers we have n1 1 Xi2 2 . Therefore by Slutskys Theorem, we have
L
nUn (Z 2 1) 2 where Z N (0, 1). (45)
As a slight generalization of Example 1, consider the kernel, h(x1 , x2 ) = f(x1 )f(x2 )

for some real-valued function f(x) for which Ef(X1 ) = 0 and 2 = Ef(X1 )2 > 0. Then
the above analysis implies that
1 L
nUn = f(Xi )f(Xj ) (Z 2 1) 2 (46)
(n 1)
i=j
as well.
Example 2. Suppose now that h(x1 , x2 ) = af(x1 )f(x2 ) + bg(x1 )g(x2 ), where f(x)
and g(x) are orthonormal functions of mean zero; that is, Ef(X)2 = Eg(X)2 = 1,
Ef(X)g(X) = 0 and Ef(X) = Eg(X) = 0. Then, h1 (x1 ) = Eh(x1 , X2 ) 0, so that
12 = 0, and
22 = a2 Var(f(X1 )f(X2 )) + 2abCov(f(X1 )f(X2 ), g(X1 )g(X2 )) + b2 Var(g(X1 )g(X2 ))

= a2 + b2
(47)
10
so the degeneracy is of order 1 (assuming a2 + b2 > 0). To nd the asymptotic distribution
of Un , we perform an analysis as in Example 1.
1
(n 1)Un = [af(Xi )f(Xj ) + bg(Xi )g(Xj )]
n
i=j
1 1 1 1
= a[( f(Xi ))2 f(Xi )2 ] + b[( g(Xi ))2 g(Xi )2 ]
n n n n
L
a(Z12 1) + b(Z22 1)
(48)
where Z1 and Z2 are independent N (0, 1).
The General Case. Example 2 is indicative of the general result for kernels with de-
generacy of order 1. This is due to a result from the Hilbert-Schmidt theory of integral
equations: For given i.i.d. random variables, X1 and X2 , any symmetric, square inte-
grable function, A(x1 , x2 ), (A(x1 , x2 ) = A(x2 , x1 ) and EA(X1 , X2 )2 < ), admits a series
expansion of the form,

A(x1 , x2 ) = k k (x1 )k (x2 ) (49)
k=1
where the k are real numbers, and the k are an orthonormal sequence,

1 if j = k,
Ej (X1 )k (X1 ) = (50)
0 if j = k.
The k are the eigenvalues, and the k (x) are corresponding eigenfunctions of the trans-
formation, g(x) EA(x, X1 )g(X1 ). That is, for all k,
EA(x, X2 )k (X2 ) = k k (x). (51)
Equation (49) is to be understood in the L2 sense, that

n
q.m.
k k (X1 )k (X2 ) A(X1 , X2 ). (52)
k=1
Stronger conditions on A are required to obtain convergence a.s.

In our problem, we take A(x1 , x2 ) = h(x1 , x2 ) , where = Eh(X1 , X2 ). This is a
symmetric square integrable kernel, but we are also assuming 12 = Var h1 (X) = 0, where
h1 (x) = Eh(x, X2 ). Note Eh1 (X) = , but since Var h1 (X) = 0, we have h1 (x) a.s.
Now replace x in (51) by X1 and take expectations on both sides. We obtain
k E(k (X1 )) = E[(h(X1 , X2 ) )k (X2 )]

= E[E(h(X1 , X2 ) |X2 )k (X2 )] (53)
= E[(h1 (X2 ) )k (X2 )] = 0.
Thus all eigenfunctions corresponding to nonzero eigenvalues have mean zero. Now we can
apply the method of Example 2, to nd the asymptotic distribution of n(Un ).
11
Theorem 4. Let Un be the U-statistic associated with a symmetric kernel of degree 2,
L
degeneracy of order 1, and expectation . Then n(Un ) 1 j (Zj2 1), where
Z1 , Z2 , . . . are independent N (0, 1) and 1 , 2 , . . . are the eigenvalues satisfying (49) with
A(x1 , x2 ) = h(x1 , x2 ) .
For h having degeneracy of order 1 and arbitrary degree m m 2,2 the corresponding
result gives the asymptotic distribution of n(Un ) as 2 1 j (Zj 1), where the i
are the eigenvalues for the kernel h2 (x1 , x2 ) . (See Sering (1980) or Lee (1990).)
Computation. To obtain the asymptotic distribution of Un in a specic case requires
computation of the eigenvalues, i , each taken with its multiplicity. In general, there may
be an innite number of these. However, for many kernels, there are just a nite number
of nonzero eigenvalues. This occurs, for example, when h(x, y) is a polynomial in x and
p
y, or more generally, when h(x, y) is given in the form, h(x, y) = 1 fi (x)gi (y), for some
functions fi and gi . See Exercise 8 for an indication of how the i are found for such
kernels.
Exercises.
1. Let P be the set of all distributions on the real line with nite rst moment. Show
that there does not exist a function f(x) such that EP f(X) = 2 for all P P, where
is the mean of P , and X is a random variable with distribution P .
2. Let g1 and g2 be estimable parameters within P with respective degrees m1 and
m2 . (a) Show g1 + g2 is an estimable parameter with degree max{m1 , m2 }. (b) Show
g1 g2 is an estimable parameter with degree at most m1 + m2 .
3. Let P be the class of distributions of two-dimensional vectors, V = (X, Y ), with
nite second moments. Find a kernel, h(V1 , V2 ) of degree 2, for estimating the co-
variance. Show
n that the corresponding U-statistic is the (unbiased) sample covariance,
sxy = n1 1 (Xi Xn )(Yi Y n ).
1
4. Derive Equation (10).

5. A continuous distribution, F (x), on the real line is symmetric about the origin if,
and only if, 1 F (x) = F (x) for all real x. This suggests using the parameter,

(F ) = (1 F (x) F (x))2 dF (x)

= (1 F (x)) dF (x) 2 (1 F (x))F (x) dF (x) + F (x)2 dF (x)
2
as a nonparametric measure of departure from symmetry. Find a kernel, h, of degree

3, such that EF h(X1 , X2 , X3 ) = (F ) for all continuous F . Find the corresponding U-
statistic. (This provides another test for the problem of Example 2. It has the advantage of
being consistent against all alternatives to the hypothesis of symmetry about the origin.)
6. (a) In the two-sample problem with samples X1 , . . . , Xn1 from F and Y1 , . . . , Yn2
from G, what is the U-statistic with kernel h(x1 , x2 , y1 ) = I(x1 < y1 , x2 < y1 )?
(b) What is its asymptotic distribution as n1 +n2 and n1 /(n1 +n2 ) p (0, 1)?
12
(c) What is the asymptotic distribution under the hypothesis H0 : F = G? (Give
numerical values for the mean and variance.)
7. Suppose the distribution of X is symmetric about the origin, with variance 2 > 0
and EX 4 < . Consider the kernel, h(x, y) = xy + (x2 2 )(y 2 2 ).
(a) Show the problem is degenerate of order 1.
(b) Find 1 , 2 , and 1(x) and 2 (x) orthonormal, so that h(x, y) = 1 1 (x)1 (y) +
2 2 (x)2 (y).
(c) Find the asymptotic distribution of nUn .
8. Suppose the distribution of X is symmetric about the origin, with variance 2 > 0
and EX 6 < . Consider the kernel, h(x, y) = xy(1 + x2 y 2 ).
(a) Show the problem is degenerate of order 1.
(b) Using (51) with A = h, show that any eigenfunction with nonzero eigenvalue must
be of the form, (x) = ax3 + bx, for some a and b.
(c) Specializing to the case where X has a N (0, 1) distribution (EX 2 = 1, EX 4 = 3
and EX 6 = 15), nd the linear equations for a and b by equating coecients of x and x3
in (51).
(d) Find the two nonzero eigenvalues (no need to nd the eigenfunctions).
(e) What is the asymptotic distribution of nUn ?
References.
Denker, Manfred (1985) Asymptotic Distribution Theory in Nonparametric Statistics,
Fr. Vieweg & Sohn, Braunschweig, Wiesbaden.
Ferguson, T. S. (1996) A Course in Large Sample Theory, Chapman-Hall, New York.
Fraser, D. A. S. (1957) Nonparametric Methods in Statistics, John Wiley & Sons,
New York.
Hoeding, W. (1948a) A class of statistics with asymptotically normal distribution
Ann. Math. Statist. 19, 293-325.
Hoeding, W. (1948b) A non-parametric test for independence, Ann. Math. Statist.
19, 546-557.
Lee, A. J. (1990) U-Statistics, Marcel Dekker Inc., New York.
Lehmann, E. L. (1951) Consistency and unbiasedness of certain nonparametric tests
Ann. Math. Statist. 22, 165-179.
Lehmann, E. L. (1999) Elements of Large Sample Theory, Springer.
Mann, H. B. and Whitney, D. R. (1947) On a test of whether one of two random
variables is stochastically larger than the other Ann. Math. Statist. 18, 50-60.
Sering, R. J. (1980) Approximation Theorems of Mathematical Statistics, John Wiley
& Sons, New York.
Wilcoxon, Frank (1945) Individual comparisons by ranking methods, Biometrics 1,
80-83.
13

U Statistics

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

U Statistics

Hochgeladen von

Copyright:

Verfügbare Formate

U-STATISTICS

Notes for Statistics 200C, Spring 2005

1. Denitions. The basic theory of U-statistics was developed by W. Hoeding

Denition 1. We say that (P ) is an estimable parameter within P, if for some integer

It should be noted that the function h may be assumed to be a symmetric function of

Denition 2. For a real-valued measurable function, h(x1 , . . . , xm ) and for a sample,

If (P ) = EP h(X1 , . . . , Xm ) exists for all P P, then an obvious property of the

1 x2 2x1 x2 + x22 (x1 x2 )2

This is the unbiased sample variance.

Kendalls tau behaves like a correlation coecient in that 1 1, = 0 when X and

3. The Asymptotic Distribution of Un . For a given estimable parameter, =

hc (x1 , . . . , xc ) = E h(x1 , . . . , xc , Xc+1 , . . . , Xm ) (17)

where Xc+1 , . . . , Xn are i.i.d. P . Then h0 = and hm (x1 , . . . , xm ) = h(x1 , . . . , xm ). These

E hc (X1 , . . . , Xc ) = E h(X1 , . . . , Xc , Xc+1 , . . . , Xm ) = , (18)

but they cannot be called kernels since they may depend on P .

The following lemma relates these covariances to the c2 .

The same argument shows Cov(hc (X1 , . . . , Xc ), h(X1 , . . . , Xm )) = c2 .

nE(Un (Un ))2 = nVar(Un ) 2nCov(Un , Un ) + nVar(Un ) (28)

there are n such k,

12 = Cov(h(X1 , X2 ), h(X1 , X3 )) = P(X1 + X2 > 0, X1 + X3 > 0) 2 . (31)

2. The Wilcoxon (1945), Mann-Whitney (1947), two-sample test. Take F and G to

5. Degeneracy. When using U-statistics for testing hypotheses, it occasionally

As a slight generalization of Example 1, consider the kernel, h(x1 , x2 ) = f(x1 )f(x2 )

22 = a2 Var(f(X1 )f(X2 )) + 2abCov(f(X1 )f(X2 ), g(X1 )g(X2 )) + b2 Var(g(X1 )g(X2 ))

EA(x, X2 )k (X2 ) = k k (x). (51)

Equation (49) is to be understood in the L2 sense, that

Stronger conditions on A are required to obtain convergence a.s.

k E(k (X1 )) = E[(h(X1 , X2 ) )k (X2 )]

4. Derive Equation (10).

as a nonparametric measure of departure from symmetry. Find a kernel, h, of degree

Das könnte Ihnen auch gefallen