Sie sind auf Seite 1von 15

STATISTICS IN MEDICINE, VOL.

15, 631-645 (1996)

THE APPROPRIATENESS OF THE WILCOXON TEST IN


ORDINAL DATA
JOAN F. HILTON
Department of Epidemiofogy and 3iostatistics, University of California at San Francisco, San Francisco,
CA 94143-0560, U.S.A.

SUMMARY
In two-sample studies with ordinal responses, the Wilcoxon rank-sum test is generally chosen to test
equality of the distributions, in spite of it being a specific test of location shift. I compared the power of the
exact tests based on the Wilcoxon statistic, OBriens generalized Wilcoxon statistic, and the omnibus
Smirnov statistic in the presence of location shift and scale alternatives. All three tests can gain power as
a function of the scale parameter, depending on its magnitude relative to the shift parameter. As the relative
influence of the scale parameter grows, the OBrien test is the most powerful. I also compared the power of
the asymptotic Wilcoxon test with its exact version, and found that it either inflates or deflates power if there
are scale changes present. I present results of a simulation study.

1. INTRODUCTION

In the analysis of two-sample data, one generally chooses the Wilcoxon rank-sum statistic
to test equality of the distributions, Ho: F1 = Fz. This statistic is specifically sensitive to the
hypothesis

Ho: F ( y ) = F ( y - A)

where y is a continuous random variable and A represents a location shift between two
distributions of responses. In categorical data, the Wilcoxon test statistic must accommodate tied
responses and one must characterize alternative hypotheses other than by simple transformations
of the data (for example, Y = Y + c or Y = cY). With few distinct responses available to
describe the shapes of the distributions, there may be little known about the types of differences
that occur between them and one should consider other statistics. In this situation, and when the
difference between distributions is clearly not a location shift, then an omnibus test that
accommodates ordinal categorical data, such as that based on the Smirnov statistic, may be
more powerful than the Wilcoxon test.
In this paper I consider independent samples of n~and n ordinal categorical responses that arise
from multinomial distributions, F 1 and F z , with parameters R = (al, ... ,zK),E n j = 1, and
R = ( x i , ... , A ; ) , C zj = 1, respectively. Denote the two samples by x = (xl, ... ,x K ) ,C x j = m,
and x = (xi, ... ,xk), 1 +
x; = n, and the combined data by t j = x j xj, j = 1, ... ,K, where K is
the number of distinct categories in the combined samples. I denote realizations of the data in

CCC 02774715/96/06063 1-1 5 Received July1994


0 1996 by John Wiley & Sons, Ltd. Revised June I995
632 J. HILTON

lower case and random variables in upper case letters. The data can be laid out in a contingency
table, as shown here:
rank of ordered category
1 2 ... K total
sample 1 x1 x2 ... XK m
sample 2 xi x; ... XZ n
both tl t2 ... tK m+n

Such data are common in epidemiologicand clinical research. They could arise, for example, in
a clinical trial in which one compares two chemotherapies, A and B, with respect to the worst
toxicity level observed during a course of treatment. The five levels might be recorded as {none,
mild, moderate, severe, life-threatening} and ranked as (1, 2, 3, 4, 5). Of m subjects on the
standard therapy, arm A, xi is the number with toxicity nadir at level j, compared with x i of
n subjects on the experimental therapy, arm B. To pose a simple example, suppose experience
with the standard therapy indicates a uniform distribution of subjects across toxicity levels (that
is, 20 per cent of subjects respond at each level). Reductions in toxicity for experimental arm
subjects could occur through a location-type shift or a scale-typeshift; more experimental therapy
subjects could experience low levels of toxicity (for example, 29, 23, 19, 16, and 13 per cent of
subjects respond at levels 1 to 5), or more could experience moderate levels of toxicity and fewer
sexperience very high or very low toxicity (for example, 9, 25, 32, 25, and 9 per cent of subjects
respond at levels 1 to 5).
The two-sided linear rank Wilcoxon statistic is defined as
w = IW+ - E(W+)I

+ +
where W + = CwjXj, E ( W + )= f m ( m n l), and w j = t l t 2 + + + + +
f ( t j 1). In-
spection of the score, wj, shows that it reflects the cumulative number of subjects who respond by
the jth category. In the special case of equal numbers of tied responses in each category,
t l = t 2 = ... = t K ,the scores w = (wl,... ,wK) are equally spaced, as are the ranks j = (1, ... ,K),
and the Wilcoxon statistic is identical to the Cochran-Armitage trend statistic.394 In
particular, this occurs when the data are strictly continuous (that is, K = m + n and ti = 1,
j = 1, ... , K ) .
Several authors have reported attempts to increase the sensitivity of the Wilcoxon test to
a broader spectrum of differences between distributions than location alone. For example, Eplett5
proposed a statistic that is the sum of the Wilcoxon and Smirnov statistic. The two-sided non-
linear rank Smirnov statistic is defined as

where F,(j) and P,(j) are the empirical distribution functions. A nice feature of this statistic is its
simple interpretation.
More recently, OBrien6 generalized Wilcoxons test and Students t test. Specifically, the
generalized Wilcoxon statistic regresses group membership, Y, on the ranks of the responses, w j ,
and includes both a linear and a quadratic term. (As I show in Section 2 for exact conditional
tests, if a test conditions on both margins of a contingency table then we can interchange the roles
of Y and w j ; treating Y as the response variable and wj as the explanatory variable has no effect
APPROPRIATENESS OF WILCOXON TEST IN ORDINAL DATA 633

on the P value.) We can obtain an overall test of no association, H o : B1 = B2 = 0, via either


ordinary least squares or logistic regression. The logistic regression model for the generalized
Wilcoxon test is

logit(Pr{ Y = I l w j } ) = Po + Blwj + pzw?. (1)


According to OBrieq6 the main usefulnessof the quadratic models 'will be to aid in the detection
and understanding of group differences when the associations with log odds of group membership
are U-shaped'. Podgor and Gastwirth showed that this OBrien test is asymptotically equivalent
to a Lepage test, which is based on the sum of squares of standardized location shift only
(Wilcoxon) and scale shift only (Ansari-Bradley) linear rank tests.'.* Furthermore, they showed
that this test is a member of a large family of O'Brien-type tests.8
OBrien6 and Blair and Morelg evaluated the power of asymptotic tests based on Wilcoxon's
statistic and Student's t statistic and on OBrien's generalized (ordinary least squares) versions of
these tests in continuous data. Here, we compare the power of exact tests based on the Wilcoxon
statistic, the Smirnov statistic, and OBrien's generalized (logisticregression)Wilcoxon statistic in
ordinal categorical data. Thus, these power evaluations complement the earlier comparisons of
the Wilcoxon test and its generalization.
I evaluate exact tests because in relatively small samples some cell counts xj in a contingency
table can be quite small while others are quite large and thus the asymptotic approximation may
be poor. Exact tests are often more conservative than their asymptotic counterparts. This is
because the conditional size of an exact test is constrained to be no larger than the nominal size
(a),while the asymptotic test is not constrained by a. Contrary to this usual result, Hilton et al."
showed that the exact test based on the Smirnov statistic is less conservative than the correspond-
ing asymptotic test. Because of the Smirnov statistic's ease of use and its (presumed) greater
sensitivity than the Wilcoxon statistic's to non-location type alternatives, I also evaluate this
exact test.
Hilton and Mehta'o-'2 previously described a very efficient method for estimating power of
exact tests which I use here. A review of the exact distributions of the test statistics and our
methods for computing power appear in Section 2. Section 3 discusses a model for expressing
location- and scale-type alternatives suitable for categorical data. Two examples based on real
data appear in Section 4. The first concerns a dataset with a non-significant Wilcoxon P value,
but where there is also evidence of differences between distributions other than shift in location.
The second example compares the power of the Wilcoxon test with use of exact and asymptotic
methods whitehead'^'^ approximation). Section 5 presents a simulation study of general
differences in power among the Wilcoxon, OBrien and Smirnov exact tests.

2. EXACT POWER
Under multinomial sampling m and n are fixed by design. In addition, under the null hypothesis,
when conducting a conditional test the response margin t is also fixed at its observed values.
Thus, we obtain the conditional probability of a particular permutation of the data by the
generalized hypergeometric distribution (LehmannI4),
634 J. HILTON

where I' reflects a particular alternative hypothesis H1of interest and the conditional sample
space of x is
r~ K \
xj = m, 1 XI = n, and x + x' =
j= 1 j=1

Different statistics partition this sample space differently, resulting in greater or lesser conserva-
tiveness and sensitivity to specific alternatives. For concreteness, we express the following
notation using the Wilcoxon statistic; the modifications of these equations for other statistics are
analogous.
The conditional probability of obtaining a Wilcoxon test result that is as or more extreme than
w = ICj"= wjxj - E(W+)Iis

where the rejection region is the set Tt(w) = {x E T,: W > w}. In particular, if w is the a-level
critical value, w,(t) = min[w: Pr{ W 2 w I t, H,} < a], then, under H1,
equation (2) gives the exact
conditional power of the test,

n,(w)= Pr{W 2 w/t,H,}. (3)


As noted above, when conducting a conditional test we compute the exact P-value with the
marginal responses fixed at their observed values. When designing a study, however, the marginal
responses that will arise once the data are gathered are unknown; a priori we can specify only the
distributions of the responses, IT and I'. Consequently, we must compute power unconditionally
+
with respect to t = x x'. We can then obtain exact unconditional power as the expected value
of these terms,

n(w)= C II,(w)Pr{T = tlH1}, (4)


tER

where R = {t: C t j = rn + n} and Pr{T = tlHl} = C, Pr{X = xlHl} Pr{X' = X ' I H , } .


6 r,
Thus, in theory computation of unconditional power (4)is not difficult, but in practice it is
because the sizes of Tt and R can be quite large. For example, for K = 5 and m + n = 50,
R contains 316,251 distinct vectors t.
Alternatively, to reduce the computational burden we can instead estimate exact power from
a sample of R, given m and n. Hilton and Mehta" described a Monte Carlo estimator of exact
power and reported its high efficiency relative to the usual Monte Carlo estimator when using the
Wilcoxon statistic in 5-category data. With this estimator, for each margin ti, i = 1, . .. ,N , in the
sample from R, we compute conditional power (3) and average these values, inherently weighting
them by Pr{T = tlH1} as desired (see equation (4)):

We extended this algorithm so that one can also estimate the power of the exact Smirnov test
via equation (9.' The algorithm does not, however, accommodate the OBrien statistic, where
the P-value depends on parameter estimates of (Po,P1,pZ),given reference set T,, for which there
are no closed form solutions. Consequently, we generate exact OBrien P-values and estimate the
APPROPRIATENESS OF WILCOXON TEST IN ORDINAL DATA 635

power of this test using the usual Monte Carlo estimator,

where I[.] is the indicator function, equal to 1 when true and 0 when false. Both estimators are
unbiased for ll, the unconditional power of the test, but estimator (6) is less efficient than
estimator (5): Because ll,(w) contains more information than Z[Pr{ W 2 wlx, t,Ho) < a], (5)
greatly reduces the number of simulated samples needed to estimate power reliably.

3. MODELLING ALTERNATIVES
Both ll,(w)and Pr{T = tlHl} depend on n and n,and through these parameters we can specify
an infinite range of alternative hypotheses. Even in the simplest case, K = 2, however, the
hypotheses are composite: there is one parameter of interest, say, A, and one nuisance parameter,
n. As K increases, the problem of specifying alternatives becomes more difficult because then
there are 2(K - 1) independent parameters. Furthermore, when data are continuous we can
transform them directly to specify location and scale changes: y = (y - A)/z. When data are
categorical, however, we transform instead the distribution of the data to specify location- and
scale-type alternatives. For simplicity, I drop the suffix type in some instances.
I find the proportional odds model (McCullagh 5, useful in simplifying the problem of defining
alternatives in ordered categorical data. To account for the ordering of the responses, for sample
1 define the cumulative probability of responding in categories 1 to j as y j = n1 n2 + + + nj,
and set yo = 0. Then, y j - < y j , j = 1, ... ,K. For sample 2 define y>, j = 0, ... ,K, analogously.
Since these are multinomial probabilities, only K - 1 parameters are unique and yK = y t = 1.0.
We can now express hypotheses for ordinal data in terms of y = (yl, ... ,yKPI) and
y = ( y ; , ... ,yh- l). The null hypothesis is Ho: y j = y j , for allj. The two-sided Wilcoxon statistic is
sensitive to the alternative H,:y> 2 y j or y J < y j , for all j , with inequality for at least one j .
The proportional odds model is
logit(yj) = logit(yj) - A, j = 1, ... ,K - 1, (7)
where A E ( - co, co),with A = 0 representing the null case. A is the log odds ratio that compares
+
the two samples odds of response in categories 1 to j (versus j 1 to K). We can extend this
model also to allow differences in scale:
logit(yj) - A
logit(yJ) = , j = 1 , ..., K - 1 ,
exp( - 7)
where z E ( - 00, co) and z = 0 under Ho.Use of such a model reduces the parameters to K + 1
and clearly distinguishes between parameters of interest, A and z, and nuisance parameters, y.
Thus, we can generate an alternative hypothesis by specifying K - 1 nuisance parameters, n,and
the parameters by which n differs from ~ tand, we can plot power as a function of a parameter of
interest. A and z may be particular deviations from n whose values we can hypothesize based on
the science underlying the problem, while the nuisance parameters might represent the distribu-
tion of a control group whose values we can obtain from previous research.
Figures 1 and 2 illustrate distributions that can arise from the modified proportional odds
model (8) for two different vectors of nuisance parameters, n 1 = (0~20,0~20,0~20,0~20,0~20) and
n2 = (0~03,0~06,0~13,0~26,0~52) respectively. Parts (a) of the figures show the response probabilit-
ies (n)in each category, while parts ( b ) show the corresponding cumulative mass functions (7).
636 J. HILTON

a). x * (.2#
. .2$.2#.2#.2)
0.6
[TTlI

1
0.5

0.4
..............................................................................
:,:) : I 1
rn(.9, .2)

Oa3
B 0.2
&
0.1

category

Figure 1. One baseline group distribution, c1 = (020,0.20,020,0~20,0~20) and A,? = (O,O), and three alternative group
distributions,A, T = (09,0),(0.9,0.2),(0.9,0.4),represented by (a)probabilitymass functions and (b)cumulative probability
distributions

-
/
(b)y (.&.22#
I, .w,
.48,1.0)

L'I
0.6 1
rn
g;J;
(4 5): 1 I
I ( 0 , O ) ..................
0.5 8
1 0.8 +(.9 0) ..............................................................................
aN.9, 0)
il& 0.4
rn(.9, .2)
4) .................... ...... 2
8
~~

1 ......................................................................................................
w

3
0
0.3 1 Oh
0.4
8 0.2 >
ii 48 0.2
0.1

0 0
1 2 3 4 5
ca~gory category

Figure 2. One baseline goup distribution, u2 = (0~03,0~06,0~13,0~26,O52) and A, 7 = (0,0), and three alternative group
distributions,A, T = (0.9,0),(@9,@2),(0*9,0.4),represented by (a) probability mass functions and (b)cumulativeprobability
distributions

One may see in the case of symmetric nuisance parameters, such as


n1 = (020,020,020,020,0~20), that the proportional odds model does not itself induce asym-
metry in 7'. When A changes alone are present, it generates alternative parameters that are
symmetric about the off-diagonal (see Figure l(b)). Similarly, for T changes alone (not shown), the
model generates alternatives that are symmetric about the point defined by the intersection of the
two diagonal axes of the graph. In general, 7 induces U-shaped probability mass functions that
are concave if T > 0 and convex if T < 0 (Figure l(a)).
APPROPRIATENESS OF WILCOXON TEST IN ORDINAL DATA 637

In the case of asymmetric nuisance parameters, such as a2 = (0~03,0~06,0~13,0~26,0~52) in


Figure 2, the alternatives are harder to predict but are related to the simpler case above. For
example, the T changes are more J-shaped. An important point is that in both cases z tends to
make y cross y, so that neither y; 2 yj for allj nor y ; < y j for allj holds, and hence the Wilcoxon
statistics may lose power when T # 0. OBrien6anticipated that his generalized statistics would be
sensitive to U-shaped alternatives. Since modification of the proportional odds model via a scale
parameter T produces such shapes in categorical data, one would expect his test to have high
power against scale alternatives in this setting.

4. EXAMPLES
4.1. Example 1: Non-location-shift differences between distributions
Graubard and KornI6 discuss maternal alcohol consumption during pregnancy and occurrence
of malformation of their infants. The alcohol data were collected as average number of drinks
per day in five categories. The data, including the proportions of malformed infants, appear
below:

Malformation Average number of drinks per day


0 <1 1-2 3-5 >6

Present 48 (0.28%) 38 (026%) 5 (0.63%) 1 (0.79%) 1 (2.6%)


Absent 17,066 14,464 788 126 37

I computed exact P-values from the Wilcoxon test, the Smirnov test, and the permutation test,
and asymptotic P-values from OBriens generalized tests (the sample size is too large to compute
these exact P-values).
~~ ~

two-sided P-value
Statistic Scores basic test generalized test
Wilcoxon {0,0*659,0*977,0997,1} 05772 (0.0127)
Smirnov - 0.3625 -
Permutation (0,0~071,0~214,0~571,1> 0-0172 (0.0075)

I transformed the scores shown for the Wilcoxon test (to (wi - wl)/(wK - w l ) , j = 1, ... , K )
in order to contrast them more clearly with the permutation test scores. The permutation test
scores shown above are the (approximate) midpoints of the alcohol consumption levels
{0,0.5,15,4.0,7-0},divided by 7. I present the permutation test as a logistic regression analogue of
Students t test. I substitute its scores for the ranks in equation (1) to obtain OBriens generalized
t statistic.6
The highly significant P-values from both tests of H o : B1 = /I2 = 0 suggest the presence of
a relatively strong scale effect, as defined by model (8), in these data. If instead a proportional odds
model (7) fit the data, then A would be constant across all categories. In fact, we can estimate
Aj,j = 1, ... ,4, from the data as {0.037,0-99,1*5,2.3}!It is no surprise then that the Wilcoxon test
is non-significant. The generalized Wilcoxon test is sensitive to a broader alternative and is much
more significant (P = 0.0127).
638 J. HILTON

0.6 1
BControl .-Control

0.5 IInvestigational -Investigational


91 e3! 0.8

0.4 3
.s
G
0.6
u a

3
0.3
2g 0.4
30 0.2 W
.->
Y

ii 9
g 0.2
0.1
5
0 f
0 50 100 50 100
(a) Barthel Index, categorized (b) Barthel Index, categorized

Figure 3. Distributions of Barthel's Index scores of control and treated subjects: (a) probability mass functions and (b)
cumulative probability distributions (data from Lesaffre et used with permission)

The permutation test is significant, however, at P = 0.0172. The scores of this test give
information about the association between occurrence of malformations and maternal drinking,
suggesting how distinct subjects in one category are from the others. Their comparison with the
observed proportions of malformations by drinking level shows that the data support this
hypothesized association reasonably well, hence the small P-value.

4.2. Example 2 Exact versus asymptotic power


Lesaffre et a l l 7 described the problem of calculating sample size in studies with bounded
outcome scores. Their responses fell into 21 categories, with high probabilities in the first and last
categories. As shown in Figure 3(a), their data demonstrate U-shaped distributions. The differ-
ence between the distributions may result from a location shift alone or to a combination of
location and scale changes; Figure 3(b) suggests that the difference is predominantly a shift in
location. Lesaffre et al. note that when the data have a U- or J-shaped distribution and are
ordinal, Lehmann'~'~ method of determining power using the Wilcoxon statistic, which assumes
a location-shift alternative and a normally-distributed test statistic, is not advised.' Using
Lehmann's method, they estimated that 120 subjects per group are required to detect a standard-
ized difference of 15/40 = 0.375 with 80 per cent power using a two-sided O.09-level Wilcoxon
test.
Using the estimator described in equation (9,parameter estimates ii and 6' obtained from
Lesaffre et a l l 7 (personal communication), and (m,n) = (120,120), I estimated the power of the
exact two-sided Wilcoxon test. Based on N = 10 simulated samples, the Wilcoxon power was
45-6 & 1.1 per cent.
Whitehead13 described a new asymptotic method for sample size calculations for ordered
categorical data based on the Wilcoxon statistic. His equation for the combined sample size is
APPROPRIATENESS OF WILCOXON TEST IN ORDINAL DATA 639

where A = n/m, the allocation ratio of experimental to control subjects; z.,~ and zn are standard
normal variates, and Xi = (xi + Anj)/(l + A). Whitehead13refers to OR as the reference difference,
estimated by
OR = logit(y>)- logit(yj),j = K/2.

When K is odd, we calculate 0, as the average over the two central categories. Note that
Whitehead also chooses to model the alternative distribution using the proportional odds
model and the 0, equals A in model (8) when z = 0 but otherwise captures a combination of the
location and scale effects.
Using f and i from Figure 3, I compute 1 - EX;
= 0.873 and GR = 0.519. With
(m,n) = (120,120), I obtain 58.3 per cent power. This power level is closer to the exact level than
the estimate using Lehmannsasymptotic method, but still underestimates the number of subjects
needed. As shown below, the large difference between the exact asymptotic power estimates
suggests that these two samples differ in scale as well as in location.

5. SIMULATIONS
5.1. Methods
In this section I compare more generally the unconditional power of the two-sided exact tests
+
based on the Wilcoxon, OBrien and Smirnov statistics in moderate (m n = 100) and small
(m + n = 40) samples. Power is determined as a function of z for fixed values of A. For
+
m + n = 100 I analyzed A = 0.9 and for m n = 40 I analyzed A = 1.3. I chose these values of
A to ultimately obtain power values that are 2 0.5, for z = 0 (0.1)0.6. All tests were conducted at
the a = 0.05 level.
I generated five-category multinomial data by transforming m + n uniform (0,l) random
variates into multinomial counts specified by n 1 = (0~20,0-20,0~20,0~20,0~20) or n2 =
(0.03,0.06,0-13,O26,0.52). For sample 2, I subsequently transformed n variates again via model
(8) to obtain x = ( x l , ... ,xg). I evaluated sample size pairs (m, n) = (50,50), (75,25), (20,20),and
(30,lO);the unbalanced pairs could represent three controls per experimental subject. For each
combination of parameters, I generated N two-sample data sets and estimated the power of each
n
test: via fi (5) with N = 100 for the Wilcoxon and Smirnov tests and via (6) with N = 500 for
the OBrien test. (I determined the latter sample size empirically to produce standard errors of
fi that are < 0.025.)
5.2. Results
Table I displays the power of the three tests against A = 0.9 and z = 0(0-1)0.6,when m + n = 100,
in balanced (m = n) and unbalanced (m = 3n) samples. The alternatives analysed in Table I are
those depicted in Figures 1 and 2, for nuisance parameter vectors y1 = (0~2,0~40,0~60,0~8,1~0) and
yz = (O03,0.09,0.22,0~48,1.0), respectively. Table I1 displays the power of the three tests against
A = 1-3and z = 0(0.1)0*6,when m + n = 40.
The power of the Wilcoxon test increased in z for both nuisance parameter vectors at both
m + n = 100 and m + n = 40. At A = 0.9 and m + n = 100, the power increase over this range of
z was greater for n 1 than for z2. At A = 1.3 and m + n = 40, however, the power increase was
about the same for x1 and z2.Assessing these results in light of the parameter pairs (n,d), we can
see that, at most of these values of A, z, they were consistent with the Wilcoxon alternative, H I :
y; 2 y j or yJ < y j , for all j with inequality for at least onej. For xl, all were consistent with this
alternative; for x 2 , at A = 0-9 beginning with z = 0.4 the two distributions crossed
J. HILTON

Table I. Power (fIf SE(fi), x 100 per cent) of exact two-sided


Wilcoxon, OBrien and Smirnov tests at A = 0.9 and rn + n = 100

m,n IL* z Wilcoxon OBrien Smirnov

50,50 111 0.0 69.6 f 0 9 65.0 f 2.1 57.8 f 0 9


0.1 73.8 f 0.9 65.2 f 2.1 62.1 f 1.0
02 78.6 f 1.0 75.6 f 1.9 69.0 If: 1.1
0.3 82.6 f 1.1 82.0 f 1.7 75.8 f 1.2
0.4 85.1 f 1.2 91.8 f 1.2 81.9 f 1.3
05 86.8 f 1.3 95.0 f 1.0 86.6 f 1.4
0.6 89.4 f 1.2 98.4 f0.6 91.7 f 1.3
50,SO 112 0.0 67.5 f 0.7 57.8 f 2.2 55.0 f 0 8
0.1 69.1 f 0.6 608 f 2.2 58.5 f0 7
0.2 70.5 f 0.7 66.0 f 2.1 63.9 f 0 8
0-3 73.2 f 0.8 78.6 f 1.8 71.1 f 0 8
0.4 74.6 f 0.9 82.0 f 1.7 77.2 f 0 9
05 77.1 f 1.1 862 f 1.5 84.5 f 0 9
0.6 80.4 f 1.1 94.0 f 1.1 89.6 f 0 9
75,25 111 0.0 57.0 f 0.7 51.6 f 2.2 467 f 0 7
0.1 61.8 f 0 8 52.8 f 2.2 51.8 f 0 9
0.2 66.4 f 1.0 61.6 _+ 2.2 57.2 f 1.1
0.3 71.4 f 1.1 74.4 f 2.0 66.0 f 1.3
0.4 74.0 f 1.3 82.6 _+ 1.7 72.8 f 1.4
0.5 75.3 f 1.4 88.2 f 1.4 76.8 f 1.5
0.6 77.4 f 1.5 93.0 f 1.1 83.0 f 1.6
75,25 112 0.0 546 f 0.5 43.0 f 2 2 44.8 f 0 6
01 564 fO.5 49.6 f 2,2 47.7 f 0 7
02 57.7 f0.5 56.8 f 2.2 53.2 f0 7
0.3 59.9 f 0.7 61.8 f 2.2 58.7 f 0 8
0.4 61.7 f 0.9 66-2 f 2.1 66.0 f0 7
0.5 62.9 f 1.0 80.0 f 1.8 72.8 f 0 9
0.6 66.2 f 1.2 88.6 f 1.4 80.2 f 0 9

Table 11. Power (l? f S E ( f i ) , x 100 per cent) of exact two-sided


Wilcoxon, OBrien and Smirnov tests at A = 1.3 and rn n = 40 +
m,n n* 7 Wilcoxon OBrien Smirnov

20,2O 0.0 60.0f 0.9 49.6 f 2.2 45.8 f 1.0


0.1 63.2 f 1.0 57.6 _+ 2.2 48.6 f 1.2
0.2 69.1 f 1.1 63.4 f 2.2 54.1 f 1.3
0.3 74.5 f 1.3 704 f 2.0 61.7 f 1.5
0.4 77.2 f 1.3 77.4 f 1.9 63.9 f 1.6
0.5 78.7 f 1.5 82.2 f 1.7 67.4 f 1.8
0.6 82.6 f 1.5 89.2 f 1.4 73.3 f 1.9
20,20 112 0.0 57.8 f 0.7 46.6 f 2.2 41.3 f 0 9
0.1 61.1 f 0.7 52.2 f 2.2 46.1 f 1.0
02 63.4 f 0.8 57.2 f 2.2 49.9 f 1.1
0.3 66.7 f 0.9 65.6 f 2.1 55.9 f 1.2
0.4 69.4 f 1.0 70.8 f 2.0 60.1 f 1.3
0.5 74-3 f 1.2 79.0 f 1.8 68.2 f 1.4
0.6 77.0 f 1.3 82-2 f 1.7 74.2 f 1-5
APPROPRIATENESSOF WILCOXON TEST IN ORDINAL DATA 641

Table 11. (Continued)


~~~~ ~

m,n n* 7 Wilcoxon OBrien Smirnov


3410 0.0 46-9 f 0 7 41.8 f 2.2 38.0 f 0.9
01 53.0 f 0.8 46.4 f 2.2 44.4 f 1.1
0.2 56.4 f 1.1 55.6 f 2 2 471 f 1.3
0.3 60.8 f 1.4 60.8 f 2.2 52.5 f 1.6
0-4 64.1 f 1.5 68.0 f 2.1 56.3 f 1.8
0.5 68.2 f 1.6 72.0 f 2.0 61.6 f 1.9
0.6 699 f 1.9 81.6 1.7 -+ 66.2 f 2.2
30,lO n2 0.0 45.6 f 0.5 37.8 f2.2 36.0 f 0.8
0.1 47.4 f 0.6 41.4 f 2.2 38.0 f 1.0
02 51.2 f 0.6 47.2 f 2 2 43.2 f 1.1
0.3 53.6 & 0.8 58.0 & 2.2 46.2 f 1.2
0.4 56.7 f 0.9 604 f 2.2 51.8 f 1.2
0.5 58.8 f 1.2 69-0 & 2.1 56.3 f 1.6
06 61.5 f 1.5 72.8 f 2.0 58.9 f 1.8

0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6
(a) scale parameter, T (b) scale parameter, 1

Figure 4. Differencein power of two-sidedexact tests as a function of scale parameter T, when location parameter A = 0.9
and m + n = 100: Smirnov-Wilcoxon power, and OBrien-Wilcoxon power: (a) Nuisance parameters are
zl= (@20,020,0~20,0~20,0~20); (b) Nuisance parameters are z2= (0.03,O.M.@13,0.26,052)

l-O)), and at A = 1.3 beginning with z = 0.6 the distributions crossed


(y; = (0~021,0-107,0~37,0~78,
(y; = (0~03,0*13,0~36,0*37,1~0)).
The power of the OBrien and Smirnov tests also increased with z for both nuisance parameter
vectors. Figure 4 shows the differences in power between these tests and the Wilcoxon test for
m + n = 100 and A = 0.9, and nl and A?; Figure 5 shows these differences for m + n = 40 and
A = 1.3.At z = 0 the Wilcoxon test has the greatest power of the three tests. The increasing slopes
in both figures, however, show that the OBrien and Smirnov tests rose in power more quickly as
a function of r than did the Wilcoxon test, and eventually exceeded its power.
642 J. HILTON

-20 I -20 I
1 I
I I
0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6
(a) scale parameter, T (b) scale parameter, T

Figure 5. Difference in power of two-sided exact tests as a function of scale parameter 5, when location parameter A = 1.3
and rn + n = 4 0 Smirnov-Wilcoxon power, and OBrien-Wilcoxon power (a) Nuisance parameters are
sl= (0~20,0~20,0~20,0~20,0~20); (b) Nuisance parameters are s2= (0~03,0~06.0~13,0~26,0~52)

Table 111. Power ( x 100 per cent) of exact and asymptotic Wilcoxon
tests, at A = 0.9

(m,n) = (50,50) (m,n) = (75,25)


R* 7 8, exact asymptotic' exact asymptotic'
XI 0.0 0900 69.6 71.7 57.0 58.8
0.2 1.099 78.6 86.3 66.4 74.8
0.4 1.343 85.1 96.5 74.0 90-1
1 ~ 2 0.0 0.900 67.5 69.5 54.6 58.0
0.2 0.703 705 50.9 57.7 41.0
0.4 0462 74.6 23.8 61.7 19.1

*II= (0~20,0~20,0~20,0~20,0~20),
~2 = (0~03,0~06,0~13,0~26,0~52)
'Whitehead's method (9)

In addition, Figure 4 shows that in moderate-sized samples the OBrien and Smirnov tests have
comparable power, with the O'Brien test generally having a slight edge on the Smirnov test.
Figure 5 shows that in smaller samples, especially unbalanced ones, the O'Brien test is consis-
+
tently more powerful than the Smirnov test. At rn n = 40 and A = 1-3, the Smirnov test is less
powerful than the Wilcoxon test against z < 0.6, but presumably would be more powerful against
larger z. The OBrien test, however, is more powerful than the Wilcoxon test, even at these local
values of z and for this large A. One can also see these results in Table 11.
All three tests are slightly more powerful in the presence of nuisance parameter vector n l
than xZ, for given A, z, and rn + n. This is consistent with the greater separation attained
by the proportional odds model between II and a' when a = a1 than when a = az,as discussed
above. The slope of the Wilcoxon power function in z is steeper for the uniform baseline
parameters (xl = (0~20,0~20,0~20,0~20,020)) than the geometrically increasing parameters
(az= (0~03,0~06,0~13,0~26,0~52)). Since the power of the OBrien test appears less dependent on
APPROPRIATENESS OF WILCOXON TEST IN ORDINAL DATA 643

the baseline parameters, the power difference between these tests is greater in the geometric
case.
Finally, Table I11 compares the exact power of the Wilcoxon test with its asymptotic power,
+
obtained using whitehead'^'^ method (9) when m n = 100. At 7 = 0 (0, = A), the Whitehead
method provides an excellent approximation to the exact power, for both nuisance parameter
vectors, a. This is the alternative against which the Wilcoxon test was designed to have high
power.
When the effect of 7 on a was such that OR > A (for example, for al and 7 > 0), then the
Whitehead method overestimated the power of the Wilcoxon test. Conversely, when the com-
bined effects of 7 and a were such that OR < A (for example, for 7z2 and z > 0), then the Whitehead
method underestimated the power of the Wilcoxon test.

6. DISCUSSION
I compared the power of exact tests based on the Wilcoxon rank-sum statistic, OBriens
generalized Wilcoxon statistic, and the Smirnov statistic in the presence of location- and
scale-type shifts in ordered categorical data. I used an extension of McCullaghs proportional
odds model to characterize these shifts. I found that when location shifts alone were present, the
Wilcoxon test had the greatest power of the three tests. As the influence of the scale parameter
grew relative to the location parameter, however, so did the power of the Smirnov and
OBrien tests until they exceeded the power of the Wilcoxon test. These results were expected,
since the Wilcoxon statistic is specifically sensitive to location shifts in continuous data and the
other two statistics test the general alternative F1 # F2.What was unexpected, however, was the
relatively high power of OBriens test when the scale change was small relative to the location
shift.
The OBrien test was more powerful than the Smirnov test for scale parameter z > 0 and was
more powerful than the Wilcoxon test for z > 0.3.The results held for two different baseline
distributions F1 of five-category qualitative responses, with the second distribution F2 specified
via the scale-modified proportional odds model. I demonstrated that the effects of local scale
changes on the baseline distribution can appear very similar to location shifts, depending on the
baseline parameters. Thus, without information on how two ordinal categorical distributions
differ, the OBrien test appears as an excellent choice for testing I f o : F1 = F 2 while maintaining
power against the possible presence of scale changes.
Previous research can guide the choice of the baseline (nuisance) parameters, but the choice of
the proportional odds model is more arbitrary. It is akin to choosing to estimate the association
in a 2 x 2 table via the difference between binomial probabilities or via the relative risk. A nice
property of this model, however, is that it does not induce asymmetry when used to generate F 2
from F1.In addition, when the nuisance parameters a that define the baseline distribution are
symmetric, one can anticipate the effects of A and 7 produced by this model. When, however, the
nuisance parameters are asymmetrical (or when one uses a model that induces asymmetry, such
as the complementary log-log model), one needs some practice to anticipate the effects of A and
r on F2.
In addition to the comparison of these three exact tests, I compared the exact and asymptotic
+
power of the Wilcoxon test when m n = 100. Whiteheads asymptotic approximation was
excellent in the absence of scale changes, but abysmal in the presence of even local scale changes,
producing either highly inflated or highly deflated estimates. Formerly, a drawback of conducting
exact tests and calculating exact power was that these methods were either unavailable or were
not as quick as asymptotic methods. Software for performing all three of these exact test is,
644 J. HILTON


however, increasingly available (for example, StatXact for the Smirnov and Wilcoxon tests, and
L o g X u d 9 for OBriens tests).
NikiforovZorecently reported an algorithm for exact Smirnov testing. One can find the exact
power of these tests using the algorithms of Mehta and Hilton.10-2
I focus on qualitative ordinal responses in this paper and thus do not evaluate the power of
t-type tests. When the response is quantitative it is possible to use a permutation test, a categori-
cal data analogue of the t test that uses the actual values of the response variable as the scores.
When the responses are quantitative, however, these tests cannot be used because no such scores
are available. Any of the rank-based tests evaluated here can be used instead.
The Wilcoxon rank-sum statistic is commonly used to analyse ordinal categorical data.
I showed, however, that one should not use it without careful consideration of the alternative
hypothesis of interest since it was designed specifically to detect a location shift between
distributions. Unless the location-shift effect greatly dominates the scale effect, the OBrien test is
likely more powerful than either the Smirnov or the Wilcoxon test for analysis of two-sample
ordinal categorical data.

ACKNOWLEDGEMENTS
I thank Dr. E. Lesaffre and colleagues for sharing the response probabilities shown in Figure 3,
and I thank Dr. P. C. OBrien for constructive comments. This work was supported in part by
a grant from the Heart, Lung and Blood Institute.

REFERENCES
1. Wilcoxon, F. Individual comparisons by ranking methods, Biometrics, 1, 80-83 (1945).
2. Smirnov, N. V. On the estimation of the discrepancy between empirical distribution curves for two
independent samples, Bulletin de rllniversitk de Moscou, Skrie lnte&ationale (Mathimatiques), 2, 3-14
(1939).
3. Cochran, W. G. Some methods for strengthening the common x2 tests, Biometrics, 10,417-451 (1954).
4. Armitage, P. Tests for linear trends in proportions and frequencies, Biometrics, 11, 375-386 (1955).
5. Eplett, W. J. R. The distributions of Smirnov type two-sample rank tests for discontinuous distribution
functions, Journal of the Royal Statistical Society, Series B, 44, 361-369 (1982).
6. OBrien, P. C. Comparing two samples, extensions of the t, rank-sum and log-rank tests, Journal ofthe
American Statistical Association, 83, 52-61 (1988).
7. Lepage, Y. A combination of Wilcoxons and Ansari-Bradley statistics, Biometrika, 58, 213-217
(1971).
8. Podgor, M. and Gastwirth, J. On non-parametric and generalized tests for the two-sample problem
with location and scale change alternatives, Statistics in Medicine, 13, 747-758 (1994).
9. Blair, R. C. and Morel, J. G. On the use of the generalized t and g e n e r a l i d rank-sum statistics in
medical research Statistics in Medicine, 11, 491-501 (1992).
10. Hilton, J. F. and Mehta, C. R. Power and sample size calculations for exact conditional tests with
ordered categorical data, Biometrics, 49, 609-616 (1993).
11. Hilton, J. F., Mehta, C. R. and Patel, N. R. An algorithm for conducting exact Smirnov tests,
Computational Statistics and Data Analysis, 17, 351-361 (1994).
12. Mehta, C. R., Patel, N. R. and Tsiatis, A. A. Exact significancetesting to establish treatment equivalence
with ordered categorical data, Biometrics, 40,819-825 (1984).
13. Whitehead, J. Sample size calculations for ordered categorical data, Statistics in Medicine, 12,
2257-2271 (1993).
14. Lehmann, E.L. Nonparametrics, Statistical Methods Based on Ranks, Holden-Day, Inc., San Francisco,
1975.
15. McCullagh, P. Regression models for ordinal data (with discussion), Journal of the Royal Statistical
Society, Series B, 42, 109-142 (1980).
16. Graubard, B. I. and Korn,E. L. Choice of column scores for testing independence in ordered 2 x K
contingency tables, Biometrics, 43,471-476 (1987).
APPROPRIATENESS OF WILCOXON TEST IN ORDINAL DATA 645

17. Lesaffre, E., Scheys, I., Frolich, J. and Bluhmki, E. Calculation of power and sample size with bounded
outcome scores, Statistics in Medicine, 12, 1063-1078 (1993).
18. Cytel Software Corporation. StatXact: Statistical Softwarefor Exact Nonparametric Inference,version 3,
Cambridge, MA, 1995.
19. Cytle Software Corporation. LogXact: Statistical Software for Exact and Asymptotic Logistic Regres-
sion, version 1.1, Cambridge, MA, 1993.
20. Nikiforov, A. M. Exact Smirnov two-sample tests for arbitrary distributions, Applied Statistics, 43,
265-269 (1994).

Das könnte Ihnen auch gefallen