Sie sind auf Seite 1von 10

https://onlinecourses.science.psu.edu/stat414/prin...

Published on STAT 414 / 415 (https://onlinecourses.science.psu.edu/stat414)


Home > The Wilcoxon Signed Rank Test for a Median

The Wilcoxon Signed Rank Test for a Median


Developed in 1945 by the statistician Frank Wilcoxon, the signed rank test was one of the first
"nonparametric" procedures developed. It is considered a nonparametric procedure, because we
make only two simple assumptions about the underlying distribution of the data, namely that:

(1) the random variable X is continuous

(2) the probablility density function of X is symmetric

Then, upon taking a random sample X1, X2, ..., Xn, we are interested in testing the null hypothesis:

H0 : m = m0
against any of the possible alternative hypotheses:

HA : m > m0 or HA : m < m0 or HA : m ≠ m0
As we often do, let's motivate the procedure by way of example.

Example

Let Xi denote the length, in centimeters, of a randomly


selected pygmy sunfish, i = 1, 2, ... 10. If we obtain the
following data set:

5.0 3.9 5.2 5.5 2.8 6.1 6.4 2.6


1.7 4.3

can we conclude that the median length of pygmy


sunfish differs significantly from 3.7 centimeters?

Solution. We are interested in testing the null hypothesis H0: m = 3.7 against the
alternative hypothesis HA: m ≠ 3.7. In general, the Wilcoxon signed rank test procedure
requires five steps. We'll introduce each of the steps as we apply them to the data in this
example.

Step #1. In general, calculate Xi − m0 for i = 1, 2, ..., n. In this case, we have to


calculate Xi − 3.7 for i = 1, 2, ..., 10:

1 de 10 20-12-2015 00:31
https://onlinecourses.science.psu.edu/stat414/prin...

Step #2. In general, calculate the absolute value of Xi − m0, that is, |Xi − m0| for i = 1, 2,
..., n. In this case, we have to calculate |Xi − 3.7| for i = 1, 2, ..., 10:

Step #3. Determine the rank Ri, i = 1, 2,..., n of the abolute values (in ascending order)
according to their magnitude. In this case, the value of 0.2 is the smallest, so it gets rank
1. The value of 0.6 is the next smallest, so it gets rank 2. We continue ranking the data in
this way until we have assigned a rank to each of the data values:

2 de 10 20-12-2015 00:31
https://onlinecourses.science.psu.edu/stat414/prin...

Step #4. Determine the value of W, the Wilcoxon signed rank test statistic:
n


W= Zi R i
i=1

where Zi is an indicator variable with Zi = 0 if Xi − m0 is negative and Zi = 1 if Xi − m0 is


positive. That is, with Zi defined as such, W is then the sum of the positive signed ranks. In
this case, because the first observation yields a positive X1 − 3.7, namely 1.3, Z1 = 1.
Because the fifth obervation yields a negative X5 − 3.7, namely −0.9, Z5 = 0. Determining
Zi as such for i = 1, 2, ..., 10, we get:

And, therefore W equals 40:

W = (1)(5) + (1)(1)+. . . +(0)(−8) + (1)(2) = 5 + 1 + 6 + 7 + 9 + 10 + 2 = 40

3 de 10 20-12-2015 00:31
https://onlinecourses.science.psu.edu/stat414/prin...

Step #5. Determine if the observed value of W is extreme in light of the assumed value of
the median under the null hypothesis. That is, calculate the P-value associated with W, and
make a decision about whether to reject or not to reject. Whoa, nellie! We're going to
have to take a break from this example before we can finish, as we first have to learn
something about the distribution of W.

The Distribution of W

As is always the case, in order to find the distribution of the discrete random variable W, we need:

(1) to find the range of possible values of W, that is, we need to specify the support of W

(2) to determine the probability that W takes on each of the values in the support
n
Let's tackle the support of W first. Well, the smallest that W = ∑i=1 Zi Ri could be is 0. That
would happen if each observation Xi fell below the value of the median m0 specified in the null
hypothesis, thereby causing Zi = 0, for i = 1, 2, ..., n:

n n(n+1)
The largest that W = ∑i=1 Zi Ri could be is 2
. That would happen if each observation fell
above the value of the median m0 specified in the null hypothesis, thereby causing Zi = 1, for i =
1, 2, ..., n:

and therefore W reduces to the sum of the integers from 1 to n:


n n
n(n + 1)
∑ ∑
W= Zi R i = =
i=1 i=1
2

So, in summary, W is a discrete random variable whose support ranges between 0 and n(n+1)/2.

Now, if we have a small sample size n, such as we do in the above example, we could use the
exact probability distribution of W to calculate the P-values for our hypothesis tests. Errr....
first we have to determine the exact probability distribution of W. Doing so is very doable. It just
takes some thinking and perhaps a bit of tedious work. Let's make our discussion concrete by
considering a very small sample size, n = 3, say. In that case, the possible values of W are the
integers 0, 1, 2, 3, 4, 5, 6. Now, each of the three data points would be assigned a rank Ri of either
1, 2, or 3, and depending on whether the data point fell above or below the hypothesized median
m0, each of the three possible ranks 1, 2, or 3 would remain either a positive signed rank or

4 de 10 20-12-2015 00:31
https://onlinecourses.science.psu.edu/stat414/prin...

become a negative signed rank. In this case, because we are considering such a small sample size,
we can easily enumerate each of the possible outcomes, as well as sum W of the positive ranks to
see how each arrangement results in one of the possible values of W:

There we have it. We're just about done with finding the exact probability distribution of W when n
= 3. All we have to do is recognize that under the null hypothesis, each of the above eight
arrangements (columns) is equally likely. Therefore, we can use the classical approach to assigning
the probabilities. That is:

P(W = 0) = 1/8, because there is only one way that W = 0


P(W = 1) = 1/8, because there is only one way that W = 1
P(W = 2) = 1/8, because there is only one way that W = 2
P(W = 3) = 2/8, because there are two ways that W = 3
P(W = 4) = 1/8, because there is only one way that W = 4
P(W = 5) = 1/8, because there is only one way that W = 5
P(W = 6) = 1/8, because there is only one way that W = 6

And, just to make sure that we haven't made an error in our calculations, we can verify that the
sum of the probabilities over the support 0, 1, ..., 6 is indeed 1/8 + 1/8 + ... + 1/8 = 1.

Hmmm. That was easy enough. Let's do the same thing for a sample size of n = 4. Well, in that
case, the possible values of W are the integers 0, 1, 2, ..., 10. Now, each of the four data points
would be assigned a rank Ri of either 1, 2, 3, or 4, and depending on whether the data point fell
above or below the hypothesized median m0, each of the three possible ranks 1, 2, 3, or 4 would
remain either a positive signed rank or become a negative signed rank. Again, because we are
considering such a small sample size, we can easily enumerate each of the possible outcomes, as
well as sum W of the positive ranks to see how each arrangement results in one of the possible
values of W:

Again, under the null hypothesis, each of the above 16 arrangements is equally likely, so we can
use the classical approach to assigning the probabilities:

P(W = 0) = 1/16, because there is only one way that W = 0


P(W = 1) = 1/16, because there is only one way that W = 1
P(W = 2) = 1/16, because there is only one way that W = 2
P(W = 3) = 2/16, because there are two ways that W = 3
and so on...
P(W = 9) = 1/16, because there is only one way that W = 9
P(W = 10) = 1/16, because there is only one way that W = 10

Do you want to do the calculation for the case where n = 5? Here's what the enumeration of
possible outcomes looks like:

5 de 10 20-12-2015 00:31
https://onlinecourses.science.psu.edu/stat414/prin...

After having worked through finding the exact probability distribution of W for the cases where n =
3, 4, and 5, we should be able to make some generalizations. First, note that, in general, there are
2n total number of ways to make signed rank sums, and therefore the probability that W takes on a
particular value w is:

c(w)
P(W = w) = f (w) =
2n
where c(w) = the number of possible ways to assign a + or a − to the first n integers so that
∑ni=1 Zi Ri = w.
Okay, now that we have the general idea of how to determine the exact probability distribution of
W, we can breathe a sigh of relief when it comes to actually analyzing a set of data. That's because
someone else has done the dirty work for us for sample sizes n = 3, 4, ..., 12, and published the
relevant results in a statistical table of W [1]. (Our textbook authors chose not to include such a
table in our textbook.) By relevant, I mean the probabilities in the "tails" of the distribution of W.
After all, that's what P-values generally are, that is, probabilities in the tails of the distribution
under the null hypothesis.

As the table of W suggests, our determination of the probability distribution of W when n = 4


agrees with the results published in the table:

because both we and the table claim that:

P(W ≤ 0) = P(W ≥ 10) = 0.062


and:

P(W ≤ 1) = P(W = 0) + P(W = 1) = 0.062 + 0.062 = 0.125


P(W ≥ 9) = P(W = 9) + P(W = 10) = 0.062 + 0.062 = 0.125
Okay, it should be pretty obvious that working with the exact distribution of W is going to be pretty
limiting when it comes to large sample sizes. In that case, we do what we typically do when we
have large sample sizes, namely use an approximate distribution of W.

6 de 10 20-12-2015 00:31
https://onlinecourses.science.psu.edu/stat414/prin...

Theorem. When the null hypothesis is true, for large n:


n(n+1)

∑ni=1 Zi Ri − 4
W =
‾n(n+1)(2n+1)
‾‾‾‾‾‾‾‾‾
√ 24

follows an approximate standard normal distribution N(0, 1).

Proof. Because the Central Limit Theorem is at work here, the approximate standard
normal distribution part of the theorem is trivial. Our proof therefore reduces to showing
that the mean and variance of W are:
n(n+1) n(n+1)(2n+1)
E(W) = 4
and Var(W) = 24

respectively. To find E(W) and Var(W), note that W = ∑ni=1 Zi Ri has the same distribution
of U = ∑ni=1 Ui where:

Ui = 0 with probability ½
Ui = i with probability ½

In case that claim was less than obvious, consider this intuitive, hand-waving kind of
argument:

W and U are both sums of a subset of the numbers 1, 2, ..., n


Under symmetry, an equally likely chance of getting assigned either a + or a − is
equivalent to having an equally likely chance of being included in the sum or not.

At any rate, we therefore have:


n n n

[ 0 ( ) + i ( )] =
1 1 1 1 n(n + 1) n(n + 1)
∑ ∑ 2∑
E(W) = E(U) = E(Ui ) = i= × =
i=1 i=1
2 2 i=1
2 2 4

and:
n


Var(W) = Var(U) = Var(Ui )
i=1

because the Ui's are independent under the null hypothesis. Now:

Var(Ui ) = (E(Ui2 ) − E(Ui ))2 = [02 ( ) + i2 ( )] − ( ) =


1 1 i 2 i2 i2 i2
− =
2 2 2 2 4 4
and therefore:
n n n
i2 1 1 n(n + 1)(2n + 1)
i2 = ×
∑ ∑ 4 4∑
Var(W) = Var(Ui ) = =
i=1 i=1 i=1
4 6

Therefore, in summary, under the null hypothesis, we have that:

7 de 10 20-12-2015 00:31
https://onlinecourses.science.psu.edu/stat414/prin...

n(n+1)

∑ni=1 Zi Ri − 4
W =
‾n(n+1)(2n+1)
‾‾‾‾‾‾‾‾‾
√ 24

follows an approximate standard normal distribution as was to be proved.

Let's return to our example now to complete our work.

Example (continued)

Let Xi denote the length of a randomly selected pygmy


sunfish, i = 1, 2, ... 10. If we obtain the following data set:

5.0 3.9 5.2 5.5 2.8 6.1 6.4 2.6


1.7 4.3

can we conclude that the median length of pygmy


sunfish differs significantly from 3.7 centimeters?

Solution. Recall that we are interested in testing the null hypothesis H0: m = 3.7 against
the alternative hypothesis HA: m ≠ 3.7. The last time we worked on this example, we got
as far as determining that W = 40 for the given data set. Now, we just have to use what
we know about the distribution of W to complete our hypothesis test. Well, in this case,
with n = 10, our sample size is fairly small so we can use the exact distribution of W. The
upper and lower percentiles of the Wilcoxon signed rank statistic when n = 10 are:

Therefore, our P-value is 2 × 0.116 = 0.232. Because our P-value is large, we cannot reject
the null hypothesis. There is insufficient evidence at the 0.05 level to conclude that the
median length of pygmy sunfish differs significantly from 3.7 centimeters.

Notes

A couple of notes are worth mentioning before we take a look at another example:
n
(1) Our textbook authors define W = ∑i=1 Ri as the sum of all of the ranks, as opposed to
just the sum of the positive ranks. That is perfectly fine, but not the most typical way of
defining W.

8 de 10 20-12-2015 00:31
https://onlinecourses.science.psu.edu/stat414/prin...

(2) W is based on the ranks of the deviations from the hypothesized median m0, not on the
deviations themselves. In the above example, W = 40 even if x7 = 6.4 or 10000 (now that's
a pretty strange sunfish) because its rank would be unchanged. It is in this sense that W
protects against the effect of outliers.

Now for that last example.

Example

The median age of the onset of diabetes is thought to be 45


years. The ages at onset of a random sample of 30 people
with diabetes are:

35.5 44.5 39.8 33.3 51.4 51.3 30.5 48.9


42.1 40.3
46.8 38.0 40.1 36.8 39.3 65.4 42.6 42.8
59.8 52.4
26.2 60.9 45.6 27.1 47.3 36.6 55.6 45.1
52.2 43.5

Assuming the distribution of the age of the onset of diabetes is symmetric, is there evidence to
conclude that the median age of the onset of diabetes differs significantly from 45 years?

Solution. We are interested in testing the null hypothesis H0: m = 45 against the
alternative hypothesis HA: m ≠ 45. We can use Minitab's calculator and statistical
functions to do the dirty work for us:

Then, summing the last column, we get:

9 de 10 20-12-2015 00:31
https://onlinecourses.science.psu.edu/stat414/prin...

Because we have a large sample (n = 30), we can use the normal approximation to the
distribution of W. In this case, our P-value is defined as two times the probability that W ≤
200. Therefore, using a half-unit correction for continuity, our transformed signed rank
statistic is:

200.5 − ( 4 )
30(31)
W′ = = −0.6581
‾30(31)(61)
‾‾‾‾‾‾‾
√ 24
Therefore, upon using a normal probability calculator (or table), we get that our P-value is:

P ≈ 2 × P(W ′ < −0.66) = 2(0.2546) ≈ 0.51


Because our P-value is large, we cannot reject the null hypothesis. There is insufficient
evidence at the 0.05 level to conclude that the median age of the onset of diabetes differs
significantly from 45 years.

By the way, we can even be lazier and let Minitab do all of the calculation work for us.
Under the Stat menu, if we select Nonparametrics, and then 1-Sample Wilcoxon, we
get:

Source URL: https://onlinecourses.science.psu.edu/stat414/node/319

Links:
[1] https://onlinecourses.science.psu.edu/stat414/sites/onlinecourses.science.psu.edu.stat414/files/lesson48
/ExactW_Table.pdf

10 de 10 20-12-2015 00:31

Das könnte Ihnen auch gefallen