Sie sind auf Seite 1von 7

Contemporary Clinical Trials 30 (2009) 490–496

Contents lists available at ScienceDirect

Contemporary Clinical Trials


j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / c o n c l i n t r i a l

Performance of five two-sample location tests for skewed distributions with


unequal variances
Morten W. Fagerland ⁎, Leiv Sandvik
Ullevål Department of Research Administration, Oslo University Hospital, N-0407 Oslo, Norway

a r t i c l e i n f o a b s t r a c t

Article history: Tests for comparing the locations of two independent populations are associated with different
Received 16 March 2009 null hypotheses, but results are often interpreted as evidence for or against equality of means or
Accepted 18 June 2009 medians. We examine the appropriateness of this practice by investigating the performance of
five frequently used tests: the two-sample T test, the Welch U test, the Yuen–Welch test, the
Keywords: Wilcoxon–Mann–Whitney test, and the Brunner–Munzel test. Under combined violations of
Two-sample location problem normality and variance homogeneity, the true significance level and power of the tests depend
T test
on a complex interplay of several factors. In a wide ranging simulation study, we consider
Welch test
scenarios differing in skewness, skewness heterogeneity, variance heterogeneity, sample size,
Wilcoxon–Mann–Whitney test
Yuen–Welch test and sample size ratio. We find that small differences in distribution properties can alter test
Brunner–Munzel test performance markedly, thus confounding the effort to present simple test recommendations.
Robustness Instead, we provide detailed recommendations in Appendix A. The Welch U test is
Skewness recommended most frequently, but cannot be considered an omnibus test for this problem.
Heteroscedasticity © 2009 Elsevier Inc. All rights reserved.

1. Introduction two populations. The results from the WMW test can be
interpreted as a test of equality of medians only when the two
Comparison of locations, or central tendency, of two distributions are identical except for a possible shift in
independent populations is common in medical research. A location [2]. Many attempts have been made to improve the
plethora of tests exists, of amenability depending on the WMW test. The most prominent of these is the Brunner–
distribution of the data at hand. The choice of test decides Munzel test [3], which allows for tied values and unequal
what can be inferred from the results. This is due to the variances.
different null hypotheses these methods are designed to test. For markedly skewed distributions, the mean can be a
The two-sample T test is the most common approach. This poor measure of central tendency because outliers inflate its
is a test of equality of means, but it is derived under the value. This can be ameliorated by removing the smallest and
assumptions that the two distributions are normal with equal the largest values in the sample. If an equal amount of values
variances. A modification of this test, the Welch U test [1], is are removed from each tail, the mean of the resulting sample
designed for unequal variances, but the assumption of is called the trimmed mean. Comparing trimmed means can
normality is maintained. be done with the Yuen–Welch test [4], which is identical to
When distributions deviate from normality, several the Welch U test for zero amount of trimming.
approaches are available. The most common non-parametric When using these tests, one must be aware that the results
alternative is the Wilcoxon–Mann–Whitney (WMW) test. pertain to the tests' specific null hypotheses. A significant p-
This test is often regarded as a test of equal medians, but this value from the WMW test or the Brunner–Munzel test, for
is not true in general. The correct null hypothesis for this test example, can be difficult to interpret beyond noting that the
is P(X b Y) = 0.5, where X and Y are random samples from the observations from one of the populations tend to be smaller
than the observations from the other population. According to
⁎ Corresponding author. Tel.: +47 41 50 46 14; fax: +47 22 11 84 79. Cliff [5], this interpretation has merit in its own right, and he
E-mail address: morten.fagerland@medisin.uio.no (M.W. Fagerland). suggests making inference about P(X N Y) − P(X b Y) as an

1551-7144/$ – see front matter © 2009 Elsevier Inc. All rights reserved.
doi:10.1016/j.cct.2009.06.007
M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496 491

alternative to means or other measures of location. In lism, and breast cancer. Eilertsen et al. [13] examined whether
practice, however, researchers often like to make inference different HT regimens have different effects on blood
about the two common measures of central tendency, the coagulation by randomizing 202 healthy women to either
mean and the median, which offer intuitive interpretations. low-dose HT, conventional-dose (high-dose) HT, tibolone, or
In medical research, the assumptions of normality and raloxifene. The primary outcome measure was D-dimer—a
variance homogeneity are often violated [6,7]. Skewed data marker of fibrin production and degradation which can be
are common in medical research [8], and several well known used to assess the effect of HT on coagulation.
variables are known to be markedly skewed, for example After six weeks of therapy, the distribution of D-dimer was
triglyceride level and sedimentation rate. If two skewed considerably skewed in the low-dose HT group and moder-
distributions have unequal locations, the variances can be ately skewed in the high-dose HT group (Fig. 1). Summary
expected to differ as well. Hence, medical data often exhibit a statistics show that the difference in means is 87, the
combination of skewness and unequal variances. difference in medians is 103, and the difference in 20%
The purpose of this paper is to investigate to what extent trimmed means is 89:
the five mentioned tests can be appropriately used to
compare means and medians for a wide range of skewed n Mean Median 20% trimmed mean Std Skewness
distributions with varying degrees of unequal variances. Even Low-dose HT 47 398 307 336 284 3.1
though the body of literature on two-sample location tests is High-dose HT 48 485 410 425 260 1.8
considerable [9,10], a consistent and comprehensive exam-
ination of this issue has not been previously presented. For
How strong is the evidence for a difference in location
example, situations where the two distributions have unequal
between the two groups? We calculated the two-sample T
skewness have not been thoroughly studied, although it has
test (p = 0.13), the Welch U test (p = 0.13), the Wilcoxon–
been shown that both type I errors and power can be affected
Mann–Whitney test (p = 0.011), the Brunner–Munzel test
[7,11].
(p = 0.010), and the Yuen–Welch test (p = 0.027). The high-
The tests will be subjected to quantified robustness
est p-value is more than ten times the smallest p-value.
criteria. For each situation, the test or tests with highest
Which test should we trust? We return to this example in
power that maintain true significance levels (p) sufficiently
section 5.4.
close to the nominal level (α) will be identified. Bradley [12]
defines criteria for α-robustness as conservative with 0.9α ≤
3. Notation and test statistics
p ≤ 1.1α and liberal with 0.5α ≤ p ≤ 1.5α. This implies that
closeness be considered sufficient if the true significance
Consider two populations A and B. Assume that we have
levels are within plus or minus 10% or 50% of the nominal
two independent samples: X with m observations from A, and
significance levels. We consider 50% to be too liberal for most
Y with n observations from B. The estimated means and
situations, but 10%, 20%, and 40% limits will be studied. We
sample variances are:
refer to this as the 10%-, 20%-, and 40%-robustness of the tests.
For a nominal significance level of 5%, this implies that we 1 m 1 n
accept true significance levels that are in the intervals [4.5, X= ∑ X ; Y = ∑ Yi ;
m i=1 i n i=1
5.5], [4.0, 5.0], and [3.0, 7.0], respectively.

2. Clinical example and

m
Hormone therapy (HT) is associated with adverse effects 2 1 2 2 1 n 2
SX = ∑ ðXi −XÞ ; SY = ∑ ðY −YÞ :
such as increased risk of arterial and venous thromboembo- m−1 i = 1 n−1 i = 1 i

Fig. 1. Histogram showing the distribution of D-dimer in the low-dose HT (left) and high-dose HT (right) treatment arms after six weeks of the Eilertsen et al. trial
[13]. One outlier in each group was removed.
492 M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496

The two-sample T test is based on the test statistic can be approximated by the standard normal distribution. By
using the exact permutation distribution of ranks, an exact
X−Y version of the WMW test can be constructed. Since the exact
T= pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;
Sp 1= m + 1= n test is only practicable for small samples, we do not consider
it. Throughout this paper, references to the WMW test are to
where Sp is the pooled sample standard deviation: the approximate version of the test.
The Brunner–Munzel test [3] is a modification of the
2 2
2 ðm−1ÞSX + ðn−1ÞSY WMW test designed to handle ties and unequal variances.
Sp = : Instead of associating ranks with the sample observations,
m + n−2
midranks are computed. Midranks are equal to ranks when
Under the null hypothesis of equal means, the T statistic there are no tied values. For tied values, the midranks are the
has a t-distribution with m + n − 2 degrees of freedom. It is average of their ranks. The midranks of 2, 5, 5, 6, 9, 9, 9, 10, for
assumed that the distributions of A and B are normal with example, are 1, 2.5, 2.5, 4, 6, 6, 6, and 8. Let MX̅ and M̅Y be the
equal variances. means of the midranks associated with the samples X and Y
Welch [1] proposed several modifications of the two- when the data are pooled. The Brunner–Munzel test statistic
sample T test suitable for situations with unequal variances. is
One of these tests, the Welch U test, is available in most
software packages. The appropriate test statistic is qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
B = ðMY −MX Þ = ðm + nÞ SB2X = mn2 + SB2Y = m2 n;
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
S2X S2
U = ðX−YÞ = + Y:
m n where the expressions for SB2X and SB2Y are given in
Appendix B. The distribution of B can be approximated by
U is approximately t-distributed with fU degrees of freedom: a t-distribution with fB degrees of freedom:
!2 !
fU =
S2X
m
S2
+ Y
n = S4X
m3 −m2
+
S4Y
n3 −n2
:
fB =
SB2X
n
+
SB2Y
m
!2

= SB4X
n2 ðm − 1Þ
+
SB4Y
m2 ðn − 1Þ
!
:

To obtain the sample trimmed means, the amount of


trimming (γ) must be chosen. For general use, γ = 0.2 is a
good choice [11,14]. This corresponds to removing the 20% 4. Simulation setup
smallest and the 20% largest observations in each sample. Let
X γ̅ and Y γ̅ denote the trimmed means (the mean of the We examined the significance level and power of the tests
samples after trimming). The Yuen–Welch test [4] statistic is by using computer simulations. Table 1 defines the relevant
given by parameters of the simulation setup. The choices of these
parameters are discussed below.
X γ −Y γ Two criteria were used to select sample sizes: the total
Y = pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;
dX + dY sample size had to range from small to large, and the ratio of
the two sample sizes had to correspond to balanced designs
where dX and dY are estimates of the squared standard errors. (m = n), and unbalanced designs (m/n N 1 and m/n b 1).
Calculation of dX and dY is shown in Appendix B. Under the The impact of unequal variances was studied by specifying
null hypothesis of equal trimmed means, Y follows a t- the ratio of the standard deviations (θ). The largest standard
distribution with fY degrees of freedom, deviation was associated with the m size sample X, and the
smallest standard deviation was associated with the n size
! sample Y. Values of θ = 1.0,1.25,1.5,2.0,4.0 were used. When
fY = ðdX + dY Þ
2
= d2X
hX −1
+
d2Y
hY −1
; m N n, the distribution of the largest sample had the largest
variance, and when m b n, the distribution of the largest
sample had the smallest variance.
where hX and hY are the number of observations left in Different degrees of skewness (β) were introduced by
samples X and Y after trimming. using gamma and lognormal distributions. When the two
The WMW test statistic is based on ranks and involves distributions were given different degrees of skewness, the
calculating distribution with the largest variance had the largest skew-
ness. The normal distribution was used to generate symmetric
WX = mn + mðm + 1Þ = 2−RX ; distributions (β = 0).
In the power simulations, a difference in location (D)
where RX is the sum of the ranks in sample X. Under the null between the two distributions was introduced and standar-
hypothesis that P(X b Y) = 0.5, WX is approximately normal dized to make it comparable across distributions and sample
distributed with mean mn/2 and variance mn(m + n + 1)/12. sizes:
The statistic
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
W = ðWX −mn = 2Þ = mnðm + n + 1Þ = 12 D = δ⋅ σA2 = m + σB2 = n; δ = 1; 2; 3;
M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496 493

Table 1
Summary of the simulation setup.

Tests T: the two-sample T test


U: the Welch U test
Y: the Yuen–Welch test
W: the Wilcoxon–Mann–Whitney test
B: the Brunner–Munzel test
Null hypotheses Equal means; equal medians
Difference in location δ = 0,1,2,3
Nominal significance levels α = 0.05; 0.01
Sampling distributions Gamma a; lognormal a
Sample sizes (m,n) = (10, 10), (10, 25), (25, 10), (25, 25) (50, 50), (25, 100), (100, 25), (100, 100)
Standard deviation ratios θ = 1.0,1.25,1.5,2.0,4.0
Equal skewness values βA = βB = 0.0,0.5,1.0,1.5,2.0,2.5,3.0
Unequal skewness values (βA, βB) = (1.0, 0.5), (2.0, 0.5), (3.0, 2.5) (3.0, 2.0), (3.0, 1.0)
Replications 10,000
Programming language Matlab [15] with the Statistics Toolbox
a
Normal distribution for β = 0.

where σ2A is the variance of distribution A, and σ2B is the the power of the test had to be higher than the power of
variance of distribution B. the other tests. To allow for the inaccuracy of results from
simulation, a definition of power equivalence was devised.
5. Results and recommendations For each test with each distribution and sample size
combination, the three power values corresponding to the
5.1 . Gamma distribution versus lognormal distribution three introduced differences in distribution location
(δ = 1,2,3) were summed. Two tests were considered
We generated data from two types of distributions, the power equivalent if the smallest power sum deviated less
gamma distribution and the lognormal distribution. Test than 2.5% from the largest power sum.
recommendations were based on each distribution individu- Due to the large number of simulated situations, a
ally. In general, the robustness criteria were satisfied slightly comprehensive display of the recommendations is given in
more often when data was generated from the lognormal Appendix A. Two examples are given in Table 2: m = 100,
distribution as compared to when data was generated from n = 25 with equal distribution skewness, and m = n = 50
the gamma distribution. The general behavior of the tests was with unequal distribution skewness. In both cases, the
very similar for the two distributions, both when significance robustness level is 20%.
level and power were considered. We have restricted further It is clear from the recommendations that simple rules
attention to the results and recommendations based on the about which test should be used in which situation cannot be
gamma distribution. This makes the recommendations accurately stated. Each of the factors under consideration in
slightly more cautious than it would have been if it was this study—the total sample size, the sample size ratio, the
based on the lognormal distribution. standard deviation ratio, skewness, and skewness hetero-
geneity—has an effect on type I errors or power or both of
5.2 . Nominal significance level of 5% versus 1% some or all the tests. The net effect of these factors is often
difficult to predict. We strongly recommend that the relevant
The qualitative behavior of the tests was the same for a tables in Appendix A are consulted before the choice of test is
nominal significance level of 5% and 1%. However, the made. Nonetheless, a superficial summary of the recommen-
significance levels of the 1% tests were more sensitive to the dations is shown in Table 3.
effects of skewness, unequal variances, and unequal sample There are situations where none of the tests can be
sizes than the significance levels of the 5% tests, thus making recommended. Transformation of the data by taking loga-
the 1% tests a little less type I error robust. As for power, the rithms or square roots may reduce skewness and variance
1% and 5% power curves had similar shapes. heterogeneity, but there are some problems with this
approach [16–18]. First, the exact effect of the transforma-
5.3 . Test recommendations tion on skewness and variance is somewhat unpredictable.
Two samples of similar shape may have skewness and
For each studied situation, two criteria were used to variance altered differently, and differences that did not exist
decide if a test could be recommended. First, the true between the original samples may be introduced between
significance level (p) of the test had to be close to the the transformed samples. Second, the results from tests on
nominal significance level (α). Closeness was defined in transformed data are valid only on the transformed scale,
three levels: p within 10% of α, p within 20% of α, and p and interpreting the results back onto the original scale can
within 40% of α. As considerable less robustness was be troublesome. As a general rule, when using transforma-
observed for α = 0.01, we felt that demanding that both the tions of any kind, the transformed samples should be
α = 0.01 and the α = 0.05 tests had to satisfy the robust- examined with the same scrutiny as the original samples.
ness criteria was too strict, especially since α = 0.05 is by Specifically, signs of unequal variances and skewness
far the most used in medical publications. Therefore, the distributed unevenly between the two samples should be
robustness criteria were based on α = 0.05 only. Second, given particular attention.
494 M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496

Table 2
Tests with highest power that satisfy 4.0 ≤ p ≤ 6.0 for α = 0.05.

m = 100, n = 25
Robustness level: 20%
H0: equal means H0: equal medians
! Std. ratio
U U U U U U U 4.00 U Y Y Y – – –
U U U U U U U 2.00 U Y Y – – – W
UB U U U U U U 1.50 UB U Y Y Y – W
UB U U U U Y Y 1.25 UB U YB B B B W
T TUW W W W W W 1.00 TU TUW W W W B T
0.0 0.5 1.0 1.5 2.0 2.5 3.0 βA = βB 0.0 0.5 1.0 1.5 2.0 2.5 3.0

m = n = 50
Robustness level: 20%
H0: equal means H0: equal medians
Std. ratio
TU – – – – 4.00 Y YB Y Y –
TU TU – U – 2.00 WB Y Y Y –
TU TU TU TU TU 1.50 WB Y WB Y Y
TU TU TU TU TU 1.25 WB Y Y Y Y
TU TU Y TU B 1.00 WB Y Y Y Y
1.0 2.0 3.0 3.0 3.0 Skewness dist. A 1.0 2.0 3.0 3.0 3.0
0.5 0.5 2.5 2.0 1.0 Skewness dist. B 0.5 0.5 2.5 2.0 1.0

p is the true significance level and α is the nominal significance level. βA is the skewness of distribution A and βB is the skewness of distribution B. An entry of “–”
means that no test satisfies the robustness criterion. The data were generated from normal distributions (skewness = 0) and gamma distributions (skewness N 0).
T = the two-sample T test.
U = the Welch U test.
Y = the Yuen–Welch test.
W = the Wilcoxon–Mann–Whitney test.
B = the Brunner–Munzel test.

5.4 . The clinical example revisited test of medians. All three tests are type I error robust at the
10% level. As the differences in means and trimmed means are
In section 2, we compared the locations of D-dimer in the similar (87 and 89), the smaller p-value for the Yuen–Welch
low-dose HT and the high-dose HT treatment arms after six test reflects the smaller variance estimate this test uses due to
weeks of the Eilertsen et al. trial [13]. We obtained widely trimming of the largest observations.
different p-values with our five tests. The sample sizes in the To conclude this example, there is some evidence (Yuen–
two groups were 47 and 48, the standard deviation ratio was Welch test: p = 0.027) that there is a difference in 20%
284/260 = 1.1, and the sample skewness was 3.1 and 1.8. For trimmed means, but no evidence of a difference in means (T
distributions with unequal skewness, Table 13 in Appendix A test/Welch: p = 0.13). The Wilcoxon–Mann–Whitney and the
details recommendations for a sample size of 50 in each Brunner–Munzel tests are not recommended in this situation.
group. An excerpt is given in the lower part of Table 2. For Because the Yuen–Welch test is robust for testing medians,
distributions similar to the ones in our example, the two- and because the trimmed means are close to the medians, any
sample T test and the Welch U test are the most powerful inference drawn about the trimmed means can be applied to
tests of means, and the Yuen–Welch test is the most powerful the medians as well.

6. Discussion
Table 3
Brief summary of the recommendations.
Comparing the locations of two skewed populations is
Comparing means Comparing medians fraught with difficulties. Unless the degree of skewness is
θ = 1.0 θ N 1.0
small, different measures of central tendency—for example
the mean, the median, and the 20% trimmed mean—can differ
m=n W,B T,U T,U,W,B, sometimes Y a
mbn B U or no test b U,B, sometimes Y c markedly in numeric value. If the variances are unequal as
mNn W U U,Y,W,B well, making inferences about equality of two different
θ is the standard deviation ratio and β is the skewness. m and n are the measures can lead to opposite conclusions. In such cases, it
sample sizes. When m b n, the smallest sample has the largest variance. is important to accurately define the population differences of
When m N n, the largest sample has the largest variance. T = the two-sample interest, and to interpret test results in strict adherence to the
T test; U = the Welch U test; Y = the Yuen–Welch test; W = the Wilcoxon– tests' null hypotheses.
Mann–Whitney test; B = the Brunner–Munzel test.
a
Y for combinations of large θs or large βs or both.
The aim of this paper was to assess the ability of some
b
U when β ≤ 1.0, else no test. much used tests to compare means and medians for a wide
c
Y for large sample sizes. range of skewed distributions with unequal variances. Our
M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496 495

recommendations are detailed in Appendix A. We briefly However, if a test is robust at the 40% level only, obtained p-
review the most important results: values should be interpreted with due caution.
• The performance of the tests depends on many factors, most
Appendix A. Supplementary data
notably variance heterogeneity, skewness and skewness
heterogeneity, the sample size ratio, and the total sample
Supplementary data associated with this article can be
size.
found, in the online version, at doi:10.1016/j.cct.2009.06.007.
• Small distribution changes can lead to large changes in test
performance.
Appendix B
• Skewness heterogeneity had a slight negative effect on the
rank-based tests, but almost no effect on the parametric
Appendix B.1. Estimates of the squared standard errors in the
tests.
Yuen–Welch test
• For the simulated settings, the Welch U test is recom-
mended most frequently.
Let gX =γm and gY =γn be the number of observations
• The rank-based methods are sensitive to departures from
(rounded down) trimmed from each tail in X and Y. Denote the
the pure shift model.
number of remaining observations in the trimmed samples by
• For variables with skewed distributions, the 20% trimmed
hX =m − 2gX and hY =n − 2gY. The squared standard errors are
mean is closer to the median than to the mean.
based on the sample Winsorized variances. Denote the sorted
• The five examined tests performed similarly on samples
observations in X by X(1) ≤ X(2) ≤ ⋯ ≤ X(m). The Winsorized
drawn from gamma distributions as compared to samples
sample of X,
drawn from lognormal distributions.
The advantage of the Welch test demonstrated in our 1 2 m
WX = WX ; WX ; …; WX ;
study is in agreement with previous studies and several
authors recommend the Welch test for almost all situations
[19–22]. We agree that the Welch test is the best test in is found by setting WX =X and replacing each of the gX smallest
general, but to select the most powerful robust test, a careful observations, X(1),…, X(gx), with X(gX + 1), and replacing each of
consideration of the properties of the data is recommended. the gX largest observations, X(m − gX + 1),…, X(m), with X(m − gX).
As an aid in this endeavor, Appendix A should be helpful. The Winsorized sample of Y (WY) is found in the same way.
The five tests examined in this paper are but a small part of Denote the Winsorized sample means by WX̅ and WY̅ . The
the large set of tests available for the two-sample location sample Winsorized variances are
problem. However, because of their widespread use, these
five tests merit special attention. Several alternative methods 2 1 m
i 2 2 1 n i 2
are presented in the two books by Wilcox [11,23], including swX = ∑ ðWX −W X Þ and swY = ∑ ðWY −W Y Þ :
m−1 i = 1 n−1 i = 1
methods using robust measures of location, rank-based
methods, permutation tests, and bootstrap methods. One of
The squared standard errors in the Yuen–Welch test are
the main obstacles to contemporary methods is availability in
commercial software. This problem is easily overcome by
using the free software R [24] for which a large number of sw2X ðm−1Þ sw2Y ðn−1Þ
dX = and dY = :
functions exist to perform modern methods [11,23]. hX ðhX −1Þ hY ðhY −1Þ
Our simulation study is limited in scope by two main
factors. First, we have employed two families of distributions,
Further details can be found in [4,11].
the gamma and the lognormal. Although very similar results
were observed for the two distributions, we cannot rule out
Appendix B.2. Variance estimates in the Brunner–Munzel test
the possibility that other types of distributions may produce
conspicuously different results. Also, extreme observations
Following the notation in section 3, MX = M1X,M2X,…, Mm X
have a large impact on the T test and the Welch test. A realistic
and MY = M1Y,M2Y,…, MnY are the midranks of X and Y based on
modeling of extreme observations is difficult, and other
pooling all observations. M̅X and M̅Y are the means of the
distributions than the gamma and the lognormal are perhaps
pooled midranks. Midranks can also be computed within each
better suited. Second, the effect of kurtosis has not been
sample. Denote these by VX = V1X,V2X,…,Vm 1 2 n
X and VY = VY,VY,…VY.
assessed. There is some evidence that kurtosis has only a
The variance estimates in the Brunner–Munzel test are
minor effect on type I error rates [9,16,25], but that power
may be affected [26]. For gamma and lognormal distributions,  
skewness and kurtosis are not independent parameters [27]. 2 1 m
i i m+1 2
SBX = ∑ MX −VX −MX +
Thus, for the skewed distributions studied in this paper, the m−1 i = 1 2
effect of kurtosis cannot be separated from the effect of and
skewness.
We have quantified robustness by defining 10%, 20%, and  
2 1 n i i n+1 2
40% limits to the deviation of the true significance level from SBY = ∑ MY −VY −M Y + :
n−1 i = 1 2
the nominal level. We consider a 10% deviation to be
acceptable in almost any practical application and that a
20% deviation is sufficiently precise for most situations. For further details, see [3,11].
496 M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496

References [14] Wilcox RR. Some results on the Tukey–McLaughlin and Yuen methods
for trimmed means when distributions are skewed. Biom J 1994;3:
259–73.
[1] Welch BL. The significance of the difference between two means when [15] Matlab 7. Natick, MA: The MathWorks, Inc.; 2005.
the population variances are unequal. Biometrika 1937;29:350–62. [16] Pearson ES, Please NW. Relation between the shape of population
[2] Lehmann EL. Nonparametrics—statistical methods based on ranks. distribution and the robustness of four simple test statistics. Biometrika
Upper Saddle River, NJ: Prentice-Hall, Inc.; 1975. 1975;62:223–41.
[3] Brunner E, Munzel U. The nonparametric Behrens–Fisher problem: [17] Sutton CD. Computer-intensive methods for tests about the mean of an
asymptotic theory and a small-sample approximation. Biom J 2000;42: asymmetrical distribution. J Am Stat Assoc 1993;88:802–10.
17–25. [18] Grissom RJ. Heterogeneity of variance in clinical data. J Consult Clin
[4] Yuen KK. The two-sample trimmed t for unequal population variances. Psychol 2000;68:155–65.
Biometrika 1974;61:165–70. [19] Best DJ, Rayner JCW. Welch's approximate solution for the Behrens–
[5] Cliff N. Dominance statistics: ordinal analyses to answer ordinal Fisher problem. Technometrics 1987;29(2):205–10.
questions. Psychol Bull 1993;114:494–509. [20] Gans DJ. Use of a preliminary test in comparing two sample means.
[6] Wilcox RR. Comparing the means of two independent groups. Biom J Commun Stat Simul C 1981;B10(2):163–74.
1990;32:771–80. [21] Zimmerman DW. A note on preliminary tests of equality of variances. Br
[7] Wilcox RR, Keselman HJ. Modern robust data analysis methods: J Math Stat Psychol 2004;57:173–81.
measures of central tendency. Psychol Methods 2003;8:254–74. [22] Ruxton GD. The unequal variance t-test is an underused alternative to
[8] Bridge PD, Sawilowsky SS. Increasing physicians' awareness of the Student's t-test and the Mann–Whitney U test. Behav Ecol 2006;17(4):
impact of statistics on research outcomes: comparative power of the t- 688–90.
test and Wilcoxon rank-sum test in small samples applied research. [23] Wilcox RR. Applying contemporary statistical techniques. San Diego,
J Clin Epidemiol 1999;52:229–35. CA: Academic Press; 2003.
[9] Penfield DA. Choosing a two-sample location test. J Exp Educ [24] The R project for statistical computing. [http://www.r-project.org/].
1994;62:343–60. [25] Cressie NAC, Whitford HJ. How to use the two sample t-test. Biom J
[10] Stonehouse JM, Forrester GJ. Robustness of the t and U tests under 1986;28(2):131–48.
combined assumption violations. J Appl Stat 1998;25:63–74. [26] Wilcox RR. ANOVA: the practical importance of heteroscedastic
[11] Wilcox RR. Introduction to robust estimation and hypothesis testing. methods, using trimmed means versus means, and designing simula-
2nd ed. San Diego, CA: Academic Press; 2005. tion studies. Br J Math Stat Psychol 1995;48:99–114.
[12] Bradley JV. Robustness? Br J Math Stat Psychol 1978;31:144–52. [27] Evans M, Hastings N, Peacock B. Statistical distributions. 3rd ed. New
[13] Eilertsen AL, Qvigstad E, Andersen TO, Sandvik L, Sandset PM. York, NY: John Wiley & Sons, Inc.; 2000.
Conventional-dose hormone therapy (HT) and tibolone, but not low-
dose HT and raloxifene, increase markers of activated coagulation.
Maturitas 2006;55:278–87.

Das könnte Ihnen auch gefallen