Beruflich Dokumente
Kultur Dokumente
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
217
The
British
Psychological
British Journal of Mathematical and Statistical Psychology (2007), 60, 217–244
q 2007 The British Psychological Society
Society
www.bpsjournals.co.uk
1. Introduction
The independent groups t test is one of the most widely used statistical tests. Given this,
it is amazing that after half a century statisticians are still debating how best to compare
two group means. Unfortunately, this lively debate rarely finds its way into statistical
methodology books, the authors of which frequently advocate a strategy for comparing
two group means that some of the literature suggests should not be used. In this paper
we further evaluate this strategy, which we call the ‘conditional decision rule’. This rule
states that the choice of which of two t tests to use, both of which are printed by
popular statistics packages, should be based on the outcome of a test of variance
equality. We compare the Type I error rate of the conditional decision rule in a variety of
situations and also compare it to several alternative methods of comparing two means,
including some based on resampling methods of inference. Although there have been
several studies of the performance of the conditional decision rule that have led their
* Correspondence should be addressed to Andrew F. Hayes, School of Communication, Ohio State University, 3016 Derby Hall,
154 N. Oval Mall, Columbus, OH 43210, USA (e-mail: hayes.338@osu.edu).
DOI:10.1348/000711005X62576
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
respective investigators to argue against it, existing studies are limited in scope. By using
four different variance equality sets and sampling from 49 different distributions, the
simulations we present here are arguably the most comprehensive and provide a strong
means of extending and testing the generalizability of previous recommendations
regarding the usefulness (or lack thereof) of the conditional decision rule.
Zumbo, 1992). Therefore, there are some occasions when the pooled variance t test
should be preferred, and others when the separate variance t test should be used.
provides better (or worse) Type I error protection, depending on the test of variance
equality or the shape of the population sampled.
Keppel, 1991), we focus on these two tests as the ‘robust’ variance equality tests in the
simulations reported below.
2. Method
To assess the Type I error characteristics of conditional compared to unconditional tests,
a series of Monte Carlo simulations was conducted using the GAUSS program (Aptech
Systems Inc., 1996). In all, 4,116 different simulations were conducted, manipulating a
number of factors known to influence the performance of the tests examined here.
These factors included total sample size (N: 48, 80, 160), group 1 to group 2 sample size
ratio (n1:n2: 1:7, 1:3, 2:3, 1:1), group 1 to group 2 population variance ratio (u: 0.10,
0.25, 0.50, 1, 2, 4, 10), population skewness, and population kurtosis.
GAUSS rndu function. Adding these six distributions to the 43 populations generated
with the Fleishman yielded 49 different populations for the simulation. These 49
combinations of population skewness and kurtosis are listed in the Appendix along with
the coefficients used when generating samples with the Fleishman power method.
1,000. The p-value for t separate was computed using the same logic. These tests are
denoted throughout as unconditional permutation pooled variance (UPP) and
unconditional permutation separate variance (UPS).
The final unconditional test we used was based on a permutation procedure
introduced by Mielke and Berry (1994). They proposed a permutation test using the
average of the nj ðnj 2 1Þ=2 within-group pairwise distances between values of Yi in
group j, This average, j, is computed in each group, and the test statistic is a sample
size weighted sum of the average distances across the j groups:
d ¼ ðn1 j1 þ n2 j2 Þ=ðn1 þ n2 Þ. We used the squared Euclidean distance because this
directly translates into a test of the equality of means (see Mielke & Berry, 1994). As with
other kinds of permutation tests, one can either generate an exact null distribution of
the test statistic by generating all possible permutations of the n1 þ n2 values of Yi
across groups or generate an approximate null distribution based on random sampling
from all possible permutations. But Mielke, Berry, and Johnson (1976) developed a
method we used here of approximating the p-value using the first three moments of the
null distribution and the Pearson Type III distribution. Technical details of this
complicated method can be found in Mielke and Berry. We refer to this throughout as
the UMB test (unconditional Mielke–Berry test).
For each combination of sample size, sample size ratio, variance ratio, and
population shape, the procedures described here were repeated 2000 times and the
proportion of rejections of the equal means null hypothesis over the 2000 replications
was recorded as an estimate of the Type I error rate.
3. Results
When describing the results of simulation studies, the typical practice is to provide
tables of rejection rates for all combinations of the factors manipulated and to make
generalizations about the performance of the test based on how it performs in those
conditions simulated. However, it would be nearly impossible for us to succinctly
summarize the performance of the tests we examined here with a few tables because of
the sheer number of unique combinations of variance inequality, sample size inequality,
sample size, skewness, and kurtosis we generated in the simulations.
Instead, we ask the reader first to allow us to ignore the many factors that we
manipulated and then ask about the relative performance of the methods overall,
collapsing across one or more of the manipulated factors. This allows us to determine
whether any test or tests tend to do especially well across situations, and whether one of
those better tests is a conditional test. Having answered the question as to how
conditional and unconditional tests perform relative to each other across conditions, we
then more specifically examine the conditions in which some tests are superior to
others, as well as the conditions in which many or most fail.
We first examine the Type I error rates of the methods by ignoring the factors we
manipulated and answer the question which is the best test overall. A researcher may
embark on a study and have little information a priori as to how equal the group sizes
and variances are likely to be or how the data will be distributed. What test or tests can
be trusted to produce a valid test across the majority of situations that a researcher might
confront across a series of studies? And is one or more of those tests a conditional test?
To answer this question, we computed the mean Type I error rate for each testing
procedure across the 4,116 simulated conditions. We also assessed the proportion of
conditions simulated in which the test was valid. We define a valid test in three ways.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Formally, a test is valid at a ¼ :05 if the Type I error rate of the test is no greater than .05.
Because the estimated Type I error rate in any condition is subject to sampling
variability, we define a test as valid in a condition if its estimated Type I error rate in that
condition was no greater than .060. Each Type I error rate was estimated based on 2,000
replications. If the true Type I error rate is .05 or less, than you expect the estimate to
exceed .060 in 2,000 replications no more than 5% of the time by chance. The second
definition of valid we will call ‘practical validity’. Others have argued that a test can be
considered valid if the Type I error rate is sufficiently close to a, where ‘sufficiently
close’ is defined subjectively. We use .075 as our definition of ‘sufficiently close,’
(Bradley, 1978). This is a subjective rule but is sensitive to the fact that few would worry
about a test rejecting at a rate slightly higher than .05 but if the test goes much above
.075, then the test does not control the Type I error rate as well as desired. This
definition, however, ignores that a test may be valid but highly conservative. Thus, we
also use a third definition of validity, defining a test as valid if its Type I error rate is close
to .05 in either direction, defining ‘close’ as between .025 and .075 (Bradley, 1978).
A third means of gauging the performance of a test across conditions is to assess the
variability in the Type I error rate. A good test would be consistent in its false rejection
rate across conditions. In other words, better tests should have smaller variability in the
Type I error rate across conditions. So we computed the standard deviation of the Type I
error rates across conditions and also computed percentile scores for the Type I error
rates to examine the variability in Type I errors across conditions.
These indices of test performance can be found in Table 1. Examining first the mean
(M) Type I error rate, three tests stand out from the others. Of the four conditional tests,
CO had a Type I error rate closest to .05 on average. In addition, CO was valid a greater
proportion of the time compared to the other conditional tests. However, two
unconditional tests also performed as well or better. Tests UBS and UMB had average
Type I error rates no greater than .05 on average, and both produced validity
percentages that exceeded CO. Indeed, both of these unconditional tests were
practically valid in over 97% of the conditions simulated.
A close examination of the distribution of the Type I error rates shows that UBS and
UMB were superior to CO as well as any of the other unconditional tests. Notice that
10% of the estimated Type I error rates for CO exceeded .069, with the other
unconditional tests having even higher 90th percentiles. In contrast, 90% of the Type I
error rates were no greater than .058 when using the UBS or UMB. But looking even
more closely at the entire distribution of Type I error rates, it is clear that UBS is superior
to all methods on the grounds that the Type I error rates tend to be less extreme, more
tightly clustered, and closer to .05. Less than 10% of the error rates were less than .042
using UBS, compared to .026 for CO. UMB, however, was quite conservative, with 50%
of the Type I error rates less than .013 and yielding a Type I error rate between .025 and
.075 in only 36.2% of the conditions simulated. And UBS was the least variable of all the
tests in Type I error rates, which clustered more tightly around not only its mean error
rate relative to the other tests, but also with the Type I error rates usually quite close to
.05 (in 97% of conditions).
Although some of the differences between the overall performance of these methods
are small and subtle, it is fairly clear looking at Table 1 that the conditional decision rule,
regardless of which variance equality test is used, does not control the Type I error rate
any better than several unconditional alternatives. In fact, a few of the unconditional
alternatives tended to do better. Of the unconditional alternatives, UBS and UMB
maintained Type I error rates closest to or less than .05 on average, were valid more
Table 1. Type I error rates and validities across all 4,116 conditions
Unconditional tests
Pooled t (UPT) 0.075 .006 .027 .052 .092 .204 0.078 65.7 71.2 48.0
Separate t (UST) 0.053 .044 .048 .051 .056 .061 0.011 89.0 97.1 97.1
Bootstrap pool. t (UBP) 0.059 .044 .049 .053 .054 .062 0.017 71.1 87.9 87.9
Bootstrap sep. t (UBS) 0.050 .042 .046 .050 .054 .058 0.010 93.7 98.1 98.0
Permutation pool. t (UPP) 0.075 .006 .028 .052 .093 .203 0.077 64.8 71.0 47.9
Permutation sep. t (UPS) 0.052 .041 .047 .052 .057 .062 0.011 86.0 97.5 95.9
Mielke–Berry (UMB) 0.024 .000 .002 .013 .049 .058 0.025 93.0 97.9 36.2
1
% of conditions where test yielded Type I error rate less than .06.
2
% of conditions where test yielded Type I error rate no greater than .075.
3
% of conditions where test yielded Type I error rate between .025 and .075.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
often, and were less likely to have large Type I error rates (as evidenced by the smaller
90th percentile scores). However, UMB tended to be conservative. It is also worth
noting how well the much simpler UST did across situations. Although its average Type I
error rate was not much different from any of the conditional tests, it was more
consistent and was less likely to produce extreme Type I error rates. Furthermore, it was
valid in more conditions than any of the conditional tests, and it is much easier to
compute than UBS or UMB. As such, it should not be discounted at this stage as a serious
competitor to the conditional decision rule.
The results presented in Table 1 collapse across two factors we manipulated that are
known to affect the validity of the t test: variance inequality and sample size inequality.
Tables 2–5 detail how these 11 tests performed as a function of variance inequality and
sample size inequality, defining a test as valid using the ‘practical validity’ definition (we
will use this definition of validity in all remaining tables and discussions thereof). These
tables tabulate the percent of populations sampled, out of 49 (because there were 49
different populations simulated), in which the test was invalid (i.e. estimate Type I error
rate above .075). Table 2 singles out the 1,029 simulated conditions with equal group
sample sizes. As can be seen, all of the tests did well when the sample sizes were equal.
In that case, conditioning the choice of t test on a variance equality test serves little
purpose. Furthermore, one can use any of the unconditional tests we examined here.
They all tend to produce a valid test with a few exceptions we do not detail at this stage.
Table 3 focuses on the 588 conditions where the population variances were equal. As
can be seen, when the groups were equally variable, all tests tended to do relatively well,
although UBP, UBS and UST did not do as well as most of the conditional tests when
there were large discrepancies between the group sample sizes and the overall sample
size was relatively small. The researcher might be well advised to avoid the use of these
tests in such situations. However, this advice presents a paradox, in that the researcher
cannot tell definitively whether the group variances are equal without first conducting a
variance equality test, which automatically moves him or her down the path of a
conditional testing strategy.
Table 4 documents the performance of these tests in the 1,323 conditions that
combined sample size inequality and variance inequality such that the smaller group has
the smaller variance (u , 1). As can be seen, the tests performed well, with the
exception of UBP, which tended to fail far more than the other tests when sample size
inequality was combined with moderate variance inequality. With that exception, it
makes little difference whether a conditional or unconditional test is used when the
group variances are different and the smaller group is less variable than the larger group.
The biggest discrepancy between the methods occurred when the variances were
different and the smaller group was more variable than the larger group (occurring in
1,323 of the simulations). As can be seen in Table 5, some methods performed relatively
well, while others performed miserably. In these conditions, the UPT and UPP should be
avoided, as those tests were invalid in nearly all 49 of the combinations of population
skewness and kurtosis that we simulated. The best performer was UBS, followed by
UMB, UPS and UST. All of the conditional tests clearly were better than the
unconditional use of a pooled t test (whether using the t distribution or a permutation
distribution), but not nearly as good as the unconditional tests based on the separate
variance t statistic. This is an interesting and important finding, because it is in these
situations that researchers might be especially likely to be attracted to an unconditional
test given knowledge of the literature that documents the failure of the pooled t test.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Table 2. Percentage of sampled populations (out of 49) where test was invalid; equal sample sizes
Conditional tests
F (CF) 48 4 0 0 0 0 4 4
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
Levine (CL) 48 4 0 0 0 0 4 4
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
O’Brien (CO) 48 4 0 0 0 0 4 6
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
Brown–Forsythe (CB) 48 4 0 0 0 0 4 4
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
Unconditional tests
Pooled t (UPT) 48 4 4 0 0 0 4 6
80 2 0 0 0 0 0 4 1
160 0 0 0 0 0 0 2
Separate t (UST) 48 4 0 0 0 0 4 4
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
Bootstrap pooled t (UBP) 48 4 0 0 0 0 0 2
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
Bootstrap separate t (UBS) 48 4 0 0 0 0 0 2
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
Permutation pooled t (UPP) 48 4 2 0 0 0 4 6
80 2 2 0 0 0 0 4 1
160 0 0 0 0 0 0 2
Permutation separate t (UPS) 48 4 2 0 0 0 4 6
80 2 2 0 0 0 0 4 1
160 0 0 0 0 0 0 2
Mielke–Berry (UMB) 48 6 2 0 0 0 4 6
80 2 4 0 0 0 0 6 2
160 0 0 0 0 0 0 2
These findings suggest that several alternative unconditional tests are more likely to
produce a valid test.
However, there is something unsatisfying about this approach to examining the
results. Although the results tabulated in Tables 1–5 provide information about which
tests tend to be valid across conditions, we cannot specify very precisely the conditions
in which any of these tests fail, because we have not considered the effect of population
skewness and kurtosis on the Type I error rates. Furthermore, these tables provide no
information about the circumstances in which the tests all fail, nor do they identify the
possible conditions in which a conditional test may be the only valid testing strategy.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Table 3. Percentage of sampled populations (out of 49) where test was invalid; equal population
variances (u ¼ 1)
n1:n2
Conditional tests
F (CF) 48 10 0 0 0
80 8 0 0 0 2
160 4 0 0 0
Levine (CL) 48 0 0 0 0
80 0 0 0 0 0
160 0 0 0 0
O’Brien (CO) 48 0 0 0 0
80 0 0 0 0 0
160 0 0 0 0
Brown–Forsythe (CB) 48 0 0 0 0
80 0 0 0 0 0
160 0 0 0 0
Unconditional tests 48 0 0 0 0
Pooled t (UPT) 80 0 0 0 0 0
160 0 0 0 0
Separate t (UST) 48 16 0 0 0
80 8 0 0 0 2
160 4 0 0 0
Bootstrap pooled t (UBP) 48 0 0 0 0
80 76 0 0 0 7
160 16 0 0 0
Bootstrap separate t (UBS) 48 12 0 0 0
80 6 0 0 0 2
160 4 0 0 0
Permutation pooled t (UPP) 48 0 0 0 0
80 0 0 0 0 0
160 0 0 0 0
Permutation separate t (UPS) 48 0 0 0 0
80 0 0 0 0 0
160 0 0 0 0
Mielke–Berry (UMB) 48 0 0 0 0
80 0 0 0 0 0
160 0 0 0 0
Conditional tests N
F (CF) 48 0 0 2 0 0 0 8 0 0
80 0 0 0 0 0 0 6 0 0 ,1
160 0 0 0 0 0 0 2 0 0
Levine (CL) 48 0 0 2 0 0 0 2 0 0
80 0 0 0 0 0 0 2 0 0 ,1
160 0 0 0 0 0 0 2 0 0
O’Brien (CO) 48 0 0 0 0 0 0 0 0 0
80 0 0 0 0 0 0 0 0 0 0
160 0 0 0 0 0 0 0 0 0
Brown–Forsythe (CB) 48 0 0 0 0 0 0 0 0 0
80 0 0 0 0 0 0 0 0 0 0
160 0 0 0 0 0 0 0 0 0
Unconditional tests
Pooled t (UPT) 48 0 0 0 0 0 0 0 0 0
80 0 0 0 0 0 0 0 0 0 0
160 0 0 0 0 0 0 0 0 0
Separate t (UST) 48 0 0 4 0 0 0 6 0 0
80 0 0 0 0 0 0 4 0 0 ,1
160 0 0 0 0 0 0 0 0 0
Bootstrap pooled t (ubp) 48 0 0 0 27 0 0 80 0 0
80 0 0 0 4 0 0 41 0 0 6
160 0 0 0 0 0 0 0 0 0
Bootstrap separate t (UBS) 48 0 0 0 0 0 0 4 0 0
80 0 0 0 0 0 0 2 0 0 ,1
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Copyright © The British Psychological Society
Table 4. (Continued)
Variance ratio (u)
n1:n2 1:7 1:2 2:3 1:7 1:2 2:3 1:7 1:2 2:3 Total % (out of 1,323)
231
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
232 Andrew F. Hayes and Li Cai
Table 5. Percentage of sampled populations (out of 49) where test was invalid: unequal sample sizes, smaller group with larger variance (u . 1)
2 4 10
Conditional tests N
F (CF) 48 90 35 0 100 20 6 51 12 10
80 71 20 0 55 16 2 14 6 4 22
160 39 2 0 14 4 0 10 4 2
Levine (CL) 48 82 41 0 100 33 8 82 14 10
80 78 24 0 96 16 2 16 8 4 26
160 51 0 0 14 4 0 10 4 2
O’Brien (CO) 48 73 27 0 100 37 10 43 14 12
80 71 10 0 33 16 2 16 10 8 20
160 27 0 0 18 6 2 10 4 2
Brown–Forsythe (CB) 48 90 67 0 100 92 10 100 14 10
80 88 33 0 100 18 2 18 8 4 32
160 71 0 0 18 4 2 10 4 2
Unconditional tests
Pooled t (UPT) 48 100 100 4 100 100 92 100 100 100
80 100 98 8 100 100 88 100 100 100 89
160 100 100 10 100 100 92 100 100 100
Separate t (UST) 48 16 8 0 12 10 6 14 10 10
80 12 4 0 12 6 2 12 6 4 7
160 8 2 0 6 2 0 10 4 2
Bootstrap pooled t (UBP) 48 98 10 0 100 14 4 100 14 6
80 78 4 0 86 6 0 71 8 2 24
160 20 0 0 16 2 0 16 4 0
Bootstrap separate t (UBS) 48 14 6 0 12 8 4 8 4 4
80 10 4 0 10 4 0 6 4 2 4
160 4 2 0 4 2 0 4 2 0
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Copyright © The British Psychological Society
Table 5. (Continued)
Variance ratio (u)
2 4 10
n1:n2 1:7 1:2 2:3 1:7 1:2 2:3 1:7 1:2 2:3 Total % (out of 1,323)
233
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
234 Andrew F. Hayes and Li Cai
Table 6. Pairwise comparisons between conditional and unconditional test performance across all 4,116 conditions
was the only valid test, and the percentage of conditions in which the unconditional test
was the only valid test, respectively. Column 6 (UC/C) quantifies their relative
performance as the ratio of these two columns. A ratio less than one signifies that when
the two tests disagreed, the conditional test was valid more frequently, whereas a ratio
greater than 1 indicates that the unconditional test tended to be the sole valid test.
There is a consistent pattern. Whenever a conditional test was pitted against an
unconditional test that relies on a pooling of variances in the test statistic (UPT, UBP,
UPP), the conditional test tended to be valid more frequently. In contrast, whenever a
conditional test competed against an unconditional test that does not use a pooling of
variances in the test statistic (UST, UBS, UPS, UMB), the unconditional test was valid
more often. This table illustrates how poor UPT is compared to any of the conditional
testing strategies in controlling Type I errors. It also illustrates that the conditional tests
are rarely better than any of the unconditional separate variance tests in keeping Type I
error rates at an acceptable level. And again, we see evidence in this table of the
superiority of UBS, UPS and UMB relative to any of the conditional tests, as well as the
good performance of UST across situations compared to the conditional tests.
So far we have focused on the relative performance of the conditional and
unconditional testing strategies collapsing across the 49 variations of population
skewness and kurtosis. Although there are some important differences between the
Type I error rates of conditional and unconditional testing strategies throughout the
results thus far, it is clear from Table 5 that the largest differences between the methods
occurred when both the sample sizes and variances differed, and when the smaller
group had the largest variance. To better understand the nature of the differences in
performance in those specific conditions, in Tables 7 and 8 we provide the mean Type I
error rates for all the methods studied here when sampling from (1) symmetrical
populations with negative kurtosis (five conditions), (2) symmetrical populations with
positive kurtosis (six conditions), (3) asymmetrical populations with zero kurtosis
(three conditions), (4) asymmetrical populations with negative kurtosis (seven
conditions), and (5) asymmetrical populations with positive kurtosis (27 conditions).
Focusing first on the symmetrical populations (Table 7), the basic pattern of findings
we have described thus far appears, with the unconditional tests that do not rely on the
pooling of variances keeping good control over Type I error rates as well as or better
than any of the conditional tests. UMB was generally conservative, and the
unconditional pooled variance t test (UPT) and its resampling variants (UBP and UPP)
were liberal, although the liberalness of UBP decreased with increasing sampling size.
The conditional tests generally performed similarly, although conditioning on O’Brien’s
test of variance equality (CO) produced a slightly less liberal test when sampling from
symmetrical populations with negative kurtosis.
Examining the relative performance of the tests when sampling from asymmetrical
populations (Table 8), the same basic pattern is found, with UST, UBS and UPS equalling
or outperforming the conditional tests and generally keeping the Type I error rate, on
average, near .05. UMB was either liberal or conservative depending on the
discrepancies in sample size and variance, and the pooled variance tests (UPT, UBP,
UPS) were generally liberal, sometimes dramatically so.
Table 7. Mean Type I error rates: unequal sample sizes, smaller group with larger variance (u . 1),
symmetrical populations (skewness ¼ 0)
Neg. 2 1:7 48 .115 .118 .104 .124 .130 .059 .109 .048 .131 .058 .058
80 .107 .108 .088 .114 .132 .055 .089 .049 .132 .056 .038
160 .076 .079 .062 .084 .128 .050 .065 .049 .128 .052 .017
1:3 48 .084 .084 .078 .089 .103 .054 .068 .052 .102 .057 .035
80 .077 .077 .069 .082 .193 .054 .064 .053 .102 .056 .020
160 .057 .058 .053 .060 .099 .051 .057 .052 .098 .053 .014
2:3 48 .060 .060 .059 .061 .067 .052 .053 .050 .067 .053 .041
80 .058 .057 .056 .059 .071 .053 .054 .052 .072 .054 .032
160 .048 .049 .048 .049 .064 .047 .049 .049 .065 .048 .018
4 1:7 48 .121 .132 .094 .155 .234 .058 .115 .041 .232 .064 .101
80 .071 .082 .063 .091 .231 .055 .083 .044 .230 .057 .067
160 .051 .053 .050 .054 .227 .050 .064 .046 .228 .052 .031
1:3 48 .066 .073 .063 .081 .161 .054 .067 .048 .161 .060 .046
80 .054 .057 .054 .059 .155 .053 .062 .052 .155 .055 .027
160 .048 .048 .048 .048 .148 .048 .052 .048 .149 .050 .009
2:3 48 .055 .057 .055 .058 .089 .054 .055 .052 .089 .058 .048
80 .053 .053 .053 .053 .086 .052 .054 .053 .085 .053 .032
160 .049 .049 .049 .049 .084 .049 .051 .050 .084 .050 .025
10 1:7 48 .067 .073 .061 .087 .361 .057 .101 .038 .352 .068 .127
80 .058 .058 .057 .059 .359 .057 .081 .047 .353 .063 .083
160 .050 .050 .050 .050 .341 .050 .060 .047 .339 .053 .038
1:3 48 .052 .053 .052 .054 .210 .052 .058 .044 .209 .058 .046
80 .050 .050 .050 .050 .213 .050 .057 .050 .212 .055 .025
160 .048 .048 .048 .048 .205 .048 .051 .048 .205 .050 .009
2:3 48 .056 .056 .056 .056 .109 .056 .056 .052 .109 .061 .051
80 .052 .052 .052 .052 .104 .052 .054 .052 .103 .056 .034
160 .055 .055 .055 .055 .101 .055 .056 .055 .100 .058 .026
Pos. 2 1:7 48 .086 .090 .079 .099 .123 .049 .093 .041 .119 .057 .020
80 .075 .081 .075 .088 .122 .043 .072 .037 .122 .049 .008
160 .065 .073 .070 .077 .125 .046 .060 .044 .122 .049 .003
1:3 48 .068 .074 .074 .079 .098 .047 .058 .042 .097 .054 .011
80 .065 .072 .071 .076 .103 .051 .059 .050 .102 .056 .004
160 .053 .058 .059 .059 .096 .047 .051 .045 .095 .049 .001
2:3 48 .056 .060 .061 .061 .068 .052 .052 .049 .069 .054 .024
80 .054 .057 .058 .057 .067 .051 .052 .050 .069 .053 .015
160 .051 .053 .054 .054 .066 .050 .051 .050 .067 .050 .015
4 1:7 48 .096 .107 .092 .125 .228 .046 .095 .034 .220 .060 .046
80 .072 .083 .074 .092 .228 .048 .077 .040 .222 .057 .024
160 .051 .054 .056 .056 .221 .046 .060 .043 .219 .052 .008
1:3 48 .066 .073 .074 .081 .156 .053 .063 .045 .157 .060 .021
80 .053 .055 .057 .057 .152 .049 .057 .046 .154 .054 .007
160 .050 .050 .051 .050 .148 .050 .052 .047 .149 .051 .002
2:3 48 .048 .051 .054 .053 .081 .045 .047 .043 .083 .051 .023
80 .047 .048 .049 .049 .077 .046 .046 .044 .079 .048 .015
160 .046 .046 .046 .046 .079 .046 .048 .047 .078 .048 .013
10 1:7 48 .067 .077 .068 .093 .350 .043 .087 .031 .335 .064 .078
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Table 7. (Continued)
Conditional tests Unconditional tests
80 .050 .053 .052 .056 .348 .046 .069 .037 .336 .055 .044
160 .050 .050 .050 .050 .335 .050 .060 .043 .329 .053 .015
1:3 48 .049 .051 .052 .054 .217 .048 .054 .039 .220 .057 .025
80 .050 .050 .050 .050 .213 .050 .054 .045 .214 .055 .010
160 .050 .050 .050 .050 .213 .050 .052 .048 .214 .053 .002
2:3 48 .047 .047 .051 .047 .099 .046 .045 .041 .104 .055 .027
80 .050 .050 .051 .050 .100 .050 .049 .048 .103 .055 .019
160 .049 .049 .049 .049 .099 .049 .050 .049 .099 .052 .010
violations. Even the better of the unconditional tests can fail in some circumstances. In
this study there were only 9 conditions (less than 0.3%) where neither UST, UBS, UPS
nor UMB were valid, all of them when sampling from populations with large skewness
and kurtosis (in our case, operationalized as exponential, x2 (1), or lognormal) and large
variance inequalities (0.1 or 10 here). This is consistent with prior research that has
shown that the separate variance t test is not robust when variances are heterogeneous
and the populations non-normal (e.g. Algina, Oshima, & Lin, 1994; Luh & Guo, 2000;
Wilcox, 2003). Even so, our findings show that in these situations, conditioning the
selection of t test on a variance equality test does not yield a more valid test.
4. Summary
In this paper, we evaluated the strategy of conditioning the choice of t test on the
outcome of one of several variance equality tests. Furthermore, we examined this
conditional decision rule using several different variance equality tests, and compared
the Type I error rates of the rule to some alternative unconditional tests based on
resampling methods. Our findings are consistent with those of others who have
reported that the conditional decision rule serves little useful function and may as well
be abandoned (Best & Rayner, 1987; Gans, 1981; Moser & Stevens, 1992; Wilcox et al.,
1986). Furthermore, these results allow us to generalize this conclusion to tests other
than F or Levine’s test as the conditioning test, as well as when sampling from
non-normal populations. Little if any Type I error protection is gained using the
conditional decision rule compared to the unconditional use of the ‘separate variance’
Welch–Satterthwaite t test ( UST) in all conditions, irrespective of the outcome of a
variance equality test. Some improvement in performance compared to UST in some
situations can be achieved by bootstrapping the sampling distribution of the separate
variance t statistic ( UBS), although this test tended not to perform as well as the separate
variance t test when the sample size of one of the groups was small. The more
complicated Mielke–Berry method also performed well, although it tended to be
conservative compared to UBS and UST. This is not to say that these tests are infallible.
Even these tests failed when extreme variance inequality was combined with large
discrepancies in group sample size. Such failures tended to occur most frequently when
the population distribution was extremely non-normal. Thus, these tests are not
solutions to the Behrens–Fisher problem in all circumstances. Our main point is that
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Table 8. Mean Type I error rates: Unequal sample sizes, smaller group with larger variance (u . 1),
asymmetrical populations (skewness . 0)
0 2 1:7 48 .101 .104 .094 .109 .128 .059 .113 .049 .129 .055 .036
80 .092 .097 .085 .100 .128 .053 .083 .048 .127 .051 .020
160 .078 .082 .076 .086 .132 .056 .069 .052 .133 .054 .007
1:3 48 .070 .073 .069 .077 .093 .051 .063 .048 .092 .052 .020
80 .066 .069 .066 .071 .095 .052 .060 .052 .093 .052 .010
160 .057 .061 .057 .062 .101 .050 .054 .049 .100 .051 .003
2:3 48 .054 .055 .055 .057 .064 .049 .051 .048 .063 .049 .032
80 .052 .055 .054 .055 .066 .048 .049 .047 .065 .049 .019
160 .051 .052 .051 .053 .066 .050 .051 .050 .066 .051 .010
4 1:7 48 .117 .127 .106 .143 .233 .061 .109 .047 .233 .066 .078
80 .089 .097 .083 .110 .237 .059 .091 .048 .237 .061 .050
160 .058 .060 .057 .063 .237 .055 .072 .050 .239 .057 .021
1:3 48 .070 .075 .071 .083 .154 .053 .065 .047 .156 .057 .033
80 .056 .058 .056 .062 .153 .053 .059 .049 .153 .055 .019
160 .047 .047 .047 .048 .152 .047 .052 .047 .150 .048 .007
2:3 48 .057 .059 .059 .063 .093 .053 .057 .053 .092 .056 .041
80 .058 .058 .058 .059 .094 .057 .058 .055 .095 .058 .028
160 .047 .047 .047 .047 .073 .047 .048 .048 .073 .047 .014
10 1:7 48 .076 .085 .069 .103 .364 .057 .106 .038 .354 .071 .116
80 .056 .057 .056 .060 .341 .055 .080 .046 .335 .063 .073
160 .055 .055 .055 .055 .342 .055 .066 .052 .338 .058 .037
1:3 48 .057 .057 .057 .059 .215 .056 .064 .050 .215 .063 .046
80 .054 .054 .054 .054 .220 .054 .058 .051 .219 .059 .026
160 .055 .055 .055 .055 .210 .055 .058 .053 .208 .057 .007
2:3 48 .050 .050 .051 .050 .105 .050 .049 .047 .106 .058 .042
80 .048 .048 .048 .048 .093 .048 .049 .046 .092 .050 .024
160 .049 .049 .049 .049 .094 .049 .049 .048 .094 .051 .009
Neg. 2 1:7 48 .114 .116 .103 .121 .134 .065 .119 .053 .137 .061 .050
80 .097 .100 .086 .105 .130 .055 .085 .047 .132 .053 .027
160 .078 .080 .068 .084 .132 .052 .068 .049 .132 .052 .013
1:3 48 .076 .077 .072 .080 .095 .054 .066 .050 .095 .054 .027
80 .075 .075 .070 .078 .103 .055 .064 .053 .103 .056 .018
160 .060 .061 .057 .063 .100 .053 .057 .052 .101 .053 .005
2:3 48 .063 .062 .061 .064 .071 .052 .057 .053 .071 .054 .036
80 .054 .054 .053 .055 .066 .048 .051 .049 .065 .049 .024
160 .051 .052 .051 .052 .065 .050 .052 .051 .064 .050 .015
4 1:7 48 .129 .138 .106 .156 .239 .064 .118 .046 .241 .066 .092
80 .077 .087 .068 .099 .231 .055 .085 .045 .230 .057 .060
160 .056 .059 .055 .061 .233 .055 .070 .051 .232 .056 .029
1:3 48 .071 .077 .070 .087 .156 .057 .069 .051 .156 .061 .045
80 .055 .057 .054 .060 .149 .052 .061 .049 .149 .056 .024
160 .052 .052 .052 .052 .156 .052 .057 .051 .155 .054 .008
2:3 48 .052 .054 .054 .056 .086 .050 .051 .047 .084 .054 .042
80 .053 .053 .053 .054 .088 .052 .054 .052 .087 .055 .032
160 .052 .052 .052 .052 .086 .052 .054 .053 .087 .054 .017
10 1:7 48 .078 .085 .071 .102 .361 .064 .109 .042 .354 .074 .130
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Table 8. (Continued)
Conditional tests Unconditional tests
80 .057 .058 .057 .061 .346 .056 .080 .045 .343 .062 .080
160 .053 .053 .053 .053 .346 .053 .062 .049 .343 .055 .039
1:3 48 .057 .058 .058 .060 .220 .057 .065 .050 .223 .064 .053
80 .052 .052 .052 .052 .209 .052 .057 .048 .211 .055 .026
160 .051 .051 .051 .051 .203 .051 .053 .050 .203 .052 .010
2:3 48 .055 .055 .056 .056 .110 .055 .054 .051 .112 .060 .049
80 .054 .054 .054 .054 .106 .054 .054 .053 .106 .057 .033
160 .049 .049 .049 .049 .097 .049 .050 .049 .097 .051 .017
Pos. 2 1:7 48 .090 .085 .077 .091 .122 .071 .112 .062 .120 .065 .018
80 .082 .080 .076 .085 .125 .065 .090 .058 .126 .059 .008
160 .072 .074 .073 .077 .126 .060 .073 .055 .126 .055 .002
1:3 48 .070 .070 .069 .073 .097 .061 .068 .056 .099 .061 .010
80 .065 .066 .066 .069 .096 .056 .062 .054 .097 .056 .004
160 .058 .061 .062 .062 .098 .054 .057 .052 .099 .053 .001
2:3 48 .054 .055 .056 .056 .065 .051 .050 .048 .067 .053 .021
80 .054 .055 .056 .056 .066 .052 .051 .050 .068 .053 .013
160 .054 .054 .055 .054 .068 .052 .053 .052 .069 .052 .007
4 1:7 48 .117 .124 .112 .136 .230 .071 .118 .058 .223 .077 .045
80 .092 .099 .093 .107 .227 .066 .093 .056 .224 .066 .024
160 .068 .073 .075 .077 .227 .060 .074 .054 .227 .058 .009
1:3 48 .077 .083 .082 .089 .154 .063 .072 .057 .156 .067 .022
80 .067 .071 .072 .074 .154 .060 .066 .054 .155 .061 .011
160 .057 .057 .060 .059 .156 .055 .059 .052 .157 .056 .002
2:3 48 .059 .062 .064 .064 .088 .056 .056 .053 .092 .060 .030
80 .056 .057 .059 .058 .086 .055 .055 .053 .087 .058 .018
160 .052 .053 .054 .053 .085 .052 .053 .051 .085 .054 .009
10 1:7 48 .099 .109 .101 .127 .359 .070 .114 .051 .340 .089 .090
80 .074 .078 .079 .085 .354 .067 .090 .053 .343 .074 .053
160 .062 .063 .065 .063 .347 .062 .073 .054 .343 .062 .023
1:3 48 .067 .070 .073 .075 .219 .063 .070 .054 .222 .072 .038
80 .061 .062 .064 .063 .212 .061 .064 .054 .215 .065 .019
160 .055 .055 .056 .055 .211 .055 .057 .052 .213 .057 .006
2:3 48 .061 .062 .066 .064 .114 .061 .059 .055 .118 .068 .042
80 .059 .059 .061 .059 .107 .059 .057 .055 .109 .063 .026
160 .054 .054 .055 .054 .101 .054 .054 .053 .103 .057 .011
these tests do as well as or better than the condition decision rule in nearly every
circumstance we examined.
Or course, the methods we included in our simulations are not exhaustive of all
unconditional tests. We are not in a position to argue that these alternative
unconditional tests are better than others that exist. For instance, we acknowledge
that so-called asymmetric t tests are available, and the use of an asymmetric variance
estimate may be better than both the separate variance estimate and the pooled variance
estimate when n2 is much greater than n1 (Balkin & Mallow, 2001). Alternative robust
methods (Andrews et al., 1972; Keselman, Othman, Wilcox, & Fradette, 2004; Lix
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
& Keselman, 1998; Wilcox, 1997; Wilcox & Keselman, 2003), especially those
comparing trimmed means provide alternatives resistant to variance heterogeneity and
bizarre distribution shapes, but they test a different null hypothesis so we did not
include them in this study. Such robust methods also have resampling based alternatives
which can further improve their performance (Keselman et al., 2004; Wilcox,
Keselman, & Kowalchuk, 1998). It may be that the best solution to the Behrens–Fisher
problem is to abandon the use of the arithmetic mean as the default measure of location,
and instead focus our inferential questions on statistics less susceptible to the influence
of extreme observations that tend to occur when sampling from irregular distributions.
Finally, validity is only one of the criteria used to judge the performance of a statistical
test. A test may be valid but lower in power than alternative tests. Our results, combined
with previous ones, suggest that there is little value in a conditional test if the goal is to
control Type I error rates. But perhaps a conditional test that is valid in a given situation
yields a more powerful test in that situation than any unconditional test. Our results say
nothing about this possibility. Moser and Stevens (1992) and Moser et al. (1989) note
that the only situations in which a conditional test (in their case, using F as variance
equality test for choosing between the pooled and separate t tests) is superior in terms
of power are (i) when the variances are equal or nearly so, and (ii) when the sample sizes
are unequal and the smaller group has the larger variance. However, in the latter case the
conditional test buys power at the cost of high Type I error rates. Otherwise, the
unconditional use of the separate variance t test is either equal in power to or more
powerful than conditioning the choice on the F-ratio test of variance equality. It would
be worth further examining the extent of any power differences between the
conditional decision rule and some of the unconditional tests we examined here and
seeing whether their conclusions generalize across the selection of variance equality
tests as well as when sampling from non-normal populations.
References
Algina, J., Oshima, T. C., & Lin, W. Y. (1994). Type I error rates for Welch’s test and James’s second-
order test under non-normality and inequality of variance when there are two groups. Journal
of Educational and Behavioral Statistics, 19, 275–291.
Andrews, D., Bickel, P., Hampel, F., Huber, P., Rogers, W., & Tukey, J. (1972). Robust estimates of
location. Princeton, NJ: Princeton University Press.
Aptech Systems, Inc. (1996). GAUSS (version 3.21) [computer software]. Maple Valley, WA:
Aptech Systems Inc..
Balkin, S. D., & Mallow, C. L. (2001). An adjusted, asymmetric two-sample t test. American
Statistician, 55, 203–206.
Bassili, J. N. (2003). The minority slowness effect: Subtle inhibitions in the expression of views not
shared by others. Journal of Personality and Social Psychology, 84, 261–276.
Best, D. J., & Rayner, J. C. W. (1987). Welch’s approximate solution for the Behrens–Fisher
problem. Technometrics, 29, 205–210.
Bluman, A. G. (2001). Elementary statistics: A step-by-step approach (4th ed.). Boston: McGraw-
Hill.
Boik, R. J. (1987). The Fisher–Pitman permutation test: A non-robust alternative to the normal
theory F when variances are heterogeneous. British Journal of Mathematical and Statistical
Psychology, 40, 26–42.
Boneau, C. A. (1960). The effects of violating the assumptions underlying the t test. Psychological
Bulletin, 57, 49–64.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Mervis, C. B., & Robinson, B. F. (2000). Expressive vocabulary ability of toddlers with Williams
syndrome or Down syndrome: A comparison. Developmental Neuropsychology, 17, 111–126.
Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological
Bulletin, 105, 156–166.
Mielke, P. W., & Berry, K. J. (1994). Permutation tests for common locations among samples with
unequal variances. Journal of Educational and Behavioral Statistics, 19, 217–236.
Mielke, P. W., Berry, K. J., & Johnson, E. S. (1976). Some multi-response permutation procedures
for a priori classifications. Communications in Statistics: Theory and Methods, 5, 1409–1424.
Moser, B. K., & Stevens, G. R. (1992). Homogeneity of variance in the two-sample means test.
American Statistician, 46, 19–21.
Moser, B. K., Stevens, G. R., & Watts, C. L. (1989). The two-sample t test versus Satterthwaite’s
approximate F test. Communications in Statistics: Theory and Methods, 18, 3963–3975.
Murphy, B. P. (1976). Comparison of some two sample means tests by simulation.
Communications in Statistics: Simulation and Computation, 5, 23–32.
Norusis, M. J. (2002). SPSS 11.0 guide to data analysis. Chicago, IL: SPSS Inc..
O’Brien, R. G. (1981). A simple test for variance effects in experimental designs. Psychological
Bulletin, 89, 570–574.
Olejnik, S. F. (1988). Variance heterogeneity: An outcome to explain or a nuisance factor to
control. Journal of Experimental Education, 56, 193–197.
Olejnik, S. F., & Algina, J. (1988). Tests of variance equality when distributions differ in form and
location. Educational and Psychological Measurement, 48, 317–329.
Pfanzagl, J. (1974). On the Behrens–Fisher problem. Biometrika, 61, 39–47.
Press, S. J. (1966). A confidence interval comparison of two test procedures proposed for the
Behrens–Fisher problem. Journal of the American Statistical Association, 61, 454–466.
Ramsey, P. H. (1994). Testing variances in psychological and educational research. Journal of
Educational Statistics, 19, 23–42.
Romano, J. P. (1990). On the behavior of randomization tests without a group invariance
assumption. Journal of the American Statistical Association, 85, 686–692.
Rosner, B. (2000). Fundamentals of biostatistics (5th ed.). Pacific Grove, CA: Duxbury.
Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components.
Biometrics Bulletin, 2, 110–114.
Sawilowsky, S. S., & Blair, R. C. (1992). A more realistic look at the robustness and Type II error
properties of the t test to departures from population normality. Psychological Bulletin, 111,
352–360.
Scheffé, H. (1943). On solutions of the Behrens–Fisher problem, based on the t distribution.
Annals of Mathematical Statistics, 14, 35–44.
Smith, S. S., & Richardson, D. (1983). Amelioration of deception and harm in psychological
research: The important role of the debriefing. Journal of Personality and Social Psychology,
44, 1075–1082.
Snedecor, G. W., & Cochran, W. G. (1989). Statistical methods (8th ed.). Ames: University of Iowa.
Stonehouse, J. M., & Forrester, C. J. (1998). Robustness of the t and U tests under combined
assumption violations. Journal of Applied Statistics, 25, 63–74.
Triola, M. F., Goodman, W. M., & Law, R. (2002). Elementary statistics. Toronto: Wesley.
Welch, B. L. (1938). The significance of the difference between two means when the population
variances are unequal. Biometrika, 29, 350–362.
Welch, B. L. (1947). The generalization of Student’s problem when several different population
variances are involved. Biometrika, 34, 28–35.
Wilcox, R. R. (1990). Comparing variances and means distributions have non-identical shapes.
Communications in Statistics: Simulation and Computation, 19, 155–173.
Wilcox, R. R. (1997). Introduction to robust estimation and hypothesis testing. San Diego, CA:
Academic Press.
Wilcox, R. R. (2002). Comparing the variances of two independent groups. British Journal of
Mathematical and Statistical Psychology, 55, 169–175.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Wilcox, R. R. (2003). Applying contemporary statistical techniques. San Diego, CA: Academic
Press.
Wilcox, R. R., Charlin, V. L., & Thompson, K. L. (1986). New Monte Carlo results on the robustness
of the ANOVA F, W, and F* statistics. Communication in Statistics: Simulation and
Computation, 15, 933–943.
Wilcox, R. R., & Keselman, H. J. (2003). Modern robust data analysis methods: Measures of central
tendency. Psychological Methods, 8, 254–274.
Wilcox, R. R., Keselman, H. J., & Kowalchuk, R. K. (1998). Can tests for treatment group equality
be improved? The bootstrap and trimmed means conjecture. British Journal of Mathematical
and Statistical Psychology, 51, 123–134.
Zimmerman, D. W. (1996). Some properties of preliminary tests of equality of variances in the two-
sample location problem. Journal of General Psychology, 123, 217–231.
Zimmerman, D. W. (2004). A note on preliminary tests of equality of variences. British Journal of
Mathematical and Statistical Psychology, 57, 173–181.
Zimmerman, D. W., & Zumbo, B. D. (1992). Parametric alternatives to the student t test under
violations of normality and homogeneity of variance. Perceptual and Motor Skills, 74,
835–844.
Appendix
The table below lists the skewness and kurtosis for the population samples. When the
sample was generated with the Fleishman method, the coefficients used in equation (2)
are listed in the row corresponding to the population skewness and kurtosis those
coefficients generate. Otherwise, the probability density function for the population is
listed.
Skewness Kurtosis a b c d
(Continued)
Skewness Kurtosis a b c d