Sie sind auf Seite 1von 28

Copyright © The British Psychological Society

Reproduction in any form (including the internet) is prohibited without prior permission from the Society

217

The
British
Psychological
British Journal of Mathematical and Statistical Psychology (2007), 60, 217–244
q 2007 The British Psychological Society
Society

www.bpsjournals.co.uk

Further evaluating the conditional decision rule


for comparing two independent means

Andrew F. Hayes1* and Li Cai2


1
School of Communication, Ohio State University, USA
2
Department of Psychology, University of North Carolina, USA

Many books on statistical methods advocate a ‘conditional decision rule’ when


comparing two independent group means. This rule states that the decision as to
whether to use a ‘pooled variance’ test that assumes equality of variance or a ‘separate
variance’ Welch t test that does not should be based on the outcome of a variance
equality test. In this paper, we empirically examine the Type I error rate of the
conditional decision rule using four variance equality tests and compare this error rate
to the unconditional use of either of the t tests (i.e. irrespective of the outcome of a
variance homogeneity test) as well as several resampling-based alternatives when
sampling from 49 distributions varying in skewness and kurtosis. Several unconditional
tests including the separate variance test performed as well as or better than the
conditional decision rule across situations. These results extend and generalize the
findings of previous researchers who have argued that the conditional decision rule
should be abandoned.

1. Introduction
The independent groups t test is one of the most widely used statistical tests. Given this,
it is amazing that after half a century statisticians are still debating how best to compare
two group means. Unfortunately, this lively debate rarely finds its way into statistical
methodology books, the authors of which frequently advocate a strategy for comparing
two group means that some of the literature suggests should not be used. In this paper
we further evaluate this strategy, which we call the ‘conditional decision rule’. This rule
states that the choice of which of two t tests to use, both of which are printed by
popular statistics packages, should be based on the outcome of a test of variance
equality. We compare the Type I error rate of the conditional decision rule in a variety of
situations and also compare it to several alternative methods of comparing two means,
including some based on resampling methods of inference. Although there have been
several studies of the performance of the conditional decision rule that have led their

* Correspondence should be addressed to Andrew F. Hayes, School of Communication, Ohio State University, 3016 Derby Hall,
154 N. Oval Mall, Columbus, OH 43210, USA (e-mail: hayes.338@osu.edu).

DOI:10.1348/000711005X62576
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

218 Andrew F. Hayes and Li Cai

respective investigators to argue against it, existing studies are limited in scope. By using
four different variance equality sets and sampling from 49 different distributions, the
simulations we present here are arguably the most comprehensive and provide a strong
means of extending and testing the generalizability of previous recommendations
regarding the usefulness (or lack thereof) of the conditional decision rule.

1.1. Two versions of the t test


The independent groups t test is used to test the null hypothesis H0: m1 ¼ m2, where m1
and m2 are population means estimated in independent samples of size n1 and n2 with
arithmetic sample means Y 1 and Y 2 . The test statistic used is the ratio of the sample
mean difference to an estimate of the standard error of the difference:
Y 1 2 Y 2
t ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffi ; ð1Þ
d1 d2
n1 þ n2

where d1 and d2 are within-group variance estimates. There is considerable controversy


over how best to estimate the standard error of the difference. Assuming homogeneity
of the variance in Y across groups, the most common method is to derive a pooled
within-group variance estimate, defined as the weighted average of the two sample
variances, weighting each by its degrees of freedom. This pooled variance estimate is
then substituted for d1 and d2 in equation (1). If H0 is true, the p-value for t can be
derived using the t distribution with n1 þ n2 2 2 degrees of freedom. This version of the
test, referred to here as the pooled variance test, is printed by nearly every statistical
package when an investigator requests an independent groups t test.
An alternative approach relaxes the assumption of homogeneity of variance in Y.
Rather than pooling the variances to derive a common variance estimate, d1 and d2 are
set to s21 and s22 , the sample variances in groups 1 and 2, respectively. In that case, the
sampling distribution of t can be well approximated by a t distribution, but the degrees
of freedom are different than when a pooled variance estimate of the standard error is
used. Welch (1938, equation 14) and Satterthwaite (1946, equation 17) both
recommended a degrees of freedom approximation that has subsequently been
implemented in commonly used statistical packages such as SPSS, SAS and STATA.
We will refer to this test as the ‘separate variance’ test (it has also been called the ‘Welch
t test’—Gans, 1981; Zimmerman, 2004)
The extent to which the normality and equality of variance assumptions are met can
influence the performance of these tests. Research suggests that the pooled t test can be
robust to violations of its assumptions (e.g. Boneau, 1960; Sawilowsky & Blair, 1992),
and so many people use the test without thinking about these assumptions. But some
forms of variance inequality can result in a p-value that is erroneously small or large.
The combination of differences in group population variances and sample size can have
adverse effects on the validity or power of the pooled test (e.g. Boneau, 1960;
Stonehouse & Forrester, 1998). This has been called the Behrens–Fisher problem, and
many statisticians have studied it (e.g. Fisher, 1941; Pfanzagl, 1974; Press, 1966; Scheffé,
1943; Welch, 1938, 1947). Research indicates that when the population variances
are unequal, the separate variance t test better controls the Type I error rate. When the
population variances are equal and the populations normal, both t tests are valid but
the pooled variance test is somewhat more powerful (e.g. Murphy, 1976; Zimmerman &
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

Conditional decision rule for comparing two means 219

Zumbo, 1992). Therefore, there are some occasions when the pooled variance t test
should be preferred, and others when the separate variance t test should be used.

1.2. The conditional decision rule


To deal with the potential ambiguity over which test to employ in a given situation,
a common rule that has been advocated is to condition the choice of t test on the
outcome of a variance equality test. If the null hypothesis of equal variances is not
rejected, assume them equal and apply the pooled variance t test. But if a variance
homogeneity test rejects the null hypothesis, assume them unequal and apply the
separate variance t test. The flow charts in Keppel (1991, p. 128), Rosner (2000, p. 297)
and Triola, Goodman, and Law (2002, p. 437) are typical examples of the description of
this ‘conditional decision rule’, as it will be referred to throughout this paper. This rule is
advocated by textbook and software manual writers and facilitated by software
developers who provide computer output necessary to carry it out. The documentation
produced by software companies and independent writers makes the conditional
decision rule explicit when describing how to interpret program output (e.g. Coakes &
Steed, 1997, p. 82; Cody & Smith, 1997, pp. 141; Field, 2000, p. 238; Foster, 2001,
p. 167; George & Mallery, 2003, p. 140; MathSoft Inc., 1999; Norusis, 2002,
pp. 273–274). And writers of statistical methodology textbooks either suggest it by
presenting different methodologies to use depending on whether or not the group
variances are assumed equal, or directly advocate the rule, spelling out step-by-step how
to carry it out (e.g. Bluman, 2001, pp. 425–427; Howell, 1996. pp. 199–200; Keppel,
1991, p. 128; Levine, Ramsey, & Smidt, 2001, p. 418; McClave & Sincich, 2003,
pp. 388–390; Rosner, 2000, pp. 297–299; Snedecor & Cochran, 1989; Triola et al., 2002,
pp. 430–437). Indeed, there is evidence that researchers take this advice seriously and
use the conditional decision rule in their data analyses (e.g. Bassili, 2003; Chapman,
Hardin-Jones, Schulte, & Halter, 2001; Hesse & Van Ijzendoorn, 1998; Mervis &
Robinson, 2000; Smith & Richardson, 1983).
Although the conditional decision rule is frequently advocated, there is evidence that
the unconditional separate variance test is superior to the conditional decision rule with
respect to controlling Type I errors (Wilcox, Charlin, & Thompson, 1986; Gans, 1981;
Moser & Stevens, 1992; Moser, Stevens, & Watts, 1989; Zimmerman, 1996, 2004). As a
result, some have argued that the conditional decision rule should be abandoned.
But this conclusion may be premature. Previous investigators have implemented the
conditional decision rule with a single and usually arbitrarily chosen variance equality
test. What is needed is a comparison of the performance of the conditional decision rule
applied to the same data but using different variance equality tests. This has been done
only in limited circumstances, such as small sample sizes or when sampling from
populations that are normally distributed. Given that our samples are not always small
and our populations frequently not normally distributed (Bradley, 1977; Micceri, 1989),
it is worth studying larger sample sizes and samples from non-normal populations,
among other variations in conditions that might be important. Although research
supports the claim that there is little value to the conditional decision rule when
sampling from normal populations, there is no reason to assume that it provides no
advantages over an unconditional strategy when sampling from non-normal data
because unconditional tests are not always robust when variance inequality is combined
with non-normality and sample size inequalities. Perhaps the conditional decision rule
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

220 Andrew F. Hayes and Li Cai

provides better (or worse) Type I error protection, depending on the test of variance
equality or the shape of the population sampled.

1.3. Selecting a test of variance equality


There are many ways of comparing group variances that differ not only in approach but
also in assumptions, and not all are robust to violations of their assumptions. If the
conditional decision rule is to be used, on which variance equality test should the
decision be conditioned? A variance equality test that liberally rejects the null hypothesis
of variance equality may lead the user of the conditional decision rule to use the separate
variance test too often. Alternatively, if a test is conservative, it may not lead the user to
the separate variance test when it should be used. Fortunately, the performance of the
many variance equality tests available is well researched (e.g. Brown & Forsythe, 1974;
Conover, Johnson, & Johnson, 1981; Olejnik & Algina, 1988), but two of the tests are
more likely to be used regardless of their statistical performance because they are
printed by default by statistical packages that enjoy popular use: the F-ratio test and
Levene’s test. The F-ratio test of variance equality is computed by forming the ratio of
the largest to smallest sample variance. Assuming a normally distributed population and
a true null hypothesis, F follows the Fðnmax 2 1; nmin 2 1Þ distribution, where nmax is
the sample size of the group with the larger variance, and nmin is the sample size of the
group with the smaller variance. Levene’s test (Levene, 1960) requires a transformation
of the original data prior to conducting the test. If Yij equals the original
 score
 on the
response variable Y for case i in group j, the transformation is Y ij ¼ Y ij 2 Y j . The null
0

hypothesis of equal group variances is tested with an analysis of variance on these


transformed scores.
The problem with these tests is that they are not robust. Research has shown that
they are liberal when the populations are not normal; Levene’s test less but still
noticeably so (e.g. Conover et al., 1981). Common sense suggests that if one were to
employ the conditional decision rule, it would be better to condition the selection of
t test on a robust variance equality test. Two alternative and putatively robust tests of
equality of variance may improve the performance of the conditional decision rule: the
Brown–Forsythe test (Brown & Forsythe, 1974), and O’Brien’s test (O’Brien, 1981).
The Brown–Forsythe test is identical to Levene’s test, except that Y j in the
transformation is replaced with group j’s sample median. O’Brien’s test is also an
analysis-of-variance-based test that requires a transformation of the original data
(see O’Brien, 1981, for details).
Research on the performance of these tests suggests that they are fairly robust
(Conover et al., 1981; Olejnik, 1988; Ramsey, 1994) and offer the advantage of being
simple to implement in any statistical package. But as Wilcox (2002) points out, the
Brown–forsythe test does not actually test the null hypothesis of equal variances
because the expected value of the average absolute deviation from the group median is
not s2. Thus, it should not be considered a literal test of variance equality but rather a
more general test of equality of dispersion, as dispersion is quantified with the Brown–
Forsythe transformation. Furthermore, the Brown–Forsythe and O’Brien tests can be
invalid when the populations are distributed differently (Wilcox, 1990). Although we do
not examine the effect of differences in distribution shape in this study, it is important to
acknowledge that Brown–Forsythe and O’Brien’s tests, although favoured by some as
robust, do have their weaknesses. Nevertheless, because they are advocated as robust
tests in the statistics literature as well as popular statistics texts (e.g. Howell, 1996;
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

Conditional decision rule for comparing two means 221

Keppel, 1991), we focus on these two tests as the ‘robust’ variance equality tests in the
simulations reported below.

1.4. Resampling-based unconditional alternatives to the conditional decision rule


Given their implementation in nearly every widely used statistical package, the pooled
and separate variance t tests would appear to be natural benchmarks against which to
compare the performance of the conditional decision rule. But they certainly are not the
only unconditional tests that might be useful to researchers. Many alternative methods
for comparing two means have been proposed, nearly all of which are based on the
computation of a p-value for a test statistic in reference to a theoretical sampling
distribution such as the t or F distribution. Although there are merits to many of these
approaches, when these tests do fail, they do so because the sampling distribution of the
test statistic does not conform to the reference sampling distribution used.
A conceptually appealing alternative is to abandon the use of theoretical sampling
distributions altogether and instead generate p-values empirically. There is a large
literature on resampling approaches to hypothesis testing that may offer advantages to
tests based on a theoretical sampling distribution, and we used several of these
resampling approaches as additional benchmarks for comparing the conditional
decision rule in the simulations reported here. Details of the various approaches to
resampling-based inference are well documented in a number of sources (e.g. Edgington,
1995; Efron & Tibshirani, 1998; Good, 2000; Lunneborg, 2000; Wilcox, 2003), so we
only briefly overview the logic of two of the more common resampling methods we
include in this study: bootstrapping and permutation tests.
When bootstrapping, the investigator computes some statistic g that quantifies the
effect. This obtained g could be a mean difference, a t statistic, or any other measure of
the effect. Once the obtained effect is quantified, the sample is transformed so that the
obtained result corresponds exactly to the null hypothesis. This transformed data set is
then used as a pseudopopulation from which many (say, 1,000) new samples with
replacement are taken. In each resample, the test statistic g is computed, where g is
the same statistic as g but based on the resampled data set rather than the original data.
The p-value for g is computed as the proportion of values of g that are at least as deviant
from the null hypothesis as g derived in the original data. Thus, the distribution of g is
used as an empirical estimation of the sampling distribution of g.
Permutation tests are conceptually similar, but they differ from bootstrapping in how
the empirical sampling distribution is derived. As with bootstrapping, the investigator
first computes some measure of the effect g in the obtained data. But instead of
repeatedly resampling the data with replacement as in bootstrapping, each case’s score
on the outcome variable Y is reassigned to a different case. Each time this reassignment
is undertaken, g is computed. Repeating this reassignment procedure M times yields M
values of g that can be used as an empirical estimation of the sampling distribution of g.
Depending on the size of the sample, the distribution of g may reflect all possible
reassignments of Ys to cases, or it may be only a random sample of the possible
reassignments (in which case the test is known as an approximate permutation test).
Once this estimated sampling distribution of g is generated, the p-value is computed as
the proportion of the values of g that are at least as deviant as g from the null
hypothesis. However, unlike in bootstrapping, the obtained result g is placed in the
distribution of g . So with a permutation test, the p-value can never be smaller than
1/(M þ 1).
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

222 Andrew F. Hayes and Li Cai

In the simulations reported below, we included bootstrap and permutation-test


analogues of the pooled variance and separate variance t tests as benchmarks against
which to compare the performance of the conditional decision rule, as well as a more
complicated permutation-based procedure described later. There is a literature
illustrating that at least the permutation version of the pooled variance t test does not
validly test the null hypothesis of equal means when sampling from populations with
heterogeneous variance, so there is reason to be sceptical of its performance as a
satisfactory unconditional test (e.g. Boik, 1987; Hayes, 2000; Romano, 1990), although it
is unknown how this approach will compare to the conditional decision rule.
We acknowledge that methods for comparing two group means that we examine
here are only a few of the many attempted solutions to the Behrens–Fisher problem, any
of which could be used as benchmarks for assessing the usefulness of the conditional
decision rule. Our point is not to conduct an exhaustive comparison of the dozens of
proposed approaches to testing mean differences, nor is our goal to advocate a
particular test for comparing group means as the best test. Instead, our goal is restricted
to assessing the performance of the conditional decision rule based on different variance
equality tests and relative to some alternative unconditional tests, some of which do not
rely on theoretical sampling distributions.

2. Method
To assess the Type I error characteristics of conditional compared to unconditional tests,
a series of Monte Carlo simulations was conducted using the GAUSS program (Aptech
Systems Inc., 1996). In all, 4,116 different simulations were conducted, manipulating a
number of factors known to influence the performance of the tests examined here.
These factors included total sample size (N: 48, 80, 160), group 1 to group 2 sample size
ratio (n1:n2: 1:7, 1:3, 2:3, 1:1), group 1 to group 2 population variance ratio (u: 0.10,
0.25, 0.50, 1, 2, 4, 10), population skewness, and population kurtosis.

2.1. Sampling from non-normal population distributions


Most of the samples were generated using Fleishman’s (1978) power method. The
power method is a polynomial transformation of the form
Y ¼ a þ bX þ cX 2 þ dX 3 ; ð2Þ
where X is a vector of random standard normal deviates and a, b, c and d are coefficients
that produce the desired skewness and kurtosis within sampling error. The Fleishman
method is a simple and efficient way of obtaining non-normal samples with a controlled
degree of population skewness and kurtosis. Forty-three different populations defined
by combinations of skewness and kurtosis were generated in the simulations, and
samples were taken from these populations. In addition to these 43 populations, we also
sampled from populations often used in simulation studies of this sort, including the
exponential with a parameter of 1, x2(1), x2(2), x2 (3), lognormal and uniform (0,1). The
exponential samples were generated by creating a vector of random uniform numbers
between 0 and 1 (U ) using the GAUSS rndu function, and then applying the
transformation Y ¼ 2lnðU Þ. The x2 (df ) samples were generated by creating df
samples of standard normal deviates using the GAUSS rndn function and then adding up
these squared deviates. The lognormal samples were created exponentiating a vector of
standard normal deviates (Y ¼ ez ). The uniform samples were generated with the
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

Conditional decision rule for comparing two means 223

GAUSS rndu function. Adding these six distributions to the 43 populations generated
with the Fleishman yielded 49 different populations for the simulation. These 49
combinations of population skewness and kurtosis are listed in the Appendix along with
the coefficients used when generating samples with the Fleishman power method.

2.2. Variance transformation


To simulate sampling from distributions with different variances, the variance of the first
sample was transformed by multiplying all observations in the sample by the square root
of either 0.10, 0.25, 0.50, 1, 2, or 4, or 10 after first subtracting the population mean
from each of the scores in each sample.

2.3. Statistical tests


Once the two samples were generated, the null hypothesis of equal population means
was tested at a ¼ :05 using various methods. Two of the tests were unconditional t tests.
That is, in each pair of samples we tested the null hypothesis of equal population means
using the pooled variance t test as well as the separate variance t test with the
Welch–Satterthwaite df approximation, irrespective of the outcome of a variance
equality test. For expository convenience, we will refer to these two tests as UPT
(unconditional pooled t) and UST (unconditional separate t). Four of the methods were
conditional tests, where the use of the separate or pooled variance t test was
conditioned on the outcome of either the F-ratio test, Levene’s test, O’Brien’s test, or the
Brown–Forsythe test. The null hypothesis of variance equality was testing using
a ¼ :05. We refer to these conditional tests throughout as CF (conditioned on F), CL
(conditioned on Levene’s test), CO (conditioned on O’Brien’s test) and CB (conditioned
on the Brown–Forsythe test).
Two additional unconditional tests implemented were bootstrap versions of the
pooled t and separate t tests using the algorithm described by Efron and Tibshirani
(1998, p. 224). Let Yi1, i ¼ 1; 2; : : :n1 , and Yi1, i ¼ 1; 2; : : :n2 , be the two samples, Y 1
and Y 2 be the two sample means, and Y be the mean of the combined sample. Let tpooled
and t separate be the absolute value of the pooled and separate variance t statistics
computed in the sample. The first step in generating the bootstrapped sampling
distribution for t was to transform the two samples by setting Y~ i1 ¼ Y i1 2 Y 1 þ Y and
Y~ i2 ¼ Y i2 2 Y 2 þ Y so that the two samples both have mean Y.  We then took 1,000
random samples with replacement of size n1 and n2 from the transformed samples. For
 
each bootstrap resample, we calculated t pooled and t separate , but using the resampled data
and again ignoring the sign of t. The bootstrap p-values for t pooled and t separate were
 
derived by dividing the number of times t pooled (or t separate ) was greater than or equal to
t pooled (or t separate ). We will refer to these tests as unconditional bootstrap pooled
variance (UBP) and unconditional bootstrap separate variance (UBS).
Two additional conditional tests examined were permutation versions of the pooled
and separate variance t tests. In these simulations both pooled variance and separate
variance permutation tests were implemented using the approximate permutation
approach. To derive the p-values, we computed t pooled and t separate in the original sample.
We then generated 999 new resamples by randomly reassigning the scores in the sample
 
into two groups of size n1 and n2 and computing t pooled and t separate in each of the 999
permuted data sets. The approximate p-value of the pooled t test was computed as 1

plus the number of times over the 999 resamples that t pooled exceeded t pooled , divided by
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

224 Andrew F. Hayes and Li Cai

1,000. The p-value for t separate was computed using the same logic. These tests are
denoted throughout as unconditional permutation pooled variance (UPP) and
unconditional permutation separate variance (UPS).
The final unconditional test we used was based on a permutation procedure
introduced by Mielke and Berry (1994). They proposed a permutation test using the
average of the nj ðnj 2 1Þ=2 within-group pairwise distances between values of Yi in
group j, This average, j, is computed in each group, and the test statistic is a sample
size weighted sum of the average distances across the j groups:
d ¼ ðn1 j1 þ n2 j2 Þ=ðn1 þ n2 Þ. We used the squared Euclidean distance because this
directly translates into a test of the equality of means (see Mielke & Berry, 1994). As with
other kinds of permutation tests, one can either generate an exact null distribution of
the test statistic by generating all possible permutations of the n1 þ n2 values of Yi
across groups or generate an approximate null distribution based on random sampling
from all possible permutations. But Mielke, Berry, and Johnson (1976) developed a
method we used here of approximating the p-value using the first three moments of the
null distribution and the Pearson Type III distribution. Technical details of this
complicated method can be found in Mielke and Berry. We refer to this throughout as
the UMB test (unconditional Mielke–Berry test).
For each combination of sample size, sample size ratio, variance ratio, and
population shape, the procedures described here were repeated 2000 times and the
proportion of rejections of the equal means null hypothesis over the 2000 replications
was recorded as an estimate of the Type I error rate.

3. Results
When describing the results of simulation studies, the typical practice is to provide
tables of rejection rates for all combinations of the factors manipulated and to make
generalizations about the performance of the test based on how it performs in those
conditions simulated. However, it would be nearly impossible for us to succinctly
summarize the performance of the tests we examined here with a few tables because of
the sheer number of unique combinations of variance inequality, sample size inequality,
sample size, skewness, and kurtosis we generated in the simulations.
Instead, we ask the reader first to allow us to ignore the many factors that we
manipulated and then ask about the relative performance of the methods overall,
collapsing across one or more of the manipulated factors. This allows us to determine
whether any test or tests tend to do especially well across situations, and whether one of
those better tests is a conditional test. Having answered the question as to how
conditional and unconditional tests perform relative to each other across conditions, we
then more specifically examine the conditions in which some tests are superior to
others, as well as the conditions in which many or most fail.
We first examine the Type I error rates of the methods by ignoring the factors we
manipulated and answer the question which is the best test overall. A researcher may
embark on a study and have little information a priori as to how equal the group sizes
and variances are likely to be or how the data will be distributed. What test or tests can
be trusted to produce a valid test across the majority of situations that a researcher might
confront across a series of studies? And is one or more of those tests a conditional test?
To answer this question, we computed the mean Type I error rate for each testing
procedure across the 4,116 simulated conditions. We also assessed the proportion of
conditions simulated in which the test was valid. We define a valid test in three ways.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

Conditional decision rule for comparing two means 225

Formally, a test is valid at a ¼ :05 if the Type I error rate of the test is no greater than .05.
Because the estimated Type I error rate in any condition is subject to sampling
variability, we define a test as valid in a condition if its estimated Type I error rate in that
condition was no greater than .060. Each Type I error rate was estimated based on 2,000
replications. If the true Type I error rate is .05 or less, than you expect the estimate to
exceed .060 in 2,000 replications no more than 5% of the time by chance. The second
definition of valid we will call ‘practical validity’. Others have argued that a test can be
considered valid if the Type I error rate is sufficiently close to a, where ‘sufficiently
close’ is defined subjectively. We use .075 as our definition of ‘sufficiently close,’
(Bradley, 1978). This is a subjective rule but is sensitive to the fact that few would worry
about a test rejecting at a rate slightly higher than .05 but if the test goes much above
.075, then the test does not control the Type I error rate as well as desired. This
definition, however, ignores that a test may be valid but highly conservative. Thus, we
also use a third definition of validity, defining a test as valid if its Type I error rate is close
to .05 in either direction, defining ‘close’ as between .025 and .075 (Bradley, 1978).
A third means of gauging the performance of a test across conditions is to assess the
variability in the Type I error rate. A good test would be consistent in its false rejection
rate across conditions. In other words, better tests should have smaller variability in the
Type I error rate across conditions. So we computed the standard deviation of the Type I
error rates across conditions and also computed percentile scores for the Type I error
rates to examine the variability in Type I errors across conditions.
These indices of test performance can be found in Table 1. Examining first the mean
(M) Type I error rate, three tests stand out from the others. Of the four conditional tests,
CO had a Type I error rate closest to .05 on average. In addition, CO was valid a greater
proportion of the time compared to the other conditional tests. However, two
unconditional tests also performed as well or better. Tests UBS and UMB had average
Type I error rates no greater than .05 on average, and both produced validity
percentages that exceeded CO. Indeed, both of these unconditional tests were
practically valid in over 97% of the conditions simulated.
A close examination of the distribution of the Type I error rates shows that UBS and
UMB were superior to CO as well as any of the other unconditional tests. Notice that
10% of the estimated Type I error rates for CO exceeded .069, with the other
unconditional tests having even higher 90th percentiles. In contrast, 90% of the Type I
error rates were no greater than .058 when using the UBS or UMB. But looking even
more closely at the entire distribution of Type I error rates, it is clear that UBS is superior
to all methods on the grounds that the Type I error rates tend to be less extreme, more
tightly clustered, and closer to .05. Less than 10% of the error rates were less than .042
using UBS, compared to .026 for CO. UMB, however, was quite conservative, with 50%
of the Type I error rates less than .013 and yielding a Type I error rate between .025 and
.075 in only 36.2% of the conditions simulated. And UBS was the least variable of all the
tests in Type I error rates, which clustered more tightly around not only its mean error
rate relative to the other tests, but also with the Type I error rates usually quite close to
.05 (in 97% of conditions).
Although some of the differences between the overall performance of these methods
are small and subtle, it is fairly clear looking at Table 1 that the conditional decision rule,
regardless of which variance equality test is used, does not control the Type I error rate
any better than several unconditional alternatives. In fact, a few of the unconditional
alternatives tended to do better. Of the unconditional alternatives, UBS and UMB
maintained Type I error rates closest to or less than .05 on average, were valid more
Table 1. Type I error rates and validities across all 4,116 conditions

M 10% 25% 50% 75% 90% SD Valid %1 Valid %2 Valid %3

226 Andrew F. Hayes and Li Cai


Conditional tests
F-ratio (CF) 0.055 .042 .046 .051 .057 .071 0.016 81.4 92.3 92.3
Levene (CL) 0.054 .041 .046 .051 .058 .073 0.017 81.2 91.1 90.1
O’Brien (CO) 0.050 .026 .043 .050 .057 .069 0.020 81.9 93.0 83.8
Brown–Forsythe (CB) 0.055 .039 .046 .051 .058 .077 0.020 80.1 89.5 87.7

Unconditional tests
Pooled t (UPT) 0.075 .006 .027 .052 .092 .204 0.078 65.7 71.2 48.0
Separate t (UST) 0.053 .044 .048 .051 .056 .061 0.011 89.0 97.1 97.1
Bootstrap pool. t (UBP) 0.059 .044 .049 .053 .054 .062 0.017 71.1 87.9 87.9
Bootstrap sep. t (UBS) 0.050 .042 .046 .050 .054 .058 0.010 93.7 98.1 98.0
Permutation pool. t (UPP) 0.075 .006 .028 .052 .093 .203 0.077 64.8 71.0 47.9
Permutation sep. t (UPS) 0.052 .041 .047 .052 .057 .062 0.011 86.0 97.5 95.9
Mielke–Berry (UMB) 0.024 .000 .002 .013 .049 .058 0.025 93.0 97.9 36.2
1
% of conditions where test yielded Type I error rate less than .06.
2
% of conditions where test yielded Type I error rate no greater than .075.
3
% of conditions where test yielded Type I error rate between .025 and .075.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

Conditional decision rule for comparing two means 227

often, and were less likely to have large Type I error rates (as evidenced by the smaller
90th percentile scores). However, UMB tended to be conservative. It is also worth
noting how well the much simpler UST did across situations. Although its average Type I
error rate was not much different from any of the conditional tests, it was more
consistent and was less likely to produce extreme Type I error rates. Furthermore, it was
valid in more conditions than any of the conditional tests, and it is much easier to
compute than UBS or UMB. As such, it should not be discounted at this stage as a serious
competitor to the conditional decision rule.
The results presented in Table 1 collapse across two factors we manipulated that are
known to affect the validity of the t test: variance inequality and sample size inequality.
Tables 2–5 detail how these 11 tests performed as a function of variance inequality and
sample size inequality, defining a test as valid using the ‘practical validity’ definition (we
will use this definition of validity in all remaining tables and discussions thereof). These
tables tabulate the percent of populations sampled, out of 49 (because there were 49
different populations simulated), in which the test was invalid (i.e. estimate Type I error
rate above .075). Table 2 singles out the 1,029 simulated conditions with equal group
sample sizes. As can be seen, all of the tests did well when the sample sizes were equal.
In that case, conditioning the choice of t test on a variance equality test serves little
purpose. Furthermore, one can use any of the unconditional tests we examined here.
They all tend to produce a valid test with a few exceptions we do not detail at this stage.
Table 3 focuses on the 588 conditions where the population variances were equal. As
can be seen, when the groups were equally variable, all tests tended to do relatively well,
although UBP, UBS and UST did not do as well as most of the conditional tests when
there were large discrepancies between the group sample sizes and the overall sample
size was relatively small. The researcher might be well advised to avoid the use of these
tests in such situations. However, this advice presents a paradox, in that the researcher
cannot tell definitively whether the group variances are equal without first conducting a
variance equality test, which automatically moves him or her down the path of a
conditional testing strategy.
Table 4 documents the performance of these tests in the 1,323 conditions that
combined sample size inequality and variance inequality such that the smaller group has
the smaller variance (u , 1). As can be seen, the tests performed well, with the
exception of UBP, which tended to fail far more than the other tests when sample size
inequality was combined with moderate variance inequality. With that exception, it
makes little difference whether a conditional or unconditional test is used when the
group variances are different and the smaller group is less variable than the larger group.
The biggest discrepancy between the methods occurred when the variances were
different and the smaller group was more variable than the larger group (occurring in
1,323 of the simulations). As can be seen in Table 5, some methods performed relatively
well, while others performed miserably. In these conditions, the UPT and UPP should be
avoided, as those tests were invalid in nearly all 49 of the combinations of population
skewness and kurtosis that we simulated. The best performer was UBS, followed by
UMB, UPS and UST. All of the conditional tests clearly were better than the
unconditional use of a pooled t test (whether using the t distribution or a permutation
distribution), but not nearly as good as the unconditional tests based on the separate
variance t statistic. This is an interesting and important finding, because it is in these
situations that researchers might be especially likely to be attracted to an unconditional
test given knowledge of the literature that documents the failure of the pooled t test.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

228 Andrew F. Hayes and Li Cai

Table 2. Percentage of sampled populations (out of 49) where test was invalid; equal sample sizes

Variance ratio (u)

N 0.1 0.25 0.5 1 2 4 10 Total % (of 1,029)

Conditional tests
F (CF) 48 4 0 0 0 0 4 4
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
Levine (CL) 48 4 0 0 0 0 4 4
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
O’Brien (CO) 48 4 0 0 0 0 4 6
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
Brown–Forsythe (CB) 48 4 0 0 0 0 4 4
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0

Unconditional tests
Pooled t (UPT) 48 4 4 0 0 0 4 6
80 2 0 0 0 0 0 4 1
160 0 0 0 0 0 0 2
Separate t (UST) 48 4 0 0 0 0 4 4
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
Bootstrap pooled t (UBP) 48 4 0 0 0 0 0 2
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
Bootstrap separate t (UBS) 48 4 0 0 0 0 0 2
80 2 0 0 0 0 0 2 ,1
160 0 0 0 0 0 0 0
Permutation pooled t (UPP) 48 4 2 0 0 0 4 6
80 2 2 0 0 0 0 4 1
160 0 0 0 0 0 0 2
Permutation separate t (UPS) 48 4 2 0 0 0 4 6
80 2 2 0 0 0 0 4 1
160 0 0 0 0 0 0 2
Mielke–Berry (UMB) 48 6 2 0 0 0 4 6
80 2 4 0 0 0 0 6 2
160 0 0 0 0 0 0 2

These findings suggest that several alternative unconditional tests are more likely to
produce a valid test.
However, there is something unsatisfying about this approach to examining the
results. Although the results tabulated in Tables 1–5 provide information about which
tests tend to be valid across conditions, we cannot specify very precisely the conditions
in which any of these tests fail, because we have not considered the effect of population
skewness and kurtosis on the Type I error rates. Furthermore, these tables provide no
information about the circumstances in which the tests all fail, nor do they identify the
possible conditions in which a conditional test may be the only valid testing strategy.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

Conditional decision rule for comparing two means 229

Table 3. Percentage of sampled populations (out of 49) where test was invalid; equal population
variances (u ¼ 1)

n1:n2

N 1:7 1:3 2:3 1:1 Total % (out of 588)

Conditional tests
F (CF) 48 10 0 0 0
80 8 0 0 0 2
160 4 0 0 0
Levine (CL) 48 0 0 0 0
80 0 0 0 0 0
160 0 0 0 0
O’Brien (CO) 48 0 0 0 0
80 0 0 0 0 0
160 0 0 0 0
Brown–Forsythe (CB) 48 0 0 0 0
80 0 0 0 0 0
160 0 0 0 0

Unconditional tests 48 0 0 0 0
Pooled t (UPT) 80 0 0 0 0 0
160 0 0 0 0
Separate t (UST) 48 16 0 0 0
80 8 0 0 0 2
160 4 0 0 0
Bootstrap pooled t (UBP) 48 0 0 0 0
80 76 0 0 0 7
160 16 0 0 0
Bootstrap separate t (UBS) 48 12 0 0 0
80 6 0 0 0 2
160 4 0 0 0
Permutation pooled t (UPP) 48 0 0 0 0
80 0 0 0 0 0
160 0 0 0 0
Permutation separate t (UPS) 48 0 0 0 0
80 0 0 0 0 0
160 0 0 0 0
Mielke–Berry (UMB) 48 0 0 0 0
80 0 0 0 0 0
160 0 0 0 0

Finally, they provide no information about how frequently conditional and


unconditional tests agree, how often they disagree, and which tends to produce the
correct decision when they disagree. We address these issues next.
In Table 6 we provide a pairwise comparison of the conditional tests and the
unconditional tests, still collapsing across the 49 combinations of skewness and
kurtosis. Many of the columns in this table we provide only as information and do not
discuss. Most important is the information in the last three columns. Columns 4 and 5
tabulate the percentage of simulated conditions in which the conditional test in the pair
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
230 Andrew F. Hayes and Li Cai
Table 4. Number of sampled populations (out of 49) where test was invalid; unequal sample sizes, smaller group with smaller variance (u , 1)

Variance ratio (u)

0.10 0.25 0.50

Copyright © The British Psychological Society


n1:n2 1:7 1:2 2:3 1:7 1:2 2:3 1:7 1:2 2:3 Total % (out of 1,323)

Conditional tests N
F (CF) 48 0 0 2 0 0 0 8 0 0
80 0 0 0 0 0 0 6 0 0 ,1
160 0 0 0 0 0 0 2 0 0
Levine (CL) 48 0 0 2 0 0 0 2 0 0
80 0 0 0 0 0 0 2 0 0 ,1
160 0 0 0 0 0 0 2 0 0
O’Brien (CO) 48 0 0 0 0 0 0 0 0 0
80 0 0 0 0 0 0 0 0 0 0
160 0 0 0 0 0 0 0 0 0
Brown–Forsythe (CB) 48 0 0 0 0 0 0 0 0 0
80 0 0 0 0 0 0 0 0 0 0
160 0 0 0 0 0 0 0 0 0

Unconditional tests
Pooled t (UPT) 48 0 0 0 0 0 0 0 0 0
80 0 0 0 0 0 0 0 0 0 0
160 0 0 0 0 0 0 0 0 0
Separate t (UST) 48 0 0 4 0 0 0 6 0 0
80 0 0 0 0 0 0 4 0 0 ,1
160 0 0 0 0 0 0 0 0 0
Bootstrap pooled t (ubp) 48 0 0 0 27 0 0 80 0 0
80 0 0 0 4 0 0 41 0 0 6
160 0 0 0 0 0 0 0 0 0
Bootstrap separate t (UBS) 48 0 0 0 0 0 0 4 0 0
80 0 0 0 0 0 0 2 0 0 ,1
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Copyright © The British Psychological Society
Table 4. (Continued)
Variance ratio (u)

0.10 0.25 0.50

n1:n2 1:7 1:2 2:3 1:7 1:2 2:3 1:7 1:2 2:3 Total % (out of 1,323)

Conditional decision rule for comparing two means


160 0 0 0 0 0 0 0 0 0
Permutation pooled t (UPP) 48 0 0 0 0 0 0 0 0 0
80 0 0 0 0 0 0 0 0 0 0
160 0 0 0 0 0 0 0 0 0
Permutation separate t (UPS) 48 0 0 4 0 0 0 0 0 0
80 0 0 0 0 0 0 0 0 0 ,1
160 0 0 0 0 0 0 0 0 0
Mielke–Berry (UMB) 48 0 0 0 0 0 0 0 0 0
80 0 0 0 0 0 0 0 0 0 0
160 0 0 0 0 0 0 0 0 0

231
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
232 Andrew F. Hayes and Li Cai
Table 5. Percentage of sampled populations (out of 49) where test was invalid: unequal sample sizes, smaller group with larger variance (u . 1)

Variance ratio (u)

2 4 10

Copyright © The British Psychological Society


n1:n2 1:7 1:2 2:3 1:7 1:2 2:3 1:7 1:2 2:3 Total % (out of 1,323)

Conditional tests N
F (CF) 48 90 35 0 100 20 6 51 12 10
80 71 20 0 55 16 2 14 6 4 22
160 39 2 0 14 4 0 10 4 2
Levine (CL) 48 82 41 0 100 33 8 82 14 10
80 78 24 0 96 16 2 16 8 4 26
160 51 0 0 14 4 0 10 4 2
O’Brien (CO) 48 73 27 0 100 37 10 43 14 12
80 71 10 0 33 16 2 16 10 8 20
160 27 0 0 18 6 2 10 4 2
Brown–Forsythe (CB) 48 90 67 0 100 92 10 100 14 10
80 88 33 0 100 18 2 18 8 4 32
160 71 0 0 18 4 2 10 4 2
Unconditional tests
Pooled t (UPT) 48 100 100 4 100 100 92 100 100 100
80 100 98 8 100 100 88 100 100 100 89
160 100 100 10 100 100 92 100 100 100
Separate t (UST) 48 16 8 0 12 10 6 14 10 10
80 12 4 0 12 6 2 12 6 4 7
160 8 2 0 6 2 0 10 4 2
Bootstrap pooled t (UBP) 48 98 10 0 100 14 4 100 14 6
80 78 4 0 86 6 0 71 8 2 24
160 20 0 0 16 2 0 16 4 0
Bootstrap separate t (UBS) 48 14 6 0 12 8 4 8 4 4
80 10 4 0 10 4 0 6 4 2 4
160 4 2 0 4 2 0 4 2 0
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
Copyright © The British Psychological Society
Table 5. (Continued)
Variance ratio (u)

2 4 10

n1:n2 1:7 1:2 2:3 1:7 1:2 2:3 1:7 1:2 2:3 Total % (out of 1,323)

Conditional decision rule for comparing two means


Permutation pooled t (UPP) 48 100 100 8 100 100 94 100 100 100
80 100 100 12 100 100 94 100 100 100 89
160 100 100 12 100 100 92 100 100 100
Permutation separate t (UPS) 48 6 6 0 16 10 8 37 14 10
80 4 2 0 12 6 2 16 8 6 7
160 0 0 0 4 2 0 6 4 2
Mielke–Berry (UMB) 48 4 0 0 27 0 0 82 0 0
80 0 0 0 4 0 0 27 0 0 5
160 0 0 0 0 0 0 0 0 0

233
Reproduction in any form (including the internet) is prohibited without prior permission from the Society
234 Andrew F. Hayes and Li Cai
Table 6. Pairwise comparisons between conditional and unconditional test performance across all 4,116 conditions

Conditional is the Unconditional is the


Conditional Unconditional Agree Both valid Both invalid only valid test only valid test (%)
test test (%) (%) (%) (C) (%) (UC) UC/C

Copyright © The British Psychological Society


CF UPT 77.9 70.7 7.2 21.6 0.5 ,0.1
UST 94.8 92.1 2.7 0.2 5.0 25.0
UBP 92.7 86.4 6.2 5.9 1.5 0.3
UBS 94.0 92.2 1.8 0.1 5.9 59.0
UPP 77.6 70.5 7.2 21.8 0.5 ,0.1
UPS 94.3 92.0 2.3 0.3 5.4 18.0
UMB 92.4 91.3 1.1 1.0 6.6 6.6
CL UPT 79.9 71.1 8.8 20.0 0.1 ,0.1
UST 92.1 90.2 1.9 1.0 7.0 7.0
UBP 91.7 85.4 6.3 5.8 2.6 0.4
UBS 91.6 90.4 1.2 0.7 7.7 11.0
UPP 79.6 70.9 8.8 20.3 0.1 ,0.1
UPS 93.0 90.8 2.2 0.4 6.7 16.8
UMB 91.9 90.5 1.4 0.7 7.5 10.7
CO UPT 78.3 71.2 7.0 21.7 0 0
UST 93.8 91.9 1.8 1.0 5.2 5.2
UBP 91.5 86.2 5.3 6.8 1.7 0.3
UBS 93.3 92.2 1.1 0.8 5.9 7.4
UPP 78.0 71.0 7.0 22.0 0 0
UPS 94.8 92.6 2.2 0.3 4.8 16.0
UMB 92.9 91.9 1.0 1.1 6.0 5.5
CB UPT 81.7 71.2 10.5 18.2 0 ,0.1
UST 90.4 88.5 1.9 1.0 8.6 6.0
UBP 90.9 84.2 6.8 5.3 3.7 0.9
UBS 90.0 88.8 1.2 0.7 9.3 11.8
UPP 81.5 71.0 10.5 18.5 0 ,0.1
UPS 91.3 89.1 2.2 0.4 8.3 6.1
UMB 90.6 89.0 1.6 0.5 8.9 17.0
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

Conditional decision rule for comparing two means 235

was the only valid test, and the percentage of conditions in which the unconditional test
was the only valid test, respectively. Column 6 (UC/C) quantifies their relative
performance as the ratio of these two columns. A ratio less than one signifies that when
the two tests disagreed, the conditional test was valid more frequently, whereas a ratio
greater than 1 indicates that the unconditional test tended to be the sole valid test.
There is a consistent pattern. Whenever a conditional test was pitted against an
unconditional test that relies on a pooling of variances in the test statistic (UPT, UBP,
UPP), the conditional test tended to be valid more frequently. In contrast, whenever a
conditional test competed against an unconditional test that does not use a pooling of
variances in the test statistic (UST, UBS, UPS, UMB), the unconditional test was valid
more often. This table illustrates how poor UPT is compared to any of the conditional
testing strategies in controlling Type I errors. It also illustrates that the conditional tests
are rarely better than any of the unconditional separate variance tests in keeping Type I
error rates at an acceptable level. And again, we see evidence in this table of the
superiority of UBS, UPS and UMB relative to any of the conditional tests, as well as the
good performance of UST across situations compared to the conditional tests.
So far we have focused on the relative performance of the conditional and
unconditional testing strategies collapsing across the 49 variations of population
skewness and kurtosis. Although there are some important differences between the
Type I error rates of conditional and unconditional testing strategies throughout the
results thus far, it is clear from Table 5 that the largest differences between the methods
occurred when both the sample sizes and variances differed, and when the smaller
group had the largest variance. To better understand the nature of the differences in
performance in those specific conditions, in Tables 7 and 8 we provide the mean Type I
error rates for all the methods studied here when sampling from (1) symmetrical
populations with negative kurtosis (five conditions), (2) symmetrical populations with
positive kurtosis (six conditions), (3) asymmetrical populations with zero kurtosis
(three conditions), (4) asymmetrical populations with negative kurtosis (seven
conditions), and (5) asymmetrical populations with positive kurtosis (27 conditions).
Focusing first on the symmetrical populations (Table 7), the basic pattern of findings
we have described thus far appears, with the unconditional tests that do not rely on the
pooling of variances keeping good control over Type I error rates as well as or better
than any of the conditional tests. UMB was generally conservative, and the
unconditional pooled variance t test (UPT) and its resampling variants (UBP and UPP)
were liberal, although the liberalness of UBP decreased with increasing sampling size.
The conditional tests generally performed similarly, although conditioning on O’Brien’s
test of variance equality (CO) produced a slightly less liberal test when sampling from
symmetrical populations with negative kurtosis.
Examining the relative performance of the tests when sampling from asymmetrical
populations (Table 8), the same basic pattern is found, with UST, UBS and UPS equalling
or outperforming the conditional tests and generally keeping the Type I error rate, on
average, near .05. UMB was either liberal or conservative depending on the
discrepancies in sample size and variance, and the pooled variance tests (UPT, UBP,
UPS) were generally liberal, sometimes dramatically so.

3.1. When all methods fail


The superior performance of some of these unconditional tests relative to the
conditional decision rule does not mean that they are a panacea to assumption
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

236 Andrew F. Hayes and Li Cai

Table 7. Mean Type I error rates: unequal sample sizes, smaller group with larger variance (u . 1),
symmetrical populations (skewness ¼ 0)

Conditional tests Unconditional tests

Kurt. u n1:n2 N CF CL CO CB UPT UST UBP UBS UPP UPS UMB

Neg. 2 1:7 48 .115 .118 .104 .124 .130 .059 .109 .048 .131 .058 .058
80 .107 .108 .088 .114 .132 .055 .089 .049 .132 .056 .038
160 .076 .079 .062 .084 .128 .050 .065 .049 .128 .052 .017
1:3 48 .084 .084 .078 .089 .103 .054 .068 .052 .102 .057 .035
80 .077 .077 .069 .082 .193 .054 .064 .053 .102 .056 .020
160 .057 .058 .053 .060 .099 .051 .057 .052 .098 .053 .014
2:3 48 .060 .060 .059 .061 .067 .052 .053 .050 .067 .053 .041
80 .058 .057 .056 .059 .071 .053 .054 .052 .072 .054 .032
160 .048 .049 .048 .049 .064 .047 .049 .049 .065 .048 .018
4 1:7 48 .121 .132 .094 .155 .234 .058 .115 .041 .232 .064 .101
80 .071 .082 .063 .091 .231 .055 .083 .044 .230 .057 .067
160 .051 .053 .050 .054 .227 .050 .064 .046 .228 .052 .031
1:3 48 .066 .073 .063 .081 .161 .054 .067 .048 .161 .060 .046
80 .054 .057 .054 .059 .155 .053 .062 .052 .155 .055 .027
160 .048 .048 .048 .048 .148 .048 .052 .048 .149 .050 .009
2:3 48 .055 .057 .055 .058 .089 .054 .055 .052 .089 .058 .048
80 .053 .053 .053 .053 .086 .052 .054 .053 .085 .053 .032
160 .049 .049 .049 .049 .084 .049 .051 .050 .084 .050 .025
10 1:7 48 .067 .073 .061 .087 .361 .057 .101 .038 .352 .068 .127
80 .058 .058 .057 .059 .359 .057 .081 .047 .353 .063 .083
160 .050 .050 .050 .050 .341 .050 .060 .047 .339 .053 .038
1:3 48 .052 .053 .052 .054 .210 .052 .058 .044 .209 .058 .046
80 .050 .050 .050 .050 .213 .050 .057 .050 .212 .055 .025
160 .048 .048 .048 .048 .205 .048 .051 .048 .205 .050 .009
2:3 48 .056 .056 .056 .056 .109 .056 .056 .052 .109 .061 .051
80 .052 .052 .052 .052 .104 .052 .054 .052 .103 .056 .034
160 .055 .055 .055 .055 .101 .055 .056 .055 .100 .058 .026
Pos. 2 1:7 48 .086 .090 .079 .099 .123 .049 .093 .041 .119 .057 .020
80 .075 .081 .075 .088 .122 .043 .072 .037 .122 .049 .008
160 .065 .073 .070 .077 .125 .046 .060 .044 .122 .049 .003
1:3 48 .068 .074 .074 .079 .098 .047 .058 .042 .097 .054 .011
80 .065 .072 .071 .076 .103 .051 .059 .050 .102 .056 .004
160 .053 .058 .059 .059 .096 .047 .051 .045 .095 .049 .001
2:3 48 .056 .060 .061 .061 .068 .052 .052 .049 .069 .054 .024
80 .054 .057 .058 .057 .067 .051 .052 .050 .069 .053 .015
160 .051 .053 .054 .054 .066 .050 .051 .050 .067 .050 .015
4 1:7 48 .096 .107 .092 .125 .228 .046 .095 .034 .220 .060 .046
80 .072 .083 .074 .092 .228 .048 .077 .040 .222 .057 .024
160 .051 .054 .056 .056 .221 .046 .060 .043 .219 .052 .008
1:3 48 .066 .073 .074 .081 .156 .053 .063 .045 .157 .060 .021
80 .053 .055 .057 .057 .152 .049 .057 .046 .154 .054 .007
160 .050 .050 .051 .050 .148 .050 .052 .047 .149 .051 .002
2:3 48 .048 .051 .054 .053 .081 .045 .047 .043 .083 .051 .023
80 .047 .048 .049 .049 .077 .046 .046 .044 .079 .048 .015
160 .046 .046 .046 .046 .079 .046 .048 .047 .078 .048 .013
10 1:7 48 .067 .077 .068 .093 .350 .043 .087 .031 .335 .064 .078
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

Conditional decision rule for comparing two means 237

Table 7. (Continued)
Conditional tests Unconditional tests

Kurt. u n1:n2 N CF CL CO CB UPT UST UBP UBS UPP UPS UMB

80 .050 .053 .052 .056 .348 .046 .069 .037 .336 .055 .044
160 .050 .050 .050 .050 .335 .050 .060 .043 .329 .053 .015
1:3 48 .049 .051 .052 .054 .217 .048 .054 .039 .220 .057 .025
80 .050 .050 .050 .050 .213 .050 .054 .045 .214 .055 .010
160 .050 .050 .050 .050 .213 .050 .052 .048 .214 .053 .002
2:3 48 .047 .047 .051 .047 .099 .046 .045 .041 .104 .055 .027
80 .050 .050 .051 .050 .100 .050 .049 .048 .103 .055 .019
160 .049 .049 .049 .049 .099 .049 .050 .049 .099 .052 .010

violations. Even the better of the unconditional tests can fail in some circumstances. In
this study there were only 9 conditions (less than 0.3%) where neither UST, UBS, UPS
nor UMB were valid, all of them when sampling from populations with large skewness
and kurtosis (in our case, operationalized as exponential, x2 (1), or lognormal) and large
variance inequalities (0.1 or 10 here). This is consistent with prior research that has
shown that the separate variance t test is not robust when variances are heterogeneous
and the populations non-normal (e.g. Algina, Oshima, & Lin, 1994; Luh & Guo, 2000;
Wilcox, 2003). Even so, our findings show that in these situations, conditioning the
selection of t test on a variance equality test does not yield a more valid test.

4. Summary
In this paper, we evaluated the strategy of conditioning the choice of t test on the
outcome of one of several variance equality tests. Furthermore, we examined this
conditional decision rule using several different variance equality tests, and compared
the Type I error rates of the rule to some alternative unconditional tests based on
resampling methods. Our findings are consistent with those of others who have
reported that the conditional decision rule serves little useful function and may as well
be abandoned (Best & Rayner, 1987; Gans, 1981; Moser & Stevens, 1992; Wilcox et al.,
1986). Furthermore, these results allow us to generalize this conclusion to tests other
than F or Levine’s test as the conditioning test, as well as when sampling from
non-normal populations. Little if any Type I error protection is gained using the
conditional decision rule compared to the unconditional use of the ‘separate variance’
Welch–Satterthwaite t test ( UST) in all conditions, irrespective of the outcome of a
variance equality test. Some improvement in performance compared to UST in some
situations can be achieved by bootstrapping the sampling distribution of the separate
variance t statistic ( UBS), although this test tended not to perform as well as the separate
variance t test when the sample size of one of the groups was small. The more
complicated Mielke–Berry method also performed well, although it tended to be
conservative compared to UBS and UST. This is not to say that these tests are infallible.
Even these tests failed when extreme variance inequality was combined with large
discrepancies in group sample size. Such failures tended to occur most frequently when
the population distribution was extremely non-normal. Thus, these tests are not
solutions to the Behrens–Fisher problem in all circumstances. Our main point is that
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

238 Andrew F. Hayes and Li Cai

Table 8. Mean Type I error rates: Unequal sample sizes, smaller group with larger variance (u . 1),
asymmetrical populations (skewness . 0)

Conditional tests Unconditional tests

Kurt. u n1:n2 N CF CL CO CB UPT UST UBP UBS UPP UPS UMB

0 2 1:7 48 .101 .104 .094 .109 .128 .059 .113 .049 .129 .055 .036
80 .092 .097 .085 .100 .128 .053 .083 .048 .127 .051 .020
160 .078 .082 .076 .086 .132 .056 .069 .052 .133 .054 .007
1:3 48 .070 .073 .069 .077 .093 .051 .063 .048 .092 .052 .020
80 .066 .069 .066 .071 .095 .052 .060 .052 .093 .052 .010
160 .057 .061 .057 .062 .101 .050 .054 .049 .100 .051 .003
2:3 48 .054 .055 .055 .057 .064 .049 .051 .048 .063 .049 .032
80 .052 .055 .054 .055 .066 .048 .049 .047 .065 .049 .019
160 .051 .052 .051 .053 .066 .050 .051 .050 .066 .051 .010
4 1:7 48 .117 .127 .106 .143 .233 .061 .109 .047 .233 .066 .078
80 .089 .097 .083 .110 .237 .059 .091 .048 .237 .061 .050
160 .058 .060 .057 .063 .237 .055 .072 .050 .239 .057 .021
1:3 48 .070 .075 .071 .083 .154 .053 .065 .047 .156 .057 .033
80 .056 .058 .056 .062 .153 .053 .059 .049 .153 .055 .019
160 .047 .047 .047 .048 .152 .047 .052 .047 .150 .048 .007
2:3 48 .057 .059 .059 .063 .093 .053 .057 .053 .092 .056 .041
80 .058 .058 .058 .059 .094 .057 .058 .055 .095 .058 .028
160 .047 .047 .047 .047 .073 .047 .048 .048 .073 .047 .014
10 1:7 48 .076 .085 .069 .103 .364 .057 .106 .038 .354 .071 .116
80 .056 .057 .056 .060 .341 .055 .080 .046 .335 .063 .073
160 .055 .055 .055 .055 .342 .055 .066 .052 .338 .058 .037
1:3 48 .057 .057 .057 .059 .215 .056 .064 .050 .215 .063 .046
80 .054 .054 .054 .054 .220 .054 .058 .051 .219 .059 .026
160 .055 .055 .055 .055 .210 .055 .058 .053 .208 .057 .007
2:3 48 .050 .050 .051 .050 .105 .050 .049 .047 .106 .058 .042
80 .048 .048 .048 .048 .093 .048 .049 .046 .092 .050 .024
160 .049 .049 .049 .049 .094 .049 .049 .048 .094 .051 .009
Neg. 2 1:7 48 .114 .116 .103 .121 .134 .065 .119 .053 .137 .061 .050
80 .097 .100 .086 .105 .130 .055 .085 .047 .132 .053 .027
160 .078 .080 .068 .084 .132 .052 .068 .049 .132 .052 .013
1:3 48 .076 .077 .072 .080 .095 .054 .066 .050 .095 .054 .027
80 .075 .075 .070 .078 .103 .055 .064 .053 .103 .056 .018
160 .060 .061 .057 .063 .100 .053 .057 .052 .101 .053 .005
2:3 48 .063 .062 .061 .064 .071 .052 .057 .053 .071 .054 .036
80 .054 .054 .053 .055 .066 .048 .051 .049 .065 .049 .024
160 .051 .052 .051 .052 .065 .050 .052 .051 .064 .050 .015
4 1:7 48 .129 .138 .106 .156 .239 .064 .118 .046 .241 .066 .092
80 .077 .087 .068 .099 .231 .055 .085 .045 .230 .057 .060
160 .056 .059 .055 .061 .233 .055 .070 .051 .232 .056 .029
1:3 48 .071 .077 .070 .087 .156 .057 .069 .051 .156 .061 .045
80 .055 .057 .054 .060 .149 .052 .061 .049 .149 .056 .024
160 .052 .052 .052 .052 .156 .052 .057 .051 .155 .054 .008
2:3 48 .052 .054 .054 .056 .086 .050 .051 .047 .084 .054 .042
80 .053 .053 .053 .054 .088 .052 .054 .052 .087 .055 .032
160 .052 .052 .052 .052 .086 .052 .054 .053 .087 .054 .017
10 1:7 48 .078 .085 .071 .102 .361 .064 .109 .042 .354 .074 .130
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

Conditional decision rule for comparing two means 239

Table 8. (Continued)
Conditional tests Unconditional tests

Kurt. u n1:n2 N CF CL CO CB UPT UST UBP UBS UPP UPS UMB

80 .057 .058 .057 .061 .346 .056 .080 .045 .343 .062 .080
160 .053 .053 .053 .053 .346 .053 .062 .049 .343 .055 .039
1:3 48 .057 .058 .058 .060 .220 .057 .065 .050 .223 .064 .053
80 .052 .052 .052 .052 .209 .052 .057 .048 .211 .055 .026
160 .051 .051 .051 .051 .203 .051 .053 .050 .203 .052 .010
2:3 48 .055 .055 .056 .056 .110 .055 .054 .051 .112 .060 .049
80 .054 .054 .054 .054 .106 .054 .054 .053 .106 .057 .033
160 .049 .049 .049 .049 .097 .049 .050 .049 .097 .051 .017
Pos. 2 1:7 48 .090 .085 .077 .091 .122 .071 .112 .062 .120 .065 .018
80 .082 .080 .076 .085 .125 .065 .090 .058 .126 .059 .008
160 .072 .074 .073 .077 .126 .060 .073 .055 .126 .055 .002
1:3 48 .070 .070 .069 .073 .097 .061 .068 .056 .099 .061 .010
80 .065 .066 .066 .069 .096 .056 .062 .054 .097 .056 .004
160 .058 .061 .062 .062 .098 .054 .057 .052 .099 .053 .001
2:3 48 .054 .055 .056 .056 .065 .051 .050 .048 .067 .053 .021
80 .054 .055 .056 .056 .066 .052 .051 .050 .068 .053 .013
160 .054 .054 .055 .054 .068 .052 .053 .052 .069 .052 .007
4 1:7 48 .117 .124 .112 .136 .230 .071 .118 .058 .223 .077 .045
80 .092 .099 .093 .107 .227 .066 .093 .056 .224 .066 .024
160 .068 .073 .075 .077 .227 .060 .074 .054 .227 .058 .009
1:3 48 .077 .083 .082 .089 .154 .063 .072 .057 .156 .067 .022
80 .067 .071 .072 .074 .154 .060 .066 .054 .155 .061 .011
160 .057 .057 .060 .059 .156 .055 .059 .052 .157 .056 .002
2:3 48 .059 .062 .064 .064 .088 .056 .056 .053 .092 .060 .030
80 .056 .057 .059 .058 .086 .055 .055 .053 .087 .058 .018
160 .052 .053 .054 .053 .085 .052 .053 .051 .085 .054 .009
10 1:7 48 .099 .109 .101 .127 .359 .070 .114 .051 .340 .089 .090
80 .074 .078 .079 .085 .354 .067 .090 .053 .343 .074 .053
160 .062 .063 .065 .063 .347 .062 .073 .054 .343 .062 .023
1:3 48 .067 .070 .073 .075 .219 .063 .070 .054 .222 .072 .038
80 .061 .062 .064 .063 .212 .061 .064 .054 .215 .065 .019
160 .055 .055 .056 .055 .211 .055 .057 .052 .213 .057 .006
2:3 48 .061 .062 .066 .064 .114 .061 .059 .055 .118 .068 .042
80 .059 .059 .061 .059 .107 .059 .057 .055 .109 .063 .026
160 .054 .054 .055 .054 .101 .054 .054 .053 .103 .057 .011

these tests do as well as or better than the condition decision rule in nearly every
circumstance we examined.
Or course, the methods we included in our simulations are not exhaustive of all
unconditional tests. We are not in a position to argue that these alternative
unconditional tests are better than others that exist. For instance, we acknowledge
that so-called asymmetric t tests are available, and the use of an asymmetric variance
estimate may be better than both the separate variance estimate and the pooled variance
estimate when n2 is much greater than n1 (Balkin & Mallow, 2001). Alternative robust
methods (Andrews et al., 1972; Keselman, Othman, Wilcox, & Fradette, 2004; Lix
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

240 Andrew F. Hayes and Li Cai

& Keselman, 1998; Wilcox, 1997; Wilcox & Keselman, 2003), especially those
comparing trimmed means provide alternatives resistant to variance heterogeneity and
bizarre distribution shapes, but they test a different null hypothesis so we did not
include them in this study. Such robust methods also have resampling based alternatives
which can further improve their performance (Keselman et al., 2004; Wilcox,
Keselman, & Kowalchuk, 1998). It may be that the best solution to the Behrens–Fisher
problem is to abandon the use of the arithmetic mean as the default measure of location,
and instead focus our inferential questions on statistics less susceptible to the influence
of extreme observations that tend to occur when sampling from irregular distributions.
Finally, validity is only one of the criteria used to judge the performance of a statistical
test. A test may be valid but lower in power than alternative tests. Our results, combined
with previous ones, suggest that there is little value in a conditional test if the goal is to
control Type I error rates. But perhaps a conditional test that is valid in a given situation
yields a more powerful test in that situation than any unconditional test. Our results say
nothing about this possibility. Moser and Stevens (1992) and Moser et al. (1989) note
that the only situations in which a conditional test (in their case, using F as variance
equality test for choosing between the pooled and separate t tests) is superior in terms
of power are (i) when the variances are equal or nearly so, and (ii) when the sample sizes
are unequal and the smaller group has the larger variance. However, in the latter case the
conditional test buys power at the cost of high Type I error rates. Otherwise, the
unconditional use of the separate variance t test is either equal in power to or more
powerful than conditioning the choice on the F-ratio test of variance equality. It would
be worth further examining the extent of any power differences between the
conditional decision rule and some of the unconditional tests we examined here and
seeing whether their conclusions generalize across the selection of variance equality
tests as well as when sampling from non-normal populations.

References
Algina, J., Oshima, T. C., & Lin, W. Y. (1994). Type I error rates for Welch’s test and James’s second-
order test under non-normality and inequality of variance when there are two groups. Journal
of Educational and Behavioral Statistics, 19, 275–291.
Andrews, D., Bickel, P., Hampel, F., Huber, P., Rogers, W., & Tukey, J. (1972). Robust estimates of
location. Princeton, NJ: Princeton University Press.
Aptech Systems, Inc. (1996). GAUSS (version 3.21) [computer software]. Maple Valley, WA:
Aptech Systems Inc..
Balkin, S. D., & Mallow, C. L. (2001). An adjusted, asymmetric two-sample t test. American
Statistician, 55, 203–206.
Bassili, J. N. (2003). The minority slowness effect: Subtle inhibitions in the expression of views not
shared by others. Journal of Personality and Social Psychology, 84, 261–276.
Best, D. J., & Rayner, J. C. W. (1987). Welch’s approximate solution for the Behrens–Fisher
problem. Technometrics, 29, 205–210.
Bluman, A. G. (2001). Elementary statistics: A step-by-step approach (4th ed.). Boston: McGraw-
Hill.
Boik, R. J. (1987). The Fisher–Pitman permutation test: A non-robust alternative to the normal
theory F when variances are heterogeneous. British Journal of Mathematical and Statistical
Psychology, 40, 26–42.
Boneau, C. A. (1960). The effects of violating the assumptions underlying the t test. Psychological
Bulletin, 57, 49–64.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

Conditional decision rule for comparing two means 241

Bradley, J. V. (1977). A common situation conducive to bizarre distribution shapes. American


Statistician, 31, 147–150.
Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31,
144–152.
Brown, M. B., & Forsythe, A. B. (1974). Robust tests for the equality of variances. Journal of the
American Statistical Association, 69, 364–367.
Chapman, K. L., Hardin-Jones, M., Schulte, J., & Halter, K. A. (2001). Vocal development of
9-month-old babies with cleft palate. Journal of Speech, Language and Hearing Research,
44, 1268–1283.
Coakes, S. J., & Steed, L. G. (1997). SPSS: Analysis without anguish. New York: Wiley.
Cody, R. P., & Smith, J. K. (1997). Applied statistics and the SAS programming language (4th ed.).
New York: North-Holland.
Conover, W. J., Johnson, M. E., & Johnson, M. M. (1981). A comparative study of tests of
homogeneity of variances, with applications to the outer continental shelf bidding data.
Technometrics, 23, 351–361.
Edgington, E. E. (1995). Randomization tests (3rd ed.). New York: Dekker.
Efron, B., & Tibshirani, R. J. (1998). An introduction to the bootstrap. Boca Raton: FL: CRC Press.
Field, A. (2000). Discovering statistics: Using SPSS for Windows. Thousand Oaks, CA: Sage.
Fisher, R. A. (1941). The asymptotic approach to Behrens’s integral, with further tables for the
d test of significance. Annals of Eugenics, 11, 141–172.
Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43,
521–532.
Foster, J. J. (2001). Data analysis using SPSS for Windows: A beginner’s guide. Thousand Oaks,
CA: Sage.
Gans, D. J. (1981). Use of a preliminary test in comparing two sample means. Communications in
Statistics: Simulation and Computation, 10, 163–174.
George, D., & Mallery, P. (2003). SPSS for Windows step by step: A simple guide and reference
(4th ed.). Boston, MA: Allyn and Bacon.
Good, P. I. (2000). Permutation tests: A practical guide to resampling methods for testing
hypotheses (2nd ed). New York: Springer.
Hayes, A. F. (2000). Randomization tests and the equality of variance assumption when comparing
group means. Animal Behaviour, 59, 653–656.
Hesse, E., & Van Ijzendoorn, M. H. (1998). Parental loss of close family members and propensities
towards absorption in offspring. Developmental Science, 1, 299–305.
Howell, D. C. (1996). Statistical methods for psychology (4th ed.). Belmont, CA: Duxbury.
Keppel, G. (1991). Design and analysis: A researcher’s handbook (3rd ed.). Upper Saddle River,
NJ: Prentice Hall.
Keselman, H. J., Othman, A. R., Wilcox, R. R., & Fradette, K. (2004). The new and improved two-
sample t test. Psychological Science, 15, 47–51.
Levene, H. (1960). Robust tests for the equality of variance. In L. Olkin (Ed.), Contributions to
probability and statistics. Palo Alto, CA: Stanford University Press.
Levine, D. M., Ramsey, P. P., & Smidt, R. K. (2001). Applied statistics for engineers and scientists.
Upper Saddle River, NJ: Prentice Hall.
Lix, L. M., & Keselman, H. J. (1998). To trim or not to trim: Tests of location equality under
heteroscedasticity and non-normality. Educational and Psychological Measurement, 58,
409–429.
Luh, W., & Guo, J. (2000). Johnson’s transformation two-sample trimmed t and its bootstrap
method for heterogeneity and non-normality. Journal of Applied Statistics, 27, 965–973.
Lunneborg, C. E. (2000). Data analysis by resampling: Concepts and applications. Pacific Grove,
CA: Duxbury.
MathSoft, Inc. (1999). S-PLUS 2000 guide to statistics, Vol. 1. Seattle, WA: MathSoft Inc..
McClave, J. T., & Sincich, T. (2003). Statistics (9th ed.). Upper Saddle River, NJ: Prentice Hall.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

242 Andrew F. Hayes and Li Cai

Mervis, C. B., & Robinson, B. F. (2000). Expressive vocabulary ability of toddlers with Williams
syndrome or Down syndrome: A comparison. Developmental Neuropsychology, 17, 111–126.
Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological
Bulletin, 105, 156–166.
Mielke, P. W., & Berry, K. J. (1994). Permutation tests for common locations among samples with
unequal variances. Journal of Educational and Behavioral Statistics, 19, 217–236.
Mielke, P. W., Berry, K. J., & Johnson, E. S. (1976). Some multi-response permutation procedures
for a priori classifications. Communications in Statistics: Theory and Methods, 5, 1409–1424.
Moser, B. K., & Stevens, G. R. (1992). Homogeneity of variance in the two-sample means test.
American Statistician, 46, 19–21.
Moser, B. K., Stevens, G. R., & Watts, C. L. (1989). The two-sample t test versus Satterthwaite’s
approximate F test. Communications in Statistics: Theory and Methods, 18, 3963–3975.
Murphy, B. P. (1976). Comparison of some two sample means tests by simulation.
Communications in Statistics: Simulation and Computation, 5, 23–32.
Norusis, M. J. (2002). SPSS 11.0 guide to data analysis. Chicago, IL: SPSS Inc..
O’Brien, R. G. (1981). A simple test for variance effects in experimental designs. Psychological
Bulletin, 89, 570–574.
Olejnik, S. F. (1988). Variance heterogeneity: An outcome to explain or a nuisance factor to
control. Journal of Experimental Education, 56, 193–197.
Olejnik, S. F., & Algina, J. (1988). Tests of variance equality when distributions differ in form and
location. Educational and Psychological Measurement, 48, 317–329.
Pfanzagl, J. (1974). On the Behrens–Fisher problem. Biometrika, 61, 39–47.
Press, S. J. (1966). A confidence interval comparison of two test procedures proposed for the
Behrens–Fisher problem. Journal of the American Statistical Association, 61, 454–466.
Ramsey, P. H. (1994). Testing variances in psychological and educational research. Journal of
Educational Statistics, 19, 23–42.
Romano, J. P. (1990). On the behavior of randomization tests without a group invariance
assumption. Journal of the American Statistical Association, 85, 686–692.
Rosner, B. (2000). Fundamentals of biostatistics (5th ed.). Pacific Grove, CA: Duxbury.
Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components.
Biometrics Bulletin, 2, 110–114.
Sawilowsky, S. S., & Blair, R. C. (1992). A more realistic look at the robustness and Type II error
properties of the t test to departures from population normality. Psychological Bulletin, 111,
352–360.
Scheffé, H. (1943). On solutions of the Behrens–Fisher problem, based on the t distribution.
Annals of Mathematical Statistics, 14, 35–44.
Smith, S. S., & Richardson, D. (1983). Amelioration of deception and harm in psychological
research: The important role of the debriefing. Journal of Personality and Social Psychology,
44, 1075–1082.
Snedecor, G. W., & Cochran, W. G. (1989). Statistical methods (8th ed.). Ames: University of Iowa.
Stonehouse, J. M., & Forrester, C. J. (1998). Robustness of the t and U tests under combined
assumption violations. Journal of Applied Statistics, 25, 63–74.
Triola, M. F., Goodman, W. M., & Law, R. (2002). Elementary statistics. Toronto: Wesley.
Welch, B. L. (1938). The significance of the difference between two means when the population
variances are unequal. Biometrika, 29, 350–362.
Welch, B. L. (1947). The generalization of Student’s problem when several different population
variances are involved. Biometrika, 34, 28–35.
Wilcox, R. R. (1990). Comparing variances and means distributions have non-identical shapes.
Communications in Statistics: Simulation and Computation, 19, 155–173.
Wilcox, R. R. (1997). Introduction to robust estimation and hypothesis testing. San Diego, CA:
Academic Press.
Wilcox, R. R. (2002). Comparing the variances of two independent groups. British Journal of
Mathematical and Statistical Psychology, 55, 169–175.
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

Conditional decision rule for comparing two means 243

Wilcox, R. R. (2003). Applying contemporary statistical techniques. San Diego, CA: Academic
Press.
Wilcox, R. R., Charlin, V. L., & Thompson, K. L. (1986). New Monte Carlo results on the robustness
of the ANOVA F, W, and F* statistics. Communication in Statistics: Simulation and
Computation, 15, 933–943.
Wilcox, R. R., & Keselman, H. J. (2003). Modern robust data analysis methods: Measures of central
tendency. Psychological Methods, 8, 254–274.
Wilcox, R. R., Keselman, H. J., & Kowalchuk, R. K. (1998). Can tests for treatment group equality
be improved? The bootstrap and trimmed means conjecture. British Journal of Mathematical
and Statistical Psychology, 51, 123–134.
Zimmerman, D. W. (1996). Some properties of preliminary tests of equality of variances in the two-
sample location problem. Journal of General Psychology, 123, 217–231.
Zimmerman, D. W. (2004). A note on preliminary tests of equality of variences. British Journal of
Mathematical and Statistical Psychology, 57, 173–181.
Zimmerman, D. W., & Zumbo, B. D. (1992). Parametric alternatives to the student t test under
violations of normality and homogeneity of variance. Perceptual and Motor Skills, 74,
835–844.

Received 15 August 2003; revised version received 21 March 2005

Appendix
The table below lists the skewness and kurtosis for the population samples. When the
sample was generated with the Fleishman method, the coefficients used in equation (2)
are listed in the row corresponding to the population skewness and kurtosis those
coefficients generate. Otherwise, the probability density function for the population is
listed.

Skewness Kurtosis a b c d

0.00 2 1.20 Uniform(0,1)


0.00 2 1.00 0.00 1.22100957 0.00 2 0.08015837
0.00 2 0.75 0.00 1.13362195 0.00 2 0.04673170
0.00 2 0.50 0.00 1.07673274 0.00 2 0.02626832
0.00 2 0.25 0.00 1.03424763 0.00 2 0.01154929
0.00 0.00 0.00 1.00 0.00 0.00
0.00 0.50 0.00 0.94583094 0.00 0.01774145
0.00 1.00 0.00 0.90297660 0.00 0.03135645
0.00 1.50 0.00 0.86699327 0.00 0.04252248
0.00 2.00 0.00 0.83566457 0.00 0.05205740
0.00 4.00 0.00 0.73738123 0.00 0.08092509
0.00 6.00 0.00 0.66268162 0.00 0.10189081
0.25 2 1.00 20.07746244 1.26341280 0.07746244 2 0.10003605
0.25 2 0.75 20.05928145 1.15546858 0.05928145 2 0.05617881
0.25 2 0.50 20.05098546 1.09162985 0.05098546 2 0.03246963
0.25 2 0.25 20.04602658 1.04545396 0.04602658 2 0.01611868
0.25 0.00 20.04263275 1.00896430 0.04263275 2 0.00360753
0.25 0.50 20.03812715 0.95223759 0.03812715 0.01520430
0.25 1.00 20.03515419 0.90797194 0.03515419 0.02939742
Copyright © The British Psychological Society
Reproduction in any form (including the internet) is prohibited without prior permission from the Society

244 Andrew F. Hayes and Li Cai

(Continued)
Skewness Kurtosis a b c d

0.25 1.50 20.03297936 0.87109462 0.03297936 0.04092481


0.25 2.00 20.03128626 0.83914834 0.03128626 0.05070704
0.25 4.00 20.02692846 0.73957028 0.02692846 0.08008653
0.50 2 0.25 20.10290997 1.08559668 0.10290997 2 0.03319707
0.50 2 0.50 20.12015607 1.14784906 0.12015607 2 0.05750353
0.50 0.00 20.09262357 1.03994604 0.09262357 2 0.01646086
0.50 0.50 20.08045036 0.97343107 0.08045036 0.00664738
0.50 1.00 20.07311803 0.92409763 0.07311803 0.02298245
0.50 1.50 20.06801719 0.84413424 0.06801719 0.03578704
0.50 2.00 20.06416926 0.85011103 0.06416926 0.04641702
0.50 4.00 20.05464090 0.74631819 0.05464090 0.07748621
0.75 2 0.25 20.22758948 1.20392341 0.22758948 2 0.09549567
0.75 0.00 20.17363002 1.11251460 0.17363002 2 0.05033445
0.75 0.50 20.13640251 1.01798355 0.13640251 2 0.01241224
0.75 1.00 20.11906128 0.95591357 0.11906128 0.00983810
0.75 1.50 20.10846069 0.90889311 0.10846069 0.02575256
0.75 2.00 20.10103831 0.87041099 0.10103831 0.03829112
1.00 1.00 20.19099508 1.01748519 0.19099508 2 0.01857700
1.00 1.50 20.16319428 0.95307690 0.16319428 0.00659737
1.00 2.00 20.14721082 0.90475830 0.14721082 0.02386092
1.00 4.00 20.11697723 0.77658502 0.11697723 0.06549950
1.00 6.00 20.10288900 0.69033611 0.10288900 0.09116368
1.50 6.00 20.17053106 0.73313072 0.17053106 0.07350201
1.50 8.00 20.15092965 0.65323954 0.15092965 0.09793493
2.00 8.00 20.23336330 0.71043655 0.23336330 0.07226367
1.63 4.00 x2 (3)
2 6 x2 (2) and exponential (1)
2.83 12.00 x2 (1)
110.94 6.18 Lognormal

Das könnte Ihnen auch gefallen