Sie sind auf Seite 1von 36

Review of Educational Research Spring 1990, Vol. 60, No. 1, pp.

91-126

Nonparametric Tests of Interaction in Experimental Design


Shlomo S. Sawilowsky Wayne State University
Until recently the design of experiments in the behavioral and social sciences that focused on interaction effects demanded the use of the parametric analysis of variance. Yet, researchers have been concerned by the presence of nonnormally distributed variables. Although nonparametric statistics are recommended in these situations, researchers often rely on the robustness of parametric tests. Further, often it is assumed that nonparametric methods lack statistical power and that there is a paucity of techniques in more complicated research designs, such as in testing for interaction effects. This paper reviewed (a) research in the past decade and a half that addressed concerns in selecting parametric and nonparametric statistics and (b) 10 recently developed nonparametric techniques for the testing of interactions in experimental design. The review shows that these new techniques are robust, powerful, versatile, and easy to compute. An application of selected nonparametric techniques on fabricated data is provided. Introduction Fifteen years ago in this journal Gardner (1975) brought to our attention the data analysis problem of Gatta (1973). Gatta was interested in the effects of two grading systems on the attitude of students with different achievement levels. The design of the experiment was selected to focus on the interaction of grading system and achievement level. "This virtually demands the use of parametric statistics: as Gaito (1959) points out, the few nonparametric tests for interaction that are available are computationally laborious and, more importantly, very much less powerful" (Gardner, 1975, p. 54). The same problem was mentioned by Edgington (1980); Hinkle, Wiersma, and Jurs (1988); Kirk (1968,1972); Klugh (1970); McCall (1975, 1980, 1986); Siegel (1956); and Tate and Clelland (1957). Klugh (1974) noted, "Nonparametrics have not been developed which can extract and test the significance of complex interaction effects. Analysis of variance remains the single, most important statistical tool in behavioral research" (p. 308). Gatta's data analysis problem leads to the two issues reviewed in this paper: (a) What are the concerns in using nonparametric rank tests in experimental design, and what research in the last decade and a half addressed those concerns? (b) Can a technique be suggested for the researcher of the 1990s who needs a nonparametric test of interaction in the analysis of variance (ANOVA) layout? It should be noted that nonparametric tests typically fall into three divisions based on the type of information they use: categorical, sign, or rank. This review is I wish to acknowledge the detailed comments of three anonymous referees and Michael R. Harwell, and especially the suggestions from R. Clifford Blair, who reviewed an earlier draft. I also wish to thank my wife, Avigayil, for her support. 91

Shlomo S. Sawilowsky delimited to rank tests. For the other types of nonparametric methods, the reader is referred to Conover (1980), Fienberg (1985), Freeman (1987), Gibbons (1985a, 1985b), Kennedy (1983), Siegel and Castellan (1988), Sprent (1989), and Wilcox (1987). Background Researchers in the behavioral and social sciences have long been concerned by the appearance of nonnormally distributed variables (Blair & Higgins, 1981; Bloom, 1984; Bradley, 1968, 1977; Walberg, Strykowski, Rovai, & Hurg, 1984). This concern arose because most commonly employed parametric statistics designed to test hypotheses of shift in location parameters were derived under the assumption of population normality. In the 1950s, many researchers advocated the use of nonparametric statistics, because they do not assume the data were sampled from a normal population (Nunnally, 1975, 1978). The popularity of nonparametric statistics peaked, however, during the 1960s (Michael, 1963; Page & Marcotte, 1966). Glass, Peckham, and Sanders (1972) commented, Applied statistics in education and the social sciences experienced a largely unnecessary hegira to nonparametric statistics during the 1950s. Increasingly during the 1950s and 1960s thefixed-effects,normal theory ANOVA was replaced by such comparable nonparametric techniques as the Wilcoxon test, Mann-Whitney Litest, Kruskal-Wallis one-way ANOVA, and the Friedman two-way ANOVA for ranks.... Theflightto nonparametrics was unnecessary, (pp. 237-238) The use of these procedures waned to the point that some textbook authors relegated nonparametric tests to a secondary role, as did Winn and Johnson (1978), who said that nonparametric tests are only "better than doing no statistical analysis at all" (p. 358). After reviewing 1,698 empirical studies published in the Journal of Counseling Psychology from 1954 through 1984, Jenkins, Fuqua, and Froehle (1984) concluded that "analysis indicated that nonparametric procedures have not been used as widely as parametric procedures. It was also evident that the use of nonparametric procedures has steadily declined over time" (p. 31). The reasons reported for the decline in the use of nonparametric statistics seems to be threefold. First, it is usually asserted that parametric statistics are extremely robust with respect to the assumption of population normality (Boneau, 1960; Box, 1954; Glass et al., 1972; Lindquist, 1953), precluding the need to consider alternative tests. Second, it is assumed that nonparametric tests are less powerful than their parametric counterparts (Kerlinger, 1964, 1973; Nunnally, 1975), apparently regardless of the shape of the population from which the data were sampled. Third, there has been a paucity of nonparametric tests for the more complicated research designs (Bradley, 1968). For example, in testing for shift there have been few usable nonparametric alternatives to the classical fixed-effect factorial ANOVA, as mentioned in the opening paragraph. Robustness, power, and versatility are traditional areas of comparison between parametric and nonparametric tests (Anderson, 1961; Blair, 1985). The efficiency of statistical tests and rank tests is briefly discussed for the benefit of readers from various backgrounds, because these topics are referred to in later sections. To respond to the question about concerns and advances in nonparametric rank tests, the issues of robustness of parametric tests and power of nonparametric

92

Nonparametric Tests of Interaction tests are reviewed. To respond to the question about the current availability of nonparametric tests of interaction, the advances in this area are discussed, and recommendations for data analysis are made. Finally, an application of selected nonparametric tests of interactions is presented. Efficiency of Statistical Tests Efficiency, in its broadest sense, refers to the cost, time, and effort required to use a test (Bradley, 1968). In terms of statistical power, efficiency refers to the minimum sample size necessary to detect a false null hypothesis. The smaller the sample necessary to detect a treatment, the more efficient, or powerful, is the statistic. An index that compares one test's requirements in terms of sample size to an alternative test is the Relative Efficiency (RE). The RE is the ratio of the necessary size of the sample of each test to achieve a desired power level. To be a fair index, nominal alpha of the two competing tests must be maintained at the same level while testing the identical hypothesis. The statistic that requires the smaller sample size is the more efficient test. The RE is relative because it depends on alpha and the distribution. An index is needed that compares the efficiency of competing statistical tests under many different conditions. The Asymptotic Relative Efficiency (ARE), or Pitman efficiency (Pitman, 1948), is a large sample index that compares the RE of competing statistical tests when sample a of Test A and b of Test B are infinitely large, and the treatment effect is infinitesimally small. (For a more rigorous definition of the ARE, see Lehmann, 1975; Noether, 1955; and Stuart, 1956.) An advantage of the ARE over the RE is that it does not depend on alpha. Thus, the efficiency of competing tests can be compared under "standardized" (Bradley, 1968, p. 58) conditions. An important use of the ARE is in predicting the relative power of one test to another for small samples. This was recognized in this journal by Blum and Fattu (1954) over 35 years ago. If the ARE of Test A relative to Test B is greater than 1.0, then Test A should have a power advantage over Test B. If the ARE is a ratio much greater than 1.0, then the magnitude of the ratio should correspond to an even greater power advantage in favor of Test A. If the ratio is less than 1.0, or much less than 1.0, then Test A should display less, or much less, relative power than Test B. The ARE as an indicator of relative power for small samples does have limitations, because it assumes an infinite sample size and an infinitesimally small treatment effect, which is unrealistic in real-world experimental conditions. However, several Monte Carlo studies confirmed the usefulness of the ARE as an indicator of the relative power of various nonparametric tests as compared to their parametric counterparts. Blair (1981), Blair and Higgins (1981), Hollander and Wolfe (1973), Randies and Wolfe (1979), and Smitley (1981) are some examples. For a review of additional studies using the ARE as an indication of statistical power, see Jenkins, Fuqua, and Hartman (1984). Rank Tests The rank test family is a particular application of permutation or randomization tests (Hajek & Sidak, 1967). Permutation tests on original data are robust (Lehmann & D'Abrera, 1975) and "stunningly efficient" (Bradley, 1968, p. 87), but are

93

Shlomo S. Sawilowsky impractical to use. This is due to the difficulties that arise in calculating all possible permutations to obtain critical values. The improvement is to replace the observations with their respective rank order. This rank-randomization results in rank tests that maintain the properties of the parent permutation test in being nonparametric exact tests, and yet these procedures are often easy to compute. There are other advantages of rank tests. For example, no assumption is made about the shape of the population from which samples were drawn, unlike the normality assumption of parametric tests (Marascuilo & Serlin, 1988). Also, the data need to be measured only on an ordinal scale (Stevens, 1946). The latter consideration permits more valid analyses of educational and psychological variables that are measured as rank data obtained from judges or observers. Moreover, original data that are on a higher level of measurement (interval or ratio) need only to be converted to ranks to use rank tests. The notion of converting higher level variables to a lower rank scale has concerned many textbook authors. They view tests with fewer assumptions as testing more general, and hence, weaker hypotheses, and the ranking of interval and ratio scale measures as a process that throws away valuable information. A consequence that is assumed to occur by the deliberate reduction of information is that rank tests lack statistical power, that they are "quick and dirty" procedures. Authors who accepted this position include the following: Ary and Jacobs (1976); Bartz (1981); Best (1981); Best and Kahn (1989); Borg (1987); Chase (1976); Clayton (1984); Couch (1982); Downie and Starry (1977); Englehart (1972); Fallik and Brown (1983); Freund (1970); Garrett (1966); Gay (1976, 1988); Gravetter and Wallnau (1985); Haber and Runyon (1969); Hays (1963, 1988); M. Johnson (1977); R. R. Johnson (1973); Kerlinger (1964); Kurtz and Mayo (1979, 1983); Lapin (1975, 1980); Leedy (1989); Lutz (1983); Lynch and Huntsberger (1976); Mansfield (1986); Mason and Bramble (1989); McCall (1980); McMillan and Schumacher (1989); McNemar (1962); Mendenhall (1968, 1971); Miller and Freund (1965); Mills (1977); Minium (1970); Olson (1987); Pagano (1981); Palumbo (1969, 1977); Parket (1974); Roscoe (1969, 1975); Runyon and Haber (1968, 1971, 1976, 1980, 1984, 1988); Sowell and Casey (1982); Spence, Underwood, Duncan, and Cotton (1968); Sprinthall (1982); Summers, Peters, and Armstrong (1977); Williams (1979); Wonnacott (1977); and Wright (1976). Some of these authors stated without qualification that rank tests are less powerful than parametric counterparts, whereas others only allowed that rank tests may be as powerful as parametric tests under nonnormality. Some textbook authors disagreed and opined that rank tests often perform more efficiently than parametric counterparts under nonnormality (Barlow, Bartholomew, Bremmer, & Brunk, 1972; Capon, 1988; Christensen, 1977; Couch, 1987; Dixon & Massey, 1969; Downie & Heath, 1970; Kreyszig, 1970; Marascuilo & Serlin, 1988; Mendenhall & Sheaffer, 1973; Wampold & Drew, 1990). A few authors presented both positions on the issue of ranking data without rendering an opinion (Dinham, 1976; Howell, 1989; Kachigan, 1986;Kossack&Henscke, 1975; Spatz & Johnson, 1989). It is interesting that Wilcoxon (1949), who proposed the Wilcoxon rank test (1945, 1947; the two independent samples version also was proposed by Mann & Whitney, 1947), believed "these methods do not utilize fully the information contained in the data" (p. 3). However, in a fascinating review of the early years of 94

Nonparametric Tests of Interaction nonparametrics, Noether (1984) chronicled the discovery in 1950 that Wilcoxon's "quick and dirty" or "approximate" rank test, at least asymptotically, was a very powerful test: "Far from wasting information, in many sampling situations the Wilcoxon tests make better use of the available information than the corresponding t tests, provide exact tests when the null hypothesis is true, and are considerably less sensitive to outliers among observations than the t procedures" (p. 175). Similarly, Noether cited Wolfowitz (1949), who published a survey of nonparametric methods, as saying as early as 1946 that "the only kind of information a nonparametric procedure is likely to waste is information that is unavailable anyway" (p. 175). In recent years, Monte Carlo studies (e.g., Blair & Higgins, 1985b; Blair, Higgins, & Smitley, 1980) have shown that many nonparametric rank tests outperform their parametric counterparts under many nonnormal distributions, a subject to be discussed more fully, essentially negating the argument that converting to ranks throws away important information. (See Gibbons, 1971, for an intriguing opinion on such power comparisons.) Furthermore, many variables encountered in education and psychology that are treated as interval in scale may be better justified as ordinal in scale. To the extent that these variables are indeed ordinal, the loss-ofinformation issue vanishes. The Robustness Issue In education, psychology, and related disciplines, researchers often are confronted with applying the ANOVA to data of undetermined characteristics. This might lead to a violation of an underlying assumption of the statistic. If this occurs the results may be due to the abrogation of the assumption, rather than the presence or absence of a significant treatment. The underlying assumptions of the ANOVA usually given are (a) the population is sampled randomly, (b) the components that contribute to the total variance are additive, (c) there is homogeneity of variance among the groups, and (d) the parent population is normally distributed (Hildebrand, 1986; Winer, 1971). It is generally agreed in statistical literature that to make an externally valid interpretation of the ANOVA the scores must be independent. The ANOVA is also very sensitive to the assumption of homogeneity of variance (Bishop, 1976; Box, 1954; Brown & Forsythe, 1974; Wilcox, Charlin, & Thompson, 1986). In a oneway ANOVA layout, Scheffe (1959) showed that if three treatments of sample sizes 9, 5, and 1, respectively, were compared and the level of heteroscedasticity was expressed in the ratio of 1:1:3 (there was homogeneity of variance among the first two treatments, but both were one third the variance of the third treatment), then with nominal alpha at 0.05 the actual alpha level inflated considerably to 0.17. If sample sizes are equal and the variances are slightly heterogeneous, the oneway ANOVA inflates to a lesser degree (Box, 1954; Goodard & Lindquist, 1940; Hornsnell, 1953; Lindquist, 1953; Welch, 1937). For example, in the same situation Scheffe (1959) reported, but with the three treatments at sample size 5 each, the actual alpha level inflated only slightly to 0.059. With moderate (1:1:6) or large (1:1:12) heterogeneous variances (or in conditions such as 1:3:4, Randolph & Barcikowski, 1989), however, the one-way ANOVA inflates to a greater degree, even with equal sample sizes (Rogan & Keselman, 1977; Tomarkin & Serlin, 1986). Moreover, the inflations become more severe with an unequal number of obser-

95

Shlomo S. Sawilowsky vations in each cell and moderate or large heterogeneous variances, and become even worse if the cell with the largest n has the smallest variance, and the cell with the smallest n has the largest variance (Box, 1953; Snedecor & Cochran, 1980). Population normality is not considered as vital as the first three assumptions. The ANOVA is claimed to be quite robust to "mild" departures from normality (Baker, Hardych, & Petrinovich, 1966; Box, 1954; Cochran, 1947; Cochran & Cox, 1950; Fisher, 1922; Goodard & Lindquist, 1940; Guilford & Fruchter, 1978; Hack, 1958; Havlicek& Peterson, 1974;Lunney, 1970; Mandeville, 1972; Pearson, 1929; Rider, 1929; Scheffe, 1959). For a further review of this position, see Glass et al. (1972) and Ito (1980). A debate arose concerning the robustness of certain parametric tests to mild departures from normality. Bradley (1968) argued that the robustness of the t test and F test was suspect. Under nonnormal distributions such as the mixed normal, or L shape, the parametric tests performed quite poorly. Glass et al. (1972) retorted, "Incautious statements concerning the robustness of the ANOVA to non-normality could send applied statistics off on a rerun of the unproductive 1950s stampede to nonparametric methods" (p. 56). However, they confined their comments to Bradley's (cited in Glass et al.) earlier work, which showed the tests performed poorly at very small alpha levels. Bradley (1978, 1980a, 1980b, 1980c, 1982) reiterated that the mixed-normal distribution occurred frequently throughout the various disciplines of science, and reliance on the robustness of the t test or the ANOVA is an action based on faith or myth, rather than on prudent statistical practice. An example of a mixed-normal distribution is depicted in Figure 1. Bradley's notion of the nonnormal distribution shape of real-world data frequently encountered in education and psychology was supported by Blair (1980),

Frequency

97

100

103

106

109

112

115

118

121

124

127

130

Score
FIGURE 1. Mixed-normal distribution

96

Nonparametric Tests of Interaction Micceri (1989), Still and White (1981), and Tan (1982). For example, Micceri obtained frequency distributions from 440 large-sample achievement and psychometric measures used by researchers from 1982 to 1984. The data sets were the basis of refereed published research in 11 journals and other sources. Micceri found that fewer than 1% of the data sets displayed tail weights and symmetry similar to the Gaussian or normal distribution. All the data sets were found to be nonnormal according to the Kolmogorov-Smirnov test of normality at the 0.01 alpha level. Among the most prevalent nonnormal distribution shapes found were the "digit preference," "multimodality and lumpy," and "extreme asymmetry" shapes. Figure 2 is a histogram of scores obtained from an achievement measure given by Micceri (1986) called multimodal and lumpy. Figure 3 from Micceri is a histogram of scores from a psychometric measure that depicts extreme asymmetry. Micceri (1989) noted little overlap in the types of distributions that occur with real-world data and those selected for study in the classical study by Boneau (1960) demonstrating the robustness of the t test to nonnormality. Micceri concluded that the robustness issue remains unresolved, because "almost none of these comparisons occurs in real life" (p. 164). In an unpublished Monte Carlo study on the robustness of the independent and dependent samples t tests under the eight most prevalent nonnormal distributions described by Micceri (1986), Sawilowsky and Blair (1990) demonstrated that the t tests were remarkably robust to digit preference, and multimodality and lumpiness, with the Type I error rates obtained from the simulation almost identical to nominal alpha. Nonrobust results were obtained primarily under distributions with extreme skew. The t tests, however, were reasonably robust even under extreme skew when (a) sample sizes were equal; (b) sample sizes were fairly large, such as at least 20 or 30; and (c) tests were two tailed rather than one tailed. Blair (1980) and Blair and Higgins (1980a, 1980b, 1981) pointed out that

Frequency

10

15

20

25

30

35

40

Score
FIGURE 2. Multimodality and lumpiness achievement measure 97

Shlomo S. Sawilowsky Frequency


1200

FIGURE 3. Extreme asymmetry psychometric measure

"resolution of these two diverging points of view is not easy, primarily because there are no commonly agreed upon standards as to what constitutes robustness and what does not" (Blair & Higgins, 1981, p. 125). Nevertheless, the selection of nonparametric tests is obvious to the researcher who is convinced of the nonrobustness of the t test and ANOVA. Furthermore, the researcher who insists on the robustness of the t test and ANOVA should be aware that nonparametric tests are also robust, and they are often more powerful under nonnormality. Robustness of Type II Errors A consequence of violating the normality assumption, even though the test maintains the nominal Type I error rate in the absence of treatment effects, is that the test may demonstrate an erratic power function. Thus, the robustness issue is related not only to Type I error, but also to Type II error, the complement of the power of a statistical test. Glass et al. (1972), in reviewing numerous robustness studies, concluded that the ANOVA is a powerful statistic even in the presence of nonnormal data, precluding the need for alternative statistics. This decision was made through an inspection of the characteristics of the power function of the ANOVA under many nonnormal distributions and sample sizes. For example, the power function of the ANOVA reached similar levels regardless of whether the data were sampled from populations that were normal, slightly skewed, moderately skewed, extremely skewed, leptokurtic, or platykurtic. Similarly, under transformations such as the square root, reciprocal, and rectangular, the ANOVA remained relatively consistent in its power function. The Glass et al. (1972) review warranted the position that the ANOVA is consistent across the various types of distributions and treatment sizes investigated. 98

Nonparametric Tests ofInteraction It would appear that this led Glass et al, and Boneau (1960, 1962) before them, to consider the ANOVA to be essentially distribution free (Blair et al, 1980). Similarly, Harwell (1988) noted, "ANOVA is, in many ways, a 'distribution-free' procedure, because it can validly be performed in the face of massive assumption violations" (p. 36). This does not preclude the need for alternative tests, however, because Scheffe (1959) had pointed out earlier that "the question of whether F tests preserve against non-normal alternatives the power calculated under normal theory should not be confused with that of their efficiency against such alternatives relative to other kinds of tests" (p. 351). To make conclusions regarding the relative power of the ANOVA, it is necessary to compare its power directly with competing alternative tests. However, the paucity of nonparametric counterparts to the many different layouts possible in factorial ANOVA in experimental design has heretofore presented a major obstacle in making direct power comparisons. Nevertheless, the argument for avoiding nonparametric tests in consideration of the robustness of parametric tests was clearly unwarranted. The Power Issue Concerning the t test, whose distribution is equal to the square root of the F distribution with one degree of freedom in the numerator, Hemelrijk (1961) demonstrated that at the cost of establishing its robustness against violations of population normality the t test failed to maintain the characteristic of being the Uniformly Most Powerful Unbiased (UMPU) test (but see the final paragraph of this section). The early theoretical studies of Dixon (1954) and Hodges and Lehmann (1956), and empirical studies by Chernoffand Savage (1958) and Neave and Granger (1968), brought forth evidence supporting the position that the simple rank test is very powerful, and in many cases rivals or outperforms the t test under nonnormality. To cite a few recent Monte Carlo small-samples power studies, Blair (1980), Randies and Wolfe (1979), and Smitley (1981) found that for various nonnormal distributions, the Mann-Whitney U test (Mann & Whitney, 1947, also known as the Wilcoxon Rank Sum Test) was more powerful than the two independentsamples t test. Similarly, Blair and Higgins (1985a) and Randies and Wolfe (1979) found that the Wilcoxon Signed Rank test (WSR), under many nonnormal distributions, was much more powerful than the paired-samples t test. In a Monte Carlo study, Rasmussen (1985, 1986) compared the WSR test to the t test, but came to the opposite conclusion. Rasmussen applied a data transformation maximizing homoscedasticity and within-group normality prior to conducting the t test. He found that in this situation the t test was more powerful than the WSR test. He concluded that the transformed t test was more powerful than the nonparametric alternative, regardless of the shape of the distribution. However, while the data were transformed prior to the t test, the data remained untransformed prior to the WSR test. The unequal treatment given to each statistic was a major delimitation of Rasmussen's study.Transformations historically have been associated with parametric tests rather than with nonparametric tests, because the primary focus was on Type I error properties. There is nothing to be gained regarding Type I error in transforming the data prior to a distribution-free statistic. With regard to Type II error, however, nonparametric tests are indeed dependent

99

Shlomo S. Sawilowsky on the shape of the population. To compare the relative power of the t and the WSR appropriate transformations should be carried out on the data before performing the t and the WSR statistic. Furthermore, Rasmussen limited the distributions investigated to functions of the normal curve. Therefore, the findings should not be generalized to distributions that are not so amenable to transformation to normality. Thus, it has been shown that the parametric t test may be much less powerful than its nonparametric counterparts when population normality is violated. These studies suggested that the ANOVA may perform similarly when compared to its nonparametric counterparts. When the assumption of normality is not met, the ANOVA also loses its distinction of being the UMPU test. As a final point on this issue, it should be noted that most of the textbook authors just reviewed erroneously stated that when the assumption of normality is fully met nonparametric tests must be less powerful than the t test. In fact, Hoeffding (1952) and Lehmann and Stein (1959) demonstrated that the randomization test, mentioned earlier, is equally efficient as the t test under normality. Further, it makes no assumption regarding normality. (It should be noted, however, that the permutation test is not robust to violations of the assumption of homogeneity of varianceBoik, 1987.) The difficulty in calculating the randomization test for all but the smallest sample sizes, however, limits its practical usage. Comparative Power of Nonparametric Analysis of Variance Given K independent samples, the Kruskal-Wallis (KW) test (Kruskal & Wallis, 1952), which is based on the two-sample Mann-Whitney U test, will substitute for the one-way ANOVA. The Kruskal-Wallis test has been shown to be robust to departures from normality (Kruskal, 1952). (When the assumption of homogeneous variances is violated, however, the KW is not as robust. With equal sample sizes the KW inflates slightly; with unequal sample sizes the Type I error inflations become quite severeFeir-Walsh & Toothaker, 1974; Keselman, Rogan, & FeirWalsh, 1977; Tomarkin & Serlin, 1986.) As a guide to its power properties, Andrews (1954) and Hodges and Lehmann (1956) found that the ARE of the KW test never falls below 0.864 and can be infinitely larger in comparison to the ANOVA under continuous distributions. These findings are valid when the distribution functions have identical shapes but differ only in their means (Conover, 1980, p. 237). The ARE of the KW test to the F test for the normal, uniform, and double exponential distributions are 0.955, 1.0, and 1.5,-respectively. Similarly, the Friedman test (Friedman, 1937), which is based on the Sign test, may be used in place of the ANOVA in randomized complete block designs. This layout assumes there is only one observation per cell, and no interaction is present. (For designs with one observation per cell and an interaction present, see Hedgeman & Johnson, 1976.) Studies of the characteristics of the Friedman test indicate that it is also robust to departures from normality (Noether, 1967). (However, the Friedman test is sensitive to the assumption of homogeneity of variance. As a result of their Monte Carlo study, Harwell & Serlin, 1989b, suggested avoiding the Friedman test in the presence of heterogeneity.) Van Elteren and Noether (1959) found that the ARE of the Friedman test relative to the F test has a lower limit of 0.576 and can be as high as infinity. Hollander and Wolfe (1973), in calculating the AREs for the normal, uniform, and double exponential distribution, found 100

Nonparametric Tests ofInteraction much higher AREs. For example, with the number of groups of 2, 5, 10, and 50, under the normal distribution, the AREs were 0.637, 0.796, 0.868, and 0.936, respectively. Under the uniform distribution, the AREs were 0.637, 0.833, 0.909, and 0.980, respectively. For the double exponential distribution, the AREs were 1.00, 1.250, 1.364, and 1.471, respectively. Feir-Walsh and Toothaker (1974), McSweeney and Penfield (1969), Penfield and Koffler (1985), and Srisukho (1974) are additional references on the comparative power of the Friedman and KW tests to the ANOVA. The Availability Issue The KW and Friedman tests are simple layouts in data analysis. The ANOVA is a flexible procedure that can be used in a variety of sophisticated designs, such as to test complex interaction effects. Historically, however, it has been maintained there are no satisfactory nonparametric tests for interaction in the analysis of variance, and even more so for tests of higher order interactions (e.g., A x B x C) in hierarchical designs. Bradley (1968), in commenting on the available nonparametric tests for interaction, indicated that they were complicated and hard to compute, made a multitude of stringent assumptions, and most important, lacked statistical power. Interaction Effects in Nonparametric ANOVA Kleijnen (1987) succinctly defined interactions: "Intuitively, 'interactions' among factors means that the effect of a factor also depends on other factors" (p. 28). The nonparametric tests of interaction to be reviewed here are based on ordinal level information, such as the median or ranks. In the former case an interaction can be viewed the same as in parametric tests, except the location parameter is different. The interaction hypothesis tested refers to the difference in the (AB) medians instead of the (AB) means. In the latter case, the interaction hypothesis tested is that the (AB) population distributions are identical. This hypothesis is also sensitive to shift in location parameter and thus is comparable to the hypothesis tested in parametric tests. (Regarding the hypothesis being tested, however, see Brunner & Neumann, 1984; and Noether, 1981.) Nonparametric Tests of Interaction Early nonparametric tests for interactions developed after Bradley's (1968) comments included the Puri and Sen (1969) aligned ranks approach, which was derived from Hodges and Lehmann (1962) and Hajek (1968); the Patel and Hoel (1973) technique; and the Scheirer, Ray, and Hare (1976) expansion of the KW test. The Puri and Sen approach is conditionally distribution free, is limited to balanced designs, and assumes there are no tied observations. This approach, as with the Patel and Hoel technique, is theoretically rigorous but difficult to compute in applied situations (Iman & Conover, 1976; Rineman, 1983), which is often a major consideration for the researcher. The Scheirer et al. technique is less promising. An investigation of its properties by Toothaker and Chang (1980) demonstrated that it failed to "control alpha at the set value nor are they powerful in the presence of effects other than those being tested" (p. 174), and should not be considered as a viable alternative to the ANOVA. 101

Shlomo S. Sawilowsky Collapsed-Reduced Test After these early attempts, Bradley (1979) introduced a promising technique. The data are entered into a matrix where they are collapsed and reduced. Then, the appropriate nonparametric test of main effects is applied to the collapsedreduced data, which, in effect, approximates a test of interaction. This approach can be used to test interactions of any order and is easily performed by hand calculations. The procedure assumes there are no tied observations, and it is limited to the balanced design. Also, the method of entering the original data into the matrix is not independent of the test statistic. An insightful researcher could systematically and purposefully place the data into matrix format in a manner to sway the results in a desired direction. Bradley (1979) conceded that "such contrivances spoil the test" (p. 182). Furthermore, Bradley's technique would seem to preclude exact replication. Two researchers could have the identical set of data, apply the Bradley technique, and by merely placing the data into the matrix in a different order they would obtain different numerical results. Approximate Randomization Test Another alternative to the ANOVA is the Still and White (1981) approximate randomization test. This approach is based on a Fisher-Pitman type randomization test that considers all possible permutations. This yields an exact test with good power properties, but it is too impractical to perform. The approximate randomization test seeks to maintain the power of the randomization test, but relies on Monte Carlo techniques to sample the possible permutations to generate a sampling distribution from the observations. This test is versatile and can be applied in complex layouts of experimental design. In a Monte Carlo study, Still and White (1981) found that under population normality the approximate randomization test was robust, performing nearly as well as the ANOVA. Under a mixed-normal distribution, in 2 x 2 factorial and 2 x 2 repeated-measures designs, the approximate randomization test maintained good robustness properties, while outperforming the parametric test in terms of statistical power. A major consideration of the Still and White (1981) approximate randomization test as an alternative to the ANOVA is that the procedure cannot be performed without a computer. Even though all possible permutations are not performed, the required amount of computer time and resources in most cases precludes the researcher from doing hand calculations. The IBM System 370/168 used in the Still and White study rivaled the state of the art in computing facilities of the late 1970s and early 1980s, and yet, due to the nature of permutations, a 2 x 2 layout with five observations per cell required significant computer time. Currently, there are no computer statistics packages that offer this procedure. Also, the Still and White approach makes replication difficult, because there are many ways to sample all possible permutations. Moment Approximation Test Berry and Mielke (1983) explained their moment approximation procedure as follows: "[Calculate] the test statistic based on the realized data set, calculating the 102

Nonparametric Tests ofInteraction necessary exact moments of the sampling distribution of the test statistic, and integrating a continuous distribution representation to get an approximate probability value" (p. 203). Thus, the moment approximation technique relies on the mean, variance, and skew of the sampling distribution. Berry and Mielke's (1983) study investigated the robustness and power of this procedure as an analog to the one-way ANOVA and found it to be robust and powerful. Their approach required a computer and programming expertise, but relatively little computer time. Additional studies are necessary to determine the characteristics of their approach in the factorial design. Further, it is not known how this technique performs under many nonnormal distributions of interest, including the Cauchy, which has undefined moments. Rank Transform Test Conover and Iman (1976), Iman (1974), and Iman and Conover (1976) advanced the technique of the rank transform (RT), which is performed by converting original observations of the entire data set to ranks and then calculating the parametric statistic on the ranks. Conover and Iman (1981) call this type of ranking process RT-1. The RT statistic, which is conditionally distribution free, is then referred to the usual table of critical values. To illustrate the power of the RT, Iman found that for a typical block (main) effect with nominal alpha set at 0.05 under the mixed-normal distribution, the ANOVA yielded a power level of 0.32. The corresponding level for the RT was 0.91, almost triple the power of the ANOVA. The RT was shown to have good asymptotic properties under the null hypothesis (Hora & Conover, 1984), in randomized complete block designs (Hora & Iman, 1988; Iman, Hora, & Conover, 1984), in incomplete block designs (Kepner & Robinson, 1988), and in the presence of main effects but no interaction in the balanced two-way layout (Thompson & Ammann, 1989). The RT was proposed as a "bridge between parametric and nonparametric statistics" (Conover & Iman, 1981), and Harwell (1988, in press) and Harwell and Serlin (1989a) recommended it as a test for interaction. SAS (1985, 1987), a popular mainframe and microcomputer statistics package, and the International Mathematical and Statistical Libraries (IMSL, 1987), a FORTRAN statistical subroutine library, also promoted it. Nevertheless, the bridge between parametric and nonparametric techniques appears to cross troubled waters, as Fligner (1981) raised a cautionflag.His position was that until each new application of the RT was investigated it should not be used. This concern was amplified by Blair and Higgins (1985a) who more fully investigated the Iman (1974) study, in which the matched-pairs RT was found to be robust and more powerful than the t test. A more thorough investigation by Blair and Higgins revealed that the RT applied to the matched-pairs t test had severe limitations. When samples were correlated, such as in a repeated-measures design, the RT was often less powerful than the WSR and the t test. Blair and Higgins (1985a) found that this loss of power became acute as a function of increasing instrument reliability. Because the development of measurement in education and related disciplines has advanced to the point that instruments with high reliability are not only possible, but are often the norm, this would seem to preclude the use of the RT with matched pairs. Agresti and Pendergast (1986) and Kepner and Robinson (1988) also investigated 103

Shlomo S. Sawilowsky the RT, along with the ANOVA, Friedman, and a proposed Rank Transform Hotelling's T2 statistic in repeated-measures randomized complete block designs. The RT did not lose power in these layouts so long as the correlation between samples was low. Asymptotic results by Thompson and Ammann (in press) (as well as asymptotic results by Akritas, 1988, on a related rank transform statistic due to Hora & Conover, 1984), however, showed that these limited favorable findings were not generalizable to the factorial repeated-measures layout, because the RT test for interaction broke down in the presence of main effects. Subsequent Monte Carlo studies have shown that the RT is not robust, nor is it a powerful alternative to the factorial ANOVA. Several factors, such as how the treatment effects are modeled, the magnitude of the treatment, and the number of nonnull treatments present, adversely affect the RT. Lemmer (1980) noticed the anomaly of rejection of null main effects in a Monte Carlo study of the RT as a test for interaction in the 4 x 3 layout. In the 2 x 2 x 2 layout, Sawilowsky (1985b) showed that the Type I error and power properties of the RT were dependent on how the treatment effects were modeled. Blair, Sawilowsky, and Higgins (1987) demonstrated in the 4 x 3 layout that the magnitude of the treatment effect can yield Type I error rates inflated as high as 1.0 in the presence of both a column and row effect. In the 2 x 2 x 2 layout, Sawilowsky, Blair, and Higgins (1989) showed that as the number of nonnull effects present increased, the Type I error rate of null effects inflated to 1.0, and the power properties of the nonnull effects were often less than half the power obtained by the ANOVA test. In the Blair et al. (1987) study, an analysis of expected cell means in the 4 x 3 layout indicated that the nonlinear RT introduced interactions when none were modeled, causing extreme Type I error inflations. The same analysis also suggested that, as an exception, the RT appeared to test the same hypothesis as the parametric test in the 2 x 2 layout. Thus, the RT as a test for interaction in the simple but common 2 x 2 ANOVA remained as the bridge between the parametric test and the rank transform in analysis of variance in experimental design. In a Monte Carlo study, Sawilowsky (1989) found the RT to be robust in the 2 x 2 layout under the normal distribution. For large treatment effect sizes, however, the RT demonstrated a considerable lack of power relative to the parametric test. Therefore, the RT should be avoided even in the 2 x 2 layout. The RT also has been proposed for use in analysis of covariance (ANCOVA) by Conover and Iman (1981, 1982). The RT ANCOVA, where both the independent and dependent variables are ranked, was investigated by Olejnik and Algina (1984), who found favorable results compared to ANCOVA; Harwell and Serlin (1988), who found it to be lacking in comparison to Hettmansperger's (1984) technique and their proposed statistic; and Stephenson and Jacobson (1988), who found it less powerful in comparison to their proposed statistic (a hybrid of ranking the dependent variable but not the independent variables) for certain hypotheses, and less useful if the experiment required retaining the actual values for the independent variables. The RT was proposed in multiple regression (Iman & Conover, 1979), discriminant analysis (Conover & Iman, 1980), multivariate tests of the independence of two sets of variables (e.g., Rao, 1951, investigated by Habib & Harwell, 1989, who found good results), and other applications (Worsley, 1977). 104

Nonparametric Tests ofInteraction Random Normal and Expected Normal Scores Transform Test A potentially useful modification to the RT is to retain it as the engine of a random normal scores or expected normal scores type transform. The RT, under certain nonnormal distributions, may have an ARE relative to the ANOVA test as high as infinity. Under the normal distribution, however, the ARE falls below 1.0. The substitution of normal scores with the ranks will bring the ARE up to 1.0 under the normal distribution (Bradley, 1968). Thus, the original observations are ranked and replaced by normal scores, and then the usual parametric procedure would be performed on the resultant values. Fisher and Yates (1949) and Bell and Doksum (1965) suggested the use of a random normal scores transform (RNST), variates drawn randomly from the Gaussian distribution. This procedure practically eliminates "all population assumptions" (Bradley, 1968, p. 160). Under nonnull conditions, however, the assumption of independence is violated, as with any nonlinear transformation. Bradley (1968) suggested that a more powerful test could be constructed if the variates were drawn specifically from a normally distributed population with a mean of 0 and a standard deviation of 1.0. In the ANOVA test, the denominator of the F ratio, the Mean Square Within (MSW) or error term, estimates the population variance. In determining this particular F ratio the MSW is predetermined to be 1.0, reducing the ratio to the Mean Square Between (MSB) or treatment effect term. This replaces the F distribution as the referent with the more powerful chi-square, divided by the degrees of freedom of the MSB (Hajek & Sidak, 1967; Puri, 1964). (For an analysis of the RNST test under the F distribution, see Sawilowsky, 1985a). The work of Hoeffding (1951) and Terry (1952) suggested another modification to the RT. They suggested that the substitution of expected normal scores (ENST) with the ranks might compete favorably with its parametric counterpart. Expected normal scores, expected values of the sample, are constant values that depend on the sample size n. That is, if a normal distribution was sampled randomly, ordered, recorded, and replaced, and this process was repeated an infinite number of times, the average of each position of n is the expected normal score. Gibbons (1985b) and Winer (1962) suggested limiting the ANOVA on normal scores to when "the population to which inferences are to be made is considered to be one in which the criterion scores are normally distributed" (Winer, 1962, p. 623). Li (1964) recommended, without such restrictions, using the ENST in several ANOVA layouts, including one-factor designs, repeated-measures designs, and the 3 x 4 factorial layout. Conover (1971) and Gibbons (1985b) noted that the ENST is easily adapted to fit any hypothesis associated with experimental design. In a Monte Carlo study, Lu and Smith (1979) investigated the properties of the ENST in the context of a one-way ANOVA. They found the ENST to be robust and a powerful alternative to the parametric one-way ANOVA. Sawilowsky (1985b, 1989) investigated the small-samples properties of the RNST and ENST statistics in the 2 x 2 x 2 layout under the normal and various nonnormal distributions. The RNST and ENST tests were found to be more robust and more powerful than the RT. The RNST and ENST, however, were not as robust or powerful as the ANOVA, particularly when sample sizes were small (n = 2 observations per cell). Although an increase in sample size tended to ameliorate the lack of power, these two tests were never serious competitors to the ANOVA. 105

Shlomo S. Sawilowsky Adjusted Rank Transform Test An aligned ranks type modification to the RT, in the spirit of Aubuchon and Hettmansperger (1984), Hettmansperger (1984), and Hettmansperger and McKean (1983), made the RT more robust and a powerful alternative to the ANOVA. Reinach (1965, 1966) demonstrated in the randomized block design (which assumes no interaction) that rank tests such as the RT are confounded by the presence of both nonnull main effects and an interaction. This suggests that in a factorial layout the main effects should be treated as nuisance parameters and removed from the model. The residual, which contains the interaction, remains to be analyzed. Blair and Sawilowsky (1990) found that in the 4 x 3 and 2 x 2 layouts the adjusted RT statistic was slightly liberal with respect to Type I error when samples were small (e.g., n = 2 or n = 5 observations per cell). For example, with nominal alpha at 0.05, the actual alpha rose slightly to 0.07 under certain distributions. The power properties, however, were excellent. As sample size increased it became more robust and increased its power advantage over the ANOVA. The technique is performed in the 2 x 2 layout as follows: Subtract the mean (or a trimmed mean, winsorized mean, or robust mean) of the column from the original observations in each column. Then, subtract the mean of the row from the observations of each row. This removes the presence of the two main effects, leaving only the effect due to the interaction. Of course, the researcher never knows what the real main effects were, but subtracting the mean is the best unbiased guess over the long run. The resultant values are pooled together and ranked (RT-1), and the ranks are returned to their respective cells. The procedure until this point is identical to that by McSweeney (1967). The test statistic by McSweeney is a ratio of sums of squares referred to the chi-square distribution (or a Wilcoxon or Kruskal-Wallis statistic, depending on the layoutMarascuilo & McSweeney, 1977, pp. 422-425); here the ANOVA is performed. The adjusted RT is almost as convenient as the original RT and can be calculated easily by hand or with existing computer statistics packages. Blair and Sawilowsky (1990) did not investigate using this method for the analysis of main effects. Because the subtraction of nuisance parameters assumes that a linear model underlies the scores, it would appear that, at least for the balanced layout, the interaction could be removed also, leaving the residual main effects to be tested (Marascuilo & McSweeney, 1977, p. 414). It is not known, however, how such a test compares with other well-known nonparametric tests of main effects. Additional studies are necessary before use of this technique can be recommended in the analysis of main effects. Other Aligned Ranks-Based Tests Hettmansperger (1984) discussed modified aligned ranks tests based on the work of Hodges and Lehmann (1962). Adichie (1978), Puri and Sen (1971, 1973), and Sen and Puri (1977) extended these techniques. Aubuchon and Hettmansperger (1984) reviewed several related techniques subsumed under the general linear model, such as those by Huber (1981), Schrader and Hettmansperger (1980), and Sen (1980). Little is known about the small sample properties of the residual-based methods for regression models described in Hettmansperger (1984). These statistics are not pure aligned ranks tests in that they also involve the minimizing of a 106

Nonparametric Tests ofInteraction dispersion measure that is dependent on the original data and their ranks. Hettmansperger and McKean (1983) found a modified aligned ranks test to be slightly liberal (inflated Type I error rates) and lacking in power. Harwell (in press) also found a similar test to be slightly liberal, especially for very small sample sizes, such as n = 3 per cell. These modified aligned ranks methods subtract an appropriate estimate of the block effect with the results ranked in each block separately. Conover and Iman (1981) called this an RT-2 type transformation. Marascuilo and McSweeney (1977) described the test statistic. Fawcett and Salter (1984) suggested following a modified aligned ranks method with the ANOVA. Their Monte Carlo study indicated that this procedure was a robust and powerful improvement over the original modified aligned ranks method and that it rivaled the ANOVA. Groggel (1987) compared the Fawcett and Salter technique to the ANOVA, RT, Friedman, and Quade (1979) modification of the Friedman statistic and found it to be superior except under the normal and uniform distributions, in which the ANOVA performed slightly better. L Test It was noted earlier that the Puri and Sen (1969) aligned ranks technique was considered difficult to compute in applied situations. This position applies to the exact form of the statistic. The Puri and Sen (1985) L statistic, however, is a largesample approximation that is easy to compute and is applicable to ANOVA and many other types of analyses subsumed by a multivariate general linear model. Harwell (1990, in press) and Harwell and Serlin (1989a) presented the L test in trace criterion form (similar to a statistic in Meddis, 1984) by ranking the original scores and calculating various sums of squares on the ranks. The difference between this test and McSweeny's (1967) is the absence of an alignment procedure (an adjustment of subtracting out the row and column means, or some other estimate of the block effect, before ranking the original data) in the L test in trace criterion form. The test statistic is then referred to the chi-square distribution. In a Monte Carlo study of the univariate, completely randomized, factorial ANOVA layout, Harwell (in press) found the L test in trace criterion form to be robust and powerful in detecting small treatment effects. Because nuisance parameters are not removed, however, the question remains about the possible loss of power in testing for interactions in the presence of large main effects. Harwell and Serlin (1989a), in a Monte Carlo study using a single-sample multivariate multiple regression layout, found the test to be robust and its power properties competitive, as long as sample sizes were large. Their simulation results indicated that the fit of the L test in trace criterion form to the chi-square distribution was conservative. A satisfactory fit reflecting converging nominal Type I error rates with simulated results required sample sizes as large as N = 400. Also, the conservative nature of this test was reflected in its power properties. For example, with n = 20 and alpha = 0.01, under a symmetric and extremely lighttailed distribution, the ANOVA had a power rate of 0.512, and the L statistic was only 0.130; under a symmetric and moderately heavy-tailed distribution, the power of the ANOVA was 0.568, and the L statistic was only 0.146. The L statistic in trace criterion form closed the gap in power considerably at the same sample size for the 0.05 alpha level. With the larger sample sizes investigated (n = 40 and 107

Shlomo S. Sawilowsky n = 100), it was more robust and rivaled or outperformed the ANOVA at both alpha levels. Extended Median Test Shoemaker (1986) proposed an extension of the median test for ANOVA layouts. The test is similar to Wilson's (1956) test based on counts, which is not considered very powerful. Shoemaker's procedure tests the proportion of counts in each cell of a column that are above or below their row median. The test statistic is a Pearson chi-square with modified degrees of freedom. This technique had difficulty in preserving nominal alpha for null effects with small samples. The improvement was to subtract the column median from each observation, essentially treating the main effects as nuisance parameters and reducing the analysis to the residuals, a test for interaction. This modification was shown to be robust for small samples obtained from the t distribution with various degrees of freedom, and a powerful alternative to the ANOVA test with n = 5 per cell in a 3 x 3 layout. The test is easy to perform by hand or with any statistics package that provides the median. Large- and Small-Samples Patel and Hoel Test It was mentioned earlier that the Patel and Hoel (1973) technique is difficult to compute in applied situations. Krauth (1988) refined the Patel and Hoel statistic into large-samples and small-samples versions for the 2 x 2 layout. The largesamples version can be used in the presence of ties, but is not much easier to compute than the original statistic. The small-samples version assumes there are no ties and is not distribution-free. However, it is easier to compute. The test procedure is analogous to Fisher's Fourfold-Table test, and requires critical values from the hypergeometric distribution. Studies on the robustness and power of this technique have not yet appeared in the literature. Testing Interactions in Experimental Design Regarding the decline in the development and use of nonparametric techniques in the 1960s, Page and Marcotte (1966) noted that "there have been some impressive developments.... Yet the field does not appear involved in many exciting new research thrusts, and the special charisma of expectation about nonparametrics seems to have become faded in the last decade" (p. 517). Clearly, there has been a renewed excitement since their review. Many nonparametric techniques have been developed, including tests of interaction in factorial layouts. An important task that remains is choosing the most appropriate test of interaction in experimental design. The best practice is to use the best test; it must be robust, powerful, versatile, and easy to compute by hand or with available computer statistics packages. The practice of using parametric tests when normality is not known based on their robustness to Type I error is the subject of much debate. A defense of this practice based on their robustness to Type II error is without merit. There are several considerations in recommending a nonparametric test of interaction. It must have competitive power properties for that apparently rare occurrence of normality, and must hold out the possibility for a gain in power 108

Nonparametric Tests of Interaction under many types of nonnormal distributions. Of the 10 recently developed techniques discussed in this review, the rank transform and random/expected normal scores tests should be ruled out based on poor Monte Carlo results. The sophistication of computer programming skills required also may limit the usefulness of the Still and White, and Berry and Mielke techniques until computer statistics packages incorporate them. Krauth's small-samples refinement of the Patel and Hoel technique is limited to the 2 x 2 layout and cannot be recommended without further investigation. Which of the remaining five statistics is the best test? Each of them has been shown to perform well in comparison to the ANOVA, but it is not known how they compare to each other. A Monte Carlo study comparing their properties would help in selecting the best test. The comparative study also should investigate the following issues: (a) How do they perform in the presence of many tied observations? (b) What corrections for tied observations are available? (c) Can they be extended to unbalanced layouts? (d) Which techniques are amenable to the construction of confidence intervals and post hoc type analyses? (e) What are the effects of homogeneous variances violations? Of the techniques that are most easily performed by hand (Bradley, Adjusted RT, Puri and Sen L, and Shoemaker tests), the Adjusted RT appears to reach desirable power properties with the smallest sample size. The modified aligned ranks test discussed by Hettmansperger is more difficult to compute than the other four tests. Although the Puri and Sen L and the Hettmansperger tests require moderate or large sample sizes, they are especially versatile and permit the testing of a wide variety of hypotheses subsumed under the general linear model. With a repertoire of excellent nonparametric tests of interaction readily available, the researcher of the 1990s has viable options in experimental design that did not exist a decade and a half ago. Application of Selected Techniques on Fabricated Data In this final section the Bradley, Adjusted RT, Puri and Sen L, and Shoemaker tests for interaction are demonstrated with fabricated data in a balanced 2 x 2 layout with n = 10 per cell. (The Hettmansperger test is discussed with an application in Harwell, in press, and Hettmansperger, 1984.) Consider an experiment in which the researcher is interested in the interaction of two types of grading systems, Al and A2, on the attitude of students at two levels of achievement, Bl and B2. Reliable and valid measurements have been collected on 40 students, with the scores shown in Table 1. Assume further that either the attribute being measured is not known to be normally distributed or that the scores are ranks. The ANOVA is shown in Table 2 for comparison purposes only. With alpha at 0.05, the critical F value for 1 and 36 degrees of freedom for the A x B interaction is 4.11. The obtained F value, 3.98, indicates that the interaction is not significant. Bradley's Collapsed and Reduced Technique If the layout was, for example, a 2 x 2 x 2, the first step would be to collapse the data over the third main effect. No collapsing is required, because the hypothesis of interest is the A x B interaction of a 2 x 2 layout. The first step, therefore, is to reduce the row scores by subtracting the first score in cell (AB)2i from the first 109

Shlomo S. Sawilowsky
TABLE 1 Fabricated attitude scores collected on 40 students by grading system and achievement level
Achievement level 1 (Bl) Achievement level 2 (B2) 45 32 14 16 9 18 28 1 17 23 43 39 26 29 35 37 40 36 4 8

Grading system 1 (Al)

0 10 34 38 24 46 41 27 42 44 20 15 21 25 30 19 47 33 11 13

Grading system 2 (A2)

observation in cell (AB)n, followed by similar subtractions for the remaining nine sets of observations. This step is completed by performing the same type of reduction on the scores of cell (AB)22 and cell (AB),2. The results, in which the two rows are reduced, are shown in Table 3. The second step is to reduce the remaining values by subtracting the 10 scores in the second column from the scores in the first column, as shown in Table 4. The resultant values are difference scores that are submitted to the Wilcoxon SignRank Test. The magnitude of the values are ranked, and then the signs are replaced. The minority signed values are summed, and the magnitude represents the obtained TABLE 2
Summary table ofANOVA results for fabricated data Source of . . variation Grading system
Achievement lewd 2-way interaction Residual Total

Sum of squares 12.1


40.0 688.9 6227.0 6968.0

_ DF
1 1 1 36 39

Mean square 12.1


40.0 688.9 172.9 178.7 0.070 0.231 3.983

110

Nonparametric Tests of Interaction


TABLE 3 Bradley's technique: Step 1Reduction of rows by subtraction Achievement level 1 (Bl) Grading systems 1 and 2 (A1-A2)
0 - 20 = -20 10- 15 = 5 34-21 = 13 3 8 - 2 5 = 13 24 - 30 = -6 4 6 - 19 = 27 41-47 = -6 27 - 33 = -6 42-11=31 4 4 - 13 = 31

Achievement level 2 (B2)


45 - 43 = 2 32 - 39 = -7 14-26 = -12 16 - 29 = -13 9 - 35 = -26 18 37 = 19 28-40 = -12 1 - 36 = -35

17-4= 13 23-8= 15

T. As shown in Table 5, the obtained T is 6. The critical T at the 0.05 alpha level is 10, indicating the interaction is not significant. Adjusted Rank Transform Statistic The first step is to calculate the row (Al and A2) and column (Bl and B2) means. They are as follows: XAl = 25.45; XA2 = 26.55; XB\ = 27; and XB2 = 25. Next, subtract the row and column means from each observation as shown in Table 6. The results are ranked, as shown in Table 7 (the two tied ranks are assigned midranks), and then submitted to the usual ANOVA. The results are shown in Table 8. The critical F is 4.11, and the obtained F is 4.45, indicating that the interaction is significant. Puri and Sen L Statistic The original scores are ranked (Table 9). The necessary sum of squares can be computed by hand, or the ranks can be submitted to any computer program that TABLE 4
Bradley's technique: Step 2Reduction of columns by subtraction to obtain difference scores Achievement levels 1 and2(Bl-B2) Grading systems 1 and 2 (A1-A2) -20 - 2 = -22 -5-(-7) = 2 13-(-12) = 25 13 - (-13) = 26 - 6 - (-26) = 20 2 7 - ( - 1 9 ) = 46 -6-(-12) = 6 - 6 - ( - 3 5 ) = 29 31-13=18 3 1 - 1 5 = 16

111

Shlomo S. Sawilowsky TABLE 5 Bradley's technique: Step 3Resultant values ranked by magnitude (sign preserved) and Wilcoxon Sign Rank test performed Ranks -6 1 7 8 5 10 2 9 4 3 Minority sign -6 Obtained T

performs the ANOVA and prints out the ANOVA table. (If the ANOVA were to be carried out, the test becomes the original Rank Transform.) The sum of squares (from Table 10) for the interaction (ABSS) is 592.9, and the total sum of squares (TSS) is 5,330.0. The statistic in trace criterion form (Harwell, in press; Harwell & TABLE 6 Adjusted Rank Transform: Step 1Subtraction of row and column means from original scores Achievement level 1 (Bl) Grading system 1 (Al) 0 - 25.45 - 27 10-25.45-27 34-25.45-27 38-25.45-27 24-25.45-27 46-25.45-27 41-25.45-27 27-25.45-27 42-25.45-27 44-25.45-27 20 - 26.55 - 27 15-26.55-27 21 - 2 6 . 5 5 - 2 7 25-26.55-27 30-26.55-27 19-26.55-27 47-26.55-27 33-26.55-27 11-26.55-27 13-26.55-27 Achievement level 2 (B2) 45 - 25.45 - 25 32-25.45-25 14-25.45-25 16-25.45-25 9-25.45-25 18-25.45-25 28-25.45-25 1-25.45-25 17-25.45-25 23-25.45-25 43 - 26.55 - 25 39-26.55-25 26-26.55-25 29-26.55-25 35-26.55-25 37-26.55-25 40-26.55-25 36-26.55-25 4-26.55-25 8-26.55-25

Grading system 2 (A2)

112

Nonparametric Tests of Interaction


TABLE 7 Adjusted Rank Transform: Step 2Rank results
Achievement level 1 (Bl) Score Rank 1 6 26.5 31 18 39 34 21 35 37 13 9 15 17 22 11 38 25 5 8 Achievement level 2 (B2) Score -5.45 -18.45 -36.45 -34.45 -41.45 -32.45 -22.45 -49.45 -33.45 -27.45 -8.55 -12.55 -25.55 -22.55 -16.55 -14.55 -11.55 -15.55 -47.55 -43.55 Rank 40 26.5 10 12 7 16 24 2 14 19 36 32 20 23 28 30 33 29 3 4

Grading system 1 (A 1)

-52.45 -42.45 -18.45 -14.45 -28.45 -6.45 -11.45 -25.45 -10.45 -8.45 -33.55 -38.55 -32.55 -28.55 -23.55 -34.55 -6.55 -20.55 -42.55 -40.55

Grading system 2 (A2)

Serlin, 1989a) is computed as follows: L = (N- 1) x ABSS/TSS = (40 - 1) x (592.9/5330.0) = 4.34 Because the L statistic in trace criterion form is distributed as a chi-square with (A - 1) (B - 1) degrees of freedom (Harwell, in press; Harwell & Serlin, 1989a), the critical value with alpha at 0.05 is 3.84. Thus, the interaction is significant. (Although an ANOVA style table with mean squares and F ratios is unnecessary, Table 10 is presented to show the relationship between the Puri and Sen L in trace criterion form and the Rank Transform statistic.) TABLE 8
Summary table ofANOVA results for Adjusted Rank Transform Source of variation 2-way interaction Residual Total Sum of squares 585.23 4735.95 5329.50 DF 36 39 Mean square 585.23 131.55 136.65 4.45

113

Shlomo S. Sawilowsky TABLE 9


Puri and Sen L. Step 1Original scores are ranked Achievement level 1 (Bl) Grading system 1 (A1)
1 6 27 31 18 39 34 21 35 37 15 10 16 19 24 14 40 26 7 8

Achievement level 2 (B2)


38 25 9 11 5 13 22 2 12 17 36 32 20 23 28 30 33 29 3 4

Grading system 2 (A2)

Shoemaker's Statistic The first step is to obtain the median of column 1 and column 2 of the original data, MedianBi = 26; MedianB2 = 27. Then, the median of each column is subtracted from the observations of that column, as shown in Table 11. The row medians, MedianAi = -0.5, and MedianA2 = 0.5, are calculated on the results of the first subtraction. Next, each cell is divided into two parts. The number of observations in the cell that are above the row median are placed in the upper part of that cell. The number of observations that are either less than or equal to the median are placed in the lower part of that cell, as shown in Table 12. The last step is to calculate a Pearson TABLE 10
Summary table of sum of squares for Puri and Sen L
Source of vanation Grading system Achievement level 2-way interaction Residual Total
a

Sum of squares 4.9 32.4 592.9 4699.8 5330.0

DF 1 1 1 36 39

Mean square 4.9 32.4 592.9 130.6 136.7

These statistics are the original rank transform.

114

Nonparametric Tests ofInteraction TABLE 11 Shoemaker's extended median test: Step 1Subtract column medians from original scores, calculate row medians on results Achievement level 1 (Bl) Grading system 1 (Al) 0 - 26 = -26 10 26 = 16 34-26 = 8 3 8 - 2 6 = 12 24 - 26 = - 2 46 - 26 = 20 41-26=15 27-26= 1 4 2 - 2 6 = 16 44-26=18
20 - 26 = -6 15 26 = 11 21 -26 = -5 25-26 = -1 30 - 26 = 4 19-26 = -7 47-26 = 21 33-26 = 7 11 26 = 15 13 26 = 13

Achievement level 2 (B2) 4 5 - 2 7 = 18 32 - 27 = 5 14-27 = -13 16 27 = 11 9 - 2 7 = -18 18-27 = - 9 28-27= 1 1 - 27 = -26 17 27 = 10 23 - 27 = - 4


4 3 - 2 7 = 16 39-27=12 26-27 = -1 29 - 27 = 2 35-27 = 8 3 7 - 2 7 = 10 4 0 - 2 7 = 13 36-27 = 9 4 - 27 = -23 8 -27 = -19

Row median

-0.5

Grading system 2 (A2)

0.5

chi-square on the resultant values. The expected frequencies are calculated separately for the upper and lower cells. The degrees of freedom are (A - 1) (B - 1) = 1, and the critical value is 3.84. The obtained chi-square is 1.76, indicating that the interaction is not significant. Summary Using a fabricated data set, the Adjusted RT and the Puri and Sen L statistic in trace criterion form rejected the null hypothesis, indicating there is a significant TABLE 12 Shoemaker's extended median test: Step 2Place number of values above row median in upper part of each cell and the number of values below or equal to row median in lower part of each cell Achievement level 1 (Bl) Grading system 1 (Al) Grading system 2 (A2) 7 (6.25) 3(4.17) 3(3.75) 7 (5.83) Achievement level 2 (B2) 3 (3.75) 7 (5.83) 3 (2.25) 7 (8.67) > < > < Row median Row median Row median Row median

Note. Expected values are in parentheses. 115

Shlomo S. Sawilowsky interaction of grading system and achievement level on the attitude of the students. Conversely, the ANOVA, Bradley's technique, and Shoemaker's extension of the median test failed to reject the null hypothesis regarding the presence of an interaction. References
Adichie, J. N. (1978). Rank tests of sub-hypotheses in the general linear regression. Annals of Statistics, 5, 1012-1026. Agresti, A., & Pendergast, J. (1986). Comparing mean ranks for repeated measures data. Communications in Statistics, 75(5), 1417-1433. Akritas, M. G. (1988). The rank transform method in some two factor designs (Tech. Rep. #57). University Park: Pennsylvania State University, Department of Statistics. Anderson, N. H. (1961). Scales and statistics: Parametric and nonparametric. Psychological Bulletin, 55,305-316. Andrews, F. C. (1954). Asymptotic behavior of some rank tests for analysis of variance. Annals of Mathematical Statistics, 25, 724-736. Ary, D., & Jacobs, L. C. (1976). Introduction to statistics. New York: Holt, Rinehart, and Winston. Aubuchon, J. C, & Hettmansperger, T. P. (1984). On the use of rank tests and estimates in the linear model. In P. R. Krishnaiah & P. K. Sen (Eds.), Handbook of statistics (Vol. 4, pp. 259-274). Amsterdam: Elsevier. Baker, B. O., Hardych, C. D., & Petrinovich, L. F. (1966). Weak measurement versus strong statistics: An empirical critique of S. S. Stevens' prescription on statistics. Educational and Psychological Measurement, 26, 219-309. Barlow, R. E., Bartholomew, D. J., Bremmer, J. M., & Brunk, H. D. (1972). Statistical inference under order restrictions: The theory and application of isotonic regression. New York: John Wiley. Bartz, A. E. (1981). Basic statistical concepts (2nd ed.). Minneapolis, MN: Burgess. Bell, C. B., & Doksum, K. A. (1965). Some new distribution-free statistics. Annals of Mathematical Statistics, 56(1), 203-214. Berry, K. J., & Mielke, P. W. (1983). Moment approximations as an alternative to the Ftest in analysis of variance. British Journal of Mathematical and Statistical Psychology, 36, 202-206. Best, J. (1981). Research in education (4th ed.). Englewood Cliffs, NJ: Prentice-Hall. Best, J. W., & Kahn, J. V. (1989). Research in education (6th ed.). Englewood Cliffs, NJ: Prentice-Hall. Bishop, T. (1976). Heteroscedastic ANOVA, MANOVA and multiple comparisons. Unpublished doctoral dissertation, Ohio State University. Blair, R. C. (1980). A comparison of the power of the two independent means t test to that of the Wilcoxon 's rank-sum test for samples of various populations. Unpublished doctoral dissertation, University of South Florida, Tampa, FL. Blair, R. C. (1981). A reaction to "Consequences of failure to meet assumptions underlying the fixed effects analysis of variance and covariance." Review of Educational Research, 57(4), 499-507. Blair, R. C. (1985). Some comments on the statistical treatment of rank data. Paper presented at the annual meeting of the American Educational Research Association and the National Council on Measurement in Education, Chicago. Blair, R. C, & Higgins, J. J. (1980a). A comparison of the power of the t test and the Wilcoxon statistics when samples are drawn from a certain mixed normal distribution. Evaluation Review, 4, 645-656. Blair, R. C, & Higgins, J. J. (1980b). A comparison of the power of the Wilcoxon's rank-sum statistic to that of student's t statistic under various non-normal distributions. Journal of

116

Nonparametric Tests of Interaction


Educational Statistics, 5(4), 309-335. Blair, R. C, & Higgins, J. J. (1981). A note on the asymptotic relative efficiency of the Wilcoxon rank-sum test relative to the independent means t test under mixtures of two normal distributions. British Journal of Mathematical and Statistical Psychology, 31, 124 128. Blair, R. C, & Higgins, J. J. (1985a). A comparison of the power of the paired samples rank transform statistic to that of Wilcoxon's signed rank statistic. Journal of Educational Statistics, 10(4), 368-383. Blair, R. C, & Higgins, J. J. (1985b). Comparison of the power of the paired samples t test to that of Wilcoxon's signed-ranks test under various population shapes. Psychological Bulletin, 97(1), 119-128. Blair, R. C, Higgins, J. J., & Smitley, W. D. S. (1980). On the relative power of the U and t tests. British Journal of Mathematical and Statistical Psychology, 33, 114-120. Blair, R. C, & Sawilowsky, S. S. (1990, April). A test for interaction based on the rank transform. Paper presented at the annual meeting of the American Educational Research Association, Boston. Blair, R. C, Sawilowsky, S. S., & Higgins, J. J. (1987). Limitations of the rank transform in tests for interaction. Communications in Statistics: Computation and Simulation, B16, 1133-1145. Bloom, B. S. (1984). The search for methods of group instruction as effective as one-to-one tutoring. Educational Leadership, 47(8), 4-17. Blum, J. R., & Fattu, N. A. (1954). Nonparametric methods. Review ofEducational Research, 24, 467-487. Boik, R. J. (1987). The Fisher-Pitman permutation test: A nonrobust alternative to the normal theory F test when variances are heterogeneous. British Journal of Mathematical and Statistical Psychology, 40, 26-42. Boneau, C. A. (1960). The effects of violations of assumptions underlying the / test. Psychological Bulletin, 57, 49-64. Boneau, C. A. (1962). A comparison of the power of the /and t tests. Psychological Review, 69, 246-256. Borg, W. R. (1987). Applying educational research: A guide for teachers. White Plains, NY: Longman. Box, G. E. P. (1953). Non-normality and tests of variances. Biometrika, 40, 318-355. Box, G. E. P. (1954). Some theories on quadratic forms applied to the study of analysis of variance problems: Effect of unequality of variance in the one-way classification. Annals of Mathematical Statistics, 25, 290-302. Bradley, J. V. (1968). Distribution-free statistical tests. Englewood Cliffs, NJ: Prentice-Hall. Bradley, J. V. (1977). A common situation conducive to bizarre distribution shapes. American Statistician, 31, 147-150. Bradley, J. V. (1978). Robustness? British Journal ofMathematical and Statistical Psychology, 31, 144-152. Bradley, J. V. (1979). A nonparametric test for interactions of any order. Journal of Quality Technology, 77(4), 177-184. Bradley, J. V. (1980a). Nonrobustness in classical tests on means and variances: A large-scale sampling study. Bulletin of the Psychonomic Society, 15, 275-278. Bradley, J. V. (1980b). Nonrobustness in one-sample Z and / tests: A large-scale sampling study. Bulletin of the Psychonomic Society, 15, 29-32. Bradley, J. V. (1980c). Nonrobustness in Z, t, and F tests at large sample sizes. Bulletin of the Psychonomic Society, 16, 333-336. Bradley, J. V. (1982). The insidious L-shaped distribution. Bulletin of the Psychometrics Society, 20(2), 85-88. Brown, M. B., & Forsythe, A. (1974). The small sample behavior of some statistics which test the equality of several means. Technometrics, 16, 129-132.

117

Shlomo S. Sawilowsky
Brunner, E., & Neumann, N. (1984). Rank tests for the 2 x 2 split plot design. Metrika, 31, 233-243. Capon, J. A. (1988). Elementary statistics for the social sciences. Belmonte, CA: Wadsworth. Chase, C. (1976). Elementary statistical procedures (2nd ed.). New York: McGraw-Hill. Chernoff, H., & Savage, I. R. (1958). Asymptotic normality and efficiency of certain nonparametric test statistics. Annals of Mathematical Statistics, 29, 972-999. Christensen, H. B. (1977). Statistics step by step. Boston: Houghton Mifflin. Clayton, K. N. (1984). An introduction to statistics for psychology and education. Columbus, OH: Charles E. Merrill. Cochran, W. G. (1947). Some consequences when the assumptions for the analysis of variance are not satisfied. Biometrics, 3, 27-38. Cochran, W. G., & Cox, G. M. (1950). Experimental designs. New York: Wiley. Conover, W. J. (1971). Practical nonparametric statistics. New York: Wiley. Conover, W. J. (1980). Practical nonparametric statistics (2nd ed.). New York: John Wiley. Conover, W. J., & Iman, R. L. (1976). On some alternative procedures using ranks for the analysis of experimental designs. Communications in Statistics, A5, 1349-1368. Conover, W. J., & Iman, R. L. (1980). The rank transformation as a method of discrimination with some examples. Communications in Statistics, A9(5), 465-487. Conover, W. J., & Iman, R. L. (1981). Rank transformations as a bridge between parametric and nonparametric statistics. American Statistician, 35, 124-129. Conover, W. J., & Iman, R. L. (1982). Analysis of covariance using the rank transform. Biometrics, 38, 715-724. Couch, J. V. (1982). Fundamentals of statistics for the behavioral sciences. St. Paul, MN: St. Martin's Press. Couch, J. V. (1987). Fundamentals of statistics for the behavioral sciences (2nd ed.). St. Paul, MN: St. Martin's Press. Dinham, S. M. (1976). Exploring statistics. Belmonte, CA: Wadsworth. Dixon, W. J. (1954). Power under normality of several nonparametric tests. Annals of Mathematical Statistics, 25, 610-614. Dixon, W. J., & Massey, F. J. (1969). Introduction to statistical analysis (3rd ed.). New York: McGraw-Hill. Downie, N. M., & Heath, R. (1970). Basic statistical methods (3rd ed.). New York: Harper and Row. Downie, N. M., & Starry, A. R. (1977). Descriptive and inferential statistics. New York: Harper and Row. Edgington, E. S. (1980). Randomization tests. New York: Marcel Dekker. Englehart, M. (1972). Methods of educational research. Chicago: Rand McNally. Fallik, F., & Brown, B. (1983). Statistics for behavioral sciences. Homewood, IL: Dorsey Press. Fawcett, R. F., & Salter, K. C. (1984). A Monte Carlo study of the F test and three tests based on ranks of treatment effects in randomized block designs. Communications in Statistics, B13, 213-225. Feir-Walsh, B. J., & Toothaker, L. E. (1974). An empirical comparison of the ANOVA Ftest, normal scores, and Kruskal-Wallis test under violations of assumptions. Educational and Psychological Measurement, 34, 789-799. Fienberg, S. E. (1985). The analysis of cross-classified categorical data (2nd ed.). Cambridge, MA: MIT Press. Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, A(222), 309-368. Fisher, R. A., & Yates, F. (1949). Statistical tables for biological, agricultural, and medical research (3rd ed.). New York: Hafner. Fligner, M. A. (1981). Comment. American Statistician, 35(3), 131-132. Freeman, Jr., D. M. (1987). Applied categorical data analysis. New York: Marcel Dekker.

118

Nonparametric Tests ofInteraction


Freund, J. E. (1970). Statistics: A first course. Englewood Cliffs, NJ: Prentice-Hall. Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32, 675-701. Gaito, J. (1959). Non-parametric methods in psychological research. Psychological Reports, 5, 115-125. Gardner, P. L. (1975). Scales and statistics. Review of Educational Research, 45(1), 43-57. Garrett, H. (1966). Statistical methods in psychology and education (6th ed). New York: David McKay. Gatta, L. A. (1973). An analysis of the pass-fail grading system as compared to the conventional grading system in high school chemistry. Journal of Research in Science Teaching, 10, 3-12. Gay, L. R. (1976). Educational research: Competencies for analysis and application. Columbus, OH: Charles E. Merrill. Gay, L. R. (1988). Educational research: Competencies for analysis and applications (3rd ed.). Columbus, OH: Charles E. Merrill. Gibbons, J. D. (1971). Nonparametric statistical inference. New York: McGraw-Hill. Gibbons, J. D. (1985a). Nonparametric methods for qualitative analysis. Columbus, OH: American Sciences Press. Gibbons, J. D. (1985b). Nonparametric statistical inference (2nd ed.). New York: Marcel Dekker. Glass, G., Peckham, P., & Sanders, J. R. (1972). Consequences of failure to meet assumptions underlying the fixed effects analysis of variance and covariance. Review of Educational Research, 42, 237-288. Goodard, R. H., & Lindquist, E. F. (1940). An empirical study of the effects of heterogeneous within groups variance upon certain F-tests of significance in analysis of variance. Psychometrica, 5, 263-274. Gravetter, F. J., & Wallnau, L. B. (1985). Statistics for the behavioral sciences. St. Paul, MN: West. Groggel, D. J. (1987). A Monte Carlo study of rank tests for block designs. Communications in Statistics, 76(3), 601-620. Guilford, J. P., & Fruchter, B. (1978). Fundamental statistics in psychology and education (6th ed.). New York: McGraw-Hill. Haber, A., & Runyon, R. P. (1969). General statistics. Reading, MA: Addison-Wesley. Habib, A. R., & Harwell, M. R. (1989). An empirical study of the type I error rate and power for some selected normal-theory and nonparametric tests of the independence of two sets of variables. Communications in Statistics, 18(2), 793-826. Hack, H. R. B. (1958). An empirical investigation into the distribution of the F-ratio in samples from two nonnormal populations. Biometrika, 45, 260-265. Hajek, J. (1968). Asymptotic normality of simple linear rank statistics under alternatives. Annals of Mathematical Statistics, 39, 325-346. Hajek, J., & Sidak, Z. (1967). Theory of rank tests. New York: Academic Press. Harwell, M. R. (1988). Choosing between parametric and nonparametric statistics. Journal of Counseling and Development, 67, 35-38. Harwell, M. R. (1990). A general approach to hypothesis testing for nonparametric tests. Journal of Experimental Education, 58(2), 143-156. Harwell, M. R. (in press). Completely randomized factorial analysis of variance using ranks. British Journal of Mathematical and Statistical Psychology. Harwell, M. R., & Serlin, R. C. (1988). An empirical study of a proposed test of nonparametric analysis of covariance. Psychological Bulletin, 104(2), 268-281. Harwell, M. R., & Serlin, R. C. (1989a). A nonparametric test statistic for the general linear model. Journal of Educational Statistics, 14(4), 351-371. Harwell, M. R., & Serlin, R. C. (1989b, April). An empirical study of the Friedman test under covariance heterogeneity. Paper presented at the annual meeting of the American Educa-

119

Shlomo S. Sawilowsky
tional Research Association, San Francisco. Havlicek, L. L., & Peterson, N. L. (1974). Robustness of the /-test in a guide for researchers on effect of violations of assumptions. Psychological Reports, 34, 1095-1114. Hays, W. L. (1963). Statistics. New York: Holt, Rinehart, and Winston. Hays, W. L. (1988). Statistics. New York: Holt, Rinehart, and Winston. Hedgeman, V., & Johnson, P. E. (1976). On analyzing two-way AoV data with interactions. Technometrics, 18, 273-281. Hemelrijk, J. (1961). Experimental comparison of student's and Wilcoxon's two sample test. In H. de Jonge (Ed.), Quantitative methods in psychology. New York: Interscience. Hettmansperger, T. P. (1984). Statistical inference based on ranks. New York: John Wiley. Hettmansperger, T. P., & McKean, J. W. (1983). A geometric interpretation of inferences based on ranks in the linear model. Journal of the American Statistical Association, 78, 885-893. Hildebrand, D. K. (1986). Statistical thinking for behavioral scientists. Boston: Duxbury Press. Hinkle, D. E., Wiersma, W., & Jurs, S. G. (1988). Applied statistics for the behavioral sciences (2nd ed.). Boston: Houghton Mifflin. Hodges, J. C, & Lehmann, E. L. (1956). The efficiency of some nonparametric competitors of the / test. Annals of Mathematical Statistics, 27, 324-335. Hodges, J. C, & Lehmann, E. L. (1962). Rank methods for combination of independent experiments in analysis of variance. Annals of Mathematical Statistics, 33, 482-497. Hoeffding, W. (1951). Optimum nonparametric tests. In J. Nyemund (Ed.), Proceedings of the second Berkeley symposium on mathematics, statistics, and probability (pp. 83-92). Berkeley, CA: University of California Press. Hoeffding, W. (1952). The large sample power of tests based on permutations of observations. Annals of Mathematical Statistics, 23, 169-192. Hollander, M., & Wolfe, D. A. (1973). Nonparametric methods. New York: John Wiley. Hora, S. C, & Conover, W. J. (1984). The F statistic in the two-way layout with rank-score transformed data. Journal of the American Statistical Association, 79, 668-673. Hora, S. C, & Iman, R. C. (1988). Asymptotic relative efficiencies of the rank transformation procedures in randomized complete block designs. Journal of the American Statistical Association, 83, 462-470. Hornsnell, G. (1953). The effect of unequal group variances on the F-test for the homogeneity of group means. Biometrika, 40, 128-136. Howell, D. C. (1989). Fundamental statistics for the behavioral sciences. Boston: PWS-Kent. Huber, P. S. (1981). Robust statistics. New York: Wiley. Iman, R. L. (1974). A power study of a rank transform for the two-way classification model when interactions may be present. Canadian Journal of Statistics, 2, 227-239. Iman, R. L., & Conover, W. J. (1976). A comparison of several rank tests for the two-way layout (SAND76-0631). Alburquerque, NM: Sandia Laboratories. Iman, R. L., & Conover, W. J. (1979). The use of the rank transform in regression. Technometrics, 21(4), 499-509. Iman, R. L., Hora, S. C, & Conover, W. J. (1984). Comparison of asymptotically distributionfree procedures for the analysis of complete blocks. Journal of the American Statistical Association, 79, 674-685. International Mathematical and Statistical Libraries. (1987). IMSL library, reference manual (10th ed.). Houston, TX: Author. Ito, P. K. (1980). Robustness of ANOVA and MANOVA test procedures. In P. R. Krishnaiah (Ed.), Handbook of statistics (Vol. 1, pp. 199-236). The Netherlands: North-Holland. Jenkins, S. J., Fuqua, D. R., & Froehle, T. C. (1984). A critical examination of the use of nonparametric statistics in the Journal of Counseling Psychology. Perceptual and Motor Skills, 59,31-35. Jenkins, S. J., Fuqua, D. R., & Hartman, B. W. (1984). Evaluating criteria for selection of

120

Nonparametric Tests of Interaction


nonparametric statistics. Perceptual and Motor Skills, 58, 979-984. Johnson, M. (1977). A review of research methods in education. Chicago: Rand McNally. Johnson, R. R. (1973). Elementary statistics. Belmonte, CA: Wadsworth. Kachigan, S. K. (1986). Statistical analysis: An interdisciplinary introduction to univariate and multivariate methods (2nd ed.). New York: Radius Press. Kennedy, J. J. (1983). Analyzing qualitative data: Introducing log-linear analysis for behavioral research. New York: Praeger. Kepner, J. L., & Robinson, D. H. (1988). Nonparametric methods for detecting treatment effects in repeated measures designs. Journal of the American Statistical Association, 83, 456-461. Kerlinger, F. N. (1964). Foundations of behavioral research. New York: Holt, Rinehart, and Winston. Kerlinger, F. N. (1973). Foundations of behavioral research (2nd ed.). New York: Holt, Rinehart, and Winston. Keselman, H. J., Rogan, J. C, & Feir-Walsh, B. J. (1977). An evaluation of some nonparametric tests for location equality. British Journal of Mathematical and Statistical Psychology, 30, 213-221. Kirk, R. E. (1968). Experimental design: Procedures for the behavioral sciences. Belmont, CA: Wadsworth. Kirk, R. E. (1972). Nonparametric methods. In R. E. Kirk (Ed.), Statistical issues: A reader for the behavioral sciences. Monterey, CA: Brooks/Cole. Kleijnen, Jr., P. C. (1987). Statistical tools for simulation practitioners. New York: Marcel Dekker. Klugh, H. E. (1970). Statistics: The essentials for research. New York: John Wiley. Klugh, H. E. (1974). Statistics: The essentials for research (2nd ed.). New York: John Wiley. Kossack, C. F., & Henscke, C. (1975). Introduction to statistics and computer programming. San Francisco: Holden-Day. Krauth, J. (1988). Distribution-free statistics: An application-oriented approach. In J. P. Huston (Ed.), Techniques in behavioral and neural sciences (Vol. 2). Amsterdam: Elsevier. Kreyszig, E. (1970). Introductory mathematical statistics: Principles and methods. New York: John Wiley. Kruskal, W. H. (1952). A nonparametric test for the several sample problem. Annals of Mathematical Statistics, 23, 525-545. Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47, 583-621. Kurtz, A., & Mayo, S. (1979). Statistical methods in education and psychology. New York: Springer-Verlag. Kurtz, A., & Mayo, S. (1983). Introduction to social statistics. New York: McGraw-Hill. Lapin, L. (1975). Statistics: Meaning and method. New York: Harcourt Brace Jovanovich. Lapin, L. (1980). Statistics: Meaning and method (2nd ed.). New York: Harcourt Brace Jovanovich. Leedy, P. D. (1989). Practical research: Planning and design (4th ed.). New York: Macmillan. Lehmann, E. L. (1975). Nonparametrics. San Francisco: Holden-Day. Lehmann, E. L., & D'Abrera, H. J. M. (1975). Nonparametrics: Statistical methods based on ranks. New York: McGraw-Hill. Lehmann, E. L., & Stein, C. (1959). Testing statistical hypotheses. New York: John Wiley. Lemmer, H. H. (1980). Some empirical results on the two-way analysis of variance by ranks. Communications in Statistics, A9, 1427-1438. Li, J. C. R. (1964). Statistical inference I. Ann Arbor, MI: Edwards Brothers. Lindquist, E. F. (1953). Design and analysis of experiments in psychology and education. Boston: Houghton Mifflin. Lu, H. T., & Smith, P. J. (1979). Distribution of the normal scores statistic for nonparametric one-way analysis of variance. Journal of the American Statistical Association, 74,115-722. 121

Shlomo S. Sawilowsky
Lunney, G. H. (1970). Using analysis of variance with a dichotomous dependent variable: An empirical study. Journal of Educational Measurement, 7, 263-269. Lutz, G. M. (1983). Understanding social statistics. New York: Macmillan. Lynch, M. D., & Huntsberger, D. V. (1976). Elements of statistical inference for education and psychology. Boston: Allyn and Bacon. Mandeville, G. K. (1972). A new look at treatment differences. American Educational Research Journal, 9, 311-321. Mann, H. B., & Whitney, D. R. (1947). On a test of whether one or two random variables is stochastically larger than the other. American Mathematical Society, 18, 50-60. Mansfield, E. (1986). Basic statistics with applications. New York: W. W. Norton. Marascuilo, L. A., & McSweeney, M. (1977). Nonparametric and distribution-free methods for the social sciences. New York: Brooks-Cole. Marascuilo, L. A., & Serlin, R. C. (1988). Statistical methods for the social and behavioral sciences. New York: W. H. Freeman. Mason, E. J., & Bramble, W. J. (1989). Understanding and conducting research: Applications in education and the behavioral sciences (2nd ed.). New York: McGraw-Hill. McCall, R. B. (1975). Fundamental statistics for psychology (2nd ed.). New York: Harcourt Brace Jovanovich. McCall, R. B. (1980). Fundamental statistics for psychology (3rd ed.). New York: Harcourt Brace Jovanovich. McCall, R. B. (1986). Fundamental statistics for psychology (4th ed.). New York: Harcourt Brace Jovanovich. McMillan, J. H., & Schumacher, S. (1989). Research in education: A conceptual introduction (2nd ed.). Glenview, IL: Scott, Foresman. McNemar, Q. (1962). Psychological statistics (3rd ed.). New York: John Wiley. McSweeney, M. (1967). An empirical study of two proposed nonparametric tests for main effects and interaction. Unpublished doctoral dissertation, University of California, Berkeley. McSweeney, M., & Penfield, D. A. (1969). Normal scores tests for the C-sample problem. British Journal of Mathematical and Statistical Psychology, 22, 177192. Meddis, R. (1984). Statistics using ranks: A unified approach. Blackwell, England: Oxford. Mendenhall, W. (1968). Introduction to probability and statistics (2nd ed.). Belmonte, CA: Wadsworth. Mendenhall, W. (1971). Introduction to probability and statistics (3rd ed.). Belmonte, CA: Wadsworth. Mendenhall, W., & Scheaffer, R. L. (1973). Mathematical statistics with applications. North Scituate, MA: Duxbury Press. Micceri, T. (1986, November). A futile search for that statistical chimera of normality. Paper presented at the annual meeting of the Florida Educational Research Association, Tampa, FL. Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105( 1), 156-166. Michael, W. B. (1963). Selected contributions to parametric and nonparametric statistics. Review of Educational Research, 33, 474-488. Miller, I., & Freund, J. E. (1965). Probability and statistics for engineers. Englewood Cliffs, NJ: Prentice-Hall. Mills, R. L. (1977). Statistics for applied economics and business. New York: McGraw-Hill. Minium, E. W. (1970). Statistical reasoning in psychology and education. New York: John Wiley. Neave, H. R., & Granger, C. W. J. (1968). A Monte Carlo study comparing various twosample tests for differences in mean. Technometrics, 10, 509-522. Noether, G. E. (1955). On a theorem of Pitman. Annals of Mathematical Statistics, 26, 6468.

122

Nonparametric Tests of Interaction


Noether, G. E. (1967). Elements of nonparametric statistics. New York: Wiley. Noether, G. E. (1981). Comment. American Statistician, 55(3), 129-132. Noether, G. E. (1984). Nonparametrics: The early yearsimpressions and recollections. American Statistician, 38(3), 173-178. Nunnally, J. (1975). Introduction to statistics for psychology and education. New York: McGraw-Hill. Nunnally, J. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill. Olejnik, S. F., & Algina, J. (1984). Parametric ANCOVA and the rank transform ANCOVA when the data are conditionally nonnormal and heteroscedastic. Journal of Educational Statistics, 9(2), 129-150. Olson, C. L. (1987). Essentials of statistics: Making sense of data. Boston: Allyn and Bacon. Pagano, R. R. (1981). Understanding statistics in the behavioral sciences. St. Paul, MN: West. Page, E. B., & Marcotte, D. R. (1966). Nonparametric statistics. Review of Educational Research, 36, 517-528. Palumbo, D. J. (1969). Statistics in political and behavioral science. New York: AppeltonCentury-Crofts. Palumbo, D. J. (1977). Statistics in political and behavioral science (2nd ed.). New York: Columbia University Press. Parket, I. R. (1974). Statistics for business decision making. New York: Random House. Patel, K. M., & Hoel, D. G. (1973). A nonparametric test for interaction in factorial experiments. Journal of the American Statistical Association, 68, 615-620. Pearson, E. S. (1929). The analysis of variance in cases of nonnormal variation. Biometrika, 23, 259-286. Penfield, D. A., & Koffler, S. L. (1985). A power study of selected nonparametric K-sample tests. Paper presented at the annual meeting of the American Educational Research Association, Chicago. Pitman, E. J. G. (1948). Lecture notes on non-parametric statistics (photocopy). New York: Columbia University. Puri, M. L. (1964). Asymptotic efficiency of a class of c-sample tests. Annals ofMathematical Statistics, 35, 102-121. Puri, M. L., & Sen, P. K. (1969). A class of rank order tests for a general linear model. Annals of Mathematical Statistics, 40, 1325-1343. Puri, M. L., & Sen, P. K. (1971). Nonparametric methods in multivariate analysis. New York: John Wiley. Puri, M. L., & Sen, P. K. (1973). A note on the asymptotically distribution free tests for subhypotheses in multiple linear regression. Annals of Statistics, 1, 553-556. Puri, M. L., & Sen, P. K. (1985). Nonparametric methods in general linear models. New York: Wiley. Quade, D. (1979). Using weighted rankings in the analysis of complete blocks with additive block effects. Journal of the American Statistical Association, 74, 680-683. Randies, R. H., & Wolfe, D. A. (1979). Introduction to the theory of nonparametric tests. New York: John Wiley. Randolph, E. A., & Barcikowski, R. S. (1989, November). Type I error rate when real study values are used as population parameters in a Monte Carlo study. Paper presented at the 11th annual meeting of the Mid-Western Educational Research Association, Chicago. Rao, C. R. (1951). An asymptotic expansion of the distribution of Wilk's criterion. Bulletin of the International Statistics Institute, 33, 177-180. Rasmussen, J. L. (1985). The power of student's t and Wilcoxon W statistics. Evaluation Review, 9(4), 505-510. Rasmussen, J. L. (1986). An evaluation of parametric and nonparametric tests on modified and nonmodified data. British Journal of Mathematical and Statistical Psychology, 39, 213-220. Reinach, S. G. (1965). A nonparametric analysis for a multiway classification with one

123

Shlomo S. Sawilowsky
element per cell. South African Journal of Agricultural Science, 8, 941-960. Reinach, S. G. (1966). Distribution-free methods in experimental design. Unpublished doctoral dissertation, University of Pretoria, Pretoria. Rider, P. R. (1929). On the distribution of the ratio of mean to standard deviation in small samples from nonnormal populations. Biometrika, 21, 124-143. Rineman, W. C , Jr. (1983). On distribution-free rank tests for two-way layouts. Journal of the American Statistical Association, 78, 655-659. Rogan, J. C , & Keselman, H. J. (1977). Is the ANOVA .F-test robust to variance heterogeneity when sample sizes are equal? An investigation via a coefficient of variation. American Educational Research Journal, 14, 493-498. Roscoe, J. T. (1969). Fundamental research statistics for the behavioral sciences. New York: Holt, Rinehart, and Winston. Roscoe, J. T. (1975). Fundamental research statistics for the behavioral sciences (2nd ed.). New York: Holt, Rinehart, and Winston. Runyon, R. P., & Haber, A. (1968). Fundamentals of behavioral statistics. Reading, MA: Addison-Wesley. Runyon, R. P., & Haber, A (1971). Fundamentals of behavioral statistics (2nd ed.). Reading, MA: Addison-Wesley. Runyon, R. P., & Haber, A. (1976). Fundamentals of behavioral statistics (3rd ed.). Reading, MA: Addison-Wesley. Runyon, R. P., & Haber, A. (1980). Fundamentals of behavioral statistics (4th ed.). Reading, MA: Wesley. Runyon, R. P., & Haber, A. (1984). Fundamentals of behavioral statistics (5th ed.). New York: Random House. Runyon, R. P., & Haber, A. (1988). Fundamentals of behavioral statistics (6th ed.). New York: Random House. SAS Institute. (1985). SAS user's guide: Statistics (5th ed.). Cary, NC: Author. SAS Institute. (1987). SAS/stat guide for personal computers (6th ed.). Cary, NC: Author. Sawilowsky, S. S. (1985a). A comparison of random normal scores tests under the F a n d chisquare distributions to the 2 x 2 x 2 ANOVA test. Florida Journal of Educational Research, 27, 83-97. Sawilowsky, S. S. (1985b). Robust and power analysis of the 2 x 2 x 2 ANOVA, rank transformation, random normal scores, and expected normal scores transformation tests. Unpublished doctoral dissertation, University of South Florida, Tampa, FL. Sawilowsky, S. S. (1989, April). Rank transform: The bridge is falling down. Paper presented at the annual meeting of the American Educational Research Association, San Francisco. Sawilowsky, S. S., & Blair, R. C. (1990). A more realistic look at the robustness of the independent and dependent samples t tests to departures from population normality. Manuscript submitted for publication. Sawilowsky, S. S., Blair, R. C , & Higgins, J. J. (1989). An investigation of the type I error and power properties of the rank transformation procedure in factorial ANOVA. Journal of Educational Statistics, 14(3), 255-267. Scheffe, H. (1959). The analysis of variance. New York: Wiley. Scheirer, G. J., Ray, W. S., & Hare, N. (1976). The analysis of ranked data derived from completely randomized factorial designs. Biometrics, 32, 429-434. Schrader, R. M., & Hettmansperger, T. P. (1980). Robust analysis of variance based upon a likelihood ratio criterion. Biometrika, 67, 93-101. Sen, P. K. (1980). On M tests in linear models. Biometrika, 69, 245-248. Sen, P. K., & Puri, M. L. (1977). Asymptotically distribution free aligned rank order tests for composite hypotheses for general linear models. Zeitschrift fuer Wahrscheinlich-Keitstheorie und Verwandte Gebiete, 39, 175-186. Shoemaker, L. H. (1986). A nonparametric method for analysis of variance. Communications in Statistics, 75(3), 609-632.

124

Nonparametric Tests ofInteraction


Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. New York: McGrawHill. Siegel, S., & Castellan, Jr., N. J. (1988). Nonparametric statistics for the behavioral sciences (2nd ed.). New York: McGraw-Hill. Smitley, W. D. S. (1981). A comparison of the power of the two independent means t test and the Mann-Whitney U test. Unpublished doctoral dissertation, University of South Florida, Tampa, FL. Snedecor, G. W., & Cochran, W. G. (1980). Statistical methods (7th ed.). Ames, IA: Iowa University Press. Sowell, E. J., & Casey, R. J. (1982). Analyzing educational research. Belmonte, CA: Wadsworth. Spatz, C, & Johnston, J. O. (1989). Basic statistics: Tales of distributions. Pacific Grove, CA: Brooks/Cole. Spence, J. T., Underwood, B. J., Duncan, C. P., & Cotton, J. W. (1968). Elementary statistics (2nd ed.). New York: Appleton-Century-Crofts. Sprent, P. (1989). Applied nonparametric statistical methods. London: Chapman and Hall. Sprinthall, R. C. (1982). Basic statistical analysis. Reading, MA: Addison-Wesley. Srisukho, D. (1974). Monte Carlo study of the power ofH-test compared to F-test when population distributions are different inform. Unpublished doctoral dissertation, University of California, Berkeley. Stephenson, W. R., & Jacobson, D. (1988). A comparison of nonparametric analysis of covariance techniques. Communications in Statistics, 17(2), 451-461. Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677-680. Still, A. W., & White, A. P. (1981). The approximate randomization test as an alternative to the .Ftest in analysis of variance. British Journal ofMathematical and Statistical Psychology, 34, 243-252. Stuart, A. (1956). The measurement of estimation and test efficiency. Bulletin of the International Statistics Institute, 36, 79-86. Summers, G. W., Peters, W. S., & Armstrong, C. P. (1977). Basic statistics in business and economics (2nd ed.). Belmont, CA: Wadsworth. Tan, W. Y. (1982). Sampling distributions and robustness of t, F, and variance-ratio in two samples and ANOVA models with respect to departure from normality. Communications in Statistics, All, 2485-2511. Tate, M. W., & Clelland, R. C. (1975). Nonparametric and shortcut statistics: In the social, biological, and medical sciences. Danville, IL: Interstate. Terry, M. E. (1952). Some rank-order tests which are most powerful against specific parametric alternatives. Annals of Mathematical Statistics, 23, 346-366. Thompson, G. L., & Ammann, L. P. (1989). Efficiencies of the rank-transform in two-way models with no interaction. Journal of the American Statistical Association, #4(405), 325330. Thompson, G. L., & Ammann, L. P. (in press). Efficiencies of interblock rank statistics for repeated measures designs. Journal of the American Statistical Association. Tomarkin, A. J., & Serlin, R. C. (1986). Comparison of ANOVA alternatives under variance heterogeneity and specific noncentrality structures. Psychological Bulletin, 99(\), 90-99. Toothaker, L. E., & Chang, H. (1980). On "The analysis of ranked data derived from completely randomized factorial designs." Journal ofEducational Statistics, 5(2), 169-176. Van Elteren, P., & Noether, G. E. (1959). The asymptotic efficiency of the Xr2 test for a balanced incomplete block design. Biometrika, 46, 465-477. Walberg, H. J., Strykowski, B. F., Rovai, E., & Hurg, S. S. (1984). Exceptional performance. Review of Educational Research, 54, 87-112. Wampold, B. E., & Drew, C. J. (1990). Theory and application of statistics. New York: McGraw-Hill. Welch, B. L. (1937). The significance of the difference between two means when the

125

Shlomo S.

Sawilowsky

population variances are unequal. Biometrika, 29, 350-362. Wilcox, R. R. (1987). New statistical procedures for the social sciences. Hillsdale, NJ: Lawrence Erlbaum. Wilcox, R. R., Charlin, V., & Thompson, K. (1986). New Monte Carlo results on the robustness of the ANOVA, F, W, and F* statistics. Communications in Statistics Simulations and Computation, 15, 933-944. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics, 1, 80-82. Wilcoxon, F. (1947). Probability tables for individual comparisons by ranking methods. Biometrics, 3, 119-122. Wilcoxon, F. (1949). Some rapid approximate statistical procedures. New York: American Cyanamid. Williams, F. (1979). Reasoning with statistics (2nd ed.). New York: Holt, Rinehart, and Winston. Wilson, K. V. (1956). A distribution-free test of analysis of variance hypotheses. Psychological Bulletin, 53(1), 96-101. Winer, B. J. (1962). Statistical principles in experimental design. New York: McGraw-Hill. Winer, B. J. (1971). Statistical principles in experimental design (2nd ed.). New York: McGraw-Hill. Winn, P. R., & Johnson, R. H. (1978). Business statistics. New York: Macmillan. Wolfowitz, J. (1949). Non-parametric statistical inference. In Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability (pp. 93-113). Berkeley, CA: University of California Press. Wonnacott, T. H. (1977). Introductory statistics for business and economics (2nd ed.). Santa Barbara, CA: John Wiley. Worsley, K. J. (1977). A non-parametric extension of a cluster analysis method by Scott and Knott. Biometrics, 33, 532-535. Wright, R. L. D. (1976). Understanding statistics: An informal introduction for the behavioral sciences. New York: Harcourt Brace Jovanovich. Author SHLOMO S. SAWILOWSKY is Assistant Professor, Educational Evaluation and Research, Room 347 EDUC, College of Education, Wayne State University, Detroit, MI 48202. He specializes in rank tests and computer simulations.

126