Beruflich Dokumente
Kultur Dokumente
Vienna
Marie Sime
ckova
2008
Abstract
Statistical design is a very important part of applied empirical studies. In this work evaluating
of the required size of experiment in two specific cases of ANOVA models is proposed.
In the first part, the one-way layout with one fixed factor and ordinal categorical response
variable is considered. The Kruskal Wallis test is used to test the equality of main effects and
its properties are compared with those of the F -test. The distribution of response variables is
characterized by the relative effect. A formula for evaluating the sample size for the Kruskal
Wallis was derived by simulation. Then the formula was compared with the two Noethers
formula for the Wilcoxon test in the case of two factor levels.
In the second part, the two-way ANOVA mixed model with one observation for each row
column combination is considered and tests of interaction in this model are studied. Five tests
of additivity are covered, all developed for models with two fixed factors: Tukey test, Mandel
test, Johnson Graybill test, Tussel test and locally best invariant (LBI) test. We confirmed
by simulation that these tests hold the type-I-risk even for the mixed model case. Then their
power was studied. The power of Johnson Graybill, LBI and Tussel test is sufficient, but
the power of Tukey and Mandel test is low for the general type of interaction. A modification
of Tukey test is proposed to solve this problem. Finally, a formula for determination of the
size of an experiment for the Johnson Graybill test is derived.
Zusammenfassung
Versuchsplanung ist ein wichtiger Bestandteil angewandter empirischer Forschung. In dieser
Arbeit wird der erforderliche Versuchsumfang f
ur einige Spezialfalle der Varianzanalyse (VA)
hergeleitet.
Im ersten Teil wird die einfache Klassifikation mit einem festen Faktor und kategorialer Zielvariablen behandelt. Der Kruskal Wallis Test wird zur Pr
ufung der Gleichheit der Haupteffekte herangezogen und seine Eigenschaften werden mit denen des F -Tests verglichen. Die
Verteilung der kategorialen Zielvariablen wird durch den relativen Effekt charakterisiert. Es
wurde eine Formel zur Bestimmung des Stichprobenumfangs f
ur den Kruskal Wallis Test
mit Hilfe von Simulationen abgeleitet. Diese Formel wurde dann mit zwei Formeln von
Noether f
ur den Wilcoxon Test, also f
ur den Spezialfall eines Faktors mit zwei Stufen,
verglichen.
Im zweiten Teil wird die zweifache Klassifikation mit einfacher Klassenbesetzung und einem
gemischten Modell betrachtet. Es wurden f
unf Tests auf fehlende Wechselwirkungen untersucht, die f
ur das Modell mit zwei festen Faktoren entwickelt wurden. Diese sind: der Tukey
Test, der Mandel Test, der Johnson Graybill Test, der Tussel Test und ein bester
lokaler invarianter Test (LBI). Auch im Falle eines gemischten Modells halten wie unsere
Simulationsstudien zeigen alle diese Test das Risiko erster Art ein. Bez
uglich der G
ute
stellten wir fest, dass die G
ute von Johnson Graybill Test, LBI Test und Tussel Test
zufriedenstellend ist, dagegen haben Tukey Test und Mandel Test eine unbefriedigend
geringe G
ute. Eine Modifikation des Tukey Tests wurde vorgenommen aber es ergab sich
keine ausreichende Verbesserung der G
ute. Schliesslich wurde eine Formel zur Bestimmung
des Versuchsumfanges f
ur den Johnson Graybill Test entwickelt.
Contents
1 Introduction
2 One-Way Layout with One Fixed Factor for Ordered Categorical Data
2.1
2.2
2.1.1
2.1.2
2.1.3
11
2.2.1
Comparison of the F -test and the Kruskal Wallis test for the one-way
layout with ordinal categorical response . . . . . . . . . . . . . . . . .
11
2.2.2
13
2.2.3
19
3.2
3.3
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.1.1
23
3.1.2
Tests of additivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
25
3.2.1
25
3.2.2
26
3.2.3
27
30
Chapter 1
Introduction
In biology, agriculture, psychology and many other research fields experiments are very important methods to acquire knowledge. To obtain credible conclusions on the one hand and
not to waste resources on the other hand experiments must be carefully designed. An essential
part of planning an experiment is determination of the number of unit included. This work is
focused on two specific cases estimation of power and sample size for the Kruskal Wallis
test and additivity test in two-way ANOVA model without replication.
The thesis is divided in two parts. In Chapter 2 the interest lies in the one-way layout with
ordinal categorical response. The Kruskal Wallis analysis of variance by ranks is applied.
see Sime
cek and Sime
ckova (submitted), Rusch et al. (subbmitted) and Sime
ckova and Rasch
(subbmitted).
All presented simulations were performed using the statistical environment R, R Development
Core Team (2008) on a grid of 48 Intel machines at Supercomputing Centre Brno. I would
like to thank the METACentrum project (http://meta.cesnet.cz) for allocation of computing
time.
I cannot forget to commemorate my former supervisor Prof. Harald Strelec. I am extremely
grateful to Prof. Dieter Rasch, my colleagues from Universitat f
ur Bodenkultur Wien and
my friends from Universitat Wien, their suggestions improve this work substantially. The
friendly environment and support was provided by Institute of Animal Science in Prague.
Chapter 2
2.1
Let us consider a one-way layout y 1 , . . . , y a with distribution functions F1 , . . . , Fa , respectively. The random variable y i corresponds to the i-th level of the factor A. For each i
4
(2.1)
HA : i, j : Fi < Fj or Fi > Fj .
(2.2)
For F1 , . . . , Fa Gaussian distribution functions the F -test can be used. For non-Gaussian
distributions the Kruskal Wallis test has to be used.
Our interest lies in the case when the response variables y 1 , . . . , y a are ordinal categorical, particularly in the case when y 1 , . . . , y a where derived from some continuous variables x1 , . . . , xa
by discretization. The formal definition follows.
Definition 1. Consider a continuous random variable x. A new ordered categorical random
variable y with r categories is derived from x using a decomposition of the real line based
on a set {1 , 2 , . . . , r1 }, = 0 < 1 < 2 < < r1 < r = +.
The y = i if and only if x lies in the interval (i1 , i ], i = 1, . . . , r.
The set {1 , 2 , . . . , r1 } is called the support of the decomposition.
Because we compare the sample size for the Kruskal Wallis test with the sample size for
the F -test, we will recall the F -test and the Kruskal Wallis test and their properties. Then
the definition of relative effect will be introduced.
2.1.1
In this section properties of the F -test are shortly summarized. More information about them
can be found e.g. in (Rasch et al., 2007, section 4.1.1), Scheffe (1959) or Lehmann (2005).
Let us consider a continuous random variable x. The model equation is written in the form
(ANOVA model I):
xij = E(xij ) + eij = + i + eij
(i = 1, . . . , a; j = 1, . . . , ni ),
(2.3)
Pa
where
, i are real numbers (i.e. non-random); it should hold either
i=1 i = 0 or
Pa
i=1 ni i = 0 (which are equivalent if n1 = = na ). The eij are mutually independent
normally distributed random variables with E(eij ) = 0 and var(eij ) = 2 .
P
P
The ni s and a are known real constants. Let us denote N = ai=1 ni and
= i /a.
In Rasch and Guiard (1990) it was shown that the F -test is quite robust and the assumption
of normality of the error terms eij can be relaxed.
We like to design the experiment for test of the hypothesis
H0 : 1 = = i ,
(2.4)
in other words the factor A has no effect on the response variable, against the alternative
HA : i, j : i 6= j .
5
If the errors eij are normally distributed, the exact test statistic for testing the null hypothesis (2.4) is equal to
M SA
F =
,
(2.5)
M SR
where
P P
P
)2
xi x
)2
i
j (xij x
i ni (
and M S R =
M SA =
a1
N a
are the mean squares of the factor A and the residual mean squares (
xi row means, x
overall
mean).
Under the null hypothesis F follows the central F -distribution with f1 = a1 and f2 = N a
degrees of freedom. Otherwise, it follows the non-central F -distribution with f1 = a 1 and
f2 = N a degrees of freedom and the non-centrality parameter equals
Pa
ni (i
)2
= i=1
.
(2.6)
2
Let denote F (f1 , f2 ; ; p) the p-quantile of the non-central F -distribution with f1 and f2
degrees of freedom and non-centrality parameter . If = 0 we shorten F (f1 , f2 ; 0; p)
as F (f1 , f2 ; p).
If the realization F of F in (2.5) exceeds F (a 1, N a; 1 ) the null hypothesis is rejected
on the level , otherwise it is not rejected. For a = 2 the F -test coincides with the t-test for
two independent samples.
A design of an experiment must involve specification of the the required type-I-risk and
required type-II-risk . For determination of the sample size of the F -test two more parameters should be specified, the variance 2 of random errors eij and = max min
(max = max(i ) the greatest and min = min(i ) the lowest of the effects 1 , . . . , a ). We
will determine the maxi-min sample size, which assure the type-II-risk for any 1 , . . . , a
fulfilling max min = . The maxi-min sample size will be denoted nmax .
The type-II-risk of the F -test depends on the non-centrality parameter . The type-II-risk
decreases as increases. For given the term (2.6) is minimized if the a 2 effects are equal
to 21 (min + max ). For any other values would be higher and the type-II-risk lower.
P
If N = ai=1 ni is fixed then the type-II-risk of the F -test is minimized when the subclass
numbers ni are as equal as possible. If a is an integer divisor of N then the type-II-risk is
minimized for n1 = = na = n = N/a. Therefore we choose n1 = . . . , na = n.
To calculate the required sample size nmax we have to solve the quantile equation
F (a 1, na a; 1 ) = F (a 1, na a; ; ).
(2.7)
If max = min + and the other a 2 of the effects i are equal to 21 (min + max ) =
P
2
n 2
min + 2 =
, then ai=1 (i
)2 = 2 and so = 2
2 . The depends on and only
through their ratio, therefore only this ration , called relative effect size, is necessary to
know for calculation of the maxi-min sample size.
To conclude, for evaluation of the maxi-min sample size of the F -test (for given number
of factor levels a) the type-I-risk , type-II-risk and relative effect size must be fixed.
In Table 2.1 we report some values of nmax in dependence on
Table 2.1: Maxi-min sample sizes for = 0.05, a = 6 and different values of and
2.1.2
= 0.05
= 0.1
= 0.2
1
1.2
1.5
41
29
19
34
24
16
27
19
13
The F -test for testing the hypothesis about the equality of means in a one-way ANOVA
model I discussed in Section 2.1.1 is based on the assumption that the observed variables are
normally distributed and their distributions in different groups differ only in their expected
values. The Kruskal Wallis test considered in this chapter can be used in cases where
normality is questionable. The principle of this test will be explained in brief, for details see
e.g. (Lehmann, 1975, chapter 5, section 2).
Let y 1 , . . . , y a be random variables with distribution functions F1 , . . . , Fa , the y i corresponds
to the observed variable in the i-th level of the factor A. We will test the hypothesis
H0 : F1 = F2 = = Fa ,
(2.8)
HA : i, j : Fi < Fj or Fi > Fj .
(2.9)
X T2
12
i
3(N + 1).
Q=
N (N + 1)
ni
(2.10)
i=1
This basic version of the Kruskal Wallis test assumes that the distribution functions Fi are
continuous and therefore there are no ties (almost surely). In our case of categorical variables
y i this is not true and a Kruskal Wallis test corrected for ties should be used (Kruskal and
Wallis (1952)). Let s be the number of distinct values of observations and tl be the number
of tied values of the l-th smallest observed value, l = 1, . . . , s. Then the corrected test statistic
is equal to
Q
Ps
QK =
.
(2.11)
(t3l tl )
1 l=1
3
N N
Let us note that Q is a special case of QK , where tl = 1 for all l = 1, . . . , s = N .
The test statistics Q and QK are under H0 asymptotically 2a1 distributed. For small sample
sizes the critical values are tabulated in software or tables.
For a = 2 (comparing of distributions in only two groups) the test statistics are simplified to
the Wilcoxon (Mann Whitney) test statistics.
7
2.1.3
The aim of this work is to design an experiment for the Kruskal Wallis test, i.e. to find
maxi-min sample size for given type-I-risk and type-II-risk. Because the exact distribution
of the Kruskal Wallis test statistic under the alternative hypothesis is not known we derived
a formula for determining of the sample size by simulation.
We assume that the ordinal response variables y 1 , . . . , y a are discretized from an underlying
continuous random variables x1 , . . . , xa (see Definition 1 on page 5). The loss of information
caused by the discretization is measured by the so called relative effect.
In more details, the ordered categorical random variable y takes realizations belonging to r
ordered categories C1 C2 Cr with r 1 (we used the symbol to denote the order
relation). We need a measure for the distance between two distributions. For this we use the
approach of (Brunner and Munzel, 2002, section 1.4, formula (1.4.1)).
First, let us recall the definition of distribution function. The concept of Brunner and Munzel
(2002) is used to treat the problem of discontinuity of the distribution function.
Definition 2. Let y be a random variable. Then we define the (standardized) distribution
function of y as
1
F (y) =
P (y < y) + P (y y) .
2
The relative effect is defined as follows.
Definition 3. For two independent random variables y 1 and y 2 with distribution functions
F1 (y) and F2 (y) respectively, the probability
Z
1
p(y 1 ; y 2 ) = P (y 1 < y 2 ) + P (y 1 = y 2 ) = F1 dF2
2
is called the relative effect of y 2 with respect to y 1 .
Note that the relative effect p(y 1 ; y 2 ) is equal to 1 p(y 2 ; y 1 ), and for continuous random
variables p(y 1 ; y 2 ) = P (y 1 < y 2 ).
Our aim is to find a connection between the non-centrality parameter of the continuous model
and the relative effect of the model with ordinal variables. Properties of the relative effect
will be discussed in this section in details.
For simulation experiments we generate ordered categorical variables by decomposition of the
real line, as was described in Definition 1 (page 5). The relative effect for this case is computed
in the following example.
Example 1. Let x be a random variable with distribution function F . The random variable
y is derived from it using the decomposition with support {1 , 2 , . . . , r1 }. Then it holds
P (y = i) = F + (i ) F + (i1 ).
If x1 and x2 are two variables with distribution functions F1 and F2 , respectively, and the
variables y 1 and y 2 are derived from them using the same support {1 , 2 , . . . , r1 }, the
i=1
j1
r X
X
F2+ (j ) F2+ (j1 ) F1+ (i ) F1+ (i1 ) +
j=2 i=1
r
X
1
2
j=1
j=1
F2+ (j ) F2+ (j1 ) F1+ (j ) F1+ (j1 ) .
(2.12)
The following theorem shows the relation between the relative effects of two continuous variables and two ordinal categorical variables, derived from the continuous one by decomposition.
Theorem 1. Let x1 and x2 be two independent continuous random variables and y r1 , y r2
two ordinal categorical variables with r categories, derived from x1 , x2 by decomposition with
support {1 , 2 , . . . , r1 }. Then:
(i) If r = 1 (i.e. y 11 = y 12 = 1 with probability 1) then p(y 1 ; y 2 ) = 21 .
(ii) If r and maxi |i+1 i | 0 then
p(y r1 ; y r2 ) p(x1 ; x2 ).
(iii) If p(x1 ; x2 ) >
1
2
(2.13)
and
1 , 2 , 3 R , 1 < 2 < 3 :
(i) Obviously:
1
1
1
p(y 1 ; y 2 ) = P (y 1 < y 2 ) + P (y 1 = y 2 ) = 0 + 1 = .
2
2
2
(ii) It is valid
p(x1 ; x2 ) p(y r1 ; y r2 ) =
It follows
1
P (x1 < x2 , y r1 = y r2 ) P (x1 > x2 , y r1 = y r2 ) .
2
(2.15)
(iii) Let us proceed by the induction on r. Take any y r1 , y r2 and make refinement of their
support of decomposition adding one point somewhere between any i and i+1 .
Denote z 1 , z 2 the categorical variables based on this new decomposition.
Analogously as in (2.15),
p(z 1 ; z 2 ) p(y r1 ; y r2 ) =
1
P (z 1 < z 2 , y r1 = y r2 ) P (z 1 > z 2 , y r1 = y r2 ) .
2
1
P (z 1 < z 2 , y r1 = y r2 ) P (z 1 > z 2 , y r1 = y r2 ) =
2
1
P (i < x1 ) P ( < x2 i+1 )
=
2
p(z 1 ; z 2 ) p(y r1 ; y r2 ) =
The part (ii) of Theorem 1 tells us that the random variables with countable infinite support
(i.e. variables y r1 , y r2 , which take the values 1, 2, . . . , r, r ) can for the sufficiently crowded
support have almost the same relative effect as the variables with uncountable support (i.e.
variables x1 , x2 , which take the values from some interval in real numbers).
In part (iii) there is the necessary condition to increase the relative effect by adding a new
support point. If we take 1 = i , 3 = i+1 elements of the original support and 2 a new
element added between the i and i+1 , the condition (2.14) is necessary to increase the
relative effect in this particular case.
The condition is fulfill e.g. for x1 , x2 uniformly distributed. It is usually fulfilled for normally
distributed variables but it does not hold generally, as is shown in Example 2. The condition
is not valid because the difference between the expected values of x1 and x2 is very small
(relatively to their standard deviation).
Example 2. Consider normally distributed random variables x1 and x2 with common variance equal to 1 and expected values equal to 0.1 and 0.1, respectively. Two random variables
y 1 and y 2 are derived from them using the support {0.4, 0.2}. Then p(y 1 ; y 2 ) = 0.5020
(it is easy to see using formula (2.12)).
If a new point 0.3 is added to the support and random variables z 1 , z 2 are derived from the
x1 and x2 using the support {0.4, 0.3, 0.2}, the relative effect p(z 1 ; z 2 ) = 0.5004. This
value is lower then the value of p(y 1 ; y 2 ).
The distribution functions of y 1 , y 2 and z 1 , z 2 are plotted in Figure 2.1.
10
0.8
0.0
0.4
Fz(z)
0.8
0.4
0.0
Fy(y)
Figure 2.1: The distribution functions of y 1 , y 2 and z 1 , z 2 for Example 2; y 1 and z 1 solid
lines, y 2 and z 2 dashed lines.
2.2
2.2.1
Some researchers use the F -test for the comparison of means of ordered categorical variables
although it is not correct. In this section we compare the nominal and the actual risks for
investigating the impact of this mistake.
Let us consider three normally distributed random variables x1 , x2 , x3 , with the same standard deviation = 50/3 = 16.67, and different expected values 1 = 50 /2 = 41.67,
2 = 50, 3 = 50 + /2 = 58.33. Three categorical random variables y 1 , y 2 , y 3 are derived
from the variables x1 , x2 , x3 using the support {30, 50, 70} (see Definition 1 on page 5). We
are comparing a = 3 variables, each attaining r = 4 values.
We are interested in the properties of testing the hypothesis
H0 : 1 = 2 = 3 ,
(2.16)
11
Table 2.2: The actual type-II-risk. The bold printed values are the 20 % robust results.
Nominal values
Normal
F -test
Categorical
KruskalWallis
Normal
KruskalWallis
Categorical
F -test
nom
nom
act
sd(act )
act
sd(act )
act
sd(act )
act
sd(act )
0.10
0.10
0.10
0.10
0.05
0.05
0.05
0.05
0.01
0.01
0.01
0.01
0.20
0.15
0.10
0.05
0.20
0.15
0.10
0.05
0.20
0.15
0.10
0.05
17
19
22
27
21
23
27
32
30
33
37
43
0.1808
0.1410
0.0962
0.0484
0.1850
0.1454
0.0931
0.0483
0.1889
0.1405
0.0943
0.0485
0.0027
0.0024
0.0020
0.0017
0.0037
0.0023
0.0021
0.0020
0.0032
0.0026
0.0031
0.0017
0.2400
0.1975
0.1440
0.0827
0.2582
0.2128
0.1452
0.0881
0.2820
0.2254
0.1628
0.0974
0.0033
0.0039
0.0036
0.0023
0.0032
0.0044
0.0020
0.0025
0.0027
0.0030
0.0033
0.0027
0.2018
0.1589
0.1117
0.0584
0.2108
0.1685
0.1096
0.0602
0.2234
0.1696
0.1165
0.0630
0.0038
0.0030
0.0031
0.0020
0.0038
0.0026
0.0026
0.0019
0.0023
0.0024
0.0034
0.0029
0.2309
0.1880
0.1369
0.0776
0.2425
0.2004
0.1341
0.0799
0.2573
0.2037
0.1451
0.0856
0.0024
0.0030
0.0036
0.0026
0.0047
0.0045
0.0030
0.0030
0.0027
0.0035
0.0036
0.0019
The main question is whether the difference of the actual and the nominal type-II-risks is
substantial. To asses the difference the concept of a 20 % robustness as defined in Rasch and
Guiard (1990) is used: a test is called 20 % robust if |nom act | 0.2 nom .
Simulation
Three levels of the nominal type-I-risk 0.10, 0.05, 0.01 and four levels of the nominal type-IIrisk 0.20, 0.15, 0.10, 0.05 were considered. For each combination of these nominal risks the
maxi-min sample size n for the F -test was evaluated (see Section 2.1.1). A sample of size
n was generated for each of the three normally distributed random variables x1 , x2 , x3 and
from them the categorical random variables y 1 , y 2 , y 3 were derived by decomposition with
the support {30, 50, 70}.
For both the normal and the categorical variables the F -test and the Kruskal Wallis test
were performed. This was repeated 100 000 times and the ratio of non-significant tests was
recorded, it is the (estimated) actual type-II-risk. The repetitions were divided into 10 blocks
of 10 000 and the act were recorded for each of them, these values were used to estimate the
standard deviation of the act (denote sd(act )).
For the actual type-I-risk, the simulation was made in the analogous way, just the means
of the variables x1 , x2 , x3 were all equal to 1 = 2 = 3 = 50 and the ratio of significant
tests was recorded as the estimate of act .
Results
Tables 2.2 and 2.3 reports the actual type-II-risk and type-I-risks of the F -test and Kruskal
Wallis test. The sample size n is the maxi-min sample size of the F -test for the given nom
and nom . For each test the standard deviation sd of the estimate of the risk is recorded
in the second column.
It is not surprising that the actual type-I-risk is in all cases lower than the nominal risk
(the few opposite cases are caused by the error of the estimate, as is seen from standard
deviations). The tests are constructed to keep the given level of the type-I-risk.
12
0.20
0.15
0.10
0.05
0.20
0.15
0.10
0.05
0.20
0.15
0.10
0.05
17
19
22
27
21
23
27
32
30
33
37
43
Normal
F -test
act
sd(act )
0.1013
0.1004
0.0996
0.0984
0.0496
0.0494
0.0505
0.0497
0.0101
0.0098
0.0099
0.0096
0.0030
0.0028
0.0031
0.0022
0.0029
0.0016
0.0026
0.0012
0.0010
0.0011
0.0012
0.0008
Categorical
KruskalWallis
act
sd(act )
0.0995
0.0994
0.0994
0.0981
0.0484
0.0492
0.0487
0.0490
0.0090
0.0091
0.0097
0.0094
0.0029
0.0046
0.0040
0.0018
0.0028
0.0015
0.0027
0.0015
0.0011
0.0004
0.0010
0.0010
Normal
KruskalWallis
act
sd(act )
0.0999
0.0997
0.0987
0.0982
0.0474
0.0485
0.0495
0.0493
0.0090
0.0089
0.0092
0.0091
0.0020
0.0033
0.0036
0.0020
0.0029
0.0013
0.0032
0.0018
0.0011
0.0010
0.0010
0.0007
Categorical
F -test
act
sd(act )
0.1007
0.1010
0.0998
0.0985
0.0509
0.0508
0.0516
0.0509
0.0104
0.0100
0.0107
0.0101
0.0035
0.0039
0.0040
0.0025
0.0032
0.0020
0.0024
0.0010
0.0013
0.0008
0.0014
0.0010
The results for the type-II-risk are more interesting. In Table 2.2 the 20 % robust results
are bold print. Naturally, for normally distributed random variables the actual type-II-risk
of the F -test is close to the nominal one. The type-II-risk of the Kruskal Wallis test is a bit
higher, it is not 20 % robust in two cases.
For the ordinal categorical variable it seems that the actual type-II-risk of the F -test is closer
to the nominal one than the risk of the Kruskal Wallis test. However, the difference between
the nominal risks and the actual ones is greater than 20 %, which is asked in the concept of the
20 % robustness. The act lies neither in the interval [0.08, 0.12] for nom = 0.10, nor in the
interval [0.12, 0.18] for the nom = 0.15, nor in the interval [0.16, 0.24] for nom = 0.20, and
nor in the interval [0.04, 0.06] for the nom = 0.05. Note that the increase of the type-II-risk
is partially caused by the discretization which decreases the relative effect size .
It follows that the maxi-min sample size computed for the F -test and continuous response
variable is lower than the required maxi-min sample size for the Kruskal Wallis test and
ordinal categorical response variables. Some method for the Kruskal Wallis test is necessary
and will be discussed in the following part.
2.2.2
The Kruskal Wallis test is used to test the hypothesis of equal means (2.8). For keeping the
appropriate type-II-risk it is necessary to determine the maxi-min sample size.
Because the exact distribution of the Kruskal Wallis test statistics (2.10) or (2.11) under
the alternative hypothesis is not known, evaluation of the power of the test and following
determination of sample size is very problematic in the case of ordinal categorical response
variables. In this section a formula for computing the sample size was found out by simulation.
If only two groups are compared, the Kruskal Wallis test coincides with the Wilcoxon test.
The asymptotic power of the Wilcoxon test statistic under the alternative hypothesis is derived
in Lehmann (1975). In Noether (1987) two formulas for determining the size of experiments
13
Table 2.4: The parameters and properties of the used distributions of Fleishman system. All
the distributions have zero mean and standard deviation 1.
No. of distr. Skewness Kurtosis
u = s
t
v
1
2
3
4
5 (Normal)
6 (Uniform)
0
0
1
2
0
0
3.75
7
1.5
7
0
1.2
0
0
0.163194276264
0.260022598940
0
0.748020807992
0.630446727840
0.953076897706
0.761585274860
1
0.077872716101
0.110696742040
0.006597369744
0.053072273491
0
were provided. Properties of Noethers formula were described in Chakraborti et al. (2006)
in details. In Section 2.2.3 our results for the Kruskal Wallis test are compared to the
Noethers formulas .
Let us consider a (a 2) continuously distributed random variables x1 , . . . , xa . We want to
test whether their means are all equal or whether there is at least one pair of these variables
with different means.
Instead of these continuous variables, only the ordinal categorical variables y 1 , . . . , y a are
observed. They are derived from the variables x1 , . . . , xa using the decomposition based
on a support {1 , 2 , . . . , r1 }, as is described in Definition 1 (page 5).
In this section, the simulation to determine the type-II-risk for given sample size is described.
Simulation
For the simulation experiment it is important to choose the mechanism of generating categorical random variables. We used six different types of distribution of the underlying continuous
variables.
All the considered distributions have zero mean and standard deviation 1. The distributions
differ in the skewness and kurtosis. The first distribution is the normal distribution, i.e. both
the skewness and the kurtosis
are
equal to 0. The second distribution was the uniform
distribution in the interval ( 3, 3), its skewness is equal to 0, kurtosis to 1.2.
The other distributions come from the Fleishman system, described in Rasch and Guiard
(1990). The random variable has the form s + tx + ux2 + vx3 , where the x is a standard
normally distributed random variable and s, t, u, v are given parameters. Values of these
parameters and properties of the distributions can be found in Table 2.4. The densities
of these distributions are plotted in Figure 2.2.
For each of these distributions two different decompositions are used:
Equidistant: The support points are equidistant in the area where 99 % of observations
lie in.
Equal mass: The measure of all categories is equal with respect to the given distribution. It means that an observation will be in each category with the same probability.
Emphasize that the support points were fixed for distributions of zero mean.
The parameters of simulation are as follows:
14
0.6
0.4
0.0
0.2
f(x)
0.4
0.0
0.2
f(x)
0.4
0.0
0.2
Distribution 4
Distribution 5
Distribution 6
0.4
0.0
0.2
f(x)
0.0
0.2
f(x)
0.4
0.0
0.2
0.6
0.6
0.6
0.4
f(x)
f(x)
Distribution 3
0.6
Distribution 2
0.6
Distribution 1
1. The continuous random samples of size n were generated for each group with the appropriate expected value. Then they were transformed to the ordinal categorical variables,
using the given support of decomposition.
2. The Kruskal Wallis test was performed and the result was recorded.
These two steps were repeated 10 000 times. The (estimated) actual type-II-risk for the
given sample size n is equal to the proportion of the non-significant tests in these repetitions.
For the further analysis only the results with actual type-II-risk smaller than 0.40 were
used.
Formula
By inspection of the results of simulation it was found that there is almost linear dependence
of the required sample size on the maxi-min sample size for the ANOVA F -test and normally
distributed variables (for given a, , the relative effect p, and the number of categories r).
Many linear models were tried. The model below was chosen as the most appropriate (good
fit and not too many factors).
Given the type-I-risk = 0.05, the maxi-min sample size for the Kruskal Wallis test can be
computed as follows:
n() = 3.054 n0 () 47.737
1
+ 51.288 p2 + 82.050 +
1
7.428 n0 () p2 0.535 n0 () +
r
2
1
2 1
+29.708 p + 56.102 223.770 p ,
r
r
+2.336 n0 ()
(2.17)
Table 2.5: Comparison of required sample sizes attained by simulation and calculated by formula (2.17) for = 0.2. In the columns are subsequently the number of groups a, relative
effect size , identification of underlying distribution (as in Table 2.4), relative effect of the
distribution of categorical variables and their number of categories, maxi-min sample size
of the F -test for normal variables and the maxi-min sample sizes for the Kruskal Wallis test
based on the simulation and calculated by formula (2.17).
Groups
Distribution
Rel. effect
Categories
n0 ()
nSIM
nF IT
2
2
2
2
6
6
6
6
2
2
2
2
6
6
6
6
1
1
1.67
1.67
1
1
1.67
1.67
1
1
1.67
1.67
1
1
1.67
1.67
1
1
1
1
1
1
1
1
3
3
3
3
3
3
3
3
0.66
0.78
0.77
0.89
0.66
0.78
0.77
0.89
0.69
0.77
0.8
0.88
0.69
0.77
0.8
0.88
3
5
3
5
3
5
3
5
3
5
3
5
3
5
3
5
16.71
16.71
6.76
6.76
26.59
26.59
10.2
10.2
16.71
16.71
6.76
6.76
26.59
26.59
10.2
10.2
31
14
11
7
47
22
16
10
28
17
11
7
44
27
16
11
35
15
11
7
55
22
19
10
30
17
10
7
46
26
16
10
17
1.0
0.8
0.6
0.4
0.2
0.0
0.2
0.4
20
40
60
80
100
Figure 2.3: Relation between the residuals of model (2.17) and the required sample size
estimated from this model. The ratio of the residuals and the estimated sample sizes is
plotted on the y-axis, the estimated sample sizes are plotted on the x-axis.
Table 2.6: Sample sizes computed by formula (2.17) for = 0.05, a = 6, r = 5, p = 0.71 and
different values of and .
= 0.05
= 0.1
= 0.2
1
1.2
1.5
61
51
38
51
42
30
40
32
21
It should be noted that the formula is checked for four values of / between 1 and 1.7. It
can be interpolated for all the values in this interval. Similarly, it is assumed the number
of categories r can be interpolated for all integer values between 3 and 50. With r there
is decreasing influence of r on the required size of an experiment (the distribution tends to
a continuous one which is reflected by the fact that there is 1r in the formula, see Theorem 1
in Section 2.1.3). Therefore if r > 50 use formula (2.17) as for r = 50.
For comparison of the sample sizes needed for quantitative and ordered categorical variables
we calculated the values analogue to those of Table 2.1 by formula (2.17), a results are
in Table 2.6. The required sample size is always higher in the case of Kruskal Wallis test
and ordinal categorical variables (the relative effect is lower in that case) than for the F -test
and continuous normally distributed variables and the difference is significant.
To summarize, the required size of an experiment with ordinal categorical variables for given
type-I-risk = 0.05, type-II-risk from the interval [0.05, 0.4], in interval [1, 1.7] and
18
number of compared groups a between 2 and 10 can be calculated by formula (2.17). There
are no restrictions for the other parameters.
2.2.3
The Wilcoxon test is a special case of the Kruskal Wallis test for the number of groups
equal to a = 2. In Chakraborti et al. (2006) two formulas for evaluating the sample size for
the Wilcoxon test are mentioned.
Noethers formula is derived for the local alternatives and the required sample size is computed
as
2 !
1 (1 /2) + 1 (1 )
nN F = CEIL
,
(2.18)
6 (p 0.5)2
where 1 is the quantile function of normal distribution, and the risk of the first and the
second kind and p the relative effect. This formula is quite simple, which is useful in practice.
More accurate but demanding more inputs is the second formula (it was derived for an onetailed alternative with instead of /2):
2
p
1 (1 /2)/ 6 + 1 (1 ) (p p2 ) (p p2 )
3
2
nF 2 = CEIL
(2.19)
,
2
6 (p 0.5)
where p2 = P (x1 < x2 and x1 < x2 ) and p3 = P (x1 < x2 and x1 < x2 ), while x1 , x1
are independent random variables distributed as in the group with the lower expected value,
and x2 , x2 are independent random variables distributed as in the group with the greater
expected value.
Note that both of Noethers formulas were derived assuming continuous distributions of the
response variables.
Values of sample size computed using the formula (2.17) derived in Section 2.2.2, and the
formulas (2.18) and (2.19) from Chakraborti et al. (2006) in some special cases can be found
in Table 2.7.
In the left part of Table 2.8 there are differences between the simulated and calculated sample
sizes given as the percentage of the calculated sample size. In the ideal case the most of the
observations should be in the row of 0 % as happen for the formula (2.17). Two Noethers
formulas tend to overestimate the required sample size. In the right part of Table 2.8 there
is analogue data for the type-II-risk between 0.1 and 0.3 only.
19
Table 2.7: Comparison of required sample sizes for the Wilcoxon test evaluated using simulation and formulas (2.17), (2.18) and (2.19) for = 0.2 and some values of the other
parameters.
p2
p2
n0 ()
Simulated
1
1
1
1.67
1.67
1.67
1
1
1
1.67
1.67
1.67
1
1
1
1.67
1.67
1.67
1
1
1
1.67
1.67
1.67
0.66
0.78
0.78
0.77
0.89
0.89
0.69
0.76
0.77
0.8
0.88
0.88
0.71
0.74
0.75
0.83
0.86
0.87
0.69
0.72
0.73
0.82
0.86
0.86
0.47
0.66
0.67
0.63
0.82
0.82
0.52
0.65
0.66
0.68
0.82
0.83
0.56
0.61
0.62
0.71
0.77
0.78
0.53
0.57
0.59
0.72
0.77
0.76
0.54
0.71
0.71
0.7
0.85
0.85
0.58
0.67
0.67
0.73
0.83
0.83
0.63
0.66
0.66
0.78
0.81
0.81
0.59
0.62
0.62
0.77
0.81
0.8
3
4
5
3
4
5
3
4
5
3
4
5
3
4
5
3
4
5
3
4
5
3
4
5
16.71
16.71
16.71
6.76
6.76
6.76
16.71
16.71
16.71
6.76
6.76
6.76
16.71
16.71
16.71
6.76
6.76
6.76
16.71
16.71
16.71
6.76
6.76
6.76
31
14
14
11
7
7
28
17
17
11
7
7
22
20
19
9
8
8
31
25
22
10
8
8
20
Formula
(2.17) (2.18) (2.19)
35
15
15
11
7
7
30
18
17
10
7
7
26
21
20
9
7
7
29
24
22
9
7
7
55
17
17
18
9
9
37
20
19
15
10
9
30
23
22
13
11
10
36
28
25
13
11
11
53
17
16
17
8
8
36
20
19
13
9
8
30
23
21
12
9
9
36
28
24
13
9
9
Table 2.8: The differences between the simulated sample size and sample sizes evaluated using
the formulas (2.17), (2.18) and (2.19), given as the percentage of the evaluated sample sizes.
In the cells are the percentages of the cases in the whole data set.
Percentage
positive diff.
negative diff.
> 40 %
20 %40 %
10 %20 %
5 %10 %
0 %5 %
0%
0 %5 %
5 %10 %
10 %20 %
20 %40 %
> 40 %
[0.05, 0.4]
(2.17) (2.18) (2.19)
0.1
1.6
9.1
8.0
3.8
26.4
5.8
20.4
21.4
3.1
0.2
0
0
0
0
0
0
0
2.1
43.0
43.3
11.6
21
0
0
0.2
1.0
0.5
12.7
1.5
12.3
33.9
30.9
6.9
[0.1, 0.3]
(2.17) (2.18) (2.19)
0
1.1
9.3
8.2
2.8
27.1
3.6
22.0
23.6
2.3
0
0
0
0
0
0
0
0
1.1
43.9
43.0
12.1
0
0
0
0.1
0
13.9
1.1
12.0
33.8
32.4
6.7
Chapter 3
22
3.1
3.1.1
Preliminaries
Description of the problem
In this section we will discuss the two-way ANOVA models. First, the model with both factor
and interaction fixed is considered. The response in the ith row and the j th column is modeled
as follows:
y ij = + i + j + ()ij + eij ,
i = 1, . . . , a, j = 1, . . . , b,
(3.1)
where , i , j and ()ij are real constants such that
X
X
X
X
()ij = 0
i =
j =
()ij =
i
and the eij are normally distributed independent random variables with zero mean and variance 2 .
Second, the model with one factor fixed and one factor random is considered, the interaction
is random variable. The response variable is modeled as
y ij = + i + bj + (ab)ij + eij ,
i = 1, . . . , a, j = 1, . . . , b,
(3.2)
where where , i are real constants and bj , (ab)ij and eij are normally distributed random
2 , 2
2
variables, all with zero mean and variance B
AB and , respectively.
We want to test the hypothesis that there is no interaction in the model. In the fixed effect
model (3.1) the additivity hypothesis can be written as
H0 : ()ij = 0,
i = 1, . . . , a, j = 1, . . . , b,
(3.3)
(3.4)
A lot of tests were designed for the test of hypothesis (3.3) in the fixed effects model (3.1).
We want to confirm whether these tests work also in the case of the mixed model. The
power of these tests in mixed model is investigated and an empirical determination of size
of experiment for the Johnson Graybill test is derived. Finally, a modification of Tukey test
with improved power is developed.
23
3.1.2
Tests of additivity
Several tests of additivity in fixed effects model (3.1) have been developed over the years. Five
of them will be discussed here, namely the tests by Tukey (1949), Mandel (1961), Johnson
and Graybill (1972), Boik (1993b) and Tusell (1990). More details can be found e.g. in Boik
(1993a) or Alin and Kurt (2006).
P P
Subsequently the following notation will be used: Let y = i j yij /ab denotes the over
P
P
all mean of the response, yi = j yij /b the ith row mean and yj = i yij /a the j th column
mean. The matrix R will stand for the residual matrix with respect to the main effects
rij = yij yi yj + y
(3.5)
If the interaction is present we may expect that some of i coefficients will be substantially
higher than others.
Tukey test: Introduced in Tukey (1949). Tukey suggested first to estimate row and column
effects by the row a column means
i = yi , j = yj and then test for interaction of type
()ij = k i j , where k is a real constant (k = 0 implies no interaction). The Tukey test
statistic ST equals
ST = MSint /MSerror ,
where
MSint =
and
MSerror =
P P
i
j (yij
2
y
(
y
)(
y
i
j ij i
P
P
2
yi y )
yj y )2
i (
j (
P P
y )2 a
yj
j (
y )2 b
(a 1)(b 1) 1
yi
i (
y )2 MSint
j
zi = P
yij (
yj y )
yj
j (
y )2
24
Johnson Graybill test: Introduced in Johnson and Graybill (1972). These authors chose
a different approach and derived a test for (ab)ij = k ci dj with ci and dj being a certain
row or column constant and k an overall constant. They suggested the test statistic
SJ = 1 =
eig1 (RR )
.
tr(RR )
SU = (a 1)
a1
Y
i=1
!(b1)/2
The additivity hypothesis is rejected if SU is low. Critical values for this test statistic are
given e. g. in Kres (1972). Note that these tables should be used with (a 1) = p and b = N .
Locally best invariant (LBI) test: See Boik (1993b). This test was designed to have
locally more power than the Tussel test. LBI test statistic equals
SL =
1
1
Pa1 2 .
a1
i
1
The critical values of these test can be found by simulation for given a and b.
All these test (together with the procedure for founding of critical values) were implemented
in R environment (R Development Core Team (2008)), package AdditivityTests. It may
be downloaded on
http://5r.matfyz.cz/skola/AdditivityTests/additivityTests 0.3.zip.
As far as we are informed, this is the first R implementation of additivity tests with the
exception of the Tukey test.
All these test were developed for the fixed effects model (3.1). In the next section their usage
for the model with mixed effects (3.2) is examined.
3.2
3.2.1
The main interest of our work lies in model (3.2) with one fixed and one random factor.
There arise a question whether the tests presented in the previous section and developed for
the fixed effect model (3.1) can be used also in this situation.
We considered common 5 % type-I-risk level and perform a simulation to estimate the actual
type-I-risk. In the simulation the parameters were set to the following values:
The number of levels of the fixed factor was equal to a = 3, 4, . . . , 10.
25
Table 3.1: Number and percentage of the simulated cases where the actual type-I-risk is lower
or greater than the nominal level 5%.
Test
0.05
> 0.05
Tukey test
349 (96.94) 11 (3.06)
Mandel test
348 (96.67) 12 (3.33)
Johnson Graybill test 339 (94.17) 21 (5.83)
Tusell test
337 (93.61) 23 (6.39)
LBI test
336 (93.33) 24 (6.67)
The number of levels of the random factor b was chosen between 4 and 50 (by 2 between
4 and 20, by 5 between 20 and 50).
2 = 2, 5, 10.
The variance of the random factor was equal to B
The variance of the random error was 2 = 1. For other values of 2 the model can be
scaled (see Example 3 on page 29).
In one step of the simulation a dataset was generated based on the model without interaction
2
(AB
= 0). Then one of the Tukey, Mandel, Johnson Graybill test, LBI or Tussel test was
performed. The percentage of significant results in the number of steps is assumed to be the
actual type-I-risk of the test.
The 10 000 simulation steps were repeated 10 times and standard error of the estimation was
computed based on these 10 repetitions. Then for each test and each combination of parameters the one sided hypothesis
H0 : the actual type-I-risk is lower or equal to 0.05
was tested by one sample t-test on 5 % level against the alternative
HA : the actual type-I-risk is greater than 0.05.
The results of these t-tests for each additivity test are summarized in Table 3.1.
For the Tukey and Mandel tests the vast majority (> 95 %) of cases is not significantly above
the 0.05 level. For the other tests the estimated type-I-risk is higher than 0.05 in slightly
more cases (6 7 %). However, this may be also false positives caused by multiple testing.
We
fixed effects model
Pb would like to remark that although the tests were derived for the P
( j=1 j = 0) we used them in mixed effects model where E bj = 0 but bj=1 bj 6= 0 almost
surely. However, for high number of levels of the random factor b, the sum converges to zero
(law of large numbers, e.g. Grimmett and Stirzaker (1992)).
For 5 % type-I-risk we can say that all five additivity tests seem not to violated the type-I-risk
assumption and therefore they can be used for the mixed effects model as well.
3.2.2
The power of the Tukey, Mandel, Johnson Graybill, LBI and Tussel tests is studied in this
section. Powers of these tests were compared by simulation. It is shown that while Tukey and
26
Mandel tests have good power when the interaction is a product of the main effects, i.e. when
(ab)ij = k i bj (k is a real constant, i and bj the row and column effects in model (3.2)),
their power for more general interaction is very poor. The other three tests work a bit worse
in this special case but they have appropriately good power in more general cases too.
Let us consider mixed effects model (3.2). Two possible interaction schemes were under
inspection:
Type (A): (ab)ij = k i bj
Type (B): (ab)ij = k i cj ,
2 , mutually
where cj is normally distributed random variables with zero mean and variance B
independent on bj and eij , k is a real constant, i the row effect, bj the column effect.
Two possibilities are considered for the value of b, either b = 10 or b = 50, and 10 different
values of interaction parameter k between 0 and 12 are considered. The other parameters are
2 = 2, 2 = 1, a = 10,
= 0, B
(1 , . . . , 10 ) = (2.03, 1.92, 1.27, 0.70, 0.46, 0.61, 0.84, 0.94, 1.07, 2.00).
For each combination of parameters a dataset was generated for the model (3.2), the Tukey,
Mandel, Johnson Graybill, LBI and Tussel tests were performed and results were noted
down. The step was repeated 10 000 times. The estimated power of the test is the percentage
of the positive results.
All tests were done on = 5% level. The dependence of the power on the constant k is
visualized on Figure 3.1. As we can see while Tukey and Mandel tests outperformed the
other three tests for the interaction type (A), they completely fail to detect the interaction
type (B) even for a large value of k. Therefore, it is desirable to develop a test which is able
to detect a spectrum of practically relevant alternatives while still has the power comparable
to the Tukey and Mandel tests for the most common interaction type (A).
Because in practice the type of interaction is usually not known, it should be recommended
to use Johnson Graybill, LBI or Tussel test for the hypothesis of additivity (3.3) or (3.4).
Other possibility is to use the modified Tukey test proposed in Section 3.3.
3.2.3
In this section we will propose an empirical formula for the required size of experiment for
the Johnson Graybill test.
We consider an interaction term in the model (3.2) of the form
(ab)ij = k i cj ,
(3.6)
where i are the row effects in (3.2), cj are normally distributed random variables with zero
2 , mutually independent to the random variable b and random
expected value and variance B
j
error eij . The k is a real constant.
The interaction (3.6) is a random variable with zero mean and variance
2
.
var(ab)ij = k 2 i2 B
27
0.2
1.0
0.2
0.6
power
0.6
0.2
power
0.6
power
1.0
1.0
0.6
power
0.2
1.0
Figure 3.1: Power dependence on k, b and interaction type; b = 10 left, b = 50 right and
interaction type (A) up, type (B) down. Tukey test solid line, Mandel test dashed line,
Johnson Graybill test dotted line, Tussel test long dash line, LBI test dot-dash line.
28
The variance of the random effect bj was considered to be equal to 1, 2 or 2. The parameter
k (which control the variance of random interaction (ab)ij ) takes values 0.03, 0.05, 0.07 and
0.1. The variance of random noise eij is considered to be equal to 1, in the case of other value
the model should be scaled (see Example 3 on page 29).
The power of a test increases with the distance of its alternative from the null hypothesis.
Based on the simulation, the power of Johnson Graybill test for type-I-risk equal to 5 %
can be approximated by
1
Pa
(b) = 1
(3.7)
2.
4
4
a b k B
i=1 i
The formula was computed only for the power in the interval h0.10, 0.95i.
Let us emphasize that in practice the number of rows a is fixed and we can influence the size
of experiment only through the number of columns b. By simple manipulation formula (3.7)
can be reformulated as follows:
1
Pa
b() =
(3.8)
4
2 ,
a k 4 B
i=1 i
where x means the lowest integer equal to or greater than x. In case of 2 6= 1 the model
should be scaled, see Example 3.
In Figure 3.2 the difference of number of levels of random factor b realized by simulation and
computed by formula (3.8) is plotted. Notice that the formula gives quite satisfactory results,
although there are a few outliers.
Example 3. Scaling the model when 2 6= 1
Consider that we want to plan an experiment and use formula (3.8) to determine the size
of this experiment. This formula supposes that the variance of errors eij in model (3.2) is
equal to 1. This example shows the solution when this assumption is violated.
29
500
300
100
100 0
bSIM bEST
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 3.2: Dependency of the difference of real number of levels of the random factor bSIM
(realized from simulation) and the bEST estimated by formula (3.8) on the type-II-risk (the
line is zero level).
Let the variance of eij in model (3.2) be equal to 2 > 0. We divide the equation (3.2) with
interaction (3.6) by and for i = 1, . . . , a, j = 1, . . . , b define
y ij =
y ij
bj
cj
eij
i
, = , i = , bj = , cj = , k = k and eij =
.
i = 1, . . . , a, j = 1, . . . , b.
(3.9)
P
3.3
To increase the power of Tukey test its modification is proposed in this section.
In the classic Tukey test a model
y ij = + i + j + k i j + eij
is tested against the submodel
y ij = + i + j + eij .
30
(3.10)
i j
RSS(0) =
i
i.e. the same as in Tukey test, and then we continue by the iteration procedure updating
estimates based on previous steps versions
P
(n1) 1 + k(n1) (n1)
y
ij
j
j
j
(n)
a
i
=
,
2
P
(n1)
(n1)
1
+
k
j
j
P
(n1)
(n1)
(n1)
y
1
+
k
ij
i
i
i
(n)
j
=
,
2
P
(n1)
(n1)
1
+
k
i
i
P P
(n1)
(n1)
(n1) (n1)
i
j
ij
i
j
i
j
.
k(n) =
2
P P
(n1) 2
(n1)
i
j
i
j
Surprisingly, it seems that one iteration is just enough in a vast majority of cases. Therefore,
for a simplicity reason let us define
XX
(1)
(1)
(1) (1) 2
RSS =
yij
i j k (1)
i j
.
i
1.0
0.2
0.6
power
1.0
0.6
0.2
power
0.0
1.0
2.0
3.0
0.0
1.0
2.0
3.0
Figure 3.3: Power dependence on k and b for the interaction type (B), b = 10 left, b = 50
right. Tukey test solid line, Mandel test dashed line, Johnson Graybill test dotted line,
Tussel test long dash line, LBI test dot-dash line, modified Tukey test two dash line.
and it is asymptotically 2 distributed with 1 degree of freedom.
RSS
The consistent estimate of a residual variance 2 is s2 = abab
and RSS
is approximately
2
2
distributed with ab a b degrees of freedom. Thus, using a linear approximation of the
nonlinear model (3.10) the following statistic
RSS(0) RSS
RSS
abab
(3.11)
One possibility how overcome this obstacle is to bootstrap without replacement. Consider
the test statistic S = RSS (0) RSS. Then generate N (boot) times a dataset by model
(boot)
yij
(0)
(0)
=
+
i + j + rij ,
where is a random permutation of indexes of R matrix (3.5). For each dataset the statistic
of interest S (boot) = RSS(0)(boot) RSS(boot) is computed. The critical value of the modified
Tukey test is then the (1)100% quantile of the generated S (boot) . The number of generated
samples N (boot) = 1000 seems to be sufficient in the most cases.
The second possibility is to estimate the residual variance 2 of random errors eij as s2 =
RSS
(sample) datasets using model
abab and then generate N
(sample)
yij
(0)
(0)
(N EW )
=
+
i + j + eij
,
(N EW )
where (eij
) are independent identically distributed random variables generated from
a normal distribution with zero mean and variance s2 . Because under the additivity hypothesis
the parameter k is equal to zero, the proposed test statistic is the absolute value of its estimator
k (1) . As in bootstrapping, for each of the N (sample) datasets the value of the statistic is
computed and the additivity hypothesis is rejected if more than (1 ) 100% of sampled
statistics lie below the statistic k (1) based on the real data. The number of generated samples
N (boot) = 1000 seems to be sufficient in the most of cases.
To conclude, we have proposed a modification of the Tukey additivity test. The modified
Tukey test performs almost as good power as Tukey test when the interaction is a product
of main effects but should be recommended if we also request reasonable power in case of
more general interaction schemes.
33
Bibliography
A. Alin and S. Kurt. Testing non-additivity (interaction) in two-way ANOVA tables with no
replication. Statistical Methods in Medical Research, 15:6385, 2006.
M.S. Bartlett. Properties of sufficiency and statistical tests. Statistical Methods in Medical
Research, 160:268282, 1937.
R.J. Boik. A comparison of three invariant tests of additivity in two-way classifications with
no replications. Computational Statistics and Data Analysis, 15:411424, 1993a.
R.J. Boik. Testing additivity in two-way classifications with no replications: the locally best
invariant test. Journal of Applied Statistics, 20:4155, 1993b.
E. Brunner and U. Munzel. Nichtparametrische Datenanalyse - unverbundene Stichproben.
Springer, Berlin, 2002.
S. Chakraborti, B. Hong, and M.A. van de Wiel. A note on sample size determination for
a nonparametric test of location. Technometrics, 48:8894, 2006.
M.N. Ghosh and D. Sharma. Power of tukeys test for non-additivity. Journal of the Royal
Statistical Society, B25:213219, 1963.
G. R. Grimmett and D. R. Stirzaker. Probability and Random Processes, 2nd Edition. Clarendon Press, Oxford, 1992.
V. Hegeman and D.E. Johnson. The power of two tests for nonadditivity. Journal of the
American Statistical Association, 71:945948, 1976.
D.E. Johnson and F.A. Graybill. An analysis of a two-way model with interaction and no
replication. Journal of the American Statistical Association, 67:862868, 1972.
H. Kres. Statistical Tables for Multivariate Analysis. Springer, New York, 1972.
W.H. Kruskal and W.A. Wallis. Use of ranks in one-criterion variance analysis. Journal of
the American Statistical Association, 47:583621, 1952.
E. L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, Inc., San
Francisco, 1975.
E. L. Lehmann. Testing Statistical Hypothesis. Springer-Verlag, New York, 2005.
M. Mahoney and R. Magel. Estimation of the power of the kruskal-wallis test. Biometrical
journal, 38:613630, 1996.
34
T. Rusch, M. Sime
ckova, K.D. Kubinger, K. Moder, P. Sime
cek, and D. Rasch. Test of additivity in mixed and fixed effects two-way ANOVA models with single subclass numbers. In
Proceedings of The International Conference on rends and Perspectives in Linear Statistical Inference LINSTAT 2008, Bedlewo, Poland, April 21-25, 2008, subbmitted. Springer,
special issue of Statistical Papers.
H. Scheffe. The Analysis of Variance. John Wiley & Sons, Inc., New York, 1959.
P. Sime
cek and M. Sime
ckova. Modification of Tukeys additivity test. Journal of Statistical
Planning and Inference, submitted.
M. Sime
ckova and D. Rasch. Additivity hypothesis in the mixed two-way ANOVA model
M. Sime
ckova and D. Rasch. Sample size for the one-way layout with one fixed factor for
ordered categorical data. Journal of Statistical Theory and Practice, 2:109123, 2008.
J.W. Tukey. One degree of freedom for non-additivity. Biometrics, 5:232242, 1949.
F. Tusell. Testing for interaction in two-way ANOVA tables with no replication. Computational Statistics and Data Analysis, 10:2945, 1990.
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1:8083, 1945.
35