Sie sind auf Seite 1von 11

ANOVA Introd

Fernando Lpz
September 15, 2018

Problem 1.

Assign at random 13 experimental units to the v = 3 treatments, so that the first treatment is
assigned 3 units and the other two treatments are assigned 5 units.

## NOTE: As of emmeans versions > 1.2.3,


## The 'cld' function will be deprecated in favor of 'CLD'.
## You may use 'cld' only if you have package:multcomp attached.
treatments <- c(1,1,1,rep(2:3, each=5))
treatment.assignments <- sample(treatments)
frm <- data.frame(rbind(1:13,treatment.assignments),
row.names= c("Unit","Treatment Assignment"))
frm[2,]

## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13


## Treatment Assignment 3 3 2 3 2 2 2 3 1 2 1 3 1

Problem 2.
Pv
For fixed constants c1 , c2 , . . . , cv with i=1= ci = 0, show that
v v
( y ) !
X X X c2
i
ci Y i• ∼N ci τi , σ2 .
i=1 i=1
r
i=1 i

Carefully justify all steps.

Recall the following results from probability theory:


• If X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ) are independent, then X + Y ∼ N (µ1 + µ2 , σ12 + σ22 )
• If X = λY + η, where Y is a random variable and λ, η ∈ R, then

E[X] = λE[Y ] + η

and
V ar(X) = λ2 V ar(Y ).

We can first find the distribution for Y i• .

ri ri
1 X 1 X
Y i• = Yit = µ + τi + εit
ri t=1 ri t=1
 Pri  Pri
Since E[εit ] = 0 and V ar µ + τi + 1
ri t=1 εit = r 2 ·
1
t=1 σ = σ /ri (provided that the εit ’s ∼ N (0, σ )
2 2 2
i
are mutually independent), then:
σ2
 
Y i• ∼ N µ + τi , .
ri

1
Multiplying by the scalars ci , we have
c2
 
ci · Y i• ∼ N ci [µ + τi ], i σ 2 .
ri
Pv
And finally, provided that i=1 ci =0:

v v v v
!
X X X X c2
ci Y i• ∼ N µ ci + ci τi , i
σ2 .
i=1 i=1 i=1
r
i=1 i
v v
( v ) !
X X X c2i
ci Y i• ∼ N ci τi , σ2 .
i=1 i=1 i=1
ri

Problem 3.

For the one-way analysis of variance model determine which of the following are estimable.
For those that are estimable, state the least square estimator.
Pv
We know that in this model f (µ, τ1 , . . . , τv ) is estimable if and only if f (µ, τ1 , . . . , τv ) = i=1 bi (µ + τi ).
(a) τ1 + τ2 − 2τ3
Let b1 = b2 = 1 and b3 = −2. Then τ1 + τ2 − 2τ3 is estimable since
3
X
bi (µ + τi ) = (µ + τ1 ) + (µ + τ2 ) − 2(µ + τ3 ) = τ1 + τ2 − 2τ3 .
i=1

The least squared estimator is Y 1• + Y 2• − 2Y 3•


(b) µ + τ3
Let b1 = b2 = 0 and b3 = 1. Then µ + τ3 is estimable since
3
X
bi (µ + τi ) = (µ + τ3 ) = µ + τ3 .
i=1

The least squared estimator is Y 3•


(c) τ1 − τ2 − τ3
Consider for arbitrary µ, τ1 , τ2 , τ3 ∈ R
3
X
τ1 − τ2 − τ3 = bi (µ + τi )
i=1
3
X 3
X
=µ bi + bi τi
i=1 i=1
P3
Since µ occurs on the RHS and do not occur on the LHS, then i=1 bi = 0. Therefore:

τ1 − τ2 − τ3 = b1 τ1 + b2 τ2 + b3 τ3

0 = (b1 − 1)τ1 + (b2 + 1)τ2 + (b3 + 1)τ3


P3
Since τ1 , τ2 , τ3 are arbitrary, then b1 = 1, b2 = −1, b3 = −1. However, this would contradict i=1 bi = 0.
Therefore, τ1 − τ2 − τ3 is not estimable.

2
(d) µ + (τ1 + τ2 + τ3 )/3
Let b1 = b2 = b3 = 13 . Then µ + (τ1 + τ2 + τ3 )/3 is estimable since
3
X 1 1 1 τ1 + τ2 + τ3
bi (µ + τi ) = (µ + τ1 ) + (µ + τ2 ) + (µ + τ3 ) = µ + .
i=1
3 3 3 3

The least squared estimator is 13 Y 1• + 13 Y 2• + 13 Y 3•

Problem 4.

TO find the least square estimator of µ + τ we need to minimize


X ri
v X ri
v X
X
f (µ, τ ) = (yit − (µ + τ ))2 = 0it
i=1 t=1 i=1 t=1

We can take partial derivatives with respect to µ and τ and set them equal to zero and solve. So,
v X ri v X ri
∂ ∂ X X
= = −2 (yit − (µ + τ )) = −2 yit + 2n(µ + τ ).
∂µ ∂τ i=1 t=1 i=1 t=1

Setting this result equal to zero we get.

ri
v X
X
0=− yit + n(µ + τ ).
i=1 t=1

Although we have only one equation for two parameters (µ, τ ), we can solve for the treatment mean µ + τ
and get:
v ri
1 XX
µ̂ + τ̂ = yit = y •• .
n i=1 t=1
Therefore, the least square estimator would be: Y ••

Problem 5.

For the model in the previous exercise, find an unbiased estimator for σ 2 .

Define
ri
v X
X 2
SSE0 = Yit2 − nY ••
i=1 t=1
h 2 i
To calculate E[SSE0 ] we can first compute E nY ••
 !2 
v ri
h 2 i 1  XX
E nY •• = E Yit 
n i=1 t=1
 !2 
v X ri ri
v X
1  2 X X
= E n (µ + τ ) + 2n(µ + τ )
2
ε0it + ε0it 
n i=1 t=1 i=1 t=1
ri
v X
X
We can use that that ε0it ∼ N (0, nσ 2 )
i=1 t=1
= n(µ + τ ) + σ2 2

3
Pv Pri
Next we compute E

i=1 t=1 Yit2
" v r # " v r #
XX i XX i

E Yit2 =E (µ + τ + εit ) 2

i=1 t=1 i=1 t=1


" ri
v X ri
v X
#
X X
= E n(µ + τ )2 + (µ + τ ) εit +
0
(ε0it )2
i=1 t=1 i=1 t=1
ri
v X
X
= n(µ + τ )2 + E (ε0it )2 = n[(µ + τ )2 + σ 2 ]
 
i=1 t=1

Pv Pri h 2 i
Hence, E[SSE0 ] = E t=1 Yit − E nY •• = (n − 1)σ . An unbiased estimator for σ would be then
2
 2 2
i=1

v r
1 1 XX i
n 2
SSE0 = Yit2 − Y
n−1 n − 1 i=1 t=1 n − 1 ••

Problem 6.

(a) and (b). Plot inflation time versus color and comment on the results; estimate the mean
inflation time for each balloon color, and add these estimates to the previous plot.
25
Time (in seconds)

O
O
20

O O
15

1 2 3 4

Balloon color
Note that in the previous plot the means for each color are represented as empty circles.
We observe from the previous plot that balloons of color two and three seem to take slightly more time to
inflate than the other two. However, it is not clear due to the variability o the data. For example, the range
of the data of color 3 and color 4 are similar even though their medians and means differ.
## least_sq_est
## Color1 18.3375
## Color2 22.5750
## Color3 21.8750
## Color4 18.1875

4
(c) Construct an analysis of variance table and test the hypothesis that color has no effect on
inflation time.

We will consider the model:


Yit = µ + τi + 
εit ∼ N (0, σ 2 )
iid

1 ≤ t ≤ 8, 1 ≤ i ≤ 4

Where Yit stands for the t-th observation on treatment i, and µ + τi is the mean response for the i-th
treatment.
We will test at α = 0.05-level the hypothesis

H0 : τ1 = τ2 = τ3 = τ4 vs. HA : for some (i, j), τi 6= τj (1 ≤ i, j ≤ 4).

To test it, the corresponding ANOVA table is shown.


model1 <- aov(Time ~ Color.Code,data = balloons)
anova(model1)

## Analysis of Variance Table


##
## Response: Time
## Df Sum Sq Mean Sq F value Pr(>F)
## Color.Code 3 127.66 42.554 3.9379 0.01836 *
## Residuals 28 302.58 10.806
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the previous result, at α = 0.05−level of significance we reject H0 , in favor of HA , concluding
that at this significance level there is enough evidence supporting that color has some effect on inflation time.
In fact, as shown in the following least square estimates of the treatment means, data suggest that yellow
(2) and orange (3) balloons take more time to inflate than pink (1) and blue (4) ones. Note that the 95%
confidence intervals all intersect on (20.19, 20.56).
emmeans(object = model1 , specs = "Color.Code")

## Color.Code emmean SE df lower.CL upper.CL


## 1 18.3375 1.162236 28 15.95677 20.71823
## 2 22.5750 1.162236 28 20.19427 24.95573
## 3 21.8750 1.162236 28 19.49427 24.25573
## 4 18.1875 1.162236 28 15.80677 20.56823
##
## Confidence level used: 0.95
Finally, we can provide an estimate of σ 2 and its corresponding 95% upper confidence limit
ssE
σ̂2 = = 10.806
28

σ2 2
Recall that σ̂ 2 = M SE ∼ 28 χ28 . Therefore we can use as a pivot the statistic

28σ̂ 2
∼ χ228
σ2

28σ̂ 2 28σ̂ 2
   
P ≥ x 0.95 = P σ 2
≤ = 0.95
σ2 x0.95

5
Where xβ is such that P[X ≥ xβ ] = β for X ∼ χ228 . Therefore a 95% upper confidence limit for σ 2 is given by

28σ̂ 2 28 · 10.806
= = 17.874
x0.95 16.928

(d) and (e) Plot the data for each color in the order that it was collected. Are you concerned
that the assumptions on the model are not satisfied? If so, why? If not, why not? Is the
analysis in part (c) satisfactory?
Time (in seconds)

25
20
15

0 5 10 15 20 25 30

Order
The previous plot shows that the order in which data was collected may have an effect over the dependent
variable (it seems that the first balloons may be harder to inflate). This possibility, if true, would compromise
our model’s assumption that the εit ’s are identically distributed, since they would also depend on the order in
which balloons are inflated. However, the use of a completely randomized design should reduce the impact of
the effect of the order of collection of data, if any. In fact, the following boxplot suggests that the assignment
of units (in terms of the order of inflation) was equally likely regardless of the balloons’ colors. Although we
should be aware of the effect that the order of collection of data may possibly have on the results, we can
conclude that our analysis in part (c) was satisfactory.

6
30
25
20
Order

15
10
5
0

1 2 3 4

Color

Problem 7.

(a) Plot the data and comment on the results.


12
Hemoglobin (grams per 100 ml)

11
10

O
O
9

O
8

O
7
6

1 2 3 4

Trough
Observe from the previoius table we that the results in the last three troughs seem to be greater than the
results from trough 1. Based on the graph, results from the last three treatments look similar, although the

7
second one produced a few results more extreme than those from the last two troughs.

(b) Write down a suitable model for this experiment, assuming trough effects are negligible.

We will consider the model:


Yit = µ + τi + 
εit ∼ N (0, σ 2 )
iid

1 ≤ t ≤ 10, 1 ≤ i ≤ 4

Where Yit stands for the t-th observation on treatment i (i.e. the i-th trough), and µ + τi is the mean response
for the i-th treatment.

(c) Calculate the least squares estimate of the mean response for each treatment. Show these
estimates on the plot obtained in (a). Can you draw any conclusions from these estimates?

least_sq_est

## least_sq_est
## Trough1 7.20
## Trough2 9.33
## Trough3 9.03
## Trough4 8.69
These values suggest that the mean respond of treatment one are lower than the rest, while treatments 2 and
3 have the greatest mean responds (the former having a slightly bigger mean than the latter). Differences
between treatments 1,2, and 3, however does not look as big as the differences between treatment one and
any of the rest. Nevertheless, to know if some of those differences are significant at some confidence level a
formal test is needed.

(d) Test the hypothesis that sulfamerazine has no effect on the hemoglobin content of trout
blood.

We will test at α = 0.05-level the hypothesis

H0 : τ1 = τ2 = τ3 = τ4 vs. HA : for some (i, j), τi 6= τj (1 ≤ i, j ≤ 4).

To test it, the corresponding ANOVA table is shown.


model1 <- aov(hemoglobin ~ trough,data = fish)
anova(model1)

## Analysis of Variance Table


##
## Response: hemoglobin
## Df Sum Sq Mean Sq F value Pr(>F)
## trough 3 26.803 8.9343 5.6955 0.002685 **
## Residuals 36 56.471 1.5686
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(object = model1 , specs = "trough")

8
## trough emmean SE df lower.CL upper.CL
## 1 7.20 0.3960605 36 6.396752 8.003248
## 2 9.33 0.3960605 36 8.526752 10.133248
## 3 9.03 0.3960605 36 8.226752 9.833248
## 4 8.69 0.3960605 36 7.886752 9.493248
##
## Confidence level used: 0.95
The p-value of this test was 0.0026 < 0.05. According to the this result, at α = 0.05−level of significance we
reject H0 , in favor of HA , concluding that there is enough evidence to claim that sulfamerazine has some
effect on the hemoglobin content of trout blood. Furthermore, data suggest that sulfamerazine concentrations
on treatment 2 and 3 (5 and 10 g of sulfamerazine per 100 pounds of fish, respectively) are associated with
more hemoglobin content of trout blood, while the 0 g of sulfamerazine per 100 pounds of fish concentration
resulted in the lowest hemoglobin content in this experiment.

(e) Calculate a 95% upper confidence limit for σ 2 .

We can provide an estimate of σ 2 and its corresponding 95% upper confidence limit
ssE
σ̂2 = = 1.5686
36

σ2 2
Recall that σ̂ 2 = M SE ∼ 36 χ36 . Therefore we can use as a pivot the statistic

36σ̂ 2
∼ χ236
σ2

36σ̂ 2 36σ̂ 2
   
P ≥ x0.95 = P σ ≤
2
= 0.95
σ2 x0.95
Where xβ is such that P[X ≥ xβ ] = β for X ∼ χ236 . Therefore a 95% upper confidence limit for σ 2 is given by

36σ̂ 2 36 · 1.5686
= = 2.427.
x0.95 23.269

Problem 8.

First note that


v
X
ri Y •• = nY •• , and
i=1
Xv ri
v X
X
2 ri Y i• = 2 Yit = 2nY ••
i=1 i=1 t=1

Then " #
v
X
E[SST ] = E ri (Y i• − Y •• ) 2

i=1
v
" v
#
X h 2i 2 X
= ri E Y i• + E Y •• ri (Y •• − 2Y i• )
i=1 i=1
v
X h 2i h 2 i
= ri E Y i• − nE Y ••
i=1

9
Next we can use the identity E[X 2 ] = V ar(X) + E[X]2 and that Yit ∼ N (µ + τi , σ 2 ) implies that
ri
1 X
Y i• = Yit ∼ N (µ + τi , σ 2 /ri )
ri t=1
v ri v
1 XX X
Y •• = Yit ∼ N (µ + τi ri /n, σ 2 /n)
n i=1 t=1 i=1

Hence,
v
X
E[SST ] = ri (V arY i• + E[Y ]2 ) − n(V arY •• + E[Y •• ]2 )
i=1
v v
X X ri τi
= ri (σ /ri + (µ + τi ) ) − n(σ /n + (µ +
2 2 2
)2 )
i=1 i=1
n
v
X v
X v
X v
X
= vσ 2 + nµ2 + 2µ ri τi + ri τi2 − σ 2 − µ2 n − 2µ ri τi − n( ri τi /n)2
i=1 i=1 i=1 i=1
v
X v
X
= (v − 1)σ 2 + ri τi2 − n( ri τi /n)2
i=1 i=1
v
X v
X v
X
= (v − 1)σ 2 + ri τi2 − 2( ri τi )2 /n + n( rh τh /n)2
i=1 i=1 h=1
v
X v
X v
X v
X v
X
= (v − 1)σ 2 + ri τi2 − 2( ri τi )( rh τh )/n + ( ri )( rh τh /n)2
i=1 i=1 h=1 i=1 h=1
v
X v
X v
X
= (v − 1)σ 2 + ri (τi2 − 2τi ( rh τh )/n + ( rh τh /n)2 )
i=1 h=1 h=1
v
X v
X
= (v − 1)σ 2 + ri (τi − rh τh /n)2
i=1 h=1

And finally,
v v
!2
1 1 X 1X
E[M ST ] = E[SST ] = σ 2 + ri τi − rh τh
v−1 v − 1 i=1 n
h=1

Problem 9.

(b) Calculate the sample sizes needed for an analysis of variance test with α = 0.05 to have
power 0.95 if: (i) ∆ = 1.5; (ii) ∆ = 1.0; (iii) ∆ = 2.0.

alpha = 0.05
Delta = 1.5
power = .95
sigma.2 = 2
v <- 4

pwr.anova.test(k = v, sig.level = alpha , power = power , f = sqrt(Delta^2/(2*v*sigma.2)))

##
## Balanced one-way analysis of variance power calculation

10
##
## k = 4
## n = 31.5215
## f = 0.375
## sig.level = 0.05
## power = 0.95
##
## NOTE: n is number in each group
For (i) we need at least 32 replicates
Delta = 1
pwr.anova.test(k = v, sig.level = alpha , power = power , f = sqrt(Delta^2/(2*v*sigma.2)))

##
## Balanced one-way analysis of variance power calculation
##
## k = 4
## n = 69.66559
## f = 0.25
## sig.level = 0.05
## power = 0.95
##
## NOTE: n is number in each group
For (ii) we need at least 70 replicates
Delta = 2
pwr.anova.test(k = v, sig.level = alpha , power = power , f = sqrt(Delta^2/(2*v*sigma.2)))

##
## Balanced one-way analysis of variance power calculation
##
## k = 4
## n = 18.18244
## f = 0.5
## sig.level = 0.05
## power = 0.95
##
## NOTE: n is number in each group
For (iii) we need at least 19 replicates

11

Das könnte Ihnen auch gefallen