Analysis of Variance and Contrasts: Ken Kelley's Class Notes

Goal of Analysis of Variance
The Formal ANOVA Model

Explanation by Example
Multiple Comparisons
Assumptions
Analysis of Variance and Contrasts
Ken Kelley’s Class Notes
1 / 104
Assumptions
Lesson Breakdown by Topic
1 Example: Weight Loss Drink

ANOVA Using SPSS
A Conceptual Example Appropriate
for ANOVA 4 Multiple Comparisons
Example F -Test for Independent Why Multiplicity Matters
Variances Error Rates
Conceptual Underpinnings of Linear Combinations of Means
ANOVA Controlling the Type I Error
Mean Squares 5 Assumptions
2 The Formal ANOVA Model Assumptions of the ANOVA
A Worked Example What You Learned
3 Explanation by Example Notations
2 / 104
Assumptions
What You Will Learn from this Lesson
You will learn:

How to compare more than two independent means to assess if
there are any differences via an analysis of variance (ANOVA).
How the total sums of squares for the data can be decomposed
into a part that is due to the mean differences between groups
and to a part that is due to within group differences.
Why doing multiple t-tests is not the same thing as ANOVA.
Why doing multiple t-tests leads to a multiplicity issue, in that
as the number of tests increases, so to does the probability of
one or more error.
How to correct for the multiplicity issue in order for a set of
contrasts/comparisons has a Type I error rate for the collection
of tests at the desirable (e.g., .05) level.
How to use SPSS and R to implement an ANOVA and
follow-up tests.
3 / 104
Assumptions
Motivation
When looking at different allergy medicines, there are

numerous options. So how can it be determined which brand
will work best when they all claim to do so?
Data could be collected to determine the outcomes from each
product among numerous individuals randomly assigned to
different brands.
An ANOVA could be run to infer if there is a performance
difference between these different brands.
If there are no significant results, evidence would not exist to
suggest there are differences in performance among the brands.
If there are significant results, we would infer that the brands
do not perform the same, but further tests would have to be
conducted so as to infer where the differences are .
4 / 104
A Conceptual Example Appropriate for ANOVA
Example F -Test for Independent Variances
Conceptual Underpinnings of ANOVA
Mean Squares
Assumptions
The goal of ANOVA is to detect if mean differences exist

among m groups.
Recall the independent groups t-test is designed to detect
differences between two independent groups.
2
The t-test is a special case of ANOVA when m = 2 (tdf equals
the F(1,df ) from ANOVA for two groups).
5 / 104
Mean Squares
Assumptions
Obtaining a statistically significant result for ANOVA conveys

that not all groups have the same population mean.
However, a statistically significant ANOVA with more than

two groups does not convey where those differences exist.
Follow-up tests (contrasts/comparisons) can be conducted to
help discern specifically where group means differ.
6 / 104
Mean Squares
Assumptions
Consumer Preference
Consider the overall perception of how consumers regard

different companies.
An experiment was done in which 30 individuals were

randomly assigned into one of three groups.
All participants saw (almost) the same commercial advertising

a new Android smart phone.
The difference between the groups was that the commercial
attributed the phone to either (a) Nokia, (b) Samsung, or (c)
Motorola.
Of interest is in whether consumers tend to rate the brands
differently, even for the “same” cell phone.
7 / 104
Mean Squares
Assumptions
What are other examples in which ANOVA would be useful?
8 / 104
Mean Squares
Assumptions
Consider the null hypotheses of equal variances:
H0 : σ12 = σ22 .
9 / 104
Mean Squares
Assumptions
H0 : σ12 = σ22 .
The F -statistic is used to evaluate the above null hypothesis,

and is defined as the ratio of two independent variances:
s12
F(df1 ,df2 ) = ,
s22
where df1 and df2 are the degrees of freedom for s12 and s22 ,
respectively.
10 / 104
Mean Squares
Assumptions
H0 : σ12 = σ22 .
The F -statistic is used to evaluate the above null hypothesis,

and is defined as the ratio of two independent variances:
s12
F(df1 ,df2 ) = ,
s22
where df1 and df2 are the degrees of freedom for s12 and s22 ,
respectively.
Notice that F cannot be negative and is unbound on the high
side.
F -is a positively skewed distribution.
11 / 104
Mean Squares
Assumptions
Examples
We have previously asked questions about the mean

difference, but the F -distribution allows us to ask questions
about variability.
Is the variability of user satisfaction of Gmail users different
than the variability of user satisfaction of Outlook.com?
Does Mars and their M&M’s production have “better control”

(i.e., smaller variance) than Wrigley’s Skittles?
For a given item, are Wal-Mart prices across the country more
stable than Kroger’s (for like items)?
Does a particular machine (or location/worker/shift) produce

more variable products than a counterpart?
12 / 104
Mean Squares
Assumptions
The standard deviation of Gmail user satisfaction was 6.35

based on a sample size of 55.
The standard deviation of Outlook.com user satisfaction was
8.90 based on a sample size of 42.
For an F -test of this sort addressing any differences in the
variance (e.g., is there more variability in user satisfaction in
one group), there are two critical values, one at the α/2 value
and one at the 1 − α/2 value.
The critical values are and for the .025 and
.975 quantiles (i.e., when α = .05).
The F -statistic for the test of the null hypothesis is
6.352 40.3225
F = = = .509.
8.902 79.21
The conclusion is: .

13 / 104
Mean Squares
Assumptions
Thus far, we have talked only about the idea of comparing

two variances.
But, what does this have to do with comparing means, which

is the question we are interested in addressing?
14 / 104
Mean Squares
Assumptions
Analysis of variance (ANOVA) considers two variances:
15 / 104
Mean Squares
Assumptions

one variance calculates the variance of the group means;
another variance is the (weighted) mean of within group

variances (recall sp2 from the two group t-test).
16 / 104
Mean Squares
Assumptions

one variance calculates the variance of the group means;
another variance is the (weighted) mean of within group

variances (recall sp2 from the two group t-test).
We thus consider the variability of the group means to assess

if the population group means differ.
17 / 104
Mean Squares
Assumptions
The null hypothesis in an ANOVA context is that all of the

group means are the same: µ1 = µ2 = . . . = µm = µ,
where m is the total number of groups.
When the null hypothesis is true, we can estimate the

variance of the scores with two methods, both of which are
independent of one another.
18 / 104
Mean Squares
Assumptions
If the ratio of variances (i.e., F -test) is so much larger than 1

that it seems unreasonable to have happened by chance alone,
then the null hypothesis can be rejected.
Of course, “so much larger than 1 that it seems unreasonable”
is defined in terms of the p-value (compared to α).
If the p-value is smaller than α, the null hypothesis of equal

population means is rejected.
The variance of the scores can be calculated from within each
group and then pooled across the groups (in exactly the same
manner as was done for the independent groups t-test).
19 / 104
Mean Squares
Assumptions
Mean Square Within
Recall that the best way to arrive at a pooled within group

variance is to calculate a weighted mean of the variances:
m m
(nj − 1)sj2
P P
SSj
2 j=1 j=1 2
sPooled = m = = sWithin = MSWithin ,
P N −m
nj − m
j=1
where SS is sum of squares, MS is mean square (i.e., a

variance), m is the number of groups, nj is the sample size in
the jth group (j = 1, . . . , m), and N is the total sample size
Pm
(N = nj ).
j=1
20 / 104
Mean Squares
Assumptions
In the special case where n1 = n2 = . . . = nm = n, the

equation for the pooled variance reduces:
m
sj2
P
j=1 2
= sWithin = MSWithin .
m
Notice that the degrees of freedom here are N − m.
The degrees of freedom are N − m because there are N
independent observations yet m sample means estimated.
21 / 104
Mean Squares
Assumptions
Mean Square Between
Recall from the single group situation that the variance of the
mean is equal
to the variance
of the scores divided by the
sY2
sample size i.e.,sȲ2 = nj
j
.
j
That is, the variance of the sample means is the variance of

the scores divided by the sample size.
22 / 104
Mean Squares
Assumptions
However, under the null hypothesis, we can calculate the

variance of the sample means directly by using the m means
as if they were individual scores.
Then, an estimate of the variance of the scores could be

obtained by multiplying the variance of the means by sample
2
size (sBetween = nsȲ2 ).
If the F -statistic is statistically significant, the conclusions is

that the variance of the means is larger than it should have
been, if in fact the null hypothesis was true.
Notice that the degrees of freedom here are m − 1.
23 / 104
Mean Squares
Assumptions
There are thus two variances that estimate the same value
under the null hypothesis.
2
One (σWithin ) calculated by pooling within group variances.
2
The other (σBetween ) by calculating the variance of the means
and multiplying by the within group sample size.
2
σBetween
If the null hypothesis is exactly true, 2
σWithin
= 1.
If the null hypothesis is false and mean differences do exist,

2
sBetween will be larger than would be expected under the null
2
sBetween
hypothesis, then 2
sWithin
> 1.
24 / 104
Mean Squares
Assumptions
s2
If F = sBetween
2 (i.e., F = MS
MSWithin ) is statistically significant,
Between
Within
we will reject the null hypothesis and conclude that
H0 : µ1 = µ2 = . . . = µm = µ is false.
25 / 104
Mean Squares
Assumptions
s2
If F = sBetween
2 (i.e., F = MS
MSWithin ) is statistically significant,
Between
Within
we will reject the null hypothesis and conclude that
H0 : µ1 = µ2 = . . . = µm = µ is false.
Thus, we are comparing means based on variances!
26 / 104
Explanation by Example A Worked Example
Assumptions
The ANOVA Model
The ANOVA assumes that the score for the ith individual in
the jth group is a function of some overall mean, µ, some
effect for being in the jth group exists, τj , and some
uniqueness exists, εij .
Such a scenario implies that
Yij = µ + τj + εij ,
where
τj = µj − µ,
with τj being the treatment effect of the jth group.
27 / 104
Assumptions
When the null hypothesis is true, the sum of the τ s squared

m
τj2 = 0.
P
equals zero:
j=1
When the null hypothesis is false, the sum of the τ s squared

m
τj2 > 0.
P
equals some number larger than zero:
j=1
28 / 104
Assumptions
Thus, we can formally write the null and alternative

hypotheses for ANOVA as
m
τj2 = 0
P
H0 :
j=1
and
m
τj2 > 0,
P
Ha :
j=1
respectively.
m
τj2 = 0 is equivalent to
P
Note that H0 :
j=1
H0 : µ1 = µ2 = . . . = µm = µ.
29 / 104
Assumptions
The null hypothesis can be evaluated by determining,

probabilistically, if the sum of the estimated τ s squared is
greater than zero by more than what would be expected by
chance alone.
The “hard to believe” part is evaluated by the specified α
value.
30 / 104
Assumptions
The sums of squares are defined as follows:

m
X
SSBetween = SSTreatment = SSAmong = nj (Ȳj − Ȳ.. )2 ;
j=1
and
nj
m X
X
SSWithin = SSError = SSResidual = (Yij − Ȳj )2 ;
j=1 i=1
nj
m X
X
SSTotal = (Yij − Ȳ.. )2 .
j=1 i=1
SSTotal = SSBetween + SSWithin
31 / 104
Assumptions
Like usual, we divide the sums of squares by the appropriate

degrees of freedom in order to obtain a variance.
In the ANOVA context, the sums of squares divided by its

degrees of freedom is called a mean square: SS
df = MS.
“Mean squares” are so named because when the sums of
squares is divided by its degrees of freedom, the resultant value
is the mean of the squared deviations (i.e., the mean square).
Mean square simply means variance.
32 / 104
Assumptions
In general, the ANOVA source table is defined as:
Source SS df MS F p-value
m
P 2 SSBetween MSBetween
Between nj (Ȳj − Ȳ ..) m−1 m−1 MSWithin p
j=1
m P nj
SSWithin
(Yij − Ȳ.j )2
P
Within N −m N−m
j=1 i=1
m Pnj
(Yij − Ȳ ..)2
P
Total N −1
j=1 i=1
33 / 104
Assumptions
In general, the ANOVA source table is defined as:
Source SS df MS F p-value
m
P 2 SSBetween MSBetween
Between nj (Ȳj − Ȳ ..) m−1 m−1 MSWithin p
j=1
m P nj
SSWithin
(Yij − Ȳ.j )2
P
Within N −m N−m
j=1 i=1
m Pnj
(Yij − Ȳ ..)2
P
Total N −1
j=1 i=1
The ANOVA source table is (very) similar to that used in the

context of multiple regression, a widely applicable future topic.
34 / 104
Assumptions
It can also be shown that the expected values of the mean

squares are given as
m
nj τj2
P
2 j=1
E [MSBetween ] = σWithin + ,
m−1
2
E [MSWithin ] = σWithin ,
When all of the population means are equal, the second
component of the MSBetween and the expectation of the two
mean squares is the same.
When any population mean difference exists,

E [MSBetween ] > E [MSWithin ].
35 / 104
Worked Example – Raw Data
Nokia Samsung Motorola
ei12 = ( yi1 − y1 ) ei2 = yi2 − y2 ei22 = ( yi 2 − y2 ) ei3 = yi3 − y3 ei 3 = ( yi 3 − y3 )

2
ei1 = yi1 − y1
2 2 2
Ratings Ratings Ratings
6 1.5 2.25 10 2 4 10 3 9
6 1.5 2.25 10 2 4 6 11 1
2 12.5 6.25 9 1 1 10 3 9
3 11.5 2.25 4 14 16 5 12 4
4 10.5 0.25 4 14 16 10 3 9
4 10.5 0.25 10 2 4 5 12 4
6 1.5 2.25 10 2 4 2 15 25
2 12.5 6.25 10 2 4 10 3 9
5 0.5 0.25 3 15 25 2 15 25
7 2.5 6.25 10 2 4 10 3 9
Σ 45.00 0.00 28.50 80.00 0.00 82.00 70.00 0.00 104.00
Mean 4.50 0.00 8.00 0.00 7.00 0.00
SD 1.78 1.78 3.02 3.02 3.40 3.40
Variance 3.17 3.17 9.11 9.11 11.56 11.56
y..
Grand=Mean=(======;=y1bar=dot=dot==)= ===(4.50*10=+=8.00*10=+=7.00*10)/30===6.50
The=grand=mean=is=the=(weighted)=mean=of=the=sample=means=(here=it=is=simply=equal=to=the=mean=of=the=means=due=to=equal=group=sample=sizes.
Sums%of%Squares
Between=Sum=of=Squares ===10*(4.5016.50)2=+=10*(8.0016.50)2=+=10*(7.0016.50)2===65.00===SSBetween
This=is=the=weighted=(because=each=score=in=a=group=has=the=same=sample=mean,=of=course)=sum=of=squared=deviation=between=the=group=means=and=the=grand=mean.
Within=Sum=of=Squares ===9*3.17=+=9*9.11=+=9*11.56===28.5=+=82=+=104===214.50===SSWithin
This=is=the=sum=of=each=of=the=within=group=sum=of=squares.
Mean%Squares
Now,=to=obtain=the=mean=squares,=divide=the=sums=of=squares=by=their=appropriate=degrees=of=freedom:
Mean=Square=Between ===65.00/(311)===32.50===MSBetween
Mean=Square=Within= ===214.50/=27===7.94===MSWithin
Inference
Now,=to=obtain=the=F1statistic,=divide=the=Mean=Square=Between=by=the=Mean=Square=Within:
F"=" 32.50/7.94===4.091
To=obtain=the=p1value,=use=the="F.Dist.RT"=formula=for=finding=the=area=in=the=right=tail=that=exceeds=the=F=value=of=4.091
p"=" 0.028061704
Now,=because=the=p1value=is=less=than=α=(.05=being=typical),=we=reject=the=null=hypothesis.=We=infer=that=the=population=group=mean=are=not=all=equal.=
Thus,=the=same=phone=commercial,=as=attributed=to=different=brands,=had=an=effect=on=the=ratings=of=the=phone.=
The=conclusion=is=that=there=is=an=effect=of=brand=on=consumer=sentiment=1=consumers=rate=the=same=thing=differently=depending=on=the=brand=attribution.=
The data are available here: nd.edu/~kkelley/Teaching/Data/Phone_Commercial_Preference.sav.

Worked Example – Summary Statistics
Summary8Statistics8from8the8Phone8Evaluation
Nokia Samsung Motorola
Mean yj 4.50 8.00 7.00
Standard8deviation sj 1.78 3.02 3.40
Sample8size y.. 10 10 10
Grand8mean y.. 8=8(4.50*108+88.00*108+87.00*10)/(30)8=86.50
Rather8than8using8the8full8data8set,8only8the8summary8statistics8are8actually8needed.8The8reason8is8because8we8can8determine8the8
sums8of8squares8from8the8summary8data.8The8within8sum8of8squares8is8literally8the8sum8of8the8degrees8of8freedom8multiplied8by8the8variance8from8each8group.
Sums%of%Squares
Between8Sum8of8Squares 8=810*(4.50N6.50)28+810*(8.00N6.50)28+810*(7.00N6.50)28=8865.008=8SSBetween
This8is8the8weighted8(because8each8score8in8a8group8has8the8same8sample8mean,8of8cousre)8sum8of8squared8deviation8between8the8group8means8and8the8grand8mean.
Within8Sum8of8Squares 8=81.782*(10N1)8+83.022*(10N1)8+83.402*(10N1)8=8214.50,8which8in8terms8of8variances8(instead8of8standard8deviations)8can8be8written8as:
8=83.17*(10N1)8+9.11*(10N1)8+811.56*(10N1)8=8214.508=8SSWithin
Recall8that8the8sums8of8squares8divided8by8its8degree8of8freedom8is8a8variance8Correspdongly,8a8variance8multiplied8by8its8degrees8
of8freedom8is8a8sum8of8squares.8Thus,8we8are8able8to8find8the8sum8of8squares8by8multiplying8the8variances8by8their8degrees8of8freedom.8
Mean%Squares
Now,8to8obtain8the8mean8squares,8divide8the8sums8of8squares8by8their8appropriate8degrees8of8freedom:
Mean8Square8Between 8=865.00/(3N1)8=832.508=8MSBetween
Mean8Square8Within8 8=8214.50/8278=87.948=8MSWithin
Inference
Now,8to8obtain8the8FNstatistic,8divide8the8Mean8Square8Between8by8the8Mean8Square8Within:
F"=" 32.50/7.948=84.091
To8obtain8the8pNvalue,8use8the8"F.Dist.RT"8formula8for8finding8the8area8in8the8right8tail8that8exceeds8the8F8value8of84.091
p"=" 0.0281
Now,8because8the8pNvalue8is8less8than8α8(.058being8typical),8we8reject8the8null8hypothesis.8We8infer8that8the8population8group8mean8are8not8all8equal.8
Thus,8the8same8phone8commercial,8as8attributed8to8different8brands,8had8an8effect8on8the8ratings8of8the8phone.8
The8conclusion8is8that8there8is8an8effect8of8brand8on8consumer8sentiment8N8consumers8rate8the8same8thing8difference,8depending8on8the8brand8attribution.8
The data are available here: nd.edu/~kkelley/Teaching/Data/Phone_Commercial_Preference.sav.

Example: Weight Loss Drink
ANOVA Using SPSS
Assumptions
Product Effectiveness: Weight Loss Drinks
Over a two month period in the early spring, 99 participants from

the midwest were randomly assigned to one of three groups (33
each) to assess the effectiveness of meal replacement weight loss
drink.
Study was conducted and analyzed by an independent firm.
The three groups were a (a) control group, (b) SF, and (c) TL.
All participants were encouraged to exercise and given running

shoes, workout outfit, and a pedometer.
38 / 104
ANOVA Using SPSS
Assumptions
The summary statistics for weight change in pounds (before breakfast)

are given as:
Control SF TL Total
Ȳ -1.61 -3.06 -7.29 -3.78
s 1.83 2.12 1.79 3.00
n 26 29 22 77
As can be seen, 22 participants did not compete the study.

Implications?
39 / 104
ANOVA Using SPSS
Assumptions
The following table is the ANOVA source table:

Source SS df MS F p
Between 408.28 2 204.14 54.567 < .001
Within 276.84 74 3.74
Total 685.12 76
40 / 104
ANOVA Using SPSS
Assumptions
The critical F -value at the .05 level for 2 and 69 degrees of

freedom is F(2,74;.95) = 3.12.
So, given the information, the decision is to

.
The one-sentence interpretation of the results is:
41 / 104
ANOVA Using SPSS
Assumptions
Performing an ANOVA in SPSS
42 / 104
ANOVA Using SPSS
Assumptions
43 / 104
ANOVA Output from SPSS
ANOVA Using SPSS
Assumptions
Suggestions when Performing ANOVA in SPSS
Start with Analyze → Descriptives → Explore.
Analyze → Compare Means → One-Way ANOVA for ANOVA

procedure.
In the One-Way ANOVA specification, request a Means Plot

(via Options).
Consider using Analyze → General Linear Model → Univariate
for a more general approach.
45 / 104
Why Multiplicity Matters
Error Rates
Linear Combinations of Means
Controlling the Type I Error
Assumptions
Omnibus Versus Targeted Tests
Procedures such as the t-test are targeted, and thus test

specific hypotheses.
For example, the independent groups t-test evaluates the
hypothesis that µ1 = µ2 .
Thus, after an ANOVA is performed, oftentimes we want to

know where the mean differences exist.
However, a rationale of ANOVA was not to perform many

significance tests.
46 / 104
Error Rates
Assumptions
An Analogy
Consider an airline scheduling system at the gate of departure.
47 / 104
Error Rates
Assumptions
An Analogy

This system requires all five processes to simultaneously
function:
1 live feed from the corporate server;
2 live feed to the corporate server;
3 live feed to the departing airport;
4 live feed to the arrival airport;
5 computer terminal to function property (e.g., no software
glitches, no power loss).
48 / 104
Error Rates
Assumptions
An Analogy

This system requires all five processes to simultaneously
function:
1 live feed from the corporate server;
2 live feed to the corporate server;
3 live feed to the departing airport;
4 live feed to the arrival airport;
5 computer terminal to function property (e.g., no software
glitches, no power loss).
Suppose that the “uptime” or ”reliability” of each of these
independent systems is .95, meaning at any given time there
is a 95% chance each process is working.
49 / 104
Error Rates
Assumptions
What is the probability that the system can be used when

needed (i.e., that all five systems working properly)?
50 / 104
Error Rates
Assumptions

Recalling the rule of independent events, the probability that
the system can be used is
.95 × .95 × .95 × .95 × .95 = .955 = .7738.
51 / 104
Error Rates
Assumptions

.95 × .95 × .95 × .95 × .95 = .955 = .7738.
Thus, even though each piece of the system has a 95%
chance of working properly, there is only a 77.38% chance
that the system itself can be used.
52 / 104
Error Rates
Assumptions

.95 × .95 × .95 × .95 × .95 = .955 = .7738.
The implication here is that an error occurring somewhere in
the set of processes (1-.7738=0.2262) is much higher than for
any given system (1-.95=.05).
53 / 104
Error Rates
Assumptions

.95 × .95 × .95 × .95 × .95 = .955 = .7738.
Note that the rate of errors in the system is (.2262/.05) 4.524
times higher than in a given process!
54 / 104
Error Rates
Assumptions

.95 × .95 × .95 × .95 × .95 = .955 = .7738.
Note that the rate of errors in the system is (.2262/.05) 4.524
times higher than in a given process!
This is the multiplicity issue – an error somewhere among a
set of “tests” is higher than for any given test.
55 / 104
Error Rates
Assumptions
Why Multiplicity Matters – Multiple Testing
The probability of making a Type I error out of C (i.e.,

independent) comparisons is given as
p(At least one Type I error) = 1−p(No Type I errors) = 1−(1−α)C ,
where C is the number of independent comparisons to be performed

(based on rules of probability).
If C = 5, then p(At least one Type I error) = .2262!
Note that this is the same probability that 1 or more

confidence intervals when 5 are computed, each at the 95%
level, do not bracket the population quantity.
The scenario here is analogous to the airline scheduling system.
56 / 104
Error Rates
Assumptions
Types of Error Rates
There are three types of error rates that can be considered:

1 Per comparison error rate (αPC ).
Analogous to the per process failure rate (5%) in the the
airline system example.
2 Familywise error rate (αFW ).

Analogous to the system failure rate (22.62%) of the airline
system example above.
3 Experimentwise error rate (αEW ).
Analogous to the multiple systems being required to fly the
airplane (e.g., not only the scheduling system, but also that
the plan functions properly, the flight team arrives on time,
etc.), which can be much higher than αFW (if there are
multiple families).
57 / 104
Error Rates
Assumptions
Per Comparison Error Rate
αPC : the probability that a particular test (i.e., a comparison)

will reject a true null hypothesis.
This is the Type I error rate with which we have always used
(as we only worked with a single test at a time).
58 / 104
Error Rates
Assumptions
Familywise Error Rate (αFW )
αFW : the probability that one or more tests will reject a true
null hypothesis somewhere in the “family.”
Defining exactly what a family is can be difficult and open to

interpretation.
As an aside, there are many statistical issues “open to
interpretation.”
Reasonable people can disagree on how to handle various

issues.
Openness about the methods, it’s assumptions, and limitations

is key.
59 / 104
Error Rates
Assumptions
Experimentwise Error Rate (αEW )
αEW : the probability that one or more tests will reject a true
null hypothesis somewhere in the “experiment” (or study
more generally).
Modifying the significance criterion so that αFW is the

probability of a Type I error out of the set C significance tests
is the same as forming C simultaneous confidence intervals.
We do not focus on the experiment wise error rate, as we will

assume a single family for our set of tests.
60 / 104
THING EXPLAINER IS AVAILABLE AT: AMAZON, BARNES & NOBLE, INDIE BOUND, HUDSON
ABOUT
A Hypothesis, A Result
SIGNIFICANT
|< < PREV RANDOM NEXT > >|
From XKCD: http://xkcd.com/882/

Tests, tests, tests, . . .

. . . and more tests. . .
xkcd.com/882/

. . . and more tests. . .
016 xkcd: Signiﬁcant

. . . and yet more tests. . .

THING EXPLAINER IS AVAILABLE AT: AMAZON, BARNES & NOBLE, INDIE BOUND, HUDSON
ABOUT
A Type I Error (It Seems)
SIGNIFICANT

After Many Tests, A “Finding”

PERMANENT LINK TO THIS COMIC: HTTP://XKCD.COM/882/
IMAGE URL (FOR HOTLINKING/EMBEDDING): HTTP://IMGS.XKCD.COM/COMICS/SIGNIFICANT.PNG
Error Rate
PERMANENT LINK TO THIS COMIC: HTTP://XKCD.COM/882/

IMAGE URL (FOR HOTLINKING/EMBEDDING): HTTP://IMGS.XKCD.COM/COMICS/SIGNIFICANT.PNG
The probability of a Type I error for 20 independent tests,

which the jelly bean comparisons were, is
http://xkcd.com/882/ 2/4
1 − (1 − .05)20 = 1 − .0.3584859 = 0.6415141
Thus, there is a 64% chance of a Type I error in such a case!

A Summary. . .
Multiplicity Matters!
Error Rates
Assumptions
Linear Comparisons: Specifying Contrasts of Interest
Suppose a question of interest is the contrast of group 1 and

group 3 in a three group design.
That is, we are interested in the following effect: Ȳ1 − Ȳ3 .
The above is equivalent to: (1) × Ȳ1 + (0) × Ȳ2 + (−1) × Ȳ3 .
70 / 104
Error Rates
Assumptions
Suppose a question of interest is the mean of group 1 and

group 2 (i.e., the mean of the two group means) and group 3
in a three group design.
Ȳ1 +Ȳ2
That is, we are interested in the following effect: 2 − Ȳ3 .
Ȳ1 +Ȳ2
The above is equivalent to: 2 + (−1) × Ȳ3 .
The above is equivalent to: ( 12 ) × Ȳ1 + ( 12 ) × Ȳ2 + (−1) × Ȳ3 .
We could also write the above as:

(.5) × Ȳ1 + (.5) × Ȳ2 + (−1) × Ȳ3 .
71 / 104
Error Rates
Assumptions
Consider a situation in which group 1 receives one type of

allergy medication, group 2 receives another type of allergy
medication, and group 3 receives a placebo (i.e., no
medication).
The question here is “does taking medication have an effect
over not taking medication on self reported allergy symptoms.”
72 / 104
Error Rates
Assumptions
Forming Linear comparisons

In the population, the value of any contrast of interest is
given as
m
X
Ψ = c1 µ1 + c2 µ2 + c3 µ3 + . . . + cm µm = c j µj ,
j=1
where cj is the comparison weight for the jth group and Ψ is

the population value of a particular linear combination of
means.
An estimated linear comparisons is of the form
m
X
Ψ̂ = c1 Ȳ1 + c2 Ȳ2 + c3 Ȳ3 + . . . + cm Ȳm = c j Ȳj ,
j=1
where cj is the comparison weight for the jth group and Ψ̂ is

the particular linear combination of means.
73 / 104
Error Rates
Assumptions
The first example from above was comparing the mean of

group 1 versus group 2.
In c-weight form the c-weights are [1, 0, −1]:
(1) × Ȳ1 + (0) × Ȳ2 + (−1) × Ȳ3 .
Comparing one mean to another (i.e., using a 1 and -1

c-weight with the rest 0’s) is called a pairwise comparisons (as
the comparison only involves a pair).
74 / 104
Error Rates
Assumptions
The second example was comparing the mean of groups 1 and

2 versus group 3.
In c-weight form the c-weights are [.5, .5, −1]:
(.5) × Ȳ1 + (.5) × Ȳ2 + (−1) × Ȳ3 .
Comparing weightings of two or more groups to one or more

other groups is called a complex comparison. That is, if the
c-weights are something other than 1 and -1 and the rest 0’s,
it is a complex comparison.
75 / 104
Error Rates
Assumptions
m
P
It is required that cj = 0.
j=1
For example, setting c1 to 1 and c2 to −1 yields the pair-wise

comparison of Group 1 and Group 2:
Ψ̂ = (1 × Ȳ1 ) + (−1 × Ȳ2 ) = Ȳ1 − Ȳ2 .
76 / 104
Error Rates
Assumptions
Rules for c-Weights
The sum of the c-weights for a comparison that are positive

should sum to 1.
The sum of the c-weights for a comparison that are negative
should sum to -1.
By implication of the two rules above,
P sum of all c-weights for
a comparison should sum to 0 (i.e., cj = 0).
Otherwise, the corresponding confidence interval is not as

intuitive.
However, any rescaling of such c-weights produces the same
t-test.
The confidence interval will have a different interpretation than
usual, as the effect will be for a specific linear combination
(e.g., Ψ̂ = 2Ȳ1 − Ȳ2 − Ȳ3 ).
77 / 104
Error Rates
Assumptions
Thus, for the mean of Groups 1 and 2 compared to the mean

of Group 3, the contrast is
Ȳ1 + Ȳ2
Ψ̂ = (.5 × Ȳ1 ) + (.5 × Ȳ2 ) + (−1 × Ȳ3 ) = − Ȳ3 .
2
78 / 104
Error Rates
Assumptions
Consider a situation in which one wants to weight the groups

based on the relative size of an outside factor, such as
marketshare, profit-per-segment, number of users, et cetera.
Suppose that interest is in comparing teens versus a weighted

average of 20 year olds and 30 year olds in an online
community, where among the 20 and 30 year olds the
proportion of users is 70 percent and 30 percent, respectively.
Ψ̂ = 1 × ȲTeens + (−.70 × Ȳ20s ) + (−.30 × Ȳ30s ).
79 / 104
Error Rates
Assumptions
There are technically an infinite number of comparisons that

can be formed, but only a few will likely be of interest.
The comparisons are formed so that targeted research

questions about population mean differences can be
addressed.
But, recall that in general, the sum of the c-weights that are
positive should sum to 1 and the sum of the c-weights that are
negative should sum to -1 so as to have a more interpretable
confidence interval.
80 / 104
Error Rates
Assumptions
A More Powerful t-Test
The t-test corresponding to a particular contrast it given as

P
cj Ȳj Ψ̂
t=s = ,
m SE (Ψ̂ )

P cj2
MSWithin nj
j=1
where the MSWithin is from the ANOVA and is the best

estimate of the population variance.
Importantly, this t-test has N − m degrees of freedom (i.e.,
the MSWithin degrees of freedom).
Note that the denominator is simply the standard error of the
contrast, which is used for the corresponding confidence
interval.
81 / 104
Error Rates
Assumptions
Recall that when the homogeneity of variance assumption holds,

there are m different estimates of σ 2 .
For homogeneous variances, the best estimate of the population

variance for any group is the mean square error (MSWithin ), which
uses information from all groups.
Thus, the independent groups t-test can be given as
Ȳj − Ȳk
t=r ,
1 1
MSWithin nj + nk
with degrees of freedom based on the mean square within (N − m),

which provides more power.
82 / 104
Error Rates
Assumptions
The above two-group t-test is still addresses the question

“does the population mean of Group 1 differ from the
population mean of Group 2?”
However, there is more information is used because the error

term is based on N − m degrees of freedom instead of
n1 + n2 − 2.
83 / 104
Error Rates
Assumptions
The MSWithin – Even for a Single Group

Even if we are interested in testing or forming a confidence
interval for a single group, the mean square within can (and
usually should) be used — again, due to having a better
estimate of σ 2 :
Ȳj − µ0
t=r .
MSWithin n1j
The two-sided confidence interval is thus:

s
1
Ȳj ± MSWithin × t(1−α/2,N−m) .
nj
The degrees of freedom for the above test and confidence
interval is, because MSWithin is used as the estimate of σ 2 ,
N − m. 84 / 104
Error Rates
Assumptions
Thus, using MSWithin is one way to have more power to test

the null hypothesis concerning a single group or two groups,
even when more than two groups are available.
Additionally, precision is increased because the confidence

interval will be narrower (due to the smaller standard error
and smaller critical value).
85 / 104
Error Rates
Assumptions
The Bonferroni Procedure
The Bonferroni Procedure is also called Dunn’s procedure.
Good for a few pre-planned targeted tests, but doing too

many leads to conservative critical values.
Conservative critical values are those that are bigger (i.e.,
harder to achieve significance) than would be the case ideally.
Liberal critical values are those that are smaller (i.e., easier to
achieve significance) than would be the case ideally.
86 / 104
Error Rates
Assumptions
It can be shown that αPC ≤ αFW ≤ C αPC , where C is the

number of comparisons.
The per comparison error rate can be manipulated by dividing

the desired familywise (or experimentwise) Type I error rate by
C , the number of comparisons: αPC = αFW C .
The standard t-test formula is used, but the obtained t value

is compared to a critical value based on α/C : t(1−(α/C )/2,df ) .
The observed p-values (e.g., from SPSS) can be corrected for
multiplicity by multiplying the C p-values by C .
If the corrected p value is less than αFW , then the test is
statistically significant in the context of a correct familywise
Type I error rate.
87 / 104
Error Rates
Assumptions
The critical value is what changes in the context of a

Bonferroni test, not the way in which the t-test and/or
confidence intervals are calculated.
√
Incorporating MSWithin√into the denominator of the t-test is
not really a change, as MSWithin is just a pooled variance
based on m (rather than 2) groups.
2
Recall this is just an extension of sPooled when information on
more than two groups is available.
88 / 104
Error Rates
Assumptions
Tukey’s Test
Tukey’s test is used when all (or several) pairwise comparisons

are to be tested.
For comparing all possible pair-wise comparisons, Tukey’s test

provides the most powerful multiple comparison procedure.
There is a Tukey-b in SPSS — I recommend “Tukey.”
The p-values and confidence intervals given by SPSS already
yields, for the Tukey procedure, “corrected” p-values and
confidence intervals.
89 / 104
Error Rates
Assumptions
Pairwise comparisons compare the means of two groups (i.e.,

a pair; µ1 − µ3 ) without allowing any other complex
comparisons (e.g., (Ȳ1 + Ȳ2 )/2 − Ȳ3 ).
The observed test statistic is compared to the tabled values of

the Studentized range distribution.
This is the distribution that the Tukey procedure uses to
obtain confidence intervals and p-values.
90 / 104
Error Rates
Assumptions
The Scheffé Test
For any number of post hoc tests with any linear combination
of means, the Scheffé Test is generally optimal.
Although the Scheffé Test is conservative for a small number

of comparisons, any number of comparisons can be conducted
while controlling the Type I error rate.
91 / 104
Error Rates
Assumptions
We compute the F -value (just a t-value squared) in accord

with some linear combination of means, and a critical value is
determined for the specific context.
The Scheffé critical F -value (take the square root for the
critical t-value) is given as
(m − 1)F(m−1,N−m;α) ,
which is m − 1 times larger than the critical ANOVA value.
92 / 104
Error Rates
Assumptions
The Scheffé procedure should not be done for all pairwise

comparisons (it is not as powerful as Tukey’s Test for pairwise
comparisons).
If many complex and other (e.g., pairwise) are to be done,

usually the Scheffé procedure is optimal.
93 / 104
Flowchart for Multiple Comparisons
Begin
Testing all pairwise and

no complex comparisons
(either planed or post Yes
Use Tukey’s method
hoc) or choosing to
test only some pairwise
comparisons post hoc?
No
Are all comparisons No

Use Scheffé’s method
planned?
Yes
Is Bonferroni critical
Yes
value less than Use Bonferroni’s method
Scheffé critical value?
No
Use Scheffé’s method (or, prior

to collecting the data, reduce the
number of contrasts to be tested)
Error Rates
Assumptions
SPSS does not make it easy to get the appropriate p-values

and confidence intervals for complex comparisons.
The Bonferroni and Scheffé procedures in SPSS are for

pair-wise comparisons, which are not of interest because
Tukey is almost always preferred for pair-wise.
For the specified contrasts, SPSS reports only the standard

output (i.e., not controlling the Type I error rate).
Thus, users need to be really careful they are appropriately

controlling the Type I error rate appropriately.
95 / 104
The Formal ANOVA Model Assumptions of the ANOVA
Explanation by Example What You Learned
Multiple Comparisons Notations
Assumptions
Assumptions of the ANOVA
The assumptions of the ANOVA are the same as for the

two-group t-test.
1 The population from which the scores were sampled is
normally distributed.
2 The variances for each of the m groups is the same.
3 The observations are independent.
Recall that multiple regression assumes homoscedasticity,

which is just an extension of homogeneity of variance.
96 / 104
Assumptions
Also like the independent groups t-test, the first two

assumptions become less important as sample size increases.
This is especially when the per group sample sizes are equal or
nearly so.
Thus, the larger the sample size, the more robust the model to
these two assumption violations.
97 / 104
Assumptions
Again, like the t-test, the ANOVA is very sensitive (i.e., it is

not robust) to violations of the assumption of independence.
Observations that are not independent can make the empirical
α rate much different than the nominal α rate.
98 / 104
Assumptions
Analysis of variance procedures test an omnibus (i.e., an

overall) hypothesis.
More specifically, ANOVA models test the hypothesis that
µ1 = µ2 = . . . = µm .
In many situations, primary interest concerns targeted null

hypotheses (not just the omnibus hypothesis).
Thus, additional analyses may be necessary.
99 / 104
Assumptions
A Summary from Designing Experiments and Analyzing

Data
This discussion “focuses on special methods that are needed when the
goal is to control αFW instead of to control αPC . Once a decision has
been made to control αFW , further consideration is required to choose an
appropriate method of achieving this control for the specific
circumstance. One consideration is whether all comparisons of interest
have been planned in advance of collecting the data. If so, the Bonferroni
adjustment is usually most appropriate, unless the number of planned
comparisons is quite large. Statisticians have devoted a great deal of
attention to methods of controlling αFW for conducting all pairwise
comparisons, because researchers often want to know which groups differ
from other groups. We generally recommend Tukey?s method for
conducting all pairwise comparisons. Neither Bonferroni nor Tukey is
appropriate when interest includes complex comparisons chosen after
having collected the data, in which case Scheffé’s method is generally 100 / 104
Assumptions
What You Learned from this Lesson

You learned:
How to compare more than two independent means to assess if
there are any differences via analysis of variance (ANOVA).
How the total sums of squares for the data can be decomposed
to a part that is due to the mean differences between groups
and to a part that is due to within group differences.
Why doing multiple t-tests is not the same thing as ANOVA.
Why doing multiple t-tests leads to a multiplicity issue, in that
as the number of tests increases, so to does the probability of
one or more error.
How to correct for the multiplicity issue in order for a set of
contrasts/comparisons has a Type I error rate for the collection
of tests at the desirable (e.g., .05) level.
How to use SPSS to implement an ANOVA and follow-up
tests.
101 / 104
Assumptions
Notations
H0 : σ12 = σ22 - The null hypothesis of equal variances

F(df1 ,df2 ) - The F -statistic with df1 and df2 as the degrees of
freedom
s12 and s22 - The variances for group 1 and group 2, respectively
2
sPooled - Pooled within group variance
m - Number of groups
nj - Sample size in the jth group
(j = 1, . . . , m)
N - Total sample size
m
P
N= nj
j=1
102 / 104
Assumptions
Notations Continued
SS - Sum of squares
This can be for the Between, Treatment, Among, Within,
Error, or Total Sum of Squares
MS - Mean square (i.e., a variance)
MSWithin is the mean square within a group
Yij - The score for the ith individual in the jth group
τj - The treatment effect of the jth group
εij - Some uniqueness for the ith individual in the jth group
E[MSWithin ] - The expected value of the mean squares within

a group
103 / 104
Assumptions
Notations Continued
C - The number of independent comparisons to be performed
αPC - Per comparison error rate
αFW - Familywise error rate
αEW - Experimentwise error rate
Ψ̂ - The particular linear combination of means
cj - Comparison weight for the jth group
104 / 104

Analysis of Variance and Contrasts: Ken Kelley's Class Notes

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Analysis of Variance and Contrasts: Ken Kelley's Class Notes

Hochgeladen von

Copyright:

Verfügbare Formate

Goal of Analysis of Variance

The Formal ANOVA Model

Analysis of Variance and Contrasts

Ken Kelley’s Class Notes

Lesson Breakdown by Topic

1 Example: Weight Loss Drink

What You Will Learn from this Lesson

You will learn:

When looking at different allergy medicines, there are

Goal of Analysis of Variance

The goal of ANOVA is to detect if mean differences exist

Obtaining a statistically significant result for ANOVA conveys

However, a statistically significant ANOVA with more than

Consider the overall perception of how consumers regard

An experiment was done in which 30 individuals were

All participants saw (almost) the same commercial advertising

What are other examples in which ANOVA would be useful?

Consider the null hypotheses of equal variances:

Consider the null hypotheses of equal variances:

The F -statistic is used to evaluate the above null hypothesis,

Consider the null hypotheses of equal variances:

The F -statistic is used to evaluate the above null hypothesis,

We have previously asked questions about the mean

Does Mars and their M&M’s production have “better control”

Does a particular machine (or location/worker/shift) produce

The standard deviation of Gmail user satisfaction was 6.35

The conclusion is: .

Thus far, we have talked only about the idea of comparing

But, what does this have to do with comparing means, which

Analysis of variance (ANOVA) considers two variances:

Analysis of variance (ANOVA) considers two variances:

another variance is the (weighted) mean of within group

Analysis of variance (ANOVA) considers two variances:

another variance is the (weighted) mean of within group

We thus consider the variability of the group means to assess

Conceptual Underpinnings of ANOVA

The null hypothesis in an ANOVA context is that all of the

When the null hypothesis is true, we can estimate the

If the ratio of variances (i.e., F -test) is so much larger than 1

If the p-value is smaller than α, the null hypothesis of equal

Mean Square Within

Recall that the best way to arrive at a pooled within group

where SS is sum of squares, MS is mean square (i.e., a

In the special case where n1 = n2 = . . . = nm = n, the

Mean Square Between

That is, the variance of the sample means is the variance of

However, under the null hypothesis, we can calculate the

Then, an estimate of the variance of the scores could be

If the F -statistic is statistically significant, the conclusions is

Notice that the degrees of freedom here are m − 1.

If the null hypothesis is false and mean differences do exist,

Thus, we are comparing means based on variances!

The ANOVA Model

When the null hypothesis is true, the sum of the τ s squared

When the null hypothesis is false, the sum of the τ s squared

Thus, we can formally write the null and alternative

The null hypothesis can be evaluated by determining,

The sums of squares are defined as follows:

SSTotal = SSBetween + SSWithin

Like usual, we divide the sums of squares by the appropriate

In the ANOVA context, the sums of squares divided by its

Mean square simply means variance.

In general, the ANOVA source table is defined as:

Grand8mean y.. 8=8(4.50108+88.00108+87.00*10)/(30)8=86.50