Sie sind auf Seite 1von 9

mode= L+(f1-f0/2f1-f0-f2)*h

L is the lower class limit of the modal class


f1 is the frequency of the modal class
f0 is the frequency of the class before the modal class in the frequency table
f2 is the frequency of the class after the modal class in the frequency table
h is the class interval of the modal class
THE CHI-SQUARE TEST
Introduction: The chi-square test is a statistical test that can be used to dete
rmine whether observed frequencies are significantly different from expected fre
quencies. For example, after we calculated expected frequencies for different al
lozymes in the HARDY-WEINBERG module we would use a chi-square test to compare t
he observed and expected frequencies and determine whether there is a statistica
lly significant difference between the two. As in other statistical tests, we be
gin by stating a null hypothesis (H0: there is no significant difference between
observed and expected frequencies) and an alternative hypothesis (H1: there is
a significant difference). Based on the outcome of the chi-square test we will e
ither reject or fail to reject the null hypothesis.
Importance: Chi-square tests enable us to compare observed and expected frequenc
ies objectively, since it is not always possible to tell just by looking at them
whether they are "different enough" to be considered statistically significant.
Statistical significance in this case implies that the differences are not due
to chance alone, but instead may be indicative of other processes at work.
Question: How is the chi-square test used to compare samples or populations? Wha
t does a comparison of observed and expected frequencies tell us about these sam
ples?
Variables:

the chi-square test statistic
o observed count or frequency
e expected count or frequency
n total number of observations
RT row total
CT column total
Methods: Shaklee et al. (1993) collected data to study genetic variation within
a species of fish called the barramundi perch (Lates calcarifer). Many fish spec
ies are composed of breeding groups called stocks, which are populations that ar
e genetically distinct from one another. One of the goals of Shaklee et al.'s st
udy was to identify individual stocks of the barramundi perch on the basis of si
gnificant genetic differentiation. Of the 25 collections examined, those that we
re not significantly genetically distinct from one another were considered to be
from the same stock; collections that were genetically distinct were considered
to be from different stocks. Understanding species subdivision into stocks has
important implications for conservation and fisheries management, since maintain
ing the genetic diversity of the species as a whole will require conservation of
the different stocks.
We'll use some of their data here to illustrate the application of a simple chi-
square test. Below are data showing allele frequencies at seven loci for eight c
ollections of perch from different parts of the Australian coast (table adapted
from Shaklee et al. 1993; all errors due to rounding are mine).

Locus & allele
# 1
# 2
# 14
# 15
# 18
# 21
# 22
# 25
EST-2*
*100+
249
78
97
115
101
242
128
116
*98
26
4
0
1
2
0
2
30
*95
126
41
60
60
52
226
125
70
ESTD*
*100+
390
120
155
176
171
465
335
210
*114
15
4
0
0
0
9
2
6
mIDHP*
*100
387
123
152
167
152
474
333
216
*78
0
0
5
10
4
1
0
0
sIDHP*
*100
354
113
111
137
143
432
310
177
*121+
37
7
44
33
27
39
18
28
*83
9
3
0
0
0
1
1
3
LDH-C*
*100
373
115
156
175
154
400
245
208
*90+
29
9
1
1
1
75
25
5
PGDH*
*100
382
122
130
145
153
378
240
199
*88+
5
2
21
18
16
95
89
3
PROT*
*100+
399
120
149
168
147
453
326
207
*97
8
4
8
9
9
22
5
9
We can use the chi-square test to compare collections # 1 and # 25 at the EST-2*
locus. The expected values are the allele frequencies we would expect if there
were no difference between the two collections at this locus. We can calculate t
he expected allele frequencies using the row and column totals from a table of t
he observed frequencies for these two collections.
For the first cell (collection #1, allele *100+) we begin by calculating the pro
bability of an observation being in the first row, regardless of column. To do t
his, take the row total (365) and divide it by n (617) (note that n changes depe
nding on which locus and which pair of populations is being compared). Based on
these two collections, the probability of a barramundi perch having the *100+ al
lele at the EST-2* locus is 0.5916 (365/617). Next, we calculate the probability
of an observation being in the first column, regardless of row, by taking the c
olumn total (401) and dividing it by n (617). The probability of an observation
coming from collection #1 as opposed to collection #25 is 0.6499 (401/617).
We have now determined the probability of a perch having a given allele at this
locus, and the probability of being in a given collection. But what is the proba
bility that an individual observation will have the *100+ allele at the EST-2* l
ocus and be from collection #1? The probability of two outcomes occurring togeth
er is called the joint probability, and is calculated by multiplying the two sep
arate probabilities: 0.5916 x 0.6499 = 0.3845. It follows that in a sample of 61
7 fish we would expect 617 x 0.3845 = 237 individuals to be from collection #1 a
nd have the *100+ allele, and we have now calculated our expected value for the
first cell in the table. This calculation can be simplified with the following f
ormula:
e = (RT/n)(CT/n)*n
Verify that the other expected frequencies have been calculated correctly.
Observed frequencies
Expected frequencies
allele # 1 # 25
RT
allele # 1
# 25
RT
*100+ 249 116 365 *100+ 237 128 365
*98 26 30 56 *98 36 20 56
*95 126 70 196 *95 127 69 196
CT
401 216 n=617
CT
401 216 n=617
Note also that the row and column totals remain the same. Now we can use the chi
-square test to compare the observed and expected frequencies. The chi-square te
st statistic is calculated with the following formula:
For each cell, the expected frequency is subtracted from the observed frequency,
the difference is squared, and the total is divided by the expected frequency.
The values are then summed across all cells. This sum is the chi-square test sta
tistic. For the example here,
= 0.608 + 2.778 + 0.008 + 1.125 + 5.000 + 0.014 = 9.533.
Interpretation: The critical value for the chi-square in this case () is 5.991;
if the calculated chi-square value is equal to or greater than this critical val
ue, we can conclude that the probability of the null hypothesis being correct is
0.05 or less-- a very small probability indeed! Our calculated value of 9.533 i
s greater than the critical value of 5.991. We therefore reject the null hypothe
sis, and conclude that there is a significant difference between the observed an
d expected frequencies of alleles at the EST-2* locus for these two collections
of barramundi perch. (Critical values for the chi-square are determined from a s
tatistical table based on the significance level at which the test is being perf
ormed [0.05 in our case] and a number called degrees of freedom [2 in this examp
le], but the details are beyond the scope of this module).
Conclusions: Our rejection of the null hypothesis allows us to conclude that the
two collections of barramundi perch compared here are genetically distinct at t
he EST-2* locus. In other words, the frequencies of the three alleles at this lo
cus are significantly different between the two populations. Using somewhat more
complicated applications of the chi-square test, the authors concluded that the
25 collections they analyzed came from seven genetically distinct stocks, or po
pulations, from adjacent stretches of the northeastern Australian coast. One of
the goals of conservation and/or management is the preservation of genetic diver
sity within a species. Management decisions based on the assumption that a speci
es' genetic variation is distributed across populations could have disastrous co
nsequences for the future of the species if the populations are indeed genetical
ly distinct. Techniques for identifying amounts and patterns of genetic variatio
n within a species are critical tools for biologists.
Additional Questions:
1) Are the allele frequencies at the other six loci also significantly differen
t between collections #1 and #25? (**For loci with two alleles instead of three,
the critical value of the chi-square is 3.841, but otherwise the procedure is t
he same).
2) Use the chi-square test to compare allele frequencies for collections #14 an
d #15. Can you determine whether or not these two collections are from the same
stock?
Sources: Rohlf, F. J. and R. R. Sokal. 1995. Biometry, 3rd ed. W. H. Freeman and
Company, New York, NY.
Rohlf, F. J. and R. R. Sokal. 1995. Statistical Tables, 3rd ed. W. H. Freeman an
d Company, New York, NY.
Shaklee, J. B., J. Salini, and R. N. Garrett. 1993. Electrophoretic characteriza
tion of multiple genetic stocks of barramundi perch in Queensland, Australia. Tr
ansactions of the American Fisheries Society 122:685-701.
copyright 1999 by M. Beals, L. Gross, and S. Harrell
Related Searches:
Mathematics Majors
Calculus Tutorials
Debit Credit Card
Mathematics Teachers
Introduction To Differential Equations
Mathematics Degree Programs
Merchant Account Services
Mathematical Methods In The Physical Sciences
Credit Card Today
About this Ad
Trust Rating
91%
tiem.utk.edu
Close
Chi-Square Test for Independence
This lesson explains how to conduct a chi-square test for independence. The test
is applied when you have two categorical variables from a single population. It
is used to determine whether there is a significant association between the two
variables.
For example, in an election survey, voters might be classified by gender (male o
r female) and voting preference (Democrat, Republican, or Independent). We could
use a chi-square test for independence to determine whether gender is related t
o voting preference. The sample problem at the end of the lesson considers this
example.
When to Use Chi-Square Test for Independence
The test procedure described in this lesson is appropriate when the following co
nditions are met:
The sampling method is simple random sampling.
Each population is at least 10 times as large as its respective sample.
The variables under study are each categorical.
If sample data are displayed in a contingency table, the expected frequency coun
t for each cell of the table is at least 5.
This approach consists of four steps: (1) state the hypotheses, (2) formulate an
analysis plan, (3) analyze sample data, and (4) interpret results.
State the Hypotheses
Suppose that Variable A has r levels, and Variable B has c levels. The null hypo
thesis states that knowing the level of Variable A does not help you predict the
level of Variable B. That is, the variables are independent.
H0: Variable A and Variable B are independent.
Ha: Variable A and Variable B are not independent.
The alternative hypothesis is that knowing the level of Variable A can help you
predict the level of Variable B.
Note: Support for the alternative hypothesis suggests that the variables are rel
ated; but the relationship is not necessarily causal, in the sense that one vari
able "causes" the other.
Formulate an Analysis Plan
The analysis plan describes how to use sample data to accept or reject the null
hypothesis. The plan should specify the following elements.
Significance level. Often, researchers choose significance levels equal to 0.01,
0.05, or 0.10; but any value between 0 and 1 can be used.
Test method. Use the chi-square test for independence to determine whether there
is a significant relationship between two categorical variables.
Analyze Sample Data
Using sample data, find the degrees of freedom, expected frequencies, test stati
stic, and the P-value associated with the test statistic. The approach described
in this section is illustrated in the sample problem at the end of this lesson.
Degrees of freedom. The degrees of freedom (DF) is equal to:
DF = (r - 1) * (c - 1)
where r is the number of levels for one catagorical variable, and c is the numbe
r of levels for the other categorical variable.
Expected frequencies. The expected frequency counts are computed separately for
each level of one categorical variable at each level of the other categorical va
riable. Compute r * c expected frequencies, according to the following formula.
Er,c = (nr * nc) / n
where Er,c is the expected frequency count for level r of Variable A and level c
of Variable B, nr is the total number of sample observations at level r of Vari
able A, nc is the total number of sample observations at level c of Variable B,
and n is the total sample size.
Test statistic. The test statistic is a chi-square random variable (?2) defined
by the following equation.
?2 = S [ (Or,c - Er,c)2 / Er,c ]
where Or,c is the observed frequency count at level r of Variable A and level c
of Variable B, and Er,c is the expected frequency count at level r of Variable A
and level c of Variable B.
P-value. The P-value is the probability of observing a sample statistic as extre
me as the test statistic. Since the test statistic is a chi-square, use the Chi-
Square Distribution Calculator to assess the probability associated with the tes
t statistic. Use the degrees of freedom computed above.
Interpret Results
If the sample findings are unlikely, given the null hypothesis, the researcher r
ejects the null hypothesis. Typically, this involves comparing the P-value to th
e significance level, and rejecting the null hypothesis when the P-value is less
than the significance level.
Test Your Understanding of This Lesson
Problem
A public opinion poll surveyed a simple random sample of 1000 voters. Respondent
s were classified by gender (male or female) and by voting preference (Republica
n, Democrat, or Independent). Results are shown in the contingency table below.
Voting Preferences Row total
Republican Democrat Independent
Male 200 150 50 400
Female 250 300 50 600
Column total 450 450 100 1000
Is there a gender gap? Do the men's voting preferences differ significantly from
the women's preferences? Use a 0.05 level of significance.
Solution
The solution to this problem takes four steps: (1) state the hypotheses, (2) for
mulate an analysis plan, (3) analyze sample data, and (4) interpret results. We
work through those steps below:
State the hypotheses. The first step is to state the null hypothesis and an alte
rnative hypothesis.
H0: Gender and voting preferences are independent.
Ha: Gender and voting preferences are not independent.
Formulate an analysis plan. For this analysis, the significance level is 0.05. U
sing sample data, we will conduct a chi-square test for independence.
Analyze sample data. Applying the chi-square test for independence to sample dat
a, we compute the degrees of freedom, the expected frequency counts, and the chi
-square test statistic. Based on the chi-square statistic and the degrees of fre
edom, we determine the P-value.
DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
Er,c = (nr * nc) / n
E1,1 = (400 * 450) / 1000 = 180000/1000 = 180
E1,2 = (400 * 450) / 1000 = 180000/1000 = 180
E1,3 = (400 * 100) / 1000 = 40000/1000 = 40
E2,1 = (600 * 450) / 1000 = 270000/1000 = 270
E2,2 = (600 * 450) / 1000 = 270000/1000 = 270
E2,3 = (600 * 100) / 1000 = 60000/1000 = 60
?2 = S [ (Or,c - Er,c)2 / Er,c ]
?2 = (200 - 180)2/180 + (150 - 180)2/180 + (50 - 40)2/40
+ (250 - 270)2/270 + (300 - 270)2/270 + (50 - 60)2/60
?2 = 400/180 + 900/180 + 100/40 + 400/270 + 900/270 + 100/60
?2 = 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2
where DF is the degrees of freedom, r is the number of levels of gender, c is th
e number of levels of the voting preference, nr is the number of observations fr
om level r of gender, nc is the number of observations from level c of voting pr
eference, n is the number of observations in the sample, Er,c is the expected fr
equency count when gender is level r and voting preference is level c, and Or,c
is the observed frequency count when gender is level r voting preference is leve
l c.
The P-value is the probability that a chi-square statistic having 2 degrees of f
reedom is more extreme than 16.2.
We use the Chi-Square Distribution Calculator to find P(?2 > 16.2) = 0.0003.
Interpret results. Since the P-value (0.0003) is less than the significance leve
l (0.05), we cannot accept the null hypothesis. Thus, we conclude that there is
a relationship between gender and voting preference.
Note: If you use this approach on an exam, you may also want to mention why this
approach is appropriate. Specifically, the approach is appropriate because the
sampling method was simple random sampling, each population was more than 10 tim
es larger than its respective sample, the variables under study were categorical
, and the expected frequency count was at least 5 in each cell of the contingenc
y table.