Beruflich Dokumente
Kultur Dokumente
Unit 11
Unit 11
Chi-Square Analysis
Structure
11.1 Introduction
Objectives
11.2 A Chi-square Test for the Goodness of Fit
11.3 A Chi-square Test for the Independence of Variables
11.4 A Chi-square Test for the Equality of More than
Two Population Proportions
11.5 Case Study
11.6 Summary
11.7 Glossary
11.8 Terminal Questions
11.9 Answers
11.10 References
11.1 Introduction
In the last unit, we discussed the Z test for the equality of two population proportions.
Now, in case we have more than two populations and want to test the equality of
all of them simultaneously, it is not possible to do it using Z test. This is because
Z test can examine the equality of two proportions at a time. In such a situation,
the chi-square test can come to our rescue and can carry out the test in one go.
The chi-square test is widely used in research. For the use of chi-square
test, data is required in the form of frequencies. Data expressed in percentages
or proportion can also be used, provided it could be converted into frequencies.
The majority of the applications of chi-square (2) are with discrete data. The
test could also be applied to continuous data, provided it is reduced to certain
categories and tabulated in such a way that the chi-square may be applied.
Some of the important properties of the chi-square distribution are:
Unlike the normal and t distribution, the chi-square distribution is not
symmetric.
The values of a chi-square are greater than or equal to zero.
The shape of a chi-square distribution depends upon the degrees of
freedom. With the increase in degrees of freedom, the distribution tends
to normal.
Sikkim Manipal University
Research Methodology
Unit 11
(Oi Ei )2
Ei
i 1
k
Research Methodology
Unit 11
Where,
Oi = Observed frequency of ith cell
Ei = Expected frequency of ith cell
k = Total number of cells
k 1 = degrees of freedom
Compare the sample value of the statistic as obtained in previous step
with the critical value at a given level of significance and make the decision.
A goodness of fit test is a statistical test of how well the observed data
supports the assumption about the distribution of a population. The test also
examines that how well an assumed distribution fits the data. Many a times, the
researcher assumes that the sample is drawn from a normal or any other
distribution of interest. A test of how normal or any other distribution fits a given
data may be of some interest.
Consider, for example, the case of the multinomial experiment which is
the extension of a binomial experiment. In the multinomial experiment, the
number of the categories k is greater than 2. Further, a data point can fall into
one of the k categories and the probability of the data point falling in the ith
category is a constant and is denoted by pi where i = 1, 2, 3, 4, ..., k. In summary,
a multinomial experiment has the following features:
There are fixed number of trials.
The trials are statistically independent.
All the possible outcomes of a trial get classified into one of the several
categories.
The probabilities for the different categories remain constant for each
trial.
Consider as an example that a respondent can fall into any one of the
four non-overlapping income categories. Let the probabilities that the respondent
will fall into any of the four groups may be denoted by the four parameters p1,
p2, p3 and p4. Given these, the multinomial distribution with these parameters,
and n the number of people in a random sample, specifies the probabilities of
any combination of the cell counts.
Given such a situation, we may use a multinomial distribution to test how
well the data fits the assumption of k probability p1, p2, ..., pk of falling into the k
cells. The hypothesis to be tested is:
Research Methodology
Unit 11
Vanilla
Chocolate
Strawberry
Mango
120
40
18
22
Solution:
Let
pv : proportion of customers preferring vanilla flavour.
pc : proportion of customers preferring chocolate flavour.
ps : proportion of customers preferring strawberry flavour.
pm : proportion of customers preferring mango flavour.
H0 : pv = 0.62, pc = 0.18, ps = 0.12, pm = 0.08
H1 : Proportions are not that specified in the null hypothesis
The expected frequencies corresponding to the various flavors under the
assumption that the null hypothesis is true are:
Vanilla = 200 0.62 = 124
Chocolate = 200 0.18 = 36
Strawberry = 200 0.12 = 24
Mango = 200 0.08 = 16
(Oi Ei )2
Ei
i 1
k
Research Methodology
Flavour
Vanilla
Chocolate
Strawberry
Mango
Total
O
(Observed
Frequencies)
120
40
18
22
Unit 11
E
(Expected
Frequencies)
124
36
24
16
OE
4
4
6
6
(O E)2
16
16
36
36
(O E )2
E
0.129
0.444
1.500
2.250
4.323
Research Methodology
Unit 11
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Number of
fatalities
31
20
20
22
22
29
36
Solution:
Let
p1 = Proportion of fatalities on Monday
p2 = Proportion of fatalities on Tuesday
p3 = Proportion of fatalities on Wednesday
p4 = Proportion of fatalities on Thursday
p5 = Proportion of fatalities on Friday
p6 = Proportion of fatalities on Saturday
p7 = Proportion of fatalities on Sunday
H0 : p1 = p2 = p3 = p4 = p5 = p6 = p7 =
1
7
1
= 25.714
7
Tuesday = 180
1
= 25.714
7
Wednesday = 180
1
= 25.714
7
Thursday = 180
1
= 25.714
7
Friday = 180
1
= 25.714
7
Research Methodology
Unit 11
Saturday = 180
1
= 25.714
7
Sunday = 180
1
= 25.714
7
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Observed
Frequencies
(O)
31
Expected
Frequencies
(E)
25.714
20
20
22
22
29
36
25.714
25.714
25.714
25.714
25.714
25.714
Total
2 =
(O E)2
(O E)2
E
5.286
27.942
1.087
5.714
5.714
3.714
3.714
3.286
10.286
32.650
32.650
13.794
13.794
10.798
105.802
1.270
1.270
0.536
0.536
0.420
4.114
9.233
OE
(O E )2
= 9.233
E
= 71=6
26 = 12.592
Since the sample chi-square value is less than the tabulated 2, there is
not enough evidence to reject the null hypothesis as shown in the figure below.
Research Methodology
Unit 11
Self-Assessment Questions
1. For the application of a chi-square test, the expected frequency in each
cell should be at least five. (True/False)
2. The sample value of the chi-square can be negative. (True/False)
3. If there are k categories of data, the degree of freedom would be _______.
Second
Classification
Category
1
2
3
Total
Total
O11
O21
O31
C1
O12
O22
O32
C2
O13
O23
O33
C3
O14
O24
O34
C4
R1
R2
R3
n
Assuming that there are r rows and c columns, the count in the cell
corresponding to the ith row and the jth column is denoted by Oij, where i = 1, 2,
..., r and j = 1, 2, ..., c. The total for row i is denoted by Ri whereas that
corresponding to column j is denoted by Cj. The total sample size is given by n,
which is also the sum of all the r row totals or the sum of all the c column totals.
The hypothesis test for independence is:
H0 : Row and column variables are independent of each other.
H1 : Row and column variables are not independent.
The hypothesis is tested using a chi-square test statistic for independence
given by:
r
=
2
i 1 j 1
(Oij Eij )2
Eij
Research Methodology
Unit 11
Where,
Ri C j
n
Intensive
100
100
50
250
Training
Good
Average
150
40
100
100
80
150
330
290
Total
290
300
280
870
Solution:
H0 : Attribute performance and the training are independent.
H1 : Attribute performance and the training are not independent.
The expected frequencies corresponding the ith row and the jth column in
the contingency table are denoted by Eij, where i = 1, 2, 3 and j = 1, 2, 3.
E1,1 =
290 250
= 83.33
870
E1,2 =
290 330
= 110.00
870
Research Methodology
Unit 11
E1,3 =
290 290
= 96.67
870
E2,1 =
300 250
= 86.21
870
E2,2 =
300 330
= 113.79
870
E2,3 =
300 290
= 100.00
870
E3,1 =
280 250
80.46
870
E3,2 =
280 330
= 106.21
870
E3,3 =
280 290
= 93.33
870
(Oij E ij )2
Row, Column
Oij
1,1
1,2
1,3
2,1
2,2
2,3
3,1
3,2
3,3
100
150
40
100
100
100
50
80
150
(Oij Eij)2
Eij
83.33
110.00
96.67
86.21
113.79
100.00
80.46
106.21
93.33
277.89
1600.00
3211.49
190.16
190.16
0
927.81
686.96
3211.49
Total
r
Sample =
2
i 1 j 1
(Oij Eij )2
Eij
E ij
3.335
14.545
33.221
2.21
1.671
0.000
11.53
6.468
34.41
107.39
= 107.39
The critical value of the chi-square at 5 per cent level of significance with
4 degrees of freedom is given by 9.49. The sample value of the chi-square falls
in the rejection region as shown in the figure below.
Sikkim Manipal University
Research Methodology
Unit 11
Good
900
700
400
2000
Defective
130
170
200
500
Total
1030
870
600
2500
Is there any association between the shift and the equality of the parts
produced? Use a 0.05 level of significance.
Solution:
H0 : There is no association between the shift and the quality of parts
produced.
H1 : There is an association between the shift and quality of parts.
The computations of the expected frequencies corresponding to the ith
row and the jth column of the contingency table are shown below: (i = 1, 2, 3)
and (j = 1, 2).
E1,1 =
1,030 2,000
= 824
2,500
E1,2 =
1,030 500
= 206
2,500
Page No. 267
Research Methodology
Unit 11
E2,1 =
870 2,000
= 696
2,500
E2,2 =
870 500
= 174
2,500
E3,1 =
600 2,000
= 480
2,500
E3,2 =
600 500
= 120
2,500
Oij
900
130
700
170
400
200
(Oij Eij)2
5776
5776
16
16
6400
6400
Eij
824
206
696
174
480
120
Total
3
i 1 j 1
(Oij Eij )2
Eij
E ij
7.010
28.039
0.023
0.092
13.333
53.333
101.83
= 101.83
Research Methodology
Unit 11
Self-Assessment Questions
4. In a cross table, where chi-square test is applied the null hypothesis is
that the two variables are related. (True/False)
5. The expected frequencies in a cross table are computed under the
assumption that null hypothesis is true. (True/False)
6. If any cell has a zero frequency, the chi-square cannot be applied. (True/
False)
7. The sum of each row and each column for the observed and expected
frequencies need not be equal. (True/False)
Research Methodology
Unit 11
Research Methodology
Unit 11
Solution:
Let
Let
H0 : p1 = p2 = p3 = p4
H1 : All proportions are not the same.
The observed data in the problem can be rewritten as:
Transactions
Incorrect transactions
Correct transactions
Total
Client 1
21
59
80
Client 2
25
75
100
Client 3
30
60
90
Client 4
40
70
110
Total
116
264
380
116
Client 1
Client 2
Client 3
Client 4
Total
Incorrect
transactions
80 0.305 = 24.4
115.9
Correct
transactions
80 0.695 = 55.6
264.1
110
380
Total
80
100
90
In fact, the sum of each row/column in both the observed and expected
frequency tables should be the same. Here, a bit of discrepancy is found because
of the rounding of the error. It can be easily verified that the expected frequencies
in each cell would be the same using the formula as Eij
Ri C j
n
already
explained. Now the value of the chi-square statistic can be calculated as:
Sikkim Manipal University
Research Methodology
=
2
(Oij Eij )2
i 1 j 1
Eij
Unit 11
24.4
30.5
27.45
33.55
55.6
69.5
62.55
76.45
Self-Assessment Questions
8. If there are 3 rows and two columns, the degrees of freedom for chisquare test are ________.
9. The combined estimate of proportion is obtained under the assumption
that __________ is true.
10. To test the equality of three population proportions, the alternative
hypothesis is written as H1 : p1 = p2 = p3. (True/false)
Research Methodology
Unit 11
restaurants observes that the total sales revenues of the restaurants have
been more or less stagnant, growing at a rate of 2 per cent only for the last
three years. A meeting of the senior management personnel was called to
discuss the issue. Some of them were of the opinion that young customers
in the age group of 1835 were switching to fast food. Further, they were of
the view that the trend is mainly among people belonging to high incomegroup and to families where both partners were economically employed.
In the series of meetings held by the top management, it was decided to
launch a chain of fast food joints in states where they were already present.
However, before starting the fast food joint, they got a survey conducted to
understand the preference of people for fast food. A sample of 100
respondents was chosen.
Data was collected on preference for fast food on an interval scale where
the respondents were asked to rate their preference for fast food on a 5point scale, where 1=not at all preferred, 2=not preferred, 3=neutral,
4=preferred, and 5=very much preferred. Further, the variable preference
was redefined as not preferred for those having a score of 13, and preferred
for those having a score of 45. The actual age of the respondents was
taken and divided into two categories. Those less than or equal 40 years of
age were treated as younger respondents, whereas, those having age of
above 40 were treated as older respondents. There were three income
categories: low income (household with monthly income less than `25,000/-),
middle income (household with monthly income of `25,000/- or more but
less than `50,000/-), high income ((household with monthly income more
than `50,000/-). The data on gender of the respondents was also taken. A
cross tabulation was carried out with preference for fast food with age,
gender and income. The results of the cross tables are reported below in
Table 1 to Table 3.
Table 1 Cross-tabulation of Preference for Fast Food with Age
Age
Total
Younger Respondents
Older Respondents
Not preferred
Count
24
30
54
Preferred
Count
35
11
46
Total
Count
59
41
100
Research Methodology
Unit 11
Total
Male
Female
Not preferred
Count
30
24
54
Preferred
Count
23
23
46
Total
Count
53
47
100
Low Income
Middle
Income
High Income
Not preferred
Count
22
19
13
54
Preferred
Count
10
32
46
Total
Count
26
29
45
100
Discussion Questions
1. Using the data as given in tables 13, examine the hypothesis that
preference for fast food is related to (i) age, (ii) gender, and (iii) income.
You may use 5 per cent level of significance.
2. Write a summary of the findings.
[Hint: To examine the hypothesis for the relationship for preference for
fast food with the age, the following hypothesis would be tested.
H0 : Preference for fast food is independent of age
H1 : Preference for fast food is related to age
The expected frequencies would be obtained as in Section 11.3. Using
this, the value of chi-square can be computed and the hypothesis be tested.
Similarly the other two cases can be handled.]
11.6 Summary
Let us recapitulate the important concepts discussed in this unit:
Chi-square test has a variety of applications in research. Chi-square is
non-symmetrical distribution taking non-negative values.
Sikkim Manipal University
Research Methodology
Unit 11
11.7 Glossary
Degrees of freedom: These are given by (r1) (c1) for a contigency
table.
Chi-square distribution: This is a non-symmetric distribution taking only
non-negative values.
Non-symmetric distribution: Those distributions that are skewed towards
any one tail of the distribution.
Male
Female
Total
25
30
55
40
15
55
30
10
40
Total
95
55
150
Test whether the type of cigarette smoked and the sex are independent.
Sikkim Manipal University
Research Methodology
Unit 11
4. A survey was carried out in a state among the doctors belonging to the
rural health service cadre (500 doctors) and among the medical education
directorate cadre (300 teaching doctors). They were asked a question,
Would it be acceptable to you, if the government proposes to hire all the
doctors on a fixed period contractual basis? The doctors were to answer
either as Acceptable or Not Acceptable. There was no third category
Undecided. The following was the data compiled in a cross-tabulated
format:
Doctors
Acceptable
Not Acceptable
Total
Rural Cadre
195
305
500
Teaching Cadre
140
160
300
Total
335
465
800
Frequency 1,026
Total
1,107
997
966
1,075
933
1,107
972
964
853
10,000
Test whether the digits may be taken to occur equally in the directory.
11.9 Answers
Answers to Self-Assessment Questions
1. True
2. False
3. K-1
4. False
5. True
6. True
7. False
8. 3
Research Methodology
Unit 11
9. Null hypothesis
10. False
11.10 References
Chawla D and Sondhi, N. (2011). Research Methodology: Concepts and
Cases, New Delhi: Vikas Publishing House.
Kothari, C R. (1990). Research Methodology: Methods and Techniques.
New Delhi: Wiley Eastern.
Zikmund, William G. (2000). Business Research Methods. Fort Worth:
Dryden Press,