Sie sind auf Seite 1von 24

Syllabus sections 22: Goodness of fit test

EXERCISES WITH SOLUTIONS

EXERCISE 1
A plumber has noticed a certain association between the typology of sink chosen by his clients and
the area of residency. In order to confirm or reject his intuition, he selects a sample of 500 clients and he
classifies them according to the contingency table reported below:

Area Downtown Outskirt Suburb


Sink
Cabinet Style 60 50 80
Pedestal 50 40 70
Wall mounted 40 60 50

a) Correctly state the hypotheses we need to verify.

H0 : there is NO ASSOCIATION between “sink” and “Area”


H1 : there exists an ASSOCIATION between “sink” and “Area”

b) Is the plumber’s intuition correct? Answer by using α = 0.01.

We have to run a Chi-square test of independence. The test-statistics to be used is

(O − Eij )
2
r c Ri C j
∑∑ where Eij =
ij

=i 1 =j 1 Eij n

under the null hypothesis. This statistics has approximately a Chi-square distribution with
(r-1)(c-1) degrees of freedom. By setting α = 0.01 and by taking into account that in this case, r = 3 and c = 3,
we have χ2 4, 0.01 = 13.28.
Therefore, the null hypothesis will be rejected only if the observed value of the test statistics is larger than
13.28.

The “Expected frequencies” are the absolute expected frequencies in the case in which the characters A and
B were independent.
𝑅𝑅𝑖𝑖 𝐶𝐶𝑗𝑗 (𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑖𝑖 𝑡𝑡ℎ 𝑟𝑟𝑟𝑟𝑟𝑟)(𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑗𝑗 𝑡𝑡ℎ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐)
𝐸𝐸𝑖𝑖𝑖𝑖 = =
𝑛𝑛 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

The following table shows the expected frequencies under the null hypothesis:

Area Downtown Outskirt Suburb


Sink
Cabinet Style 57 57 76
Pedestal 48 48 64
Wall mounted 45 45 60
Let us compute the value of the test statistics

χ2 = (60-57)2/57 + (50-57)2/57 + (80-76)2/76 + (50-48)2/48 +(40-48)2/48+(70-64)2/64+(40-45)2/45+


(60-45)2/45+(50- 60)2/60 = 10.4294

10.4294 < 13.28 so that we do not reject H0: we can conclude that there is no association between
“sink” and “Area”. (p-value 0.0338)

EXERCISE 2
A plumber has noticed a certain association between the typology of sink chosen by his clients and
the area of residency. In order to confirm or reject his intuition, he selects a sample of 1000 clients and he
classifies them according to the contingency table reported below:

Area Downtown Outskirt Suburb


Sink
Cabinet Style 140 120 180
Pedestal 120 100 100
Wall mounted 90 80 70

a) Correctly state the hypotheses we need to verify.

H0 : there is NO ASSOCIATION between “sink” and “Area”


H1 : there exists an ASSOCIATION between “sink” and “Area”

b) Is the plumber’s intuition correct? Answer by using α = 0.025.

We have to run a Chi-square test of independence. The test-statistics to be used is

(O − Eij )
2
r c Ri C j
∑∑ where Eij =
ij

=i 1 =j 1 Eij n

under the null hypothesis. This statistics has approximately a Chi-square distribution with
(r-1)(c-1) degrees of freedom. By setting α = 0.025 and by taking into account that in this case, r = 3 and c =
3, we have χ2 4, 0.025 = 11.14.

Therefore, the null hypothesis will be rejected only if the observed value of the test statistics is larger than
11.14.

The “Expected frequencies” are the absolute expected frequencies in the case in which the characters A and
B were independent.

𝑅𝑅𝑖𝑖 𝐶𝐶𝑗𝑗 (𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑖𝑖 𝑡𝑡ℎ 𝑟𝑟𝑟𝑟𝑟𝑟)(𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑗𝑗 𝑡𝑡ℎ 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐)


𝐸𝐸𝑖𝑖𝑖𝑖 = =
𝑛𝑛 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
The following table shows the expected frequencies under the null hypothesis:
Area Downtown Outskirt Suburb
Sink
Cabinet Style 154 132 154
Pedestal 112 96 112
Wall mounted 84 72 84

Let us compute the value of the test statistics

χ2 = (140-154)2/154 + (120-132)2/132 + (180-154)2/154 + (120-112)2/112 +(100-96)2/96+


(100-112)2/112+ (90-84)2/84+(80-72)2/72+(70-84)2/84 = 12.4279

12.4279 > 11.14 so that we reject H0: we can conclude that there exists an association between “sink” and
“Area”.(p-value 0.0144)

EXERCISE 3

On a sample of 100 workers, we want to find a possible link between the average contract duration (years)
and the industry type (A=Automotive, NA=Not Automotive). This contingency table shows the results

Contract duration 2 5 10
Industry Type
A 10 15 20 45
NA 15 10 30 55
25 25 50 100

a) Show the two hypothesis we should test.

Chi-square test for independence.

H0 : there is NO ASSOCIATION between “Contract duration” and “Industry type”


H1 : there exists an ASSOCIATION between “Contract duration” and “Industry type”

b) Write the formula of the statistic we should use in the test.

The test statistic is:


(O − Eij )
2
r c Ri C j
∑∑ where Eij =
ij
;
=i 1 =j 1 Eij n
under H0 is distributed according to a chi-square distribution with (r − 1)(c − 1) degrees of freedom

c) Make your final decision about hypothesis with a level of significance of 5%.

Fixed a confidence interval of 5% and , r = 2 e c = 3, we have χ 2;0.05 = 5.99 .


2

We reject H0 if the observed value for the test statistics is > 5.99.
Contract duration 2 5 10
Industry type
A 11.25 11.25 22.5
NA 13.75 13.75 27.5

Contract duration 2 5 10
Industry type
A 0.1389 1.2500 0.2778
NA 0.1136 1.0227 0.2273

Considering the data at hand, 3.0303 < 5.99, then we do not reject H0.

EXERCISE 4

On a sample of 100 workers, we want to find a possible link between the age and the means of transport
used to reach the work place. This contingency table shows the results

Transport Private car Public Transport Others


Age
18<age<30 10 35 10 55
≥30 25 15 5 45
35 50 15 100

a) Show the two hypothesis we should test.

Chi-square test. The objective is to test the independence.


H0 : Chi square = 0
H1 : Chi square ≠ 0

b) Write the formula of the statistic we should use in the test.

The test statistic is:


(O − Eij )
2
r c Ri C j
C = ∑∑ where Eij =
ij
;
=i 1 =j 1 Eij n
Under H0 the test statistic is distributed as a chi-square distribution with (r − 1)(c − 1) degrees; in
our case, r = 2 e c = 3, and the number of degrees is equal at 2.

c) Calculate (approximately) the p-value and make your final decision about hypothesis with a level of
significance of 5%.
To derive the p-value the observed value for the test statistics is required.
Transport Private Public Transport Others
Age car
18<age<30 19.25 27.5 8.25
≥30 15.75 22.5 6.75

Transport Private Public Transport Others


Age car
18<age<30 4.4448 2.0455 0.3712
≥30 5.4325 2.5000 0.4537

The sum is equal to 15.2477; the p-value is:


P(C > 15.2477 | H 0 ) < 0.005
10.60 < 15.2477. We reject H0. The two variables are dependent.

EXERCISE 5

In order to evaluate a possible association between the payment of the vehicle tax and the geographical
area, a survey of 400 users was conducted. The results are shown in the following table:

Payment vehicle tax \Geographical Area Center-North South-Islands

No 40 70

Yes 160 130

a) Specify the hypothesis to verify.


H0: independence between payment of the vehicle tax and the geographical area

H1: there exists dependence between payment of the vehicle tax and the geographical area

b) Compute the p-value of the test. For a significance level equal to α=0.05, what is the decision
concerning the hypothesis of the test?

On the basis of the marginal frequencies Oi. and O.j and of the sample size n

Payment vehicle tax Center-North South-Islands Oi.

No 40 7 110

Yes 160 130 290

O.j 200 200 400

Oi. O. j
We compute the expected frequencies: E ij =
n
Payment vehicle tax Center-North South-Islands

No 55 55

Yes 145 145

The value of the test statistics is:

c = ∑∑
2
r c (O
ij − Eij )
2

=
(40 − 55)2 + (70 − 55)2 + (160 − 145)2 + (130 − 145)2 = 11.2853
i =1 j =1 Eij 55 55 145 145

This statistics has distribution chi-square with (r-1)(c-1)=1 degrees of freedom.

Therefore, p − value = Pr (χ 1 > 11.2852 ) . From the tables it results that p-value < 0.005 (the precise
value, not available on the table is equal to 0.0008). Therefore, for every level of significance greater than
1%, p-value < α. We reject the null hypothesis. There exists dependence between payment of the vehicle
tax and the geographical area.

EXERCISE 6

In order to evaluate the existence of a possible association between the payment of the Rai tax (i.e.
“canone Rai”) and the geographical area, a survey of 200 users was conducted. The results are shown in the
following table:

Rai tax\ Geographical Area Center-North South-Islands

Yes 80 65

No 20 35

a) Specify the hypothesis to verify.

H0: independence between payment of the Rai tax and the geographical area;

H1: there exists dependence between payment of the Rai tax and the geographical area

b) Compute the p-value of the test. For a significance level equal to α=0.01, what is the decision concerning
the hypothesis of the test?

On the basis of the marginal frequencies Oi. and O.j and of the sample size n
Payment vehicle tax Center-North South-Islands Oi.

No 80 65 145

Yes 20 35 55

O.j 100 100 200

Oi. O. j
We compute the expected frequencies: E ij =
n
Payment vehicle tax Center-North South-Islands

No 72.5 72.5

Yes 27.5 27.5

The value of the test statistics is:

r c
(Oij − Eij )
2

=
(80 − 72.5) (65 − 72.5) (20 − 27.5) (35 − 27.5)
2
+
2
+
2
+
2
=
c 2 = ∑∑ Eij 72.5 72.5 27.5 27.5
i =1 j =1
= 0.7759 + 0.7759 + 2.0455 + 2.0455 = 5.6428

This statistics has distribution chi-square with (r-1)(c-1)=1 degrees of freedom.

Therefore, p − value = Pr (χ1 > 5.6428) . From the tables it results that 0.01 < p-value < 0.025 (the
precise value, not available on the table is equal to 0.0175). Therefore, we do not reject the null
hypothesis.

EXERCISE 7

Gianni is gambling with Andrea on the results of a dice roll. Gianni, thinks that the dice used by Andrea is
unfair. Gianni carries out 60 rolls of the dice and observe the following results.

Result 1 2 3 4 5 6

Observed frequency 15 5 5 15 10 10

In order to verify whether the dice used by Andrea is really unfair, Gianni decides to run an appropriate
hypothesis test.
a) Specify the hypothesis to be verified.

If the dice is fair, the probability of each outcome is 1/6 and then, over 60 rolls the expected frequency for
1
each outcome is ⋅ 60 = 10 . Let us use a chi-square test to evaluate whether the deviation between the
6
observed frequencies (Oi) and the expected frequencies (Ei) is significantly different than 0. Hence:

H0: the dice is fair,

H1: the dice is unfair.

b) Compute the p-value of the test. Fix a significance level equal to =0.05 and decide about the hypothesis
previously specified.

Let us compute the statistics

χ2 = ∑
6
(Oi − Ei )2 = (15 − 10)2 + (5 − 10)2 + (5 − 10)2 + (15 − 10)2 + (10 − 10)2 + (10 − 10)2 = 10
i =1 Ei 10

That follows a χ 2 distribution with 6 – 1 = 5 degrees of freedom.

Fix a significance level equal to α = 0.05.

(
The p-value of the test is: p − value = Pr χ 52 > 10 . )
From the tables it results that 0.05 < p − value < 0.1 , therefore we do not reject the null hypothesis:
there is no sufficient empirical evidence for the dice to be considered unfair (the exact value of the p-value,
not given in the tables is 0.0752).

EXERCISE 8

Luigi is gambling with Filippo on the results of a dice roll. Luigi, thinks that the dice used by Filippo is unfair.
Luigi carries out 120 rolls of the dice and observe the following results.

Result 1 2 3 4 5 6

Observed frequency 15 25 23 22 17 18

In order to verify whether the dice used by Filippo is really unfair, we decide to run an appropriate
hypothesis test.
a) Specify the hypothesis to be verified.

If the dice is fair, the probability of each outcome is 1/6 and then, over 120 rolls the expected frequency for
1
each outcome is ⋅ 120 = 20 . Let us use a chi-square test to evaluate whether the deviation between the
6
observed frequencies (Oi) and the expected frequencies (Ei) is significantly different than 0. Hence:

H0: the dice is fair,

H1: the dice is unfair.

b) Compute the p-value of the test. Fix a significance level equal to =0.05 and decide about the hypothesis
previously specified.

Let us compute the statistics

6
(Oi − Ei )2 (15 − 20)2 + (25 − 20)2 + (23 − 20)2 + (22 − 20)2 + (17 − 20)2 + (18 − 20)2
χ =∑
2
= = 3.8
i =1 Ei 20

That follows a χ 2 distribution with 6 – 1 = 5 degrees of freedom.

Fix a significance level equal to α = 0.05.

(
The p-value of the test is: p − value = Pr χ 52 > 3.8 . )
From the tables it results that p − value > 0.05 , therefore we do not reject the null hypothesis: there is
no sufficient empirical evidence for the dice to be considered unfair (the exact value of the p-value, not
given in the tables is 0.5786).

EXERCISE 9

A pastry chef wants to know if his customers predominantly like some kind of pastries. A sample of 80
customers has been asked which kind of pastries prefers and the classification in the following table has
been obtained:

With fruits With cream With chocolate Dry

22 26 15 17

a) At a 10% significance level, is it possible for the pastry chef to conclude, on the basis of sample
evidence, that all the 4 typologies of pastry are equally preferred?
With With With Dry Total
fruits cream chocolate

Probability distribution of
p1 p2 p3 p4 1
the population

Oi = Observations drawn
from the population 22 26 15 17 80

Assumed probability
distribution of the 1/4 1/4 1/4 1/4 1
population

Ei = expected number of
observations if the assumed 80*1/4=20 80*1/4=20 80*1/4=20 80*1/4=20 80
distribution is true

We ran the test:

 1
H 0 : the probability distribution of the population is uniform  i.e. p1 = p2 = p3 = 
 3

H 1 : the probability distribution of the population is not uniform

(Oi − Ei )
K 2

We use the test statistics U = ∑i =1 Ei


that has distribution χ k2−1 (with k=4 and n=80).

(Oi − Ei ) ( 22 − 20) ( 26 − 20) (15 − 20) (17 − 20)


k 2 2 2 2 2

U= ∑i =1 Ei
=
20
+
20
+
20
+
20
= 3,7

(Oi − Ei )
K 2

We reject H 0 if: ∑
i =1 Ei
> χ 3,0.1
2

Knowing that χ 3,0.1 = 6, 25


2

3,7 < 6, 25 and assuming a 10% significance level, I do not reject the null hypothesis H 0 : “there is no
sufficient empirical evidence to state that the customers prefer a particular kind of pastry”. It is possible for
the pastry chef to conclude that on the basis of sample evidence, that all the 4 typologies of pastry are
equally preferred.
b) Describe the logic of the test statistic used to verify the hypothesis in the previous point.
A test with significance level α , under H 0 , against the alternative hypothesis that the assumed
probabilities are not correct, is based on the following decision rule
(Oi − Ei )
K 2

reject H 0 if: ∑
i =1 Ei
> χ K2 −1,α

where χ
2
K −1,α (
is the value for which P χ K2 −1 > χ K2 −1,α =
α )
And the random variable χ K2 −1 follows a chi-square distribution with ( K − 1) degrees of freedom.

EXERCISE 10

A professor suggests to his students to choose one of the following books for their studies: A, B, C or D. He
believes that the students have no particular preferences and so the books will be equally (Uniformly)
chosen. To test this assumption, the professor collects the chosen book for a random sample of 100
students, with these results: 20 preferred book A, 40 B, 30 C and 10 D.

a) Which is the statistic we must use to help the professor in verifying his hypothesis?

We must use the “Chi-square test”, that is we must calculate the statistic:

K
(O i − E i ) 2
χ2 = ∑
i =1 Ei
where Oi are the observed frequencies and Ei those expected, in this case the frequencies of the Uniform
distribution.
This statistic is distributed like Chi-square distribution with k-1 degrees of freedom (K = nr. of classes).

b) Test if the professor hypothesis is rejected, with α = 0.01.

Observed Oi and expected Ei frequencies in this situation are:

A B C D
Oi 20 40 30 10
Ei 25 25 25 25

(O i − E i ) 2 (−5) 2 15 2 5 2 (−15) 2
K
from which: χ = ∑ 2
= + + + = 20
i =1 Ei 25 25 25 25

In a Chi-square distribution with 3 (4-1) degrees of freedom the percentile 99 is, according to our table,
11.34.
Since 20 > 11.34 we must reject the null hypothesis H0 of the professor: the distribution of preferences
among books is not Uniform.
EXERCISE 11

A professor suggests to his students to choose one of the following books for their studies: A, B, C or D. He
believes that the students have no particular preferences and so the books will be equally (Uniformly)
chosen. To test this assumption, the professor collects the chosen book for a random sample of 200
students, with these results: 70 preferred book A, 40 B, 60 C and 30 D.

a) Test if the professor hypothesis is rejected, with α = 0.05.

We must use the “Chi-square test”, that is we must calculate the statistic:

K
(O i − E i ) 2
χ2 = ∑
i =1 Ei
where Oi are the observed frequencies and Ei those expected, in this case the frequencies of the Uniform
distribution.
This statistic is distributed like Chi-square distribution with k-1 degrees of freedom (K = nr. of classes).

Observed Oi and expected Ei frequencies in this situation are:

A B C D
Oi 70 40 60 30
Ei 50 50 50 50

K
(O i − E i ) 2 (20) 2 (−10) 2 10 2 (−20) 2
from which: χ 2 = ∑
i =1 Ei
=
50
+
50
+
50
+
50
= 20

In a Chi-square distribution with 3 (4-1) degrees of freedom the percentile 95 is, according to our table,
7.81. Since 20 > 7.81 we must reject the null hypothesis H0 of the professor: the distribution of preferences
among books is not Uniform.

EXERCISE 12

You want to verify if there is an association between the area of residence of families and the presence of
underage children. To this aim, a random sample of 100 families is analyzed and the collected information
is organized in the following contingency table:

Area of residence

Urban Rural

Presence of underage Yes 12 12


children NO 53 23
a) What are the null and alternative hypotheses to verify?

We need to use a chi-square test where the hypotheses are as follows:

 H 0 : There is no association between "presence of children" and "Area of residence"



 H1 : There is association between "presence of children" and "Area of residence"

b) Determine, at the significance level α = 0.05, if there is association between the two variables.

The test statistic to use is:

(O − Eij )
2
r c Ri C j
∑∑ where Eij =
ij
.
=i 1 =j 1 Eij n

Under the null hypothesis, the test statistic has an approximate Chi-square distribution with (r − 1)(c − 1)
degrees of freedom.

(O − Eij )
2
r c

∑∑ > c (2r −1)( c −1),α


ij
We reject H 0 when
=i 1 =j 1 Eij

Setting alpha α = 0.05 and since r = 2 and c = 2 we have that χ1;0.05 = 3.84 . In the following table we
2

 Ri ⋅ C j 
calculate the expected frequencies  Eij = :
 n 

Area of residence

Urban Rural

Presence of underage Yes 15.6 8.4


children NO 49.4 26.6

(O − Eij )
2
ij
While in the following table we calculate the quantities:
Eij

Area of residence

Urban Rural

Presence of Yes 0.8308 1.5429


underage children NO 0.2623 0.4872
The sum of the values in the last table, i.e. the value of the test statistic, is 3.1232

Since 3.1232 < 3.84 we do not reject H0 and we conclude that there is no evidence that the Presence of
underage children and the Area of residence are associated.

EXERCISE 13

You want to verify if there is an association between the area of residence of families and the presence of
underage children. To this aim, a random sample of 500 families is analyzed and the collected information
is organized in the following contingency table:

Area of residence

Urban Rural

Presence of underage Yes 180 145


children NO 80 95

a) What are the null and alternative hypotheses to verify?

We need to use a chi-square test where the hypotheses are as follows:

 H 0 : There is no association between "presence of children" and "Area of residence"



 H1 : There is association between "presence of children" and "Area of residence"

b) Determine, at the significance level α = 0.01, if there is association between the two variables.

The test statistic to use is:

(O − Eij )
2
r c Ri C j
∑∑ where Eij =
ij
.
=i 1 =j 1 Eij n

Under the null hypothesis, the test statistic has an approximate Chi-square distribution with (r − 1)(c − 1)
degrees of freedom.

(O − Eij )
2
r c

∑∑ > c (2r −1)( c −1),α


ij
We reject H 0 when
=i 1 =j 1 Eij

Setting alpha α = 0.01 and since r = 2 and c = 2 we have that χ1;0.01 = 6.63 . In the following table we
2

 Ri ⋅ C j 
calculate the expected frequencies  Eij = :
 n 
Area of residence

Urban Rural

Presence of underage Yes 169 156


children NO 91 84

(O − Eij )
2
ij
While in the following table we calculate the quantities:
Eij

Area of residence

Urban Rural

Presence of Yes 0.716 0.7756


underage children NO 1.3297 1.4405

The sum of the values in the last table, i.e. the value of the test statistic, is 4.2618.

Since 4.2618 < 6.63 we do not reject H0 and we conclude that there is no evidence that the Presence of
underage children and the Area of residence are associated.

EXERCISE 14

A sample of cyclists has been classified according to their gender and to their opinion towards the bike
routes of a city:

GENDER Male Female

OPINION

Low 16 39

Sufficient 24 17

Good 20 24

You would like to carry on a test in order to verify whether there is an association between the two
variables.
a) Write down the hypotheses to test.

H0 : there is no association between “OPINION” and “GENDER”


H1 : there is association between “OPINION” and “GENDER”

b) Does the data constitute enough empirical evidence in order to affirm that the two variables are
significantly related? Answer by adopting an α = 0.05.

We need to run a Chi-square test of independence. The test statistic to be utilized is:

r c (Oij − Eij ) 2 RiC j


∑∑
i =1 j =1 Eij
where Eij =
n

Under the null hp, the random variable associated with the test follows a chi square distribution with (r-
1)(c-1) degrees of freedom.

By setting α = 0.05 and knowing that, r = 3 e c = 2, we have χ 22,0.05 = 5.99 .

The null hp will be rejected if the observed value of the test statistic is above the critical value of

χ 22,0.05 = 5.99

In order to compute the test statistic:

GENDER Male Female Ri

OPINION

Low 16 39 55

Sufficient 24 17 41

Good 20 24 44

Cj 60 80 140

Ri C j
Below, the expected frequencies under the null hp Eij = :
n

GENDER Male Female

OPINION

Low 23.5714 31.4286

Sufficient 17.5714 23.4286

Good 18.8571 25.1429


Let’s compute the value of the test statistic

r c (Oij − Eij ) 2 (16 − 23.5714) 2 (39 − 31.4286) 2 (24 − 17.5714) 2 (17 − 23.4286) 2
∑∑ = + + + +
i =1 j =1 Eij 23.5714 31.4286 17.5714 23.4286
(20 − 18.8571) 2 (24 − 25.1429) 2
+ = 8.4931
18.8571 25.1429

8.4931 > 5.99 therefore we reject H0. Data provide sufficient empirical evidence to affirm that the two
variables are associated at a level of significance α = 0.05. (p-value 0.014313)

EXERCISE 15

100 people are selected for an interview at random. The following two-way table refers to the questions
“will you take part in the next masked parade?” and “interviewee’s gender”.

“will take part…”

Yes No

Gender M 12 28

F 13 47

a) What is the statistical test used to detect association between the variables in the two-way table? What
are the null and the alternative hypotheses?

In order to test for independence we carry out the Chi-squared ( χ 2 ) test. The test’s hypotheses are:

H0: “there is no association between the variables (they are independent)”


H1: “there exists some association between the variables (they are dependent)”

b) Calculate the p-value of the test you indicated in the previous point.

To calculate the p-value we can use the statistics c = ∑∑


2
r c (O
ij − Eij )
2

that has distribution:


i =1 j =1 Eij
χ (2r −1)(χ−1) = χ12 (where c=r=2 are the number of columns and the number of rows of the joint
distribution table).
If the two variables were independent, the table would be:

“Will take part…”

Yes No

Gender M 10 30

F 15 45
The test statistic is calculated as follows

c = ∑∑
2
r c (O
ij − Eij )
2

=
(12 − 10)2 + (13 − 15)2 + (28 − 30)2 + (47 − 45)2 =
i =1 j =1 Eij 10 15 30 45
4 4 4 4
= + + + = 0,8889
10 15 30 45
The p-value is:
( ) (
p − value = P χ 12 > χ 2 = P χ 12 > 0,8889 )
Thus the p-value is larger than 0,01.

c) Based on the result obtained in point b), make an assertion on H0, with α = 0.05. Briefly comment on the
output.

Since p − value > a we don’t reject the null hypothesis: there is no sufficient empirical evidence to state
that the willingness to take part in the parade varied across genders.

EXERCISE 16

100 people are selected for an interview at random. The following two-way table refers to the questions
“will you take part in the next masked parade?” and “interviewee’s gender”.

“will take part…”

YES No

Gender M 10 30

F 30 30

a) What is the statistical test used to detect association between the variables in the two-way table? What
are the null and the alternative hypotheses?

In order to test for independence we carry out the Chi-squared ( χ 2 ) test. The test’s hypotheses are:

H0: “there is no association between the variables (they are independent)”


H1: “there exists some association between the variables (they are dependent)”

b) Calculate the p-value of the test you indicated in the previous point.

To calculate the p-value we can use the statistics c = ∑∑


2
r c (O
ij − Eij )
2

that has distribution:


i =1 j =1 Eij
χ (2r −1)(χ−1) = χ12 (where c=r=2 are the number of columns and the number of rows of the joint
distribution table).
If the two variables were independent, the table would be:

“Will take part…”

Yes No

Gender M 16 24

F 24 36

The test statistic is calculated as follows


r c (O − E ij )
2
(10 − 16)2 + (30 − 24)2 + (30 − 24)2 + (30 − 36)2
c = ∑∑ = =
2 ij

i =1 j =1 E ij 16 24 24 36
36 36 36 36
= + + + = 6,25
16 24 24 36

The p-value is:


( )
p − value = P χ 12 > χ 2 = P χ 12 > 6,25 ( )
Thus the p-value is in between 0,01 e 0,025.

c) Based on the result obtained in point d), make an assertion on H0, with α = 0.05. Briefly comment on the
output.

Since p − vαlue < α we reject the null hypothesis: there’s sufficient empirical evidence to state that the
willingness to take part in the parade varied across genders (the variables are statistically dependent).

EXERCISE 17

A retail manager of home appliances wants to analyze whether customers do have a particular preference
when they choose a flat TV. The three kind of flat TV sold are: lcd, plasma, led. In a simple random sample
of 288 purchasers of a flat TV 112 bought lcd, 103 plasma, 73 led.

a) Calculate the p-value of the test to verify whether the customers do have a preference or not.
To verify whether the customers do have a preference or not we need to compare the distribution of the
preferences with a uniform distribution:
LCD PLASMA LED Total

p1 p2 p3 1

Oi = observed distribution 112 103 73 288

Uniform probability distribution 1/3 1/3 1/3 1

Ei = expected distribution in case of no 288*1/3= 288*1/3= 288*1/3=


288
preferences 96 96 96

Test:
 1
H 0 : uniform distribution  so p1 = p2 = p3 = 
 3
H 1 : distribution is not uniform, customers have preferences

U =∑
k
(Oi − Ei )2 is distributed as a χ k2−1 (with k=3 and n=287).
i =1 Ei

U =∑
k
(Oi − Ei )2 = (112 − 96)2 + (103 − 96)2 + (73 − 96)2 = 8,3902
i =1 Ei 96 96 96

( ) (
p − value = P χ 22 > U = P χ 22 > 8.6875 )
From χ 22 distribution table we can see that 0.01 < p − value < 0.025 .

b) Assuming α equal to 0.05, which is the final conclusion you would suggest to the retail manager?
Justify your answer.

With a level of significance of 5%, we reject the Null hypothesis H 0 : there’s enough empirical evidence to say
that customers do have a preference in their TV set choice.

EXERCISE 18

A market research on drink consumption has been conducted in order to verify the association between
type of drinks and consumers’ age. A survey was administered to 120 customers and their preferences have
been collected and recorded into the following cross-table:
Type of drink

Cocktail Liqueur

< 30 years 70 10
Age of customers
≥ 30 years 18 22

a) Using an appropriate statistical test, verify the hypothesis of independence of the two variables,
providing the assumptions and the computations needed to take the final decision. What are the
conclusions? Briefly explain.
Perform a chi-square test. The hypotheses to be tested are:

H0: The variables “Age of customer” and “Type of drink” are independent

H1: The variables “Age of customer” and “Type of drink” are dependent

The statistics test to verify the hypothesis of independence is given by:


2
2
∑𝑖𝑖,𝑗𝑗�𝑂𝑂𝑖𝑖𝑖𝑖 − 𝐸𝐸𝑖𝑖𝑖𝑖 � (70 − 58.6667)2 (10 − 21.3333)2 (18 − 29.3333)2 (22 − 10.6667)2
𝜒𝜒 = = + + +
𝐸𝐸𝑖𝑖𝑖𝑖 58.6667 21.3333 29.3333 10.6667
= 24.6305

Not having specified the level of significance, we calculate the p-value considering 𝜒𝜒 2 with (2-1)(2-1)= 1 d.f.

𝑃𝑃[𝜒𝜒 2 > 24.6305] < 0.001 → 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝐻𝐻0 (𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖)

Using the sample data we can conclude that the dependence between Type of drink and Age of customer is
statistically significant.

EXERCISE 19

The business agents working for the same company presume a salary gap against women. In order to verify
their suspects, they consider a sample of 46 agents and collect the following data on gender and income:

Gender \ Income Low-Income Middle-Income High-Income

Women 5 10 4

Men 6 10 11

a) Can they conclude that there is actually a dependency between gender and income? Build and use an
appropriate statistical test with a level of significance α = 0.01. What could the agents conclude?
The appropriate test to verify the association between the two variables is the chi-square test of
independence.
The hypotheses to be tested are:

H 0 : there is no association between the variables "Income" and "Gender" (i.e. they are independent)

H1 : there is an association between the variables "Income" and "Gender" (i.e. they are dependent)
r c (O − Eij )
2

The statistical test to be used is as follows: c 2 = ∑∑ i =1 j =1


ij

Eij

that when H 0 is true, it has distribution χ ( r −1)( χ −1) = χ 2 (where c=3 are the number of columns and r=2
2 2

are the number of rows in the table of the joint frequency).


Expected frequencies table:
Gender\Income Low-Income Middle-Income High-Income Tot

4.5435 8.2609 6.1957 19 4.5435

6.4565 11.7391 8.8043 27 6.4565

11 20 15 46 11

r c (O − E ij )
2
(5 − 4.5435)
2
(10 − 8.2609)
2
(4 − 6.1957 )
2
(6 − 6.4565)
2

c = ∑∑ = + + + +
2 ij

i =1 j =1 E ij 4.5435 8.2609 6.1957 6.4565

+
(10 − 11.7391)2 + (11 − 8.8043)2 = 2.027
11.7391 8.8043

We reject H 0 when the observed value of the statistics test χ 2 is located in the right tail of the
distribution χ 22 . From the table of Chi-Square distribution we get χ 22; 0, 01 = 9.21 > 2.027  We do not
reject the null hypothesis of independence at 1% significance (Income and Gender are statistically
independent).
On the basis of the considering sample and the test result, there is not a salary gap against women.

EXERCISE 20

A researcher wants to analyse the relation, if any, between children weight and time spent in sports in a
week. A random sample of 320 children between 8 and 10 years old is observed. Let X be weekly time
spent in sports (0 = “less than 1 hour”; 1 = “1-3 hours”; 2 = “more than 3 hours”) and Y be the weight (0 =
“normal”; 1 = “slightly overweight”; 2 = “heavily overweight”).
X\Y 0 1 2

0 30 70 0

1 20 30 60

2 10 40 60

a) Write the hypotheses we should test.


Hypotheses to be tested:

H0: the two variables are independent;


H1: the two variables are not independent.

b) Write the general formula of the statistic that should be used in this case.

Compute the expected frequencies for each pair (i, j) of values of X and Y under the null hypothesis as
Ri C j
Eij = ; where Ri is the marginal frequency corresponding to the i-th row and C is the marginal
n
3 3 (O − Eij )
2

frequency corresponding to the j-th column. Reject H0 if: ∑∑


i =1 j =1
ij

Eij
> χ (3−1)(3−1), 0.05 , where Oij are

the observed frequencies.

c) Test the hypothesis of a relation between the two variables with a significance level α = 0.05.

The following tables contains the expected frequencies Eij:

X\Y 0 1 2 Tot

0 18.75 43.75 37.5 100

1 20.625 48.125 41.25 110

2 20.625 48.125 41.25 110

Tot 60 140 120 320

The values
(O ij − Eij )
2

for each cell are


Eij
X\Y 0 1 2

0 6.75 15.75 37.5

1 0.0189 6.8263 8.5227

2 5.4735 1.3718 8.5227

The sum, 90.7359, exceeds χ (3−1)(3−1), 0.05 = 9.49: we reject the hypothesis that X and Y are independent.

EXERCISE 21

You want to study the relationship existing between the revenues of a group of companies and whether or
not the companies have a web page. You conduct a sample survey on 445 companies and obtain:

Revenues\ Web site Yes No

>10 Mil $ 56 88

≤10 Mil $ 99 202

i) Are revenues and having a web site independent (state the null and the alternative hypotheses and verify
whether there is an association between the two variables)? Use the p-value approach.

H0 the two variables X, Y are independent

H1 the two variables X, Y are NOT independent, there is a relationship between the two variables X, Y

Ri C j
Expected frequency: Eij = Revenues\ Web site Yes No
n
>10 Mil $ 155*144/445 290*144/445

=50.1573 =93.8427

≤10 Mil $ 155*301/445 290*301/445

=104.8427 =196.1573

r χ (Oij − Eij ) 2
We reject H0 when ∑∑
i =1 j =1 Eij
> χ (2r −1)( χ −1),α

r c (Oij − Eij ) 2
∑∑
i =1 j =1 Eij
= 0.6806 + 0.3638 + 0.3256 + 0.1740 = 1.544 p-value >0.10

For any reasonable α do not reject H0, the two variables are independent.

Das könnte Ihnen auch gefallen