0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)

145 Ansichten31 Seitenregression

May 25, 2014

© © All Rights Reserved

PDF, TXT oder online auf Scribd lesen

regression

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)

145 Ansichten31 Seitenregression

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

Sie sind auf Seite 1von 31

1. The following data were collected by a bank wishing to examine the relationship (if

any) between individual income and savings per year (in units of $1, 000).

Income 60 40 50 30 70 80 74 54

Savings 6 3 3 2 8 12 11 7

(a) Which of the two variables would you choose to be the response variable in a

simple linear regression analysis?

Solution:

The bank would be most interested in predicting an individuals Savings, given

their individual Income, and in assessing the eect of change in Income on

Savings. Since Savings is the variable that we are most interested in predicting,

Savings is taken as the response variable. Income is the explanatory variable.

(b) Without using Excel, sketch an approximate scatterplot of the data.

Solution:

Your scatterplot should look similar to this:

A simple linear regression analysis was performed using Excel, yielding the following

output:

141

142 CHAPTER 10. SIMPLE LINEAR REGRESSION

(c) Use the Excel output to write down an estimate b

1

for the regression slope

parameter

1

. Interpret the meaning of b

1

in terms of family income and

savings.

Solution:

From the output, the estimate of the slope parameter

1

is b

1

= 0.205 (3 d.p.).

The interpretation is that if an individuals Income increases by $1, 000, their

expected Savings increases by $205.

(d) Test the hypothesis H

0

:

1

= 0 against H

1

:

1

= 0 at the 1% signicance

level.

Solution:

In simple linear regression, the test of H

0

:

1

= 0 (there is no relationship

between Income and Savings) against H

1

:

1

= 0 (there is a signicant linear

relationship between Income and Savings) can be carried out in either of

two (equivalent) ways. The rst approach is based on the test statistic

F = MS

Regression

/MS

Residual

( F

1,n2

under H

0

). From the Excel Output, the

observed value is F

obs

= 48.42, and pvalue associated with this observed value

is 0.000437 < = 0.01.

The second approach is based on the test statistic

T =

B

1

MS

Residual

/SS

x

( t

n2

= t

6

under H

0

),

143

where

B

1

=

n

i=1

(x

i

x)(Y

i

Y )

n

i=1

(x

i

x)

2

is the least squares estimator of the slope

1

. From the Excel output, the

observed value of the test statistic is

t

obs

= 6.9585,

and the critical region is (two-tail test, = 0.01)

CR = {|T| > t

crit

= t

/2

n2

= t

0.005

6

= 3.7074}.

Thus, t

obs

CR. (Alternatively, it can be seen from the Excel output that

the pvalue associated with t

obs

= 6.9585 is 0.000437 < = 0.01.)

Thus, either approach results in rejection of H

0

:

1

= 0 in favour of H

1

:

1

= 0 at the 1% signicance level. Hence there is sucient evidence at the

1% signicance level to conclude that there is a signicant linear relationship

between Income and Savings.

The advantage of the second approach is that it can be used to test H

0

:

1

= 0

against either of the one-tail alternatives H

1

:

1

> 0 (signicant positive

relationship between Income and Savings) or H

1

:

1

< 0 (signicant negative

relationship between Income and Savings), by choosing the appropriate form of

the critical region.

(e) Briey explain what, in practice, is the purpose of examining a plot of Residuals

against the explanatory variable (Income).

Solution:

From diagnostic plots, one can check whether any of the assumptions of simple

linear regression appear to be violated. From residual plots, one can assess

the appropriateness of the linear model, and can recognise if the errors are not

independent or do not have constant variance. The remaining assumption is

that of Normality of the errors, which can be checked by examining a Normality

plot of the residuals.

(f) From this regression, what is the predicted Savings for an individual with an

Income of $20, 000 per annum? Comment on the usefulness of this prediction.

Solution:

Predicted Savings = b

0

+ b

1

Income = 5.246 + 0.205 20 = 1.142. One

might question how a negative value of Savings should be interpreted. Note

that an Income of $20, 000 is outside the range of the data upon which the

model was constructed, hence this prediction is not reliable and should be

taken with a grain of salt. Predictions are only reliable if the values of any

explanatory variables are within the range of the data.

144 CHAPTER 10. SIMPLE LINEAR REGRESSION

2. The following output comes from a linear regression, modelling the number of elec-

tronic components assembled (within a certain time) by employees of an electronics

company with diering amounts of experience (in years).

(a) Specify the regression model and explain each term in the model.

(b) State the estimated regression equation between Production and Experience.

(c) Is there a signicant linear relationship between Production and Experience?

Justify your answer.

(d) Do the residual plots suggest any problems with model assumptions?

(e) Estimate the eect, on average, of

i. a one year increase in experience,

ii. a two year increase in experience.

(f) State a 95% condence interval for the slope parameter

1

.

(g) What is the co-ecient of determination for this model? What is its meaning?

Solution:

(a) Production =

0

+

1

Experience + , where Production is the number

of components produced, Experience is the number of years of experience the

employee has,

0

is the intercept,

1

is the slope, and is the random variation

term or error.

(b) The estimated regression line is

Production = 2.914 + 1.967 Experience.

(c) We are testing the hypotheses H

0

:

1

= 0 against H

1

:

1

= 0. The p

value for this test is 7.33 10

31

< 0.05, so the data provides overwhelming

evidence against the null hypothesis. We conclude that there is a signicant

linear relationship between Production and Experience.

145

(d) 1. A linear model is appropriate. There is no evidence of a trend in the

residual plot.

2. The errors are normally distributed. The points in the normal prob-

ability plot lie approximately on a straight line, indicating the assumption

of normality is okay.

3. The errors have constant variance. The spread of residuals about the

horizontal axis does not vary as Experience increases, so this looks okay.

4. The errors are independent (or uncorrelated). The residual plot

doesnt show any clear violation of independence.

No evidence of outliers or points of high leverage. Thus there is no reason to

doubt the adequacy of our linear regression model.

(e) i. an extra one year of experience will increase Production on average by

1.967 components.

ii. an extra two years of experience will increase Production on average by

1.967 2 = 3.934 components.

(f) We can read the 95% condence interval for

1

from the Excel output as

(1.912, 2.021).

(g) r

2

= 0.9955, this means that the variation in Experience explains 99.55% of

the variation in Production.

3. House Data: Regression of Price against Age

Open the House.xlsx le. We will perform a regression analysis of Price against the

Age of the houses sold. The data was collected in 2010.

(a) Create a column called AgeHouse (which is simply 2010 - YrBuilt). To do this,

type AgeHouse in Cell J1, type = 2010 - E2 in Cell J2, and ll down.

(b) Produce a scatterplot of Price against AgeHouse, and describe any general

trend. Aside from this, is there anything else of note?

Solution:

A scatterplot of Price against AgeHouse is shown below:

146 CHAPTER 10. SIMPLE LINEAR REGRESSION

There does seem to a trend for Price to decrease with increasing age, but this

is due almost entirely to 5 points (possible outliers?) which correspond to very

new houses.

(c) Go to Data Data Analysis Regression. The Input Y Range is Price,

the Input X Range is AgeHouse. You should include Labels in these ranges;

check the corresponding box. Select an Output Range and click OK.

Solution:

These steps yield the following output:

(d) Write down the equation of regression.

Solution:

From the Table of Coecients, the regression equation is

147

(e) Test appropriate hypotheses to determine if there is a signicant linear rela-

tionship between Price and AgeHouse. State your conclusion.

Solution:

The hypotheses of interest are

H

0

:

1

= 0 H

1

:

1

= 0

where

1

is the true regression gradient linking House Prices and the age of

the house.

The p-value for the test is 3.394 10

9

<< 0.05, so there is overwhelming

evidence against the null hypothesis. We conclude that there is a signicant

linear relationship between Price and the age of the house sold.

(f) Examine the residual plots and Normality plot associated with the regression.

Do the residuals appear Normally distributed?

Solution:

No, there does seem to be some non-Normality in the residuals the Normality

plot is not entirely linear, due to perhaps 3-5 extreme points. This throws some

doubt on our conclusion above. One option would be to remove one or two of

them and see whether the Normality of the residuals improves, and whether

our conclusions stay the same.

4. Calculate the estimated coecients b

0

and b

1

in the estimated least squares regres-

sion equation y = b

0

+b

1

x in each of the cases (a) and (b), using the formulae given

in lecture slides, for a set of data (x

i

, y

i

), i = 1, 2, . . . , 10, such that

(a)

10

i=1

x

i

= 15,

10

i=1

y

i

= 714,

10

i=1

x

i

y

i

= 1278,

10

i=1

x

2

i

= 25.8,

(b) x = 0, y = 12.7, SS

xy

= 246.56, s

2

x

= 36.67.

Solution:

(a) From formulae given in Lecture slides,

b

1

=

n

i=1

x

i

y

i

n x y

n

i=1

x

2

i

n( x)

2

=

1278 10(

15

10

)(

714

10

)

25.8 10(

15

10

)

2

= 62.727,

b

0

= y b

1

x =

714

10

62.727(

15

10

) = 22.691.

so the equation of the regression line is y = 22.691 + 62.727x.

148 CHAPTER 10. SIMPLE LINEAR REGRESSION

(b) First we need to compute SS

x

. Recognising that SS

x

=

n

i=1

(x

i

x)

2

=

(n 1)s

2

x

, we nd that SS

x

= (n 1)s

2

x

= 9 36.67. Thus

b

1

=

SS

xy

SS

x

=

246.56

36.67 9

= 0.7471,

b

0

= y b

1

x = 12.7 0.7471(0) = 12.7.

so the equation of the regression line is y = 12.7 0.7471x.

5. Open the Excel le House.xlsx. National Realty wants you to investigate the rela-

tionship between the selling price of a house (in $1,000) and the area of the block of

land on which it is situated (in m

2

). You decide to perform a simple linear regression

between Price and Area.

(a) First, decide which of the two variables should be chosen as the response vari-

able. Then specify the regression model, and explain each term in the model.

(b) What are the assumptions that must be satised to ensure that a simple linear

regression is appropriate?

(c) Using Excel, produce an appropriate Summary Output for the simple linear

regression described by (a). This should include an appropriate set of diag-

nostic plots that can be used to assess whether or not the assumptions of the

regression model in (b) are justied.

(d) From your output in (c), write down the estimated regression equation between

Price and Area.

(e) Give an interpretation for the estimate of the slope parameter in the estimated

regression equation in (d).

(f) Do the diagnostic plots suggest any violation of the assumptions in (b)?

Solution:

(a) Price is the appropriate choice for the response variable. The regression model

is Price =

0

+

1

Area +, where Price is the selling price of the house, Area

is the area of the block of the house,

0

is the intercept,

1

is the slope, and

is the random variation term or residual.

(b) The assumptions of the simple linear regression are

i. A linear model is appropriate: Price =

0

+

1

Area + , where E[] = 0;

ii. The error variables are Normally distributed;

iii. The error variables have constant variance;

iv. The error variables are independent (or at least uncorrelated).

(c) An Excel output is shown:

149

(d)

(e) If the area increases by 1m

2

then the selling price will, on average, increase by

$290.

(f) 1. A linear model is appropriate. The scatter plot is slightly suggestive

of a curved relationship, particularly if the one extremely negative point

is seen as an outlier.

2. The Error variables are normally distributed. The normality plot

is approximately linear, except at the tails. The Normality assumption is

called into question by 4-6 extreme points. See below.

3. The Error variables have constant variance. This really depends

on how we see the one negative outlier. Without this point, the constant

variance assumption looks okay. Leaving this point in, constant variance

is more open to question.

4. The Error variables are independent (or uncorrelated). The resid-

ual plot doesnt show any clear violation of independence.

The data contains some very extreme points in the Area variable and all of

these would have high leverage. One of these points is an extreme negative

outlier, all of which cast some doubt on the results above. We would be well

advised to see what eect these points have on our regression model, by retting

the model with one or more of these points removed.

150 CHAPTER 10. SIMPLE LINEAR REGRESSION

6. (a) Based on your answers to Question 7e, predict the selling price for a house

with area equal to (i) 900m

2

; (ii) 1900m

2

. Comment on the reliability of

these predictions.

(b) Is there a signicant (linear) relationship between Price and Area? State the

hypotheses to be tested, and read o the appropriate pvalue for this test from

your output in Question 7(e)iii.

(c) Without any calculation, state a 95% condence interval for the slope param-

eter

1

.

(d) Calculate a 98% condence interval for

1

.

Solution:

(a) (i)

Price = 219.23 + 0.2901(900) = 480.354, that is, $480,354. (ii)

Price =

219.23 + 0.2901(1900) = 770.493, that is, $770,493. The rst prediction is

reliable (subject to the comments above about residuals), since 900 is in the

range of Area on which we built the model. The second prediction is unreliable,

as 1900 is well outside the data range of Area upon which we built the model.

(b) We are testing the hypotheses H

0

:

1

= 0 versus H

1

:

1

= 0. The p-

value for this test is 4.1 10

29

< 0.05, so the data provides overwhelming

evidence against the null hypothesis. We conclude that there is signicant

linear relationship between Price and Area.

(c) We can read the 95% condence interval from the Excel output as (0.2467, 0.3336).

(d) 98% CL for

1

= b

1

t

0.01,206

SE(

1

). So 98% CL for

1

= 0.2901 2.3451

0.022044 = 0.2901 0.0517, so a 98% CI for

1

= (0.2384, 0.3418)

Chapter 11

Multiple Linear Regression

1. Absenteeism is a major problem for employers in most countries, reducing potential

output by an estimated 10%. Economists M. Chaudhary and I. Ng (Canadian Jour-

nal of Economics,, August 1992) conducted a research project to better understand

the causes of this problem. They randomly selected 100 organisations to participate

in a year long study. For each organisation, the average number of days absent per

employee was recorded, along with several other variables described below:

Wage : the average employee wage

Pct PT: percentage of part time employees

Pct U: the percentage of unionised employees

Av Shift: availability of shift work (1 = yes, 0 = no)

U/M Rel: union-management relationship (1 = good, 0 = not good)

A linear regression analysis was conducted with Absent (average number of days

absent per employee) as response, and some of the output is given on the following

page.

(a) Specify the multiple linear regression model between Absent and the explana-

tory variables, and explain each term in the model.

(b) Is there sucient evidence to conclude that the availability of shift work is

related to absenteeism? Justify your answer.

(c) Can we infer that in organisations where union and management relations are

poor, absenteeism is high? Justify your answer.

(d) Write down the tted regression model between Absent and the explanatory

variables, using only the signicant terms.

(e) State and verify the assumptions of the linear regression model using the out-

put.

(f) Which variable, Av Shift or U/M Rel, has the greatest aect on absenteeism

in the workplace according to this data?

(g) Compute a 95% condence interval for the coecient of the percentage of

unionised employees.

(h) How can this model be improved? Justify your answer.

151

152 CHAPTER 11. MULTIPLE LINEAR REGRESSION

Solution:

(a) The regression model is

Absent =

0

+

1

Wage +

2

Pct PT +

3

Pct U +

4

Av Shift +

5

U/M Rel +

where

0

is the intercept,

1

is the coecient of Wage,

2

is the coecient of

Pct PT,

3

is the coecient of Pct U,

4

is the coecient of Av Shift,

5

is the

coecient of U/M Rel, and is the random variation term.

(b) The p-value of the coecient of Av Shift is 0.0025 < 0.05, so there is sucient

evidence to conclude that the availability of shift work is related to absenteeism.

153

In fact, since the coecient of Av Shift is positive, the availability of shift work

increases the mean number of days absent per employee (by 1.56 days per year).

(c) The p-value of the coecient of U/M Rel is 5.99 10

7

<< 0.05, so there is

sucient evidence to conclude that the status of union-management relations

is related to absenteeism. Since the coecient of U/M Rel is negative, it

indicates that if union-management relations are good, then the mean number

of days absent per employee decreases (by 2.64 days per year). Equivalently,

bad union-management relations imply that the mean number of days per year

absent per employee will increase by 2.64.

(d) To write down the tted regression model, we just need to read o the estimated

coecients from the output:

+ 0.0599 Pct U + 1.5619 Av Shift 2.6366 U/M Rel.

Note that all the estimated coecients have pvalue < 0.05, thus, every one

of the ve explanatory variables contributes signicantly to Absenteeism (and

should be included in the model).

(e) i. The linear regression model is appropriate. The scatter plot of

Residuals against Fitted Values is slightly suggestive of some degree of

non-linearity (i.e. a curved relationship). Assume for now that the linear

model is appropriate.

ii. The errors are normal. It is not very evident, from the histogram

given, that the residuals are probably not normally distributed. However,

the Normal Probability Plot shows a distinct curvature. Thus the residuals

are probably not normal this assumption is not justied.

iii. The errors have constant variance. The scatter plot of Residuals

against Fitted Values shows no clear pattern, so there is no reason to

doubt the equal variance assumption.

iv. The errors are uncorrelated. As observed in i., the scatter plot of

residuals against tted values shows a slight pattern, but there is possibly

not enough reason to doubt the claim that the residuals are uncorrelated.

(f) In the case of the two factor variables, union/management relations and

availability of shift work, it is clear that union/management relations

have a greater eect than availability of shift work , because the absolute

value of the estimated coecient is larger.

(g) A 95% CI for the coecient of Pct U is

b

3

t

0.025,94

SE(b

3

) = 0.0599 t

0.025,94

0.0124

= 0.0599 1.9855 0.0124

= 0.0599 0.0246 = (0.0353, 0.0845).

One can also read this straight from the Excel output.

Informally, this tells us that if percentage union membership increases by 1%,

then we would expect that mean absenteeism will increase by between 0.0353

and 0.0845 days per year.

154 CHAPTER 11. MULTIPLE LINEAR REGRESSION

(h) The data (Absent) should be transformed and the model re-tted to see if

there is any improvement in the behaviour of the residuals with respect to the

normality assumption.

155

2. As a further analysis, the following loglinear model was tted to the data:

ln Absent =

0

+

1

Wage +

2

PctPT +

3

PctU +

4

AvShift +

5

U/MRel +

Some of the output from the analysis is given on the following page.

(a) Using the analysis of the previous question, justify tting the above model to

the data.

(b) Write down the tted regression model between ln(Absent) and the explanatory

variables.

(c) Is there sucient evidence to conclude that the availability of shift work is

related to absenteeism? Justify your answer.

(d) Can we infer that in organisations where union and management relations are

poor, absenteeism is high? Justify your answer.

(e) State and verify the assumptions of the regression model using the output.

(f) Compare the log model to the linear model tted in the previous question.

Which is better? Justify your answer.

(g) Between U/M Rel and Av Shift, which variable has the greatest aect on

absenteeism in this model? How does this compare with the model in Question

1?

(h) Compute a 95% condence interval for the coecient of the percent of unionised

employees, and compare your answer to that in Question 7(a)vii.

(i) Write a statement reporting the results of the analysis, referring to the factors

that aect worker absenteeism.

156 CHAPTER 11. MULTIPLE LINEAR REGRESSION

Solution:

(a) The Normality assumption employed in the previous analysis was perhaps not

justied. Now the response variable is being transformed, in an attempt to

nd a more appropriate model. After tting the new model to the data, we

can see if there is any change in the behaviour of the residuals. Since the

157

histogram of residuals was right-skewed, an appropriate transformation might

be to take the square root or natural logarithm of the response variable. The

log transformation has been used here.

(b) Reading the estimated coecients from the output, the tted model is

+ 0.283Av Shift 0.371U/M Rel.

(c) The p-value of the coecient of Av Shift is 0.0012 < 0.05, so there is sucient

evidence to conclude that the availability of shift work is related to absenteeism.

In fact, since the coecient of Av Shift is positive, the availability of shift work

increases the mean number of days absent per employee. From the tted model

derived in (b), we obtain

+ 0.283Av Shift 0.371U/M Rel)

so absenteeism increases by a factor of e

0.283

= 1.327 for companies that have

shift work available.

(d) The p-value of the coecient of U/M Rel is 2.11 10

5

<< 0.05, so there is

sucient evidence to conclude that the status of union-management relations

is related to absenteeism. Since the coecient of U/M Rel is negative, it

indicates that if union-management relations as good then the mean number

of days absent per employee decreases. In fact, absenteeism decreases by a

factor of e

0.371

= 0.690 if management-union relations are good.

(e) i. The tted regression model is appropriate. The scatter plot of Resid-

uals against Fitted Values shows no clear pattern, so we conclude that the

model is appropriate.

ii. The errors are normal. The normal probability plot is fairly straight,

and the histograms is similar to that expected from a Normal distribu-

tion. We conclude that there is no reason to question the assumption of

Normality of the residuals.

iii. The errors have constant variance. The scatter plot of residuals

against tted values shows possibly that the variances decrease slightly

as the tted values increase, but perhaps not enough to doubt the equal

variance assumption.

iv. The errors are uncorrelated. The scatter plot of residuals against tted

values shows no clear pattern, so there we conclude that the residuals are

uncorrelated.

(f) The log model is a great improvement on the linear model. The correlation

coecient has not changed much (0.7252 compared to 0.7296). The standard

error for the log model is much smaller than for the linear model this is

partly due to the fact that the data values have decreased due to the log

transformation, but even taking this into account, the reduction is large (in

fact, as a rough guide, the log of the standard error for the linear model is

ln 2.3559 = 0.8569, and this is more than twice the standard error for the log

model). Furthermore, the diagnostic plots suggest that we may safely assert

158 CHAPTER 11. MULTIPLE LINEAR REGRESSION

that the assumptions of the regression are satised by the log model, whereas

some doubt must be cast on the validity of the assumptions of the linear model,

in particular Normality of the residuals.

(g) Again the coecient of Union/Management Relations has larger absolute value,

so again of the two variables, this variable has the greatest eect.

(h) 95% CL for coecient of Pct U

= 0.0111 t

94

0.0021

= 0.0111 1.9855 0.0021

= 0.0111 0.0042

so a 95% CI is (0.0069, 0.0153). This is quite dierent to the 95% CI for the

same coecient under the previous model. The dierence is due to the choice

of model. Note that this is a CI for the increase in ln(Absent) corresponding

to a 1% increase in union membership.

(i) Worker absenteeism decreases as the average employee wages and percent of

part time employees increase. Further, absenteeism is lower for those companies

for which management has a good relationship with the union and higher for

those companies that have shift work available. Finally, absenteeism increases

as the percentage of unionised employees increases.

3. In this exercise, we will examine how to use Excel to generate output that can

be used to conduct a multiple linear regression analysis on a given data set. The

standard Excel regression output does not include all of the diagnostic plots that

one would usually be interested in separately, we can obtain plots of residuals

against tted values, and histograms of the residuals. The Absenteeism data of the

previous two questions is contained in the Excel le Absent.xlsx.

Follow the steps outlined below:

(a) Generate output relevant to a multiple linear regression of Absent on the

ve explanatory variables. To do this, select Data -> Data Analysis ->

Regression. Since Absent is the response variable, set Y Range to be all

the data in the column Absent. X Range should be set to be all the data in

remaining columns. Labels should be included. Select also Residuals and

Normal Probability Plot.

(b) Under Residual Output, you will see two columns headed Predicted Absent

and Residuals. Copy the two columns to a separate worksheet, and use the

Scatterplot command to generate a scatterplot of Residuals against Fitted

Values (Predicted Absent). Check that the plot is the same as the one given

in the output in Question 7a.

(c) Generate a histogram of the Residuals.

Solution:

See Question 7a for an example output.

159

4. In Question 7b, we considered a multiple linear regression of the natural logarithm

of Absent on the ve explanatory variables. Here, we will reproduce the relevant

output in Excel. First, we must transform the Absent data.

(a) Create a new column, to the right of the Absent column, headed ln(Absent),

calculate the natural logarithm of the rst data point as shown below, and

then ll down the column.

(b) Now generate the standard Excel output for a multiple linear regression of

ln(Absent) on the ve explanatory variables (ignoring the original Absent

data!).

(c) Once again, create a plot of Residuals against Fitted Values. Check that it is

the same as the plot given in the Excel output in Question 7b. Comment on

the dierences in the diagnostic plots for the two dierent models, and what

these plots tell us.

Solution:

See Question 7b for an example output. The diagnostics for the two models were

examined respectively in Questions 7(a)v and 7(b)v. We can assert that the as-

sumptions of the regression are satised by the log model. There is no evident

pattern in the residual plot to suggest that this model is not appropriate or that

the errors are not independent. Furthermore, the Normal Probability Plot resem-

bles a line, indicating that the Normality assumption is OK. However, for the linear

model, the Normal Probability Plot has a denite curve. For this model, some doubt

must be cast on the validity of the assumption of Normality of the errors. So, the

model is inappropriate. Inference based on bad models will usually result in wrong

conclusions.

160 CHAPTER 11. MULTIPLE LINEAR REGRESSION

Chapter 12

Chi-Squared Tests for Categorical

Data

12.112.2: The Chi-Squared Test for Goodness of Fit

1. A company which manufactures tractors takes daily samples of 4 tractors for careful

inspection as a check on the quality of their product. Over 200 days, the numbers

of tractors needing adjustment on each day were recorded, resulting in the following

frequency table. Test whether a Binomial model with p = 0.1 is appropriate for the

number of tractors needing adjustment on a given day.

[Fill in the rest of the table before your lab, remembering that you might need to

group the categories.]

Number needing adj. per day (x

i

) 0 1 2 3 4 total

Number of days (o

i

) 102 78 19 1 0 200

P(X = x

i

) if X Bin(4, 0.1) 1

Expected frequency (e

i

) 200

(o

i

e

i

)

2

/e

i

Solution:

The hypotheses to be tested are

H

0

: data consistent with a Bin(4, 0.1) distribution

H

1

: data not consistent with a Bin(4, 0.1) distribution

Let X denote the number of tractors needing adjustment on a given day. As-

sume that the numbers of tractors requiring adjustment on each day are iid. Then

the number of days, out of 200, on which x

i

tractors need adjustment (for x

i

=

0, 1, . . . , 4) is a Bin(200, p

i

) random variable, where p

i

= P(X = x

i

). So the ex-

pected number of days on which x

i

tractors need adjustment can be written as

200p

i

. Under the Bin(4, 0.1) model, p

i

=

4

x

i

0.1

x

i

(1 0.1)

4x

i

. The expected fre-

quencies under this model can now be calculated, and the results are given in the

following table:

161

162 CHAPTER 12. CHI-SQUARED TESTS

Number needing adj. (x

i

) 0 1 2 3 4 Total

Number of days (o

i

) 102 78 19 1 0 200

P(X = x

i

) if X Bin(4, 0.1) 0.6561 0.2916 0.0486 0.0036 0.0001 1

Expected freq. (e

i

= 200p

i

) 131.22 58.32 9.72 0.72 0.02 200

The chi-square tests require that all expected frequencies be greater than 5. To

achieve this, we group the last three categories. The revised table is shown below:

Number needing adj. (x

i

) 0 1 2, 3 or 4 total

Number of days (o

i

) 102 78 20 200

P(X = x

i

) if X Bin(4, 0.1) 0.6561 0.2916 0.0523 1

Expected freq. (e

i

= 200p

i

) 131.22 58.32 10.46 200

(o

i

e

i

)

2

/e

i

6.51 6.64 8.70 21.85

The test statistic is

X

2

=

i

(o

i

e

i

)

2

e

i

,

where the sum is over all (remaining) categories of the variable. Under H

0

, X

2

observes a

2

distribution, with

df = Number of categories 1 Number of parameters estimated

= 3 1 0

= 2,

i.e. X

2

2

2

under H

0

. So the = 0.05 critical value is

2

crit

=

2

2,0.05

= 5.99.

The observed value of test statistic is

2

obs

= 21.85.

Since

2

obs

= 21.85 CR, the data provides sucient evidence to reject H

0

at the 5%

level of signicance. We conclude that the number of tractors needing adjustment

is not distributed as Bin(4, 0.1).

2. Political ideology of government has a great impact on business perception and

planning. A market researcher is investigating the support for the various political

parties in Australia at the Federal level. The support at the 2001 Federal election

was Liberal 37%, Labor 38%, National 6%, Democrats 5%, Others 14%

(source: http://www.aec.gov.au/ content/when/past/2001/results/index.html).

Six months after the 2001 election, a survey of 1050 voters was conducted, to de-

termine whether the level of support for each party had changed. The results are

summarised in the table below.

Party (i) Lib Lab Nat Dem Oth Total

No. of voters (o

i

) 350 456 50 44 150 1050

Probability (p

i

) 0.37 0.38 0.06 0.05 1

163

Determine at signicance level 0.05 whether the level of support for the parties

changed in the six months following the 2001 election. Comment on where the

major discrepancy appears to lie.

Solution:

Party (i) Lib Lab Nat Dem Oth Total

No. of voters (o

i

) 350 456 50 44 150 1050

Probability (p

i

) 0.37 0.38 0.06 0.05 0.14 1

Expected frequency (e

i

) 388.5 399 63 52.5 147 1050

o

i

e

i

-38.5 57 -13 -8.5 3 0

(o

i

e

i

)

2

1482.25 3249 169 72.25 9 (not reqd)

(o

i

e

i

)

2

/e

i

3.8153 8.1429 2.6825 1.3762 0.0612 16.0781

(a) df = number of categories (after grouping to eliminate any expected frequencies

less than 5) 1 number of parameters estimated from the data.

(b) [See table above].

df= 5 1 0 = 4;

2

4,0.05

= 9.49.

2

o

= 16.0781 > 9.49, therefore the data provides sucient evidence to reject

H

0

at the 5% level of signicance. We conclude that the level of support for

the political parties has changed since the last election.

The largest contribution to the Chi-squared statistic is from the Labor column.

Thus the major discrepancy is that the support for Labor increased in the six

months following the 2001 election.

3. Black et al. 12.5.

Solution:

H

0

: The way that men dene their personal success does not dier from how women

dene theirs

H

1

: H

0

is false.

The test statistic is given by

2

=

obs freq(f

o

) exp freq(f

e

)

2

exp freq(f

e

)

The signicance level is given as = 0.05

There are four categories in this question (happiness, sales, helping others, achieve-

ments), k = 4. The degrees of freedom are k 1. For = 0.05 and df = 3, the

critical chi-square value is

2

0.05,3

= 7.8147

The observed values are computed by multiplying the expected proportions (from

womans data) to the total sample size of the mens data. For example, the total

sample size for the mens data is 227 (add up all the observed frequencies). The ex-

pected frequency for the happiness category is then 227(0.39) = 88.53 and similarly

for the sales category, the expected frequency is 227(0.12) = 27.24 and so on.

164 CHAPTER 12. CHI-SQUARED TESTS

Denition f

o

f

e

(fofe)

2

fe

Happiness 42 88.53 24.46

Sales 95 27.24 168.55

Helping 27 40.86 4.70

Achievements 63 70.34 0.77

Total 198.98

Since the chi-squared observed value (198.98) is greater than the critical value, we

reject the null hypothesis.

Thus, the data gathered in the sample suggests that the way men dene their

personal success diers signicantly from how women dene theirs.

12.3: Contingency Analysis: The Chi-Squared Test

for Independence

4. In a random sample of 100 people, each person was classied by buying response to

a particular product and also by degree of exposure to marketing pressure (recorded

in four categories I, II, III, IV), with the following results:

[Fill in (say) three of the expected frequencies (in parentheses) before your lab.]

Marketing Pressure

I II III IV Totals

Denitely buy 12 ( ) 12 ( ) 6 ( ) 17 ( ) 47

Undecided 5 ( ) 8 ( ) 10 ( ) 5 ( ) 28

Will not buy 3 ( ) 10 ( ) 7 ( ) 5 ( ) 25

Total 20 30 23 27 100

(a) State the hypotheses you would use in testing the advertising agencys claim

that buying response is inuenced by the degree of marketing pressure.

(b) Explain why you would calculate the expected frequencies using the rule

expected frequency for a cell =

row total column total

grand total

.

(c) Test the advertising agencys claim at the 5% signicance level.

Solution:

(a) The hypotheses to be tested are

H

0

: Marketing pressure and buying response are independent

H

1

: Marketing pressure and buying response are not independent

(i.e. buying response is inuenced by marketing pressure)

165

(b) Consider the upper-left cell. The probability that a particular customer will

be classied in this cell is P(Marketing Pressure I and Denitely buy). If

H

0

is true, then

P(Marketing Pressure I and Denitely buy)

= P(Marketing Pressure I) P(Denitely buy).

The two probabilities on the right-hand side can be estimated naturally in

terms of the respective row and column totals:

P(Marketing Pressure I)

Column 1 Total

Grand Total

P(Denitely buy)

Row 1 Total

Grand Total

.

So,

P(Marketing Pressure I and Denitely buy) =

Row 1 Total Column 1 Total

(Grand Total)

2

.

The expected frequency for the upper-left cell is

Grand Total P(Marketing Pressure I and Denitely buy)

Grand Total

Row 1 Total Column 1 Total

(Grand Total)

2

=

Row 1 Total Column 1 Total

Grand Total

.

This argument applies in the same way for all other cells in the table.

(c) Expected frequencies (under the model of independence) are given in brackets:

Marketing Pressure

I II III IV Totals

Denitely buy 12 (9.4) 12 (14.1) 6 (10.81) 17 (12.69) 47

Undecided 5 (5.6) 8 (8.4) 10 (6.44) 5 (7.56) 28

Will not buy 3 (5) 10 (7.5) 7 (5.75) 5 (6.75) 25

Total 20 30 23 27 100

The test statistic is

X

2

=

r

i=1

c

j=1

(O

ij

e

ij

)

2

e

ij

where r and c are the numbers of rows and columns (not including totals), O

ij

is the observed count in the cell in row i and column j, and e

ij

is the expected

count in the same cell (assuming H

0

, that is, no relationship between the two

variables). This double sum can be thought of simply as a single sum over

all cells in the table.

Under H

0

, the test statistic observes a

2

distribution, with degree of freedom

(r 1) (c 1) = (3 1) (4 1) = 6, i.e. X

2

2

6

under H

0

. Thus, the

= 0.05 critical value of the test is

2

crit

=

2

6,0.05

= 12.59.

166 CHAPTER 12. CHI-SQUARED TESTS

The observed value of the test statistic is

2

obs

=

r

i=1

c

j=1

(o

ij

e

ij

)

2

e

ij

= 0.719 + 0.312 + 2.140 + 1.464 + 0.064 + 0.019 + 1.968 +

+ 0.867 + 0.800 + 0.833 + 0.271 + 0.454

= 9.91.

Since

2

obs

= 9.91 < 12.59, the data does not provide sucient evidence to

reject H

0

in favour of H

1

at the 5% level of signicance. We conclude that

buying response is not inuenced by marketing pressure.

5. Four hotels took part in a survey on hotel guest satisfaction. A follow up question

was asked of all respondents who were dissatised with the service. These guests

were asked to indicate the main reason for their dissatisfaction. You are asked

to investigate whether the choice of hotel has any bearing on the main reason for

dissatisfaction.

Do not use Excel in this question! Write your answers on paper, showing

full working.

(a) State appropriate hypotheses that could be tested to answer the question: Do

the results of the survey provide evidence that the nature of dissatisfaction and

the choice of hotel are related?

Solution:

H

0

: Choice of hotel and reason for dissatisfaction are independent.

H

1

: H

0

is false (i.e. choice of hotel and reason for dissatisfaction are related).

(b) A contingency table, summarising the results of the survey, is given below.

The table shows the observed frequencies for each cell, as well as some of the

expected frequencies under H

0

(in parentheses). Copy down this table, and

without using Excel, calculate the remaining expected frequencies under H

0

.

Show working!

Hotel

Fijian Tradeswest Sheraton Coral Reef Totals

Politeness 23 ( ) 7 ( ) 37 (33.7410) 67 (62.0192) 134

Knowledge 25 ( ) 13 ( ) 25 (30.9712) 60 (56.9281) 123

Responsiveness 13 (11.0024) 5 (6.6906) 13 (15.6115) 31 (28.6954) 62

Other 13 (17.3909) 20 (10.5755) 30 (24.6763) 35 (45.3573) 98

Totals 74 45 105 193 417

Solution:

167

The expected frequency for Fijian and Politeness is

Row Total Column Total

Grand Total

= (134 74)/417 = 23.7794.

One can either work out the remaining three expected frequencies as above, or

by using the fact that the expected frequencies in each row/column are required

to sum to the (observed) row/column total. The complete table is below:

Hotel

Fijian Tradeswest Sheraton Coral Reef Totals

Politeness 23 (23.7794) 7 (14.4604) 37 (33.7410) 67 (62.0192) 134

Knowledge 25 (21.8273) 13 (13.2734) 25 (30.9712) 60 (56.9281) 123

Responsiveness 13 (11.0024) 5 (6.6906) 13 (15.6115) 31 (28.6954) 62

Other 13 (17.3909) 20 (10.5755) 30 (24.6763) 35 (45.3573) 98

Totals 74 45 105 193 417

(c) Write down an expression for the relevant test statistic, and state its distribu-

tion under H

0

(together with any associated parameters!).

Solution:

The test statistic is

X

2

=

r

i=1

c

j=1

(O

ij

e

ij

)

2

e

ij

where r and c are the numbers of rows and columns (not including totals), O

ij

is the observed count in the cell in row i and column j, and e

ij

is the expected

count in the same cell (assuming H

0

, that is, no relationship between the two

variables).

Under H

0

, the test statistic observes a

2

distribution, with degree of freedom

(r 1) (c 1) = (4 1) (4 1) = 9, i.e. X

2

2

9

under H

0

.

(d) Without using Excel, calculate the contribution from the upper-left cell to the

observed value of the test statistic.

Solution:

The contribution to the observed value from the upper-left cell is

(o

ij

e

ij

)

2

e

ij

=

(23 23.7794)

2

23.7794

= 0.0256.

(e) Given that the observed value of the test statistic is

2

obs

= 20.8059, carry

out the test (without using Excel) at the 5% signicance level, and state your

168 CHAPTER 12. CHI-SQUARED TESTS

conclusion. Is there sucient evidence to conclude that there is a relationship

between the choice of hotel and the nature of dissatisfaction?

Solution:

The critical value for this test is

2

crit

=

2

9,

=

2

9,0.05

= 16.92,

so the critical region is {X

2

>

2

crit

= 16.92}. Since

2

obs

= 20.8059 is within

the critical region, we reject H

0

in favour of H

1

. There is sucient evidence,

at the 5% signicance level, to conclude that the nature of dissatisfaction is

related to the choice of hotel.

6. To undertake contingency analysis in Excel, rst enter the data, then go KaddSTAT

-> Hypothesis Testing -> Chi-Square Test. Select the data as Input Range,

tick the Header Row and Column Included box, and choose where you want Excel

to print the output.

Enter the data from Question 7e as shown below:

(a) Use Excel to generate an appropriate output for a test for independence of the

two variables of interest, carry out the test, and check that your conclusions

are the same as in Question 7e.

Solution:

Excel returns the following output:

169

(b) If there is evidence that the nature of dissatisfaction is related to the choice

of hotel, where do the discrepancies lie? Which hotel(s) could be advised to

improve their service, and in which area(s)? Do any of the hotels appear to

provide signicantly better service than the others in a particular area?

Solution:

Having established that there is indeed a relationship between the choice of

hotel and the nature of dissatisfaction, one can examine the output to deter-

mine which hotels have a greater (or lesser) proportion of complaints of each

type.

It can be seen from the output of chi-square calculations that there are three

cells that have much larger contributions to the observed value of the test

statistic than the others. These cells are Tradewest and Politeness (3.8490),

Tradewest and Other (8.3987) and Coral Reef and Other (2.3651). Comparing

the observed frequencies with the expected frequencies for these cells, we see

that of the dissatised hotel guests, those who stayed at Tradeswest are less

often dissatised with Politeness, and more often their dissatisfaction is classi-

ed as Other. It might also be that those dissatised guests who stay at Coral

Reef less often state that their dissatisfaction is due to Other, although there

is probably not enough evidence to conrm this (the chi-square contribution is

not that large).

We conclude that Tradeswest should take steps to improve their service in the

area of Other. Some further analysis might be required to provide more useful

advice. To nd out which particular aspects of Tradeswests service guests are

dissatised with, one might choose to replace Other by a collection of more

meaningful categories (e.g. Cleanliness, Food, etc.). One can ensure that the

expected frequencies are all greater than 5 by combining any categories that

have small expected frequencies, or by simply gathering enough data.

7. A market-researcher wished to investigate whether a buyers age had any bearing

on choice of car colour. A random sample of 200 car buyers resulted in the following

table which shows the observed frequencies and some of the expected frequencies

(in parentheses).

Chose Red Chose White Chose Grey

Age 17 24 20 ( 16 ) 15 ( 16 ) 5 ( )

Age 25 40 30 ( 24 ) 20 ( 24 ) 10 ( )

Age over 40 30 ( ) 45 ( ) 25 ( )

(a) State the hypotheses that the researcher is comparing in this investigation.

(b) Copy the body of the table and complete the entries for expected frequencies.

(c) Give the number of degrees of freedom for a

2

-test of the hypotheses in part

(a).

170 CHAPTER 12. CHI-SQUARED TESTS

(d) Explain fully how the expected frequency of 16 is obtained for the 17 24 age

group with a preference for Red. (Do not merely quote a formula or show one

line of arithmetic.)

(e) Using a 5% level of signicance, determine if the buyers age has any bearing

on the choice of car colour.

Solution:

(a) H

0

: choice of colour independent of age; H

1

: choice of colour dependent on

age.

(b)

Chose Red Chose White Chose Grey Total

Age 17 24 20 ( 16 ) 15 ( 16 ) 5 ( 8 ) 40

Age 25 40 30 ( 24 ) 20 ( 24 ) 10 ( 12 ) 60

Age over 40 30 ( 40 ) 45 ( 40 ) 25 ( 20 ) 100

Total 80 80 40 200

(c) No grouping required, so the number of degrees of freedom for the

2

-distribution

is (3 1)(3 1) = 4.

(d) Under H

0

, P(Age 17 24 and Red) = P(Age 17 24).P(Red).

Estimating P(Age 1724) by

40

200

and P(Red) by

80

200

, gives expected frequency

for (1, 1)-cell =

40

200

80

200

200 = 16.

(e) From tables,

2

4, 0.05

= 9.49 and observed value of test statistic = 9.06 / CR

so the data does not provide sucient evidence to reject H

0

at the 5% level of

signicance. We conclude that choice of colour is not dependent on age.

8. Black et al. Exercises 12.27 and 12.29.

Solution:

(a) Black et al. 12.27

171

The hypotheses of interest here are

H

0

: Proportion of households with internet access is not dependent on whether

they have children under the age of 15 for the period 1989 to 2003.

H

1

: Proportion of households with internet access is dependent on whether

they have children under the age of 15 for the period 1989 to 2003. Note that

all the expected frequencies are greater than 5. df = (2 1)(6 1) = 5. The

p-value of the test = P(

2

5

> 0.13) = 0.9997 (from Excel), so there is insu-

cient evidence against the null hypothesis. We conclude that the proportion of

households with internet access is the same for those with children under 15

and those without children under 15 in the period 1989 to 2003.

(b) Black et al. 12.29

H

0

: Gender and colour preference for cars is independent

H

1

: Gender and colour preference for cars is not independent

To test this hypothesis, we use the chi-squared test of independence. The

observed chi-squared value (from the test-statistic) is 5.366. The p-value

0.252 > 0.05, so there is insucient evidence to reject the null hypothesis.

(The critical value at 5% level of signicance with (5 1)(2 1) = 4 degrees of

freedom is 9.4877. Since the observed value does not lie in the critical region,

we do not reject the null hypothesis.) Therefore, there is not enough evidence

provided by the data to suggest that colour preference is dependent on gender.

Marketing agencies dont have to model colour as a factor when trying to sell

cars to either gender. Also, manufacturers can determine car colour quotes on

another basis, instead of gender preference.