Sie sind auf Seite 1von 11

QMM5100 W19 Multiple Regression HW Solution

1) It appears that over the past 45 years, the number of farms in the United States declined while the
average size of farms increased. The data, provided by the U.S. Department of Agriculture,
appear in the Farms worksheet in the Regression Modeling HW data workbook on Moodle. The
data show five-year interval data for US farms.

a) Use JMP to fit the simple regression


model using Average Size as the
dependent or Y variable and number of
farms as the independent or X variable.
Is the model statistically significant at
 = 0.05?

The model is statistically significant.

b) Interpret the regression model


parameters (intercept and slope) in the
context of relating farms to size.

Intercept – This would be the average farm size when there are no farms. This has no
function meaning in the context of this problem and is outside the scope of the data.

Slope – This represents the change in the average farm size (in acres) for a one unit (one
million) increase in number of farms. The negative coefficient indicates a negative
correlation or that the variables move in opposite directions. Notice from the raw
data that the number of farms has been decreasing over time so I would interpret this
coefficient in the context of a decrease. Interpretation: Each one million farm
decrease corresponds to an increase of 72.3 acres in the average farm size.

Page: 1 of 11
QMM5100 W19 Multiple Regression HW Solution

c) Perform a residual analysis. Do the residuals satisfy the model assumptions?

Use a histogram to check the normal


assumption. Due to the small sample size, it is
difficult to assess normality with a histogram.
The histogram to the right doesn’t look great,
but it might be ok.

Verifying this with a hypothesis test you fail to reject Ho, cannot conclude the data is not
Normally distributed. In other words not enough deviation from a Normal distribution to violate
the regression model assumption.

Use a scatter plot of the residuals versus the


independent variable to determine if a
relationship exists between the residuals and
the independent variable. The scatter plot to
the right suggests there may be some type of
non-linear relationship in the data. The
Durbin-Watson test shows that there is a
significant positive autocorrelation in the
residuals. Therefore, the data violates the
regression model assumption.

Page: 2 of 11
QMM5100 W19 Multiple Regression HW Solution

2) Larry Swattle, a local attorney, is interested in understanding the relationship between the outside
temperature and his energy usage. Larry finds his electric bills for the last 24 months and records the
electric consumption in Kilowatt Hours. He then uses the internet to find the average monthly
temperature in Fahrenheit for his city for those months. The data appear in the Electricity worksheet
of the Multiple Regression HW data workbook on Moodle.

a) Fit the simple linear regression model using monthly electric consumption as the dependent or
Y variable and average daily temperature as the dependent or X variable. Analyze the
residuals to determine if they satisfy the three regression model assumptions.

Use a histogram to check the normal assumption. The


distribution appears more uniform than Normal.

Fitting a normal distribution


to the residuals allows
performing a goodness of fit
test. At  = 0.05, you can not
conclude the data doesn’t
follow a Normal, i.e., with the
goodness of fit test you would
conclude the Normal is ok.

Use a scatter plot of the residuals versus the


independent variable to determine if a relationship exists
between the residuals and the independent variable.
The scatter plot to the right does not suggests a trend in
the data.

Use a line chart of the residuals to determine if


there is autocorrelation or if the residuals are
correlated with themselves. The line chart to the
right looks ok.

b) Calculate the Durbin Watson statistic. Do the


residuals show autocorrelation?

Page: 3 of 11
QMM5100 W19 Multiple Regression HW Solution

Since the p-value is not less than  =


0.05, cannot conclude the errors are
autocorrelated.

c) Predict electric consumption for a month with an average temperature of 60.

Prediction: 586.13

d) Determine the leverage for each observation. Do any of the data points have excess leverage?

Since n = 24 the critical value is 6/24 = 0.25.


None of the data points have excess leverage.

e) Remove any data points with excess leverage and fit the simple linear regression model
using monthly electric consumption as the dependent or Y variable and average daily
temperature as the dependent or X variable. Using the model without points that have
excess predict electric consumption for a month with an average temperature of 60. Do
the point(s) with excess leverage appear to unduly influence the model?

There were no points with excess leverage.

Page: 4 of 11
QMM5100 W19 Multiple Regression HW Solution

3) The quality of the orange juice produced by a manufacturer (e.g. Minute Maid, Tropicana) is
constantly monitored. There are numerous sensory and chemical components that combine
to make the best tasting orange juice. For example, one manufacturer has developed a
quantitative index of the “sweetness” of orange juice where the higher the index, the sweeter
the juice. To determine if there is a relationship between the sweetness index and the parts
per million (ppm) of water soluble pectin (a chemical measure of the juice) data was
collected on 24 production runs at a juice plant. The data appear in the Sweetness worksheet
of the Multiple Regression HW data workbook on Moodle.

a) Use JMP to fit the simple regression model using the


sweetness index as the Y variable and pectin as the X
variable. Is the model statistically significant at  =
0.05?

The model is statistically significant

b) What percentage of the variation in sweetness index


has been explained by the linear relationship with
pectin?

The model explains 22.86% of the variation in


sweetness index with pectin.

c) Analyze the residuals to determine if they satisfy the three regression model assumptions.

Use a histogram to check the normal


assumption. The histogram to the right may
deviate from what one would expect from a
normal distribution.

To determine if this deviation is significant


use a goodness-of-fit test. For the goodness-
of-fit test you fail to reject Ho, cannot
conclude that the residuals do not follow a Normal distribution.

Page: 5 of 11
QMM5100 W19 Multiple Regression HW Solution

Use a scatter plot of the residuals versus the


independent variable to determine if a
relationship exists between the residuals and
the independent variable. The scatter plot to
the right doesn’t show any apparent
relationship.

d) Use a hypothesis test to determine if any of the data points have too much leverage.

4 4
Critical Value:   0.167
n 24

For runs 11 and 16 the Test Statistic exceeds the critical value, therefore runs 11 and 16
may have undue influence on the model.

Page: 6 of 11
QMM5100 W19 Multiple Regression HW Solution

e) Remove any data points that have excess leverage and refit the regression model. Does
removing these data points change the model?

Without runs 11 and 16 the regression model is no longer significant.

4) A medical researcher is studying percent body fat. As part


of the study, the researcher takes data on 50 males aged 22
to 50. The dataset consists of the following variables: Fat%
= percent body fat, Age = age (yrs), Weight = weight (lbs),
Height = height (in.), Neck = neck circumference (cm),
Chest = chest circumference (cm), Abdomen = abdomen
circumference (cm), Hip = hip circumference (cm), Thigh =
thigh circumference (cm). The data appear in the BodyFat
worksheet of the Multiple Regression HW data workbook
on Moodle.

a) Use JMP to fit the multiple regression model using


Weight as the dependent or Y variable and the other 8
variables as the independent or X variables. Is the
model statistically significant at  = 0.05? Are all 8
independent variables significant?

The model is statistically significant. Age and Chest are


not significant with the other 6 variables significant.

Page: 7 of 11
QMM5100 W19 Multiple Regression HW Solution

b) Remove insignificant independent variables one at a


time until all remaining variables in the model are
significant. Provide the equation of the fit line.
Interpret the regression model parameters in the
context of predicting weight.

Weight = -362.48 + 2.574 Height + 2.948 Neck + 0.692 Abdomen + 1.856 Hip

Intercept: You would expect a person 0 inches tall with a neck, abdomen and hips that all
have 0 circumference to weigh -362.48 pounds. This value has no meaning in this
problem.

Height: Each additional inch of height adds 2.574 pounds to the average weight, holding
Neck, Abdomen and Hip circumference constant.

Neck: Each additional centimeter of neck circumference adds 2.948 pounds to the
average weight, holding Height and Abdomen and Hip circumference constant.

Abdomen: Each additional centimeter of abdomen circumference adds 0.6928 pounds to


the average weight, holding Height and Neck and Hip circumference constant.

Hip: Each additional centimeter of hip circumference adds 1.856 pounds to the average
weight, holding Height and Neck and Abdomen circumference constant.

c) Provide a point prediction for the weight of a male 70 inches tall with a 39cm neck, 94cm
abdomen and a 100cm hip.

Weight = -362.48 + 2.574 (70) + 2.948 (39) + 0.692 (94) + 1.856 (100) = 183.33

Page: 8 of 11
QMM5100 W19 Multiple Regression HW Solution

d) Provide a 95% confidence interval for the average weight of males 70 inches tall with
20% body fat.

180.40 ; 186.27

e) Use a hypothesis test to determine if an additional 1cm in


neck circumference corresponds to expected increase of
more than 2 pounds, i.e., test the hypothesis
H o :  Neck  2 versus H1 :  Neck  2

Reject Ho, conclude the change in monthly electric usage


is different than a 7 KWH decrease for a 1 degree
increase in average daily temperature.

Page: 9 of 11
QMM5100 W19 Multiple Regression HW Solution

5) The State of Ohio Department of Education has a


mandated ninth-grade proficiency test that
covers writing, reading, mathematics, citizenship
(social studies) and science. The Ohio
worksheet in the Multiple Regression HW data
workbook contains proficiency test results for 31
school districts in Ohio for 2000. The dataset
contains the percent of students in the district
passing each of the 5 test sections (listed above)
and the percent of students who passed all five
sections.

a) Use JMP to fit the multiple regression model


using All as the dependent or Y variable and
the other five variables as the independent or
X variables. Is the model statistically
significant at  = 0.05? Are all five
independent variables statistically significant
at  = 0.05?

The regression model is significant. The Math and Writing variables are significant
while Reading, Citizenship and Science are not significant.

b) Continue to use All as the dependent or Y


variable. Remove the insignificant
variables one at a time (removing the
variable with the highest p-value) and fit
regression models until you obtain a model
with only significant independent or X
variables. Provide the output for your final
model.

c) What percentage of the variation in All has


been explained by the linear relationship
with the independent variables in your
model?

R-squared = 0.9631 which means that


96.31% of the variation in All has been
explained by the linear relationship with
the three independent variables.

Page: 10 of 11
QMM5100 W19 Multiple Regression HW Solution

d) Provide the equation of the fit line. Interpret the regression model parameters in the
context of passing proficiency tests.

All = -40.401 + 0.328 Writing + 0.616 Math + 0.338 Science

Intercept: A school district 0 percent students passing Math, Science and Writing would
expect -40.401 percent of students to pass all five sections. This value may not have
meaning in the context of this problem.

Writing: Each additional one percent of students that pass Writing corresponds to
0.328% of students passing all sections, everything else being equal.

Math: Each additional one percent of students that pass Math corresponds to 0.616% of
students passing all sections, everything else being equal.

Science: Each additional one percent of students that pass Science corresponds to
0.338% of students passing all sections, everything else being equal.

e) Use a hypothesis test to determine if the increase in


students passing All sections for a one percentage
increase of students passing Math is greater than
0.50%, i.e., test the hypothesis H o :  Math  0.50 versus
H1 :  Math  0.50 using  = 0.05.

Fail to reject Ho, cannot conclude the increase in


students passing All sections for a one percentage
increase of students passing Math is greater than
0.50%.

Page: 11 of 11

Das könnte Ihnen auch gefallen