Beruflich Dokumente
Kultur Dokumente
Learning objectives
12.1 Understand the role of dummy variables to
represent qualitative explanatory variables and
CHAPTER 12 use them in regression.
12.2 Test for differences between the categories of a
qualitative variable.
More on regression 12.3 Calculate and interpret confidence intervals and
prediction intervals, to allow inferences about the
analysis regression coefficients.
12.4 Explain the role of the assumptions on the OLS
estimators.
12.5 Describe common violations of the assumptions
and offer remedies.
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-1 Jaggia, Essentials of Business Statistics, 1e 12-2
Updated: Azizur Rahman Updated: Azizur Rahman
1 2
3 4
5 6
1
1/14/2019
• For the sake of simplicity, consider a model • For a given x, and d = 0, we compute ŷ as
containing one quantitative explanatory variable and ŷ = b0 + b1x1 + b2(0) = b0 + b1x1.
one dummy variable.
• Similarly, when d = 1
y = b 0 + b 1x1 + b 2d + e
ŷ = b0 + b1x1 + b2(1) = (b0 + b2) + b1x1.
• Conducting a standard ordinary least squares (OLS)
regression will yield an estimated equation of • The dummy variable allows a shift in the intercept
ŷ = b0 + b1x1 + b2d. term, enabling us to use a single regression
equation to represent both categories of the
qualitative variable.
continued continued
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-7 Jaggia, Essentials of Business Statistics, 1e 12-8
Updated: Azizur Rahman Updated: Azizur Rahman
7 8
Graphically, we can see how the dummy variable shifts • Example: Evidence of gender pay discrimination?
the intercept of the regression line. – The introductory case has two qualitative variables, gender
and age group. To measure the impact of gender and age
on salary, we need to create two dummy variables.
Let d1 = 1 if the professor is male; 0 if female
Let d2 = 1 if the professor is 60 or over; 0 if under 60.
continued continued
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-9 Jaggia, Essentials of Business Statistics, 1e 12-10
Updated: Azizur Rahman Updated: Azizur Rahman
9 10
11 12
2
1/14/2019
LO 12.2 Qualitative variables with two LO 12.2 Qualitative variables with two
categories categories
• Example: Evidence of gender pay discrimination? • Sometimes a qualitative variable may be described
– Is there a gender effect in the salary study? by more than two categories.
H0: b 2 = 0 (males and females are paid the same) • In such cases we use multiple dummy variables to
HA: b 2 ≠ 0 (there is a difference due to gender) capture the effect of the variable.
– For example, suppose we divide the mode of transport used
– Given a value of the tdf test statistic of 4.86 and p-value of
by commuters into three categories: public transport, driving
approximately 0.00, we reject the null hypothesis and
and park-and-ride.
conclude that the gender dummy variable is significant.
– We then define two dummy variables, d1 and d2, where d1
• For the age coefficient, tdf is 0.94 and the p-value is 0.36, so equals 1 to denote public transport and 0 otherwise, and d2
we do not reject the null hypothesis. The evidence suggests equals 1 to denote driving and 0 otherwise. Park-and-ride is
that professors over 60 do not have significantly different captured when both d1 and d2 equal 0.
salaries, compared to those under 60.
continued continued
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-13 Jaggia, Essentials of Business Statistics, 1e 12-14
Updated: Azizur Rahman Updated: Azizur Rahman
13 14
LO 12.2 Qualitative variables with two LO 12.2 Qualitative variables with two
categories categories
• Our regression model for the mode of transport • Given the intercept term, we exclude one of the
example would then be dummy variables from the regression.
continued
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-15 Jaggia, Essentials of Business Statistics, 1e 12-16
Updated: Azizur Rahman Updated: Azizur Rahman
15 16
Interval estimates for the response LO 12.3 Interval estimates for the
variable response variable
LO 12.3 Calculate and interpret confidence intervals and
prediction intervals, to allow inferences about the • But, this is only a point estimate and ignores
regression coefficients. sampling error. We can also provide interval
• Once we have developed a regression model, we estimates.
often want to use it to make predictions.
• We will develop two types of interval estimates
• In the academic salary example, what salary would regarding y:
we predict for a male professor with 10 years of – A confidence interval for the expected value of y
experience? Inserting these values into our
– A prediction interval for an individual value of y.
estimated regression equation, we find:
Salary(predicted) = ŷ = 54.011 + 1.503(10) + 18.541(1) + 5.772(0) • It is common to refer to the first as a confidence
= 87.554, that is, $87,554. interval and the second as a prediction interval.
continued continued
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-17 Jaggia, Essentials of Business Statistics, 1e 12-18
Updated: Azizur Rahman Updated: Azizur Rahman
17 18
3
1/14/2019
LO 12.3 Interval estimates for the LO 12.3 Interval estimates for the
response variable response variable
• The point estimate of E(y0) is just the ŷ value. • Many statistics programs will compute confidence
ŷ0 = b0 + b1x10 + b2x20 + … + bkxk0 intervals, but Excel’s data analysis tools do not.
• The confidence interval, as always, includes the • Here is a method you can use instead. Shift the
point estimate, plus or minus the margin of error. value of each explanatory variable in your data set
by the value of interest for that variable:
ŷ0 ± ta/2,df se(ŷ0)
x1* = x1 – x10, x2* = x2 – x20, …, xk* = xk – xk0
• The term se(ŷ0) is the standard error of the
prediction. Though difficult to compute by hand if • When we estimate this modified regression, the
there is more than one explanatory variable in the resulting estimate of the intercept and its standard
model, we will develop a procedure to compute it error equal y0 and se(ŷ0), respectively.
with a statistical package.
continued continued
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-19 Jaggia, Essentials of Business Statistics, 1e 12-20
Updated: Azizur Rahman Updated: Azizur Rahman
19 20
LO 12.3 Interval estimates for the LO 12.3 Interval estimates for the
response variable response variable
• Example: Evidence of gender pay discrimination? • Example:
– Estimating the modified regression now reveals the
– In the academic salary example, we first shift the data by
confidence interval.
our hypothesised values.
continued continued
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-21 Jaggia, Essentials of Business Statistics, 1e 12-22
Updated: Azizur Rahman Updated: Azizur Rahman
21 22
LO 12.3 Interval estimates for the LO 12.3 Interval estimates for the
response variable response variable
• Example: • If we want to compute an interval with a different
– To summarise, after shifting the explanatory variables, confidence level, we simply need to find the correct
the intercept row in the regression output gives us all the ta/2,df statistic and insert the intercept and standard
information we need. The 95% confidence interval is given error of the intercept from the same regression, or
in the same row.
alternatively, specify a different confidence level in
– A 95% confidence interval for the salary of a man with 10 Excel's Regression dialog box'.
years of experience:
ŷ0 ± ta / 2,df se(ŷ0) = • The formula for the prediction interval
87.406 ± 2.023 × 2.869 = 87.406 ± 5.802.
– With 95% confidence, we can state that the mean salary of
all male professors with 10 years of experience falls
yˆ 0 ta 2,df (se(yˆ )) + s
0 2
e
2
continued continued
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-23 Jaggia, Essentials of Business Statistics, 1e 12-24
Updated: Azizur Rahman Updated: Azizur Rahman
23 24
4
1/14/2019
LO 12.3 Interval estimates for the LO 12.3 Interval estimates for the
response variable response variable
• Example: Prediction interval for salary
• The point estimate and the standard error of the – For the introductory case, to compute the prediction interval
for a man with 10 years of experience, we simply insert the
prediction are computed using the same technique
appropriate values from the previous example, plus the
as for the confidence interval. standard error of the estimate, 9.133.
• Now we need to include the standard error of the 87.406 2.023 2.868 2 + 9.133 2 = 68.044,106 .768
estimate in the margin of error calculation.
– With 95% prediction level, we can state that the salary of a
male professor with 10 years of experience falls between
$68,044 and $106,768.
– Remember that the prediction interval is an interval
estimate for one man with this experience, while the
confidence interval pertains to the average of all men with
continued this much experience.
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-25 Jaggia, Essentials of Business Statistics, 1e 12-26
Updated: Azizur Rahman Updated: Azizur Rahman
25 26
1. The model given by y = b 0 + b 1x1 + … + b kxk + e is linear 6. The error term e is not correlated with any of the
in the parameters b 0, b 1, …, b k . predictors x1, …, xk. In other words, there are no
explanatory variables excluded.
2. Conditional on x1, …, xk, E(e) = 0, thus
E(y) = b 0 + b 1x1 + … + b kxk . 7. The error term e is normally distributed. This assumption
allows us to do hypothesis testing. If normality is not the
3. There is no exact linear relationship among the case, our tests may not be valid.
explanatory variables (i.e. no perfect multicollinearity).
continued continued
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-27 Jaggia, Essentials of Business Statistics, 1e 12-28
Updated: Azizur Rahman Updated: Azizur Rahman
27 28
continued
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-29 Jaggia, Essentials of Business Statistics, 1e 12-30
Updated: Azizur Rahman Updated: Azizur Rahman
29 30
5
1/14/2019
31 32
continued continued
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-33 Jaggia, Essentials of Business Statistics, 1e 12-34
Updated: Azizur Rahman Updated: Azizur Rahman
33 34
35 36
6
1/14/2019
37