Hapter: More On Regression Analysis

1/14/2019
Learning objectives
12.1 Understand the role of dummy variables to
represent qualitative explanatory variables and
CHAPTER 12 use them in regression.
12.2 Test for differences between the categories of a
qualitative variable.
More on regression 12.3 Calculate and interpret confidence intervals and
prediction intervals, to allow inferences about the
analysis regression coefficients.
12.4 Explain the role of the assumptions on the OLS
estimators.
12.5 Describe common violations of the assumptions
and offer remedies.
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd
Jaggia, Essentials of Business Statistics, 1e 12-1 Jaggia, Essentials of Business Statistics, 1e 12-2
Updated: Azizur Rahman Updated: Azizur Rahman
1 2
Is there evidence of gender pay Is there evidence of gender pay

discrimination? discrimination?
• Worldwide studies have documented gender • She gathers data on 42 professors, including the
differences in wages and that female academics salary, experience, gender and age of each.
received lower pay than their male colleagues.
• Numerous studies have focused on salary
differences between men and women, indigenous
and non-indigenous, and young and old Australians.
• Joanna Smith works in human resources at a large • Using this data set, Joanna hopes to:
university. – Determine whether there is evidence of gender
discrimination in salaries
• After the release of the latest Australian Bureau of
– Determine whether there is evidence of age discrimination
Statistics data, the university asked her to test for
in salaries.
both gender and age discrimination in salaries.
continued
3 4
Dummy variables LO 12.1 Dummy variables

LO 12.1 Understand the role of dummy variables to
represent qualitative explanatory variables and use them • A dummy variable for a qualitative variable with two
in regression. categories assigns a value of 1 for one of the
• In Chapter 11, all the variables used in regression categories and a value of 0 for the other.
applications are quantitative. • For example, suppose we are interested in teen
behaviour. We might first define a dummy variable d
• In empirical work it is common to have some
that has the following structure:
variables that are qualitative: the values represent
Let d = 1 if age is between 13 and 19
categories that may have no implied ordering.
and d = 0 if age is anything else.
• We can include these factors in a regression through • This would allows us to capture the role of being a
the use of dummy variables. teenager in a regression model and quantify its
impact.
continued continued
5 6
1
1/14/2019
LO 12.1 Dummy variables LO 12.1 Dummy variables
• For the sake of simplicity, consider a model • For a given x, and d = 0, we compute ŷ as
containing one quantitative explanatory variable and ŷ = b0 + b1x1 + b2(0) = b0 + b1x1.
one dummy variable.
• Similarly, when d = 1
y = b 0 + b 1x1 + b 2d + e
ŷ = b0 + b1x1 + b2(1) = (b0 + b2) + b1x1.
• Conducting a standard ordinary least squares (OLS)
regression will yield an estimated equation of • The dummy variable allows a shift in the intercept
ŷ = b0 + b1x1 + b2d. term, enabling us to use a single regression
equation to represent both categories of the
continued continued
7 8
LO 12.1 Dummy variables LO 12.1 Dummy variables
Graphically, we can see how the dummy variable shifts • Example: Evidence of gender pay discrimination?
the intercept of the regression line. – The introductory case has two qualitative variables, gender
and age group. To measure the impact of gender and age
on salary, we need to create two dummy variables.
Let d1 = 1 if the professor is male; 0 if female
Let d2 = 1 if the professor is 60 or over; 0 if under 60.
continued continued
9 10
Qualitative variables with two

LO 12.1 Dummy variables
categories
• Example: LO 12.2 Test for differences between the categories of a
• The statistical tests discussed in Chapter 11 remain

valid for dummy variables as well.
• We can perform a t test for individual significance,
– The estimated equation is form a confidence interval using the parameter
ŷ = 54.011 + 1.503x + 18.541d1 + 5.772d2 estimate and its standard error, and conduct a partial
– The difference in salary between a male and a female
F test for joint significance.
professor is captured in the coefficient of d1. A male
professor, on average, makes $18,541 more than a female
with comparable experience.
– The age coefficient, though statistically insignificant in this
case, would have a similar interpretation.
continued
11 12
2
1/14/2019
LO 12.2 Qualitative variables with two LO 12.2 Qualitative variables with two
categories categories
• Example: Evidence of gender pay discrimination? • Sometimes a qualitative variable may be described
– Is there a gender effect in the salary study? by more than two categories.
H0: b 2 = 0 (males and females are paid the same) • In such cases we use multiple dummy variables to
HA: b 2 ≠ 0 (there is a difference due to gender) capture the effect of the variable.
– For example, suppose we divide the mode of transport used
– Given a value of the tdf test statistic of 4.86 and p-value of
by commuters into three categories: public transport, driving
approximately 0.00, we reject the null hypothesis and
and park-and-ride.
conclude that the gender dummy variable is significant.
– We then define two dummy variables, d1 and d2, where d1
• For the age coefficient, tdf is 0.94 and the p-value is 0.36, so equals 1 to denote public transport and 0 otherwise, and d2
we do not reject the null hypothesis. The evidence suggests equals 1 to denote driving and 0 otherwise. Park-and-ride is
that professors over 60 do not have significantly different captured when both d1 and d2 equal 0.
salaries, compared to those under 60.
continued continued
13 14
LO 12.2 Qualitative variables with two LO 12.2 Qualitative variables with two
categories categories
• Our regression model for the mode of transport • Given the intercept term, we exclude one of the
example would then be dummy variables from the regression.
y = b 0 + b 1x + b 2d1 + b 3d2 + e • The excluded variable represents the reference

category against which the others are assessed.
and the estimated equation would be
• If we included as many dummy variables as
ŷ = b0 + b1x + b2d1 + b3d2 . categories, this would create perfect multicollinearity
in the data, and such a model cannot be estimated.
• So, we include one fewer dummy variable than the
number of categories of the qualitative variable.
continued
15 16
Interval estimates for the response LO 12.3 Interval estimates for the
variable response variable
LO 12.3 Calculate and interpret confidence intervals and
prediction intervals, to allow inferences about the • But, this is only a point estimate and ignores
regression coefficients. sampling error. We can also provide interval
• Once we have developed a regression model, we estimates.
often want to use it to make predictions.
• We will develop two types of interval estimates
• In the academic salary example, what salary would regarding y:
we predict for a male professor with 10 years of – A confidence interval for the expected value of y
experience? Inserting these values into our
– A prediction interval for an individual value of y.
estimated regression equation, we find:
Salary(predicted) = ŷ = 54.011 + 1.503(10) + 18.541(1) + 5.772(0) • It is common to refer to the first as a confidence
= 87.554, that is, $87,554. interval and the second as a prediction interval.
continued continued
17 18
3
1/14/2019
LO 12.3 Interval estimates for the LO 12.3 Interval estimates for the
response variable response variable
• The point estimate of E(y0) is just the ŷ value. • Many statistics programs will compute confidence
ŷ0 = b0 + b1x10 + b2x20 + … + bkxk0 intervals, but Excel’s data analysis tools do not.
• The confidence interval, as always, includes the • Here is a method you can use instead. Shift the
point estimate, plus or minus the margin of error. value of each explanatory variable in your data set
by the value of interest for that variable:
ŷ0 ± ta/2,df se(ŷ0)
x1* = x1 – x10, x2* = x2 – x20, …, xk* = xk – xk0
• The term se(ŷ0) is the standard error of the
prediction. Though difficult to compute by hand if • When we estimate this modified regression, the
there is more than one explanatory variable in the resulting estimate of the intercept and its standard
model, we will develop a procedure to compute it error equal y0 and se(ŷ0), respectively.
with a statistical package.
continued continued
19 20
• Example: Evidence of gender pay discrimination? • Example:
– Estimating the modified regression now reveals the
– In the academic salary example, we first shift the data by
confidence interval.
our hypothesised values.
continued continued
21 22
• Example: • If we want to compute an interval with a different
– To summarise, after shifting the explanatory variables, confidence level, we simply need to find the correct
the intercept row in the regression output gives us all the ta/2,df statistic and insert the intercept and standard
information we need. The 95% confidence interval is given error of the intercept from the same regression, or
in the same row.
alternatively, specify a different confidence level in
– A 95% confidence interval for the salary of a man with 10 Excel's Regression dialog box'.
years of experience:
ŷ0 ± ta / 2,df se(ŷ0) = • The formula for the prediction interval
87.406 ± 2.023 × 2.869 = 87.406 ± 5.802.
– With 95% confidence, we can state that the mean salary of
all male professors with 10 years of experience falls
yˆ 0  ta 2,df (se(yˆ )) + s
0 2
e
2
between $81,603 and $93,209.
continued continued
23 24
4
1/14/2019
• Example: Prediction interval for salary
• The point estimate and the standard error of the – For the introductory case, to compute the prediction interval
for a man with 10 years of experience, we simply insert the
prediction are computed using the same technique
appropriate values from the previous example, plus the
as for the confidence interval. standard error of the estimate, 9.133.
• Now we need to include the standard error of the 87.406  2.023 2.868 2 + 9.133 2 = 68.044,106 .768 
estimate in the margin of error calculation.
– With 95% prediction level, we can state that the salary of a
male professor with 10 years of experience falls between
$68,044 and $106,768.
– Remember that the prediction interval is an interval
estimate for one man with this experience, while the
confidence interval pertains to the average of all men with
continued this much experience.
25 26
Model assumptions and common LO 12.4 Model assumptions and

violations common violations
LO 12.4 Explain the role of the assumptions on the OLS
estimators. 4. The variance of the error term e is the same for all
x1, …, xk values. In other words, observations do not
have a changing variability.
• The statistical properties of the OLS estimator, as
well as the validity of the testing procedures, depend 5. The error term e is uncorrelated across observations. In
on a number of assumptions. other words, observations are not correlated.
1. The model given by y = b 0 + b 1x1 + … + b kxk + e is linear 6. The error term e is not correlated with any of the
in the parameters b 0, b 1, …, b k . predictors x1, …, xk. In other words, there are no
explanatory variables excluded.
2. Conditional on x1, …, xk, E(e) = 0, thus
E(y) = b 0 + b 1x1 + … + b kxk . 7. The error term e is normally distributed. This assumption
allows us to do hypothesis testing. If normality is not the
3. There is no exact linear relationship among the case, our tests may not be valid.
explanatory variables (i.e. no perfect multicollinearity).
continued continued
27 28
LO 12.4 Model assumptions and Model assumptions and common

common violations violations
LO 12.5 Describe common violations of the assumptions
• The true error terms e cannot be observed because and offer remedies.
they exist only in the population. We can, however,
Multicollinearity
look at the residuals, e = y – ŷ, where
ŷ = b0 + b1x1 + b2x2 + … + bkxk, for each observation. • Perfect multicollinearity exists when two or more x
variables exhibit an exact linear relationship.
• It is common to plot the residuals on the vertical axis
and an explanatory variable on the horizontal axis. • For example, suppose the x data includes total cost,
fixed cost and variable cost.
• When estimating a regression in Excel, the dialog
box that opens when you select Data > Data • Other data sets may have a great degree of
Analysis > Regression allows you to choose multicollinearity that is not perfect but still strong.
Residuals and Residual Plots options.
continued
29 30
5
1/14/2019
LO 12.5 Model assumptions and LO 12.5 Model assumptions and

common violations common violations
Multicollinearity Multicollinearity
• In these cases we may see a high R2 but individually • A good remedy may be to simply drop one of the
insignificant explanatory variables. Additional, non- collinear variables if we can justify it as redundant.
intuitive results may be indicative.
• Alternatively, we could increase our sample size.
• A sample correlation between explanatory variables
greater than 0.80 or less than –0.80 suggests severe • Another option would be to try to transform our
multicollinearity. variables so that they are no longer collinear.
• Finally, especially if we are interested only in
maintaining a high predictive power, it may make
sense to do nothing.
continued continued
31 32

Changing variability Changing variability
• The variance of the error term changes for different • Heteroscedasticity results in inefficient estimators,
values of at least one explanatory variable. and the hypothesis tests for significance are no
• Informal residual plots can gauge heteroscedasticity. longer valid.
Here is a residual plot for a model where none of the
assumptions has been violated. • To get around the second problem, some
researchers use OLS estimates along with corrected
standard errors, called White’s standard errors.
Many statistical packages have this option available.
Unfortunately, the current version of Excel does not.
continued continued
33 34

Correlated observations Excluded variables
• We assume that the error term is uncorrelated • Endogeneity in the regression model refers to the
across observations when obtaining OLS estimates. error term being correlated with the explanatory
But this often breaks down in time series data. variables. This commonly occurs due to an omitted
• In this example, we predict sales at a sushi restaurant. explanatory variable.
A plot of the residuals against time shows:
• For example, a person’s salary may be highly correlated
with that person’s innate ability. But since we cannot include
it, ability gets incorporated in the error term. If we try to
predict salary by years of education, which may also be
correlated with innate ability, then we have an endogeneity
problem.
• Remedies are not easily accessible using Excel.

continued continued
35 36
6
1/14/2019
LO 12.5 Model assumptions and

common violations
Excluded variables
• Endogeneity will result in biased estimators, and so
is quite a serious problem. Unfortunately, it is difficult
to fix.
• Most commonly, we try to find an instrumental
variable. Discussion of the instrumental variable
approach is beyond the scope of the text.
Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd

Jaggia, Essentials of Business Statistics, 1e 12-37
Updated: Azizur Rahman
37

Hapter: More On Regression Analysis

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hapter: More On Regression Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

1/14/2019

Is there evidence of gender pay Is there evidence of gender pay

Dummy variables LO 12.1 Dummy variables

LO 12.1 Dummy variables LO 12.1 Dummy variables

LO 12.1 Dummy variables LO 12.1 Dummy variables

Qualitative variables with two

• The statistical tests discussed in Chapter 11 remain

y = b 0 + b 1x + b 2d1 + b 3d2 + e • The excluded variable represents the reference

between $81,603 and $93,209.

Model assumptions and common LO 12.4 Model assumptions and

LO 12.4 Model assumptions and Model assumptions and common

LO 12.5 Model assumptions and LO 12.5 Model assumptions and

LO 12.5 Model assumptions and LO 12.5 Model assumptions and

LO 12.5 Model assumptions and LO 12.5 Model assumptions and

• Remedies are not easily accessible using Excel.

LO 12.5 Model assumptions and

Copyright © 2016 McGraw-Hill Education (Australia) Pty Ltd

Das könnte Ihnen auch gefallen