Beruflich Dokumente
Kultur Dokumente
y 0 1 x1 2 x2 ... k xk u
E u | x1 , x2 ,..., xk 0
y 0 1 x1 2 x2 u
•
• But instead of including x2 as a variable in our regression model, as we should, we
instead estimate
y 0 1 x1 u
x i1 x1 x i1 x1
2 2
i 1 i 1
n n n n
0 xi1 x1 1 xi1 xi1 x1 2 xi 2 xi1 x1 ui xi1 x1
1 i 1 i 1
n
i 1 i 1
x x1
2
i1
i 1
Multiple Regression Analysis: Estimation
• Taking the expectation, this simplifies to
ˆ x1 , x2
Cov
E 1 1 2
ˆ x1
Var
• This means that if you omit x2, and x2 is correlated with x1, your estimate of 1 will be
biased.
wages 0 1 education u
• using the data we have available.
• Probably, since ability and education are positively correlated, and ability as a positive
effect on wages, the estimate of 1 will be positively biased.
Multiple Regression Analysis: Estimation
• As an example, we will look at the relationship between a child’s educational
attainment, his or her IQ, and his or her mother’s educational attainment.
education 0 1 Meducation 2 IQ u
• Question: what happens to the estimate of b1 if the second variable (IQ) is omitted
from the equation?
IQ score
-------------------------------------------------------------
Percentiles Smallest
1% 64 50
5% 74 54
10% 82 55 Obs 935
25% 92 59 Sum of Wgt. 935
------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meduc | .1669298 .0232489 7.18 0.000 .121298 .2125615
IQ | .0651911 .0044139 14.77 0.000 .0565278 .0738544
_cons | 5.155378 .4398148 11.72 0.000 4.292133 6.018622
------------------------------------------------------------------------------
Multiple Regression Analysis: Estimation
• Note that b2 (the coefficient on IQ) is positive, as we hypothesized would be true for
Fact 1.
• Next step is to see if mother’s education is positively correlated with children’s IQ (for
Fact 2).
. correlate IQ meduc
(obs=857)
| IQ meduc
-------------+------------------
IQ | 1.0000
meduc | 0.3318 1.0000
------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meduc | .2808636 .0245594 11.44 0.000 .2326598 .3290674
_cons | 10.57491 .271523 38.95 0.000 10.04198 11.10783
------------------------------------------------------------------------------
• Note that without IQ, the estimate of the coefficent on mother’s education is upwardly
biased – 0.28 instead of 0.16.
2
Var ˆ j
n 2
ij x x j 1 Rj
2
i 1
n
1
ˆ
2
n k 1 i 1
uˆi2
ˆ 2
Se ˆ j
n 2
xij x j 1 R j
2
i 1
Multiple Regression Analysis: Estimation
• The Rj2 term refers to the R squared value of a regression of xj on the other x variables.
• You can see why multicollinearity is a problem; when xj is perfectly correlated with
the other x variables, this R squared is one and dividing by (1- Rj2) means dividing by
zero.
in years
-------------------------------------------------------------
Percentiles Smallest
1% 22 17.5
5% 22 17.5
10% 22 17.5 Obs 601
25% 27 17.5 Sum of Wgt. 601
years married
-------------------------------------------------------------
Percentiles Smallest
1% .125 .125
5% .75 .125
10% 1.5 .125 Obs 601
25% 4 .125 Sum of Wgt. 601
number of |
affairs |
within last |
year | Freq. Percent Cum.
------------+-----------------------------------
0 | 451 75.04 75.04
1 | 34 5.66 80.70
2 | 17 2.83 83.53
3 | 19 3.16 86.69
7 | 42 6.99 93.68
12 | 38 6.32 100.00
------------+-----------------------------------
Total | 601 100.00
Multiple Regression Analysis: Estimation
• It is unfortunately the case, however, that age and years of marriage are highly
correlated:
| age yrsmarr
-------------+------------------
age | 1.0000
yrsmarr | 0.7775 1.0000
------------------------------------------------------------------------------
naffairs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .033822 .0144444 2.34 0.020 .0054541 .0621899
_cons | .357114 .4880375 0.73 0.465 -.6013585 1.315586
------------------------------------------------------------------------------
------------------------------------------------------------------------------
naffairs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
yrsmarr | .1106286 .0237664 4.65 0.000 .0639529 .1573043
_cons | .5512198 .2351106 2.34 0.019 .0894785 1.012961
------------------------------------------------------------------------------
Multiple Regression Analysis: Estimation
• However, when both are put together the coefficient on age is inconclusive – its sign
changes. This suggests that there is a high variance in the measurement of this
coefficient.
. regress naffairs age yrsmar
------------------------------------------------------------------------------
naffairs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0449423 .0226134 -1.99 0.047 -.0893536 -.000531
yrsmarr | .1688902 .0377022 4.48 0.000 .0948454 .242935
_cons | 1.534838 .5476808 2.80 0.005 .4592266 2.61045
------------------------------------------------------------------------------
Break
Multiple Regression Analysis: Estimation
• Last time, we examined some consequences of misspecifying a multiple regression
model.
• In particular, we examined the implications of omitting an “important variable” from
the regression.
• x2 is an “important variable” if (a) it is correlated with x1 and (b) 2 is something other
than zero.
• In this case, the sign of the omitted variable bias is determined by the signs of (a) and
(b).
• So if the correlation is negative, and 2 is positive, the estimate 1 will be negatively
biased.
. generate lcost=ln(cost)
. generate llibvol=ln(libvol)
------------------------------------------------------------------------------
lsalary | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
llibvol | .1291507 .0325187 3.97 0.000 .0648471 .1934543
lcost | .0265127 .0295489 0.90 0.371 -.0319182 .0849437
rank | -.0041712 .0002976 -14.02 0.000 -.0047596 -.0035829
_cons | 9.880132 .3433113 28.78 0.000 9.201258 10.55901
------------------------------------------------------------------------------
• We expect the coefficient on 3 to be negative, but the others should all be positive.
Multiple Regression Analysis: Estimation
. regress lsalary llibvol lcost rank GPA LSAT
------------------------------------------------------------------------------
lsalary | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
llibvol | .0949932 .0332543 2.86 0.005 .0292035 .160783
lcost | .0375538 .0321061 1.17 0.244 -.0259642 .1010718
rank | -.0033246 .0003485 -9.54 0.000 -.004014 -.0026352
GPA | .2475239 .090037 2.75 0.007 .0693964 .4256514
LSAT | .0046965 .0040105 1.17 0.244 -.0032378 .0126308
_cons | 8.343226 .5325192 15.67 0.000 7.2897 9.396752
------------------------------------------------------------------------------
y 0 1 x1 2 x2 ... k xk u
• We have not said anything about the distribution of the OLS estimators other than that.
E ˆ j j
2
Var ˆ j
n 2
xij x j
1 Rj
2
i 1
– Additionally, the OLS estimators are BLUE – that is, among all estimators that are
both linear and unbiased, they have the minimum sampling variance possible.
Multiple Regression Analysis: Inference
• These five assumptions are collectively known as the Gauss-Markov assumptions.
• With a sixth assumption, the entire sampling distribution of the estimator can be
characterized.
• The assumption, known as the normality assumption states that u is independent of the
explanatory variables and:
u Normal 0, 2
y | x1 , x2 ,..., xk Normal 0 1 x1 2 x2 ... k xk , 2
• where the true values y are normal random variables with means equal to fitted values
constructed with the true parameters and variances all equal to 2.
ˆ j Normal j ,Var ˆ j
ˆ j Normal j ,Var ˆ j
ˆ j j
Normal 0,1
Se ˆ j
ˆ j j
tn k 1
Seˆ ˆ j
H 0 : 0.46
Multiple Regression Analysis: Inference
• An example of an alternative hypothesis is that it is not:
H A : 0.46
• Candidate B brings his poll results to the local magistrate, who devises a statistical test
to find out if candidate B has evidence beyond a reasonable doubt that the election was
rigged.
.07889
Fraction
0
32 36 40 44 48 52 56 60
xb
Multiple Regression Analysis: Inference
• A test statistic is a function of the random sample of data.
• The outcome of this function is used to create a rejection rule for the null hypothesis.
• Usually the rejection rule is to reject the null hypothesis if the outcome of the function
exceeds some critical value.
xb
-------------------------------------------------------------
Percentiles Smallest
1% 34 24
5% 38 25
10% 40 26 Obs 100000
25% 43 26 Sum of Wgt. 100000