As of Oct 24, 2017: Seppo Pynn Onen Econometrics I

Model Specification and Data Problems
Part VIII
As of Oct 24, 2017

Seppo Pynnonen Econometrics I
Functional Form Misspecification
1 Model Specification and Data Problems

RESET test
Non-nested alternatives
Using proxies for unobserved explanatory variables

Outliers

A functional form misspecification generally means that the

model does not account for some important nonlinearities.
Recall that omitting important variable is also model
misspecification.
Generally functional form misspecification causes bias in the
remaining parameter estimators.

Example 1
Suppose that the correct specification of the wage equation is
(1)
log(wage) = 0 + 1 educ + 2 exper + 3 (exper)2 + u.

Then the return for an extra year of experience is
log(wage)
= 2 + 23 exper. (2)
exper
If the second order term is dropped from (1), use of the resulting biased
estimate of 2 can be misleading.


RESET test

Outliers

Ramsey (1969)2 proposed a general functional form

misspecification test, Regression Specification Error Test (RESET),
which has proven to be useful.
Estimate
y = 0 + 1 x1 + + k xk + u, (3)
get y and test in the augmented model
y = 0 + 1 x1 + + k xk + 1 y 2 + 2 y 3 + e. (4)
Test the null hypothesis
H0 : 1 = 2 = 0. (5)
with the F -test with numerator df1 = 2 and denominator

df2 = n k 3.
2
Ramsey, J.B. (1969). Tests for specification errors in classical linear least-squares analysis, Journal of the
Royal Statistical Society, Series B, 71, 350371.
Example 2
Consider the house price data (Exercise 3.1) and estimate
price = 0 + 1 lotsize + 2 sqrft + 3 bdrms + u. (6)
Estimation results are:

Dependent Variable: PRICE
Method: Least Squares
Sample: 1 88
Included observations: 88
==========================================================
Variable Coefficient Std. Error t-Statistic Prob.
----------------------------------------------------------
C -21.77031 29.47504 -0.738601 0.4622
LOTSIZE 0.002068 0.000642 3.220096 0.0018
SQRFT 0.122778 0.013237 9.275093 0.0000
BDRMS 13.85252 9.010145 1.537436 0.1279
==========================================================
============================================================
R-squared 0.672362 Mean dependent var 293.5460
Adjusted R-squared 0.660661 S.D. dependent var 102.7134
S.E. of regression 59.83348 Akaike info criterion 11.06540
Sum squared resid 300723.8 Schwarz criterion 11.17800
Log likelihood -482.8775 F-statistic 57.46023
Durbin-Watson stat 2.109796 Prob(F-statistic) 0.000000
============================================================
\ 2 and (price)
Estimate next (6) augmented with (price) \ 3 as in (4).
The F -statistic for the null hypothesis (5) becomes F = 4.67 with 2
and 82 degrees of freedom. The p-value is 0.012, such that we
reject the null hypothesis at the 5% level.
Thus, there is some evidence of non-linearity.

Estimate next
log(price) = 0 + 1 log(lotsize) + 2 log(sqrft) + 3 bdrms + u.

(7)
Estimation results:
Dependent Variable: LOG(PRICE)
Method: Least Squares
Date: 10/19/06 Time: 00:01
Sample: 1 88
Included observations: 88
============================================================
Variable Coefficient Std. Error t-Statistic Prob.
============================================================
C -1.297042 0.651284 -1.991517 0.0497
LOG(LOTSIZE) 0.167967 0.038281 4.387714 0.0000
LOG(SQRFT) 0.700232 0.092865 7.540306 0.0000
BDRMS 0.036958 0.027531 1.342415 0.1831
============================================================
==============================================================
R-squared 0.642965 Mean dependent var 5.633180
Adjusted R-squared 0.630214 S.D. dependent var 0.303573
S.E. of regression 0.184603 Akaike info criterion -0.496833
Sum squared resid 2.862563 Schwarz criterion -0.384227
Log likelihood 25.86066 F-statistic 50.42374
Durbin-Watson stat 2.088996 Prob(F-statistic) 0.000000
==============================================================

Applying the RESET test, the F -statistic for the null hypothesis (5) is
now F = 2.56 with p-value 0.084, which implies that the hypothesis is
not rejected at the 5% level.
Thus overall, on the basis of the RESET test the log-log model (7) is
preferred.


RESET test

Outliers

For example if the model choices are
y = 0 + 1 x1 + 2 x2 + u (8)
and
y = 0 + 1 log(x1 ) + 2 log(x2 ) + u. (9)
Because the models are non-nested the usual F -test does not apply.
A common approach is to estimate a combined model
y = 0 + 1 x1 + 2 x2 + 3 log(x1 ) + 4 log(x2 ) + u.
(10)
H0 : 3 = 4 = 0 is a hypothesis for (8) and H0 : 1 = 2 = 0 is a
hypothesis for (9). The usual F -test applies again here.

Davidson and MacKinnon (1981)3 procedure:

For example to test (8), estimate first
y = 0 + 1 x1 + 2 x2 + 1 y + v , (11)
where y is the fitted value of (9). A significant t value of the

1 -estimate is a rejection of (8).
Similarly, if y denotes the fitted values of (8), the test of (9) is the
t-staistic of the 1 -estimate from
y = 0 + 1 log(x1 ) + 2 log(x2 ) + 1 y + v , (12)
3
Davidson, R. and J.G. MacKinnon (1981). Several tests for model
specification in the presence of alternative hypotheses, Econometrica 49,
781793.
Remark 8.1: A clear winner need not emerge. Both models may be
rejected or neither may be rejected. In the latter case adjusted R-square
can be used to select the better fitting one. If both models are rejected,
more work is needed. 4
4
For more complicated cases, see Wooldridge, J.M. (1994). A simple
specification test for the predictive ability of transformation models, Review of
Economics and Statistics 76, 5965.

RESET test

Outliers

As discussed earlier, an important source of bias in OLS is

omitted variables that are correlated with the included
explanatory variables.
Often the reason for omission is that these variables are
unobservable.
A way to mitigate the problem is to collect data on proxy
variables.
Consider the following regression
y = 0 + 1 x1 + 2 x2 + u, (13)
where x2 is unobservable variable (e.g. human ability).

Suppose that the primary interest is to estimate 1 , so that x2 is a

control variable.
However, as we know the simple regression y = 0 + 1 x1 + v
results to biased and inconsistent OLS estimator of 1 such
plim 1 = 1 + 1 2 , where 1 is the coefficient of regression
x2 = 0 + 1 x1 + error
Suppose that we have a good proxy x2 for x2 such tat
E[x2 |x2 , x1 ] = E[x2 |x2 ], i.e., given the proxy x2 , x1 does not
help in predicting the unobserved variable x2 .
E[u|x2 ] = 0 for the error term in regression (13).
These imply that in regression x2 = 0 + 1 x2 + x1 + e, = 0 so

that only the proxy x2 is related to the unobserved variable x2 , and
that the proxy x2 is not correlated with error term of the true
regression in equation (13).
With this kind of a good proxy instead of (13), the model to be

estimated becomes
y = 0 + 1 x1 + 2 x2 + w . (14)
Now OLS is unbiased and consistent estimator of 1 , the

parameter we are primarily interested in (also OLS estimators of 0
and 1 are unbiased and consistent for these parameters, but
0 = 0 + 2 0 and 1 = 1 2 differ from 0 and 2 ).

Example 3
Consider the return to education in wages (monthly) for men (wage2
data set).
lm(formula = log(wage) ~ educ + exper + tenure + married + south +
urban + black, data = wdf)
Residuals:
Min 1Q Median 3Q Max
-1.98069 -0.21996 0.00707 0.24288 1.22822
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.395497 0.113225 47.653 < 2e-16 ***
educ 0.065431 0.006250 10.468 < 2e-16 ***
exper 0.014043 0.003185 4.409 1.16e-05 ***
tenure 0.011747 0.002453 4.789 1.95e-06 ***
married 0.199417 0.039050 5.107 3.98e-07 ***
south -0.090904 0.026249 -3.463 0.000558 ***
urban 0.183912 0.026958 6.822 1.62e-11 ***
black -0.188350 0.037667 -5.000 6.84e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3655 on 927 degrees of freedom

Multiple R-squared: 0.2526,Adjusted R-squared: 0.2469
F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16

The estimated return to education is 6.5%. However, if the omitted

ability is positively correlated with educ, the estimate is too high.
Adding IQ as a proxy to ability into the equation reduces the estimate to
5.4%, which is consistent with the omitted variable bias assumption.
urban + black + iq, data = wdf)
Residuals:
-2.01203 -0.22244 0.01017 0.22951 1.27478
Coefficients:
(Intercept) 5.1764391 0.1280006 40.441 < 2e-16 ***
educ 0.0544106 0.0069285 7.853 1.12e-14 ***
exper 0.0141459 0.0031651 4.469 8.82e-06 ***
tenure 0.0113951 0.0024394 4.671 3.44e-06 ***
married 0.1997644 0.0388025 5.148 3.21e-07 ***
south -0.0801695 0.0262529 -3.054 0.002325 **
urban 0.1819463 0.0267929 6.791 1.99e-11 ***
black -0.1431253 0.0394925 -3.624 0.000306 ***
iq 0.0035591 0.0009918 3.589 0.000350 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


Test whether the interaction of ability and education affects wages.

urban + black + iq + iq:educ, data = wdf) # iq:educ introduces interaction iq*educ
Residuals:
-2.00733 -0.21715 0.01177 0.23456 1.27305
Coefficients:
(Intercept) 5.6482478 0.5462963 10.339 < 2e-16 ***
educ 0.0184560 0.0410608 0.449 0.653192
exper 0.0139072 0.0031768 4.378 1.34e-05 ***
tenure 0.0113929 0.0024397 4.670 3.46e-06 ***
married 0.2008658 0.0388267 5.173 2.82e-07 ***
south -0.0802354 0.0262560 -3.056 0.002308 **
urban 0.1835758 0.0268586 6.835 1.49e-11 ***
black -0.1466989 0.0397013 -3.695 0.000233 ***
iq -0.0009418 0.0051625 -0.182 0.855290
educ:iq 0.0003399 0.0003826 0.888 0.374564
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


Adding iq educ is not only insignificant but it also renders educ and
iq insignificant!
This is due to high correlation of the interaction term with its
components:
> with(wdf, cor(cbind(educ, iq, educ*iq)))
educ iq educ*iq
educ 1.0000000 0.5156970 0.8880035
iq 0.5156970 1.0000000 0.8453237
educ*iq 0.8880035 0.8453237 1.0000000
The implied collinearity can be materially reduced by defining the

interaction term in terms of demeand variables:
> with(wdf, cor(cbind(educ, iq, (educ - mean(educ))*(iq - mean(iq)))))
educ iq (e-m(e))*(i-m(i))
educ 1.0000000 0.5156970 0.1864668
iq 0.5156970 1.0000000 -0.0133327
(educ-m(educ)*(iq-m(iq)) 0.1864668 -0.0133327 1.0000000

Interaction term of the demeaned components leads also to a meaningful

interpretation of the implied model.
We can write
g f
log(wage) = 0 + 1 educ + 2 iq + 12 (educ iq) + other factors
as
log(wage) = 0 + 1 educ
g + 2 f g f
iq + 12 (educ iq) + other factors,
] = educ educ and i

where educ fq = iq iq are demeaned educ and
iq, and 0 = 0 + 1 educ + 2 iq.
We can further write
log(wage) = 0 + (1 + 12 f g + 2 f
iq) educ iq + other factors.

The slope coefficient 1 + 12 i

fq of educ implies that return to education
depends on the level of ability (measured by IQ).
At the mean IQ, ifq = 0, so that 1 indicates the return to education for
a person with average ability and 12 indicates per IQ point the rate by
which return to education changes when ability (measured in terms of
IQ) deviates from the average.
Assuming 12 > 0, above average ability implies higher return to
education and below average lower return to education.

Estimating the model, however, indicates that 12 = .00034 with p-value

.37 is not at all statistically significant, which implies that there is no
evidence that variability in IQ as such affects return to education.
urban + black + iq + I((iq - mean(iq)) * (educ - mean(educ))),
data = wdf)
Residuals:
-2.00733 -0.21715 0.01177 0.23456 1.27305
Coefficients:
(Intercept) 5.1846286 0.1283466 40.396 < 2e-16
educ 0.0528786 0.0071406 7.405 2.94e-13
exper 0.0139072 0.0031768 4.378 1.34e-05
tenure 0.0113929 0.0024397 4.670 3.46e-06
married 0.2008658 0.0388267 5.173 2.82e-07
south -0.0802354 0.0262560 -3.056 0.002308
urban 0.1835758 0.0268586 6.835 1.49e-11
black -0.1466989 0.0397013 -3.695 0.000233
iq 0.0036357 0.0009957 3.652 0.000275
I((iq - mean(iq)) * (educ - mean(educ))) 0.0003399 0.0003826 0.888 0.374564


Outliers

RESET test

Outliers

Outliers
Particularly in small data sets OLS estimates may be

influenced by one or several observations (see figure).
Generally such observations are called outliers or influential
observations.
Loosely, an observation is an outlier if dropping it changes
estimation results materially.
In detection of outliers a usual practice is to investigate
standardized (or studentized) residuals.
If an outlier is an obvious mistake in recording the data, it can
be corrected. Usual practice also is to eliminate such
observations.
Data transformations, like taking logarithms often narrow the
range of data and hence may alleviate outlier problems, too.

As of Oct 24, 2017: Seppo Pynn Onen Econometrics I

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

As of Oct 24, 2017: Seppo Pynn Onen Econometrics I

Hochgeladen von

Copyright:

Verfügbare Formate

Model Specification and Data Problems

Model Specification and Data Problems

As of Oct 24, 2017

Functional Form Misspecification

1 Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Seppo Pynnonen Econometrics I

Functional Form Misspecification

A functional form misspecification generally means that the

Seppo Pynnonen Econometrics I

Functional Form Misspecification

log(wage) = 0 + 1 educ + 2 exper + 3 (exper)2 + u.

Seppo Pynnonen Econometrics I

Functional Form Misspecification

1 Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Seppo Pynnonen Econometrics I

Functional Form Misspecification

Ramsey (1969)2 proposed a general functional form

Test the null hypothesis

with the F -test with numerator df1 = 2 and denominator

Functional Form Misspecification

price = 0 + 1 lotsize + 2 sqrft + 3 bdrms + u. (6)

Estimation results are:

Functional Form Misspecification

Seppo Pynnonen Econometrics I

Functional Form Misspecification

log(price) = 0 + 1 log(lotsize) + 2 log(sqrft) + 3 bdrms + u.

Seppo Pynnonen Econometrics I

Functional Form Misspecification

Seppo Pynnonen Econometrics I

Functional Form Misspecification

1 Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Seppo Pynnonen Econometrics I

Functional Form Misspecification

For example if the model choices are

Seppo Pynnonen Econometrics I

Functional Form Misspecification

Davidson and MacKinnon (1981)3 procedure:

where y is the fitted value of (9). A significant t value of the

y = 0 + 1 log(x1 ) + 2 log(x2 ) + 1 y + v , (12)

Functional Form Misspecification

Using proxies for unobserved explanatory variables

1 Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Seppo Pynnonen Econometrics I

Using proxies for unobserved explanatory variables

As discussed earlier, an important source of bias in OLS is

Consider the following regression

where x2 is unobservable variable (e.g. human ability).

Seppo Pynnonen Econometrics I

Using proxies for unobserved explanatory variables

Suppose that the primary interest is to estimate 1 , so that x2 is a

These imply that in regression x2 = 0 + 1 x2 + x1 + e, = 0 so

Using proxies for unobserved explanatory variables

With this kind of a good proxy instead of (13), the model to be

Now OLS is unbiased and consistent estimator of 1 , the

Seppo Pynnonen Econometrics I

Using proxies for unobserved explanatory variables

Residual standard error: 0.3655 on 927 degrees of freedom

Seppo Pynnonen Econometrics I

Using proxies for unobserved explanatory variables