Sie sind auf Seite 1von 27

Model Specification and Data Problems

Part VIII

Model Specification and Data Problems

As of Oct 24, 2017


Seppo Pynnonen Econometrics I
Model Specification and Data Problems

Functional Form Misspecification

1 Model Specification and Data Problems


Functional Form Misspecification
RESET test
Non-nested alternatives

Using proxies for unobserved explanatory variables


Outliers

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

A functional form misspecification generally means that the


model does not account for some important nonlinearities.
Recall that omitting important variable is also model
misspecification.
Generally functional form misspecification causes bias in the
remaining parameter estimators.

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

Example 1
Suppose that the correct specification of the wage equation is

(1)

log(wage) = 0 + 1 educ + 2 exper + 3 (exper)2 + u.


Then the return for an extra year of experience is

log(wage)
= 2 + 23 exper. (2)
exper

If the second order term is dropped from (1), use of the resulting biased
estimate of 2 can be misleading.

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

1 Model Specification and Data Problems


Functional Form Misspecification
RESET test
Non-nested alternatives

Using proxies for unobserved explanatory variables


Outliers

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

Ramsey (1969)2 proposed a general functional form


misspecification test, Regression Specification Error Test (RESET),
which has proven to be useful.
Estimate
y = 0 + 1 x1 + + k xk + u, (3)
get y and test in the augmented model

y = 0 + 1 x1 + + k xk + 1 y 2 + 2 y 3 + e. (4)

Test the null hypothesis

H0 : 1 = 2 = 0. (5)

with the F -test with numerator df1 = 2 and denominator


df2 = n k 3.
2
Ramsey, J.B. (1969). Tests for specification errors in classical linear least-squares analysis, Journal of the
Royal Statistical Society, Series B, 71, 350371.
Seppo Pynnonen Econometrics I
Model Specification and Data Problems

Functional Form Misspecification

Example 2
Consider the house price data (Exercise 3.1) and estimate

price = 0 + 1 lotsize + 2 sqrft + 3 bdrms + u. (6)

Estimation results are:


Dependent Variable: PRICE
Method: Least Squares
Sample: 1 88
Included observations: 88
==========================================================
Variable Coefficient Std. Error t-Statistic Prob.
----------------------------------------------------------
C -21.77031 29.47504 -0.738601 0.4622
LOTSIZE 0.002068 0.000642 3.220096 0.0018
SQRFT 0.122778 0.013237 9.275093 0.0000
BDRMS 13.85252 9.010145 1.537436 0.1279
==========================================================

============================================================
R-squared 0.672362 Mean dependent var 293.5460
Adjusted R-squared 0.660661 S.D. dependent var 102.7134
S.E. of regression 59.83348 Akaike info criterion 11.06540
Sum squared resid 300723.8 Schwarz criterion 11.17800
Log likelihood -482.8775 F-statistic 57.46023
Durbin-Watson stat 2.109796 Prob(F-statistic) 0.000000
============================================================
Seppo Pynnonen Econometrics I
Model Specification and Data Problems

Functional Form Misspecification

\ 2 and (price)
Estimate next (6) augmented with (price) \ 3 as in (4).
The F -statistic for the null hypothesis (5) becomes F = 4.67 with 2
and 82 degrees of freedom. The p-value is 0.012, such that we
reject the null hypothesis at the 5% level.
Thus, there is some evidence of non-linearity.

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

Estimate next

log(price) = 0 + 1 log(lotsize) + 2 log(sqrft) + 3 bdrms + u.


(7)
Estimation results:
Dependent Variable: LOG(PRICE)
Method: Least Squares
Date: 10/19/06 Time: 00:01
Sample: 1 88
Included observations: 88
============================================================
Variable Coefficient Std. Error t-Statistic Prob.
============================================================
C -1.297042 0.651284 -1.991517 0.0497
LOG(LOTSIZE) 0.167967 0.038281 4.387714 0.0000
LOG(SQRFT) 0.700232 0.092865 7.540306 0.0000
BDRMS 0.036958 0.027531 1.342415 0.1831
============================================================

==============================================================
R-squared 0.642965 Mean dependent var 5.633180
Adjusted R-squared 0.630214 S.D. dependent var 0.303573
S.E. of regression 0.184603 Akaike info criterion -0.496833
Sum squared resid 2.862563 Schwarz criterion -0.384227
Log likelihood 25.86066 F-statistic 50.42374
Durbin-Watson stat 2.088996 Prob(F-statistic) 0.000000
==============================================================

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

Applying the RESET test, the F -statistic for the null hypothesis (5) is
now F = 2.56 with p-value 0.084, which implies that the hypothesis is
not rejected at the 5% level.

Thus overall, on the basis of the RESET test the log-log model (7) is
preferred.

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

1 Model Specification and Data Problems


Functional Form Misspecification
RESET test
Non-nested alternatives

Using proxies for unobserved explanatory variables


Outliers

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

For example if the model choices are

y = 0 + 1 x1 + 2 x2 + u (8)

and
y = 0 + 1 log(x1 ) + 2 log(x2 ) + u. (9)
Because the models are non-nested the usual F -test does not apply.
A common approach is to estimate a combined model

y = 0 + 1 x1 + 2 x2 + 3 log(x1 ) + 4 log(x2 ) + u.

(10)
H0 : 3 = 4 = 0 is a hypothesis for (8) and H0 : 1 = 2 = 0 is a
hypothesis for (9). The usual F -test applies again here.

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Functional Form Misspecification

Davidson and MacKinnon (1981)3 procedure:


For example to test (8), estimate first

y = 0 + 1 x1 + 2 x2 + 1 y + v , (11)

where y is the fitted value of (9). A significant t value of the


1 -estimate is a rejection of (8).
Similarly, if y denotes the fitted values of (8), the test of (9) is the
t-staistic of the 1 -estimate from

y = 0 + 1 log(x1 ) + 2 log(x2 ) + 1 y + v , (12)

3
Davidson, R. and J.G. MacKinnon (1981). Several tests for model
specification in the presence of alternative hypotheses, Econometrica 49,
781793.
Seppo Pynnonen Econometrics I
Model Specification and Data Problems

Functional Form Misspecification

Remark 8.1: A clear winner need not emerge. Both models may be
rejected or neither may be rejected. In the latter case adjusted R-square
can be used to select the better fitting one. If both models are rejected,
more work is needed. 4

4
For more complicated cases, see Wooldridge, J.M. (1994). A simple
specification test for the predictive ability of transformation models, Review of
Economics and Statistics 76, 5965.
Seppo Pynnonen Econometrics I
Model Specification and Data Problems

Using proxies for unobserved explanatory variables

1 Model Specification and Data Problems


Functional Form Misspecification
RESET test
Non-nested alternatives

Using proxies for unobserved explanatory variables


Outliers

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

As discussed earlier, an important source of bias in OLS is


omitted variables that are correlated with the included
explanatory variables.
Often the reason for omission is that these variables are
unobservable.
A way to mitigate the problem is to collect data on proxy
variables.

Consider the following regression

y = 0 + 1 x1 + 2 x2 + u, (13)

where x2 is unobservable variable (e.g. human ability).

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Suppose that the primary interest is to estimate 1 , so that x2 is a


control variable.
However, as we know the simple regression y = 0 + 1 x1 + v
results to biased and inconsistent OLS estimator of 1 such
plim 1 = 1 + 1 2 , where 1 is the coefficient of regression
x2 = 0 + 1 x1 + error
Suppose that we have a good proxy x2 for x2 such tat

E[x2 |x2 , x1 ] = E[x2 |x2 ], i.e., given the proxy x2 , x1 does not
help in predicting the unobserved variable x2 .
E[u|x2 ] = 0 for the error term in regression (13).

These imply that in regression x2 = 0 + 1 x2 + x1 + e, = 0 so


that only the proxy x2 is related to the unobserved variable x2 , and
that the proxy x2 is not correlated with error term of the true
regression in equation (13).
Seppo Pynnonen Econometrics I
Model Specification and Data Problems

Using proxies for unobserved explanatory variables

With this kind of a good proxy instead of (13), the model to be


estimated becomes

y = 0 + 1 x1 + 2 x2 + w . (14)

Now OLS is unbiased and consistent estimator of 1 , the


parameter we are primarily interested in (also OLS estimators of 0
and 1 are unbiased and consistent for these parameters, but
0 = 0 + 2 0 and 1 = 1 2 differ from 0 and 2 ).

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Example 3
Consider the return to education in wages (monthly) for men (wage2
data set).
lm(formula = log(wage) ~ educ + exper + tenure + married + south +
urban + black, data = wdf)

Residuals:
Min 1Q Median 3Q Max
-1.98069 -0.21996 0.00707 0.24288 1.22822

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.395497 0.113225 47.653 < 2e-16 ***
educ 0.065431 0.006250 10.468 < 2e-16 ***
exper 0.014043 0.003185 4.409 1.16e-05 ***
tenure 0.011747 0.002453 4.789 1.95e-06 ***
married 0.199417 0.039050 5.107 3.98e-07 ***
south -0.090904 0.026249 -3.463 0.000558 ***
urban 0.183912 0.026958 6.822 1.62e-11 ***
black -0.188350 0.037667 -5.000 6.84e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.3655 on 927 degrees of freedom


Multiple R-squared: 0.2526,Adjusted R-squared: 0.2469
F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

The estimated return to education is 6.5%. However, if the omitted


ability is positively correlated with educ, the estimate is too high.
Adding IQ as a proxy to ability into the equation reduces the estimate to
5.4%, which is consistent with the omitted variable bias assumption.
lm(formula = log(wage) ~ educ + exper + tenure + married + south +
urban + black + iq, data = wdf)

Residuals:
Min 1Q Median 3Q Max
-2.01203 -0.22244 0.01017 0.22951 1.27478

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1764391 0.1280006 40.441 < 2e-16 ***
educ 0.0544106 0.0069285 7.853 1.12e-14 ***
exper 0.0141459 0.0031651 4.469 8.82e-06 ***
tenure 0.0113951 0.0024394 4.671 3.44e-06 ***
married 0.1997644 0.0388025 5.148 3.21e-07 ***
south -0.0801695 0.0262529 -3.054 0.002325 **
urban 0.1819463 0.0267929 6.791 1.99e-11 ***
black -0.1431253 0.0394925 -3.624 0.000306 ***
iq 0.0035591 0.0009918 3.589 0.000350 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.3632 on 926 degrees of freedom


Multiple R-squared: 0.2628,Adjusted R-squared: 0.2564
F-statistic: 41.27 on 8 and 926 DF, p-value: < 2.2e-16

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Test whether the interaction of ability and education affects wages.


lm(formula = log(wage) ~ educ + exper + tenure + married + south +
urban + black + iq + iq:educ, data = wdf) # iq:educ introduces interaction iq*educ

Residuals:
Min 1Q Median 3Q Max
-2.00733 -0.21715 0.01177 0.23456 1.27305

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6482478 0.5462963 10.339 < 2e-16 ***
educ 0.0184560 0.0410608 0.449 0.653192
exper 0.0139072 0.0031768 4.378 1.34e-05 ***
tenure 0.0113929 0.0024397 4.670 3.46e-06 ***
married 0.2008658 0.0388267 5.173 2.82e-07 ***
south -0.0802354 0.0262560 -3.056 0.002308 **
urban 0.1835758 0.0268586 6.835 1.49e-11 ***
black -0.1466989 0.0397013 -3.695 0.000233 ***
iq -0.0009418 0.0051625 -0.182 0.855290
educ:iq 0.0003399 0.0003826 0.888 0.374564
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.3632 on 925 degrees of freedom


Multiple R-squared: 0.2634,Adjusted R-squared: 0.2563
F-statistic: 36.76 on 9 and 925 DF, p-value: < 2.2e-16

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Adding iq educ is not only insignificant but it also renders educ and
iq insignificant!
This is due to high correlation of the interaction term with its
components:
> with(wdf, cor(cbind(educ, iq, educ*iq)))
educ iq educ*iq
educ 1.0000000 0.5156970 0.8880035
iq 0.5156970 1.0000000 0.8453237
educ*iq 0.8880035 0.8453237 1.0000000

The implied collinearity can be materially reduced by defining the


interaction term in terms of demeand variables:
> with(wdf, cor(cbind(educ, iq, (educ - mean(educ))*(iq - mean(iq)))))
educ iq (e-m(e))*(i-m(i))
educ 1.0000000 0.5156970 0.1864668
iq 0.5156970 1.0000000 -0.0133327
(educ-m(educ)*(iq-m(iq)) 0.1864668 -0.0133327 1.0000000

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Interaction term of the demeaned components leads also to a meaningful


interpretation of the implied model.
We can write
g f
log(wage) = 0 + 1 educ + 2 iq + 12 (educ iq) + other factors

as
log(wage) = 0 + 1 educ
g + 2 f g f
iq + 12 (educ iq) + other factors,

] = educ educ and i


where educ fq = iq iq are demeaned educ and
iq, and 0 = 0 + 1 educ + 2 iq.
We can further write
log(wage) = 0 + (1 + 12 f g + 2 f
iq) educ iq + other factors.

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

The slope coefficient 1 + 12 i


fq of educ implies that return to education
depends on the level of ability (measured by IQ).
At the mean IQ, ifq = 0, so that 1 indicates the return to education for
a person with average ability and 12 indicates per IQ point the rate by
which return to education changes when ability (measured in terms of
IQ) deviates from the average.
Assuming 12 > 0, above average ability implies higher return to
education and below average lower return to education.

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Using proxies for unobserved explanatory variables

Estimating the model, however, indicates that 12 = .00034 with p-value


.37 is not at all statistically significant, which implies that there is no
evidence that variability in IQ as such affects return to education.
lm(formula = log(wage) ~ educ + exper + tenure + married + south +
urban + black + iq + I((iq - mean(iq)) * (educ - mean(educ))),
data = wdf)

Residuals:
Min 1Q Median 3Q Max
-2.00733 -0.21715 0.01177 0.23456 1.27305

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1846286 0.1283466 40.396 < 2e-16
educ 0.0528786 0.0071406 7.405 2.94e-13
exper 0.0139072 0.0031768 4.378 1.34e-05
tenure 0.0113929 0.0024397 4.670 3.46e-06
married 0.2008658 0.0388267 5.173 2.82e-07
south -0.0802354 0.0262560 -3.056 0.002308
urban 0.1835758 0.0268586 6.835 1.49e-11
black -0.1466989 0.0397013 -3.695 0.000233
iq 0.0036357 0.0009957 3.652 0.000275
I((iq - mean(iq)) * (educ - mean(educ))) 0.0003399 0.0003826 0.888 0.374564

Residual standard error: 0.3632 on 925 degrees of freedom


Multiple R-squared: 0.2634,Adjusted R-squared: 0.2563
F-statistic: 36.76 on 9 and 925 DF, p-value: < 2.2e-16

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Outliers

1 Model Specification and Data Problems


Functional Form Misspecification
RESET test
Non-nested alternatives

Using proxies for unobserved explanatory variables


Outliers

Seppo Pynnonen Econometrics I


Model Specification and Data Problems

Outliers

Particularly in small data sets OLS estimates may be


influenced by one or several observations (see figure).
Generally such observations are called outliers or influential
observations.
Loosely, an observation is an outlier if dropping it changes
estimation results materially.
In detection of outliers a usual practice is to investigate
standardized (or studentized) residuals.
If an outlier is an obvious mistake in recording the data, it can
be corrected. Usual practice also is to eliminate such
observations.
Data transformations, like taking logarithms often narrow the
range of data and hence may alleviate outlier problems, too.

Seppo Pynnonen Econometrics I

Das könnte Ihnen auch gefallen