Econometrics With R

Econometrics using R
Rajat Tayal
Rmetrics Panel Data Workshop February 16-February 18, 2011 Indira Gandhi Institute of Development Research, Mumbai
February 17, 2012
Rajat Tayal
Outline of the presentation
Linear regression
Simple linear regression Multiple linear regression
Regression diagnostics
Leverage and standardized residuals Deletion diagnostics The function inuence.measures() Testing for heteroskedasticity Testing for functional form Testing for autocorrelation Robust standard errors and tests
Linear regression with panel data
Rajat Tayal
Part I Linear regression
Rajat Tayal
Introduction
The linear regression model, typically estimated by ordinary least squares (OLS), is the workhorse of applied econometrics. The model is yi = xiT + i , i = 1, 2, . . . , n. (1) y = X + For cross-sections: E ( |X ) = 0 Var ( |X ) = I For time series: E ( j |xi ) = 0, i j. (5)
2
(2)
(3) (4)
Rajat Tayal
Introduction
We estimate the OLS by: = (X T X )1 X T y (6)
The corresponding tted values are: y = X , the residuals are = y y and the residual sum of squares is T . In R, models are typically tted by calling a model-tting function, in this case lm(), with a formula object describing the model and a data.frame object containing the variables used in the formula. fm < lm(formula, data, . . .)
Rajat Tayal
The rst example

Data from Stock & Watson (2007) on subscriptions to economics journals at US libraries for the year 2000:
> > > > > > install.packages("AER", dependencies=TRUE) library(AER) data("Journals") ; names("Journals") journals <- Journals[, c("subs", "price")] journals$citeprice <- Journals$price/Journals$citations summary(journals) subs price citeprice Min. : 2.0 Min. : 20.0 Min. : 0.005223 1st Qu.: 52.0 1st Qu.: 134.5 1st Qu.: 0.464495 Median : 122.5 Median : 282.0 Median : 1.320513 Mean : 196.9 Mean : 417.7 Mean : 2.548455 3rd Qu.: 268.2 3rd Qu.: 540.8 3rd Qu.: 3.440171 Max. :1098.0 Max. :2120.0 Max. :24.459459
Rajat Tayal
The rst example
In view of the wide range of the variables, combined with a considerable amount of skewness, it is useful to take logarithms. The goal is to estimate the eect of the price per citation on the number of library subscriptions. To explore this issue quantitatively, we will t a linear regression model, log (subs)i = 1 + 2 log (citeprice)i +
i
(7)
Rajat Tayal
The rst example
i
(7)
Rajat Tayal
The rst example
i
(7)
Rajat Tayal
The rst example
Here, the formula of interest is log (subs) log (citeprice). This can be used both for plotting and for model tting:
> plot(log(subs) ~ log(citeprice), data = journals) > jour_lm <- lm(log(subs) ~ log(citeprice), data = journals) > abline(jour_lm)
abline() extracts the coecients of the tted model and adds the corresponding regression line to the plot.
Rajat Tayal
The rst example
Rajat Tayal
The rst example
The function lm() returns a tted-model object, here stored as jour lm. It is an object of class lm.
> class(jour_lm) [1] "lm" > names(jour_lm) [1] "coefficients" "residuals" "effects" "rank" "fitted. [7] "qr" "df.residual" "xlevels" "call" "terms" "model"
Rajat Tayal
The rst example

> summary(jour_lm) Call: lm(formula = log(subs) ~ log(citeprice), data = journals) Residuals: Min 1Q Median 3Q Max -2.72478 -0.53609 0.03721 0.46619 1.84808 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.76621 0.05591 85.25 <2e-16 *** log(citeprice) -0.53305 0.03561 -14.97 <2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.7497 on 178 degrees of freedom Multiple R-squared: 0.5573, Adjusted R-squared: 0.5548 F-statistic: 224 on 1 and 178 DF, p-value: < 2.2e-16
Rajat Tayal Econometrics using R
Generic functions for tted (linear) model objects

Function print() summary() coef() (or coecients()) residuals() (or resid()) tted() (or tted.values()) anova() predict() plot() connt() deviance() vcov() logLik() AIC() Function Description simple printed display standard regression output extracting the regression coecients extracting residuals extracting tted values comparison of nested models predictions for new data diagnostic plots condence intervals for the regression coecients residual sum of squares (estimated) variance-covariance matrix log-likelihood (assuming normally distributed errors) information criteria including AIC, BIC/SBC (assuming normally distributed errors)
Rajat Tayal
The rst example

It is instructive to take a brief look at what the summary() method returns for a tted lm object:
> jour_slm <- summary(jour_lm) > class(jour_slm) [1] "summary.lm" > names(jour_slm) [1] "call" "terms" "residuals" "coefficients" "aliased" "sigma" [7] "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unscaled > jour_slm$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 4.7662121 0.05590908 85.24934 2.953913e-146 log(citeprice) -0.5330535 0.03561320 -14.96786 2.563943e-33
Rajat Tayal
Analysis of variance
> anova(jour_lm) Analysis of Variance Table Response: log(subs) Df Sum Sq Mean Sq F value Pr(>F) log(citeprice) 1 125.93 125.934 224.04 < 2.2e-16 *** Residuals 178 100.06 0.562 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
The ANOVA table breaks the sum of squares about the mean (for the dependent variable, here log(subs)) into two parts: a part that is accounted for by a linear function of log(citeprice) and a part attributed to residual variation.
Point and Interval estimates
To extract the estimated regression coecients , the function coef() can be used: > coef(jour_lm) (Intercept) log(citeprice) 4.7662121 -0.5330535 > confint(jour_lm, level = 0.95) 2.5 % 97.5 % (Intercept) 4.6558822 4.8765420 log(citeprice) -0.6033319 -0.4627751
Rajat Tayal
Prediction
Two types of predictions:

1 2
the prediction of points on the regression line and the prediction of a new data value.
The standard errors of predictions for new data take into account both the uncertainty in the regression line and the variation of the individual points about the line. Thus, the prediction interval for prediction of new data is larger than that for prediction of points on the line. The function predict() provides both types of standard errors.
Rajat Tayal
Prediction
> predict(jour_lm, newdata = data.frame(citeprice = 2.11), interval = "confidence") fit lwr upr 4.368188 4.247485 4.48889 > predict(jour_lm, newdata = data.frame(citeprice = 2.11), interval = "prediction") fit lwr upr 4.368188 2.883746 5.852629 The point estimates are identical (t) but the intervals dier. The prediction intervals can also be used for computing and visualizing condence bands.
Rajat Tayal
Prediction
The prediction intervals can also be used for computing and visualizing condence bands.
> lciteprice <- seq(from = -6, to = 4, by = 0.25) > jour_pred <- predict(jour_lm, interval = "prediction", newdata = data.frame(citeprice = exp(lciteprice))) > plot(log(subs) ~ log(citeprice), data = journals) > lines(jour_pred[, 1] ~ lciteprice, col = 1) > lines(jour_pred[, 2] ~ lciteprice, col = 1, lty = 2) > lines(jour_pred[, 3] ~ lciteprice, col = 1, lty = 2)
Rajat Tayal
Prediction
Figure: Scatterplot with prediction intervals for the journals data

Plotting lm objects
The plot() method for class lm() provides six types of diagnostic plots, four of which are shown by default. We set the graphical parameter mfrow to c(2, 2) using the par() function, creating a 2 * 2 matrix of plotting areas to see all four plots simultaneously: > par(mfrow = c(2, 2)) > plot(jour_lm) > par(mfrow = c(1, 1))
Rajat Tayal
Plotting lm objects
Figure: Diagnostic plots for the journals data

Testing a linear hypothesis

The standard regression output as provided by summary() only indicates individual signicance of each regressor and joint signicance of all regressors in the form of t and F statistics, respectively. Often it is necessary to test more general hypotheses. This is possible using the function linear.hypothesis() from the car package. Suppose we want to test the hypothesis that the elasticity of the number of library subscriptions with respect to the price per citation equals 0.5. H0 : 2 = 0.5 (8)
Rajat Tayal

Rajat Tayal

Rajat Tayal
> linear.hypothesis(jour_lm, "log(citeprice) = -0.5") Linear hypothesis test Hypothesis: log(citeprice) = - 0.5 Model 1: restricted model Model 2: log(subs) ~ log(citeprice) Res.Df RSS Df Sum of Sq F Pr(>F) 179 100.54 178 100.06 1 0.48421 0.8614 0.3546
1 2
Rajat Tayal
Part II Regression diagnostics
Rajat Tayal
Review
> > > > >
data("Journals") journals <- Journals[, c("subs", "price")] journals$citeprice <- Journals$price/Journals$citations journals$age <- 2000 - Journals$foundingyear jour_lm <- lm(log(subs) ~ log(citeprice), data = journals
Rajat Tayal
Testing for heteroskedasticity

For cross-section regressions, the assumption Var ( i |xi ) = 2 is typically in doubt. A popular test for checking this assumption is the Breusch-Pagan test (Breusch and Pagan 1979). For our model tted to the journals data, stored in jour lm, the diagnostic plots in suggest that the variance decreases with the tted values or, equivalently, it increases with the price per citation. Hence, the regressor log(citeprice) used in the main model should also be employed for the auxiliary regression. Under H0 , the test statistic of the Breusch-Pagan test approximately follows a 2 distribution, where q is the q number of regressors in the auxiliary regression (excluding the constant term).


The function bptest() implements all these avors of the Breusch-Pagan test. By default, it computes the studentized statistic for the auxiliary regression utilizing the original regressors X.
> bptest(jour_lm) studentized Breusch-Pagan test data: jour_lm BP = 9.803, df = 1, p-value = 0.001742
Rajat Tayal
Alternatively, the White test picks up the heteroskedasticity. It uses the original regressors as well as their squares and interactions in the auxiliary regression, which can be passed as a second formula to bptest().
> bptest(jour_lm, ~ log(citeprice) + I(log(citeprice)^2), + data = journals) studentized Breusch-Pagan test data: jour_lm BP = 10.912, df = 2, p-value = 0.004271
Rajat Tayal
Testing the functional form
The assumption E ( |X ) = 0 is crucial for consistency of the least-squares estimator. A typical source for violation of this assumption is a misspecication of the functional form; e.g., by omitting relevant variables. One strategy for testing the functional form is to construct auxiliary variables and assess their signicance using a simple F test. This is what Ramseys RESET does.
Rajat Tayal
The function resettest() defaults to using second and third powers of the tted values as auxiliary variables. > resettest(jour_lm) RESET test data: jour_lm RESET = 1.4409, df1 = 2, df2 = 176, p-value = 0.2395
Rajat Tayal
The rainbow test (Utts 1982) takes a dierent approach to testing the functional form. It ts a model to a subsample (typically the middle 50%) and compares it with the model tted to the full sample using an F test. > raintest(jour_lm, order.by = ~ age, data = journals) Rainbow test data: jour_lm Rain = 1.774, df1 = 90, df2 = 88, p-value = 0.003741 This appears to be the case, signaling that the relationship between the number of subscriptions and the price per citation also depends on the age of the journal.
Rajat Tayal
Testing for autocorrelation

Let us reconsider the rst model for the US consumption function.
> library(dynlm) > data("USMacroG") > consump1 <- dynlm(consumption ~ dpi + L(dpi), data = USMacroG)
A classical testing procedure suggested for assessing autocorrelation in regression relationships is the Durbin-Watson test (Durbin and Watson 1950).
> dwtest(consump1) Durbin-Watson test data: consump1 DW = 0.0866, p-value < 2.2e-16 alternative hypothesis: true autocorrelation is greater than 0
Testing for autocorrelation
Further tests for autocorrelation are the Box-Pierce test and the Ljung-Box test, both being implemented in the function Box.test() in base R.
> Box.test(residuals(consump1), type = "Ljung-Box") Box-Ljung test data: residuals(consump1) X-squared = 176.0698, df = 1, p-value < 2.2e-16
Rajat Tayal
Multiple linear regression

Labour economics example: Estimation of wage equation
> data("CPS1988") > summary(CPS1988) wage Min. : 50.05 1st Qu.: 308.64 Median : 522.32 Mean : 603.73 3rd Qu.: 783.48 Max. :18777.20
education Min. : 0.00 1st Qu.:12.00 Median :12.00 Mean :13.07 3rd Qu.:15.00 Max. :18.00
experience Min. :-4.0 1st Qu.: 8.0 Median :16.0 Mean :18.2 3rd Qu.:27.0 Max. :63.0
ethnicity cauc:25923 afam: 2232
The model of interest is log (wage) = 1 + 2 exp + 3 exp 2 + 4 education + 5 ethnicity + (9)
Multiple linear regression

> cps_lm <- lm(log(wage) ~ experience + I(experience^2) + education + ethnicity, data = CPS1988) > summary(cps_lm) Call: lm(formula = log(wage) ~ experience + I(experience^2) + education + ethnicity, data = CPS1988) Residuals: Min 1Q Median 3Q Max -2.9428 -0.3162 0.0580 0.3756 4.3830 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.321e+00 1.917e-02 225.38 <2e-16 *** experience 7.747e-02 8.800e-04 88.03 <2e-16 *** I(experience^2) -1.316e-03 1.899e-05 -69.31 <2e-16 *** education 8.567e-02 1.272e-03 67.34 <2e-16 *** ethnicity -2.434e-01 1.292e-02 -18.84 <2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.5839 on 28150 degrees of freedom Multiple R-squared: 0.3347, Adjusted R-squared: 0.3346 F-statistic: 3541 on 4 and 28150 DF, p-value: < 2.2e-16
Comparison of models
With more than a single explanatory variable, it is interesting to test for the relevance of subsets of regressors.
> cps_noeth <- lm(log(wage) ~ experience + I(experience^2) + education, data = CPS1988) > anova(cps_noeth, cps_lm) Analysis of Variance Table Model 1: log(wage) ~ experience + I(experience^2) + education Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity Res.Df RSS Df Sum of Sq F Pr(>F) 1 28151 9719.6 2 28150 9598.6 1 121.02 354.91 < 2.2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 This reveals that the eect of ethnicity is signicant at any reasonable level.
Rajat Tayal
Comparison of models
> cps_noeth <- update(cps_lm, formula = . ~ . - ethnicity) > waldtest(cps_lm, . ~ . - ethnicity) from the package lmtest Wald test Model 1: log(wage) ~ experience + I(experience^2) + education + ethnicity Model 2: log(wage) ~ experience + I(experience^2) + education Res.Df Df F Pr(>F) 1 28150 2 28151 -1 354.91 < 2.2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Rajat Tayal
Part III Linear regression with panel data
Rajat Tayal
Introduction
There has been considerable interest in panel data econometrics over the last two decades. The package plm (Croissant and Millo 2008) contains the relevant tting functions and methods for specications in R. Two types of panel data models:
1 2
Statis linear models Dynamic linear models
Rajat Tayal
Introduction
Grunfeld data comprising 20 annual observations on the three variables real gross investment (invest), real value of the rm (value), and real value of the capital stock (capital) for 11 large US rms for the years 1935-1954. > data("Grunfeld", package = "AER") > library("plm") > gr <- subset(Grunfeld, firm %in% c("General Electric", "General Motors", "IBM")) > pgr <- plm.data(gr, index = c("firm", "year")) # The last command tells R that the individuals are called "firm", whereas the time identifier is called "year". Pooled OLS can be run using: > gr_pool <- plm(invest ~ value + capital, data = pgr, model = "pooling")
One-way panel regression

investit = 1 values + 2 capital + i + it (10) where i = 1, . . . , n, t = 1, . . . , T, and the i denote the individual-specic eects. A xed-eects version is estimated by running OLS on a within-transformed model:
> gr_fe <- plm(invest ~ value + capital, data = pgr, model = "within") > summary(gr_fe) Oneway (individual) effect Within Model Call: plm(formula = invest ~ value + capital, data = pgr, model = "within") Balanced Panel: n=3, T=20, N=60 Residuals : Min. 1st Qu. Median 3rd Qu. Max. -167.00 -26.10 2.09 26.80 202.00 Coefficients : Estimate Std. Error t-value Pr(>|t|) value 0.104914 0.016331 6.4242 3.296e-08 *** capital 0.345298 0.024392 14.1564 < 2.2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Total Sum of Squares: 1888900 Residual Sum of Squares: 243980 R-Squared : 0.87084 Adj. R-Squared : 0.79827

A two-way model could have been estimated upon setting eect = twoways. If xed eects need to be inspected, a xef() method and an associated summary() method are available. To check whether the xed eects are really needed, we compare the xed eects and the pooled OLS ts by means of pFtest().
> gr_fe <- plm(invest ~ value + capital, data = pgr, model = "within") > pFtest(gr_fe, gr_pool) F test for individual effects data: invest ~ value + capital F = 56.8247, df1 = 2, df2 = 55, p-value = 4.148e-14 alternative hypothesis: significant effects
Rajat Tayal
It is also possible to t a random-eects using the same tting function upon setting model = random and selecting a method for estimating the variance components. Four methods are available: Swamy-Arora, Amemiya, Wallace-Hussain, and Nerlove.
> gr_re <- plm(invest ~ value + capital, data = pgr, model = random.method = "walhus") > summary(gr_re)
Rajat Tayal

Oneway (individual) effect Random Effect Model (Wallace-Hussains transformation) Call: plm(formula = invest ~ value + capital, data = pgr, model = "random", random.method = "walhus") Balanced Panel: n=3, T=20, N=60 Effects: var std.dev share idiosyncratic 4389.31 66.25 0.352 individual 8079.74 89.89 0.648 theta: 0.8374 Residuals : Min. 1st Qu. Median 3rd Qu. Max. -187.00 -32.90 6.96 31.40 210.00 Coefficients : Estimate Std. Error t-value Pr(>|t|) (Intercept) -109.976572 61.701384 -1.7824 0.08001 . value 0.104280 0.014996 6.9539 3.797e-09 *** capital 0.344784 0.024520 14.0613 < 2.2e-16 *** Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 TSS: 1988300 RSS: 257520 R-Squared : 0.87048 Adj. R-Squared : 0.82696 F-statistic: 191.545 on 2 and 57 DF, p-value: < 2.22e-16

A comparison of the regression coecients shows that xedand randomeects methods yield rather similar results for these data. To check whether the random eects are really needed, a Lagrange multiplier test is available in plmtest(), defaulting to the test proposed by Honda (1985). > plmtest(gr_pool) Lagrange Multiplier Test - (Honda) data: invest ~ value + capital normal = 15.4704, p-value < 2.2e-16 alternative hypothesis: significant effects

Random-eects methods are more ecient than the xed-eects estimator under more restrictive assumptions, namely exogeneity of the individual eects. It is therefore important to test for endogeneity, and the standard approach employs a Hausman test. The relevant function phtest() requires two panel regression objects, in our case yielding > phtest(gr_re, gr_fe) Hausman Test data: invest ~ value + capital chisq = 0.0404, df = 2, p-value = 0.98 alternative hypothesis: one model is inconsistent In line with the rather similar estimates presented above, endogeneity does not appear to be a problem here.
Part IV Useful resources
Rajat Tayal
Econometrics Task View in R
Linear regression models: stats, lmtest, car, sandwich, AER. Non linear regression models: stats. Quantile regression: quantreg. Panel data: plm Linear structural equation model: sem.
Rajat Tayal
Books on Econometrics in R
Kleiber C. & A. Zeileis (2008), Applied Econometrics with R, Springer. Vinod H.R. (2008), Hands-on Econometrics using R, World Scientic Publishing. Hatekar N.R. (2010), Principles of Econometrics: An introduction using R, Sage Publications.
Rajat Tayal
Thank You
Rajat Tayal

Econometrics With R

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Econometrics With R

Hochgeladen von

Copyright:

Verfügbare Formate

Econometrics using R

February 17, 2012

Outline of the presentation

Linear regression with panel data

Part I Linear regression

We estimate the OLS by: = (X T X )1 X T y (6)

The rst example

The rst example

The rst example

The rst example

The rst example

The rst example

The rst example

The rst example

Generic functions for tted (linear) model objects

The rst example

Point and Interval estimates

Two types of predictions:

Figure: Scatterplot with prediction intervals for the journals data

Figure: Diagnostic plots for the journals data

Testing a linear hypothesis

Testing a linear hypothesis

Testing a linear hypothesis

Testing a linear hypothesis

Part II Regression diagnostics

> > > > >

Testing for heteroskedasticity

Testing for heteroskedasticity

Testing for heteroskedasticity

Testing for heteroskedasticity

Testing for heteroskedasticity

Testing the functional form

Testing the functional form

Testing the functional form

Testing for autocorrelation

Testing for autocorrelation

Multiple linear regression

ethnicity cauc:25923 afam: 2232

Multiple linear regression

Part III Linear regression with panel data

Statis linear models Dynamic linear models

One-way panel regression

One-way panel regression

One-way panel regression

One-way panel regression

One-way panel regression

One-way panel regression

Part IV Useful resources

Econometrics Task View in R

Das könnte Ihnen auch gefallen