Beruflich Dokumente
Kultur Dokumente
Rajat Tayal
Rmetrics Panel Data Workshop February 16-February 18, 2011 Indira Gandhi Institute of Development Research, Mumbai
Rajat Tayal
Econometrics using R
Linear regression
Simple linear regression Multiple linear regression
Regression diagnostics
Leverage and standardized residuals Deletion diagnostics The function inuence.measures() Testing for heteroskedasticity Testing for functional form Testing for autocorrelation Robust standard errors and tests
Rajat Tayal
Econometrics using R
Rajat Tayal
Econometrics using R
Introduction
The linear regression model, typically estimated by ordinary least squares (OLS), is the workhorse of applied econometrics. The model is yi = xiT + i , i = 1, 2, . . . , n. (1) y = X + For cross-sections: E ( |X ) = 0 Var ( |X ) = I For time series: E ( j |xi ) = 0, i j. (5)
2
(2)
(3) (4)
Rajat Tayal
Econometrics using R
Introduction
The corresponding tted values are: y = X , the residuals are = y y and the residual sum of squares is T . In R, models are typically tted by calling a model-tting function, in this case lm(), with a formula object describing the model and a data.frame object containing the variables used in the formula. fm < lm(formula, data, . . .)
Rajat Tayal
Econometrics using R
Rajat Tayal
Econometrics using R
In view of the wide range of the variables, combined with a considerable amount of skewness, it is useful to take logarithms. The goal is to estimate the eect of the price per citation on the number of library subscriptions. To explore this issue quantitatively, we will t a linear regression model, log (subs)i = 1 + 2 log (citeprice)i +
i
(7)
Rajat Tayal
Econometrics using R
In view of the wide range of the variables, combined with a considerable amount of skewness, it is useful to take logarithms. The goal is to estimate the eect of the price per citation on the number of library subscriptions. To explore this issue quantitatively, we will t a linear regression model, log (subs)i = 1 + 2 log (citeprice)i +
i
(7)
Rajat Tayal
Econometrics using R
In view of the wide range of the variables, combined with a considerable amount of skewness, it is useful to take logarithms. The goal is to estimate the eect of the price per citation on the number of library subscriptions. To explore this issue quantitatively, we will t a linear regression model, log (subs)i = 1 + 2 log (citeprice)i +
i
(7)
Rajat Tayal
Econometrics using R
Here, the formula of interest is log (subs) log (citeprice). This can be used both for plotting and for model tting:
> plot(log(subs) ~ log(citeprice), data = journals) > jour_lm <- lm(log(subs) ~ log(citeprice), data = journals) > abline(jour_lm)
abline() extracts the coecients of the tted model and adds the corresponding regression line to the plot.
Rajat Tayal
Econometrics using R
Rajat Tayal
Econometrics using R
The function lm() returns a tted-model object, here stored as jour lm. It is an object of class lm.
> class(jour_lm) [1] "lm" > names(jour_lm) [1] "coefficients" "residuals" "effects" "rank" "fitted. [7] "qr" "df.residual" "xlevels" "call" "terms" "model"
Rajat Tayal
Econometrics using R
Rajat Tayal
> jour_slm <- summary(jour_lm) > class(jour_slm) [1] "summary.lm" > names(jour_slm) [1] "call" "terms" "residuals" "coefficients" "aliased" "sigma" [7] "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unscaled > jour_slm$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 4.7662121 0.05590908 85.24934 2.953913e-146 log(citeprice) -0.5330535 0.03561320 -14.96786 2.563943e-33
Rajat Tayal
Econometrics using R
Analysis of variance
> anova(jour_lm) Analysis of Variance Table Response: log(subs) Df Sum Sq Mean Sq F value Pr(>F) log(citeprice) 1 125.93 125.934 224.04 < 2.2e-16 *** Residuals 178 100.06 0.562 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
The ANOVA table breaks the sum of squares about the mean (for the dependent variable, here log(subs)) into two parts: a part that is accounted for by a linear function of log(citeprice) and a part attributed to residual variation.
Rajat Tayal Econometrics using R
To extract the estimated regression coecients , the function coef() can be used: > coef(jour_lm) (Intercept) log(citeprice) 4.7662121 -0.5330535 > confint(jour_lm, level = 0.95) 2.5 % 97.5 % (Intercept) 4.6558822 4.8765420 log(citeprice) -0.6033319 -0.4627751
Rajat Tayal
Econometrics using R
Prediction
the prediction of points on the regression line and the prediction of a new data value.
The standard errors of predictions for new data take into account both the uncertainty in the regression line and the variation of the individual points about the line. Thus, the prediction interval for prediction of new data is larger than that for prediction of points on the line. The function predict() provides both types of standard errors.
Rajat Tayal
Econometrics using R
Prediction
> predict(jour_lm, newdata = data.frame(citeprice = 2.11), interval = "confidence") fit lwr upr 4.368188 4.247485 4.48889 > predict(jour_lm, newdata = data.frame(citeprice = 2.11), interval = "prediction") fit lwr upr 4.368188 2.883746 5.852629 The point estimates are identical (t) but the intervals dier. The prediction intervals can also be used for computing and visualizing condence bands.
Rajat Tayal
Econometrics using R
Prediction
The prediction intervals can also be used for computing and visualizing condence bands.
> lciteprice <- seq(from = -6, to = 4, by = 0.25) > jour_pred <- predict(jour_lm, interval = "prediction", newdata = data.frame(citeprice = exp(lciteprice))) > plot(log(subs) ~ log(citeprice), data = journals) > lines(jour_pred[, 1] ~ lciteprice, col = 1) > lines(jour_pred[, 2] ~ lciteprice, col = 1, lty = 2) > lines(jour_pred[, 3] ~ lciteprice, col = 1, lty = 2)
Rajat Tayal
Econometrics using R
Prediction
Plotting lm objects
The plot() method for class lm() provides six types of diagnostic plots, four of which are shown by default. We set the graphical parameter mfrow to c(2, 2) using the par() function, creating a 2 * 2 matrix of plotting areas to see all four plots simultaneously: > par(mfrow = c(2, 2)) > plot(jour_lm) > par(mfrow = c(1, 1))
Rajat Tayal
Econometrics using R
Plotting lm objects
Rajat Tayal
Econometrics using R
Rajat Tayal
Econometrics using R
Rajat Tayal
Econometrics using R
> linear.hypothesis(jour_lm, "log(citeprice) = -0.5") Linear hypothesis test Hypothesis: log(citeprice) = - 0.5 Model 1: restricted model Model 2: log(subs) ~ log(citeprice) Res.Df RSS Df Sum of Sq F Pr(>F) 179 100.54 178 100.06 1 0.48421 0.8614 0.3546
1 2
Rajat Tayal
Econometrics using R
Rajat Tayal
Econometrics using R
Review
data("Journals") journals <- Journals[, c("subs", "price")] journals$citeprice <- Journals$price/Journals$citations journals$age <- 2000 - Journals$foundingyear jour_lm <- lm(log(subs) ~ log(citeprice), data = journals
Rajat Tayal
Econometrics using R
The function bptest() implements all these avors of the Breusch-Pagan test. By default, it computes the studentized statistic for the auxiliary regression utilizing the original regressors X.
> bptest(jour_lm) studentized Breusch-Pagan test data: jour_lm BP = 9.803, df = 1, p-value = 0.001742
Rajat Tayal
Econometrics using R
Alternatively, the White test picks up the heteroskedasticity. It uses the original regressors as well as their squares and interactions in the auxiliary regression, which can be passed as a second formula to bptest().
> bptest(jour_lm, ~ log(citeprice) + I(log(citeprice)^2), + data = journals) studentized Breusch-Pagan test data: jour_lm BP = 10.912, df = 2, p-value = 0.004271
Rajat Tayal
Econometrics using R
The assumption E ( |X ) = 0 is crucial for consistency of the least-squares estimator. A typical source for violation of this assumption is a misspecication of the functional form; e.g., by omitting relevant variables. One strategy for testing the functional form is to construct auxiliary variables and assess their signicance using a simple F test. This is what Ramseys RESET does.
Rajat Tayal
Econometrics using R
The function resettest() defaults to using second and third powers of the tted values as auxiliary variables. > resettest(jour_lm) RESET test data: jour_lm RESET = 1.4409, df1 = 2, df2 = 176, p-value = 0.2395
Rajat Tayal
Econometrics using R
The rainbow test (Utts 1982) takes a dierent approach to testing the functional form. It ts a model to a subsample (typically the middle 50%) and compares it with the model tted to the full sample using an F test. > raintest(jour_lm, order.by = ~ age, data = journals) Rainbow test data: jour_lm Rain = 1.774, df1 = 90, df2 = 88, p-value = 0.003741 This appears to be the case, signaling that the relationship between the number of subscriptions and the price per citation also depends on the age of the journal.
Rajat Tayal
Econometrics using R
A classical testing procedure suggested for assessing autocorrelation in regression relationships is the Durbin-Watson test (Durbin and Watson 1950).
> dwtest(consump1) Durbin-Watson test data: consump1 DW = 0.0866, p-value < 2.2e-16 alternative hypothesis: true autocorrelation is greater than 0
Rajat Tayal Econometrics using R
Further tests for autocorrelation are the Box-Pierce test and the Ljung-Box test, both being implemented in the function Box.test() in base R.
> Box.test(residuals(consump1), type = "Ljung-Box") Box-Ljung test data: residuals(consump1) X-squared = 176.0698, df = 1, p-value < 2.2e-16
Rajat Tayal
Econometrics using R
education Min. : 0.00 1st Qu.:12.00 Median :12.00 Mean :13.07 3rd Qu.:15.00 Max. :18.00
experience Min. :-4.0 1st Qu.: 8.0 Median :16.0 Mean :18.2 3rd Qu.:27.0 Max. :63.0
The model of interest is log (wage) = 1 + 2 exp + 3 exp 2 + 4 education + 5 ethnicity + (9)
Rajat Tayal Econometrics using R
Comparison of models
With more than a single explanatory variable, it is interesting to test for the relevance of subsets of regressors.
> cps_noeth <- lm(log(wage) ~ experience + I(experience^2) + education, data = CPS1988) > anova(cps_noeth, cps_lm) Analysis of Variance Table Model 1: log(wage) ~ experience + I(experience^2) + education Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity Res.Df RSS Df Sum of Sq F Pr(>F) 1 28151 9719.6 2 28150 9598.6 1 121.02 354.91 < 2.2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 This reveals that the eect of ethnicity is signicant at any reasonable level.
Rajat Tayal
Econometrics using R
Comparison of models
> cps_noeth <- update(cps_lm, formula = . ~ . - ethnicity) > waldtest(cps_lm, . ~ . - ethnicity) from the package lmtest Wald test Model 1: log(wage) ~ experience + I(experience^2) + education + ethnicity Model 2: log(wage) ~ experience + I(experience^2) + education Res.Df Df F Pr(>F) 1 28150 2 28151 -1 354.91 < 2.2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Rajat Tayal
Econometrics using R
Rajat Tayal
Econometrics using R
Introduction
There has been considerable interest in panel data econometrics over the last two decades. The package plm (Croissant and Millo 2008) contains the relevant tting functions and methods for specications in R. Two types of panel data models:
1 2
Rajat Tayal
Econometrics using R
Introduction
Grunfeld data comprising 20 annual observations on the three variables real gross investment (invest), real value of the rm (value), and real value of the capital stock (capital) for 11 large US rms for the years 1935-1954. > data("Grunfeld", package = "AER") > library("plm") > gr <- subset(Grunfeld, firm %in% c("General Electric", "General Motors", "IBM")) > pgr <- plm.data(gr, index = c("firm", "year")) # The last command tells R that the individuals are called "firm", whereas the time identifier is called "year". Pooled OLS can be run using: > gr_pool <- plm(invest ~ value + capital, data = pgr, model = "pooling")
Rajat Tayal Econometrics using R
Rajat Tayal
Econometrics using R
It is also possible to t a random-eects using the same tting function upon setting model = random and selecting a method for estimating the variance components. Four methods are available: Swamy-Arora, Amemiya, Wallace-Hussain, and Nerlove.
> gr_re <- plm(invest ~ value + capital, data = pgr, model = random.method = "walhus") > summary(gr_re)
Rajat Tayal
Econometrics using R
Rajat Tayal
Econometrics using R
Linear regression models: stats, lmtest, car, sandwich, AER. Non linear regression models: stats. Quantile regression: quantreg. Panel data: plm Linear structural equation model: sem.
Rajat Tayal
Econometrics using R
Books on Econometrics in R
Kleiber C. & A. Zeileis (2008), Applied Econometrics with R, Springer. Vinod H.R. (2008), Hands-on Econometrics using R, World Scientic Publishing. Hatekar N.R. (2010), Principles of Econometrics: An introduction using R, Sage Publications.
Rajat Tayal
Econometrics using R
Thank You
Rajat Tayal
Econometrics using R