Sie sind auf Seite 1von 56

Econometrics using R

Rajat Tayal
Rmetrics Panel Data Workshop February 16-February 18, 2011 Indira Gandhi Institute of Development Research, Mumbai

February 17, 2012

Rajat Tayal

Econometrics using R

Outline of the presentation

Linear regression
Simple linear regression Multiple linear regression

Regression diagnostics
Leverage and standardized residuals Deletion diagnostics The function inuence.measures() Testing for heteroskedasticity Testing for functional form Testing for autocorrelation Robust standard errors and tests

Linear regression with panel data

Rajat Tayal

Econometrics using R

Part I Linear regression

Rajat Tayal

Econometrics using R

Introduction
The linear regression model, typically estimated by ordinary least squares (OLS), is the workhorse of applied econometrics. The model is yi = xiT + i , i = 1, 2, . . . , n. (1) y = X + For cross-sections: E ( |X ) = 0 Var ( |X ) = I For time series: E ( j |xi ) = 0, i j. (5)
2

(2)

(3) (4)

Rajat Tayal

Econometrics using R

Introduction

We estimate the OLS by: = (X T X )1 X T y (6)

The corresponding tted values are: y = X , the residuals are = y y and the residual sum of squares is T . In R, models are typically tted by calling a model-tting function, in this case lm(), with a formula object describing the model and a data.frame object containing the variables used in the formula. fm < lm(formula, data, . . .)

Rajat Tayal

Econometrics using R

The rst example


Data from Stock & Watson (2007) on subscriptions to economics journals at US libraries for the year 2000:
> > > > > > install.packages("AER", dependencies=TRUE) library(AER) data("Journals") ; names("Journals") journals <- Journals[, c("subs", "price")] journals$citeprice <- Journals$price/Journals$citations summary(journals) subs price citeprice Min. : 2.0 Min. : 20.0 Min. : 0.005223 1st Qu.: 52.0 1st Qu.: 134.5 1st Qu.: 0.464495 Median : 122.5 Median : 282.0 Median : 1.320513 Mean : 196.9 Mean : 417.7 Mean : 2.548455 3rd Qu.: 268.2 3rd Qu.: 540.8 3rd Qu.: 3.440171 Max. :1098.0 Max. :2120.0 Max. :24.459459

Rajat Tayal

Econometrics using R

The rst example

In view of the wide range of the variables, combined with a considerable amount of skewness, it is useful to take logarithms. The goal is to estimate the eect of the price per citation on the number of library subscriptions. To explore this issue quantitatively, we will t a linear regression model, log (subs)i = 1 + 2 log (citeprice)i +
i

(7)

Rajat Tayal

Econometrics using R

The rst example

In view of the wide range of the variables, combined with a considerable amount of skewness, it is useful to take logarithms. The goal is to estimate the eect of the price per citation on the number of library subscriptions. To explore this issue quantitatively, we will t a linear regression model, log (subs)i = 1 + 2 log (citeprice)i +
i

(7)

Rajat Tayal

Econometrics using R

The rst example

In view of the wide range of the variables, combined with a considerable amount of skewness, it is useful to take logarithms. The goal is to estimate the eect of the price per citation on the number of library subscriptions. To explore this issue quantitatively, we will t a linear regression model, log (subs)i = 1 + 2 log (citeprice)i +
i

(7)

Rajat Tayal

Econometrics using R

The rst example

Here, the formula of interest is log (subs) log (citeprice). This can be used both for plotting and for model tting:
> plot(log(subs) ~ log(citeprice), data = journals) > jour_lm <- lm(log(subs) ~ log(citeprice), data = journals) > abline(jour_lm)

abline() extracts the coecients of the tted model and adds the corresponding regression line to the plot.

Rajat Tayal

Econometrics using R

The rst example

Rajat Tayal

Econometrics using R

The rst example

The function lm() returns a tted-model object, here stored as jour lm. It is an object of class lm.

> class(jour_lm) [1] "lm" > names(jour_lm) [1] "coefficients" "residuals" "effects" "rank" "fitted. [7] "qr" "df.residual" "xlevels" "call" "terms" "model"

Rajat Tayal

Econometrics using R

The rst example


> summary(jour_lm) Call: lm(formula = log(subs) ~ log(citeprice), data = journals) Residuals: Min 1Q Median 3Q Max -2.72478 -0.53609 0.03721 0.46619 1.84808 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.76621 0.05591 85.25 <2e-16 *** log(citeprice) -0.53305 0.03561 -14.97 <2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.7497 on 178 degrees of freedom Multiple R-squared: 0.5573, Adjusted R-squared: 0.5548 F-statistic: 224 on 1 and 178 DF, p-value: < 2.2e-16
Rajat Tayal Econometrics using R

Generic functions for tted (linear) model objects


Function print() summary() coef() (or coecients()) residuals() (or resid()) tted() (or tted.values()) anova() predict() plot() connt() deviance() vcov() logLik() AIC() Function Description simple printed display standard regression output extracting the regression coecients extracting residuals extracting tted values comparison of nested models predictions for new data diagnostic plots condence intervals for the regression coecients residual sum of squares (estimated) variance-covariance matrix log-likelihood (assuming normally distributed errors) information criteria including AIC, BIC/SBC (assuming normally distributed errors)
Econometrics using R

Rajat Tayal

The rst example


It is instructive to take a brief look at what the summary() method returns for a tted lm object:

> jour_slm <- summary(jour_lm) > class(jour_slm) [1] "summary.lm" > names(jour_slm) [1] "call" "terms" "residuals" "coefficients" "aliased" "sigma" [7] "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unscaled > jour_slm$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 4.7662121 0.05590908 85.24934 2.953913e-146 log(citeprice) -0.5330535 0.03561320 -14.96786 2.563943e-33

Rajat Tayal

Econometrics using R

Analysis of variance

> anova(jour_lm) Analysis of Variance Table Response: log(subs) Df Sum Sq Mean Sq F value Pr(>F) log(citeprice) 1 125.93 125.934 224.04 < 2.2e-16 *** Residuals 178 100.06 0.562 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

The ANOVA table breaks the sum of squares about the mean (for the dependent variable, here log(subs)) into two parts: a part that is accounted for by a linear function of log(citeprice) and a part attributed to residual variation.
Rajat Tayal Econometrics using R

Point and Interval estimates

To extract the estimated regression coecients , the function coef() can be used: > coef(jour_lm) (Intercept) log(citeprice) 4.7662121 -0.5330535 > confint(jour_lm, level = 0.95) 2.5 % 97.5 % (Intercept) 4.6558822 4.8765420 log(citeprice) -0.6033319 -0.4627751

Rajat Tayal

Econometrics using R

Prediction

Two types of predictions:


1 2

the prediction of points on the regression line and the prediction of a new data value.

The standard errors of predictions for new data take into account both the uncertainty in the regression line and the variation of the individual points about the line. Thus, the prediction interval for prediction of new data is larger than that for prediction of points on the line. The function predict() provides both types of standard errors.

Rajat Tayal

Econometrics using R

Prediction

> predict(jour_lm, newdata = data.frame(citeprice = 2.11), interval = "confidence") fit lwr upr 4.368188 4.247485 4.48889 > predict(jour_lm, newdata = data.frame(citeprice = 2.11), interval = "prediction") fit lwr upr 4.368188 2.883746 5.852629 The point estimates are identical (t) but the intervals dier. The prediction intervals can also be used for computing and visualizing condence bands.

Rajat Tayal

Econometrics using R

Prediction

The prediction intervals can also be used for computing and visualizing condence bands.
> lciteprice <- seq(from = -6, to = 4, by = 0.25) > jour_pred <- predict(jour_lm, interval = "prediction", newdata = data.frame(citeprice = exp(lciteprice))) > plot(log(subs) ~ log(citeprice), data = journals) > lines(jour_pred[, 1] ~ lciteprice, col = 1) > lines(jour_pred[, 2] ~ lciteprice, col = 1, lty = 2) > lines(jour_pred[, 3] ~ lciteprice, col = 1, lty = 2)

Rajat Tayal

Econometrics using R

Prediction

Figure: Scatterplot with prediction intervals for the journals data


Rajat Tayal Econometrics using R

Plotting lm objects

The plot() method for class lm() provides six types of diagnostic plots, four of which are shown by default. We set the graphical parameter mfrow to c(2, 2) using the par() function, creating a 2 * 2 matrix of plotting areas to see all four plots simultaneously: > par(mfrow = c(2, 2)) > plot(jour_lm) > par(mfrow = c(1, 1))

Rajat Tayal

Econometrics using R

Plotting lm objects

Figure: Diagnostic plots for the journals data


Rajat Tayal Econometrics using R

Testing a linear hypothesis


The standard regression output as provided by summary() only indicates individual signicance of each regressor and joint signicance of all regressors in the form of t and F statistics, respectively. Often it is necessary to test more general hypotheses. This is possible using the function linear.hypothesis() from the car package. Suppose we want to test the hypothesis that the elasticity of the number of library subscriptions with respect to the price per citation equals 0.5. H0 : 2 = 0.5 (8)

Rajat Tayal

Econometrics using R

Testing a linear hypothesis


The standard regression output as provided by summary() only indicates individual signicance of each regressor and joint signicance of all regressors in the form of t and F statistics, respectively. Often it is necessary to test more general hypotheses. This is possible using the function linear.hypothesis() from the car package. Suppose we want to test the hypothesis that the elasticity of the number of library subscriptions with respect to the price per citation equals 0.5. H0 : 2 = 0.5 (8)

Rajat Tayal

Econometrics using R

Testing a linear hypothesis


The standard regression output as provided by summary() only indicates individual signicance of each regressor and joint signicance of all regressors in the form of t and F statistics, respectively. Often it is necessary to test more general hypotheses. This is possible using the function linear.hypothesis() from the car package. Suppose we want to test the hypothesis that the elasticity of the number of library subscriptions with respect to the price per citation equals 0.5. H0 : 2 = 0.5 (8)

Rajat Tayal

Econometrics using R

Testing a linear hypothesis

> linear.hypothesis(jour_lm, "log(citeprice) = -0.5") Linear hypothesis test Hypothesis: log(citeprice) = - 0.5 Model 1: restricted model Model 2: log(subs) ~ log(citeprice) Res.Df RSS Df Sum of Sq F Pr(>F) 179 100.54 178 100.06 1 0.48421 0.8614 0.3546

1 2

Rajat Tayal

Econometrics using R

Part II Regression diagnostics

Rajat Tayal

Econometrics using R

Review

> > > > >

data("Journals") journals <- Journals[, c("subs", "price")] journals$citeprice <- Journals$price/Journals$citations journals$age <- 2000 - Journals$foundingyear jour_lm <- lm(log(subs) ~ log(citeprice), data = journals

Rajat Tayal

Econometrics using R

Testing for heteroskedasticity


For cross-section regressions, the assumption Var ( i |xi ) = 2 is typically in doubt. A popular test for checking this assumption is the Breusch-Pagan test (Breusch and Pagan 1979). For our model tted to the journals data, stored in jour lm, the diagnostic plots in suggest that the variance decreases with the tted values or, equivalently, it increases with the price per citation. Hence, the regressor log(citeprice) used in the main model should also be employed for the auxiliary regression. Under H0 , the test statistic of the Breusch-Pagan test approximately follows a 2 distribution, where q is the q number of regressors in the auxiliary regression (excluding the constant term).
Rajat Tayal Econometrics using R

Testing for heteroskedasticity


For cross-section regressions, the assumption Var ( i |xi ) = 2 is typically in doubt. A popular test for checking this assumption is the Breusch-Pagan test (Breusch and Pagan 1979). For our model tted to the journals data, stored in jour lm, the diagnostic plots in suggest that the variance decreases with the tted values or, equivalently, it increases with the price per citation. Hence, the regressor log(citeprice) used in the main model should also be employed for the auxiliary regression. Under H0 , the test statistic of the Breusch-Pagan test approximately follows a 2 distribution, where q is the q number of regressors in the auxiliary regression (excluding the constant term).
Rajat Tayal Econometrics using R

Testing for heteroskedasticity


For cross-section regressions, the assumption Var ( i |xi ) = 2 is typically in doubt. A popular test for checking this assumption is the Breusch-Pagan test (Breusch and Pagan 1979). For our model tted to the journals data, stored in jour lm, the diagnostic plots in suggest that the variance decreases with the tted values or, equivalently, it increases with the price per citation. Hence, the regressor log(citeprice) used in the main model should also be employed for the auxiliary regression. Under H0 , the test statistic of the Breusch-Pagan test approximately follows a 2 distribution, where q is the q number of regressors in the auxiliary regression (excluding the constant term).
Rajat Tayal Econometrics using R

Testing for heteroskedasticity

The function bptest() implements all these avors of the Breusch-Pagan test. By default, it computes the studentized statistic for the auxiliary regression utilizing the original regressors X.
> bptest(jour_lm) studentized Breusch-Pagan test data: jour_lm BP = 9.803, df = 1, p-value = 0.001742

Rajat Tayal

Econometrics using R

Testing for heteroskedasticity

Alternatively, the White test picks up the heteroskedasticity. It uses the original regressors as well as their squares and interactions in the auxiliary regression, which can be passed as a second formula to bptest().
> bptest(jour_lm, ~ log(citeprice) + I(log(citeprice)^2), + data = journals) studentized Breusch-Pagan test data: jour_lm BP = 10.912, df = 2, p-value = 0.004271

Rajat Tayal

Econometrics using R

Testing the functional form

The assumption E ( |X ) = 0 is crucial for consistency of the least-squares estimator. A typical source for violation of this assumption is a misspecication of the functional form; e.g., by omitting relevant variables. One strategy for testing the functional form is to construct auxiliary variables and assess their signicance using a simple F test. This is what Ramseys RESET does.

Rajat Tayal

Econometrics using R

Testing the functional form

The function resettest() defaults to using second and third powers of the tted values as auxiliary variables. > resettest(jour_lm) RESET test data: jour_lm RESET = 1.4409, df1 = 2, df2 = 176, p-value = 0.2395

Rajat Tayal

Econometrics using R

Testing the functional form

The rainbow test (Utts 1982) takes a dierent approach to testing the functional form. It ts a model to a subsample (typically the middle 50%) and compares it with the model tted to the full sample using an F test. > raintest(jour_lm, order.by = ~ age, data = journals) Rainbow test data: jour_lm Rain = 1.774, df1 = 90, df2 = 88, p-value = 0.003741 This appears to be the case, signaling that the relationship between the number of subscriptions and the price per citation also depends on the age of the journal.

Rajat Tayal

Econometrics using R

Testing for autocorrelation


Let us reconsider the rst model for the US consumption function.
> library(dynlm) > data("USMacroG") > consump1 <- dynlm(consumption ~ dpi + L(dpi), data = USMacroG)

A classical testing procedure suggested for assessing autocorrelation in regression relationships is the Durbin-Watson test (Durbin and Watson 1950).
> dwtest(consump1) Durbin-Watson test data: consump1 DW = 0.0866, p-value < 2.2e-16 alternative hypothesis: true autocorrelation is greater than 0
Rajat Tayal Econometrics using R

Testing for autocorrelation

Further tests for autocorrelation are the Box-Pierce test and the Ljung-Box test, both being implemented in the function Box.test() in base R.
> Box.test(residuals(consump1), type = "Ljung-Box") Box-Ljung test data: residuals(consump1) X-squared = 176.0698, df = 1, p-value < 2.2e-16

Rajat Tayal

Econometrics using R

Multiple linear regression


Labour economics example: Estimation of wage equation
> data("CPS1988") > summary(CPS1988) wage Min. : 50.05 1st Qu.: 308.64 Median : 522.32 Mean : 603.73 3rd Qu.: 783.48 Max. :18777.20

education Min. : 0.00 1st Qu.:12.00 Median :12.00 Mean :13.07 3rd Qu.:15.00 Max. :18.00

experience Min. :-4.0 1st Qu.: 8.0 Median :16.0 Mean :18.2 3rd Qu.:27.0 Max. :63.0

ethnicity cauc:25923 afam: 2232

The model of interest is log (wage) = 1 + 2 exp + 3 exp 2 + 4 education + 5 ethnicity + (9)
Rajat Tayal Econometrics using R

Multiple linear regression


> cps_lm <- lm(log(wage) ~ experience + I(experience^2) + education + ethnicity, data = CPS1988) > summary(cps_lm) Call: lm(formula = log(wage) ~ experience + I(experience^2) + education + ethnicity, data = CPS1988) Residuals: Min 1Q Median 3Q Max -2.9428 -0.3162 0.0580 0.3756 4.3830 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.321e+00 1.917e-02 225.38 <2e-16 *** experience 7.747e-02 8.800e-04 88.03 <2e-16 *** I(experience^2) -1.316e-03 1.899e-05 -69.31 <2e-16 *** education 8.567e-02 1.272e-03 67.34 <2e-16 *** ethnicity -2.434e-01 1.292e-02 -18.84 <2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.5839 on 28150 degrees of freedom Multiple R-squared: 0.3347, Adjusted R-squared: 0.3346 F-statistic: 3541 on 4 and 28150 DF, p-value: < 2.2e-16
Rajat Tayal Econometrics using R

Comparison of models
With more than a single explanatory variable, it is interesting to test for the relevance of subsets of regressors.
> cps_noeth <- lm(log(wage) ~ experience + I(experience^2) + education, data = CPS1988) > anova(cps_noeth, cps_lm) Analysis of Variance Table Model 1: log(wage) ~ experience + I(experience^2) + education Model 2: log(wage) ~ experience + I(experience^2) + education + ethnicity Res.Df RSS Df Sum of Sq F Pr(>F) 1 28151 9719.6 2 28150 9598.6 1 121.02 354.91 < 2.2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 This reveals that the eect of ethnicity is signicant at any reasonable level.

Rajat Tayal

Econometrics using R

Comparison of models
> cps_noeth <- update(cps_lm, formula = . ~ . - ethnicity) > waldtest(cps_lm, . ~ . - ethnicity) from the package lmtest Wald test Model 1: log(wage) ~ experience + I(experience^2) + education + ethnicity Model 2: log(wage) ~ experience + I(experience^2) + education Res.Df Df F Pr(>F) 1 28150 2 28151 -1 354.91 < 2.2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Rajat Tayal

Econometrics using R

Part III Linear regression with panel data

Rajat Tayal

Econometrics using R

Introduction

There has been considerable interest in panel data econometrics over the last two decades. The package plm (Croissant and Millo 2008) contains the relevant tting functions and methods for specications in R. Two types of panel data models:
1 2

Statis linear models Dynamic linear models

Rajat Tayal

Econometrics using R

Introduction
Grunfeld data comprising 20 annual observations on the three variables real gross investment (invest), real value of the rm (value), and real value of the capital stock (capital) for 11 large US rms for the years 1935-1954. > data("Grunfeld", package = "AER") > library("plm") > gr <- subset(Grunfeld, firm %in% c("General Electric", "General Motors", "IBM")) > pgr <- plm.data(gr, index = c("firm", "year")) # The last command tells R that the individuals are called "firm", whereas the time identifier is called "year". Pooled OLS can be run using: > gr_pool <- plm(invest ~ value + capital, data = pgr, model = "pooling")
Rajat Tayal Econometrics using R

One-way panel regression


investit = 1 values + 2 capital + i + it (10) where i = 1, . . . , n, t = 1, . . . , T, and the i denote the individual-specic eects. A xed-eects version is estimated by running OLS on a within-transformed model:
> gr_fe <- plm(invest ~ value + capital, data = pgr, model = "within") > summary(gr_fe) Oneway (individual) effect Within Model Call: plm(formula = invest ~ value + capital, data = pgr, model = "within") Balanced Panel: n=3, T=20, N=60 Residuals : Min. 1st Qu. Median 3rd Qu. Max. -167.00 -26.10 2.09 26.80 202.00 Coefficients : Estimate Std. Error t-value Pr(>|t|) value 0.104914 0.016331 6.4242 3.296e-08 *** capital 0.345298 0.024392 14.1564 < 2.2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Total Sum of Squares: 1888900 Residual Sum of Squares: 243980 R-Squared : 0.87084 Adj. R-Squared : 0.79827
Rajat Tayal Econometrics using R

One-way panel regression


A two-way model could have been estimated upon setting eect = twoways. If xed eects need to be inspected, a xef() method and an associated summary() method are available. To check whether the xed eects are really needed, we compare the xed eects and the pooled OLS ts by means of pFtest().
> gr_fe <- plm(invest ~ value + capital, data = pgr, model = "within") > pFtest(gr_fe, gr_pool) F test for individual effects data: invest ~ value + capital F = 56.8247, df1 = 2, df2 = 55, p-value = 4.148e-14 alternative hypothesis: significant effects

Rajat Tayal

Econometrics using R

One-way panel regression

It is also possible to t a random-eects using the same tting function upon setting model = random and selecting a method for estimating the variance components. Four methods are available: Swamy-Arora, Amemiya, Wallace-Hussain, and Nerlove.

> gr_re <- plm(invest ~ value + capital, data = pgr, model = random.method = "walhus") > summary(gr_re)

Rajat Tayal

Econometrics using R

One-way panel regression


Oneway (individual) effect Random Effect Model (Wallace-Hussains transformation) Call: plm(formula = invest ~ value + capital, data = pgr, model = "random", random.method = "walhus") Balanced Panel: n=3, T=20, N=60 Effects: var std.dev share idiosyncratic 4389.31 66.25 0.352 individual 8079.74 89.89 0.648 theta: 0.8374 Residuals : Min. 1st Qu. Median 3rd Qu. Max. -187.00 -32.90 6.96 31.40 210.00 Coefficients : Estimate Std. Error t-value Pr(>|t|) (Intercept) -109.976572 61.701384 -1.7824 0.08001 . value 0.104280 0.014996 6.9539 3.797e-09 *** capital 0.344784 0.024520 14.0613 < 2.2e-16 *** Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 TSS: 1988300 RSS: 257520 R-Squared : 0.87048 Adj. R-Squared : 0.82696 F-statistic: 191.545 on 2 and 57 DF, p-value: < 2.22e-16
Rajat Tayal Econometrics using R

One-way panel regression


A comparison of the regression coecients shows that xedand randomeects methods yield rather similar results for these data. To check whether the random eects are really needed, a Lagrange multiplier test is available in plmtest(), defaulting to the test proposed by Honda (1985). > plmtest(gr_pool) Lagrange Multiplier Test - (Honda) data: invest ~ value + capital normal = 15.4704, p-value < 2.2e-16 alternative hypothesis: significant effects
Rajat Tayal Econometrics using R

One-way panel regression


Random-eects methods are more ecient than the xed-eects estimator under more restrictive assumptions, namely exogeneity of the individual eects. It is therefore important to test for endogeneity, and the standard approach employs a Hausman test. The relevant function phtest() requires two panel regression objects, in our case yielding > phtest(gr_re, gr_fe) Hausman Test data: invest ~ value + capital chisq = 0.0404, df = 2, p-value = 0.98 alternative hypothesis: one model is inconsistent In line with the rather similar estimates presented above, endogeneity does not appear to be a problem here.
Rajat Tayal Econometrics using R

Part IV Useful resources

Rajat Tayal

Econometrics using R

Econometrics Task View in R

Linear regression models: stats, lmtest, car, sandwich, AER. Non linear regression models: stats. Quantile regression: quantreg. Panel data: plm Linear structural equation model: sem.

Rajat Tayal

Econometrics using R

Books on Econometrics in R

Kleiber C. & A. Zeileis (2008), Applied Econometrics with R, Springer. Vinod H.R. (2008), Hands-on Econometrics using R, World Scientic Publishing. Hatekar N.R. (2010), Principles of Econometrics: An introduction using R, Sage Publications.

Rajat Tayal

Econometrics using R

Thank You

Rajat Tayal

Econometrics using R

Das könnte Ihnen auch gefallen