Sie sind auf Seite 1von 7

Econometric pills

Causality and correlation


The notion of ceteris paribusthat is, holding all other (relevant) factors fixedis crucial to
establish a causal relationship. Simply finding that two variables are correlated is rarely enough to
conclude that a change in one variable causes a change in another. This result is due to the nature of
economic data: rarely can we run a controlled experiment that allows a simple correlation analysis
to uncover causality. Instead, we can use econometric methods to effectively hold other factors
fixed.
Because economic variables are properly interpreted as random variables, we should use ideas from
probability to formalize the sense in which a change in w causes a change in y. If we focus on the
average, or expected, response, a ceteris paribus analysis entails estimating E y | w, c , the
expected value of y conditional on w and c. The vector c denotes a set of control variables that we
would like to explicitly hold fixed when studying the effect of w on the expected value of y. The
reason we control for these variables is that we think w is correlated with other factors that also
influence y. If w is continuous, interest centers
on E y | w, c / w , which is usually called the partial effect of w on E y | w, c . If w is discrete,
we are interested in E y | w, c evaluated at different values of w, with the elements of c fixed at
the same specified values.

The linear regression model


To develop an econometric model, lets start with a determinist model: C = + Y , where C is
consumption, Y is income and and are two parameters to be estimated. We know that the
relationship we have just written is not deterministic in the real world, but some fluctuations can

occur that disturb it. We then add a stochastic component, to take into account of the uncertainty
incorporated in many economic variables: C = + Y + , where is a random disturbance. When
we want to investigate such a theoretical relation, we estimate a model of the form:

yi = + xi + i
Where y is the dependent variable, x is the independent or explanatory variable, and i=1, , n is
the index of the n observations in our sample.
To complete our model, on top of the linear hypothesis, we add some assumptions:

Zero mean of the disturbances: E [ i ] = 0i

Homoskedasticity: Var [ i ] = 2 , constant for all i.

Nonautocorrelation: Cov i , j = 0 if i j

Uncorrelation between the regressor and the disturbance: Cov xi , j = 0 for all i and j.

Normality of the residuals: i N ( 0, 2 )

In graphic terms, when we run a regression we get something very close to the following picture:

Figure 1

So, in our regression model, the parameter captures the intercept of the function that represents
the relationship between the dependent variable and the regressor, while is the linear coefficient.

So, when x increases by one, y increases by in the linear regression model. captures the marginal
effect of x on y. The same concept holds true when we turn to a linear regression with multiple
explanatory variables:
yi = + xi + zi + wi + i
Here the regressors are x, z and w. The parameters , , and capture the partial effects of each of
these regressors on y, holding the others constant.

Testing and significance

Coefficients can be estimated, but, given the presence of disturbances, we cannot be sure that the
underlying real parameters are of the same magnitude and sign as we have hypothesized in our
model. Think of Figure 1: the estimated coefficient may be close to one, but can we be sure that the
true is indeed 1?
To test this, we employ statistical inference and the tools of hypothesis testing.
Start with , the true coefficient of the relation we are studying. After estimation, we obtain b, the
estimated coefficient, and s 2 , the estimated variance of the error terms . We can also get the
sample variance of x: it is defined as S xx = ( xi x ) , where x is the sample mean of x. We can
2

then obtain the estimate of the sample variance of b as Var [b ] = s 2 / S xx . Taking the square root of
the estimated variance of b, we get sb, the standard error of the estimate b. So sb = s / S xx .
It can be shown that

b
t ( n 2 ) : the ratio of the difference between the estimated parameter
sb

and its true value with the estimated standard errors is distributed as Student-t distribution with (n2) degrees of freedom.
In practice, suppose we have a sample of N=10 and estimate b=1.96 and sb =0.384. If we
hypothesize that the true is , say, equal to 1, we will compute

1.96 1
= 2.5 to test H 0 : = 1
0.384

against H1 : > 1 . If H 0 holds, the ratio just calculated is distributed as a t-distribution with 8
degrees of freedom. Suppose that, before discarding our hypothesized level , we want to be
confident at the 95% level that our estimated b is different from 1. We then want to check whether
the value 2.5 is within the 95-th percentile of the Students t distribution. There are tables that show
the values taken by the t-distribution at difference confidence levels. For 8 degrees of freedom, the
critical value at the 95% confidence level is 1.836. Therefore, we reject the null hypothesis
H 0 : = 1 and accept H1 : > 1 . Our baseline hypothesis has probably to be revised.

In standard regression analysis, this procedure is employed to test whether the estimated
coefficients are different from zero. The relevant test statistic is therefore: t =

b
. Since we are
sb

testing the null of equal to zero against the two-sided hypothesis that it is either greater or smaller,
we take the ratio in absolute value and look at the

H 0 : = 0 if

th percentile in the tables. We reject the null

b
> t / 2 and say that the coefficient can be said to be statistically significant. In
sb

general, in large samples, the value of 1.96, which would apply for the 95 percent significance
level, is used as a benchmark value when tables are not available.
Finally, note that you can infer the t-test value even if you are given only standard errors as post
estimation outputs in papers: simply apply the formula and divide the estimated coefficient by the
displayed standard errors to get the t and check the significance of each variable.

Endogeneity and Instrumental Variables Estimation

Suppose we have a model in the form: yi = + x1i + x2i + x3i + i , but one of the assumption of
the linear regression model are violated for one regression. Namely, one regressor is correlated with
the disturbances: Cov [ x3 , ] 0 . We can then say that, while x1 and x2 are exogenous, x3 is
potentially endogenous in our model.

The violation of the assumption of no correlation with the residual for one regressor disrupts the
validity of the whole model: in other words, the cannot consistently estimate any parameter in our
model if there is correlation between one regressor and the disturbances.
The correlation itself may be due to the presence in the real world of an omitted variable that has
not been included in our model. This variable may be correlated with x3, so that the coefficient of
the latter captures both the effects of x3 on the dependent variable and the effects of the omitted
variable on x3. Note also that the omitted variable in question may also be the dependent variable in
our model: there are often cases of reverse causation, where one regressor is actually influenced by
the values taken by the dependent variable.
The method of the Instrumental Variables is widely used to solve the problems caused by the
endogeneity of one regressor. In practice, we must find an observable variable z that is correlated
with x3, but uncorrelated with the disturbances in our model ( Cov [ z , ] = 0 ).
The correlation with x3 must, on the other hand, be conditional on the exogenous regressors: it
must not be a simple correlation, but a significant partial effect of z on x3 once the effects of the
other covariates have been netted out!
More formally, the condition requires that the linear projection of x3 on all the exogenous variables
has the coefficient of z significantly different from zero:
x3 = 0 + 1 x1 + 2 x2 + z + r ,

0.

Here, r is a random disturbance that fulfils the linear regression assumptions.


If the two conditions outlined above are respected, we can say that z is a valid instrument for x3. It is
important to stress, though, that the full list of instrumental variables is the list of all the exogenous
regressors in the base equation plus the instrument z.
Once we have identified z, consistent OLS estimates for our model can be retrieved, substituting the
linear projection of x3 into the original equation.
Closely related to the general instrumental variable technique (and the one mostly used in practice)
is the two-stage-least-squared estimation (2SLS). The procedure consists in first running a

regression of x3 on the full set of instrumental variables. We obtain the estimates 0 , 1 ,2 and .
The estimated coefficients must then be multiplied by the observed variables values to obtain the
projected value of x3:
x3 = 0 + 1 x1 + 2 x2 + z

This projected value (which is a function of the instrumental variables only) can be plugged in the
original equation, substituting the original x3.
yi = + x1i + x2i + (0 + 1 xi1 + 2 xi 2 + zi ) + ui

where ui = i + r .
Rearranging the terms of the equation above, we get:

yi = 0 + + 1 x1i + + 2 x2i + zi + ui
yi = 0 + 1 x1i + 2 x2i + 3 zi + ui
This last equation can be consistently estimated by OLS. The important thing is that it can be shown
that not only can we get the s, but also the original coefficients of interest: , ,, and . In other
words, the assumptions made in the instrumental variables approach solve the identification
problem of the original coefficient.

Duration analysis and the Cox hazard model

Sometimes in we are interested in the duration of particular events, so we try to study the time
elapsed until a certain event occurs (or a certain state is abandoned). Thus, we are interested in how
some characteristics affect survival times. Many studies focus on the probability of exiting the
initial state within a short interval, conditional on having survived up to the starting time of the
interval. An approximation of this probability is given by the hazard function.
Suppose T is the time at which a person leaves the initial state. For example, if the initial state is
unemployment, T would be the time until a person becomes employed.
The cumulative distribution function (cdf) of T is defined as

F ( t ) = P (T t )

t0

The survivor function is defined as S ( t ) 1 F ( t ) = P (T > t ) and is the probability of surviving


(not exiting the state) past time t.
For h > 0 , the probability of leaving the initial state in the interval [t , t + h ) given survival up to
time t is P ( t T < t + h | T t ) .
The hazard function for T is defined as: (t ) = lim
h 0

P (t T < t + h | T t )
h

So, for each t, (t ) is the instantaneous rate of leaving per unit of time.
An important class of hazard functions is the one of proportional hazard models. A proportional
hazard model takes the form:

( t ; x ) = k ( x ) 0 ( t )
where k ( ) > 0 is a nonnegative function of x and 0 ( t ) > 0 is the baseline hazard. The baseline
hazard is common to all individuals; so each individual proportionally differs from the other
according to the different characteristics captured by the term k ( x ) . If we impose k ( x ) = exp( x )
where is a vector of parameters, we can transform the equation of the hazard model into:
log ( t ; x ) = x + log 0 ( t )

This model is named the Cox hazard model, since Cox (1972) first designed the procedure to
correctly estimate its . Its results are based on the fact that, in most of cases, we are not interested
in the baseline hazard, but want to focus on the effects of the covariates x on the hazard function.

References:

Greene, 1991, Econometric Analysis, Maxwell Mamillan, ch.5

Wooldridge, 2001, Econometric Analysis of Cross Section and Panel Data, MIT Press, ch.1.

Das könnte Ihnen auch gefallen