Simple Linear Regression Scott M Lynch

(copyright by Scott M.
Lynch, February 2003)

Simple Linear Regression (Soc 504)
The linear regression model is one of the most important and widely-used models in statistics.
Most of the statistical methods that are common in sociology (and other disciplines) today
are extensions of this basic model. The model is used to summarize large quantities of data
and to make inference about relationships between variables in a population. Before we
discuss this model in depth, we should consider why we need statistics in the rst place, in
order to lay the foundation for understanding the importance of regression.
1 Why Statistics?
There are three goals of statistics:
1. Data summarization
2. Making inference about populations from which we have a sample
3. Making predictions about future observations
1.1 Data Summarization
You have already become familiar with this notion by creating univariate summary statis-
tics about sample distributions, like measures of central tendency (mean, median, mode)
and dispersion (variance, ranges). Linear regression is also primarily about summarizing
data, but regression is a way to summarize information about relationships between 2 or
more variables. Assume for instance, we have the following information:
Age
O
b
s
e
r
v
e
d
/
P
r
e
d
i
c
t
e
d

f
(
r
a
t
e
s
)
40 60 80 100
-
8
-
6
-
4
-
2
0
Figure 1. F(death rates) by Age
1
This plot is of a function of 1995 death rates for the total US population ages 25-105,
by age (see http://www.demog.berkeley.edu/wilmoth/mortality). [Side Note: Death rates
are approximately exponential across age, so Ive taken the ln() to linearize the relationship.
Also, Ive added a random normal variable to them [N(0, .5)], because at the population
level, there is no sampling error, something we will discuss in the next lecture].
There is clearly a linear pattern between age and these 81 death rates. One goal of a
statistical model is to summarize a lot of information with a few parameters. In this example,
since there is a linear pattern, we could summarize this data with 3 parameters in a linear
model:
Y
i
=
0
+
1
Age
i
+
i
(Population Equation)
Or
Y
i
= b
0
+ b
1
Age
i
+ e
i
(Sample Equation)
Or
Y
i
= b
0
+ b
1
Age
i
(Sample Model)
We will discuss the meaning of each of these equations soon, but for now, note that,
if this model ts the data well, we wil have reduced 81 pieces of information to 3 pieces
of information (an intercept-b
0
, a slope-b
1
, and an error term-more specically, an error
variance, s
2
e
).
1.2 Inference
Typically we have a sample, but we would like to infer things about the population from
which the sample was drawn. This is the same goal that you had when you conducted various
statistical tests in previous courses. Inference in linear regression isnt much dierent. Our
goal is to let the estimates b
0
and b
1
be our best guess about the population parameters
0
and
1
, which represent the relationship between two (or more) variables at the population
level. Formalizing inference in this model will be the topic of the next lecture.
1.3 Prediction
Sometimes, a goal of statistical modeling is to predict future observations from observed
data. For example, given the data above, we might extrapolate our line out from age 105
to predict what the death rate should be for age 106, or 110. Or we might extrapolate our
line back from age 25 to predict the death rate for persons at age 15. Alternatively, suppose
I had a time series of death rates from 1950-2001, and I wanted to project death rates for
2002. Or, nally, suppose I had a model that predicted the death rates of smokers versus
nonsmokers, and I wanted to predict whether a person would die within the next 10 years
based on whether s/he were a smoker versus a nonsmoker.
1.4 Prediction and Problems with Causal Thinking
Inference and Prediction are not very dierent, but prediction tends to imply causal argu-
ments. We often use causal arguments in interpreting regression coecients, but we need to
2
realize that statistical models, no matter how complicated, may never be able to demonstrate
causality. There are three rules of causality:
1. Temporality. Cause must precede eect.
2. Correlation. Two variables must be correlated (in some way) if one causes the other.
3. Nonspuriousness. The relationship between two variables cant be attributable to the
inuence of a third variable.
1.4.1 Temporality
Many, if not most, social science data sets are cross-sectional. This makes it impossible to
determine whether A causes B or vice versa. Here is where theory (and some common sense)
comes in. Theory may tell us that A is causally prior to B. For example, social psychological
theory suggests that stress induces depression, and not that depression leads to stress. (note
that two theories, however, may posit opposite causal directions). Common sense may also
reveal the direction of the relationship. For example, in the mortality rate example, it is
unreasonable to assume that death rates make people older.
1.4.2 Correlation and Nonspuriousness
Two variables must be related (B) if there is a causal relationship between them (A), but
this does not imply the reverse statement that correlation demonstrates causation. Why?
Because there could be any number of alternate explanations for the relationship between
two variables. There are several types of alternate explanations. First, a variable C could be
correlated with A but be the true cause of B. In that case, the relationship between A and
B is spurious. A classic example of this is that ice cream consumption rates (in gallons
per capita) are related to rape rates (in rapes per 100,000 persons). This is not a causal
relationship, however, because both are ultimately driven by season. More rapes occur,
and more ice cream is consumed, during the summer. Regression modeling can help us rule
out such spurious relationships.
Second, A could aect C, with C aecting B. In that case, A may be considered a
cause, but C is the more proximate cause (often called an intervening variable). For
example, years of education is strongly linearly related to health. However, we seriously
doubt that time spent in school is ultimately the cause of health; instead, education probably
aects income, and income is more proximately related to health. Often, our goal is to nd
proximate causes of an outcome variable, and we will be discussing how the linear model is
often used (albeit somewhat incorrectly) to nd proximate causes.
Third, two variables may be correlated, but the relationship may not be a causal one.
Generally, when we say that two variables are related, we are generally thinking about this at
the within-individual level; that as a characteristic changes for an individual it will inuence
some other characteristic of the individual. Yet, our models generally capture covariance
at the between-individual level. The fact that gender, for example, covaries with income
DOES NOT imply that a sex change operation will automatically lead to an increase in pay.
With xed characteristics, like gender or race, we often use causal terminology, but because
3
gender cannot change, it technically cannot be a cause of anything. Instead, there may
be more proximate and changeable factors that are associated with gender which are also
associated with the outcome variable in which we are interested. Experimentalists realize
this, and experiments involve observing average within-individual change, with a manipulable
intervention.
As another example, life course research emphasizes three dierent types of eects of
time: age, period, and cohort eects. Age eects refer to biological or social processes that
occur at the within-individual level as the individual ages. Period eects refer to historical
processes that occur at the macro level at some point in time and inuence individuals at
multiple ages. Cohort eects, at least one interpretation of them, refer to the interaction of
period with age. A period event at time t may aect persons age x at time t dierently than
persons age x+5 at t. Think, for example, about the dierence between computer knowledge
of persons currently age 20 versus those currently age 70. The fact that 70 year-olds know
less about computers is not an artifact of some cognitive function decline across age-its due
to the dierential eect of the period event of the invention of the home PC across birth
cohorts.
Recognizing these dierent types of time eects demonstrates why our models may fail
at determining causality. Imagine some sort of life course process that is stable across an
individuals life course but may vary across birth cohorts-suppose it is decreasing across birth
cohorts. Assume that we take a cross section and look at the age pattern-we will observe a
linear age pattern, even though this is not true for any individual:
Age
H
y
p
o
t
h
e
t
i
c
a
l

Y
0 20 40 60 80 100
-
0
.
4
-
0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
1900 Cohort
1920 Cohort
1940 Cohort
1960 Cohort
1980 Cohort
2000 Cohort
2000 Period Mean
Figure 2. Hypothetical Y by Age
4
This example, as well as the ice cream and rape example, highlights additional fallacies
that we need to be aware of when thinking causally: ecological fallacies and individualistic
fallacies. If we assume that, because ice cream consumption rates and rape rates were related,
that people who eat ice cream are rapists, this would be committing the ecological fallacy.
Briey stated, macro relationships arent necessarily true at the micro level. To give a more
reasonable example, we know that SES is related to heart disease mortality at the macro
level (with richer countries having greater rates of heart disease mortality). This does not
imply that having low SES at the individual level is an advantage-we know that the pattern
is reversed at that level. The explanation for the former (macro) nding may be dierences
in competing risks or diet at the national level. The explanation for the latter (micro) nding
may be dierences in access to health care, diet, exercise, etc.
The individualistic fallacy is essentially the opposite fallacy-reverse the arguments above.
We cannot infer macro patterns from micro patterns. Furthermore, exceptions to a pattern
dont invalidate the pattern.
Because of the various problems with causality and its modeling, we will stay away from
thinking causally, although to some extent, the semantics we use in discussing regression will
be causal.
2 Back to Regression
Recalling the example above under data summarization, we have data on death rates by age,
and we would like to summarize the data by using a linear model. We saw 3 equations above.
In these equations,
0
is the linear intercept term,
1
is the (linear) eect of age on death
rates, and
i
is an error term that captures the dierence (vertical distance) between the
line implied by the coecients and the actual data we observed-obviously every observation
cannot fall exactly on a straight line through the data.
We assume that the rst equation holds in the population. However, we dont have
population data. So, we change our notation slightly to the second equation.
In the third equation, the error term has dropped out of the equation, because

Y (y-
hat) is the expected (mean) score for the rate applicable to a particular age. Note that
simple algebra yields e
i
= Y
i

Y
i
(b
0
+ b
1
Age
i
).
3 Least Squares Estimation
How do we nd estimates of the regression coecients? We should develop some criteria
that lead us to prefer one set of coecients over another set.
One reasonable strategy is to nd the line that gives us the least error. This must be
done for the entire sample so, we would like the b
0
and b
1
that yield (min
n
i=1
e
i
). However,
the sum of the raw errors will always be 0, so long as the regression line passes through the
point ( x, y) An alternate strategy is to nd the line that gives us the least absolute value
error: (min
n
i=1
| e
i
|). We will discuss this strategy later in the semester.
It is easier to work with squares (which are also positive), so we may consider: (min
n
i=1
e
2
i
).
In fact, this is the criteria generally used, which is why it is called Ordinary Least Squares
regression.
5
We need to minimize this term, so, recalling from calculus that a maximum or minimum
is reaches wherever a curve inects, we simply need to take the derivative of the least squares
function, set it equal to 0, and solve for b
0
and b
1
.
There is one catch-in this model we have two parameters that we need to solve for. So,
we need to take two partial derivatives: one with respect to b
0
and the other with respect to
b
1
. Then we will need to solve the set of two equations:
F
b
0
=

b
0
_
n
i=1
(Y
i
(b
0
+ b
1
X
i
))
2
_
= 2nb
0
2
Y
i
+ 2b
1
X
i
and:
F
b
1
=

b
1
_
n
i=1
(Y
i
(b
0
+ b
1
X
i
))
2
_
= 2
X
i
Y
i
+ 2b
0
X
i
+ 2b
1
X
2
i
.
Setting these equations to 0 and solving for b0 and b1 yields:
b
0
=

Y b
1

X
and
b
1
=
n
i=1
(X
i

X)(Y
i

Y )
n
i=1
(X
i

X)
2
Notice that the denominator of b
1
is (n 1) s
2
x
, and the numerator is (n 1) cov(X, Y ).
So, b
1
=
Cov(X,Y )
V ar(X)
.
4 Maximum Likelihood Estimation
An alternative approach to estimating the coecients is to use maximum likelihood estima-
tion. Recall that for ML estimation, we rst establish a likelihood function. If we assume
the errors (e
i
) are N(0, s
e
), then we can establish a likelihood function based on the error.
Once again, assuming observations are independent, the joint pdf for the data given the
parameters (the likelihood function) is:
p(Y | b
0
, b
1
, X) L(b
0
, b
1
| X, Y ) =
n
i=1
1
s
e
2
exp
_
(Y
i
(b
0
+ b
1
X
i
))
2
2s
2
e
_
This likelihood reduces to:
p(Y | b
0
, b
1
, X) L(b
0
, b
1
| X, Y ) =
_
2s
2
e
_n
2
exp
_
n
i=1
(Y
i
(b
0
+ b
1
X
i
))
2
2s
2
e
_
Taking the log of the likelihood, we get:
6
LL(b
0
, b
1
, X) nlog(s
e
)
1
2s
2
e
_
n
i=1
(Y
i
(b
0
+ b
1
X
i
))
2
_
Taking the derivative of this function with respect to each parameter yields the following 3
equations:
LL
b
0
=
1
2s
2
e
_
2nb
0
2
Y
i
+ 2b
1
X
i
_
and
LL
b
1
=
1
2s
2
e
_
2
X
i
Y
i
+ 2b
0
X
i
+ 2b
1
X
2
i
_
and
LL
s
e
= s
3
e
(Y
i
(b
0
+ b
1
X
i
))
2
n
s
e
Setting these partial derivatives equal to 0 and solving for the parameters yields the same
values as the OLS approach. Furthermore, the error variance is found to be:
s
2
e
=
e
2
i
n
However, due to estimation, we lose 2 degrees of freedom, making the unbiased denominator
n 2, rather than n. Realize that this end result is really nothing more than the average
squared error. If we take the square root, we get the standard error of the regression, which
we can use to construct measures of model t.
Using the results above, the regression for the death rate data yields the following line:
Age
O
b
s
e
r
v
e
d
/
P
r
e
d
i
c
t
e
d

f
(
r
a
t
e
s
)
40 60 80 100
-
8
-
6
-
4
-
2
0
Figure 3. Observed and Predicted F(death rates)
7
(copyright by Scott M. Lynch, February 2003)
Simple Regression II: Model Fit and Inference (Soc 504)
In the previous notes, we derived the least squares and maximum likelihood estimators for
the parameters in the simple regression model. Estimates of the parameters themselves,
however, are only useful insofar as we can determine a) how well the model ts the data
and b) how good our estimates are, in terms of the information they provide about the
possible relationship between variables in the population. In this set of notes, I discuss how
to interpret the parameter estimates in the model, how to evaluate the t of the model, and
how to make inference to the population using the sample estimates.
1 Interpretation of Coecients/Parameters
The interpretation of the coecients from a linear regression model is fairly straightforward.
The estimated intercept (b
0
) tells us the value of y that is expected when x = 0. This is
often not very useful, because many of our variables dont have true 0 values (or at least not
relevant ones-like education or income, which rarely if ever have 0 values). The slope (b
1
) is
more important, because it tells us the relationship between x and y. It is interpreted as the
expected change in y for a one-unit change in x. In the deathrates example, the intercept
estimate was 9.45, and the slope estimate was .084. This means that the log of the death
rates is expected to be 9.45 at age 0 and is expected to increase .084 units for each year of
age.
The other parameter in the model, the standard error of the regression (s
e
), is important
because it gives us (in y units) an estimate of the average error associated with a predicted
score. In the model for deathrates, the standard error of the regression was .576.
2 Evaluating Model Fit
Before we make inferences about parameters from our sample estimates, we would like to
decide just how well the model ts the data. A rst approach to this is to determine the
amount of the variance in the dependent variable is accounted for by the regression line. If
we think of the formula for variance:
n
i=1
(Y
i

Y )
2
n 1
,
we can consider the numerator to be called the Total Sum of Squares (TSS). We have an
estimate of the variance that is unexplained by the model-the Residual Sum of Squares
(SSE), which is just the numerator of the standard error of the regression function:
n
i=1
e
2
i
=
n
i=1
(Y
i

Y
i
)
2
1
The dierence between these two is the Regression Sum of Squares, (RSS), or the amount
of variance accounted for by the model. RSS can be represented as:
n
i=1
(
Y
i

Y )
2
Some basic algebra reveals that, in fact, RSS + SSE = TSS. A measure of model t can
be constructed by taking the ratio of the explained variance to the total variance:
R
2
=
Y
i

Y )
2
(Y
i

Y )
2
= 1
(Y
i

Y
i
)
2
(Y
i

Y )
2
This measure ranges from 0 to 1. For a poor-tting model, the error (SSE) will be large
(possibly equal to the TSS), making RSS small. For example, if we were to allow the mean
of y to be our best guess for y (
Y =

Y ), then we would be supposing that the relationship
between x and y did not exist. In this case, RSS would be 0, as would R
2
. For a perfect
model, on the other hand, RSS = TSS, so R
2
= 1. In the deathrates example, the R
2
was
.92, indicating a good linear t.
Interestingly, in the simple linear regression model the signed square root of R
2
is the cor-
relation between x and y (sign based on sign of b
1
). In multiple regression, This computation
yields the multiple correlation, which we will discuss later.
These results are almost all that is needed to complete an ANOVA table for the regression
model. The typical model ANOVA output looks like (this is from my death rate example):
ANOVA TABLE Df SS MS F Sig
Regression 1 315.54 315.54 949.78 p = 0
Residual 79 26.25 ..33223
Total 80 341.79
The general computation of the table is:
ANOVA TABLE DF SS MS F Sig
Regression k-1
n
i=1
(
Y
i

Y )
2 RSS
Df(R)
MSR
MSE
(from F table)
Residual n-k
n
i=1
(Y
i

Y
i
)
2 SSE
Df(E)
(called MSE)
Total n-1
n
i=1
(Y
i

Y )
2
where k is the number of parameters, and n is the number of data points (sample size).
2
For the F test, the numerator and denominator degrees of freedom are k 1 and n k,
respectively. The F test is a joint test that all the coecients in the model are 0. In simple
linear regression, F = t
2
(for b
1
).
There are numerous ways to further assess model t, specically by examining the error
terms. We will discuss these later, in the context of multiple regression.
3 Inference
Just as the mean in the models that you discussed in previous classes (and that we discussed
earlier) have a sampling distribution from which you can assume your estimate of x is a
sampled value, regression coecients also have a sampling distribution. For these sampling
distributions to be general, there are several assumptions that the OLS (or MLE) regression
model must meet. They include:
1. Linearity. This assumption says that the relationship between x and y must be linear.
If this assumption is not met, then the parameters will be biased.
2. Homoscedasticity (constant variance of y across x). This assumption says that the
variance of the error term cannot change across values of x. If this assumption is
violated, then the standard errors for the parameters will be biased. Note that if the
linearity assumption is violated, the homoscedasiticity assumption need not be.
3. Independence of e
i
and e
j
, for i = j. This assumption says that there cannot be a
relationship between the errors for any two individuals. If this assumption is violated,
then the likelihood function does not hold, because the probability multiplication rule
says that joint probabilities are multiple of the individual probabilities ONLY if the
events are independent.
4. e
i
N(0,
). Because the structural part of y is xed, given x, this is equivalent

to saying that y
i
N(b
0
+ b
1
x
i
,
e
). This assumption is simply that the errors are
normally distributed. If this assumption is not met, then the likelihood function is
not appropriate. In terms of least squares, this assumption guarantees that the sam-
pling distributions for the parameters are also normal. However, as n gets large, this
assumption becomes less important, by the CLT.
If all of these assumptions are met, then the OLS estimates/MLE estimates are BLUE
(Best Linear Unbiased Estimates), and the sampling distributions for the parameters can be
derived. They are:
b
0
N
_
0
,

2
X
2
i
n
(X
i

X)
2
_
b
1
N
_
1
,

2
(X
i

X)
2
_
With the standard errors obtained (square root of the sampling distribution variance
estimators above), we can construct t-tests on the parameters just as we conducted t- tests
3
on the mean in previous courses. Note that these are t-tests, because we must replace
epsilon
with our estimate, s
e
, for reasons identical to those we discussed before. The formulas for
the standard errors are often expressed as (where (MSE)
1
2
and s
e
are identical):
S.E.(
b
0
) =
_
MSE
X
2
i
n
(X
i

X)
2
_
S.E.(
b
1
) =
_
MSE
n
(X
i

X)
2
_
Generally, we are interested in whether x is, in fact, related to y. If x is not related to y,
then b
1
will equal 0, and our best guess for the value of y is y. Thus, we typically hypothesize
that b
1
= 0 and construct the following t-test:
t =
b
1
(H
0
:
0
= 0)
s.e.(b
1
)
In the deathrate example, the standard error for b
0
was .189, and the standard error for b
1
was .0027. Computing the t-tests yields values of 49.92 and 30.82, respectively. Both of
these are large enough that we can reject the null hypothesis that the population parameters,
0
and
1
, are 0. In other words, if the null hypothesis were true, these data would be almost
impossible to obtain in a random sample. Thus, we reject the null hypothesis.
As we discussed before, constructing a condence interval for this estimate is a simple
algebraic contortion of the t-test in which we decide a priori the condence level we desire
for our interval:
(1 )%C.I. = b
1
t
2
s.e.(b
1
)
These tests work also for the intercept, although, as we discussed before, the intercept is
often not of interest. However, there are some cases in which the intercept may be of interest.
For example, if we are attempting to determine whether one variable is an unbiased measure
of another, then the intercept should be 0. If there is a bias, then the intercept should pick
this up, and the intercept will not be 0. For example, if people tend to under-report their
weight by 5 pounds, then the regression model for observed regressed on reported weight
will have an intercept of 5 (implying you can take an individuals reported weight and add
5 pounds to it to get their actual weight-see Fox.
Sometimes, beyond condence intervals for the coecients themselves, we would like to
have prediction intervals for an unobserved y. Because the regression model posits that y is
a function of both b
0
and b
1
, a prediction interval for y must take variability in the estimates
of both parameter estimates into account. There are various ways to do such calculations,
depending on exactly what quantity you are interested in (but these are beyond the scope
of this material).
With the above results, you can pretty much complete an entire bivariate regression
analysis.
4
4 Deriving the Standard Errors for Simple Regression
The process of deriving the standard errors for the parameters in the simple regression
problem using maximum likelihood estimation involves the same steps as we used in nding
the standard errors in the normal mean/standard deviation problem before:
1. Construct the second derivative matrix of the log likelihood function with respect to
each of the parameters. That is, in this problem, we have three parameters, and hence
the Hessian matrix will contain 6 unique elements (Ive replaced s
2
e
with :
_
2
LL
b
2
0
2
LL
b
0
b
1
2
LL
b
0
2
LL
b
1
b
0
2
LL
b
2
1
2
LL
b
1
2
LL
b
0
2
LL
b
1
2
LL
2
_
_
2. Take the negative expectation of this matrix to obtain the Information matrix.
3. Invert the information matrix to obtain the variance-covariance matrix of the param-
eters. For the actual standard error estimates, we substitute the MLEs for the popu-
lation parameters.
4.1 Deriving the Hessian Matrix
We can obtain the elements of the Hessian matrix using the rst derivatives shown above.
Using the rst derivative of the log likelihood with respect to b
0
, we can take the second
derivative with respect to b
0
. The rst derivative with respect to b
0
was:
LL
b
0
=
1
2
_
2nb
0
2
Y + 2b
1
X
_
.
The second derivative with respect to b
0
is thus simply:
2
LL
b
2
0
=
1
2
(2n) =
n
.
The second derivative with respect to b
1
is:
2
LL
b
0
b
1
=
1
2
_
2
X
_
=
The second derivative with respect to is:
2
LL
b
0
=
1
2
_
nb
0
Y + b
1
X
_
The three remaining second partial derivatives require the other two rst partial derivatives.
Using the rst partial derivative with respect to b
1
:
5
LL
b
1
=
1
2
_
2
XY + 2b
0
X + 2b
1
X
2
_
,
we can easily take the second partial derivative with respect to b
1
:
LL
b
2
1
=
1
2
_
2
X
2
_
=

X
2
We can also obtain the second partial derivative with respect to :
2
LL
b
1
=
2
_
XY + b
0
X + b
1
X
2
_
For nding the second partial derivative with respect to s
2
e
, we need to re-derive the rst
derivative with respect to s
2
e
rather than s
e
. So, letting = s
2
e
as we did above, we get the
following:
LL
=
n
2
+
1
2
(Y (b
0
+ b
1
X))
2
.
The second partial derivative, then, is:
2
LL
2
=
1
2
_
n
2
2
3
(Y (b
0
+ b
1
X))
2
_
.
We now have all 6 second partial derivatives, giving us the following Hessian matrix (after
a little rearranging of the terms):
_
_
n

(nb
0
Y +b
1
X)
X
2
XY b
0
Xb
1
X
2
)
(nb
0
Y +b
1
X)
2

(
XY b
0
Xb
1
X
2
)
2
n
2
2

(Y (b
0
b
1
X))
2
3
_
_
4.2 Computing the Information Matrix
In order to obtain the information matrix, we need to negate the expecation of the Hessian
matrix. There are a few tricks involved in this process. The negative expectation of the rst
element is simply
n
. The negative expectation of the second element (and also the fourth,
given the symmetry of the matrix) is
n
(recall the trick we used earlierthat if

X =
X
n
,
then
X = n

X). We will skip the third and sixth (and hence also the seventh and eighth)
elements for the moment. The negative of the expectation of the fth element is
X
2
. (We
leave it unchanged, given that there is no simple way to reduce this quantity). Finally, to
take the negative of the expectation of the ninth and last element, we must rst note that
the expression can be rewritten as:
n
(Y (b
0
+b
1
X))
2
2
3
. The expectation of the sum, though,
is nothing more than the error variance itself, , taken n times. Thus, the expectation of
the numerator is n 2n. After some simplication, including some cancelling with terms
in the denominator, we are left with
n
2
2
for the negative epectation.
6
All the other elements in this matrix (3, 6, 7, and 8) go to 0 in expectation. Lets take
the case of the third (and seventh) element. The expectation of b
0
is
0
, which is equal to
Y

1
X
. The expectation of
Y is n
Y
, and the expectation of
X is n
X
. Finally,
the expectation of b
1
is
1
. Substituting these expressions into the numerator yields:
n(
Y

1
X
) n
Y
+ n
1
X
.
This term clearly sums to 0, and hence the entire third and seventh expressions are 0. A
similar nding results for elements six and eight.
4.3 Inverting the Information Matrix
Our information matrix obtained in the previous section turns out to be:
_
_
n
n
X
0
n
X
X
2
0
0 0
n
2
2
_
_
Although inverting 3 3 matrices is generally not easy, the 0 elements in this one make
the problem relatively simple. We can break this problem into parts. First, recall that the
multiple of a matrix with its inverse produces an identity matrix. So, in this case, the sum
of the multiple of the elements of the last row of the information matrix by the elements of
the last column of the inverse of the information matrix must be 1. But, all these elements
are 0:
_
0 0 I()
1
33
_
_
0
0
n
2
2
_
_
= 1.
The only way for this to occur is for I()
1
33
to be
2
2
n
.
If we do a little more thinking about this problem, we will see that, because of the 0
elements, to invert the rest of the matrix, we can treat the remaining non-zero elements of
the matrix as a 22 sub-matrix, with the elements that are 0 in the information matrix also
being 0 in the variance-covariance matrix. The inverse of a 2 2 matrix M can be found
using the following rule:
M
1
_
_
A B
C D
_
_
1
=
1
| M |
_
_
D B
C A
_
_
.
You can derive this result algebraically for yourself by setting up a system of four equations
in four unknowns.
The determinant of the submatrix here is
n
X
2
n
2
2
X
2
, so, after inverting (the constant
in front of the inverse formula above is
1
|M|
), we obtain

2
n
X
2
n
2
2
X
. This expression can
be simplied by recognizing that an n can be factored from the numerator, leaving us with
7
n multiplied by numerator for the so-called computation formula for the variance of X
(=
(X
X
)
2
). Thus, we have

2
n
(X
X
)
2
. The inverse matrix thus becomes:
_
X
2
n
(X
X
)
2
(X
X
)
2
(X
X
)
2
(X
X
)
2
_
_
We now have all the elements of the variance-covariance matrix of the parameters. These
terms should look familiar, after replacing with
2
and the population-level parameters

with the sample ML estimates:
_
_
s
2
e
X
2
n
(X

X)
2
s
2
e

X
(X

X)
2
s
2
e

X
(X

X)
2
s
2
e
(X

X)
2
_
_
8
Multiple Regression I (Soc 504)
Generally, we are not interested in examining the relationship between simply two variables.
Rather we may be interested in examining the relationship between multiple variables and
some outcome of interest. Or, we may believe that a relationship between one variable and
another is spurious on a third variable. Or, we may believe that the relationship between one
variable and another is being masked by some third variable. Or, still yet, we may believe
that a relationship between one variable and another may depend on another variable. In
these cases, we conduct multiple regression analysis, which is simply an extension of the
simple regression model we have discussed thus far.
1 The Multiple Regression Model
In scalar notation, the multiple regression model is:
Y
i
=
0
+
1
X
i1
+
2
X
i2
+ . . . +
k
X
ik
+
i
We rarely express the model in this fashion, however, because it is more compact to use
matrix notation. In that case, we often use:
Y
1
Y
2
.
.
.
Y
n
1 X
11
. . . X
1k
1 X
21
. . . X
2k
.
.
.
.
.
.
.
.
.
.
.
.
1 X
n1
. . . X
nk
1
.
.
.
2
.
.
.
or just Y = X + . At the sample level, the model is Y = Xb + e. In these equations, n

is the number of observations in the data, and k +1 is the number of regression parameters
(the +1 is for the intercept,
0
or b
0
).
Note that if you use what you know about matrix algebra, you can multiply out the
X matrix by the matrix, add the error vector, and get the scalar result above for each
observation. The column of ones gets multiplied by
0
, so that the intercept term stands
alone without an X variable.
2 The OLS Solution in Matrix Form
The OLS solution for can be derived the same way we derived it in the previous lecture,
but here we must use matrix calculus. Again, we need to minimize the sum of the squared
error terms (
n
i=1
e
2
i
). This can be expressed in matrix notation as:
F = min(Y Xb)
T
(Y Xb)
1
Notice that the summation symbol is not needed here, because (Y Xb) is an n1 column
vector. Transposing this vector and multiplying it by itself (untransposed) produces a scalar
that is equal to the sum of the squared errors.
In the next step, we need to minimize F by taking the derivative of the expression and
setting it equal to 0, as before:
F
= 2X
T
(Y X) = 2X
T
Y + 2X
T
X.
It may be dicult to see how this derivative is taken. Realize that the construction above is
a quadratic form in (and X). We could think of the equation as: (Y X)
2
. In that case,
we would obtain: 2X(Y X) for our derivative. This is exactly what this expression
is. X is transposed so that the multiplication that is implied in the result is possible. Note
that, using the distributive property of matrix multiplication, we are able to distribute the
2X
T
across the parenthetical.
Setting the derivative equal to 0 and dividing by -2 yields:
0 = X
T
Y X
T
X
Obviously, if we move X
T
X to the other side of the equation, we get:
X
T
X = X
T
Y
In order to isolate , we need to premultiply both sides of the equation by (X
T
X)
1
. This
leaves us with I on the left, which equals , and the OLS solution
X
T
X
X
T
Y
on
the right.
The standard error of the regression looks much as before. Its analog in matrix form is:
2
e
=
1
n k
e
T
e.
Finally, the variance-covariance matrix of the parameter estimates can be obtained by:
V ar() =
2
e
X
T
X
1
You would need to square root the diagonal elements to obtain the standard errors of the
parameters for hypothesis testing. Notice that this result looks similar to the bivariate
regression result, if you think of the inverse function as being similar to division. We will
derive these estimators for the standard errors using an ML approach in the next lecture.
3 Matrix Solution for Simple Regression
We will demonstrate the OLS solution for the bivariate regression model. Before we do so,
though, let me discuss a little further a type of matrix expression you will see often: X
T
X.
As we discussed above, this type of term is a quadratic form, more or less equivalent to X
2
in scalar notation. The primary dierence between the matrix and the scalar form is that in
the matrix form the o-diagonal elements of the resulting matrix will be the cross-products
2
of the X variables-essentially their covariances-while the main diagonal will be the sum of
X
2
for each X-essentially the variances.
The X matrix for the bivariate regression model looks like:
1 X
11
1 X
21
.
.
.
.
.
.
1 X
n1
For the purposes of exposition, I will not change the subscripts when we transpose this
matrix. If we compute X
T
X, we will get:
1 1 . . . 1
X
11
X
21
. . . X
n1
1 X
11
1 X
21
.
.
.
.
.
.
1 X
n1
X
2
To compute the inverse of this matrix, we can use the rule presented in the last chapter for
inverting 2 2 matrices:
M
1
=
1
| M |
m
22
m
12
m
21
m
11
.
In this case, the determinant of
X
T
X
1
is n
X
2
(
X)
2
, and so the inverse is:
X
2
n
X
2
(
X)
2
X
n
X
2
(
X)
2
X
n
X
2
(
X)
2
n
n
X
2
(
X)
2
We now need to postmultiply this by X

T
Y , which is:
1 1 . . . 1
X
11
X
21
. . . X
n1
Y
1
Y
2
.
.
.
Y
n
=

Y
XY
So,
X
T
X
X
T
Y
is:
X
2
n
X
2
(
X)
2
X
n
X
2
(
X)
2
X
n
X
2
(
X)
2
n
n
X
2
(
X)
2
XY
X
2
XY
n
X
2
(
X)
2
Y +n
XY
n
X
2
(
X)
2
Lets work rst with the denominator of the elements in the vector on the right. The
term (
X)
2
can be rewritten as n
2

X
2
. Then, an n can be factored from the denominator,
and we are left with n(
X
2
n x
2
). As we discussed before, this is equal to n times the
numerator for the variance: n
(X

X)
2
.
3
The numerator of the second element in the vector can be rewritten to be n
XY
n
2

X

Y . An n can be factored from this expression and will cancel with the n in the denom-
inator. So, we are left with
XY n

X

Y in the numerator. For reasons that will become
apparent in a moment, we can express this term as 2n

X

Y + n

X

Y . The rst term here
can be rewritten as
X

X
Y, and the second term can be written as

X

Y . All
four terms can now be collected under a single summation as:
XY

XY

Y X +

X

Y
,
which factors into
(X

X)(Y

Y ), and so the whole expression (numerator and denom-
inator) becomes:
(X

X)(Y

Y )
(X

X)
2
.
This should look familiar. It is the same expression we obtained previously for the slope.
Observe that we now have a new computational formula:
XY n

X

Y =
(X

X)(Y

Y ).
What is interesting about the result we just obtained is that we obtained the result without
deviating each variable from its mean in the
X
T
X
matrix but the means re-entered anyway.

Now, lets determine the rst element in the solution vector. The numerator is
X
2
XY. This can be reexpressed as n
X
2
n

X
XY. Once again, the denominator

is n
X

X
2
. So, the ns in the numerator and denominator all cancel. Now, we can
strategically add and subtract

Xn

X

Y in the numerator to obtain:

Y
X
2

Xn

X

Y
XY

Xn

X

Y
. With a minimal amount of algebraic manipulation, we can obtain:
X
2
n

X
2
XY n

X

Y
.
If we now separate out the two halves of the numerator and make two fractions, we get:
X
2
n

X
2
X

X
XY n

X

Y
X

X
2
,
which is just

Y b
1

X.
4 Why Do We Need Multiple Regression?
If we have more than 1 independent variable, the matrices become larger, and, as stated
above, the o-diagonal elements of the
X
T
X
matrix contain information equivalent to the

covariances among the X variables. This information is important, because the only reason
we need to perform multiple regression is to control out the eects of other X variables
when trying to determine the true eect of one X on Y . For example, suppose we were
interested in examining racial dierences in health. We might conduct a simple regression
of health on race, and we would nd a rather large and signicant dierence between whites
and nonwhites. But, suppose we thought that part of the racial dierence in health was
attributable to income dierences between racial groups. Multiple regression analysis would
allow us to control out the income dierences between racial groups to determine the residual
4
race dierences. If, on the other hand, there werent racial dierences in income (i.e., race
and income were not correlated), then including income in the model would not have an
eect on estimated race dierences in health.
In other words, if the Xs arent correlated, then there is no need to perform multiple
regression. Let me demonstrate this. For the sake of keeping the equations manageable (so
they t on a page) and so that I can demonstrate another point, lets assume that all the X
variables have a mean of 0 (or, alternatively, that we have deviated them from their means
so that the new set of X variables each have a mean of 0). Above, we derived a matrix
expression for
X
T
X
in simple regression, but we now need the more general form with
multiple X variables (once again, I have left the subscripts untransposed for clarity):
1 1 . . . 1
X
11
X
21
. . . X
n1
.
.
.
.
.
.
.
.
.
.
.
.
X
1k
X
2k
. . . X
nk
1 X
11
. . . X
1k
1 X
21
. . . X
2k
.
.
.
.
.
.
.
.
.
.
.
.
1 X
n1
. . . X
nk
X
1
X
2
. . .
X
k
X
1
X
2
1
0 . . . 0
X
2
0
X
2
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 0
X
k
0 . . . 0
X
2
k
.
Thus, the rst row and column of the matrix contain the sums of the variables, the main
diagonal contains the sums of squares of each variable, and all the cross-product positions
are 0. This matrix simplies considerably, if we realize that, if the means of all of the X
variables are 0, then
X must be 0 for each variable:
X
T
X
n 0 . . . 0
0
X
2
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 0
0 . . . 0
X
2
k
The matrix is now a diagonal matrix, and so its inverse is obtained simply by inverting each
of the diagonal elements. The
X
T
Y
matrix is:
1 1 . . . 1
X
11
X
21
. . . X
n1
.
.
.
.
.
.
.
.
.
.
.
.
X
1k
X
2k
. . . X
nk
Y
1
Y
2
.
.
.
Y
n
X
1
Y
.
.
.
X
k
Y
So, the solution vector is:
X
T
X
1
n
0 . . . 0
0
1
X
2
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 0
0 . . . 0
1
X
2
k
X
1
Y
.
.
.
X
k
Y
Y
n
X
1
Y
X
2
1
.
.
.
X
k
Y
X
2
k
5
The last thing we need to consider is what
XY and
X
2
are when the mean of X is
0. Lets take the denominators
X
2
rst. This is the same as
(X 0)
2
, which, since
the means of all the X variables are 0, means the denominator for each of the coecients
is
X

X
2
. Now lets think about the numerator. As it turns out, the numerator for
each coecient can be rewritten as
X

X

Y

Y
. Why? Try substituting 0 for

X
and expanding:
(X)
Y

Y
XY

Y
X =
XY

Y n

X =
XY

Y 0 =
XY
So, each of our coecients can be viewed as:
(X
X)(Y
Y )
(X
X)
2
. Notice that this is exactly equal
to the slope coecient in the simple regression model. Thus, we have shown that, if the means
of the X variables are equal to 0, and the X variables are uncorrelated, then the multiple
regression coecients are identical to what we would obtain if we conducted separate simple
regressions for each variable. The results we obtained here also apply if the mean of the X
variables are not 0, but the matrix expressions become much more complicated.
What about the intercept term? Notice in the solution vector that the intercept term
simply turned out to be the mean of Y . This highlights an important point: the intercept is
simply an adjusted mean of the outcome variable. More specically, it represents the mean
of the dependent variable when all the X variables are set to 0 (in this case, their means).
This interpretation holds when the X variables means are not 0, as well, just as we discussed
previously in interpreting the coecients in the simple regression model.
6
(copyright by Scott M. Lynch, March 2003)
Multiple Regression II (Soc 504)
1 Evaluation of Model Fit and Inference
Previously, we discussed the theory underlying the multiple regression model, and we de-
rived the OLS estimator for the model. In that process, we discussed the standard error
of the regression, as well as the standard errors for the regression coecients. We will now
develop the t-tests for the parameters, as well as the ANOVA table for the regression and
the regression F test.
However, rst, we must be sure that our data meet the assumptions of the model. The
assumptions of multiple regression are no dierent from those of simple regression:
1. Linearity
2. Homoscedasticity
3. Independence of errors across observations
4. Normality of the errors.
Typically, assumptions 1-4 are lumped together as: Y N (X ,
2
I), or equivalently
as e N (0 ,
2
I). If linearity holds, then the expectation of the error is 0 (see preceding
formula). If homoscedasticity holds, then the variance is constant across observations (see
the identity matrix in the preceding formulas). If independence holds, then the o-diagonal
elements of the error variance matrix will be 0 (also refer to the identity matrix in the
preceding formulas). Finally, if the errors are normal, then both Y and e will be normally
distributed, as the formulas indicate.
If the assumptions are met, then inference can proceed (if not, then we have some
problems-we will discuss this soon). This information is not really new-much of what we
developed for the simple regression model is still true. The basic ANOVA table is (k is the
number of x variables INCLUDING the intercept):
ANOVA TABLE DF SS MS F Sig
Regression k-1 (X

Y )
T
(X

Y )
SSR
Df(Reg)
MSR
MSE
(from F table)
Residual n-k e
T
e
SSE
Df(E)
(called MSE)
Total n-1 (Y

Y )
T
(Y

Y )
1
R
2
=
SSR
SST
or 1
SSE
SST
and Multiple Correlation=
R
2
.
Note that in the sums of squares calculations above, I have simply replaced the scalar
notation with matrix notation. Also note that the mean of Y is a column vector (n 1)
in which the elements are all the mean of Y . There is therefore no dierence between
these calculations and those presented in the simple regression notes-in fact, if youre more
comfortable with scalar notation, you can still obtain the sums of squares as before.
The multiple correlation before was simply the bivariate correlation between X and Y .
Now, however, there are multiple X variables, so the multiple R has a dierent interpretation.
Specically, it is the correlation between the observed values of Y and the model-predicted
values of Y (
Y = X

).
The F-Test is now formally an overall (global) test of the models t. If at least one
variable has a signicant coecient, then the model F should be signicant (except in cases
in which multicollinearity is a problem-we will discuss this later). There is more use for F
in the multiple regression context, as we will discuss shortly.
The t-tests for the parameters themselves are conducted just as before. Recall from the
last set of notes that the standard errors (more specically, the variances) for the coecients
are obtained using:

2
e
(X
T
X)
1
. This is a k k variance-covariance matrix for the coe-
cients. However, we are generally only interested in the diagonal elements of this matrix.
The diagonal elements are the covariances of the coecients with themselves. Hence, their
square roots are the standard errors. The o-diagonal elements of this matrix are the covari-
ances of the coecients with other coecients and are not important for our t-tests. Since
we dont know
2
e
, we replace it with the MSE from the ANOVA table.
For example, suppose we have data that look like:
X =
1 1
1 2
1 3
1 4
1 5
1 6
Y =
1
1
2
2
3
3
, so (X
T
Y ) =
12
50
So,
(X
T
X) =
6 21
21 91
and
(X
T
X)
1
=
91
105
21
105
21
105
6
105
If we compute (X
T
X)
1
(X
T
Y ), we get the OLS solution:
42
105
48
105
,
2
which will ultimately yield an MSE of .085714 (if you compute the residuals, square them,
sum them, and divide by 4).
If we compute MSE(X
T
X)
1
, Then we will get:
.0743 .0171
.0171 .0049
,
the variance-covariance matrix of the coecients. If we take the diagonal elements and
square root them, we get .273 and .07 as the standard errors of b
0
and b
1
, respectively. The
o-diagonal elements tell us how the coecients are related to each other. We can get the
correlation between b
0
and b
1
by using the formula: Corr(X, Y ) =
Cov(X,Y )
S(X)S(Y )
. In this case, the
S values are the standard errors we just computed. So, Corr(b
0
, b
1
) =
.0171
(.273)(.07)
= .895.
This indicates a very strong negative relationship between the coecients, which is typical
in a bivariate regression setting. This tells us that if the intercept were large, the slope would
be shallow (and vice versa), which makes sense.
If we wish to make inference about a single parameter, then the t test on the coecient
of interest is all we need to determine whether the parameter is signicant. However,
occasionally we may be interested in a test of a set of parameters, rather than a single
parameter. For example, suppose our theory suggests that 3 variables are important in
predicting an outcome, net of a host of control variables. Ideally we should be able to
construct a joint test of the importance of all 3 variables. In this case, we can conduct
nested F tests. There are numerous approaches to doing this, but here we will discuss two
equivalent approaches. One approach uses R
2
from two nested models; the other uses the
ANOVA table information directly.
The F-test approach is computed as:
F =
SSE(R) SSE(F)
df(R) df(F)

SSE(F)
df(F)
Here, R refers to the reduced model (with fewer variables), and F refers to the full model
(with all the variables). The degrees of freedom for the reduced model will be greater than for
the full model, because the degrees of freedom for the error are nk, where k is the number
of regressors (including the intercept). The numerator df for the test is df(R) df(F); the
denominator df is df(F). Recognize that this test is simply determining the proportional
increase in error the reduced model generates relative to the full model.
An equivalent formulation of this test can be constructed using R
2
:
F =
R
2
F
R
2
R
df(R) df(F)

1 R
2
F
df(F)
Simple algebra shows that these are equivalent tests; however, when there is no intercept
in the models, we must use the rst calculation.
2 Interpretation of the Model Parameters
The interpretation of the parameters in this model is ultimately no dierent than the in-
terpretation of the parameters in the simple linear regression with one key dierence. The
3
parameters are now interpreted as the relationship between X and Y , net of the eects of
the other variables. What exactly does this mean? It means simply that the coecient for
X in the model represents the relationship between X and Y , holding all other variables
constant. Lets look at this a little more in depth. Suppose we have a data set consisting of
3 variables: X
1
, X
2
, and Y , and the data set looks like:
X
1
X
2
Y
0 1 2
0 2 4
0 3 6
0 4 8
0 5 10
1 6 12
1 7 14
1 8 16
1 9 18
1 10 20
Suppose we are interested in the relationship between X
1
and Y . The mean of Y when
X
1
= 0 is 6, and the mean of Y when X
1
= 1 is 16. In fact, when a regression model is
conducted, here are the results:
Regression 1 250 250 25 p = .001
Residual 8 80 10
Total 9 330
Eect Coecient Standard Error t p-value
Intercept 6 1.41 4.24 .003
Slope 10 2 5.00 .001
The results indicate there is a strong relationship between X
1
and Y . But, now lets
conduct a multiple regression that includes X
2
. Those results are:
Regression 2 330 165 p = 0
Residual 7 0 0
Total 9 330
Intercept 0 0 0
b
X2
2 0 0
b
X1
0 0 0
4
Notice how the coecient for X
1
has now been reduced to 0. Dont be confused by the
fact that the t-test for the coecient is large: this is simply an artifact of the contrived
nature of the data such that the standard error is 0 (and the computer considers
0
0
to be ).
The coecient for X
2
is 2, and the R
2
for the regression (if you compute it) is 1, indicating
perfect linear t without error.
This result shows that, once we have controlled for the eect of X
2
, there is no relation-
ship between X
1
and Y : that X
2
accounts for all of the relationship between X
1
and Y .
Why? Because the eect of X
1
we rst observed is simply capturing the fact that there are
dierences in X
2
by levels of X
1
. Lets look at this in the context of some real data.
Race dierences in birthweight (specically black-white dierences) have been a hot topic
of investigation over the last decade or so. African Americans tend to have lighter babies
than whites. Below is a regression that demonstrates this. The data are from the National
Center for Health Statistics, which has recorded every live birth in the country for over a
decades. The data I use here are a subset of the total births for 1999.
Regression 1 2.94e + 8 2.94e + 8 815.15 p = 0
Residual 36,349 1.31e + 10 361,605.4
Total 36,350 1.34e + 10
Intercept 3364.1 3.44 977.07 0
Black -245.1 8.58 -28.55 0
These results indicate that the average white baby weighs 3364.1 grams, while the average
African American baby weighs 3364.1 245.1 = 3119 grams. The racial dierence in these
birth weights is signicant, as indicated by the t-test. The question is: why is there a racial
dierence? Since birth weight is an indicator of health status of the child, we may begin by
considering that social class may inuence access to prenatal care, and prenatal care may
increase the probability of a healthy and heavier baby. Thus, if there is a racial dierence in
social class, this may account for the racial dierence in birthweight. So, here are the results
for a model that controls on education (a measure of social class):
Regression 2 3.71e + 8 1.85e + 8 515.57 p = 0
Residual 36,348 1.31e + 10 359,525.5
Total 36,350 1.34e + 10
Intercept 3149.8 15.14 208.06 0
Black -236.96 8.58 -27.62 0
Education 16.67 1.15 14.54 0
5
Education, in fact, is positively related to birthweight: more educated mothers produce
heavier babies. The coecient for Black has become smaller, indicating that, indeed, part of
the racial dierence in birthweight is attributable racial dierences in educational attainment.
In other words, at similar levels of education, the birthweight gap between blacks and whites
is about 237 grams. The overall mean dierence, however, is about 245 grams, if educational
dierences in these populations are not considered.
Lets include one more variable: gestation length. Obviously, the longer a fetus is inutero,
the heavier it will tend to be at birth. In addition to social class dierences, there may be
gestation-length dierences between blacks and whites (perhaps another proxy for prenatal
care). When we include gestation length, here are the results:
Regression 3 4.17e + 9 1.4e + 9 5443.88 p = 0
Residual 36,347 9.3e + 9 255,108.2
Total 36,350 1.34e + 10
Intercept -1695.4 41.7 -40.64 0
Black -160.44 7.25 -22.12 0
Education 15.7 .97 16.26 0
Gestation 124.9 1.02 121.98 0
These results indicate that a large portion of the racial dierence in birthweight is at-
tributable to racial dierences in gestation length: notice how the race coecient has been
reduced to 160.44. Interestingly, the eect of education has also been reduced slightly,
indicating that there are some gestational dierences by education level. Finally, notice now
how the intercept has become large and negative. This is because there are no babies with
gestation length 0. In fact, in these data, the minimum gestation time is 17 weeks (mean is
just under 40 weeks). Thus, a white baby with a mother with no education who gestated for
17 weeks would be expected to weigh a little more than 1,000 grams (just over two pounds).
As weve discussed before, taken out of the context of the variables included in the model,
the intercept is generally meaningless.
3 Comparing Coecients in the Model
In multiple regression, it may be of interest to compare eects of variables in the model.
Comparisons are fairly clear cut if one variable reaches signicance while another one doesnt,
but generally results are not that clear. Two variables may reach signicance, but you may be
interested in which one is more important. To some extent, perhaps even a large extent, this
is generally more of a substantive or situational question than a statistical one. For example,
from a policy perspective, if one variable is changeable, while another one isnt, then the
more important nding may be that the changeable variable has a signicant eect even
after controlling for unchangeable ones. In that case, an actual comparison of coecients
6
may not be warranted. But lets suppose we are testing two competing hypotheses, each of
which is measured by its own independent variable.
Lets return to the birth weight example. Suppose one theory indicates that the racial
dierence in birthweight is a function of prenatal care dierences, while another theory
indicates that birthweight dierences are a function of genetic dierences between races. (As
a side note, there in fact is a somewhat contentious debate in this literature about racial
dierences in genetics). Suppose that our measure of prenatal care dierences is education
(reecting the eect of social class on ability to obtain prenatal care), and that our measure of
genetic dierences is gestation length (perhaps the genetic theory explicitly posits that racial
dierences in genetics lead to fewer weeks of gestation). We would like to compare the relative
merits of the two hypotheses. We have already conducted a model that included gestation
length and education. First, we should note that, because the racial dierences remained
even after including these two variables, neither is a sucient explanation. However, both
variables have a signicant eect, so how do we compare their eects?
It would be unreasonable to simply examine the relative size of the coecients (about
15.7 for education and 124.9 for gestation), because these variables are measured in dierent
units. Education is measured in years, while gestation is measured in weeks. Our rst
thought may be, therefore, to recode one or the other variable so that the units match.
So, what are the results if we recode gestation length into years? If we do so, we will nd
that the eect of education does not change, but the eect of gestation becomes 6,494.9.
Nothing else in the model changes (including the t-test, the model F, R
2
, etc.). But now,
the gestation length eect appears huge relative to the education eect. But, to demonstrate
why this approach doesnt work, suppose we rerun the model after recoding education into
centuries. In that case, the eect of education becomes 1,570.5. Now, you could argue that
measuring education in centuries and comparing to gestation length in years does not place
the two variables in the same units. However, I would argue that neither of these units is
more inappropriate than the other. Practically no (human) fetus gestates for a year, and no
one attends school for a century. The problem is thus a little more complicated than simply
a dierence in units of measurement. Plus, I would add that it is often impossible to make
units comparable: for example how would you compare years of education with salary in
dollars?
What we are missing is some measure of the variability in the measures. As a rst
step, I often compute the eect of variables at the smallest and largest values possible in
the measure. For example, gestation length has a minimum and maximum of 17 and 52,
respectively. Thus, net eect of gestation for a fetus that gestates for 17 weeks is 2,123
grams, while the net eect of gestation for a fetus that gestates for 52 weeks is 6,495 grams.
Education, on the other hand, ranges from 0 to 17. Thus, the net eect of education at 0
years is 0 grams, while the net eect at 17 years is 267 grams. On its face, then, the eect
of gestation length appears to be larger.
However, even though there is a wider range in the net eect of gestation versus education,
we have not considered that the extremes may be just thatextremein the sample. A 52-
week gestation period is a very rare (if real) event, as is having no education whatsoever. For
this reason, we generally standardize our coecients for comparisons using some function of
the standard deviation of the X variables of interest and sometimes the standard deviation
of Y as well. For fully standardized coecients (often denoted using as opposed to b), we
7
compute:
X
= b
X
s.d.(X)
s.d.(Y )
.
If we do that for this example, we get standardized eects for education and gestation of .07
and .53, respectively. The interpretation is that, for a standard unit increase in education,
we expect a .07 standard unit increase in birth weight; whereas for gestation, we expect a .53
standard unit increase in birth weight for a one standard unit increase in gestation. From
this perspective, the eect of gestation length indeed seems more important.
As a nal note on comparing coecients, we must realize that it is not always useful to
compare even the standardized coecients. The standardized eect of an X variable that is
highly nonnormal may not be informative. Furthermore, we cannot compare standardized
coecients of dummy variables, because they only take two values (more on this in the next
set of notes). Finally, our comparisons cannot necessarily help us determine which hypothesis
is more correct: in part, this decision rests on how well the variables operationalize the
hypothesis. For example, in this case, I would not conclude that genetics is more important
than social class (or prenatal care). It could easily be argued (and perhaps more reasonably)
that gestation length better represents prenatal care than genetics!
4 The Maximum Likelihood Solution
So far, we have derived the OLS estimator, and we have discussed the standard error esti-
mator, for multiple regression. Here, I derive the Maximum Likelihood estimator and the
standard errors.
Once again, we have normal likelihood function, but we will now express it in matrix
form:
p(Y | ,
e
, X) L( ,
e
| X, Y ) =
2
2
e
n
2
exp
1
2
2
e
(Y X)
T
(Y X)
Taking the log of this likelihood yields:

LL( ,
e
) nlog(
e
)
1
2
2
e
(Y X)
T
(Y X)
We now need to take the partial derivative of the log of the likelihood with respect to each
parameter. Here, we will consider that we have two parameters: the vector of coecients
and the variance of the error. The derivative of the log likelihood with respect to should
look somewhat familiar (much of it was shown in the derivation of the OLS estimator):
LL
=
1
2
2
e
(2X
T
)(Y X) =
1
2
e
(X
T
)(Y X).
If we set this expression equal to 0 and multiply both sides by s
2
e
, we end up with the same
result as we did for the OLS estimator. We can also take the partial derivative with respect
to
e
:
8
LL
e
=
n
e
+
3
e
(Y X)
T
(Y X)
The solution, after setting the derivative to 0 and performing a little algebra, is:
2
e
=
(Y X)
T
(Y X)
n
.
These solutions are the same as we found in the simple regression problem, only expressed
in matrix form.
The next step is to take the second partial derivatives in order to obtain the standard
errors. Lets rst simplify the rst partial derivatives (and exchange
2
e
with ):
LL
=
1
(X
T
Y X
T
X).
and
LL
=
n
2
1
+
1
2
2
e
T
e.
The second partial derivative of LL with respect to is:
2
LL
2
=
1
(X
T
X).
The second partial derivative of LL with respect to is:
2
LL
2
=
n
2
3
e
T
e.
The o-diagonal elements of the Hessian Matrix are:
2
LL
=

2
LL
=
1
X
T
Y + X
T
X
.
Thus, the Hessian Matrix is:
(X
T
X)
1
X
T
Y + X
T
X
X
T
Y + X
T
X
n
2
3
e
T
e
.
We now need to take the negative expectation of this matrix to obtain the information matrix.
The expectation of is , and so, if we substitute the computation of (= (X
T
X)
1
(X
T
Y )),
the numerator of the o-diagonal elements is 0.
The expectation of the second partial derivative with respect to remains unchanged.
However, the second partial derivative with respect to changes a little. First, the expec-
tation of e
T
e is n, just as we discussed while deriving the ML estimators for the simple
regression model. This gives us:
n
2
2

n
3
.
Simple algebraic manipulation gives us:
9
n
2
2
.
Thus, after taking the negative of these elements, our information matrix is:
(X
T
X) 0
0
n
2
2
.
As weve discussed before, we need to invert this matrix and square root the diagonal elements
to obtain the standard errors of the parameters. Also as weve discussed before, the inverse
of a diagonal matrix is simply a matrix with the diagonal elements inverted. Thus, our
variance-covariance matrix is:
(X
T
X)
1
0
0
2
2
n
,
which should look familiar after substituting
2
e
in for .
10
Expanding the Model Capabilities: Dummy Variables,
Interactions, and Nonlinear Transformations (Soc 504)
Until now, although we have considered including multiple independent variables in our
regression models, we have only considered continuous regressors and linear eects of them
on the outcome variable. However, the linear regression model is quite exible, and it is
fairly straightforward to include qualitative variables, like race, gender, type of occupation,
etc. Furthermore, it is quite easy to model nonlinear relationships between independent and
dependent variables.
1 Dummy Variables
When we have dichotomous (binary) regressors, we construct something called indicator,
or dummy variables. For example, for gender, we could construct a binary variable called
male and code it 1 if the person is male and 0 if the person is female. Suppose gender were
the only variable in the model:
Y
i
= b
0
+b
1
Male
i
.
The expected value of Y for a male, then, would be b
0
+b
1
, and the expected value of Y for
a female would be just b
0
. Why? Because 0 b
1
= 0, and 1 b
1
is just b
1
. Interestingly, the
t-test on the b
1
parameter in this case would be identical to the t test you could perform on
the dierence in the mean of Y for males and females.
When we have a qualitative variable with more than two categories (e.g., race coded
as white, black, other), we can construct more than one dummy variable to represent the
original variable. In general, the rule is that you construct k 1 dummy variables for a
qualitative variable with k categories. Why not construct k dummies? Lets assume we have
a model in which race is our only predictor. If we construct a dummy for black and a
dummy for other, then we have the following model:
Y
i
= b
0
+b
1
Black
i
+b
2
Other
i
.
When black=1, we get:
Y
i
= b
0
+b
1
.
When other=1, we get:
Y
i
= b
0
+b
2
.
But, when we come across an individual who is white, both dummy coecients drop from
the model, and were left with:
Y
i
= b
0
.
1
If we also included a dummy variable for white, we could not separate the eect of the
dummy variable and the intercept (more specically, there is perfect negative collinearity
between them). We call the omitted dummy variable the reference category, because the
other dummy coecients are interpreted in terms of the expected mean change relative to
it. (note: as an alternative to having a references group, we could simply drop the intercept
from the model.)
When we have only dummy variables in a model representing a single qualitative variable,
the model is equivalent to a one-way ANOVA. Recall from your previous coursework that
ANOVA is used to detect dierences in means across two or more groups. When there are
only two groups, a simple t-test is appropriate for detecting dierences in means (indeed, an
ANOVA modeland a simple regression as discussed abovewould yield an F value that
would simply be t
2
). However, when there are more than two groups, the t-test becomes
inecient. We could, for example, conduct a t-test on all the possible two-group comparisons,
but this would be tedious, because the number of these tests is
k
2
=
k!
(k2)!2!
, where k is
the number of categories/groups. Thus, in those cases, we typically conduct an ANOVA, in
which all group means are simultaneously compared. If the F statistic from the ANOVA is
signicant, then we can conclude that at least one group mean diers from the others.
The standard ANOVA model with J groups is constructed using the following computa-
tions:
Grand Mean

X =
i
X
ij
n
SumSquares Total (SST) =
X
ij
2
SumSquares Between (SSB) =
j
n
j
X
j
2
SumSquares Within (SSW) =
X
ij
X
j
2
The ANOVA table is then constructed as:
SS df MS F
Between J-1
SSB
df
F
Within N-J
SSW
df
Total N-1
This should look familiarit is the same table format used in regression. Indeed, as
Ive said, the results will be equivalent also. The TSS calculation is identical to that from
regression. The degrees of freedom are also equivalent. Furthermore, if we realize that
in a model with dummy variables only, then

Y is simply the group mean, then the SSW
2
calculation is identical to the (Y

Y ) calculation we use in regression. Thus, ANOVA and
regression are equivalent models, and when dummy variables are used, the regression model
is sometimes called ANCOVA (analysis of covariance).
A limitation of the ANOVA approach is that the F-test only tells us whether at least
one group mean diers from the others; it doesnt pinpoint which group mean is the cul-
prit. In fact, if you use ANOVA and want to determine which group mean diers from
the others, additional tests must be constructed. However, thats the important feature of
regression with dummiesthe t-tests on the individual dummy coecients provides us with
this information. So, why do psychologists (and clinicians, often) use ANOVA? A short
answer is that there is a cultural inertia at work. Psychologists (and clinicians) historically
have dealt with distinct groups in clinical trials/experimental settings in which the ANOVA
model is perfectly appropriate, and regression theoretically unnecessary. Sociologists, on the
other hand, dont use such data, and hence, sociologists have gravitated toward regression
analyses.
So far, we have discussed the inclusion of dummy variables in a relatively simple model.
When we have continuous covariates also in the model, interpretation of the dummy variable
eect is enhanced with graphics. A dummy coecient in such a model simply tells us how
much the regression line jumps for one group versus another. For example, suppose we
were interested in how education relates to depressive symptoms, net of gender. Then, we
might construct a model like:
Y = b
0
+b
1
Education + b
2
Male.
I have conducted this model using data from the National Survey of Families and Households
(NSFH), with the following results: b
0
= 25.64, b
1
= .74, and b
2
= 3.27. These results
suggest that education reduces symptoms (or more correctly, that individuals with greater
education in the sample have fewer symptomsat a rate of .74 depressive symptoms per
year of education) and that men have 3.27 fewer symptoms on average. Graphically, this
implies:
Years of Education
D
e
p
r
e
s
s
i
v
e

S
y
m
p
t
o
m
s
0 5 10 15 20
0
5
1
0
1
5
2
0
2
5
3
0
Male
Female
Figure 1. Depressive Symptoms by Years of Education and Gender
3
If we had a qualitative variable in the model that had more than two categories (hence
more than one dummy variable), we would have more than two lines, but all would be
parallel.
2 Statistical Interactions
Notice that the lines in the above example are parallelthat it is assumed in this model
that education has the same eect on depressive symptoms for men and for women. This is
often an unrealistic assumption, and often our theories provide us with reasons why we may
expect dierent slopes for dierent groups. In these cases, we need to come up with some
way of capturing this dierence in slopes.
2.1 Interactions Between Dummy and Continuous Variables
The simplest interactions involve the interaction between a dummy variable and a continuous
variable. In the education, gender, and depressive symptoms example, we may expect that
educations eect varies across gender, and we may wish to model this dierential eect. A
model which does so might look like:
Y = b
0
+b
1
Education + b
2
Male +b
3
(Education Male)
This model diers from the previous one, because it includes a statistical interaction term,
EducationMale. This additional variable is easy to constructit is simply the multiple of
each individuals education value and their gender. For women (Male = 0), this interaction
term is 0, but for men, the interaction term is equal to their education level (Education1).
This yields the following equations for men and women:
Y
men
= (b
0
+ b
2
) + (b1 +b
3
)Education
Y
women
= b
0
+b
1
Education
In the equation for men, I have consolidated the eects of education into one coecient,
so that the dierence in education slopes for men and women is apparent. We still have
an expected mean dierence for men and women (b
2
), but we now also allow the eect of
education to vary by gender, unless the interaction eect is 0. I have estimated this model,
with the following results: b
0
= 26.29, b
1
= .794, b
2
= 4.79, b
3
= .118(ns):
4
Years of Education
D
e
p
r
e
s
s
i
v
e

S
y
m
p
t
o
m
s
0 5 10 15 20
0
5
1
0
1
5
2
0
2
5
3
0
Male
Female
Figure 2. Depressive Symptoms by Years of Education and Gender (with Interaction)
Notice that in this plot, the lines for women and men appear to converge slightly as
education increases. This implies that women gain more, in terms of reduction of depressive
symptoms, from education than men do.
In this example, the interaction eect is not signicant, and I didnt expect it to be.
There is no theory that suggests an interaction between gender and education.
In the following example, I examine racial dierences in depressive symptoms. The results
for a model with race (white) only are: b
0
= 17.14, b
white
= 2.92. In a second model, I
examine the racial dierence, net of education. The results for that model are: b
0
= 25.62,
b
educ
= .73, b
white
= 1.85:
Years of Education
D
e
p
r
e
s
s
i
v
e

S
y
m
p
t
o
m
s
0 5 10 15 20
0
5
1
0
1
5
2
0
2
5
3
0
White
Nonwhite
Figure 3. Depressive Symptoms by Years of Education and Race
5
The change in the coecient for white between the two models indicates that education
accounts for part of the racial dierence in symptoms. More specically, it suggests that
the large dierence observed in the rst model is in part attributable to compositional
dierences in educational attainment between white and nonwhite populations. The second
model suggests that whites have lower levels of depressive symptoms than nonwhites, but
that education reduces symptoms for both groups.
Theory might suggest that nonwhites (blacks specically) get fewer returns from educa-
tion, though, so in the next model, I include a white education interaction term. Those
results (b
0
= 21.64, b
educ
= .39, b
white
= 4.34, b
edw
= .51) yield a plot that looks like:
Years of Education
D
e
p
r
e
s
s
i
v
e

S
y
m
p
t
o
m
s
0 5 10 15 20
0
5
1
0
1
5
2
0
2
5
3
0
White
Nonwhite
Figure 4. Depressive Symptoms by Years of Education and Gender (with Interaction)
Here, it is apparent that whites actually have greater depressive symptoms than non-
whites at very low levels of education, perhaps because nonwhites are more adaptive to
adverse economic conditions than whites are. However, as education increases, symptoms
decrease much faster for whites than for nonwhites. In fact, we could determine the precise
level of education at which the lines cross, by setting the nonwhite and white results equal
to each other and solving for Education:
White Symptoms = Nonwhite Symptoms
21.64 .38E + 4.34 .51E = 21.64 .39E
E = 8.51
We may choose to get more elaborate. So far, I have alluded to the fact that these racial
results may be attributable to dierences between blacks and whiteswe may not nd
the same results if we further subdivided our race variable, but this is an empirical ques-
tion. Thus, I conducted an additional set of models in which I disaggregated the nonwhite
category into black and other categories, and I constructed interaction terms between
education and the white and other dummy variables. I obtained the following results
6
Variable Model 1 Model 2 Model 3
Intercept 17.4*** 26.2*** 22.8***
White -3.18*** -2.32*** 3.22#
Other -.84 -1.53* -2.22
Educ -.74*** -.45***
EW -.45**
EO .09
What do these results suggest? It appears, based on Model 1, that whites have signif-
icantly fewer symptoms than blacks, but that others dont dier much from blacks. The
second model indicates that educational attainment is important, and, interestingly, the
other coecient is now signicant. Here we observe a suppressor relationshipeducational
dierences between blacks and others masks the dierences in symptoms between these
races. This result implies that other races have lower educational attainment than blacks.
Model 3 conrms our earlier resultthat education provides greater returns to whites. It
also dispels the hypothesis that this interaction eect is unique to blacks (blacks and others
appear to have similar patterns, given the nonsignicance of the coecients for others.)
It is important to note that the other category is composed (in this study) primarily of
Hispanics (who do, in fact, have lower educational attainment than blacks). If we were
to further subdivide the race variable, we may nd that heterogeneity within the other
category is masking important racial dierences.
2.2 Interactions Between Continuous Variables
So far, we have been able to represent regression models with a single variable (continuous
or dummies), and models with a single continuous variable and a set of dummies, and
models with interactions between a single continuous variable and a dummy variable, with
two-dimensional graphs. However, we will now begin considering models which can be
represented graphically only in three dimensions. Just as a continuous variable may have
an eect that varies across levels of a dummy variable, we may have continuous variables
whose eects vary across levels of another continuous variable. These interactions become
somewhat more dicult to perceive visually.
Below, I have shown a two variable model in which depressive symptoms were regressed
on functional limitations (ADLs) and years of education. The regression function is a plane,
rather than a line, as indicated in the gure.
7
0
5
10
15
20
E
ducation
0
1
2
3
4
5
6
A
D
L
s
1
0
1
5
2
0
2
5
3
0
3
5
4
0
D
e
p
r
e
s
s
i
v
e

S
y
m
p
t
o
m
s
Figure 5. Depressive Symptoms by Years of Education and ADLs
Notice that the plane is tilted in both ADL and Education dimensions. This tells us
that as we move down in education or up in ADLs, we get increases in depressive symptoms.
Obviously, the lowest depressive symptoms occur for highly-educated unlimited individuals,
and the greatest symptoms occur for low-educated highly-limited individuals. Suppose that
we dont believe this pattern is linear in both dimensions, but rather that there may be some
synergy between education and ADLs. For example, we might expect that the combination
of physical limitation and low education to be far worse, in terms of producing depressive
symptoms, than this additive model suggests. In that case, we may include an interaction
term between education and ADL limitations.
8
0
5
1
0
1
5
2
0
E
d
u
ca
tio
n
0
1
2
3
4
5
6
A
D
L
s

0
1
0
2
0
3
0
4
0
D
e
p
r
e
s
s
i
v
e

S
y
m
p
t
o
m
s
0 5 10 15 20
Education

0
1
2
3
4
5
6
A
D
L
s

0
1
0
2
0
3
0
4
0
D
e
p
r
e
s
s
i
v
e

S
y
m
p
t
o
m
s
0
5
10
15
20
E
ducation
0
1
2
3
4
5
6
A
D
L
s

0
1
0
2
0
3
0
4
0
D
e
p
r
e
s
s
i
v
e

S
y
m
p
t
o
m
s
0
5
10
15
20
Education
0
1
2
3
4
5
6
A
D
L
s

0
1
0
2
0
3
0
4
0
D
e
p
r
e
s
s
i
v
e

S
y
m
p
t
o
m
s
Figure 6. Depressive Symptoms by Years of Education and ADLs (with Interaction)
Now we are no longer dealing with a at plane, but rather a twisted surface.
If you dont want to plot the model-predicted data, you can always algebraically factor
the model to determine what the total eect of one variable is, but you may need to do this
in both dimensions. For example, this model is:
Y = b
0
+b
1
Education +b
2
ADLs +b
3
(Education ADLs)
Thus, the total eect of education is:
Education Effect = b
1
+b
3
ADLs
Similarly, the total eect of ADLs is:
ADL Effect = b
2
+b
3
Education
If you want, you can plot these two-dimensional eects, but it is often useful to just plug
in the numbers and observe what happens. I nd the following table useful for making
interpretations:
9
XZ Interaction Eect of X (+) Eect of X (-)
(+) Eect of X grows Eect of X shrinks
across levels of Z across levels of Z
(-) Eect of X shrinks Eect of X grows
across levels of Z across levels of Z
In our example, the individual eect of education is negative, the eect of ADLs is
positive, and the interaction eect is negative. This implies that the total eect of education
increases across levels of disability, and that the total eect of disability decreases across
levels of education. We can discern this from the graph above-it appears that at very high
levels of education, adding ADL limitations increases depressive symptoms slower (in fact, it
actually reduces symptoms) than it does at low levels of education. In the other dimension,
it appears that adding years of education has a relatively large eect on reducing depressive
symptoms among persons with 6 ADLs compared to the eect among persons with 0 ADLs.
Substantively, these results suggest that physical limitation is particularly troubling for lower
SES individuals, and that it is less troubling for persons with high SES. Certainly, we could
attempt to pin down an explanation by including additional variables, and here we enter
the realm of some complex interpretation. Whereas without interactions, if one variable
(Z), when entered into a model, reduces the eect of another (X), we may argue that Z
partially explains the eect of X. We can make similar interpretations with interactions. If,
for example, we entered some measure of sense of control into the above model, and the
interaction eect disappeared or reversed, we have some evidence that dierential feelings
of control explains why functional limitations have less eect on depressive symptoms for
higher-SES individuals. Sometimes, however, we may need an interaction to explain an
interaction.
2.3 Limitations of Interactions
In theory, there are no limits to the number interaction terms that can be put into a model,
and there are no limits on the degree of interaction. For example, you can construct three-
way or even higher order interactions. But, there is a tradeo, in terms of parsimony and
interpretability. Multi-way interactions become very dicult to interpret, and they also
require complex theories to justify their inclusion in a model. As an example from my
own research on education and health, it is clear that there is a life-course pattern in the
education-health relationship. That is, education appears to become more important across
age. Hence, education interacts with age. It is also clear that education is becoming more
important in dierentiating health across birth cohorts-education has a stronger relationship
to health for more recent cohorts than older birth cohorts, perhaps due to improvement in the
quality of education (content-wise), or perhaps due to the greater role education plays today
in getting a high-paying job. Taken together, these arguments imply a three-way interaction
between age, birth cohort, and education. However, a simple three-way interaction would
be very dicult to interpret.
10
A second diculty with interactions occurs when all the variables that are thought to
interact are dummy variables. For example, suppose from the previous example, age were
indicated by a dummy variable representing being 50 years old or older (versus younger
than 50), education were indicated by a dummy representing having a high school diploma
(versus not), and cohort was indicated by a dummy representing having a birth date pre-
WWII (versus post-WWII). A simple three-way interaction between these variables would
ONLY take a value of 1 for persons who were a 1 on all three of these indicators. Yet, this
may not be the contrast of interest.
There are no simple rules to guide you through the process of selecting reasonable
interactionsit just takes practice and thoughtful consideration of the hypotheses you seek
to address.
3 Nonlinear Transformations
We often nd that the standard linear model is inappropriate for a given set of data, either
because the dependent variable is not normally distributed, or because the relationship
between the independent and dependent variables is nonlinear. In these cases, we may
make suitable transformations of the independent and/or dependent variables to coax the
dependent variable to normality or to produce a linear relationship between X and Y . Here,
we will rst discuss the case in which the dependent variable is not normally distributed.
3.1 Transformations of Dependent Variables
A dependent variable may not be normally distributed, either because it simply doesnt
have a normal-bell shape, or because its values are bounded, creating a skew. In other
cases, it is possible that the variable has a nice symmetric shape, but that the interval on
which the values of the variable exist is narrow enough to produce unreasonable predicted
values (e.g., the dependent variable is a proportion). In terms of inference for parameters,
while a nonnormal dependent variable doesnt guarantee a nonnormal distribution of errors,
it generally will. As stated previously, this invalidates the likelihood function, and it also
invalidates the derivation of the sampling distributions that provide us with the standard
errors for testing. Thus, we may consider a transformation to normalize the dependent
variable.
There are 4 common transformations that are used for dependent variables, including
the logarithmic, exponential, power, and logistic transformations:
Transformation New D.V. New Model
Logarithmic Z = ln(Y ) Z = Xb
Exponential Z = exp(Y ) Z = Xb
Power Z = Y
p
, p = 0, 1 Z = Xb
Logistic Z =
ln(Y )
(1ln(Y ))
Z = Xb
11
Notice in all cases, the new model is the transformed dependent variable regressed on the
regressors. This highlights that there will be a change in the interpretation of the eects of
X on Y . Specically, you can either interpret the eects of X on Z, or you can invert the
transformation after running the model and computing predicted scores in Z units.
Each of these transformations has a unique eect on the distribution of Y . The logarith-
mic transformation, as well as power transformations in which 0 < p < 1, each reduce right
skew. Why? Because the log
b
X function is a function whose return value is the power to
which b must be raised to equal X. For example, log
10
100 = 2. Below shows the pattern for
log base 10:
X Log X
1 0
10 1
100 2
1000 3
10000 4
You can see that, although X is increasing by powers of 10, logX is increasing linearly.
We tend to use the natural log function, rather than the base 10 function, because it has nice
properties. The base of the natural logs is e 2.718. This transformation will tend to pull
large values of Y back toward the mean, reducing right skewness. By the same token, power
transformations (0 < p < 1) will have a similar eect, only such transformations generally
arent as strong as the log transformation.
Recognize that changing the distribution of Y will also have an eect on the relationship
between X and Y . In the listing above of the logarithms of the powers of 10, the transforma-
tion changes the structure of the variable, so that an X variable which was linearly related
to powers of 10 will now be nonlinearly related to the log of these numbers.
The log and power transformations are used very often, because we encounter right
skewness problems frequently. Any variable that is bounded on the left by 0 will tend to
have right skewed distributions, especially if the mean of the variable is close to 0 (e.g.,
income, depressive symptoms, small attitude scales, count variables, etc.).
When distributions have left skewness, we may use other types of power transformations
in which p > 1, or we may use the exponential transformation. By the same token that roots
and logs of variables pull in a right tail of a distribution, the inverse functions of roots and
logs will expand the right tail of a distribution, correcting for a left skew.
What if p < 0 in a power transformation? This does two things. First, it inverts the
distribution-what were large values of Y would now be small values after the transformation
(and vice versa). Then, it accomplishes the same tail-pulling/expanding that power transfor-
mations in which p is positive do. We dont often see these types of transformations, unless
we are dealing with a variable in which interest really lay in the inverse of it. For example,
we may take rates and invert them to get time-until-an event. The main reason that we
dont see this type of transformation frequently is that it doesnt aect skewness any more
than a positive power transformation would.
12
We need to be careful when applying the log and power transformations. We cant take
even roots when a variable takes negative values. Similarly, we cant take the logarithm of
0. When we are faced with these problems, we may simply add a constant to the variable
before transforming it. For example, with depressive symptoms, we may wish to add 1 to
the variable before taking the log to ensure that our logarithms will all exist.
The nal transformation we will discuss is the logistic transformation. The logistic trans-
formation is useful when the outcome variable of interest is a proportion-a number bounded
on the [0, 1] interval. The distribution of such a variable might very well be symmetric, but
the bounding poses a problem, because a simple linear model will very likely predict values
less than 0 and greater than 1. In such a case, we may take:
Z = log
Y
1Y
. This transformation yields a boundless variable, because Z = when
Y approaches 0, and Z = + when Y approaches 1. Of course, because of a division by 0
problem, and because the log of 0 is undened, something must be done to values of 0 or 1.
3.2 Transformations of Independent Variables
Whereas we may be interested in transforming a dependent variable so that the distribution
of the variable becomes normal, there is no such requirement for independent variables-
they need not be normally distributed. However, we often perform transformations on
independent variables in order to linearize the relationship between X and Y . Before we
discuss methods for linearizing a relationship between X and Y , we should note that what
makes the linear model linear is that the eects of the parameters are additive. Stated
another way, the model is linear in the parameters. We may make all the transformations
of the variables themselves that we like, and the linear model is still appropriate so long as
the model parameters can be expressed in an additive fashion. For example:
Y = exp(b
0
+b
1
X)
is a linear model, because it can be re-expressed as:
ln(Y ) = b
0
+b
1
X.
However, the model: Y = b
b
1
X
0
is not a linear model, because it cannot be re-expressed in
an additive fashion.
The same transformations that we used for altering the distribution of the dependent
variable can also be used to capture nonlinearity in a regression function. The most com-
mon such transformations are power transformations, specically squaring and cubing the
independent variable. When we conduct such transformations on independent variables,
however, we generally include both the original variable in the model as well as the trans-
formed variablesjust as when we included interaction terms, we included the original main
eects. (note: this is a matter of convention, however, and not a mathematical requirement!)
Thus, we often see models like:
Y = b
0
+b
1
X + b
2
X
2
+b
3
X
3
. . .
Such models are called polynomial regression models, because they include polynomial
expressions of the independent variables. These models are good for tting regression models
13
when there is relatively mild curvature in the relationship between X and Y . For example, it
is well-known that depressive symptoms tend to decline across young adulthood, bottoming
out after retirement, before rising again in late life before death. This suggests a quadratic
relationship between age and depressive symptoms rather than a linear relationship. Thus, I
conducted a regression model of depressive symptoms on age, and then a second model with
an age-squared term included:
Variable Model 1 Model 2
Intercept 18.47*** 23.34***
Age -.085*** -.322***
Age
2
.0025***
The rst model indicates that age has a negative eectevery year of age contributes to
a .085 symptom reduction. The second model reveals that age reduces symptoms, but also
that each year of age contributes a smaller and smaller reduction. Ultimately, when age is
large enough, age will actually begin to lead to increases in symptoms. We can see this by
taking the rst derivative of the regression function with respect to age, setting it equal to
0, and solving to nd the age at which age begins to raise symptoms:
Y = b
0
+b
1
Age +b
2
Age
2
.
The derivative is:
Y
Age
= b
1
+ 2b
2
Age.
Solving for the age at which the eect reaches a minimum is:
Age
min
=
b
1
2b
2
In this case, if we substitute our regression estimates in, we nd the age is 64.4. The plot
below conrms this result.
14
Age
R
e
d
u
c
t
i
o
n

i
n

D
e
p
r
e
s
s
i
v
e

S
y
m
p
t
o
m
s
20 40 60 80 100
-
1
0
-
9
-
8
-
7
-
6
Figure 7. Net Eect of Age on Depressive Symptoms: Second Degree (Quadratic) Polynomial
Regression
In the plot, it is clear that the eect of age reaches a minimum in the mid 60s and then
becomes positive afterward.
We can use polynomial regression functions to capture more than a single point of inec-
tion. The rule is that we can capture k 1 points of inection with a polynomial regression
of degree k. As an example, I reconducted the above regression model with an age-cubed
term and got the following results:
Age
R
e
d
u
c
t
i
o
n

i
n

D
e
p
r
e
s
s
i
v
e

S
y
m
p
t
o
m
s
20 40 60 80 100 120
-
1
4
-
1
2
-
1
0
-
8
-
6
Figure 8. Net Eect of Age on Depressive Symptoms: Third Degree Polynomial Regression
15
In this case, it looks like the eect of age bottoms out just before retirement age, and
that the eect of age is positive but relatively trivial beyond retirement. Additionally, it
appears that if one lives past 100 years of age, age again contributes to reducing symptoms
(a second point of inection). Of course the age-cubed term was nonsignicant, though, and
really shouldnt be interpreted.
I should note that as you add polynomial terms, you can ultimately almost perfectly
t the data (and you can perfectly t it, if there is only one person at each value of X).
However, the tradeo of a perfect t is a loss of parsimony.
3.3 Dummy Variables and Nonlinearity
Sometimes, nonlinearity is severe enough or odd-patterned enough that we cant easily cap-
ture the nonlinearity with a simple transformation of either (or both) independent or de-
pendent variables. In these cases, we may consider a spline function. Splines can be quite
complicated, but for our purposes, we can accomplish a spline using dummy variables and
polynomial expressions. In this case, we sometimes call the model a piecewise regression.
Imagine that the regression function is linear for values of X less than C and also linear for
values above C:
X
Y
0 5 10 15 20
2
0
4
0
6
0
8
0
Figure 9. Data That are Linear in Two Segments
In this gure, the relationship between X and Y is clearly linear, but in sections. Specif-
ically, the function is:
Y =
5 + 2X iff X < 11
27 + 6(X 11) iff X > 10
If we ran separate regression models for X < 11 and X > 10 (e.g., constructed two samples),
we would nd a perfectly linear t for both pieces. However, each section would have
16
a dierent slope and intercept. One solution is to estimate a second-degree polynomial
regression model (
Y = b
0
+ b
1
X +b
2
X
2
):
X
Y
0 5 10 15 20
2
0
4
0
6
0
8
0
Figure 10. Quadratic Fit to Two-Segment Data
This model oers a fairly decent t in this case, but it tends to underpredict and over-
predict systematically, which produces serial correlation of the errors (violating the indepen-
dence assumption). Plus, this data could be predicted perfectly with another model.
We could construct a dummy variable indicating whether X is greater than 10, and
include this along with X in the model:
Y = b
0
+b
1
X +b
2
(I(X > 10)).
This model says

Y is b
0
+b
1
X, when X < 11, and (b
0
+b
2
) +b
1
X when X > 10. This model
doesnt work, because only the intercept has changedthe slope is still the same across
pieces of X. But, we could include an interaction term between the dummy variable and the
X variable itself:
Y = b
0
+b
1
X +b
2
(I(X > 10)) +b
3
(I X).
This model says

Y = b
0
+b
1
X, when X < 11, and (b
0
+b
2
) +(b
1
+b
3
)X, when X > 10. We
have now perfectly t the data, because we have allowed both the intercepts and slopes to
vary across the pieces of the data:
17
X
Y
0 5 10 15 20
2
0
4
0
6
0
8
0
Figure 11. Spline (with Single Fixed Knot and Degree=1) or Piecewise Regression Model:
Observed and Predicted Scores
This model can be called either a piecewise regression, or a spline regression. Notice the
title of the gure includes degree 1 and 1 knot. The knots are the number of junctions
in the function. In this case, there are 2 regression functions joined in one place. The degree
of the spline is 1, because our function within each regression is a polynomial of degree 1. We
can, of course include quadratic functions in each piece, and as we include more and more
of them, the function becomes more and more complex because of the multiple interaction
terms that must be included. Technically, a spline function incorporates the location of the
knot as a parameter to be estimated (and possibly the number of knots, as well).
3.4 Final Note
For those who are interested in constructing a cognitive map of statistical methods, here are
a few tidbits. First, conducting the log transformation on a dependent variable leaves us with
one form of a poisson regression model, which you would discuss in more depth in a class on
generalized linear models (GLM). Also, the logistic transformation is used in a model called
logistic regression, which you would also discuss in a GLM class. However, the logistic
transformation we are discussing, while the same as the one used in logistic regression, is not
applied to the same type of data. In logistic regression, the outcome variable is dichotomous
(or polytomous). Finally, the piecewise/spline regression discussed above is approximately
a special case of a neural network model.
18
1 Outliers and Inuential Cases (Soc 504)
Throughout the semester we have assumed that all observations are clustered around the
mean of x and y, and that variation in y is attributable to variation in x. Occasionally, we
nd that some cases dont t the general pattern we observe in the data, but rather that
some cases appear strange relative to the majority of cases. We call these cases outliers.
We can think of outliers in (at least) two dimensions: outliers on x and outliers on y.
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
x
y
-2 -1 0 1 2 3
0
2
4
6
Figure 1. Y regressed on X, no outliers.
In the above plot, there are few if any outliers present, in either X or Y dimensions. The
regression coecients for this model are: b
0
= 2.88, b
1
= 1.08. In the following plot, there is
one Y outlier. The regression coecients for this model are: b
0
= 2.93, b
1
= 1.08.
1
s
s
s s s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
x
y
-2 -1 0 1 2 3
0
2
4
6
8
Y outlier -->
Figure 2. Y regressed on X, One Outlier on Y.
Notice that the regression line is only slightly aected by this outlier. The same would be
true if we had an outlier that was only an outlier in the X dimension. However, if a variable
is an outlier in both X and Y dimensions, we have a problem, as indicated by the next
gure.
2
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
x
y
-2 0 2 4 6
0
2
4
6
With Outlier
Outlier Omitted
Figure 3. Y regressed on X, One Outliers on Y.
In this gure, it is apparent that the outlier at the far right edge of the gure is pulling
the regression line away from where it should be. Outliers, therefore, have the potential to
cause serious problems in regression analyses.
1.1 Detecting Outliers
There are numerous ways to detect outliers in data. The simplest method is to construct
plots like the ones above to examine outliers. This procedure works well in two dimensions,
but may not work well in higher dimensions. Additionally, what may appear as an outlier
in one dimension may in fact not be an outlier when all variables are jointly modeled in a
regression.
Numerically, there are several statistics that you can compute to detect outliers, but we
will concentrate on only two: studentized residuals via dummy regression, and DFBetas via
regression. These statistics are produced by most software packages on request.
Studentized residuals can be obtained by constructing a dummy variable representing
an observation that is suspected to be an outlier (or do all of them, one at a time) and
3
including the dummy variable in the regression model. If the dummy variable coecient for
a particular case is signicant, it indicates that the observation is an outlier. The t test on
the dummy variable coecient is the studentized residual, which can also be obtained in
other ways.
Many packages will produce studentized residuals if you ask for them. However, it is
important to realize that these residuals will follow a t distribution, implying that (in a
large sample) approximately 5% of them will appear extreme. We correct for this problem
by adjusting our t test critical value, using a Bonferroni correction. The correction is to
replace the usual critical t with t

2n
. This moves the critical t further out into the tails of
the distribution, making it more dicult to obtain a signicant residual.
The above approach nds outliers, but it doesnt tell us how inuential an outlier may
be. Not all outliers are inuential. An approach to examining the inuence of a particular
outlier is to conduct the regression model both with and without the oending observation.
A statistic D
ij
can then be computed as B
j
B
j(i)
. This is simply the dierence between
the j th coecient in the model when observation i is deleted from the data. This measure
can be standardized using the following formula:
D
ij
=
D
ij
SE
i
(
j
)
This standardization makes these statistics somewhat more comparable across variables, but
we may be more interested in comparing these statistics across observations for the same
coecient/variable. In doing so, it may be helpful to simply plot all the D statistics for all
observations for each variable.
1.2 Compensating for Outliers
Probably the easiest solution for dealing with outliers is to delete them. This costs informa-
tion, but it may be the best solution. Often outliers are miscoded variables (either in the
original data or in your own recoding schemes). Sometimes, however, they are important
cases that should be investigated further. Finding clusters of outliers, for example, may lead
you to discover that you have omitted something important from your model.
4
1.3 Cautions
Outliers can be very dicult to detect in high dimensions. You must also be very careful in
using the approaches discussed above in looking for outliers. If there is a cluster of outliers,
deleting only one of the observations in the cluster will not lead to a signicant change in
the coecients, so the D statistic will not detect it.
5
1 Multicollinearity (Soc 504)
Until now, we have assumed that the data are well-conditioned for linear regression. The
next few sets of notes will consider what happens when the data are not so cooperative. The
rst problem we will discuss is multicollinearity.
1.1 Dening the Problem
Multicollinearity is a problem with being able to separate the eects of two (or more) vari-
ables on an outcome variable. If two variables are signicantly alike, it becomes impossible
to determine which of the variables accounts for variance in the dependent variable. As a
rule of thumb, the problem primarily occurs when x variables are more highly correlated
with each other than they are with the dependent variable.
Mathematically, the problem is that the X matrix is not of full rank. When this occurs,
the X matrix (and hence the X
T
X matrix) has determinant 0 and cannot be inverted. For
example, take a general 3 3 matrix A:
A =
a b b
c d d
e f f
The determinant of this matrix is:

adf + bde + bcf (bde + adf + bcf) = 0
Recall from the notes on matrix algebra that the inverse can be found using the determinant
function:
A
1
=
1
det(A)
adj(A).
However, when det(A) = 0, all of the elements of the inverse are clearly undened. Generally,
the problem is not severe enough (e.g., not every element of one column of X will be identical
to another) to crash a program, but it will produce other symptoms.
To some extent, multicollinearity is a problem of not having enough information. Addi-
tional data points, for example, will tend to produce more variation across the columns of
the X matrix and allow us to better dierentiate the eects of two variables.
1.2 Detecting/Diagnosing Multicollinearity
In order to lay the foundation for discussing the detection of multicollinearity problems, I
conducted a brief simulation, using the following generated variables (n = 100 each):
1
u N(0, 1)
x
2
N(0, 1)
e N(0, 1)
x
1
= .9 u + (1 .9
2
)
1
2
The construction of the fourth variable gives us variables x
1
and x
2
that have a correlation
of .9. I then created a series of y variables:
y
1
= 3 + x
1
+ x
2
+ (.01)e
y
2
= 3 + x
1
+ x
2
+ (.1)e
y
3
= 3 + x
1
+ x
2
+ (1)e
y
4
= 3 + x
1
+ x
2
+ (5)e
Changing the error variance has the eect of altering the noise contained in y, reducing
the relationship between each x and y. For example, the following are the correlations of
each x and y:
X
1
X
2
X
1
1 .91
X
2
.91 1
Y
1
.98 .98
Y
2
.98 .98
Y
3
.87 .87
Y
4
.37 .38
Notice that all of the correlations are high (which is atypical of sociological variables),
but that the X variables are more highly correlated with each other than they are with Y
3
and Y
4
.
I conducted regression models of each y on x
1
and x
2
to examine the eect of the very
high collinearity between x
1
and x
2
, given dierent levels of noise (f(e)) disrupting the
correlation between each x and y. I then reduced the data to size n = 20 and re-conducted
these regressions. Here are the results:
2
y
1
y
2
y
3
y
4
N=100
R
2
1 1 .79 .15
F 1740033 17463 181.11 8.51
b
0
3.0 3.0 3.04 3.19
b
1
1.0 .99 .93 .63
b
2
1.0 1.01 1.11 1.57
N=20
R
2
1 1 .82 .27
F 333802 3378 38.63 3.08#
b
0
3.0 2.98 2.76 1.82
b
1
1.01 1.12 2.22 7.08
b
2
.99 .90 0 4.0
Some classic symptoms of multicollinearity include: 1) having a signicant F, but no
signicant t-ratios; 2) wildly changing coecients when an additional (collinear) variable is
included in a model; and 3) unreasonable coecients.
This example highlights some of these classic symptoms. First, in the nal model (n = 20,
y
4
), we have a signicant F (p < .1), but none of the coecients is signicant (based on
the t-ratios). Second, if we were to reconduct this nal model with only x
1
included, the
coecient for x
1
would be 2.06 . However, as the model reported above indicates, the
coecient jumps to 7.08 when x
2
is included. Finally, as the nal model results indicate, the
coecients also appear to be unreasonable. If we examined the bivariate correlation between
each x and y, we would nd moderate and positive correlationsso, it may be unreasonable
for us to nd regression coecients that are opposite and large as in this model.
To summarize the ndings of the simulation, it appears that when there is relatively
little noise in y, collinearity between the x variables doesnt appear to cause much of a
problem. However, when there is considerable noise (as is typical in the social sciences),
collinearity signicantly inuences the coecients, and this eect is exacerbated when there
is less information (e.g., n is smaller). This highlights that to some extent multicollinearity
is a problem of having too little information.
There are several classical tests for diagnosing collinearity problems, but we will focus on
only onethe variance-ination factorperhaps the most common test. As Fox notes, the
sampling distribution variance for OLS slope coecients can be expressed as:
V (b
j
) =
1
1 R
2
j

2
e
(n 1)S
2
j
In this formula, R
2
j
is the explained variance we obtain when regressing x
j
on the other x
variables in the model, and S
2
j
is the variance of x
j
. Recall that the variance of b
j
is used in
constructing the t-ratios that we use to evaluate signicance. This variance is increased if
either
2
e
is large, S
2
j
is small, or R
2
j
is large. The rst term of the expression above is called
3
the variance ination factor (V IF). If x
j
is highly correlated with the other x variables,
then R
2
j
will be large, making the denominator of the V IF small, and hence the V IF very
large. This inates the variance of b
j
, making it dicult to obtain a signicant t-ratio. To
some extent, we can oset this problem if
2
e
is very small (e.g., there is little noise in the
dependent variable-or alternatively, that the x
s account for most of the variation in y). We

can also oset some of the problem if S
2
j
is large. Weve discussed this previously, in terms
of gaining leverage on y, but here increasing the variance of x
j
will also help generate more
noise in the regression of x
j
on the other x
s, and will thus tend to make R

2
j
smaller.
What value of V IF should we use to determine whether collinearity is a problem? Typ-
ically, we often use 10 as our threshold at which we consider it to be a problem, but this is
simply a rule of thumb. The gure below shows what V IF is at dierent levels of correlation
between x
j
and the other variables. V IF = 10 implies that the r-square for the regression
must be .9.
R-squared for X Regressed on Other Variables
V
a
r
i
a
n
c
e

I
n
f
l
a
t
i
o
n

F
a
c
t
o
r

(
V
I
F
)
0.0 0.2 0.4 0.6 0.8 1.0
0
2
0
4
0
6
0
8
0
1
0
0
Reference Line is for VIF=10
Figure. Variance Ination Factor by Level of Relationship between X Variables.
1.3 Compensating for multicollinearity
There are several ways for dealing with multicollinearity when it is a problem. The rst,
and most obvious, solution is to eliminate some variables from the model. If two variables
are highly collinear, then it means they contain highly redundant information. Thus, we can
pick one variable to keep in the model and discard the other one.
If collinearity is a problem but you cant decide an appropriate variable to omit, you
4
can combine the oending x variables into a reduced set of variables. One such approach
would be to conduct an exploratory factor analysis of the data, determine the number of
unique factors the x variables contain, and generate either (factor) weighted or unweighted
scales, based on the factors on which each variables load. For example, suppose we have
10 x variables that may be collinear. Suppose also that a factor analysis suggests that the
variables really reect only two underlying factors, and that variables 15 strongly correlate
with the rst factor, while variables 6 10 strongly correlate with the second factor. In that
case, we could sum variables 1 5 to create a scale, and sum variables 6 10 to make a
second scale. We can either sum the variables directly, or we can weight them based on their
factor loadings. We then include these two scales in the regression model.
Another solution is to transform one of the oending x variables. We have already seen
that multicollinearity becomes particularly problematic when two x variables have a stronger
relationship with each other than they have with the dependent variable. Ideally, if we want
to model the relationship between each x and y, we would like to see a strong relationship
between the x variables and y. Transforming one or both x variables may yield a better
relationship to y, and at the same time, it will eliminate the collinearity problem. Of course,
be sure not to perform the same transformation on both x variables, or you will be back at
square 1.
A nal approach to remedying multicollinearity is to conduct ridge regression. Ridge
regression involves transforming all variables in the model and adding a biasing constant to
the new (X
T
X) matrix before solving the equation system for b. You can read about this
technique more in-depth in the book.
1.4 Final Note
As a nal note, we should discuss why collinearity is an issue. As weve discussed before,
the only reason we conduct multiple regression is to determine the eect of x on y, net of
other variables. If there is no relationship between x and the other variables, then multi-
ple regression is unnecessary. Thus, to some extent, collinearity is the basis for conducting
multiple regression. However, when collinearity is severe, it leads to unreasonable coecient
estimates, large standard errors, and consequently bad interpretation/inference. Ultimately
there is a very thin line between collinearity being problematic and collinearity simply ne-
cessitating the use of multiple regression.
5
1 Non-normal and Heteroscedastic Errors (Soc 504)
The Gauss-Markov Theorem says that OLS estimates for coecients are BLUE when the
errors are normal and homoscedastic. When errors are nonnormal, the E property (Ef-
cient) no longer holds for the estimators, and in small samples, the standard errors will
be biased. When errors are heteroscedastic, the standard errors become biased. Thus, we
typically examine the distribution of the errors to determine whether they are normal.
Some approaches to examining nonnormality of the errors include constructing a his-
togram of the error terms and constructing a Q-Q plot (Quantile-Quantile plot). Certainly,
there are other techniques, but these are two simple and eective methods. Constructing
a histogram of errors is trivial, so I dont provide an example of it. Creating a Q-Q plot,
on the other hand, is a little more dicult. The objective of a Q-Q plot is to compare the
empirical error distribution to a theoretical distribution (i.e., normal). If the errors are nor-
mally distributed, then the empirical and theoretical distributions will look identical, and a
scatterplot of the two will fall along a straight line. The observed errors are obtained from
the regression analysis. The steps to computing the theoretical distribution are as follows:
1. Order the errors from smallest to largest so that X
1
< X
2
< . . . < X
n
.
2. Compute the statistic: CDF
empirical
(i) =
i
1
2
n
, where i is the rank of the empirical error
after step 1. This gives us the empirical CDF for the data.
3. Compute the inverse of this CDF value under the theoretical distribution. So take
z
i
=
1
(CDF
empirical
(i)).
4. Plot X against z.
The gure below provides an example of this plot. I simulated 100 observations (errors)
from a N(5, 2) distribution:
1
Theoretical CDF
E
m
p
i
r
i
c
a
l

C
D
F
0 2 4 6 8 10
2
4
6
8
1
0
Figure 1. Q-Q Plot for 100 random draws(X) from a N(5,2) distribution.
Notice that in this case, the distribution clearly falls on a straight line, indicating the errors
are probably normally distributed. The following gure is a histogram of exp(X). This
distribution is clearly right skewed.
2
ezden$x
F
r
e
q
u
e
n
c
y
0 10000 20000 30000 40000
0
.
0
0
.
0
0
0
0
5
0
.
0
0
0
1
0
0
.
0
0
0
1
5
Figure 2. Histogram of exp(X) from above.
The resulting Q-Q plot below reveals this skew clearly. The scatterplot points are not
on a straight line. Rather, the plot follows a curve. The plot reveals that there is too much
mass at the far left end of the distribution relative to what one expect if the distribution
were normal.
3
Theoretical CDF
E
m
p
i
r
i
c
a
l

C
D
F
0 2 4 6 8 10
0
1
0
0
0
0
2
0
0
0
0
3
0
0
0
0
Figure 3. Q-Q Plot for exp(X).
The next histogram shows the distribution for the distribution of ln(X), which has a clear
left skew.
4
lzden$x
F
r
e
q
u
e
n
c
y
-1 0 1 2 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
Figure 4. Histogram of Log(X).
The resulting Q-Q plot also picks up this skew, as evidenced by the curvature at the top
edge of the plot. This curvature indicates that there is not enough cumulative mass in the
middle of the distribution relative to what would be expected under a normal distribution.
5
Theoretical CDF
E
m
p
i
r
i
c
a
l

C
D
F
0 2 4 6 8 10
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
Figure 5. Q-Q Plot of Log(X).
1.1 Heteroscedasticity
The above plots are helpful diagnostic tools for nonnormal errors. What about heteroscedas-
ticity? Heteroscedasticity means non-constant variance of the errors across levels of X. The
main consequence of heteroscedasticity is that the standard errors of the regression coe-
cients can no longer be trusted. Recall that the standard errors are a function
e
. When
e
is not constant, then, the typical standard error formula is incorrect.
The standard technique for detecting heteroscedasticity is to plot the error terms against
either the predicted values of Y or each of the X variables. Below is a plot of errors that
evidences heteroscedasticity across levels of X
6
x
y
0 20 40 60 80 100
-
2
0
0
-
1
0
0
0
1
0
0
2
0
0
Figure 6. Example of Heteroscedasticity.
The plot has a classic horn shape in which the variance of the errors appears to increase as
X increases. This is typical of count variables, which generally follow a poisson distribution.
The poisson distribution is governed by one parameter, (density is p(X) =
1
X!
X
exp()),
which is both the mean and variance of X. When X is large, it implies that is probably
large, and that the variance of X is also large. Heteroscedasticity is not limited to this
horn-shaped pattern, but it often follows this pattern.
1.2 Causes of, and Solutions for, Nonnormality and Heteroscedas-
ticity
Nonnormal errors and heteroscedasticity may be symptoms of an incorrect functional form
for the relationship between X and Y , an inappropriate level of measurement for Y , or an
omitted variable (or variables). If the functional form of the relationship between X and Y
is misspecied, errors may be nonnormal because there may be clusters of incredibly large
errors where the model doesnt t the data. For example, the following plot shows the
observed and predicted values of Y from a model in which an incorrect functional form was
7
estimated.
x
y
-2 -1 0 1 2 3
5
1
0
1
5
2
0
2
5
Fitted
Figure 7. Observed and (Improperly) Fitted Values.
In this case, the correct model is y = b
0
+b
1
exp(X), but the model y = b
0
+b
1
X was tted.
The histogram of the errors is clearly nonnormal:
8
eden$x
F
r
e
q
u
e
n
c
y
0 5 10 15 20
0
.
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
Figure 8. Histogram of Errors from Example.
and the Q-Q plot reveals the same problem:
x
E
r
r
o
r
s
-2 -1 0 1 2 3
0
5
1
0
1
5
9
Figure 9. Q-Q Plot of Errors in Example.
A plot of the errors against the X values is:
Theoretical CDF
E
m
p
i
r
i
c
a
l

C
D
F
-4 -2 0 2 4
0
5
1
0
1
5
Figure 10. Plot of Errors against X in Example.
This plot suggests that heteroscedasticity is not a problem, because the range of the errors
doesnt really appear to vary across levels of X. However, the gure shows there is clearly a
problem with the functional form of the model-the errors reveal a clear pattern. This implies
that we may not only have the wrong functional form, but we also have a problem with serial
autocorrelation (which we will discuss in the context of time series analyses later).
If we do the appropriate transformation, we get the following.
10
dennew$x
F
r
e
q
u
e
n
c
y
-0.2 0.0 0.2
0
1
2
3
Figure 11. Histogram of Errors in Revised Example.
Theoretical CDF
E
m
p
i
r
i
c
a
l

C
D
F
-0.2 -0.1 0.0 0.1 0.2
-
0
.
2
-
0
.
1
0
.
0
0
.
1
0
.
2
Figure 12. Q-Q Plot of Errors in Revised Example.
11
x
E
r
r
o
r
s
-2 -1 0 1 2 3
-
0
.
2
-
0
.
1
0
.
0
0
.
1
0
.
2
Figure 13. Plot of Errors Against X in Revised Example.
These plots show clear normality and homoscedasticity (and no autocorrelation) of the errors.
In other cases, a simple transformation of one or more variables may not be sucient. For
example, if we had a discrete, dichotomous outcome, the functional form of the model would
need to be signicantly respecied. Nonnormal and heteroscedastic errors are very often the
consequence of measurement of the dependent variable at a level that is inappropriate for
the linear model.
As stated above, another cause of heteroscedasticity is omitting a variable that should
be included in the model. The obvious solution is to include the appropriate variable.
An additional approach to correcting for heteroscedasticity (weighted least squares) will be
discussed soon.
12
1 Alternative Estimation Strategies (Soc 504)
When regression assumptions are violated to the point that they degrade the quality of the
OLS estimator, we may use alternative strategies for estimating the model (or use alternative
models). I discuss 4 types of alternate estimation strategies here: bootstrapping, robust
estimation for M-estimators, Weighted Least Squares (WLS) estimation, and Generalized
Least Squares estimation.
2 Bootstrapping
Bootstrapping is useful when your sample size is small enough that the asymptotic properties
of MLE or OLS estimators is questionable. It is also useful when you know the errors (in a
small sample) arent normally distributed.
The bootstrapping approach for a simple statistic is relatively simple. Given a sample
of n observations, we take m resamples with replacement of size n from the original sample.
For each of these resamples, we compute the statistic of interest and form the empirical
distribution of the statistic from the results.
For example, I took a sample of 10 observations from a U(0, 1) distribution. This size
sample is hardly large enough to justify using normal theory for estimating the standard error
of the mean. In this example, the sample mean was .525, and the estimated standard error
was .1155. After taking 1000 bootstrap samples, the mean of the distribution of bootstrap
sample means was .525, and the estimated standard error was .1082. The distribution of
means looked like:
1
Means
F
r
e
q
u
e
n
c
y
0.2 0.4 0.6 0.8
0
1
2
3
Figure 1. Bootstrap Sample Means.
This distribution is approximately normal, as it should be. The empirical 95% condence
interval for the mean was (.31, .74). This interval can be found by taking the 2.5
th
and 97.5
th
largest values from the empirical bootstrap distribution.
In this case, the bootstrap results did not dier much from the original results. However,
we can better trust the bootstrap results, because normal theory really doesnt allow us to
be condent in our original estimate of the standard error.
Below is the c program that produces the bootstrap samples for the above example.
#include<stdio.h>
#include<math.h>
#include<stdlib.h>
double uniform(void);
main(int argc, char *argv[])
{
int samples,rep,pick;
double mean,threshold;
double replicate,r;
double y[10]={.382,.100681,.596484,.899106,.88461,.958464,.014496,.407422,.863247,.138585};
FILE *fpout;
for(samples=1;samples<=1000;samples++)
{
printf("doing sample %d\n",samples);
mean=0;
for(rep=0;rep<10;rep++)
2
{
r=uniform();
threshold=0;
for(pick=0;pick<10;pick++)
{
if(r>threshold){replicate=y[pick];}
threshold+=.1;
}
mean+=replicate;
}
mean/=10;
if ((fpout=fopen(argv[1],"a"))==NULL)
{printf("couldnt open the file\n"); exit(0);}
fprintf(fpout,"%d %f\n",samples,mean);
fclose(fpout);
}
return 0;
}
double uniform(void)
{
double x;
double deviate;
x=random();
deviate=x/(2147483647);
if (deviate<.0000000000000001){deviate=.0000000000000001;}
return deviate;
}
In a regression setting, there are two ways to conduct bootstrapping. In one approach,
we assume the X variables are random (rather than xed). Then, we can obtain bootstrap
estimates of the sampling distribution of , by taking samples of size n from the original
sample and forming the distribution of (X
(j)T
X
(j)
)
1
(X
(j)T
Y
(j)
) (the OLS estimates from
each bootstrap sample j). In the other approach, we treat X as xed. If X is xed, then
we must sample the error term (the only random component of the model). We do this as
follows:
1. Compute the OLS estimates for the original sample.
2. Obtain e
i
= Y
i
X
i
.
3. Bootstrap samples of e,
4. Compute Y
(j)
i
= Y
i
+e
(j)
i
for each bootstrap sample.
5. Compute the OLS estimates for each bootstrap sample Y
(j)
.
3
3 Robust Estimation (with M-Estimators)
Fox discusses M-Estimation as a supplemental method to OLS for examining the eect
of outliers. The book introduces the notion of inuence plots by showing how a single
outlying observation inuences a sample estimate (e.g., the mean, median, etc.):
Additional Y
M
e
a
n
/
M
e
d
i
a
n
-10 -5 0 5 10
-
0
.
5
0
.
0
0
.
5
1
.
0
Mean
Median
Figure 2. Inuence Plot Example.
The inuence plot was produced by taking a sample of 9 N(0, 1) observations and adding a
10
th
observation taking values incrementally in the range of (10, 10). The statistics (mean
and median) were then computed. The median is clearly more resistant to the outlying
observation, as indicated by the plot. The true mean and median of a N(0, 1) variable are
0, and the sample median remains much closer to this value as the 10
th
observation becomes
more extreme.
When outliers are a problem, OLS estimation may not be the best approach to estimating
regression coecients, because the OLS estimator minimizes the mean squared error of the
observations. Outliers generate undue inuence on the coecients, then, because the error
terms are squared. In order to determine whether our estimates are robust, we can try other
criteria rather than OLS. For example, a common alternative is the least absolute value
estimator:
LAV = min
(| Y
i

Y |)
Fox shows that this class of estimators can be estimated generally using iteratively
reweighted least squares (IRLS). In IRLS, the derivative of the objective function is re-
4
expressed in terms of weights that apply to each observation. These weights are a function
of the error term (for LAV, w
i
=
1
E
i
), which, of course is a function of the regression co-
ecients in a regression model, or they can simply be Y . Thus, we can solve for the
regression coecients by using a starting value for them, computing the weights, recomputing
the regression coecients, etc. While an estimate of a sample mean is:
=
w
i
Y
i
w
i
the estimate of regression coecients is:
= (X
T
WX)
1
(X
T
WY ).
I illustrate the use of IRLS estimation with the same sample of 9 N(0, 1) observations
as used above. In this example, I use IRLS on the LAV objective function to estimate the
mean. Notice that the LAV estimate of the mean is bounded. If we think about the median
of a sample, the median is the mean of the two centermost observations. In this sample
data, when the 10
th
observation is an extreme negative value, the centermost observations
are .23 and .24. When the 10
th
observation is an extreme positive value, the centermost
observations are .24 and 1.10. Thus, the LAV estimate of the mean will be in the range of
(.23, 1.10) as the gure below illustrates. Once the value becomes extreme enough, the
weight for that observation becomes very small in the IRLS routine, so the inuence of the
observation is minimal. Below is a c program that estimates the LAV function using IRLS.
Additional Y
M
e
a
n
/
M
e
d
i
a
n
-10 -5 0 5 10
-
0
.
5
0
.
0
0
.
5
1
.
0
Mean
Median
Figure 3. IRLS Results of Using LAV Function to Estimate a Mean.
5
#include<stdio.h>
#include<math.h>
#include<stdlib.h>
main(int argc, char *argv[])
{
int i,j,k,loop;
double mean[100],weight[10],num,denom;
double y[10]={-.30023, -1.27768, .244257, 1.276474, 1.19835,
1.733133,-2.18359,-.23418, 1.095023, 0.0};
FILE *fpout;
for(i=-20;i<=20;i++)
{
y[9]=i*1.0;
mean[0]=0;
mean[1]=0;
for(j=0;j<=9;j++)
{mean[1]+=y[j];}
mean[1]/=10;
loop=1;
while(fabs(mean[loop]-mean[loop-1])>.000001)
{
printf("mean %d=%f\n",loop,mean[loop]);
for(k=0;k<=9;k++){weight[k]=1.0/(fabs(y[k]-mean[loop]));}
num=0; denom=0;
for(k=0;k<=9;k++){denom+=weight[k]; num+=(weight[k]*y[k]);}
loop++;
mean[loop]=num/denom;
}
if ((fpout=fopen(argv[1],"a"))==NULL)
{
printf("couldnt open the file\n"); exit(0);
}
fprintf(fpout,"%f %f %f\n",y[9],mean[1],mean[loop]);
fclose(fpout);
}
return 0;
}
Within the while() loop, the previous value of the mean is used to compute the weights
of each observation (weight[k]=1/fabs(y[k]-mean[loop]);). Then, given the new weights, a
new value of the mean is computed (mean[loop]=num/denom;). If we wanted to use an
alternate objective function (e.g., Huber, bisquare), then we would simply replace the weight
calculation. I do not illustrate IRLS for a robust regression, but it is a straightforward
modication of this algorithm in which is replaced by X in the weight calculation, and
the calculation of is replaced by the weighted OLS estimator shown above.
6
4 Weighted Least Squares and Generalized Least Squares
Estimation
When errors are heteroscedastic, the error term is no longer constant across all observations.
Thus, the assumption:
i
N(0,
e
I) is no longer true. Rather, N(0, ), where is a
diagonal matrix (o-diagonal elements are still 0).
In this case, the likelihood function becomes modied to incorporate this altered error
variance term:
L( | X, Y ) =
1
(2)
1
2
i
exp
(Y
i
(X
)
i
)
2
2
2
i
,
We can typically assume that the diagonal elements of the matrix are weighted values of a
constant error variance, say:
=
e
1
w
1
0 . . . 0
0
1
w
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . .
1
wn
which gives us a matrix expression for the likelihood:

L( | X, Y ) =
1
(2)
n
2
| |
1
2
exp
1
2
(Y X)
T
(Y X)
.
The estimator for the parameters then becomes (X
T
WX)
1
(X
T
WY ), and the variance of
the estimator is:
2
e
(X
T
WX)
1
.
The obvious question is: What do we use for weights? We may know that the error
variance is related to one of the observed variables, in which we could either build this
function into the likelihood function above (and estimate it), or we could use the inverse of
the variable as the weights. Alternatively, we could simply bypass WLS estimation altogether
and simply divide every variable in the model by the oending X variable and use OLS
estimation.
Generalized Least Squares estimation is the generalization of WLS. In WLS, we assume
the o-diagonal elements of the matrix are 0. This assumption, if you recall from our notes
on multiple regression theory, implies that errors are independent across observations. Often,
however, this may be an unreasonable assumption (e.g. in time series, or in clustered data).
We can relax this assumption to obtain the GLS estimator: = (X
T
1
X)
1
X
T
1
Y .
This should look similar to the WLS estimatorin fact, its the same (the WLS estimator
just has 0s on the o-diagonal elements). The variance estimator for is: (X
T
1
X)
1
.
Obviously, we cannot estimate all the elements of there are
n(n+1)
2
unique elements
in the matrix. Thus, we may use functions or other simplications to reduce the elements
to be estimated. This is the essence of basic time series models, as we will discuss in a few
weeks.
7
1 Missing Data (Soc 504)
The topic of missing data has gained considerable attention in the last decade, as evidenced
by several recent trends. First, most graduating PhDs in statistics now claim missing data
as an area of interest or expertise. Second, it has become dicult to publish empirical work
in sociology without discussion of how missing data was handled. Third, more and more
methods for handling missing data have sprouted-up over the last few years.
Although missing data has received a growing amount of attention, there are still some key
misunderstandings regarding the problems that missing data generate, as well as acceptable
solutions. Missing data are important to consider, because they may lead to substantial
biases in analyses. On the other hand, missing data is often harmless beyond reducing
statistical power. In these notes, I dene various types of missingness and discuss methods
for handling it in your research.
2 Types of Missingness
Little and Rubin (1987) dene three dierent classes of missingness. Here, I dene the key
terms used in discussing missingness in the literature. I will then discuss how they relate to
some ways in which we encounter missing data.
Data missing on Y are observed at random (OAR) if missingness on Y is not a function
of X. Phrased another way, if X determines missingness on Y , the data are not OAR.
Data missing on Y are missing at random (MAR) if missingness on Y is not a function
of Y . Phrased another way, if Y determines missingness on Y , the data are not MAR.
Data are Missing Completely at Random (MCAR) if missingness on Y is unrelated to X
or Y . In other words MCAR=OAR + MAR. If the data are MCAR or at least MAR,
then the missing data mechanism is considered ignorable. Otherwise, the missing
data mechanism is considered nonignorable.
To make these ideas concrete, suppose we are examining the eect of education on income.
If missingness on income is a function of education (e.g., highly educated individuals dont
report their income), then the data are not OAR. If missingness on income is a function of
income (e.g., persons with high income do not report their income), then the data are not
MAR.
There are a number of ways in which we may encounter missing data in social science
research, including the following:
Individuals not followed up by design (meets MCAR assumption)
Item nonresponse (may not meet any assumption)
1
Loss due to followup, or attrition (may not meet any assumption)
Mortality (respondents, not yours)
Sample selection (e.g., estimating a model that is only applicable to a subset of the
total sample) (may not meet any assumption)
All of these avenues of missingness are common in sociology, and indeed, it is generally more
surprising to nd you have very little missing data when conducting an analysis than to nd
you have much missing data. The good news is that he statistical properties of maximum
likelihood estimation obtain if the data are MCAR or MAR. That is, the data do not have
to be OAR. This is true only when the model is properly specied, however. In that case,
whatever piece of the data is observed is sucient to produce unbiased parameters. Why
is this true? Below is a plot of the relationship between some variable X and Y.
X
Y
-2 -1 0 1 2
0
1
2
3
4
5
6
X: X<1
Y
:

X
<
1
-2 -1 0 1 2
0
1
2
3
4
5
6
X: Y<4
Y
:

Y
<
4
-2 -1 0 1 2
0
1
2
3
4
5
6
X: X<1
Y
:

X
<
1
-2 -1 0 1 2
0
1
2
3
4
5
6
Figure 1. Regression Results for Dierent Types of Missing Data. Upper left=complete data; upper
right=MAR but not OAR; upper left=not MAR; bottom right=incorrect functional form, but data
are MAR.
The rst plot (upper left corner) shows the t of the regression model when all the data
are observed. The second shows what happens to the t of the regression model when the
data are MAR but not OAR. For this case, I deleted all observations with X values greater
than 1. Notice that the regression line is virtually identical to the one obtained with complete
data. The bottom left plot shows what happens when the data are neither MAR nor OAR.
For that case, I deleted all observations with Y values greater than 4. Observe that the
regression line is biased considerably in this case. Finally, the last plot shows what happens
2
if the data are MAR, but an incorrect functional form is specied for the model. In that
case, the model estimates will also be biased.
Although these results are somewhat encouraging in that the indicate that missing data
may not always lead to biases, the fact is that it is practically impossible to assess whether
data are MAR, exactly because the data are missing on the variable of interest! Furthermore,
methods for handling data that are not MAR are statistically dicult and rely on our ability
to correctly model the process that generates the missing data. This is also dicult to assess,
because the data are missing!
3 Traditional approaches to handling missing data.
A variety of approaches to handling missing data have emerged over the last few years. Below
is a list of some of them, along with a brief description. The bad news is that most of them
are not useful when the data are not MAR, but rather, only when they are not OAR. This is
unfortunate, because, as we discussed above, parameter estimates are not biased when the
data are MAR but not OAR.
Listwise deletion. Simply delete an entire observation if it is missing on any item
used in the analyses.
Problems: Appropriate only when data are MCAR (or MAR). In that case, the only
loss is statistical power due to reduced n. If the data are not MAR, then results will
be biased. However, this is often the best method, even when the data are not MAR.
Pairwise deletion. Delete missing variables by pairs. This only works if a model is es-
timated from covariance or correlation matrices. In that case, the covariances/correlations
are estimated pairwise, so you simply dont include an observation that is missing data
on one of the items in a pair.
Problems: Seems nice, but poor conceptual statistical foundation. What n should
tests be based on? Also leads to bias if the data are not MAR.
Dummy variable adjustment. Set missing values on a variable equal to some arbi-
trary value. Then construct a dummy variable indicating missingness and include this
variable in the regression model.
Problems: This approach simply doesnt work and leads to biases in parameters and
standard errors. It also has no sound theoretical justication-it is simply an ad hoc
method to keep observations in an analysis.
Mean Imputation. Replace a missing observation with the sample mean of the
variable. When using longitudinal data, we can instead replace a missing score with
the mean of the individuals responses on the other waves. I think this makes more
sense than sample mean imputation in this case. An even better approach, I think, is
discussed below under regression imputation.
3
Problems: Sounds reasonable but isnt. Whether the data are OAR or MAR, this
approach leads to biases in both the standard errors and the parameters. The main
reasons are that it shifts possible extreme values back to the middle of the distribution,
and it reduces variance in the variable being imputed.
Hotdecking. Replace a missing observation with the score of a person who matches
the individual who is missing on the item on a set of other covariates. If multiple in-
dividuals match the individual who is missing on the item, use the mean of the scores
of the persons with complete information. Alternatively, a random draw from a distri-
bution can be used.
Problems: Seems reasonable, but reduces standard errors because it (generally) ig-
nores variability in the x. That is, the x are not perfectly correlated, but this method
assumes they are. The method also assumes the data are MAR. May be particularly
dicult to implement with lots of continuous covariates. Also, the more variables used
to match the missing observation, the better, but also the less likely you will be to nd
a match. Finally, what do you do with multiple missing variables? Impute and impute?
Theoretically, you could replace all missing data through multiple passes through the
data, but this would denitely produce overcondent and suspect results.
Regression-based Imputation. Estimate a regression model predicting the missing
variable of interest for those in the sample with complete information. Then compute
predicted scores, using the regression coecients, for the individuals who are missing
on the item. Use these predicted scores to replace the missing data. When longitudinal
data are used, and the missing variable is one with a within-individual pattern across
time, use an individual-specic regression to predict the missing score.
Problems: This is probably one of the best simple approaches, but it suers from
the same main problem that hotdecking does: it underestimates standard errors by
underestimating the variance in x. A simple remedy is to add some random error to
the predicted score from the regression, but this begs another question: what distri-
bution should the error follow? This method also assumes the data are MAR.
Heckman Selection Modeling. The classic two-step method for which Heck-
man won the Nobel Prize involves a) estimating a selection equation (I(Observed) =
X + e), b) constructing a new variablea hazard for sample inclusionthat is a
function of predicted scores from this model (
i
=
(X
i
)
1(X
i
)
), c) including this new
variable in the structural model of interest (Y
i
= Z
i
+
i
+ u
i
).
Problems: This method is an excellent method for handling data that is NOT MAR.
However, it has problems. One is that if there is a signicant overlap between X and
Z, then the method is inconsistent. Theoretically, there should be no overlap between
X and Z, but this is often unreasonable: variables related to Y are also relevant to
4
selection (otherwise, there would be no relationship between observing Y and Y , and
the data would thus be MAR). Another is that the standard errors in the structural
model are incorrect and must be adjusted. This is dicult. Fortunately, STATA has
two procedures that make this adjustment for you: Heckman (continuous structural
outcome) and Heckprob (dichotomous structural outcome).
Multiple Imputation. This method involves simulating possible values of the miss-
ing data and constructing multiple datasets. We then estimate the model using each
new data set and compute the means of parameters across the samples. Standard errors
can be obtained via a combination of the between-model variance and the within-model
standard errors.
Problems: Although this approach adjusts for the downward bias that some of the
previous approaches produce, its key drawbacks include that it assumes the data are
MAR, and it is fairly dicult to implement. To my knowledge, there are some stan-
dard packages that do this, but they may be tedious to use. Another drawback is that
you must assume some distribution for the stochasticity. This can be problematic as
well.
The EM algorithm. The EM algorithm is a two-step process for estimating model
parameters. It integrates missing data into the estimation process, thus bypassing the
need to impute. The basic algorithm consists of two steps: expectation (E step) and
maximization (M step). First, separate the data into missing and nonmissing, and
establish starting values for the parameters. In the rst step, using the parameters,
compute the predicted scores for the missing data (the expectation). In the second
step, using the predicted scores for the missing data, maximize the likelihood function
to obtain new parameter estimates. Repeat the process until convergence is obtained.
Problems: There are two key problems of this approach. One is that there is no
standard software (to my knowledge) that makes this accessible to the average user.
Second, the algorithm doesnt produce standard errors for the parameters. Other than
these problems, this is the best maximum likelihood estimation has to oer. Note that
this method assumes the data are MAR.
Direct ML Estimation. Direct estimation involves factoring the likelihood function
into components such that the missing data simply dont contribute to estimation of
parameters for which the data are missing. This approach retains all observations in
the sample and makes full use of the data that are observed.
Problems: Once again, this approach assumes the data are MAR. Not all likelihoods
factor easily, and not many standard packages allow such estimation.
Bayesian modeling with MCMC methods. Bayesian estimation is concerned
with simulating distributions of parameters so that simple descriptive statistics can be
used to summarize knowledge about a parameter. A simple example of a Bayesian
model would be a linear regression model (assume for simplicity that
e
is known). We
5
already know that the sampling distribution for the regression coecients in a linear
regression model has a mean vector equal to and a covariance matrix
2
e
(X
X)
1
.
In an OLS model, the OLS estimate for ,

is found using (X
X)
1
(X
Y ). The
central limit theorem tells us that the sampling distribution is normal, so we can say:
N((X
X)
1
(X
Y ),
2
e
(X
X)
1
). That being the case, if we simply draw normal
variables with this distribution, we will have a sample of parameters, from which in-
ference can be made. (note: If
e
is not known, its distribution is inverse gamma. The
conditional distribution for the regression parameters is still normal, but the marginal
distribution for the parameters will be t).
If there are missing data, we can use the assumption that the model is based on,
namely that Y N(X,
2
e
), and integrate this into the estimation. Now, rather
than simulating only, we break estimation into two steps. After establishing start-
ing values for the parameters, rst, simulate the missing data using the assumption
above regarding the distribution for Y . Second, given a complete set of data, simulate
the parameters using the formula discussed above regarding the distribution of . Al-
though this approach seems much like the EM algorithm, it has a couple of distinct
advantages. One is that standard errors (technically, the standard deviation of the
posterior distribution for parameters) are obtained as a byproduct of the algorithm, so
there is no need to come up with some additional method to do so. Another is that, in
this process, a better estimate of uncertainty is obtained, given the missing data, than
would be obtained using EM.
Problems: Bayesian approaches are not easy to implement, because there is very little
packaged software in existence. Also, this method assumes the data are MAR. How-
ever, I note that we can modify the model and estimation slightly to adjust for data
that arent MAR. This can become complicated, though.
Selection and Pattern Mixture Models. These approaches are somewhat oppo-
site of each other. We have already dealt with one type of selection model (Heck-
man). In general, both approaches exploit the conditional probability rule, but they
do so in opposite fashions. Pattern mixture models model p(Y, Observed) = p(Y |
Observed)p(Observed), while selection models model p(Y, Observed) = p(Observed |
Y )p(Y ). We have already seen an example of selection. A pattern mixture approach
would require us to specify a probability model for Y conditional on being observed
multiplied by a model predicting whether an individual is observed. We would simul-
taneously need to model Y conditional on being unobserved by this model for being
observed. This model is underidentied, because, without information on the Y that
are missing, we do not know any characteristics of its distribution; thus, some identi-
fying constraints are required.
Problems: The key problem with these approaches include that standard software
does not estimate them (with the exception of Heckmans method). However, these
are appropriate approaches when the data are not MAR.
6
4 A Simulation Demonstrating Common Approaches.
I generated 500 samples of size n = 100 each, consisting of 3 variables: X
1
, X
2
, and Y.
X
1
,X
2
= .4, and the error term, u, was drawn from N(0, 1). Y was computed using:
Y = 5 + 3X
1
+ 3X
2
+ u. First, I estimated the regression model on all the samples with
complete data. Second, I estimated the regression model on the samples after causing some
to be missing. I used 4 dierent missing data patterns. First, I forced Y to be missing if
X
1
> 1.645. This induces approximately 5% to be missing in each sample and generates
samples in which the data are MAR but not OAR. Second, I forced Y to be missing if
X
1
> 1.037. This induces approximately 15% to be missing. For comparison, I also esti-
mated the models after making X
1
missing under the same conditions (rather than making
Y missing). These results emphasize that the real problem occurs when the dependent vari-
able is missing. Third, I forced Y to be missing if Y > 12.0735. Fourth, I forced Y to be
missing if Y > 9.4591. These latter two patterns make the data violate the MAR assumption
and generate approximately 5% and 15% missing data, respectively (the mean and variance
for Y diers from that for the X variables). I estimated regression models using various
approaches to handling the missing data. Below is a table summarizing the results of the
simulation.
7
4.1 Simulation Results
Approach Int.(Emp/Est S.E.) X
1
(Emp/Est S.E.) X
2
(Emp/Est S.E.)
No Missing 5.01(.100/.101) 3.01(.107/.111) 3.00(.112/.111)
X
1
missing if X
1
> 1.645
Listwise 5.01(.103/.104) 3.01(.122/.125) 3.00(.115/.114)
Mean 5.31(.170/.176) 2.88(.157/.215) 3.31(.233/.187)
Dummy 5.01(.103/.105) 3.00(.122/.126) 3.01(.117/.114)
X
1
missing if X
1
> 1.037
Listwise 5.01(.113/.116) 3.01(.140/.147) 3.00(.121/.121)
Mean 5.76(.219/.234) 2.82(.212/.314) 3.57(.271/.230)
Dummy 5.01(.113/.130) 2.99(.141/.163) 3.04(.126/.125)
Y missing if X
1
> 1.645
Listwise same as above
Mean 4.57(.230/.210) 2.12(.368/.230) 2.86(.236/.231)
Dummy 5.00(.104/.122) 3.02(.133/.145) 2.88(.153/.130)
Y missing if X
1
> 1.037
Listwise same as above
Mean 3.87(.321/.250) 1.28(.342/.275) 2.57(.316/.275)
Dummy 4.95(.130/.173) 2.97(.195/.213) 2.58(.227/.166)
Y missing if Y > 12.0735
Listwise 4.96(.102/.107) 2.97(.119/.122) 2.96(.118/.121)
Mean 4.19(.257/.245) 2.07(.359/.269) 2.08(.341/.268)
Dummy 4.95(.104/.121) 2.90(.143/.134) 2.90(.143/.133)
Heckman 11.41(1.43/.314) 2.95(.128/.134) lambda=-2.23
Y missing if Y > 9.4591
Listwise 4.90(.118/.122) 2.93(.130/.135) 2.92(.130/.135)
Mean 4.10(.281/.267) 1.97(.391/.293) 1.98(.372/.292)
Dummy 4.75(.139/.164) 2.67(.197/.168) 2.67(.187/.168)
Heckman 9.57(.85/.274) 2.90(.132/.136) lambda=-2.29
4.2 Summary of Results
The results indicate that, when the data are MAR, only violating the OAR assumption,
listwise deletion only costs us eciency. The dummy variable approach appears to work
well, at least with little missing data. As the percent missing increases, the biases in the
parameters and (especially) the standard errors begin to become apparent. Mean imputation
performs very poorly in all cases. The biases appear to be most problematic when the data
that are missing are the outcome data; missingness on the covariate is less troublesome.
In the results for the data that violate the MAR assumption, listwise deletion appears
to work about as well as any approach. Once again, mean imputation performs very poorly,
as does the dummy variable approach. Heckman selection works well, but the intercept and
standard errors are substantially biased.
It is important to remember that this simulation study used better data than are typically
8
found in real datasets. The variables were all normally distributed, and the regression
relationship between the independent and dependent variables was quite strong. These
results may be more interesting if the signal/noise ratio were reduced (i.e., the error variance
were increased).
5 Recommendations for handling missing data.
The guidelines below are my personal recommendations for handling missing data. They are
based on a pragmatic view of the consequences of ignoring missing data, the consequences
of handling missing data inappropriately, and the likelihood of publication/rejection using a
particular method. These suggestions primarily apply when the outcome variable is missing
and not the covariates. When covariates are missing, the consequences of missingness are
somewhat less severe.
The standard method for reporting about the extent of missingness in current sociology
articles is to a) construct a dummy variable indicating whether an individual is missing
on an item of interest, b) conduct logistic regression analyses predicting missingness using
covariates/predictors of interest, c) report the results: do any variables predict missingness
at a signicant level? There are at least two problems with this common approach: 1) what
do you do if there is some pattern? Most people ignore it, and this seems to be acceptable.
Some people make arguments for why any biases that will be created will be conservative. 2)
this approach only demonstrates whether the data are OAR, which, as I said above, doesnt
matter!!! I dont recommend this standard approach, but unfortunately, it is fairly standard.
Beyond simply reporting patterns of missingness, I recommend the following for dealing
with missing data:
1. The rst rule I recommend is to try several approaches in your modeling. If the ndings
are robust to various approaches, this should be comforting, not just to you, but also
to reviewers. So, report in a footnote at least all the various approaches you tried.
This will save you the trouble of having to respond to a reviewer by saying you already
tried his/her suggestions. If the ndings are not robust, this may point you to the
source of the problem, or it may give you an idea for altering your research question.
2. If you have less than 5% missing, just listwise delete the missing observations. Any
sort of simple imputation or correction may be more likely generate biases, as the
simulation showed.
3. If you have between 5% and 10% missing, think about using either a selection model or
using some sort of technique like multiple imputation. Here, pay attention to whether
the data are missing on X or Y . If the data are missing on X, think about nding a
substitute for the X (assuming there is one in particular) that is causing the problem.
For example, use education rather than income to measure SES. Also consider whether
the data are OAR or MAR. If you have good reason to believe the data are MAR, then
just listwise delete but explain why youre doing so. If the data are not MAR, then
you should do something beyond listwise deletion, if possible. If nothing else, try to
9
determine whether the missingness leads to conservative or liberal results. If they
should be conservative, then make this argument in the text.
If the data are missing on Y , use either listwise deletion or Heckman selection. If you
use listwise deletion, you need to be able to justify it. This becomes more dicult as
the missing % increases. Use Heckman selection if you are pretty sure the data are not
MAR. Otherwise, a reviewer is likely to question your results. At least use Heckman
selection and report the results in a footnote indicating there was no dierence between
a listwise deletion approach and a Heckman approach (assuming there isnt!) Heckman
is sensitive to the choice of variables in the selection equation and structural equation,
so try dierent combinations. Dont have too much overlap.
4. If you have more than 10% missing, you must use a selection model or use some
sophisticated technique for handling the missing, unless the data are clearly MAR. If
they are, then either listwise delete or use some smart method of imputation (e.g.,
multiple, hotdecking, EM, etc.).
5. If you have more than 20% missing, nd other data, drop problematic variables, get
help, or give up.
6. If you have a sample selection issue (either on the independent or dependent variables),
use Heckman selection. For example, if you are looking at happiness with marriage,
this item is only applicable to married persons. However, if you dont compensate for
dierential propensities to marry (and remain so), your parameters will be biased. As
proof to the point, our divorce rates are higher now than they were in the past, yet
marital happiness is greater now than ever before.
6 Recommended Reading
Allison. (2002). Missing Data . Sage series monograph.
Little and Rubin. (1987). Statistical Analysis of Missing Data .
10
(copyright by Scott M. Lynch, April 2003)
1 Generalizations of the Regression Model (Soc 504)
As I said at the beginning of the semester, beyond the direct applicability of OLS regression
to many research topics, one of the reasons that a full semester course on the linear model
is warranted is that the linear model lays a foundation for understanding most other models
used in sociology today. In these last sets of notes, I cover three basic generalizations of linear
regression modeling that, taken as a whole, probably account for over 90% of the methods
used in published research over the last few years. Specically, we will discuss 1) generalized
linear models, 2) multivariate models, and 3) time series and longitudinal methods. I will
include discussions of xed/random eects models in this process.
2 Generalized Linear Models
In sociological data, having a continuous outcome variable is rare. More often, we tend
to have dichotomous, ordinal, or nominal level outcomes, or we have count data. In these
cases, the standard linear model that we have been discussing all semester is inappropriate
for several reasons. First, heteroscedasticity (and nonnormal errors) are guaranteed when
the outcome is not continuous. Second, the linear model will often predict values that are
impossible. For example, if the outcome is dichotomous, the linear model will predict scores
that are less than 0 or greater than 1. Third, the functional form specied by the linear
model will often be incorrect. For example, we should doubt that increases in a covariate will
yield the same returns on the dependent variable at the extremes than would be obtained
toward the middle.
2.1 Basic Setup of GLMs
Generalized linear models provide a way to handle these problems. The basic OLS model
can be expressed as:
Y N(X,
e
)
Y = X + e
Generalized linear models can be expressed as:
F() = X + e
E(Y ) =
That is, some function of the expected value of the expected value of Y is equal to the linear
predictor with which we are already familiar. The function that relates X to is called the
link function. The choice of link function determines the name of the model we are using.
The most common GLMs used in sociology have the following link functions:
1
Link Function (F) Model
Linear Regression
ln
_

1
_
Logistic Regression
1
() Probit Regression
ln() Poisson Regression
ln(ln(1 )) Complementary Log-Log Regression
An alternate way of expressing this is in terms of probabilities. The logit and probit
models are used to predict probabilities of observing a 1 on the outcome. Thus, we could
write the model as:
p(y
i
= 1) = F(X
i
).
In this notation, F is the link function. I will illustrate this with the probit regression model.
If our outcome variable is dichotomous, then the appropriate likelihood function for the
data is the binomial distribution:
L(p | y) = p(y | p)
n
i=1
p
y
i
(1 p)
1y
i
Our observed data constitute the y
i
if a person is a 1 on the dependent variable, then the
second term in the likelihood drops out (for that individual); if a person is a 0, then the
rst term drops. We would like to link p to X, but as discussed at the beginning, this is
problematic because an identity link (i.e., p = X) will predict illegitimate values for p. A
class of functions that can map the predictor from the entire real line onto the interval [0, 1] is
cumulative distribution functions. So, for example, in the probit case, we allow p = (X),
where is the cumulative normal distribution function (i.e.,
_
X
N(0, 1)) Regardless of the

value of X, p will fall in the acceptable range. To obtain a logistic regression model, one
would simply need to set p =
e
X
1+e
X
(the cumulative logistic distribution function).
The approach discussed immediately above may seem dierent from what was presented
in the table; however, the only dierence is in how the link function is expressedwhether
as a function of the expected value of Y , or in terms of the linear predictor. These are
equivalent (just inverses of one another). For example, the logistic regression model could
be written as:
ln
_

1
_
= X,
where = p. Another way to think about GLMs is in terms of latent distributions. We
could express the probit model as:
Y
= X + e,
using the link:
_
Y = 1 iff Y
> 0
Y = 0 iff Y
0
2
Here, Y
is a latent (unobserved) propensity. However, due to crude measurement, we

only observe a dichotomous response. If the individuals latent propensity is strong enough,
his/her propensity pushes him/her over the threshold (0), and we observe a 1. Otherwise,
we observe a 0.
From this perspective, we need to rearrange the model somewhat to allow estimation.
We can note that if Y
= X + e, then the expressions in the link equation above can be

rewritten such that: If Y = 1 then e > X; if Y = 0 then e < X. If we assume a
distribution for e (say normal), then we can say that:
p(Y = 1) = P(e > X) = P(e < X) =
_
X
N(0, 1).
Observe that this is the same expression we placed into the likelihood function above. If
we assume a logistic distribution for the error, then we obtain the logistic regression model
discussed above.
I will use this approach to motivate the generalization of the dichotomous probit model
weve been discussing to the ordinal probit model. If our outcome variable is ordinal, rather
than dichotomous, OLS is still inappropriate. If we assume once again that a latent variable
Y
underlies our observed ordinal measure, then we can expand the link equation above:
_
_
Y = 1 iff =
0
Y
<
1
Y = 2 iff
1
Y
<
2
.
.
.
.
.
.
.
.
.
Y = k iff
k1
Y
<
k
=
Just as before, this link, given a specication for the error term, implies an integral over the
error distribution, but now the integral is bounded by the thresholds:
p(Y = j) = P(
j1
X < e <
j
X) =
_

j
X
j1
X
N(0, 1).
2.2 Interpreting GLMs
GLMs are not as easy to interpret as the standard linear regression model. Because the link
function is nonlinear, the model is now nonlinear, even though the predictor is linear. This
complicates interpretation, because the eects of variables are no longer independent of the
eects of other variables. That is, the eect of X
j
depends on the eect of X
k
. The probit
model is linear in Z (standard normal) units. That is, given that X implies an increase in
the upper limit of the integral of the standard normal distribution, each can be viewed in
terms of its eect on the Z score for the individual.
The logit model is linear in log-odds units. Recall that odds are computed as the ratio of
p
1p
. The logistic link function, then, is a log-odds function. The coecients from the model
can be interpreted in terms of their linear eect on the log-odds, but this is not of much
help. Instead, if we exponentiate the model, we obtain:
exp
_
ln
_
p
1 p
__
= exp(X) = e
0
e
1
X
1
. . . e
j
X
j
3
This says that the odds are equal to the multiple of the exponentiated coecients. Suppose
we had an exponentiated coecient for gender (male) of 2. This would imply that the
odds are twice as great for men as for women, net of the other variables in the model.
The interpretation is slightly more complicated for a continuous variable, but can be stated
as: the odds are multiplied by exp(
j
) for each unit increase in X
j
. Be careful with this
interpretation: saying the odds are multiplied by 2 does NOT mean that men are twice as
likely to die as women. The word likely implies a ratio of probabilities and not odds.
The logistic regression model has become quite popular because of the odds-ratios inter-
pretation. However, the unfortunate aspect of it is that this interpretation tells us nothing
about the absolute risk (probability) of obtaining a 1 response. In order to make this in-
terpretation, we must compute the probabilities predicted by the model. It is in this process
that we can see how the eect of one variable depends on the values of the other variables in
the model. Below are a logistic regression and a probit regression of death on baseline age,
gender (male), race (nonwhite), and education.
Variable Logistic Reg. Parameter Exp() Probit Reg. Parameter
Intercept -6.2107 -3.4661
Age .1091 1.115 .0614
Male .7694 2.158 .4401
Nonwhite .5705 1.769 .3341
Education -.0809 .922 -.0487
The results (for either model) indicated that age, being male, and being nonwhite in-
crease ones probability of death, while education reduces it. Although the coecients dier
between the two models, this is simply a function of the dierence in the variances of the
logistic and probit distributions. The variance of the probit distribution is 1 (N(0, 1)); the
variance of the logistic distribution is

2
3
. The ratio of these variances (
L
P
) is 1.81, and this
is also approximately the ratio of the coecientsthere is some slight deviation that is at-
tributable to the slight dierences in the shape of the distribution functions (the probit is
steeper than the logit in the middle).
If we wanted to determine the dierence in probability of mortality for a person with a
high school diploma versus a college degree, we would need to x the other covariates at
some value, compute X, and perform the appropriate transformation to invert the link
function. Below are the predicted probabilities for 50 year-olds with dierent gender, race,
and education proles.
4
Prole Probit Logit
Sex Race Education (Predicted probabilities)
Male White 12 yrs. .29 .28
16 yrs. .23 .22
Nonwhite 12 yrs. .42 .40
16 yrs. .34 .33
Female White 12 yrs. .16 .15
16 yrs. .12 .11
Nonwhite 12 yrs. .26 .24
16 yrs. .20 .19
Notice that the estimated probabilities are very similar between the two models. For most
data, the models can be used interchangeably. Notice also that the change in probability
by altering one characteristic depends on the values of the other variables. For example,
the dierence in probability of death between 12 and 16 years of education is .06 for white
males, .08 for nonwhite males, .04 for white females, and .06 for nonwhite females (all base
on the probit model results). The odds ratio, however, does not vary. For example, take the
odds ratio for white males with 12 versus 16 years of education (OR = 1.38) and the odds
ratio for nonwhite males with 12 versus 16 years of education (OR = 1.35). The dierence
is only due to rounding of the probabilities.
3 Multivariate Models
For the entire semester, we have discussed univariate modelsthat is, models with a single
outcome. Often, however, we may be interested in estimating models that have multiple
dependent variables. Lets take a very simple model rst.
Modeled
y
1
= X + e
1
y
2
= Z + e
2
e
1
N(0,
2
e1
)
e
2
N(0,
2
e2
)
Not modeled, but true
_
e
1
e
2
_
N
__
0
0
_
,
_

2
e
1

e
1
e
2
e
1
e
2

2
e
2
__
5
This model says that outcomes y
1
and y
2
are functions of several covariates (X and Z) plus
some error. The third and fourth components indicate that e
1
and e
2
are assumed to be
uncorrelated. However, in fact, the last expression indicates that the errors across equations
are correlated (as long as
e
1
e
2
is nonzero). This model is sometimes called the seemingly
unrelated regression model. The regressions seem as though they could be estimated inde-
pendently, but if e
1
and e
2
are correlated, then it implies that there are variables (namely
y
2
and possibly some Z) that are omitted from the model for y
1
(and vice versa for the
model for y
2
). Omitting relevant variables leads to omitted variable bias, which means that
our estimates for coecients are incorrect. We could rewrite the model, either specically
incorporating the error covariance portion of the model, or specifying a joint distribution for
y:
_
y
1
y
2
_
N
__
X
Z
_
,
_

2
e
1

e
1
e
2
e
1
e
2

2
e
2
__
This model is the same as the model above, just expressed explicitly as a multivariate model.
3.1 Multivariate Regression
So far, we have dealt with a number of univariate distributions, especially the univariate
normal distribution:
f(Y ) =
1
2
exp
_
(Y )
2
2
2
_
The multivariate normal distribution is simply an extension of this:
f(Y ) = 2
d
2
| |
1
2
exp
_
[Y ]
T
1
[Y ]
_
Here, is a vector of means, and is the covariance matrix of Y . If the matrix is diagonal,
then the distribution could be rewritten simply a set of univariate normal distributions; when
has o-diagonal elements, it indicates that there is some (linear) relationship between
variables.
Just as with linear regression, we can assume the errors are (multivariately) normally
distributed, allowing us to place (Y X) into the numerator of the kernel of the density for
maximum likelihood estimation of . In the multivariate case, each element of the [Y ]
vector is replaced with [Y X(j)], where Im using (j) to index the set of parameters in
each dimension of the model (i.e., X does not have to have the same eect on all outcomes).
We can assume the X matrix is the same across equations in the model, and if an X does
not inuence one of the outcomes, then its parameter is constrained to be 0.
3.2 Path Analysis
Some multivariate models can be called path analysis models. The requirements for path
analysis include that the variables must all be continuous, and the model must be recursive-
that is, following the paths through the variables, one cannot revisit a variable. Below is an
example of a path-analytic graph.
6

Education
Income
Depression
F
D
A
C
Health
B
E
Figure: Path Diagram.
This path model says that depression is aected by physical health, education, and
income; health is inuenced by education and income; and income is inuenced by education.
If we estimated the model: Depression = b
0
+b
1
Education, we would nd the total eect of
education on depression. However, it is unlikely that educations eect is only direct. It is
more reasonable that income and physical health also aect depression, and that education
has direct eects (and possibly indirect eects) on income and health. Thus, the simple
model would produce a biased eect of education if income and health were ignored. If we
estimate the path model above, the coecient for the direct eect of education (c) would
most likely be reduced.
The path model above can be estimated using a series of univariate regression models:
Depression =
0
+
1
education +
2
health +
3
income
Health =
0
+
1
education +
2
income
Income =
0
+
1
education
The lettered paths in the diagram can be replaced as: A =
1
, B =
1
, C =
1
, D =
2
,
E =
2
, F =
3
.
Now the direct eect of education on depression is no longer equal to the total eect.
Rather, the direct eect is simply C, while the total eect is:
Total = Direct + Indirect
= c + (ae) + (bf) + (bde)
7
As these expressions indicate, the indirect eects are simply the multiples of the paths that
lead indirectly to depression from education.
3.3 Structural Equation Models
When nonrecursive (i.e., reciprocal eects are included) models are needed, variables are not
measured on a continuous scale, and/or measurement error is to be considered, we can gen-
eralize the path analysis model above. Structural equation models provide a generalization.
These models are often also called LISREL models (after the rst software package to esti-
mate them) and covariance structure models (because estimation is based on the covariance
matrix of the data). Using LISREL notation, these models consist of 3 basic equations and
4 matrices:
_
2
.
.
.
j
_
_
=
_
_
0
12
. . .
1j
21
0 . . .
2j
.
.
.
.
.
.
.
.
.
.
.
.
j1

j2
. . . 0
_
_
_
2
.
.
.
j
_
_
+
_
11

12
. . .
1k
21

22
. . .
2k
.
.
.
.
.
.
.
.
.
.
.
.
j1

j2
. . .
jk
_
_
_
2
.
.
.
k
_
_
+
_
2
.
.
.
j
_
_
_
_
y
1
y
2
.
.
.
y
p
_
_
=
_
y
11

y
12
. . .
y
1j
y
21

y
22
. . .
y
2j
.
.
.
.
.
.
.
.
.
.
.
.
y
p1

y
p2
. . .
y
pj
_
_
_
2
.
.
.
j
_
_
+
_
2
.
.
.
p
_
_
_
_
x
1
x
2
.
.
.
x
p
_
_
=
_
x
11

x
12
. . .
y
1k
x
21

x
22
. . .
y
2k
.
.
.
.
.
.
.
.
.
.
.
.
x
q1

x
q2
. . .
y
qk
_
_
_
2
.
.
.
k
_
_
+
_
2
.
.
.
q
_
_
= k-by-k covariance matrix of the
= j-by-j covariance matrix of the
= q-by-q covariance matrix of the
= p-by-p covariance matrix of the

In this model, latent (unobserved) variables are represented by the Greek symbols and
. The distinction between and is whether the variable is exogenous () or endogenous
() in the model, where endogenous means the variable is inuenced by other variables
in the model. The coecients that relate the to each other are , while the coecients
that relate the to the are . The rst equation is the structural equation that relates
the latent variables, with an error term for each . The second and third equations
are measurement equations that show how the observed y and x are related to the latent
variables and , respectively (via the coecients). In these equations, and represent
measurement errorsthat is, the part of the observed variable that is unaccounted-for by
the latent variable(s) which inuence it.
8
The matrix models the covariances of the exogenous latent variables, while the
matrix models the covariances of the structural equation errors (allowing cross-equation
error correlation to exist, which is something univariate regressions do not allow.) The two
matrices allow correlation between errors in the measurement equations to exist (again,
something that univariate regression cannot handle.)
This model is very general. If there is only one outcome variable (), and all variables
are assumed to be measured without error, then the model reduces to OLS regression. If we
are uninterested in structural relations, but are only interested in the measurement portion
of the model (and possibly in estimating simple correlations between latent variables), then
we have a (conrmatory) factor analysis.
The model is estimated by recognizing that (1) the parameters are functions of the
covariances (or correlations) of the variables and (2) a multivariate normal likelihood can
be written in terms of these covariances. When some of the data are not continuous, but
rather are ordinal, we can estimate something called polychoric and polyserial correlations
between the observed variables, and the resulting correlation matrix can be used as the basis
for estimation. The resulting model could then be called a multivariate generalized linear
model.
Below is a graphic depiction of a relatively simple structural equation model. The equa-
tions for this SEM would be:
1
=
11
+
1
_
_
y
1
y
2
y
3
_
_
=
_
_
y
1
y
2
y
3
_
_
1
+
_
_
3
_
_
_
_
y
1
y
2
y
3
_
_
=
_
_
x
1
x
2
x
3
_
_
1
+
_
_
3
_
_
=
11
, =
11
_
_
11
0 0
0
22
0
0 0
33
_
_
_
_
11
0 0
0
22

23
0
32

33
_
_
Notice how most of the o-diagonal elements of the various covariance matrices are 0-this
is because we have only specied 1 error correlation (between
2
and
3
). The top and
bottom portions of the gure constitute conrmatory factor analyses-the idea is that the
observed x and y variables reect underlying and imperfectly measured constructs (factors).
We are really interested in examining the relationship between these constructs, but there is
measurement error in our measures for them. Thus, with this model,
11
is our estimate of
9
the relationship between the latent variables independent of any measurement error existing
in our measures.

x3
x
3
x
2

x1
x
1

3

1 1
1

2

y3

1

y
1
y
2

y1

1
y
3

1
1

x2
11
1
y2
1
2

3

23

Figure: SEM Diagram.
3.4 Final note about multivariate models
Something you must remember with multivariate (and latent variable) models that isnt
typically an issue in simple univariate models is the notion of identication or identiability.
We cannot estimate more parameters than we have pieces of information. In SEMs, the
pieces of information are covariances and variances between variablesthere are
(p+q)(p+q+1)
2
of them. We need to be sure that we are not attempting to estimate too many parameters
if we are, then our model is under-identied, which means loosely that there is no unique
solution set for the parameters. The models we have dealt with to date have been just-
identied, meaning that there are exactly as many pieces of information as parameters.
With SEMs, we often have over-identied models, which means we have more than enough
10
information to estimate the model. This comes in handy when we would like to compare
models.
4 Time Series and Longitudinal Methods
Analysis of time series and panel data is a very large area of methodology. Indeed, you can
take entire courses on basic time series, event history models, or other methods of analysis
for longitudinal data. In this brief introduction to these classes of models, I am exchanging
depth for breadth in an attempt to point you to particular types of models that you may
need to investigate further in doing empirical work.
Ill start with some basic terminology thats relevant to longitudinal data. First, the
term longitudinal data is somewhat vague. Generally, the term implies that one has panel
data, that is, data collected on multiple units across multiple points in time (like the PSID).
However, it is often also used to refer to repeated cross-sectional data, that is, data collected
on multiple dierent units at multiple points in time (like the GSS). For the purposes of our
discussion here, we will limit our usage to the rst denition.
Time Series typically refers to repeated observations on a single observational unit
across time.
Multiple Time Series means multiple observational units observed across time. This
can also be considered a panel study, although the usage often diers depending on
the unit of analysis. Generally, micro data is considered panel data, while macro data
is considered time series data.
Multivariate Time Series means that there are multiple outcomes measured over time.
A Trend is a general pattern in time series data over time. For example, U.S. life
expectancy shows an increasing trend over the last 100 years.
Seasonality is a repeated pattern in time series data. For example, the sale of Christ-
mas trees evidences considerable seasonalitysales are low except in November and
December. No trend is necessarily implied, but the pattern repeats itself annually.
Stationarity. A stationary time series evidences no trending. A stationary series may
have seasonality, however. Stationarity is important for time series models, for reasons
that will be discussed shortly.
4.1 Problems that Time Series/Longitudinal Data Present
There are really very few dierences between the approaches that are used to analyze time
series or panel data and the basic linear regression model with which you are already familiar.
However, there are four basic problems that such data present that necessitate the expansion
of the OLS model or the development of alternative models.
11
1. Error correlation within units across time. This problem requires alternative estimation
of the linear model (e.g., GLS estimation), or the development of a dierent model (e.g.,
xed/random eects models).
2. Spuriousness due to trending. When attempting to match two time series, it may
appear that two aggregate time series with similar trends are causally related, when
in fact, they arent. For example, population size (which has an increasing trend) may
appear to be related to increases in life expectancy (also with an increasing trend).
However, this is probably not a causal relationship, because, in fact, countries with
the fastest growth in population dont necessarily have the fastest increases (if any at
all) in life expectancy. This may be an ecological fallacy or it may simply be that two
similarly-trending time series arent related.
3. Few units. Sometimes, time series data are relatively sparse. When dealing with a
single time series, for example, we often have relatively few measurements. This makes
it dicult to observe a trend and to include covariates/explanatory variables in models.
4. Censoring. Although our power to examine relationships is enhanced with time series
and panel data, such data present their own problems in terms of missing data.
What do we do with people who die or cant be traced for subsequent waves of study?
What do we do with people who are missing at some waves but not others? What do
we do when a person experiences an event that takes him/her out of the risk set? Etc.
4.2 Time Series Methods
Problems (1) and (2) above can often be resolved by placing some structure on the error
term. There are many ways to do this, and such forms the basis for a smorgasboard of
approaches to analyzing time series data.
The most basic models for time series are ones in which trends and seasonality are
modeled by including time as a variable (or a collection thereof, e.g., like time dummies) in
a model. For example, if we wanted to examine a trend in birth rates across time, and we
observed birth rates on a monthly basis across, say, a 20 year span, we could rst model
birth rates as a function of time (years), examine the residual, and then include dummy
variables (or some other type of variables) to capture seasonality. The key problem with this
approach is that we must make sure that autocorrelation of the errors does not remain after
detrending and deseasoning.
There are two basic domains of more complicated time series analysis: the time domain
and the frequency domain. In frequency domain models, measures are seen as being a
mixture of sine and cosine curves at dierent frequencies:
y
i
N((X
i
) +
J
j=1
(a
j
sin(w
j
t
i
) + b
j
cos(w
j
t
i
)),
2
)
These models are often called spectral models. I will not discuss these models, because a)
they are not particularly common in sociology and demography and b) these methods are
12
ultimately related to time domain models (i.e., they are simply an alternate parameterization
of time domain models).
More common in sociology are time domain models. Time domain models model a
time t variable as a function of the outcome at time t 1. In economics, these are called
dynamic models. A general class of models for time domain time series models are ARMA
(AutoRegressive Moving Average) models, which can be represented as:
y
t
= X
t
+
1
y
t1
+
2
y
t2
+ . . . +
m
y
tp
+ e
t
+
1
e
t1
+
2
e
t2
+ . . . +
n
y
tq
In this equation, the autoregressive terms are the lagged y-values, while the moving average
terms are the lagged errors. e
t
is assumed to be N(0,
2
I) under this model. As specied,
this model would be called an ARMA(p,q) model (although, technically, the classic ARMA
model does not contain regressors). Very often, we do not need more than 1 AR term or one
MA term to achieve stationarity of the error. (As a side note, a time series process in which
only y
t1
is needed is called a Markov process). The model requires that be less than 1,
or the model is considered explosive. (this can be observed by repeated substitution).
To put the ARMA(p,q) model into perspective, we can view this model as placing
structure on the error term in an OLS model. Recall from previous discussions that an
assumption of the OLS regression model is that e N(0,
e
I). In other words, the er-
rors for observations i and j are uncorrelated for i = j. This assumption is violated by
time series data, and so a natural extension of the OLS model is to use GLS estimation.
GLS estimation involves estimating the parameters of the model with a modied estimator:
GLS
= (X
T
1
X)
1
(X
T
1
Y ). , however, is an n n matrix, and all elements of this
matrix cannot be estimated without some simplifying constraints. One such constraint that
can be imposed is that the error correlations only exist between errors for adjacent time
periods, and that all adjacent time periods have equal error correlation. In that case, we
only need to estimate one additional parameter, say
i,i+1
, i, so our matrix appears as:
=
_
11

12
0 . . . 0
21

22

23
.
.
.
.
.
.
0
32

33

34
0
.
.
.
.
.
.
43

44
.
.
.
0 . . . 0
.
.
.
.
.
.
_
_
,
where
11
=
22
= . . . =
TT
, and
12
=
21
=
23
=
32
= . . ..
An AR(1) model simplies the error covariance matrix for GLS estimation by decompos-
ing the original OLS error term into two components: e
t
= e
t1
+v
t
. Here the error at one
time point is simply a function of the error at the previous time point plus a random shock
at time t(v). Higher order AR models are obtained by allowing the error to be a function of
errors at lags > 1. Typically, autoregressive models are estimated by incorporating a lagged
y variable into the model. So long as the absolute value of the coecient for the lagged
term(s) does not exceed 1, the series can be considered stationary.
An MA(1) model species structure on the random shocks: e
t
= v
t1
+v
t
. As with the
AR models, higher order MA models can be obtained by adding additional lagged terms.
13
Moving average models are more dicult to estimate than autoregressive models, however,
because the error term depends on the coecients in the model, which, in turn, depend on
the error.
How do we determine what type of ARMA model we need? Typically, before we model
the data, we rst construct an autocorrelation plot (also sometimes called a corellogram),
which is a plot of the autocorrelation of the data (or errors, if a model was previously
specied) at various lags. The function is computed as:
AC
L
=
m
(
t
)(
tL
)
(mL)
(
t
)
2
where m is the number of time series data points and L is the number of lags. The shape of
this function across L tells us what type of model we may need, as we will discuss below.
When autocorrelation between errors cannot be removed with an ARMA(p,q) model,
the next step may be to employ an ARiMA (Autoregressive integrated moving average)
model. The most basic ARiMA model is one in which there are no autoregressive terms
and no moving average terms, but the data are dierenced once. That is, we simply take
the dierence in all variables from time t 1 to t, so that y
diff
= y
t
y
t1
, t. We do
the same for all the covariates. We then regress (using OLS) the dierences in y on the
dierences in x. This model therefore ultimately relates change in x to change in y (this is
why the term integrated is usedfrom a calculus perspective, relating change to change is
matching the rst derivatives of x and y, thus, the original variable is integrated relative
to the dierences).
Sometimes, a rst dierences approach is not sucient to remove autocorrelation. In
those cases, we may need to add autoregressive or moving average components, or we may
even need to take second or higher order dierences.
5 Methods for Panel Data
The time series methods just discussed are commonly used in economics, but are used some-
what less often in sociology. Over the last twenty years, panel data have become quite com-
mon, and sociologists have begun to use methods appropriate for multiple time series/panel
data. In many cases, the methods that I will discuss here are very tightly related to the
time series methods discussed above; however, they are generally presented dierently than
the econometric approach to time series. In this section, I will discuss two general types of
panel methods: hazard/event history models and xed/random eects hierarchical models
(including growth models). The key feature that distinguishes these models is whether your
outcome variable is an event that can occur versus simply repeated measures on individuals
over time.
5.1 Hazard and Event History Models
Hazard models is another class of models that go by several names. One is event history
models. In demography, life table methods accomplish the same goals. Other names
14
include discrete time hazard models and continuous time hazard models. Related ap-
proaches are called survival models, and failure time models. These related approaches
are so-called because they model the survival probabilities (S(t)) and the distribution of
times-to-event (f(t)), whereas hazard models model the hazard of an event. Hazard models
are distinct from time series models, because time series models generally model an outcome
variable that takes dierent values across time, but hazard models model the occurrence of
a discrete event. If the time units in which the respondents are measured for the event are
discrete, we can use discrete time event history methods; if the time units are continuous
(or very closely so), we can use continuous time event history methods.
A hazard is a probability of experiencing an event within a specied period of time,
conditional on having survived to be at risk during the time interval. Mathematically, the
hazard can be represented as:
h(t) =
lim
t0
p(E(t, t + t) | S(t))
t
Here, p(E(t, t + t)) represents the probability of experiencing the event between time
t and t + t, S(t) indicates the individual survived to time t, and t is an innitely small
period of time. The hazard, unlike a true probability, is not bounded between 0 and 1, but
rather has no upper bound.
If we examine the hazard a little further, we nd that the hazard can be viewed as
h(t) =
# who experience event during the interval
t # exposed in the interval
The numerator represents the number of persons experiencing the event, while the denom-
inator is a measure of person-time-units of exposure. Given that individuals can experience
the event at any time during the interval, each individual who experiences the event can
only contribute as many time units of exposure as s/he existed in the interval. For exam-
ple, if the time interval is one year, and an individual experiences the event in the middle
of the interval, his/her contribution to the numerator is 1, while their contribution to the
denominator is .5.
In a hazard model, the outcome to be modeled is the hazard. If the time intervals are
suciently small (e.g, minutes, seconds, etc.), then we may use a continuous time hazard
model; if the intervals are suciently large, then we may use a discrete time hazard model.
There is no clear break point at which one should prefer a discrete time model to a continuous
time model. As time intervals become smaller, discrete time methods converge on continuous
time methods.
In this discussion, I will focus on hazard models rather than survival or failure time
models, although all three functions are related. For example, the hazard (h(t)) is equal to:
h(t) =
f(t)
S(T) = 1 F(t)
,
the density function (indicating probability of the event at time t) conditional on (dividing
by) survival up to that point. Notice that the survival function is represented as 1 F(t),
where F(t) is the integral of the density function. The density function gives the probabilities
of experiencing the event at each time t. So, if we want to know the probability that a person
15
will experience the event by time t, we simply need to know the area under the density
function from to t, which is
_
t
uf(u)du = F(t). The survival probability beyond t,

therefore, is 1 F(t).
We will focus on hazard models because they are more common in sociology and demog-
raphy. Survival analysis, beyond simple plotting techniques, is more common in clinical and
biological research.
5.1.1 A Demographic Approach: Life Tables
The earliest type of hazard model developed was the life table. The life table generally takes
as its input the hazard rate at each time (generally age), uses some assumptions to translate
the hazard rate into transition (death) probabilities, and uses a set of ow equations to apply
these probabilities to a hypothetical population to ultimately end up with a measure of the
time remaining for a person at age a. A basic table might look like:
Table: Single Decrement Life Table
Age l
a
h() = () q
a
d
a
L
a
e
a
20 l
20
= 100, 000
20
q
20
q
20
l
20
l
20
+l
21
2
a=20
La
l
20
21 l
21
= l
20
d
20

21
q
21
q
21
l
21
l
21
+l
22
2
a=21
La
l
21
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The columns in the life table are the following. The l column represents the number
of individuals remaining alive in a hypothetical cohort (radix=100,000) at the start of the
interval. The column represents the hazard for the time interval. The q column is the
probability that an individual alive at the beginning of the interval will die before the end
of the interval. An assumption is used to transform into q. For example, if we assume
that individuals die, on average, in the middle of the interval, then the exposure (in person
years) is simply the average of the number of individuals alive at the beginning and the end
of the interval, and so:
=
q l
1
2
(l + (1 q) l)
.
Some rearranging yields:
q =
l
l +
1
2
l
,
So, we can obtain the transition probabilities from the hazards. We then apply these prob-
abilities to the population surviving to the beginning of the interval to obtain the count of
deaths in the interval (d). We then have enough information to calculate the next l, and we
can proceed through the table like this. When we are done, we then construct the L column,
which is a count of the person years lived in the interval. Once again, if we assume individ-
uals died in the middle of the interval, then the person years lived is simply the average of
16
the number of persons alive at the beginning and end of the interval (the denominator of
the equation for ). The next column, T, sums these person years lived from this interval
through the end of the table. Finally, the last column, e, divides the number of individuals
alive at the start of an interval (l) by the cumulative person years remaining to be lived to
obtain an average of the number of years remaining for each person (life expectancy). The
value at the earliest age in the table is an approximation of
_
+
t f(t)dt, the expectation

of the distribution of failure times.
The life table has been extended to handle multiple decrements (multiple types of exits),
as well as reverse and repeatable transitions (the multistate life table). A key limitation of
the life table has been the diculty with which covariates can be incorporated in it and the
diculty in performing statistical tests for comparing groups. For this reason, researchers
began using hazard regression models.
5.1.2 Continuous Time Hazard Models
The most basic continuous time hazard models include the exponential model (also called
the constant hazard model), the Gompertz model, and the Weibull model. The dierence
between these models is their representation of the baseline hazard, which is the hazard
function when all covariate values are 0. I emphasize this denition, because some may be
misled by the name into thinking that the baseline hazard is the hazard at time t = 0, and
that is not the case.
Suppose for a minute that there are no covariates, and that we assume the hazard does
not change over time. In that case, the exponential model is:
ln(h(t)) = a
The hazard is logged, because of the bounding at 0. The name exponential model
stems from 1) the fact that if we exponentiate each side of the equation, the hazard is
an exponential function of the constant, and 2) the density function for failure times that
corresponds to this hazard is the exponential distribution.
If we assume that the hazard is constant across time, but that dierent subpopulations
have dierent values of this constant, the exponential model is:
ln(h(t)) = a + X
In this specication, a is the baseline hazard, which is time-independent, and X is the
linear combination of covariates thought to raise or lower the hazard.
Generally, the assumption of a constant hazard is unreasonable: instead, the hazard is
often assumed to increase across time. The most common model representing such time-
dependent hazards is the Gompertz model, which says that the log of the hazard increases
linearly with time:
ln(h(t)) = X + bt.
In demography, we often see this model as:
h(t) = exp(bt),
17
with = exp(X).
Often, we do not think the log of the hazard increases linearly with time, but rather we
believe it increases more slowly. Thus, the Weibull model is:
ln(h(t)) = X + b ln(t).
Each of these models is quite common in practice. However, sometimes, we do not believe
that any of these specications for the baseline hazard is appropriate. In those cases, we
can construct piecewise models that break the baseline hazard into intervals in which the
hazard may vary in its form.
A special and very common model in which the baseline hazard remains completely
unspecied, while covariate eects can still be estimated, is the Cox proportional hazard
model. Coxs great insight was that the likelihood function for a hazard model can be
factored into a baseline hazard and the portion that contains the covariate eects. The Cox
model looks like:
ln(h(t)) = g(t) + X
where g(t) is an unspecied baseline hazard. Estimation of this model ultimately rests on
the ordering of the event times. Thus, a problem exists whenever there are lots of ties in
the data. This method, therefore, is generally most appropriate when the time intervals in
the data are very small, and hence the probability of ties is minimal.
Final note on continuous time methods: Realize that the hazard, being an instantaneous
probability, is ultimately always unobserved. Thus, estimation of these models thus requires
special software/procedures and cannot be estimated with OLS or other standard techniques.
5.1.3 Discrete Time Models
When time intervals are discrete, we may use discrete time models. The most common
discrete time method is the discrete time logit model. The discrete time logit model is the
same logit model that we have already discussed. The only dierence in application is the
data structure to which the model is applied. The logit model is represented as:
ln
_
p
1 p
_
= X
where p is the probability that the event occurs to the individual in the discrete time interval.
Construction of the data set for estimation involves treating each individual as multiple
person-time records in which an individuals outcome is coded 0 for the time intervals prior
to the occurrence of the event, and is coded 1 for the time interval in which the event does
occur. Individuals who do not experience the event over the course of the study are said
to be censored, but they do not pose a problem for the analyses: they are simply coded 0
on the outcome for all time intervals. To visualize the structure of the data, the rst table
shows 10 hypothetical individuals.
Standard format for data
18
ID Time until event Experienced event?
1 3 1
2 5 0
3 1 1
4 1 1
5 2 1
6 4 1
7 4 1
8 3 1
9 1 0
10 2 1
The study ran for 5 time intervals. Persons 2 and 9 did not experience the event and
thus are censored. Person 2 is censored simply because s/he did not experience the event
before the end of the study. Person 9 is censored due to attrition. The new data structure
is shown in the second table.
Person-year format for data
19
Record ID Time Interval Experienced event?
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 0
5 2 2 0
6 2 3 0
7 2 4 0
8 2 5 0
9 3 1 1
10 4 1 1
11 5 1 0
12 5 2 1
13 6 1 0
14 6 2 0
15 6 3 0
16 6 4 0
17 7 1 0
18 7 2 0
19 7 3 0
20 7 4 1
21 8 1 0
22 8 2 0
23 8 3 1
24 9 1 0
25 10 1 0
26 10 2 1
Now we have 26 records rather than the 10 in the original data set. We can compute the
hazard, by observing the proportion of persons at risk who experience the event during each
time period. The hazards are: h(1) =
2
10
, h(2) =
2
7
, h(3) =
2
5
, h(4) =
1
3
, h(5) =
0
1
. Now,
when we run our logit model, the outcome variable is the logit of the hazard (ln
h()
1h()
). We
can include time-varying covariates very easily, by simply recording the value of the variable
for the respondent record for which it applies. As you can see, censoring is also handled very
easily. We can also specify whatever form we would like for the baseline hazard. If we want
the baseline hazard to be constant, we simply dont include a variable or function for time
in the model. If we want complete exibility-a completely piecewise linear model-we would
simply include a dummy variable for every time interval (except one).
This model is very similar to the ones we have already discussed. As the time intervals get
20
smaller, the model converges on the continuous time hazard models. Which one it converges
to is simply a matter of how we specify time in the model. For example, if we construct a
variable equal to ln(t) and enter it into the model, we essentially have the Weibull model. If
we just enter t as a variable, we have a Gompertz model.
The interpretation of the model is identical to that of the standard logit model, the only
dierence being that the outcome is the hazard rather than the probability.
5.2 Fixed/Random Eects Hierarchical Models
Hierarchical modeling is an approach to analyzing data that takes advantage of, and/or
compensates for, nesting structure in data. The approach is used to compensate for multi-
stage sampling, which induces dependence among errors and leads to biased standard errors.
The approach is used to take advantage of hierarchical structuring of data by distinguishing
between eects that occur at multiple levels. For example, the approach can be used to dif-
ferentiate between within-individual change over time and between-individual heterogeneity.
The approach can be used to distinguish between family-level eects and neighborhood-level
eects on individual outcomes, etc. Thus, Fixed/Random Eects Hierarchical models are
not exclusively used for panel data, but can be when the nesting structure for the data is
individuals (level 2) measured across time (level 1).
As alluded to above, hierarchical modeling is a quite exible and general approach to
modeling data that are collected at multiple levels. Because of this exibility and wide
applicability, hierarchical modeling has been called many things. Here is a list of some of
the terms that have been used in referring to these types of models:
Hierarchical modeling
Multilevel modeling
Contextual eects modeling
Random coecient models
Random/Fixed eects models
Random intercept models
Random eects models with random intercepts and slopes
Growth curve modeling
Latent curve analysis
2-level models, 3-level models, etc.
Variance component models
Mixed (eects) models
Random eects ANOVA
21
This list is not exhaustive, but covers the most common labels applied to this type of
modeling. In this brief discussion, I am not going to give an in-depth mathematical treatment
of these models; instead, I will try to show how these names have arisen in dierent empirical
research contexts, but are all part of the general hierarchical model.
Y
ij
=
0
+
1
x
ij
+
2
z
j
+ e
ij
Here, Y
ij
is the individual-level outcome for individual i in group j,
0
is the intercept,
1
is the eect of individual-level variable x
ij
,
2
is the eect of group-level variable z
j
, and
e
ij
is a random disturbance.
If there is clustering within groups-e.g, you have all family members in a family, or you
have repeated measures on an individual over time-this model is not appropriate, because
e
ij
are not independent (violating the OLS assumption that e N(0,
2
I)).
Two simple solutions to this dilemma are 1) to pull out the structure in the error term
(similar to ARMA models) by decomposing it into a group eect and truly random error
and 2) to separate the intercept into two components: a grand mean and a group mean. The
former approach leads to a random eects model; the latter a xed eects model, but they
look the same:
Y
ij
=
00
+
1
x
ij
+
j
+ e
ij
Here, the subscript on the intercept has changed to denote the dierence between this
and the OLS intercept.
j
denotes either a xed eect (the decomposition of the intercept
term) or random eect (the decomposition of the error term). I have eliminated z, because in
a xed eects approach, all xed characteristics are not identiable apart from the intercept.
If we treat the model as a random eects model, this model can be called a random
intercept model. Note that there are two levels of variance in this specication-true within-
individual variance (denoted
2
e
) and between-individual (level 2) variance (denoted
2
), the
variance of the random eects. The total variance can be computed as:
2
e
+
2
.
We now have a model specication that breaks variance across two levels, and we can
begin to bring in variables to explain variance at both levels. Suppose we allow the random
intercept
j
to be a function of group level covariates and residual group-specic random
eects u
j
, so that
j
=
0
+
1
z
j
+ u
j
. Then, substitution yields:
Y
ij
=
1
x
ij
+ (
0
+
1
z
j
+ u
j
) + e
ij
Notice that I have eliminated the original intercept,
00
, as it is no longer identied as
a parameter distinct from
0
, the new intercept after adjustment for group level dierences
contained in z
j
. Now, every group has a unique intercept that is a function of a grand mean
(
0
), a xed eect of a group-level variable (
1
), and a random, unexplained component
(u
j
).
2
should shrink as group-level variables are added to account for structure in u, and
measures of second-level model t can be constructed from this information. Similarly, the
addition of more individual-level measures (x
ij
) should reduce
2
e
, and rst-level model t
can be constructed from this information.
The next extension of this model can be made by observing that, if the intercept can vary
between groups, so may the slope coecients. We could, for example, assume that slopes
vary according to a random group factor. So, in the equation:
22
Y
ij
= (
0
+
1
z
j
+ u
j
) +
1
x
ij
+ e
ij
We could allow
2
also to be a function of group level characteristics:
1
=
0
+
1
z
j
+ v
j
Substitution yields:
Y
ij
= (
0
+
1
z
j
+ u
j
) + (
0
+
1
z
j
+ v
j
)x
ij
+ e
ij
Simplication yields:
Y
ij
= (
0
+
1
z
j
+ u
j
) + (
0
x
ij
+
1
z
j
x
ij
+ v
j
x
ij
) + e
ij
This is the full hierarchical linear model, also called a random coecients model, a
multilevel model, etc., etc. Notice that this model almost appears as a regular OLS model
with simply the addition of a cross-level interaction between z
j
and x
ij
. Indeed, prior to
the development of software to estimate this model, many people did simply include the
cross-level interaction and estimate the model via OLS, possibly adjusting for standard error
bias using some robust estimator for standard errors.
However, this model is NOT appropriately estimated by OLS, because we still have the
random eect u
j
and the term v
j
x
ij
. In a nutshell, this model now contains 3 sources of
variance: within-individual (residual) variance,
2
e
, and two between-individual variances
(
2
intercept
and
2
slope
).
We have now discussed the reason for many of the names for the hierarchical model,
including multilevel modeling, hierarchical modeling, random/xed eects modeling, random
coecient modeling, etc. We have not discussed growth curve modeling. I use growth
curve modeling extensively in my research, and approach the hierarchical model from that
perspective. I also approach the model from a probability standpoint, rather than a pure
equation/residual variance standpoint. That being the case, this brief discussion of growth
curve modeling will use a very dierent notation.
Until now, we have treated the two levels of analysis as individual and group. For growth
curve modeling, the two levels are based on repeated measures on individuals across time.
A basic growth model looks like:
Within-Individual equation
_
y
it
N(
i
+
i
t,
2
)
This equation says that time-specic individual measures are a function of an individual-
specic intercept term and a slope term capturing change across time. The second level
equation,
Between-Individual equations
_

i
N(
0
+
J
j=1
j
X
ij
,
2
i
N(
0
+
K
k=1
k
X
ik
,
2
)
_
says that there may be between-individual dierences in growth patterns, and that they
may be explained by individual-level characteristics. Realize that this model, aside from
notation, is no dierent from the hierarchical model discussed above.
23

Simple Linear Regression Scott M Lynch

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Simple Linear Regression Scott M Lynch

Hochgeladen von

Copyright:

Verfügbare Formate

(copyright by Scott M.

Lynch, February 2003)

). Because the structural part of y is xed, given x, this is equivalent

The second derivative with respect to is:

We can also obtain the second partial derivative with respect to :

(recall the trick we used earlierthat if

and the population-level parameters

or just Y = X + . At the sample level, the model is Y = Xb + e. In these equations, n

We now need to postmultiply this by X

Y, and the second term can be written as

matrix but the means re-entered anyway.

XY. This can be reexpressed as n

XY. Once again, the denominator

. With a minimal amount of algebraic manipulation, we can obtain:

matrix contain information equivalent to the

X must be 0 for each variable:

So, the solution vector is:

. Why? Try substituting 0 for

Taking the log of this likelihood yields:

The determinant of this matrix is:

s account for most of the variation in y). We

s, and will thus tend to make R

which gives us a matrix expression for the likelihood:

N(0, 1)) Regardless of the

is a latent (unobserved) propensity. However, due to crude measurement, we

= X + e, then the expressions in the link equation above can be

= q-by-q covariance matrix of the

= p-by-p covariance matrix of the

uf(u)du = F(t). The survival probability beyond t,

t f(t)dt, the expectation

Das könnte Ihnen auch gefallen