Sie sind auf Seite 1von 38

Filling Holes in Your Data:

Multiple Imputation
in Education Research
Paul T. von Hippel
Harvard Graduate School of Education
Larsen G-06
Wednesday, April 22, 1-230 pm

I. Background
II. New Results

I. Background

Education Data

Are Full of Holes

Listwise Deletion

aka Case deletion, Complete-case analysis

Impute the Mean

Regression Imputation

Impute the conditional mean

Random Regression Imputation

Conditional mean + random variation

Add Extra Regressors

Graduation rate. Sector (public vs private).

Multiple Imputation
Rubin 1987

Steps:
1.
2.
3.
4.

Replication
Imputation
Analysis
Recombination

1&2. Replication & Imputation

1&2. Replication & Imputation

The imputed variable is not the original variable.


They just have similar statistical properties.

3&4. Analysis & Recombination

MI Point Estimate
MI Standard Errror

How many imputations?

Large Large M
Often enough (Rubin 1987; von Hippel 2005)
M = 3 to 10

Surely enough (Bodner 2008)


M = 100 = 31

Models used for imputation

SASs MI procedure

Multivariate normal

Normal
Linear

Statas ice command (Royston 2006)

Alternating regression

Logistic, Poisson, normal

Other models (not widely implemented)

Rs Mixed and Pan (Schafer 1997)


Resampling
Non-normal models (He & Raghunathan 2008)

II. New results:


A. Non-normality
1. Discrete variables Horton et al 2003; Allison 2005
2. Skew von Hippel 2008
B. Missing Y von Hippel 2007
C. Nonlinearity
1. Interactions von Hippel 2009; Allison 2002
2. Curves von Hippel 2009
Theme:
Your data can look wrong,
so long as your estimates are right.

IIA. Non-normality

IIA1. Rounding discrete variables


Common advice:
Impute dummy as normal
Round normal imputations to 0 and 1

Horton, Lipsitz & Parzen 2003; Allison 2005; Bollen & Barb 1981

IIA2. Rounding skewed variables


Skewed variable.

Impute as though normal.

Truncate implausible values.


(von Hippel 2008)

IIA2. Transforming skewed variables

(von Hippel 2008)

IIA. Non-normality: Summary


Best
impute using a model that fits

Often OK
impute as though normal

Bad
Try to make data normal
Editing imputations

Principle
Imputed data
original data
Imputed estimates = original estimates

IIB. Missing Y

IIB. Missing Y
Missing Ys are useless for regression
But cases with missing Ys have information about X
Little 1992

von Hippel 2007

IIB. Missing Y:
Multiple Imputation, then Deletion (MID)
Steps:

von Hippel 2007

1. Replication
2. Imputation
2 . Deletion [of cases with imputed Y]

3. Analysis
4. Recombination

IIC. Non-linearity

IIC1. Nonlinearity: Interactions


(Allison 2002, von Hippel 2009)

Complete data. Y regressed on X, D and DX

Impute, then Interact?

Impute (X,Y,D) as though linear (no interaction).


then regress Y on X, D, and DX

Stratify, then Impute

(Allison 2002, von Hippel 2009)

2 strata: public schools and private schools.


Impute (X,Y) as linear within each stratum.

Interact, then Impute!


Impute the interaction, like any other variable.
Then regress on the imputed interaction
(Allison 2002, von Hippel 2009).

IIC2. Nonlinearity: Curves


(von Hippel 2009)

Complete data. Y regressed on X and X 2

Impute, then Square?

Impute (X,Y) as though linear (with other variables).


then regress Y on X and X2?

Square, Then Impute (von Hippel 2009)


Impute the square like any other variable.
Then use the imputed square in regression

IIC. Nonlinearity: Summary

Transform, then impute. (von Hippel 2009)


1. Calculate transformation (square, interaction, etc).
2. Impute like any other variable.
3. Use imputed transformation in analysis.

Principle
Imputed data
real data
Imputed statistics = real statistics

Conclusions
Plausible estimates more important than
plausible data
Normal imputations are versatile, but messy
Future research and software
Resampling (approximate Bayesian bootstrap)

Alternatives to imputation
full-information maximum likelihood estimation

Data quality

References

Allison, P (2001). Missing Data. Thousand Oaks CA: Sage.


Allison, P (2005). Imputation of Categorical Variables with PROC MI, SAS Users Group International
conference, Philadelphia, PA, April 10-13.
Barnard, J. and Rubin, D B (1999), Small-Sample Degrees of Freedom With Multiple Imputation,
Journal of the American Statistical Association 86(4):948-55.
He, Yulei, and Raghunathan, Trivellore E. (2006), Tukeys gh Distribution for Multiple Imputation.
The American Statistician 60 (3): 251-256.
Horton, NJ, Lipsitz, SP, & Parzen, M. (2003).
A potential for bias when rounding in multiple imputation. The American Statistician 57(4), 229-232.
Horton, NJ & Kleinman, KP. (2007). Much Ado About Nothing: A Comparison of Missing Data
Methods and Software to Fit Incomplete Data Regression Models.
Little, RJA (1992), Regression with Missing Xs: A Review, Journal of the American Statistical
Association 87(420), 1227-1237.
Little, RJA & Rubin, DB (2002), Statistical Analysis with Missing Data.
Kim, JK (2004), Finite Sample Properties of Multiple Imputation Estimators, The Annals of Statistics
32(2), 766-783.
Meng, X. L. (1995), Multiple Imputation Inferences With Uncongenial Sources of Input, Statistical
Science 10:538-73.
Rubin, DB (1987), Multiple Imputation for Survey Nonresponse. New York: Wiley.
Schafer, JL (1996). Analysis of Incomplete Multivariate Data. New York: Chapman & Hall.
von Hippel, PT (2004). " Biases in SPSS 12.0 Missing Value Analysis. " The American Statistician
58(2), 160-164.
von Hippel, PT (2005). "
How Many Imputations Are Needed? A Comment on Hershberger and Fisher (2003). " Structural
Equation Modeling, 12(2), 334-335.
von Hippel, PT (2007), Regression with Missing Ys. Sociological Methodology.
von Hippel, PT (2008), Imputing skewed variables. Under review.
von Hippel, PT (2009), How to impute squares and interactions. Sociological Methodology, in press.

Assumption:
Ignorable missingness

missing at random (MAR), noninformative


The missing values are like the observed values in similar
cases.

Full information maximum likelihood (ML)

Suppose Y has a missing value


Estimate the
distribution of possible Y values

In MI
impute 3-10 values
from this distribution

In ML,

density
0.01
0.008
0.006
0.004

integrate
0.002
across the full distribution
Weight
75
100
125
150
175
200
225
of possible Y values
Like MI
with an infinite number of imputations

ML in AMOS

to run
to view results

Das könnte Ihnen auch gefallen