Sie sind auf Seite 1von 35

Missing data & how to

handle it
Arooj Arshad
PhD Scholar

Missing data and how to


handle it?

Goals
Discuss ways to evaluate
and understand missing data
Discuss common missing
Carol
Dweck,
data
methods
based
Know
on research
the advantages and
disadvantages
of common
on belief
systems,
and methods
their role in
Treatment
of the missing
motivation
and
data
achievement,
has
ways of missing
akeyEfficient
contribution
data handling
in originating
and
explaining implicit
theories of
intelligence/ability
.

Reasons of Missing Data


Missing data can occur for many reasons:
Participants can fail to respond to questions
(legitimately or illegitimatelymore on that later),
Equipment and data collecting or recording mechanisms
can malfunction,
Subjects can withdraw from studies before they are
completed,
Data entry errors can occur.

Difference between missing and legitimate


missing data

Missing
Data
If any data on any variable
from any participant is not
present, the researcher is
dealing with missing or
incomplete data

Example: The missing of


response on a particular item
that assesses a particular
construct .

Legitimate
Missing
Data
Legitimate missing data is an
absence of data when it is
appropriate for there to be an
absence.

Example :whether you are


Married and if so, how long
you have
been married. If you say you
are not
married, it is legitimate for
you to skip the follow-up
question

(Cole, 2008)

Methods for analyzing missing data require


assumptions about the nature of the data and
about the reasons for the missing observations
that are often not acknowledged.

Reviewing the stages of data collection, data


preparation, data analysis, and interpretation of
results will highlight the issues that researchers
must consider in making a decision about how to
handle missing data in their work.

Key Elements of missingness

The

number of cases missing per


variable
The number of variables missing per
case.
The pattern of correlation among
variables.

Point to be remembered.
All

researchers should examine their data for


missingness, and researchers wanting the best (i.e.,
the most Replicable and Generalizable) results
from their research need to be prepared to deal with
missing data in the most appropriate and desirable
way possible.
If the proportion of cases with missing data is small,
say five percent or less, listwise deletion may be
acceptable (Roth, 1994). If 5% (or fewer) cases are
not missing completely at random, inconsistent
parameter estimates can result. Otherwise, missing
data experts (Little &Rubin, 1987) recommend using
a ML method for analysis, a method that makes use of
all available data points.

Nature of Missingness

Missing Completely at Random (MCAR)


Probability of the missing data on Y is unrelated to Y and X.
Missingness is random not depend on anything.
Example: the reporting of income by the respondents.
Checked with the help of Littles MCAR test. The test is based on mean
differences across group of subjects with the same missing data pattern.
Readers interested more on it should read this article (Shenoi et al.,
2012).
Missing at Random (MAR)
Probability of missing data on y is relayed to X.
Example: for really sick patients, clinicians may not draw blood for
routine labs.
Missing Not at Random (MNAR)
Probability of missing data on Y is dependent on value of Y
Example: Respondents with high income less likely to report income

Missing Data Consequences


Bias
Estimate
systematically
deviates from the
quantity of
interest.
No bias if the data
is MCAR, but bias
can occur with not
MCAR.
Lost data decrease
statistical power

Variance
Missing data can
sometimes lead
to wrong
standard errors.
Wrong study
conclusions
about
relationship of
variables to
outcomes.
(Roth, 2001)

Commonly-Used Missing Data


Handling Methods

10

Commonly-Used Missing Data Methods

Deletion Methods

Listwise/complete case deletion, pairwise


deletion

Single Imputation Methods

Mean/mode substitution, dummy


variable method, single regression, Hot
Deck Imputation

Model-Based Methods

11

Maximum Likelihood, Multiple imputation

Deletion Method

12

Listwise Deletion (Complete Case


Analysis)
Only

analyze cases
with complete data
dropping the missing
variables.
When a researcher is
estimating a model,
such as a linear
regression, most
statistical packages
use listwise deletion
by default.
13

(Cole, 2008)

Listwise Deletion (Complete Case


Analysis)
Advantages

Ease of implementation.
Comparability across analyses

Disadvantage

Reduces statistical power (because lowers n a


researcher cannot anticipate if an adequate amount
of data remain for the analysis).
Doesnt use all information
Estimates may be biased if data isnt MCAR
(complete case analysis assumes that the observed
complete cases are a random sample of the
originally targeted sample, or in Rubin's (1976)
terminology, that the missing data are MCAR)

14

Pairwise deletion (Available Case Analysis)


Analysis with all cases
in which the variables
of interest are present.
Advantage:
Keeps as many
cases as possible for
each analysis.
Uses all information
possible with each
analysis.
Disadvantage:
Cant compare
analyses because
sample different each
time.
15

(Cole, 2008)

Hot-Deck Imputation
Researcher

should replace a missing value with


the actual score from a similar case in the current
data set.
The imputed score is termed Hot because it is
used by the computer.
Advantages
Tend to increase accuracy because missing data
values are replaced by the realistic values.
Particularly helpful when data are missing in certain
patterns

Disadvantages

16

No. of classification variables may become


unmanageable in large surveys.

Single Imputation Methods

17

Single Imputation Methods

Mean/Mode

substitution
Dummy variable control
Conditional mean substitution

18

Mean/Mode Substitution
Replace

missing value with sample mean

or mode
Run analyses as if complete cases analysis
Advantages
Can

use complete case analysis methods

Disadvantages
Reduces

variability(underestimate standard

error).
Weakens covariance and correlation estimates
in the data (because It ignores relationship
between variables)

19

Computed variance estimated


decrease as more means are added to
calculations.

20

For example, a researcher might have 30


subjects, but 5 have missing data.
Through mean substitution we add 5
means to the 25 scores this would
increase the N in the calculation of the
variance but would not increase the
deviations around the mean.

Mean

substitution is worth
considering when correlations
between variables in the data are low
and less than 10% of the data are
missing (Donner, 1982).

21

Dummy Variable Adjustment


Create

an indicator for missing value


(1=value is missing for observation;
0=value is observed for observation)
Impute missing values to a constant (such
as the mean)
Advantage

Uses all available information about missing


observation

Disadvantage
Results in biased estimates
Not theoretically driven

22

Regression Imputation
Replaces

missing values with


predicted score from a regression
equation.
Advantage:

Uses information from observed data

Disadvantages:
Overestimates model fit and correlation
estimates
Weakens variance

23

Model Based Methods

24

Model Based Methods


Maximum

Likelihood Using EM
algorithm
Multiple imputation

25

These methods share two assumptions: that the


joint distribution of the data is multivariate
normal, and that the missing data mechanism is
ignorable.

Maximum Likelihood Using EM algorithm


Identifies

the set of parameter values that


produces the highest log-likelihood.
ML estimate: value that is most likely to have
resulted in the observed data
Conceptually, process the same with or without
missing data
Advantages:
Uses full information (both complete cases and
incomplete cases) to calculate log likelihood
Unbiased parameter estimates with MCAR/MAR data

Disadvantages

26

SEs biased downwardcan be adjusted by using


observed information matrix

we

can base estimation on the


likelihood of the observed data.

27

Multiple Imputation
Impute:

Data is filled in with imputed values


using specified regression model
This step is repeated m times, resulting in a
separate dataset each time.
Analyze: Analyses performed within each
dataset
Pool: Results pooled into one estimate
Imputation is done by the Donald Rubin formula:
V= W+(1+1/m) B.

28

W and B are the within and between imputed


variances.

Multiple Imputation
Advantages:

Variability more accurate with multiple


imputations for each missing value
Considers variability due to sampling
AND variability due to imputation

Disadvantages:

Cumbersome coding
Room for error when specifying models

29

Multiple Imputation
Using this likelihood function the ML
procedure provides parameter estimates
based on all available data, including the
incomplete cases. However, simulation
studies show that ML is an inadequate
estimation technique for some small
sample problems and results in biased
estimates (Little and Rubin, 1989). For
large samples ML is a preferred method for
dealing with missing data (Schafer and
Graham, 2002).

30

Difference between EM algorithm


and MI
For

the EM algorithm we substituted a


predicted value on the basis of the variables that
were available for each case. In multiple
imputation we will do something similar, but
will add error components to counteract the
tendency of EM and Maximum Likelihood to
underestimate standard errors.

31

32

Roth, 1994

33

References
Allison, P. D. (2001). Missing Data. Sage University Papers Series on
Quantitative Applications in the Social Sciences. Thousand
Oaks: Sage.
Cole, J. C. (2008). How to deal with missing data. In J. W. Osborne
(Ed.), Best practices in quantitative methods (214238). Thousand
Oaks, CA: Sage.
Enders, C. (2010). Applied Missing Data Analysis. Guilford Press: New
York.
Little, R. J., & Donald, R. (2002). Statistical Analysis with Missing
Data. John Wiley & Sons, Inc: Hoboken.
Roth, P. (1994). Missing data: A conceptual review for applied
psychologists. Personnel Psychology, 47, 537-560.
Schafer, J. L., John W. G. (2002). Missing Data: Our View of the State
of the Art. Psychological Methods, (7), 147-177.

34

Das könnte Ihnen auch gefallen