Missing Data

Missing data & how to
handle it
Arooj Arshad
PhD Scholar
Missing data and how to

handle it?
Goals
Discuss ways to evaluate
and understand missing data
Discuss common missing
Carol
Dweck,
data
methods
based
Know
on research
the advantages and
disadvantages
of common
on belief
systems,
and methods
their role in
Treatment
of the missing
motivation
and
data
achievement,
has
ways of missing
akeyEfficient
contribution
data handling
in originating
and
explaining implicit
theories of
intelligence/ability
.
Reasons of Missing Data

Missing data can occur for many reasons:
Participants can fail to respond to questions
(legitimately or illegitimatelymore on that later),
Equipment and data collecting or recording mechanisms
can malfunction,
Subjects can withdraw from studies before they are
completed,
Data entry errors can occur.
Difference between missing and legitimate

missing data
Missing
Data
If any data on any variable
from any participant is not
present, the researcher is
dealing with missing or
incomplete data
Example: The missing of

response on a particular item
that assesses a particular
construct .
Legitimate
Missing
Data
Legitimate missing data is an
absence of data when it is
appropriate for there to be an
absence.
Example :whether you are

Married and if so, how long
you have
been married. If you say you
are not
married, it is legitimate for
you to skip the follow-up
question
(Cole, 2008)
Methods for analyzing missing data require

assumptions about the nature of the data and
about the reasons for the missing observations
that are often not acknowledged.
Reviewing the stages of data collection, data

preparation, data analysis, and interpretation of
results will highlight the issues that researchers
must consider in making a decision about how to
handle missing data in their work.
Key Elements of missingness
The
number of cases missing per

variable
The number of variables missing per
case.
The pattern of correlation among
variables.
Point to be remembered.
All
researchers should examine their data for

missingness, and researchers wanting the best (i.e.,
the most Replicable and Generalizable) results
from their research need to be prepared to deal with
missing data in the most appropriate and desirable
way possible.
If the proportion of cases with missing data is small,
say five percent or less, listwise deletion may be
acceptable (Roth, 1994). If 5% (or fewer) cases are
not missing completely at random, inconsistent
parameter estimates can result. Otherwise, missing
data experts (Little &Rubin, 1987) recommend using
a ML method for analysis, a method that makes use of
all available data points.
Nature of Missingness
Missing Completely at Random (MCAR)

Probability of the missing data on Y is unrelated to Y and X.
Missingness is random not depend on anything.
Example: the reporting of income by the respondents.
Checked with the help of Littles MCAR test. The test is based on mean
differences across group of subjects with the same missing data pattern.
Readers interested more on it should read this article (Shenoi et al.,
2012).
Missing at Random (MAR)
Probability of missing data on y is relayed to X.
Example: for really sick patients, clinicians may not draw blood for
routine labs.
Missing Not at Random (MNAR)
Probability of missing data on Y is dependent on value of Y
Example: Respondents with high income less likely to report income
Missing Data Consequences

Bias
Estimate
systematically
deviates from the
quantity of
interest.
No bias if the data
is MCAR, but bias
can occur with not
MCAR.
Lost data decrease
statistical power
Variance
Missing data can
sometimes lead
to wrong
standard errors.
Wrong study
conclusions
about
relationship of
variables to
outcomes.
(Roth, 2001)
Commonly-Used Missing Data

Handling Methods
10
Commonly-Used Missing Data Methods
Deletion Methods
Listwise/complete case deletion, pairwise

deletion
Single Imputation Methods
Mean/mode substitution, dummy

variable method, single regression, Hot
Deck Imputation
Model-Based Methods
11
Maximum Likelihood, Multiple imputation
Deletion Method
12
Listwise Deletion (Complete Case

Analysis)
Only
analyze cases
with complete data
dropping the missing
variables.
When a researcher is
estimating a model,
such as a linear
regression, most
statistical packages
use listwise deletion
by default.
13
(Cole, 2008)
Listwise Deletion (Complete Case

Analysis)
Advantages
Ease of implementation.
Comparability across analyses
Disadvantage
Reduces statistical power (because lowers n a

researcher cannot anticipate if an adequate amount
of data remain for the analysis).
Doesnt use all information
Estimates may be biased if data isnt MCAR
(complete case analysis assumes that the observed
complete cases are a random sample of the
originally targeted sample, or in Rubin's (1976)
terminology, that the missing data are MCAR)
14
Pairwise deletion (Available Case Analysis)

Analysis with all cases
in which the variables
of interest are present.
Advantage:
Keeps as many
cases as possible for
each analysis.
Uses all information
possible with each
analysis.
Disadvantage:
Cant compare
analyses because
sample different each
time.
15
(Cole, 2008)
Hot-Deck Imputation
Researcher
should replace a missing value with

the actual score from a similar case in the current
data set.
The imputed score is termed Hot because it is
used by the computer.
Advantages
Tend to increase accuracy because missing data
values are replaced by the realistic values.
Particularly helpful when data are missing in certain
patterns
Disadvantages
16
No. of classification variables may become

unmanageable in large surveys.
17
Mean/Mode
substitution
Dummy variable control
Conditional mean substitution
18
Mean/Mode Substitution
Replace
missing value with sample mean
or mode
Run analyses as if complete cases analysis
Advantages
Can
use complete case analysis methods
Disadvantages
Reduces
variability(underestimate standard
error).
Weakens covariance and correlation estimates
in the data (because It ignores relationship
between variables)
19
Computed variance estimated

decrease as more means are added to
calculations.
20
For example, a researcher might have 30

subjects, but 5 have missing data.
Through mean substitution we add 5
means to the 25 scores this would
increase the N in the calculation of the
variance but would not increase the
deviations around the mean.
Mean
substitution is worth
considering when correlations
between variables in the data are low
and less than 10% of the data are
missing (Donner, 1982).
21
Dummy Variable Adjustment

Create
an indicator for missing value

(1=value is missing for observation;
0=value is observed for observation)
Impute missing values to a constant (such
as the mean)
Advantage
Uses all available information about missing

observation
Disadvantage
Results in biased estimates
Not theoretically driven
22
Regression Imputation
Replaces
missing values with

predicted score from a regression
equation.
Advantage:
Uses information from observed data
Disadvantages:
Overestimates model fit and correlation
estimates
Weakens variance
23
Model Based Methods
24
Model Based Methods

Maximum
Likelihood Using EM
algorithm
Multiple imputation
25
These methods share two assumptions: that the

joint distribution of the data is multivariate
normal, and that the missing data mechanism is
ignorable.
Maximum Likelihood Using EM algorithm

Identifies
the set of parameter values that

produces the highest log-likelihood.
ML estimate: value that is most likely to have
resulted in the observed data
Conceptually, process the same with or without
missing data
Advantages:
Uses full information (both complete cases and
incomplete cases) to calculate log likelihood
Unbiased parameter estimates with MCAR/MAR data
Disadvantages
26
SEs biased downwardcan be adjusted by using

observed information matrix
we
can base estimation on the

likelihood of the observed data.
27
Multiple Imputation
Impute:
Data is filled in with imputed values

using specified regression model
This step is repeated m times, resulting in a
separate dataset each time.
Analyze: Analyses performed within each
dataset
Pool: Results pooled into one estimate
Imputation is done by the Donald Rubin formula:
V= W+(1+1/m) B.
28
W and B are the within and between imputed

variances.
Multiple Imputation
Advantages:
Variability more accurate with multiple

imputations for each missing value
Considers variability due to sampling
AND variability due to imputation
Disadvantages:
Cumbersome coding
Room for error when specifying models
29
Multiple Imputation
Using this likelihood function the ML
procedure provides parameter estimates
based on all available data, including the
incomplete cases. However, simulation
studies show that ML is an inadequate
estimation technique for some small
sample problems and results in biased
estimates (Little and Rubin, 1989). For
large samples ML is a preferred method for
dealing with missing data (Schafer and
Graham, 2002).
30
Difference between EM algorithm

and MI
For
the EM algorithm we substituted a

predicted value on the basis of the variables that
were available for each case. In multiple
imputation we will do something similar, but
will add error components to counteract the
tendency of EM and Maximum Likelihood to
underestimate standard errors.
31
32
Roth, 1994
33
References
Allison, P. D. (2001). Missing Data. Sage University Papers Series on
Quantitative Applications in the Social Sciences. Thousand
Oaks: Sage.
Cole, J. C. (2008). How to deal with missing data. In J. W. Osborne
(Ed.), Best practices in quantitative methods (214238). Thousand
Oaks, CA: Sage.
Enders, C. (2010). Applied Missing Data Analysis. Guilford Press: New
York.
Little, R. J., & Donald, R. (2002). Statistical Analysis with Missing
Data. John Wiley & Sons, Inc: Hoboken.
Roth, P. (1994). Missing data: A conceptual review for applied
psychologists. Personnel Psychology, 47, 537-560.
Schafer, J. L., John W. G. (2002). Missing Data: Our View of the State
of the Art. Psychological Methods, (7), 147-177.
34

Missing Data

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Missing Data

Hochgeladen von

Copyright:

Verfügbare Formate

Missing data & how to

Missing data and how to

Reasons of Missing Data

Difference between missing and legitimate

Example: The missing of

Example :whether you are

Methods for analyzing missing data require

Reviewing the stages of data collection, data

Key Elements of missingness

number of cases missing per

researchers should examine their data for

Missing Completely at Random (MCAR)

Missing Data Consequences

Commonly-Used Missing Data

Commonly-Used Missing Data Methods

Listwise/complete case deletion, pairwise

Single Imputation Methods

Mean/mode substitution, dummy

Maximum Likelihood, Multiple imputation

Listwise Deletion (Complete Case

Listwise Deletion (Complete Case

Reduces statistical power (because lowers n a

Pairwise deletion (Available Case Analysis)

should replace a missing value with

No. of classification variables may become

Single Imputation Methods

Single Imputation Methods

missing value with sample mean

use complete case analysis methods

Computed variance estimated

For example, a researcher might have 30

Dummy Variable Adjustment

an indicator for missing value

Uses all available information about missing

missing values with

Uses information from observed data

Model Based Methods

Model Based Methods

These methods share two assumptions: that the

Maximum Likelihood Using EM algorithm

the set of parameter values that

SEs biased downwardcan be adjusted by using

can base estimation on the

Data is filled in with imputed values

W and B are the within and between imputed

Variability more accurate with multiple

Difference between EM algorithm

the EM algorithm we substituted a

Das könnte Ihnen auch gefallen