Beruflich Dokumente
Kultur Dokumente
(AS09)
EPM304 Advanced Statistical Methods in Epidemiology
This document contains a copy of the study material located within the computer
assisted learning (CAL) session.
If you have any questions regarding this document or your course, please contact
DLsupport via DLsupport@lshtm.ac.uk.
Important note: this document does not replace the CAL material found on your
module CDROM. When studying this session, please ensure you work through the
CDROM material first. This document can then be used for revision purposes to
refer back to specific sessions.
These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of
the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale
or further copying.
London School of Hygiene & Tropical Medicine September 2013 v2.0
Objectives
By the end of this session you will be able to:
This session should take you between 1.5 and 2.5 hours to complete.
SM05
SM07, SM08, SM09
Poisson regression
Interaction: Hyperlink: SM05:
SM05 session opens in a new window.
Interaction: Hyperlink: SM07:
SM07 session opens in a new window.
SM11, AS05
Can you match up each situation to the corresponding reason? Select the
reason for clustering from the dropdown box to the right of each situation.
Situation
Families
Financial
School
Ability
Residential areas
Genetic
Private health
clinics
Social
That's right, if samples are taken from within private health clinics, financial
clustering will occur, because only the relatively rich can afford private
health care.
Incorrect Response Ability (text appears on bottom right handside):
No, there is no reason why clustering with respect to ability should occur
within private health clinics. Please try again.
Incorrect Response Genetic (text appears on bottom right handside):
No, there is no reason why genetic clustering should occur within private
health clinics. Please try again.
Incorrect Response: Social (text appears on bottom right handside):
There may be social clustering within private health clinics, but there is a
more fundamental reason for such clustering. Please try again.
intra-cluster
correlation
If some index cases are more infectious than others, or household members share
previous exposures to TB, then the outcome (i.e. the result of the Mantoux test)
should show some correlation within households.
You are going to look at the effect of 3 different approaches to the analysis of
correlated data to test this hypothesis:
1. Robust standard errors
2. Generalised estimating equations
3. Random effects models
(also called multi-level modelling)
28.6
71.4
100.0
87
94
181
48.1
51.9
100.0
Total
123
184
307
%
40.1
59.9
100.0
From this table, what can you say about the prevalence of tuberculin positivity and
odds of tuberculin positivity?
Interaction: Button: clouds picture (pop up box appears):
Overall, 60% (184 / 307) of household contacts were tuberculin-positive.
Odds = 184 / 123 = 1.50
The prevalence of tuberculin positivity appears lower among the contacts of HIV +ve
index cases (52%) than among contacts of HIV ve index cases (71%).
The respective odds of tuberculin positivity are
1.08 (94 / 87) for HIV +ve contacts and 2.50 (90 / 36) for HIV ve contacts.
Interaction: Button: Swap (table from previous page changes to the following):
Classical analysis
Odds ratio
X
0.43
11.72
P > X
0.0006
95% confidence
interval
0.26
0.71
x x (1 - ) x ... x (1 - )
Click here to review thisfrom session SM05.
Interaction: Hyperlink: review this:
In other words, likelihood was used to derive the best estimate of the parameter and
its standard error. Standard errors like this are sometimes called "model-based"
standard errors, these are what you will be most familiar with from this course.
Variance ri2
i
The ri terms are the residuals.
The residuals are the difference between the outcome observed and the outcome
predicted by the model.
When observations are independent, then the summation is performed on the
individual-level residuals.
If data are "clustered", then cluster-level residuals are calculated and summed over
the clusters.
Note: This does not make any assumptions about independence within clusters but
does assume that there is independence between clusters.
0.838904
Constant
0.916291
0.001
5
3.396
1.32306 0.354744
5
0.19720
4.646
<
0.52978 1.302801
3
0.001
1
Log likelihood = 200.70621
The table shows estimates of a logistic regression model. This model ignores any
household clustering. Click below for further explanation of the model.
Interaction: Button: Explanation (pop up box appears):
This model is similar to many models we have fitted before. All the calculations are
performed assuming that all observations are independent. The parameter estimate
of the log(OR) for the effect of HIV (0.8389) is obtained by maximising the log
likelihood. The standard error of this estimate (0.247025) is obtained through the
quadratic approximation to the log likelihood.
How does this estimate compare to the one we obtained earlier using classical
methods?
Interaction: Button: clouds picture (pop up box appears):
On the odds ratio scale we obtain the same estimate (0.432) and a similar
confidence interval to those we obtained using classical methods.
Manto
ux
HIV1
Constan
t
0.001
0.838904
9 3.390
1.323855
0.353953
0.916291 0.19752 4.639
<
0.529150
1.303432
5
0.001
Log likelihood = 200.70621
0.012
0.838904
4
2.525
1.490150
0.187658
Constan
0.916291 0.26648
3.438
0.001
0.393989
1.438593
t
5
Log likelihood = 200.70621
Interaction: Tabs: Question 1:
How do the standard errors compare to the two previous models and how will this
affect the inference made about the effect of HIV contact on tuberculin positivity?
Interaction: Button: clouds picture (pop up box appears):
The standard errors are now quite a lot bigger. The standard error of log(OR) has
increased from 0.25 to 0.33.
As a consequence, the P-value for the null hypothesis, that there is no association
between the HIV status of the index case and the odds of tuberculin positivity in
household contacts, has also got bigger.
Interaction: Tabs: Question 2:
Converting back to the odds ratio scale, we obtain OR = exp(0.839) = 0.43, (95%
CI: 0.23, 0.83); P = 0.012.
So what can you conclude about the effect of HIV contact from this analysis?
Interaction: Button: clouds picture (pop up box appears):
After adjusting for the effect of clusters within households (using robust standard
errors) it appears that there is still evidence of an association between HIV contact
and tuberculin positivity. The odds of tuberculin positivity for contacts of HIV +ve TB
cases is approximately half that of HIV ve TB cases.
0.24702
0.001
0.838904
5 3.396
1.323065
0.354744
Constan
0.916291
0.19720 4.646
<
0.529781
1.302801
t
3
0.001
You should notice that the parameter estimates and standard errors are identical to
those we obtained using the standard likelihood approach using a logistic regression
model.
This is because the model:
a) assumes independence
b) does not use robust standard errors
0.01
Constan
t
0.838904
0.916291
4
0.26648
5
2.525
3.438
2
0.00
1
1.490150
0.393989
0.187658
1.438593
Click below to view the model that allows for correlation within households, which
you saw on the previous page.
Interaction: Button: Swap (table changes to the following):
Estimates from a logistic regression model
standard errors, adjusted for clustering
Manto Coefficie Standa
z
P>
ux
nt
rd err.
|z|
Ihiv_2
0.33227
0.01
0.838904
4
2.525
2
_cons
0.916291 0.26648
3.438 0.00
5
1
with robust
95% confidence
interval
1.490150
0.187658
0.393989
1.438593
How does the standard error for the log(OR) in the GEE analysis with robust
standard errors compare to the earlier model?
Interaction: Button: clouds picture (pop up box appears):
The standard error obtained from this analysis is identical to that obtained using the
robust standard error approach that allows for correlation within households.
The GEE analysis automatically adjusts the standard errors to take account of the
within-household correlation. However, it has taken no account of the withinhousehold correlation when obtaining the parameter estimate (the log odds ratio).
0.003
0.968266
0
2.958
1.609840 0.326692
Constan
t
1.010946
0.26172
6
3.863
<
0.001
0.497972
1.523920
Do you know how this analysis accounts for the within household correlation in the
estimate of the log(OR)?
Interaction: Button: clouds picture (pop up box appears):
This analysis takes account of within-household correlation when estimating the log
odds ratio, i.e. it gives relatively less weight to contacts in large households.
On an odds ratio scale, the model estimates are:
OR = 0.38 (95% CI: 0.20, 0.72)
where (HIV status)j is an indicator variable which takes the value 0 if the index case
for household contact j is HIV ve and 1 if the index case for household contact j is
HIV +ve.
For logistic regression models we usually assume that the ui are normally distributed,
with mean 0 and variance u .
The only extra parameter that has to be estimated is u, rather than trying to
estimate a specific value of u for each household.
In this situation, the random effects model states that for individual j in household i
the log odds of tuberculin positivity are given by:
log(odds)ij =
Coefficie
nt
1.148913
1.198623
0.034562
0.982868
0.491360
Standard
err.
0.393322
0.316097
0.496553
0.244023
0.124101
P > |z|
2.921
0.003
3.792
0.070
0.000
0.945
95% confidence
interval
1.919810
0.378017
0.579085
1.818162
0.938665
1.007788
0.604173
1.598926
0.267413
0.718830
Log likelihood = -194.26
The model gives an estimate of the log odds ratio for the effect of HIV (1.1489)
and its standard error (0.39).
Converting back to the odds ratio scale we obtain OR = 0.32 (95% CI: 0.15, 0.69).
= 200.71
= 194.26
This is especially important when is large (> 0.25, say), or the number of
observations per cluster (individuals per household in our example) is large (> 20,
say).
Interaction: Button: clouds picture (pop up box appears):
In the study of tuberculin positivity, a check suggests that the approximations may
not be reliable!
You will see how to check the reliability of the approximations in Practical 9.
As a result it may be safer to use the results from the GEE analysis, even though
this approach is less satisfying from a statistical point of view.
2. The point estimates, standard errors and log-likelihood obtained from a random
effects model all take account of the clustering (assuming that the random effects
distribution is correctly specified).
3. Likelihood ratio tests are valid.
4. Estimates of the between cluster variation and intra-cluster correlation are
obtained.
5. There needs to be a reasonable number of "clusters" in the dataset for the
method to be reliable.
6. When performing random effects logistic regression analysis, the reliability of the
estimates should be checked, especially when is large.
Standard error
of log odds ratio
0.25
P-value
0.001
Standard error
of log odds ratio
0.33
P-value
0.01
Standard error
of log odds ratio
0.33
P-value
0.003
Standard error
of log odds ratio
0.39
P-value
0.003
GEE and random effects models both take the above into account, but the analysis
using robust standard errors alone does not.
The odds ratio estimates produced by GEE and the random effects model are
different. This is because they actually estimate slightly different things. This is
explained on the tabs below.
Interaction: Tabs: GEE:
The odds ratio estimated by GEE is often called the population average odds ratio.
This represents:
Section 9: Summary
This is the end of AS09. When you are happy with the material covered here please
move on to session AS10.
The main points of this session will appear below as you click on the relevant title.
The problem with correlated data
When datasets contain observations that are correlated (clustered), such as:
repeated outcome measurements on individuals
outcome measurements on several individuals in the same household
then standard methods of data analysis are invalid, and produce confidence intervals
that are too narrow and P-values that are too small.
Robust standard errors
Robust standard errors can be obtained using residuals calculated at the cluster
level, to account for correlation. The clustering in the data is not taken into account
when estimating parameters, which is a disadvantage of this method. However, an
advantage of this method is that it always works - you do not get problems of model
convergence such as sometimes occur when using GEE or random effects models.
Another advantage is that it is a simple and intuitive approach.
Generalised estimating equations
GEEs incorporate robust standard errors and also take correlation into account when
estimating parameters, e.g. log(OR).
GEE is a pragmatic, relatively simple approach. However, sometimes there are
convergence problems with the model when using GEE, in which case the method
using robust standard errors alone (and not taking into account the within-cluster
correlation when estimating parameters), or a random-effects model, should be
used.
Random effects models
Random effects models take account of the clustering within the data when
estimating parameters and their standard errors. They are more satisfactory from a
statistical point of view than the method using robust standard errors alone, and
than GEE, because they specify a full probability model to explain the data.
They work well for Poisson models when the between-cluster variation is assumed to
follow a gamma distribution, and for quantitative outcome data when the betweencluster variation is assumed to follow a normal distribution. So, in general, you
should use a random-effects model when you have correlated data and you are
modeling rates or a quantitative outcome.
However, for logistic regression (where the outcome variable is binary), a random
effects model may run into computational problems (the model may not converge,
or the approximations used in the model-fitting may not be reliable). When this
happens, it is preferable to use GEE or the method using robust standard errors
alone
Constant exposure
These methods for analysing correlated data are needed when the exposure variable
is a "cluster-level" variable - i.e. it takes the same value for all individuals in a
cluster. When all exposure variables are "individual-level" variables - i.e. their value
varies among individuals in the same cluster -, and if clusters are quite large, then
we may instead be able to take account of the between-cluster variation by
stratifying the analysis on cluster and using regression analysis in the usual way.