Analysis of Correlated Data (As09) : Course: PG Diploma/ MSC Epidemiology

Analysis of correlated data
(AS09)
EPM304 Advanced Statistical Methods in Epidemiology
Course: PG Diploma/ MSc Epidemiology
This document contains a copy of the study material located within the computer
assisted learning (CAL) session.
If you have any questions regarding this document or your course, please contact
DLsupport via DLsupport@lshtm.ac.uk.
Important note: this document does not replace the CAL material found on your
module CDROM. When studying this session, please ensure you work through the
CDROM material first. This document can then be used for revision purposes to
refer back to specific sessions.
These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of
the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale
or further copying.
London School of Hygiene & Tropical Medicine September 2013 v2.0
Section 1: Analysis of correlated data

Aim
To learn how to analyse correlated observations, i.e, data where observations

are not independent, using robust standard errors, generalised estimating
equations and random effects models.
Objectives
By the end of this session you will be able to:
Describe the effect of intra-cluster correlation

Explain why usual methods are not valid
Explain why parameter estimates need to be adjusted when data is correlated
Use 3 approaches to obtain parameter estimates allowing for correlated data.
This session should take you between 1.5 and 2.5 hours to complete.
Section 2: Planning your study

In this session you will learn how to obtain appropriate parameter estimates and
confidence intervals for data that are in some way correlated. So far in the course
you have mostly applied methods that make the assumption the data are statistically
independent, e.g. logistic regression, Poisson regression, Cox regression. With
correlated data the assumption of independence is invalid.
To begin you will look at why the usual methods are not appropriate. You will then
look at 3 different approaches to obtain estimates accounting for the fact that
observations are correlated.
During the session you will fit logistic and Poisson regression models to correlated
data using 3 different approaches, you may wish to review the appropriate sessions.
Since initial concepts are based on likelihood estimation you could also review the
likelihood session.
Likelihood
Logistic regression
SM05
SM07, SM08, SM09
Poisson regression
Interaction: Hyperlink: SM05:
SM05 session opens in a new window.
SM11, AS05

2.1: Planning your study

To illustrate the ideas in this session, you will use a study carried out in Zambia on
the impact of HIV on the infectiousness of patients with pulmonary TB. Details are
given below.
Details
This was a study of the impact of HIV on the infectiousness of patients with
pulmonary TB. It was based on 70 pulmonary TB patients in Zambia, 42 of whom
were HIV +ve and 28 of whom were HIV ve.
The aim of the study was to determine whether HIV +ve patients were more or less
likely to transmit M.tuberculosis infection to their household contacts.
307 household contacts (181 contacts of HIV +ve index cases) were traced.
Section 3: When are data correlated?

When are data correlated?
The classical statistical techniques (Mantel-Haenszel) and regression models (logistic
regression, Poisson regression, Cox regression) that you have previously used all
share a common assumption about the data.
Can you remember what this assumption is?
Interaction: Button: clouds picture (pop up box appears, text and interaction
appears below):
The assumption made with the classical analyses and generalised linear regression
models we have looked at so far is that individual observations are statistically
independent of each other. In other words, there is no correlation between
individuals.
Can you think of an example where the assumption of statistical independence is not
valid?
Interaction: Button: clouds picture (text appears below):
In a cluster randomised trial groups of individuals, rather than individuals

themselves, are randomised to receive a particular intervention. Individuals in the
same cluster may, on average, be more similar to each other than to individuals in
other clusters. Therefore the assumption of independence can no longer be made.
Why may individuals in clusters be more similar? Some reasons are given below:
Individuals in a community tend to behave or respond more like other people in
the same community than others in a different community.
Individuals may have a level of exposure more like others in the same community
than individuals in a different community.
An infected individual is more likely to transmit their infection to an individual in
the same community than to an individual in a different community.
3.1: When are data correlated?

To illustrate clustering (and therefore correlation) within a population, consider the
diagram below.
The different coloured circles represent different characteristics for individuals. So
individuals with the same colour have similar characteristics.
We could select 2 groups randomly from this population. Click 'show' to do this.
Interaction: Button: Show (text appears and diagram changes as shown below):
Notice how each group contains a mixture of individuals of a certain type.

In certain situations, groups of individuals with similar characteristics will be
clustered, as shown below.
The diagram now represents 2 communities (divided by the line down the middle).
We can choose a random sample from each community. Click 'show' to do this.
Interaction: Button: Show (text appears and diagram changes as shown below):
Now, because of the clustering within communities, individuals in the same
community are more similar to each other than they are to individuals in a different
community. That is, individuals in the same community are correlated with each
other

The tabs below give some other examples of clustering.
Interaction: Tabs: Example 1:
Cohort studies which follow a group of individuals over a period of time, making
repeated outcome measurements on each individual. The repeated outcome
measurements on a particular individual are likely to be correlated with each other.
Ophthalmic studies in which outcome measurements are made on both eyes of each
individual in the study. The outcome measurements on the left and right eyes of a
particular individual are likely to be correlated with each other.
Family or household studies in which several members of each family/household are
studied. Outcome measurements on members of the same household may well be
correlated with each other.

The boxes below list a number of situations where there may be clustering (on the
left) and some reasons for this clustering (on the right).
Can you match up each situation to the corresponding reason? Select the
reason for clustering from the dropdown box to the right of each situation.
Situation
Reason for clustering
Families
Financial
School
Ability
Residential areas
Genetic
Private health
clinics
Social
Interaction: Hotspot: Families:

Incorrect Response Financial (text appears on bottom right handside):
There may well be financial clustering within families, since the members of
a family are likely to share the same income. However for this exercise
there is a more fundamental answer. Please try again.
Incorrect Response Ability (text appears on bottom right handside):
There may be clustering with respect to ability within families, but for this
exercise there is a more fundamental answer. Please try again.
Correct Response Genetic (text appears on bottom right handside):
That's right, you would expect genetic clustering to occur within families.
Incorrect Response Social (text appears on bottom right handside):
There may well be social clustering within families, since the members of a
family are likely to belong to the same social group. However for this
exercise there is a more fundamental answer. Please try again.
Interaction: Hotspot: School:

There may be financial clustering in schools, for example if it is a private
school. However, for this exercise there is another reason for clustering that
is more specific to schools. Please try again.
Correct Response Ability (text appears on bottom right handside):
That's right, clustering on the basis of ability can happen within schools, for
example in selective schools.
Incorrect Response Genetic (text appears on bottom right handside):
No, there is no reason why there should be genetic clustering within
schools. Please try again.
Incorrect Response Social (text appears on bottom right handside):
There may be social clustering in schools, especially where a school's
catchment area corresponds to a specific residential area. However, for this
exercise there is another reason for clustering that is more specific to
schools. Please try again.
Interaction: Hotspot: Residential Areas:
There may be financial clustering within a given residential area, but more
often you would think of this as social clustering. Please try again.
No, you would not necessarily expect clustering with respect to ability to
occur within a given residential area. Please try again.
No, you would not necessarily expect genetic clustering to occur within a
given residential area. Please try again.
Correct Response Social (text appears on bottom right handside):
That's right, if samples are taken from within a certain residential area,
social clustering may occur, because a given residential area will often be
home to a particular social class.
Interaction: Hotspot: Private health clinics:
Correct Response Financial (text appears on bottom right handside):
That's right, if samples are taken from within private health clinics, financial
clustering will occur, because only the relatively rich can afford private
health care.
No, there is no reason why clustering with respect to ability should occur
within private health clinics. Please try again.
No, there is no reason why genetic clustering should occur within private
health clinics. Please try again.
Incorrect Response: Social (text appears on bottom right handside):
There may be social clustering within private health clinics, but there is a
more fundamental reason for such clustering. Please try again.

If the lack of independence of observations is not taken into account, the main
problem in the analysis will be incorrect standard errors. In general, the estimates
will be too small. This leads to confidence intervals that are too narrow and P-values
that are too small.
In other words, if the data are analysed as though they were independent, the
inference may be incorrect.
Interaction: Button: Why:
The data provided by two individuals who are similar (e.g. from the same
household) are less informative about the general study population than data from
two individuals from different households.
So, if we assume independence, we think we have more information than we really
do have and so the standard errors are too small. Therefore, the P-values will
provide stronger evidence than is really the case.

The correlation induced for whatever reason between individuals in a community can
be measured by the intra-cluster correlation (also called within-cluster
correlation).
('rho') = intra-cluster correlation coefficient
= 0 means that responses of individuals within the same cluster are no more alike
than those of individuals from different clusters.
= 1 means that all responses of individuals within the same cluster are identical.

An alternative way of thinking about
intra-cluster correlation is in terms of between-cluster variation.
Intra-cluster correlation means that individuals are more similar to others in the
same cluster than individuals in other clusters. This happens if and only if there are
differences between clusters.
So intra-cluster correlation and between-cluster variation are two ways of measuring
the same phenomenon.
If observations are correlated within clusters, then there is variation between
clusters.
between-cluster
variation
intra-cluster
correlation

So, observations on individuals within the same cluster are correlated. This requires
modifications in the statistical methods applied.
Section 4: When are data correlated? Example

Before you look at how to deal with correlated observations, consider this example
which illustrates the problem.
A study carried out in Zambia looked at the impact of HIV on the infectiousness of
patients with pulmonary TB.
All contacts underwent a Mantoux test. An induration of diameter 5mm was
considered positive.
Information was recorded on a number of household level variables, including:
HIV status of TB index case
crowding and a number of individual contact level variables, including:
age of contact
degree of intimacy of contact
The mean number of contacts per index case were:
HIV ve : 4.5 (range 1 11)
[126 contacts / 28 HIV ve index TB cases]
HIV +ve : 4.3 (range 1 13)
[181 contacts / 42 HIV +ve index TB cases]
4.1: When are data correlated? Example
If some index cases are more infectious than others, or household members share
previous exposures to TB, then the outcome (i.e. the result of the Mantoux test)
should show some correlation within households.
You are going to look at the effect of 3 different approaches to the analysis of
correlated data to test this hypothesis:
1. Robust standard errors
2. Generalised estimating equations
3. Random effects models
(also called multi-level modelling)

The table below shows the distribution of contacts, by outcome of the Mantoux test
and HIV status of the index TB case.
Mantoux
test
status of
househol
d contact
Negative
Positive
Total
HIV status of index case

Negative
Positive
n
%
n
%
36
90
126
28.6
71.4
100.0
87
94
181
48.1
51.9
100.0
Total
123
184
307
%
40.1
59.9
100.0
From this table, what can you say about the prevalence of tuberculin positivity and
odds of tuberculin positivity?
Interaction: Button: clouds picture (pop up box appears):
Overall, 60% (184 / 307) of household contacts were tuberculin-positive.
Odds = 184 / 123 = 1.50
The prevalence of tuberculin positivity appears lower among the contacts of HIV +ve
index cases (52%) than among contacts of HIV ve index cases (71%).
The respective odds of tuberculin positivity are
1.08 (94 / 87) for HIV +ve contacts and 2.50 (90 / 36) for HIV ve contacts.

First, we analyse the data, ignoring any within-household clustering. We can
estimate the odds ratio and calculate a 95% CI for the odds ratio using classical
methods.
Click the 'swap' button to see the classical results.
Interaction: Button: Swap (table from previous page changes to the following):
Classical analysis
Odds ratio
X
0.43
11.72
P > X
0.0006
95% confidence
interval
0.26
0.71
How would you interpret the model estimates?

The results suggest that the odds of being tuberculin positive among household
contacts of HIV +ve TB cases are around half (0.43) those among household
contacts of HIV ve index cases.
Ignoring the clustering, we would conclude that there was strong evidence that this
association was not due to chance (P = 0.0006).
However, if there is clustering within households, this conclusion could be wrong.

The methods of estimation you have used so far make the assumption that
observations are independent of each other. In EPM202, you saw that these
estimation methods are based on likelihood.
You can write down a likelihood in terms of the risk, log odds, rate or log rate
parameter by multiplying together the probabilities of each individual observation, as
follows:
x x (1 - ) x ... x (1 - )
Click here to review thisfrom session SM05.
Interaction: Hyperlink: review this:
A point estimate of the parameter of interest is obtained from choosing the

parameter value that maximises the likelihood.
An approximate confidence interval for the parameter estimate can be found using
the quadratic approximation to the log likelihood. Click here to review this from
session SM05.
In other words, likelihood was used to derive the best estimate of the parameter and
its standard error. Standard errors like this are sometimes called "model-based"
standard errors, these are what you will be most familiar with from this course.
Section 5: Robust standard errors

One useful approach to derive standard errors that allow for the clustering is to use
what are called robust standard errors.
While model-based standard errors are based on predicted variability, robust
standard errors are based on observed variability.
Using robust standard errors, you can then obtain appropriate confidence intervals
and P-values.
5.1: Robust standard errors

Robust standard errors are based on the sum of the residuals.
Variance ri2
i
The ri terms are the residuals.
The residuals are the difference between the outcome observed and the outcome
predicted by the model.
When observations are independent, then the summation is performed on the
individual-level residuals.
If data are "clustered", then cluster-level residuals are calculated and summed over
the clusters.
Note: This does not make any assumptions about independence within clusters but
does assume that there is independence between clusters.

Let's look at how robust standard errors work in an example.
Consider again the study of tuberculin positivity and contact with an index case with
or without HIV. The variable HIV1 represents individuals with HIV.
Click the 'swap' button to see the estimates on an odds ratio scale.
Interaction: Button: Swap (the table changes from the second to the first below):
Estimates from logistic model (ignoring clusters) on an odds ratio

scale
Mantoux
Odds
Standard
z
P > |z|
95% Conf.
ratio
Err.
Interval
HIV1
0.43
0.106760
3.396
0.001
0.27
0.70
Estimates from logistic
Mantoux
Coefficie
nt
HIV1
0.838904
Constant
0.916291
model (ignoring clusters) on a log scale

Standa
z
P>
95% Conf.
rd Err.
|z|
Interval
0.24702
0.001
5
3.396
1.32306 0.354744
5
0.19720
4.646
<
0.52978 1.302801
3
0.001
1
Log likelihood = 200.70621
The table shows estimates of a logistic regression model. This model ignores any
household clustering. Click below for further explanation of the model.
Interaction: Button: Explanation (pop up box appears):
This model is similar to many models we have fitted before. All the calculations are
performed assuming that all observations are independent. The parameter estimate
of the log(OR) for the effect of HIV (0.8389) is obtained by maximising the log
likelihood. The standard error of this estimate (0.247025) is obtained through the
quadratic approximation to the log likelihood.
How does this estimate compare to the one we obtained earlier using classical
methods?
On the odds ratio scale we obtain the same estimate (0.432) and a similar
confidence interval to those we obtained using classical methods.

To allow for correlation within households you need to use robust standard errors
that are calculated using residuals at the cluster level. The table below shows this.
Use the 'swap' button to see the previous model, with robust standard errors
calculated at the individual level.
Notice that the estimate for the effect of HIV on tuberculin positivity is the same as
before. This is because the only difference is in the calculation of the standard
errors.
Interaction: Button: Swap (the table changes from the second to the first below):
Estimates from a logistic regression model with robust standard
Manto
ux
HIV1
Constan
t
errors calculated at the individual level

Coefficie Standar
z
P>
95% confidence
nt
d err.
|z|
interval
0.24742
0.001
0.838904
9 3.390
1.323855
0.353953
0.916291 0.19752 4.639
<
0.529150
1.303432
5
0.001
Estimates from a logistic regression model with robust standard

errors calculated at the household level
Manto
Coefficie
Standar z
P>
95% confidence
ux
nt
d err.
|z|
interval
HIV1
0.33227
0.012
0.838904
4
2.525
1.490150
0.187658
Constan
0.916291 0.26648
3.438
0.001
0.393989
1.438593
t
5
Interaction: Tabs: Question 1:
How do the standard errors compare to the two previous models and how will this
affect the inference made about the effect of HIV contact on tuberculin positivity?
The standard errors are now quite a lot bigger. The standard error of log(OR) has
increased from 0.25 to 0.33.
As a consequence, the P-value for the null hypothesis, that there is no association
between the HIV status of the index case and the odds of tuberculin positivity in
household contacts, has also got bigger.
Interaction: Tabs: Question 2:
Converting back to the odds ratio scale, we obtain OR = exp(0.839) = 0.43, (95%
CI: 0.23, 0.83); P = 0.012.
So what can you conclude about the effect of HIV contact from this analysis?
After adjusting for the effect of clusters within households (using robust standard
errors) it appears that there is still evidence of an association between HIV contact
and tuberculin positivity. The odds of tuberculin positivity for contacts of HIV +ve TB
cases is approximately half that of HIV ve TB cases.
An important parameter to note is that the log-likelihood shown in the output (

200.7) has been identical for both analyses:
Logistic regression with robust standard errors based on individual-level
residuals
Logistic regression with robust standard errors based on cluster-level residuals
Why should this be?
Interaction: Button: clouds picture (pop up box appears) :
Initially, standard maximum likelihood estimation is performed in each analysis, and
it is only afterwards that the standard errors for the parameter estimates are
computed using the robust approach.
The log-likelihood does not take account of the clustering.
Therefore you cannot use the log likelihood from the "robust" analysis to perform a
likelihood ratio test that takes correlations in the data into account.

Summary
The tabs opposite summarise the robust standard error approach in the analysis of
correlated data.
Interaction: Tabs: 1:
The robust standard error method uses the standard maximum likelihood approach
to obtain parameter estimates, ignoring any correlations in the data.
Therefore, robust standard errors do not affect the parameter estimate.
Instead of using the quadratic approximation to the log-likelihood to obtain a
standard error, this method calculates robust standard errors using household-level
(or cluster-level) residuals to take account of correlations between individuals in the
same household (or cluster).
The log-likelihood does not take account of clustering. Likelihood ratio tests are not
valid, since they ignore any correlations in the data.
Robust standard errors will be correct providing our model is correct, and we have a
reasonable number of clusters, say 30 or more.
Section 6: Generalised estimating equations

One weakness of the robust standard error approach is that it ignores clustering
when calculating the effect estimates (e.g. the odds ratio) it is only the standard
errors that are adjusted.
This means that, for the calculation of the effect estimate, the same weight is given
to an individual in a household with many individuals as an individual who is the only
contact in a household.
Interaction: Button: More (text appears below):
To account for clustering, do you think the weight for individuals in a household with
many individuals should be lower or higher than households with only one individual
contact?
Interaction: Button: Lower:
Correct Response:
That's correct, if there is within-household correlation, relatively less weight should
be given to each individual in the household with many individuals than to the
individual who is an only contact. This is because the many individuals share the
same household information.
Interaction: Button: Higher:
Incorrect Response:
In fact, if there is within-household correlation, relatively less weight should be
given to each individual in the household with many individuals than to the
individual who is an only contact. This is because the many individuals share the
same household information.
6.1: Generalised estimating equations

Generalised estimating equations (GEE) use robust standard errors, but also take
account of correlations when estimating the measure of effect, e.g. the odds ratio.
Therefore, this method gives different weights to individuals, depending on how
many individuals are in the household.
When using GEEs, you must think about how the observations in a data set are
likely to be correlated with each other. The three standard options for this are given
opposite.
Interaction: Tabs: Independence:
This choice implies that you don't think the data are correlated.
If you don't think the data are correlated, you probably don't need to be using GEE.
Interaction: Tabs: Exchangeable:

This choice implies that within a "cluster", e.g. a household, any two observations
are equally correlated, but that there is no correlation between observations from
different "clusters". This is a common choice.
Interaction: Tabs: Autocorrelation:
This choice is useful for measures repeated over time, e.g. repeated measurements
on the same individual such as episodes of diarrhoea.
Repeated measurements on an individual are most likely to be most strongly
correlated when they are made a short time apart. The greater the time interval
between two measurements the smaller the correlation is likely to be.

Let's first look at a GEE analysis that makes the working assumption that all
observations are independent. The table below shows estimates from such a model.
Estimates from GEE analysis with non-robust standard errors
Manto
Coefficie Standar
z
P>
95% confidence
ux
nt
d Err.
|z|
interval
HIV1
0.24702
0.001
0.838904
5 3.396
1.323065
0.354744
Constan
0.916291
0.19720 4.646
<
0.529781
1.302801
t
3
0.001
You should notice that the parameter estimates and standard errors are identical to
those we obtained using the standard likelihood approach using a logistic regression
model.
This is because the model:
a) assumes independence
b) does not use robust standard errors

The results below were obtained from a GEE analysis with robust standard errors.
Notice the parameter estimate for the log(OR) is the same as before. This analysis
takes no account of the within-household correlation when estimating the effect of
exposure variables.
Estimates from GEE analysis with robust standard errors
Manto
Coefficie
Standa
z
P>
95% confidence
ux
nt
rd err.
|z|
interval
HIV1
0.33227
0.01
Constan
t
0.838904
0.916291
4
0.26648
5
2.525
3.438
2
0.00
1
1.490150
0.393989
0.187658
1.438593
Click below to view the model that allows for correlation within households, which
you saw on the previous page.
Interaction: Button: Swap (table changes to the following):
Estimates from a logistic regression model
standard errors, adjusted for clustering
Manto Coefficie Standa
z
P>
ux
nt
rd err.
|z|
Ihiv_2
0.33227
0.01
0.838904
4
2.525
2
_cons
0.916291 0.26648
3.438 0.00
5
1
with robust
95% confidence
interval
1.490150
0.187658
0.393989
1.438593
How does the standard error for the log(OR) in the GEE analysis with robust
standard errors compare to the earlier model?
The standard error obtained from this analysis is identical to that obtained using the
robust standard error approach that allows for correlation within households.
The GEE analysis automatically adjusts the standard errors to take account of the
within-household correlation. However, it has taken no account of the withinhousehold correlation when obtaining the parameter estimate (the log odds ratio).

You can also fit a model using GEE that assumes exchangeable correlations within
households, i.e. it accounts for within-household correlation in the estimation of the
parameter estimate, e.g. the log(OR).
How do the estimates compare to the previous model without correlation?
Notice that the standard errors are similar in magnitude to those of the previous
analysis, but that now the parameter estimate of the log odds ratio has changed
from 0.8389 to 0.9689.
Estimates from GEE analysis with robust standard errors,
accounting for household correlation
Manto
Coefficie
Standa z
P>
95% confidence
ux
nt
rd err.
|z|
interval
HIV1
0.32734
0.003
0.968266
0
2.958
1.609840 0.326692
Constan
t
1.010946
0.26172
6
3.863
<
0.001
0.497972
1.523920
Do you know how this analysis accounts for the within household correlation in the
estimate of the log(OR)?
This analysis takes account of within-household correlation when estimating the log
odds ratio, i.e. it gives relatively less weight to contacts in large households.
On an odds ratio scale, the model estimates are:
OR = 0.38 (95% CI: 0.20, 0.72)

Summary
The main aspects of GEE analysis are:
1. GEE can include robust standard errors.
2. You need to specify how you think the data are correlated. The usual choice is
'exchangeable'.
3. If an exchangeable correlation is specified, point estimates, e.g., odds
ratio, rate ratio, are adjusted for correlations in the data.
4. Within a GEE analysis likelihood ratio tests are not valid.
Section 7: Random effects models

Robust standard errors and generalised estimating equations are two practical
approaches to dealing with correlated observations. However, they are not based on
a full (probability) model for the data.
Therefore statisticians usually prefer to use another approach. The third approach is
to use random effects models, also known as multilevel models.
Random effects models include the variation between clusters explicitly in the
likelihood and therefore take account of intra-cluster correlations.
7.1: Random effects models

Suppose for a moment that, in our study of tuberculin positivity in household
contacts, all our contacts lived in different households and could be considered to be
independent.
Interaction: Button: Show (text appears below):
Assuming independence, individuals' contributions to the likelihood can be multiplied

together to produce the full likelihood. We then maximise the full likelihood and find
quadratic approximations for it.
In a simple logistic regression model, for individual j, the log odds of positivity is
given by:
log(odds)j =
(baseline log odds) + log(ORHIV) x (HIV status)j
where (HIV status)j is an indicator variable which takes the value 0 if the index case
for household contact j is HIV ve and 1 if the index case for household contact j is
HIV +ve.

In reality, individuals will live in the same household and will therefore be exposed
to the same index case and share other potential risk factors in common i.e. they
are clustered!
In this situation, the random effects model states that for individual j in household i
the log odds of tuberculin positivity are given by:
log(odds)ij =
(baseline log odds) + log(ORHIV) x (HIV status)ij + ui

The additional term in this model is the last term ui. Each household is allowed its
own value of ui.
The log odds of all individuals in the household are shifted by this amount.
This makes them similar to each other to some extent (within-household
correlation), and different, to some extent, from individuals in other households
(between-household variation).
In the example, the ui would reflect both the infectivity of the particular tuberculosis
case in the household, and any other past shared household exposure to TB.
log(odds)ij =

The random effects model assumes that the household effects are drawn from a
probability distribution hence they are "random" (rather than fixed) effects.
For logistic regression models we usually assume that the ui are normally distributed,
with mean 0 and variance u .
The only extra parameter that has to be estimated is u, rather than trying to
estimate a specific value of u for each household.
log(odds)ij =

Estimates from a random effects model for the effect of HIV status on tuberculin
positivity are shown below.
Estimates from a random effects model
Mantoux
HIV1
Constant
log (u )
u
Coefficie
nt
1.148913
1.198623
0.034562
0.982868
0.491360
Standard
err.
0.393322
0.316097
0.496553
0.244023
0.124101
P > |z|
2.921
0.003
3.792
0.070
0.000
0.945
95% confidence
interval
1.919810
0.378017
0.579085
1.818162
0.938665
1.007788
0.604173
1.598926
0.267413
0.718830
Log likelihood = -194.26
The model gives an estimate of the log odds ratio for the effect of HIV (1.1489)
and its standard error (0.39).
Converting back to the odds ratio scale we obtain OR = 0.32 (95% CI: 0.15, 0.69).

Now consider the other estimates given in the model, u and .
Interaction: Tabs: :
With the random effects model an estimate of u is given (0.98).
This is a measure of how much ui varies between households.
If there is no clustering within households, then u will equal 0.
Interaction: Tabs: :
is a measure of the within-cluster correlation. It is also called the intra-class

correlation coefficient. Its value depends on the relative size of within and between
household variation
If there is no clustering within households, then will equal 0. The closer is to 1,
the greater is the clustering within households
For the tuberculin data set, =0.49, which is quite large and indicates considerable
within-household correlation

In a random effects model the likelihood is fully specified and all results are derived
from the likelihood, therefore it is valid to perform a likelihood ratio test.
Using this we can test the null hypothesis of no within household clustering, versus
the alternative of some within household clustering by testing the null hypothesis
that = 0.
Click below to see the likelihood ratio test.
The likelihood ratio test of the null hypothesis, that = 0, is obtained by comparing
the log-likelihood of the original logistic regression model (in which and are 0),
with the log-likelihood for the random effects model.
Log-likelihood (original)
Log-likelihood (random effects)
= 200.71
= 194.26
LRS = 2 x (200.71 194.26) = 12.9

The resulting P-value for this is P = 0.0003.
What can you conclude from this test?
The result of this test indicates strong evidence of within-household clustering in this
model.

The likelihood for the random effects logistic regression model contains a mixture of
normal (for the additional variation) and binomial (for the individual outcome data)
distributions, and so it is very complicated.
Parameter estimates are obtained using numerical approximations, and the reliability
of these approximations should be checked.
This is especially important when is large (> 0.25, say), or the number of
observations per cluster (individuals per household in our example) is large (> 20,
say).
In the study of tuberculin positivity, a check suggests that the approximations may
not be reliable!
You will see how to check the reliability of the approximations in Practical 9.
As a result it may be safer to use the results from the GEE analysis, even though
this approach is less satisfying from a statistical point of view.

The problem with approximations applies particularly to random effects logistic
regression models. The combination of binomial and normal distributions is especially
problematic.
The problem does not arise when the outcome is continuous (normally distributed)
with normally distributed random effects. In this case we are combining normal
distributions together and we can solve these equations without needing to use
approximations.
Interaction: Button: More (text appears below):
The same is true for random effects Poisson regression, if we assume the random
effects follow a Gamma distribution.
The combination of Poisson and Gamma distributions produces another distribution
the negative binomial and again we can solve these equations without resorting
to approximations. This is appropriate for dealing with overdispersion (click here to
review thisfrom AS05).
A window opens with AS05
However, if we specify in our model that the Poisson random effects are normally
distributed, then we run into the same problems we face with random effects logistic
regression, and the reliability of the estimates should be checked.

Summary
1. A random effects model specifies the form of the between-cluster variation and
includes it in the likelihood.
2. The point estimates, standard errors and log-likelihood obtained from a random
effects model all take account of the clustering (assuming that the random effects
distribution is correctly specified).
3. Likelihood ratio tests are valid.
4. Estimates of the between cluster variation and intra-cluster correlation are
obtained.
5. There needs to be a reasonable number of "clusters" in the dataset for the
method to be reliable.
6. When performing random effects logistic regression analysis, the reliability of the
estimates should be checked, especially when is large.
Section 8: Comparison of different approaches

Click on each of the analyses listed below to see the results obtained for the effect of
HIV on tuberculin positivity in contacts of the index TB case.
1. Ordinary logistic regression
2. Logistic regression with robust standard error adjusted for clustering
3. Logistic regression with GEE, exchangeable correlation matrix and robust
standard errors
4. Random effects logistic regression
Interaction: Hyperlink: Ordinary logistic regression (text appears below):
Ordinary logistic regression analysis fails to take account of the possibility that
individuals living in the same household are likely to be more similar to each other
than to individuals in other households.
This approach is invalid in this dataset.
Results
Odds ratio for
HIV (95% CI)
0.43 (0.27, 0.70)
Standard error
of log odds ratio
0.25
P-value
0.001
Interaction: Hyperlink: Logistic regression with robust standard error

adjusted for clustering (text appears below):
The use of robust standard errors improves an ordinary logistic regression analysis
by taking account of possible clustering when computing the standard errors, but it
ignores clustering when estimating the odds ratio. Therefore, the same estimate is
obtained for the odds ratio but the standard error is larger.
This approach is valid, but not optimal.

Results
Odds ratio for
HIV (95% CI)
0.43 (0.23, 0.83)
Standard error
of log odds ratio
0.33
P-value
0.01
Interaction: Hyperlink: Logistic regression with GEE, exchangeable

correlation matrix and robust standard errors (text appears below):
A GEE analysis with robust standard errors and an exchangeable correlation matrix
is a further improvement, since it takes account of clustering when estimating the
odds ratio.
This approach is valid, but somewhat unsatisfying from a statistical perspective.
Results
Odds ratio for
HIV (95% CI)
0.38 (0.20, 0.72)
Standard error
of log odds ratio
0.33
P-value
0.003
Interaction: Hyperlink: Random effects logistic regression (text appears

below):
A random effects model uses a different approach to deal with clustering. The
clustering is incorporated explicitly in the likelihood to obtain estimates of the odds
ratio and its standard error.
This approach is valid.
When fitting random effects logistic regression models, the reliability of the
estimates should be checked.
Results
Odds ratio for
HIV (95% CI)
0.32 (0.15, 0.69)
Standard error
of log odds ratio
0.39
P-value
0.003
8.1: Comparison of different approaches

It is preferable to use a method that takes into account the fact that data provided
by two individuals who are similar (e.g. from the same household) are less
informative about the general study population than data from two individuals from
different households. That is, it is preferable to use a method that takes into account
that data provided by 2 individuals in a large cluster (e.g. a large household) are less
informative about the general population than data provided by 2 individuals in a
small cluster (e.g. a small household).
GEE and random effects models both take the above into account, but the analysis
using robust standard errors alone does not.
The odds ratio estimates produced by GEE and the random effects model are
different. This is because they actually estimate slightly different things. This is
explained on the tabs below.
Interaction: Tabs: GEE:
The odds ratio estimated by GEE is often called the population average odds ratio.
This represents:
odds of the average household contact of an

HIV +ve person being tuberculin positive
odds of the average household contact of an
HIV -ve person being tuberculin positive
Interaction: Tabs: Random effects:
The odds ratio estimated from a random effects model is often called the clusterspecific odds ratio.
This represents:
an individual's odds of being tuberculin

positive if their index case is HIV +ve
same individual's odds of being tuberculin
positive if their index case is HIV -ve

The two measures will only be the same if there is no random effect.
What do we mean by this?
Interaction: Button: clouds picture (text appears below):
If there is no within household clustering then there will be no random effect.
When there are random effects (clustering) then the estimate from the random
effects model (also called the cluster-specific estimate) will be more extreme.
GEE model estimates:
Odds ratio for HIV = 0.38 (95% CI: 0.20, 0.72)
Random effects model estimates:

Odds ratio for HIV = 0.32 (95% CI: 0.15, 0.69)

Which is the measure of choice?
Because the random-effects OR estimates the effect of the risk factor at the level of
the individual, it is arguable that it is preferable to the "population-average" estimate
given by GEE
However, in most circumstances, there will be little difference between the clusterspecific and population average results. In such situations it does not matter which
estimate is used.
The difference between the 2 methods will only be large if there is substantial
between-cluster variation and, in at least a proportion of clusters, the outcome is
common.
In such circumstances, the odds ratio will not approximate the risk/rate ratio, and in
communicating your results to a wider audience the biggest problem you are likely
to face is explaining how to interpret an odds ratio whether it be "population
average" or "cluster specific".
Section 9: Summary
This is the end of AS09. When you are happy with the material covered here please
move on to session AS10.
The main points of this session will appear below as you click on the relevant title.
The problem with correlated data
When datasets contain observations that are correlated (clustered), such as:
repeated outcome measurements on individuals
outcome measurements on several individuals in the same household
then standard methods of data analysis are invalid, and produce confidence intervals
that are too narrow and P-values that are too small.
Robust standard errors
Robust standard errors can be obtained using residuals calculated at the cluster
level, to account for correlation. The clustering in the data is not taken into account
when estimating parameters, which is a disadvantage of this method. However, an
advantage of this method is that it always works - you do not get problems of model
convergence such as sometimes occur when using GEE or random effects models.
Another advantage is that it is a simple and intuitive approach.
Generalised estimating equations
GEEs incorporate robust standard errors and also take correlation into account when
estimating parameters, e.g. log(OR).
GEE is a pragmatic, relatively simple approach. However, sometimes there are
convergence problems with the model when using GEE, in which case the method
using robust standard errors alone (and not taking into account the within-cluster
correlation when estimating parameters), or a random-effects model, should be
used.
Random effects models
Random effects models take account of the clustering within the data when
estimating parameters and their standard errors. They are more satisfactory from a
statistical point of view than the method using robust standard errors alone, and
than GEE, because they specify a full probability model to explain the data.
They work well for Poisson models when the between-cluster variation is assumed to
follow a gamma distribution, and for quantitative outcome data when the betweencluster variation is assumed to follow a normal distribution. So, in general, you
should use a random-effects model when you have correlated data and you are
modeling rates or a quantitative outcome.
However, for logistic regression (where the outcome variable is binary), a random
effects model may run into computational problems (the model may not converge,
or the approximations used in the model-fitting may not be reliable). When this
happens, it is preferable to use GEE or the method using robust standard errors
alone
Constant exposure
These methods for analysing correlated data are needed when the exposure variable
is a "cluster-level" variable - i.e. it takes the same value for all individuals in a
cluster. When all exposure variables are "individual-level" variables - i.e. their value
varies among individuals in the same cluster -, and if clusters are quite large, then
we may instead be able to take account of the between-cluster variation by
stratifying the analysis on cluster and using regression analysis in the usual way.

Analysis of Correlated Data (As09) : Course: PG Diploma/ MSC Epidemiology

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Analysis of Correlated Data (As09) : Course: PG Diploma/ MSC Epidemiology

Hochgeladen von

Copyright:

Verfügbare Formate

Analysis of correlated data

Course: PG Diploma/ MSc Epidemiology

Section 1: Analysis of correlated data

To learn how to analyse correlated observations, i.e, data where observations

Describe the effect of intra-cluster correlation

Section 2: Planning your study

Interaction: Hyperlink: SM08:

2.1: Planning your study

Section 3: When are data correlated?

In a cluster randomised trial groups of individuals, rather than individuals

3.1: When are data correlated?

3.2: When are data correlated?

3.3: When are data correlated?

3.4: When are data correlated?

Reason for clustering

Interaction: Hotspot: Families:

Interaction: Hotspot: School:

3.5: When are data correlated?

3.6: When are data correlated?

3.7: When are data correlated?

3.8: When are data correlated?

Section 4: When are data correlated? Example

4.1: When are data correlated? Example

4.2: When are data correlated? Example

HIV status of index case

4.3: When are data correlated? Example

How would you interpret the model estimates?

4.4: When are data correlated? Example

A point estimate of the parameter of interest is obtained from choosing the

Section 5: Robust standard errors

5.1: Robust standard errors

5.2: Robust standard errors

Estimates from logistic model (ignoring clusters) on an odds ratio

model (ignoring clusters) on a log scale

5.3: Robust standard errors

errors calculated at the individual level

Estimates from a logistic regression model with robust standard

5.4: Robust standard errors

An important parameter to note is that the log-likelihood shown in the output (

5.5: Robust standard errors

Section 6: Generalised estimating equations

6.1: Generalised estimating equations

Interaction: Tabs: Exchangeable:

6.2: Generalised estimating equations

6.3: Generalised estimating equations

6.4: Generalised estimating equations

6.5: Generalised estimating equations

Section 7: Random effects models

7.1: Random effects models

Assuming independence, individuals' contributions to the likelihood can be multiplied

(baseline log odds) + log(ORHIV) x (HIV status)j

7.2: Random effects models

(baseline log odds) + log(ORHIV) x (HIV status)ij + ui

7.3: Random effects models

(baseline log odds) + log(ORHIV) x (HIV status)ij + ui

7.4: Random effects models

(baseline log odds) + log(ORHIV) x (HIV status)ij + ui

7.5: Random effects models

7.6: Random effects models

is a measure of the within-cluster correlation. It is also called the intra-class

7.7: Random effects models

LRS = 2 x (200.71 194.26) = 12.9

7.8: Random effects models