Beruflich Dokumente
Kultur Dokumente
(AS10)
EPM304 Advanced Statistical Methods in Epidemiology
This document contains a copy of the study material located within the computer
assisted learning (CAL) session.
If you have any questions regarding this document or your course, please contact
DLsupport via DLsupport@lshtm.ac.uk.
Important note: this document does not replace the CAL material found on your
module CDROM. When studying this session, please ensure you work through the
CDROM material first. This document can then be used for revision purposes to refer
back to specific sessions.
These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of
the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale or
further copying.
London School of Hygiene & Tropical Medicine September 2013 v1.0
To begin you will look at why the usual methods are not appropriate. You will then look at
different approaches to obtain estimates accounting for the fact that observations are correlated.
During the session you will fit random effects models that were covered in AS09. Hence, you
may wish to review this material. In addition, sample size calculations will be covered and so you
could also review this session.
Correlated data
AS09
PE06
Details
A rural study area in Ghana was divided arbitrarily into 96 clusters of compounds. Households in
48 randomly selected clusters then received insecticide-treated bednets. A demographic
surveillance system recorded all births, deaths and migrations. Around 16,000 children aged 6-59
months were followed up for 2 years with all-cause child mortality rates as the outcome.
Correlation between measurements on individuals in the same cluster may arise in a number of
ways:
-Persons may behave or respond more like other persons in the same community than persons in
a different community.
-Persons may have a level of exposure more like other persons in the same community than
persons in a different community.
-An infected individual in a community may transmit an infection directly to another individual in
that same community.
A rural study area in Ghana was divided arbitrarily into 96 clusters of compounds, with
households in 48 randomly selected clusters receiving insecticide-treated bednets. A
demographic surveillance system recorded all births, deaths and migrations. Around 16,000
children aged 6-59 months were followed up for 2 years with all-cause child mortality rates as the
outcome.
Why do you think a CRT design was used, rather than randomising some households to receive
treated bednets?
Interaction: Button: clouds picture (text appears in pop up box):
Logistically it may be easier to distribute bednets to every household in a cluster. Education and
information campaigns around the use of bednets may be better delivered at a cluster level.
In addition, by impacting malaria prevalence and mosquito density there may be substantial
contamination between neighbouring households and so measuring the effect at the cluster level
may help reduce this problem.
Why do you think a CRT design was used, rather than randomising children to the intervention or
control?
Interaction: Button: clouds picture (text appears in box):
Children who received the health promotion intervention may talk to their classmates about the
intervention. This may reduce the impact of the intervention among the children who received it
and may also lead to the intervention affecting the children who did not receive it. Hence, the
measured intervention effect may be diluted.
In addition, it may be easier to deliver such a campaign to the whole school using large meetings
as well as smaller group counselling.
medical facilities. Individuals from one community may commonly visit medical facilities in other
communities for treatment of stigmatising illnesses. Such contamination will generally dilute the
effect of any intervention.
Interaction: Tabs: 2:
Between intervention clusters and the wider community. This is likely to dilute the effect of the
intervention, but to a lesser extent as it may weaken the intervention, but is unlikely to lead to
any effect in the control clusters under study.
Interaction: Tabs: 3:
Between control clusters and the wider community. This is unlikely to have any substantial effect,
unless the control communities are receiving an improved standard of care due to the study.
where B is the variance of the true cluster-level proportions and is the overall true proportion
across all the clusters in the study.
2
where B is the between-cluster variance and W is the within-cluster variance of the outcome of
interest.
2
The intra-cluster correlation coefficient is not defined for person-years (rates) data.
For proportions,
c = 1 + f [ 0(1 - 0)/m + 1(1 - 1)/m + k (0 + 1 ) ] / (0 - 1)
2
of variation in the (true) rates or proportions between clusters in each treatment arm (see Page 1819).
c
4.2
0.25
0.35
5.8
8.2
In this example, we can see that the number of clusters required is quite sensitive to k. We can
also see that as k increases, the between cluster variation or intra-cluster correlation is
increasing and so the number of clusters required increases.
We can see that it would be helpful to get estimates of k for HIV incidence in these communities
while planning the study, but of course this may not be possible.
Note that with a pair-matched design, the sample size equations given should be amended to add 2
rather than 1 to the right-hand side. However, k now represents the between-cluster variation within
matched pairs, and if the matching is effective this may be much smaller than the unmatched k so
that the required sample size is reduced.
By having larger clusters, the variability among clusters in the proportions, rates or means is
reduced. Hence, there is more power to detect a difference between arms.
The second approach is to analyse the individual-level data using regression methods that make
allowance for correlated data. These methods were introduced in AS09.
We will now cover cluster-level analysis followed by individual-level analysis.
The proportion of children in each school (and overall) who reported tobacco use after 2 years in
the Smoke-free generation intervention trial:
0/42
1/84
9/149
11/136
4/58
1/55
10/219
4/160
2/63
5/85
1/96
10/194
Intervention
Proportion
0
5/103
0.012
3/174
0.060
6/83
0.081
6/75
0.069
2/152
0.018
7/102
0.046
7/104
0.025
3/74
0.032
1/55
0.059
23/225
0.010
16/125
0.052
12/207
Control
Proportion
0.049
0.017
0.072
0.080
0.013
0.069
0.067
0.041
0.018
0.102
0.128
0.058
58/1341
0.0433
0.0615
91/1479
The main arguments for using the overall measure (of rates, risks or means) are:
(i) ease of estimation;
(ii) if each treatment arm is a random cluster sample from the population of interest, then the
overall measure gives consistent estimates of the population measure; and
(iii) clusters are weighted according to their size, giving equal weight to each individual.
The main arguments for using the means of the cluster measure (of rates, risks or means) are
(i) this gives a better tie-in with simple inferential procedures based on the t-test (see later);
(ii) the clusters do not always form a random cluster sample from a well-defined population - for
example, the clusters (communities) may have been arbitrarily selected and then randomly
allocated to the two treatment arms; and
(iii) equal weight is given to each cluster.
Note this formula is the same whether r0 and r1 represent risks or rates.
Then Var(r1) can be estimated approximately as s1 /c1, where s1 is the SD of the observed cluster
risks in the intervention arm, and c1 is the number of clusters in that arm, and similarly for the
control arm.
Calculate the 95% CI for the risk ratio using overall risks (i.e. RR = 0.0433/0.0615 = 0.704).
Interaction: Button: clouds picture (text appears in pop-up):
Using the given formula we obtain:
Var(log RR) (0.026 /12)/0.043 + (0.035 /12)/0.062 = 0.0571
2
Then taking log(RR) 1.96 Var(log RR), we obtain an approximate 95% CI for log(RR):
-0.351 1.96 0.239, or -0.819 to 0.441.
Exponentiating these values, we obtain the corresponding CI for the RR of 0.70 as 0.44 to 1.12.
t-test of the cluster level mortality rates from the Ghana bednet study (0=control; 1=intervention).
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+--------------------------------------------------------------------
0 |
48
27.92242
1.840102
12.7486
24.22061
31.62423
1 |
48
23.97155
1.404359
9.729682
21.14635
26.79676
---------+-------------------------------------------------------------------combined |
96
25.94699
1.168985
11.45367
23.62626
28.26772
---------+-------------------------------------------------------------------diff |
3.950866
2.314778
-.6451805
8.546912
-----------------------------------------------------------------------------Ha: diff < 0
Ha: diff != 0
Ha: diff > 0
Pr(T < t) = 0.9544
Pr(|T| > |t|) = 0.0912
Pr(T > t) = 0.0456
Then Var(r1) can be estimated approximately as s1 /c1, where s1 is the SD of the observed cluster
rates in the intervention arm, and c1 is the number of clusters in that arm, and similarly for the
control arm.
Calculate the approximate confidence interval for the rate ratio using cluster rates.
Interaction: Button: clouds picture (text appears in pop-up box):
From the output in the table on the previous page, we know that s1 and s0 are 9.730 and 12.749
2
2
2
2
respectively. Hence Var(log RR) (9.730 /48)/23.972 + (12.749 /48)/27.922 = 0.00778
we obtain an approximate 95% CI for log(RR):
log(23.972/27.922) 1.96 0.00778, or -0.325 to 0.020.
Exponentiating, we obtain the CI for the RR of 0.86 as 0.72 to 1.02.
Finally, by dividing the observed number of deaths by the expected number, in each cluster, we
obtain the cluster level residuals. Would it matter here if we used the ratio of observed to
expected rates instead of the ratio of observed to expected deaths?
Interaction: Button: clouds picture (text appears in pop-up box):
No it would not have mattered. We could have calculated the observed and expected rate in each
cluster as the observed or expected deaths divided by total person time in that cluster. However,
as the person time in the cluster is the same for both the observed and the expected rate, the
ratio of observed to expected rates is the same as the ratio of observed to expected deaths.
However, it does matter if we want the rate difference, when we would have to use the difference
in rates, not the difference in deaths.
Cluster
Arm
1
2
3
4
5
6
7
8
9
10
1
0
0
0
1
0
0
1
1
1
Observed
deaths
12
11
6
12
9
9
7
10
8
9
Expected
deaths
6.44
6.11
6.08
6.37
8.73
10.40
7.86
9.53
10.65
7.51
Residual
1.86
1.80
0.99
1.88
1.03
0.87
0.89
1.05
0.75
1.20
diff |
.1701979
.0870485
-.0026389
.3430346
-----------------------------------------------------------------------------Ha: diff < 0
Ha: diff != 0
Ha: diff > 0
Pr(T < t) = 0.9732
Pr(|T| > |t|) = 0.0535
Pr(T > t) = 0.0268
Returning to the Ghana bednet trial, if we ignore the clustered design and fit a simple Poisson
regression model to the data, we obtain an unadjusted rate ratio for the intervention effect of 0.84
(95% CI: 0.74 0.96) with a P-value of 0.012.
Note however that this analysis is invalid as it does not take account of intra-cluster correlation.
Number of obs
Number of groups
=
=
26342
96
-----------------------------------------------------------------------------outcome |
IRR
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Ibednet_1 |
.8472471
.0729997
-1.92
0.054
.7155989
1.003114
follyr | (exposure)
-------------+---------------------------------------------------------------/lnalpha | -2.764239
.4069338
-3.561814
-1.966663
-------------+---------------------------------------------------------------alpha |
.0630241
.0256466
.0283873
.139923
-----------------------------------------------------------------------------Likelihood-ratio test of alpha=0: chibar2(01) =
11.82 Prob>=chibar2 = 0.000
Number of obs
Number of groups
=
=
26342
96
-----------------------------------------------------------------------------outcome |
IRR
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_Ibednet_1 |
.8374889
.070685
-2.10
0.036
.7098012
.9881467
_Iagegp_1 |
.3969492
.0383497
-9.56
0.000
.3284728
.4797008
_Iagegp_2 |
.2990543
.0308447
-11.70
0.000
.2443186
.3660527
_Iagegp_3 |
.1715907
.0256085
-11.81
0.000
.1280734
.2298946
_Iagegp_4 |
.1827192
.0416134
-7.46
0.000
.1169304
.2855229
_Isex_1 |
.9422119
.0644793
-0.87
0.384
.8239436
1.077456
follyr | (exposure)
-------------+---------------------------------------------------------------/lnalpha | -2.881458
.4360814
-3.736162
-2.026755
-------------+---------------------------------------------------------------alpha |
.056053
.0244437
.0238454
.1317625
-----------------------------------------------------------------------------Likelihood-ratio test of alpha=0: chibar2(01) =
9.80 Prob>=chibar2 = 0.001
Section 9: Summary
This is the end of AS10. When you are happy with the material covered here please move on to
session AS11 .
The main points of this session will appear below as you click on the relevant title.
Background to CRTs
Randomised controlled trials are the gold standard for evaluating health interventions. Often it is
more appropriate to randomise groups of people (clusters) to study arms, rather than
individuals.
Possible reasons include: the intervention is naturally delivered at a community level; for logistical
convenience; to reduce mixing between people in different arms; to capture mass effects of an
intervention; or the impact on infectiousness of infected individuals
Measuring intra-cluster correlation
Individuals from one cluster may be more alike than individuals from different clusters termed
intra-cluster correlation and measured by (rho). This happens if and only if there are
differences between clusters termed between-cluster variation and measured by the
coefficient of variation of cluster level responses (k). Hence, intra-cluster correlation and
between-cluster variation are two ways of measuring the same thing.
Design issues
Correlated data contain less information than uncorrelated data and so a CRT needs a larger
sample size than an individually randomised trial for the same power. The sample size
calculations are complicated by the need to determine both the number of clusters and the
number of individuals per cluster. In addition, assumptions must be made about the level of
between-cluster variation.
Analysis at the cluster level
Analysis of CRTs must take the clustering into account. There are two main approaches for this:
The first is to analyse the cluster level summaries (usually risks, rates or means) as the unit of
observation with standard methods, such as the t test. These methods are robust even for small
numbers of clusters, but they are inconvenient when adjusting for covariates.
Analysis at the individual level
The second method is to use the individual-level regression methods for correlated data
introduced in AS09, such as GEE and random effects models. These have a number of
advantages, including convenience and taking intra-cluster correlation into account in the
estimated effect of the intervention, but they are not robust with small numbers of clusters. As a
general rule, if there are fewer than 15 clusters per study arm, cluster-level methods are
recommended.