Sie sind auf Seite 1von 9

American Journal of Epidemiology

Copyright 1997 by The Johns Hopkins University School of Hygiene and Public Health
All rights reserved

Vol. 146, No. 5


Printed in U.S.A.

Two-Stage Sampling for Etiologic Studies


Sample Size and Power

Douglas Schaubel, 1 ' 2 James Hanley,1 Jean-Paul Collet, 13 Jean-Francois Boivin, 13 Colin Sharpe,1-3
Howard I. Morrison,2 and Yang Mao 2

biometry; case-control studies; confounding factors (epidemiology); epidemiologic methods; regression


analysis; sample size; two-stage sampling

Many computerized data sources are potentially


useful for epidemiologic research. Examples include
physician claim files, hospital separation records, prepaid insurance plan databases, and occupational
records. Studies using these databases can often be
conducted more quickly and at a lower cost than those
involving primary data collection. Unfortunately, such
databases were seldom established with a view toward
etiologic research. Although data on the exposure of
interest may be available, with data on the endpoint of
interest obtainable through record linkage, data on
extraneous variables which could potentially confound
the exposure-disease association are typically unavailable. The cost of surveying the entire study population
may be prohibitive. A cost-effective alternative is to

collect confounder data on a subset of the original


study populationan approach which has been termed
"two-stage sampling" (1-4), since information pertaining to the crude and covariable-adjusted exposure
effects is obtained in two separate phases of the investigation.
For efficiency, stage 2 sample selection is typically
based jointly on exposure and disease status. Although
methods for analyzing two-stage data exist (1, 3, 4),
issues of sample size estimation have not been explicitly addressed. The objective of this article is to provide a method for deciding on the number of subjects
to be selected at the second stage of a two-stage study.
EXAMPLES

Consider a hypothetical occupational cohort study


of railway workers. Information on diesel exhaust
exposure for cohort members could be estimated on
the basis of employment histories. The cancer incidence experience of the cohort could be determined
through linkage to vital statistics and cancer registry
databases. However, the validity of any association
would be compromised by the lack of information on
smoking, since exposure might vary directly with
smoking prevalence. If smoking data were not collected a priori, two-stage sampling could enhance the

Received for publication July 23,1996, and in final form February


4, 1997.
Abbreviation: OR, odds ratio.
1
Department of Epidemiology and Biostatistics, Faculty of Medicine, McGill University, Montreal, Quebec, Canada.
2
Laboratory Centre for Disease Control, Health Canada, Ottawa,
Ontario, Canada.
3
Centre for Clinical Epidemiology and Community Studies, Sir
Mortimer B. Davis-Jewish General Hospital, Montreal, Quebec,
Canada.
Reprint requests to Douglas Schaubel, Room 1388, LCDC Building 0601C1, Health Canada, Tunney's Pasture, Ottawa, Ontario K1A
0L2, Canada.

450

Downloaded from http://aje.oxfordjournals.org/ by RamaKrishna ch on August 25, 2016

Preexisting computerized databases are potentially valuable sources of epidemiologic data. Since such
databases are infrequently created specifically for etiologic research, data may be available for the exposure
of interest and, through record linkage, for the endpoint of interest, but lacking for potential confounders.
Because of the size of these databases, two-stage sampling is an efficient alternative to surveying the entire
study population for confounder data. At stage 1, information on exposure and disease status is obtained for
the entire study population. Confounder data are collected for probability-selected subsamples at stage 2.
Logistic regression is performed on the stage 2 sample, with the parameter estimates and variances appropriately corrected to account for the stage 1 data. In this paper, the authors present methods for determining
the required stage 2 sample size in the case of categorical exposure and confounding variables. Sample size
tables, power curves, and a computer program have been produced to accommodate a binary exposure and
a single binary confounder. With the increasing availability of preexisting yet incomplete databases, the
potential for use of two-stage sampling will greatly increase in the future. This investigation provides a basis
for estimating the number of participants to sample for the collection of confounder data at the second stage.
Am J Epidemiol 1997;146:450-8.

Sample Size Estimation for Two-Stage Studies

TWO-STAGE SAMPLING METHOD

The data layout for a two-stage study is depicted in


figures 1 and 2. At stage 1, the investigator obtains
data on exposure and disease status. This first-stage
sample could represent the traditional case-control,
historical cohort, or cross-sectional study. At stage 2,
confounder data are collected for a subset of the stage
1 participants. For efficiency (i.e., inverse of variance
or information per sample member), a "balanced" design is often utilized at stage 2. That is, to the extent
possible, equal numbers of stage 1 members are selected from each cell of the cross-tabulation of disease
and exposure. When a cell has been exhaustively sampled, the remaining cells are sampled equally. This
sampling algorithm, also referred to as "equal alloca-

D=0

D=l

No.o

N,,o

No.,

N,,,

J-l

Nftj-i

N..M

Total

No

N,

FIGURE 1. Cross-tabulation of exposure level () and disease


status (D) (2 x J table) for the stage 1 sample. At stage 1, data on
exposure and disease status are available for N = S/X/V,)- subjects.

Am J Epidemiol

Vol. 146, No. 5, 1997

tion," is usually more efficient than random, diseasebased, or exposure-based stage 2 sampling. Its efficiency derives from sampling fractions that are
inversely proportional to cell size, exploiting the fact
that observations from small groups contribute, on
average, more information than those from large
groups. The reason for the increased efficiency can be
understood heuristically from Woolf s formula for the
variance of the logarithm of the odds ratio (OR) (5).
ANALYSIS OF TWO-STAGE DATA
Preliminary issues

Assume that disease status (D) is dichotomous (1 =


present, 0 = absent), exposure (E) is categorical with
J levels (0,..., / 1), and there exists one confounder
(C) which is categorical with K levels (0, ... , K 1),
with 0 serving as the reference level for both exposure
and confounder. Let NtJ denote the number of subjects
in the stage 1 sample with D i and E = j , where
N = L^jNij. Let ntjk be the number of stage 2 sample
members with D = i, E = j , and C = k, where n2=
XjXjXktiijk, and ny = S^n,-^. Assume that disease incidence can be described by a multiplicative model.
That is,

(1)

where {,} are indicator variables such that Ej = 1


when E = j and 0 otherwise, with the {Ck} similarly
defined.
Data analysis

As outlined by Breslow and Cain (3, 4), to analyze


two-stage data using logistic regression, parameter
estimates are obtained from the stage 2 data. These
estimates and their corresponding covariance matrix
are corrected to account for the information provided
by the stage 1 data and the biased stage 2 sampling
mechanism, the bias having been introduced by
exposure-dependent sampling. The corrected OR,
contrasting disease incidence between subjects with
E = j versus E = 0 is given by
OR; =

(2)

where "(2)" denotes estimation based on the stage


2 data alone and Sy = n^/Ny denotes the sampling
fractions. Thus, the adjusted OR, is OR,-(2) scaled by
the cross-product of the sampling fractions for the 2 X
2 table relating disease status and exposure levels 0
and j , algebraically similar to the correction for selection bias described by Kleinbaum et al. (6). The vari-

Downloaded from http://aje.oxfordjournals.org/ by RamaKrishna ch on August 25, 2016

study's validity while negating the need to survey the


entire cohort.
It was during the development of a protocol for a
Canadian pharmacoepidemiologic study that the need
for work on sample size calculations for two-stage
studies was discovered. The investigation sought primarily to evaluate the effect of nonsteroidal antiinflammatory drugs and antidepressants on the risk of
breast cancer. Information on prescription medication
history and various demographic factors was obtained
using data from a provincial health organization. Case
ascertainment was accomplished through linkage to
the provincial cancer registry. However, the investigators required data on several variables which could
potentially confound the association of interest, such
as ages at menarche and menopause, family history,
nulliparity, obesity, and alcohol consumption. Efforts
to estimate the cost of interviews at the second stage
were thwarted by the unavailability of stage 2 sample
size estimates.

451

452

Schaubel etal.

D=0

D=l
C

C
E

no.o.o

"o.o.i

"o.i.o

"o.i.i

...

K-l

Total

...

"o.O.K-l

n o . o = So.oNo.o

i.o,o

...

no, i,K-i

no.i^o.iNo.i

i.i.o

...

i,o,<

...

i.0,K-l

l.l.l

...

l,l,K-l

nn=Si,,Nlfl

0
n

K-l

Total
n

io=si.oN|,o

...

...

...

...

...

...

J-1

no.j-i.o

...

Ko, M, K-l

o,j-i =s o,j-iNo,j-i

J-1

i,Ji.i

...

no

Total

"OJ-LI

Total

i.J-i.o

l.M,K-l

u-i = su-iNi,n
n,

FIGURE 2. Cross-tabulation of exposure level (E), disease status (D), and confounder level (C) (2 x J x K table) for t h e stage 2 sample.
Confounder (C) information is obtained for n2 = 2,-2/J,; subjects. The n^'s are chosen at random from the A/(/ members of the stage 1 sample
(see figure 1) with sampling fractions s,y = n,-,/W;/.

ance of the corrected exposure parameter estimate is


given by
logistic

1
1
- { +

1
i

e refers to type II error. Thus, we seek the minimum n2


such that the following inequality is satisfied:

1
i

(4)

where V2(f5j)logiStic is the variance from the logistic


regression based on the stage 2 data alone. Because the
correction factor is nonnegative, the correction reduces the variance of the exposure parameter estimate,
since it incorporates additional information available
on the crude association between exposure and disease. Breslow and Cain provide other relevant formulae with respect to the analysis of two-stage data (4)
and their derivations (3).
SECOND-STAGE SAMPLE SIZE AND POWER
Background

For ease of presentation, we assume that exposure is


binary (1 = present, 0 = absent). The null hypothesis,
Ho: OR = e& = 1, is tested against the alternative
hypothesis, HA: OR = ep = 1. The presence of only
one nonreferent exposure level precludes the need to
subscript /3 or OR. Our goal is to estimate the required
stage 2 sample size (n2) such that we have power of at
least (1 e) against a given value for OR = 1, where

where $ is the cumulative distribution function for the


standard normal distribution, Za/2 denotes the 100(1
a/2) percentile of the standard normal distribution
corresponding to the type I error of size a (two-sided),
ze is similarly defined for the type II error of size e,
and the subscripts "0" and "A" refer to the sampling
distribution for the logistic regression coefficient, /3,
under Ho and HA, respectively.
Variances

As equation 4 indicates, to estimate power we must


project V(/3) under the null and alternative hypotheses.
For each scenario, we can write this variance as the
sum of the variability associated with the first and
second stages. That is,
V(j) = V(l) + V(2), with
V{\) = V,(/3)cradeand
V(2) = V2(/3)logistic - V2(/3)crudes

(5)

where V(Y) and V(2) refer to the sampling variability at


stages 1 and 2, respectively. V(2) represents the difAm J Epidemiol

Vol. 146, No. 5, 1997

Downloaded from http://aje.oxfordjournals.org/ by RamaKrishna ch on August 25, 2016

Sample Size Estimation for Two-Stage Studies 453

ference in the precision of the estimator of /3 attributable to having adjusted for the confounder. The
difference, V2(j3)crade V(l), can be considered the
gain in precision obtained by incorporating the stage 1
information into the stage 2 estimates. The crude variances are given by the sum of the reciprocals of the
entries of the 2 X 2 table relating exposure and disease, as per Woolf s method (5). That is,

10

11

1 1 1

+
nu +
n 10 +
W) cnide =
nn0l
n
n

(6)

/logistic

-1

i-

1
1
vt = ^ +
n

llk

1
1
h 7 + -^, (7)

\0k

0\k

00k

where the circumflexes indicate fitted values under the


logistic model. Thus, the functional units of this variance {vk} are sums of the reciprocals of the frequencies in the confounder-stratum-specific 2 X 2 tables of
fitted values. The regression-based variance has the
same form as that obtained by Woolf s method, with
the cell counts replaced by fitted values under the
model. For V2(/3/)iOgistic f r a multilevel exposure,
replace nnk with nijk. In the case of more than one
confounder, the summation is over all possible confounder level combinations. Thus, V(/3) can be completely specified in terms of the crude 2 X 2 tables at
stages 1 and 2 and the confounder-stratum-specific
fitted 2 X 2 tables at stage 2. A method of calculating
these expected cell entries is outlined in the Appendix.
Having dealt with the variances, we can calculate
power directly for a fixed value of n2 and a given
allocation on the ity. However, iteration is required to
find the minimum n2 for a prespecified level of power
when the balanced design is employed, since the sampling fractions, and hence the adjustment to the variance, are unknown until n2 is known. The sampling
fractions are known in advance only when n^ /i2/4
(i.e., 2 ^ min{fy,} X 4).
RESULTS AND ILLUSTRATION
Power with a given sample size

Figure 3 provides an example of a power calculation


for a fixed stage 2 sample size for a binary exposure
Am J Epidemiol

Vol. 146, No. 5, 1997

Sample size for a given power

Table 1 lists the n 2 s required in order to achieve 90


percent power for various case-control studies under
the balanced design. The user must anticipate the
incidence density ratios for the exposure and the confounder; the prevalences of the exposure and the confounder (pE and pc, respectively); and the degree of
association between E and C, quantified by previous
authors (3, 4) by the "control odds ratio," 8, which is
the cross-product of cell prevalences in the 2 x 2 table
relating E and C in the source population from which
the cases arose (i.e., 6 = PuPoo'PioPoi)-

Determinants of power/sample size

The factors affecting the power of a stage 2 sample


were examined. Although the results displayed in figures 4 and 5 pertain to particular ranges or values of n2
and specific target population parameters, the observed trends apply generally. Power increases greatly
with the number of cases collected at stage 1, although, naturally, marginal returns eventually diminish (figure 4, top left panel). There appears to be little
gain in stage 2 power by sampling more than four
controls per case at stage 1 (figure 4, top right). The
strongest determinant of power is the exposure odds
ratio (figure 4, bottom left), which has a much greater
impact on power than the confounder odds ratio (figure 4, bottom right). The impact on study power of the
exposure prevalence exceeds that of the confounder
(figure 5, top panel). The values of exposure or confounder prevalence at which power is minimized and
maximized depend on the values of the remaining
parameters, although the curve is consistently
U-shaped for confounder prevalence and takes an inverted U shape for exposure prevalence. Stage 2
power is a decreasing function of the degree of association between exposure and confounder (figure 5,

Downloaded from http://aje.oxfordjournals.org/ by RamaKrishna ch on August 25, 2016

The variance component associated with the logistic


regression-based parameter estimate is given by

and confounder, in the context of a case-control study


with a stage 1 sample in which the number of cases
and controls is prespecified and the stage 2 sample size
is fixed in advance at n^ 600. If one collected
confounder information for all subjects, the power
would be approximately 92 percent. As is shown,
approximately 90 percent of the power available from
the corresponding single-stage design was retained
even though the confounding was relatively strong
(crude OR = 2.14; OR adjusted for C: ep = 1.5) and
only one fifth of the study population was sampled at
stage 2, exemplifying the efficiency of two-stage sampling using the balanced design.

454

Schaubel et al.

a. Exposure/Confounder Association

C=0

C= l

Total

Poo=O-66

p M =0.24

1 - PE=0.9

P,o=O.O4

p,,=0.06

PE=0.1

Total

l-Pc=0.7

p c =0.3

b. Expected 2 x 2 Table, Stage 1

D=0

D=l

No (Poi +Poo)
1800 (1800)

808.1 (863.3)

N , (4>n +<t>w)l4>

No (Pn+Pio)
200 (200)

191.9(136.7)

c. Expected 2x2x2 Table, Stage 2

D=l

D=0
E

c=o

C= l

C=0

"ooPoo /(Poi +Poo)


109.9(113.1)

nooPoi/(Poi+Poo)
40.1 (41.3)

71 .6(73 7)

noiPi</(Pii+Pio)
61.0(62.8)

89.0 (91.7)

27 .9(25 4)

C= l
)
78.4 (80.7)
)
122.1 (111.3)

00

FIGURE 3. Example of power calculation for a two-stage study. A two-stage case-control study is planned to evaluate the effect on disease
incidence (D) of an exposure () recorded on a binary scale after adjustment for a single binary confounder (C). The following quantities are
known or estimated: cases {N-, = 1,000), controls (A/o = 2,000), exposure prevalence (pE = 10%), confounder prevalence ( p c = 30%),
exposure odds ratio {ep = 1.5) (crude exposure odds ratio = 2.1), confounder odds ratio (e 7 = 3.0), and stage 2 sample size (fixed in advance)
{n2 = 600). The {E,C) distribution in the source population is described in section a, with 0 = (0.66 x 0.06)/(0.24 x 0.04) = 4.0. Expected
cell entries for the cross-tabulation of D and E at stage 1 under HA {Ho in parentheses) are given in section b. Expected cell entries for the
cross-tabulation of D, E, and C at stage 2 under HA {Ho) are given in section c. The projected variances under Ho and HA are Vo{0) = 0.02
and VA(/3) = 0.017821, respectively. Power is estimated at 1 - * ( - z j = 83%, where ze = (log(1.5) - 1.96 x V0.017821)/VO02 = 0.961.
Power = 92% when the entire study population is sampled at stage 2. Note that <f>lk = plk e * + 1 * , <f> = </>,, + <f>10 + <01 + #0,,.

bottom). Overall, the exposure attributes (odds ratio


and prevalence) have the greatest impact on the power
of the sample selected at stage 2, although confounder
prevalence, the confounder odds ratio, and the control
odds ratio also play a large role. The effect on power
of varying one of these parameters depends on the
values of the others. These results parallel those for the
single-stage design, except that the stage 2 sample size
required to achieve a fixed level of power in the

two-stage design is more sensitive to the confounder


attributes relative to that for the one-stage design (data
not shown).
Available tools

Two tools for stage 2 sample size considerations


have been developed for the case of a binary exposure
and a single binary confounder. The first is a set of
Am J Epidemiol

Vol. 146, No. 5, 1997

Downloaded from http://aje.oxfordjournals.org/ by RamaKrishna ch on August 25, 2016

Sample Size Estimation for Two-Stage Studies

TABLE 1.

455

Required stage 2 sample size* for two-stage case-control studiest

(N., = 1,000, N=o 2,000)

e
10%
30%

50%

10%

50%

10%
30%

50%

30%

40%

20%

30%

40%

1.5
3.0
6.0
1.5
3.0
6.0
1.5
3.0
6.0

48
136
302
92
245
548
91
220
434

37
99
207
72
195
446
72
184
390

33
84
166
66
179
406
68
179
396

30
85
192
57
156
356
56
139
280

27
72
150
52
143
329
52
134
285

25
64
125
50
136
308
51
135
301

1.5
3.0
6.0
1.5
3.0
6.0
1.5
3.0
6.0

192
311
528
308
460
762
260
352
507

148
236
387
248
390
680
211
308
484

135
206
320
232
373
666
201
310
522

119
200
352
194
303
516
162
228
337

107
173
285
180
288
504
154
226
358

102
156
242
175
282
500
152
235
393

1.5
3.0
6.0
1.5
3.0
6.0
1.5
3.0
6.0

349
490
756
504
639
932
384
443
463

273
381
575
408
560
879
315
397
551

250
339
487
384
548
897
302
408
613

219
323
516
318
428
642
241
291
378

198
282
425
298
414
651
229
293
408

188
257
367
291
414
668
228
308
460

* Stage 2 sample size required to detect an exposure odds ratio (OR) of eP = 1.5 with 90% power and type I
error of a = 0.05 (two-sided).
t A case-control study {N. cases and Wo controls at stage 1) designed to evaluate the effect of exposure
recorded on a binary scale, witn adjustment for a single binary confounder with the following quantities anticipated:
exposure prevalence = pp confounder prevalence = p^, exposure OR = eP, confounder OR = ei, and (,C) crossproduct ratio = 8.

tables and power curves, generated by a program written in SAS (SAS Institute, Cary, North Carolina),
which provides the required n2 for either 80 percent or
90 percent power. The second is an executable program (source code written in C) which can calculate
either power for a given n2 or the minimum n2 needed
to achieve a prespecified level of power. The relevant
tables and software for this procedure are available
from the first author upon request.
DISCUSSION

Although methods for analyzing two-stage data


have been described, issues pertaining to sample size
estimation have not been explicitly addressed. Here
we provide a basis for determining an appropriate
number of participants to select at the second stage
when the exposure and confounders are categorical,
Am J Epidemiol

Vol. 146, No. 5, 1997

and we have produced tables and software with which


to accommodate a binary exposure with a single binary confounder.
Two-stage sampling was initially proposed for epidemiologic research independently by both White (1)
and Walker (2) in 1982. Each presented a stratified
analysis with a binary exposure. White derived the
variance for the covariable-adjusted odds ratio, with
estimation equivalent to weighted least squares.
Breslow and Cain (3, 4) extended White's methods to
incorporate multilevel and continuous variables using
a pseudo-likelihood approach, with analysis based on
logistic regression. Zhao and Lipsitz (7) have proposed a family of 12 two-stage sampling designs, of
which those considered in this report constitute special
cases. Wacholder et al. (8) coined the phrase "partial
questionnaire design," wherein basic data are collected from all study participants through use of a brief

Downloaded from http://aje.oxfordjournals.org/ by RamaKrishna ch on August 25, 2016

30%

(W,= 2,000, W o =4,000)

20%

458

Schaubel et al.
APPENDIX
Expected Cell Entries

Stage 1. Assume that the numbers of diseased and nondiseased subjects at stage 1 are known, and that the data layout
for the study population follows that of table 1. Under a multiplicative model, as in equation 1, with no interaction, the
expected entries of the 2 X J table at stage 1 are:

j=0 (fc=0

;=0 *=0

Stage 2. With the n,y known, the expected cell entries for the 2 X / X K table at stage 2 are given by:
Downloaded from http://aje.oxfordjournals.org/ by RamaKrishna ch on August 25, 2016

pjk
n

2^
k=Q

fc=0

Am J Epidemiol

Vol. 146, No. 5, 1997

Das könnte Ihnen auch gefallen