Sie sind auf Seite 1von 12

Chris J.

Skinner and Jon Wakefield


Introduction to the design and analysis of
complex survey data
Article (Published version)
(Refereed)

Original citation: Skinner, Chris J. and Wakefield, Jon (2017) Introduction to the design and
analysis of complex survey data. Statistical Science, 32 (2). pp. 165-175. ISSN 0883-4237

DOI: 10.1214/17-STS614

© 2017 Institute of Mathematical Statistics

This version available at: http://eprints.lse.ac.uk/76991/


Available in LSE Research Online: May 2017

LSE has developed LSE Research Online so that users may access research output of the
School. Copyright © and Moral Rights for the papers on this site are retained by the individual
authors and/or other copyright owners. Users may download and/or print one copy of any
article(s) in LSE Research Online to facilitate their private study or for non-commercial research.
You may not engage in further distribution of the material or use it for any profit-making activities
or any commercial gain. You may freely distribute the URL (http://eprints.lse.ac.uk) of the LSE
Research Online website.
Statistical Science
2017, Vol. 32, No. 2, 165–175
DOI: 10.1214/17-STS614
© Institute of Mathematical Statistics, 2017

Introduction to the Design and Analysis of


Complex Survey Data
Chris Skinner and Jon Wakefield

Abstract. We give a brief overview of common sampling designs used in a


survey setting, and introduce the principal inferential paradigms under which
data from complex surveys may be analyzed. In particular, we distinguish
between design-based, model-based and model-assisted approaches. Simple
examples highlight the key differences between the approaches. We discuss
the interplay between inferential approaches and targets of inference and the
important issue of variance estimation.
Key words and phrases: Design-based inference, model-assisted inference,
model-based inference, weights, variance estimation.

1. INTRODUCTION be designed for research questions of interest rather


than these questions having to be adapted to the, often
Sampling has proved an essential tool over the last
very limited, sets of variables available from big data
century to enable society to collect wide ranging ac-
sources. Samples can be designed to represent popula-
curate information about populations through cost-
tions of interest rather than study populations having
efficient survey data collection. Moreover, sample sur-
to be adapted to the typically selective coverage of big
veys, together with experiments, have provided core
data sources. In the light of such essential roles, the
methods of data collection to support the development
sample survey continues to be the method of choice in
and application of modern statistical methods to sci-
many settings and this special issue seeks to reflect the
entific research. The central roles of surveys and sam-
continuing vitality of developments in statistical meth-
pling are seeing some challenges in the twenty-first
ods in this field. We also aim to capture some of the
century and “big data” era (e.g., Japec et al., 2015).
Big data is taken here to refer to data sources which evolution of the field as it advances. In an ideal situ-
are generated as secondary outcomes of existing sys- ation, survey data can provide an important comple-
tems (rather than as a result of designed primary data ment to alternative data sources. For example, estima-
collection) and which cover 100% of units to which tion methods which combine carefully collected survey
the system applies (and thus involve no sampling). data and “big” data, have the potential to leverage the
Such sources can be cheap and can provide informa- advantages of both.
tion much more rapidly. Also, since they cover 100% The design and analysis of surveys are fascinating
of units, they may provide more granular estimates. enterprises. Unless one is trained in the field, how-
In addition to competition from such sources, sample ever, they can be exercises shrouded in mystery. For
surveys now face threats to their accuracy from in- instance, the expression, “a weighted analysis is rec-
creasing nonresponse and major cost pressures. Nev- ommended”, is a standard accompaniment to public re-
ertheless, they continue to have essential roles that lease survey datasets but, unfortunately, weights can be
big data sources cannot replace. Survey variables can constructed in many different and subtle ways which
can leave the uninitiated scratching their heads in be-
wilderment. In this special issue, we hope to provide
Chris Skinner is Professor, Department of Statistics,
London School of Economics and Political Science, some enlightenment, beginning in this opening paper
London, United Kingdom (e-mail: c.j.skinner@lse.ac.uk). with a gentle introduction to the central themes of com-
Jon Wakefield is Professor, Departments of Statistics and plex survey analysis.
Biostatistics, University of Washington, Seattle, The complexity of survey data alluded to in the title
Washington, USA (e-mail: jonno@uw.edu). of this paper refers to the complex nature of sampling

165
166 C. SKINNER AND J. WAKEFIELD

designs, involving, for example, stratification and mul- population which is sampled and n denotes the sam-
tistage sampling, together with associated complica- ple size. Why would one wish to deviate from SRS
tions such as nonresponse. Although we shall provide and, in particular, from one of its properties, that each
a brief outline of complex designs at the end of this unit in the population is selected with equal probabil-
section, we focus in this paper on the methodology of ity? One reason is for efficient estimation. Tillé and
estimation and analysis and the question of how to ac- Wilhelm (2017) refers to the “false intuition that a
count for the complex sampling design, largely treated sample must be similar to a population” and explains
here as given. how more efficient estimation can often be achieved by
A further source of mystery for the many secondary sampling units with unequal probabilities. Other rea-
analysts of complex survey data is that, whereas many sons include practical constraints imposed by the na-
novel developments in the methodology of survey de- ture of frames from which the sample must be drawn
sign and estimation have been introduced by the agen- and variable costs of data collection. See Valliant, De-
cies conducting large surveys, not all the information ver and Kreuter (2013) for a detailed account of a wide
relating to these developments is made available when range of designs used in practice.
data are released for general analysis. For example, sur- We now describe some basic designs. We emphasize
vey weights and imputed values may be made avail- that in practice these often act as the building blocks of
able but some of the complex features of the sampling more complex designs, because of the characteristics
frame, design and “raw” data underlying such released of the sampling frame or the population. A common
information may remain concealed. design is stratified simple random sampling in which
Historically, survey sampling has often been seen as a group label is available for each unit in the popu-
a rather separate topic to much of the rest of statistics. lation, and SRSs are taken within each of the groups.
For example, in studies of individuals, the groups may
Not only have survey design and survey data analysis
correspond to demographic strata and geographical re-
typically been undertaken by different people but the
gions. Stratified random sampling can provide appre-
estimation methodology associated with survey design
ciable gains in efficiency if the variables defining the
has often been centered on the design-based (or ran-
strata are associated with the response. The main im-
domization) approach, quite separate from the model-
pediment to its use is the availability of strata informa-
based inference featuring in much mainstream applied
tion on all members of the population. In single-stage
statistics. While the former approach has a clear ratio-
cluster sampling, the population is again partitioned,
nale, it can be confusing to those who have received
but this time into what are called “clusters”, or primary
a conventional model-based training in statistics. The sampling units (PSUs). The PSUs are often defined ge-
slow rate of inclusion of complex survey methods in ographically.
much applied statistical software for analysis has also The key difference between cluster sampling and
contributed to this separation. Nevertheless, we have stratified sampling is that only a sample, rather than
recently sensed a greater degree of cross-fertilization all, of the clusters are selected and then informa-
of ideas between survey sampling and applied statisti- tion is obtained from all individuals within clusters.
cal methods of analysis. A key purpose of this paper Cluster sampling in general reduces efficiency, be-
and special issue is to help support the sharing of such cause of within-cluster correlation, but logistically it
ideas by opening up developments in the survey sam- is very convenient, particularly in nationwide sam-
pling literature to a broader readership. pling. In two-stage cluster sampling, random samples
Before proceeding to consider inference, however, are taken within the sampled clusters (PSUs). In large
we set the scene by outlining some features of com- national surveys, stratified multistage cluster sampling
plex designs. We only consider probability sampling (in which cluster sampling and stratified SRS is exe-
in which the design is characterized via a probabil- cuted) is the norm, since it balances efficiency, logis-
ity distribution over the possible samples that may tical constraints, and the requirement for estimates of
be collected. In particular, each unit in the popula- sufficient precision to be obtained for subgroups of in-
tion of interest has a nonzero probability of being se- terest.
lected. A complex design may be viewed as one de- For the remainder of this paper, we turn to inference,
viating from the simplest design, simple random sam- but see Tillé and Wilhelm (2017) for additional proba-
pling (SRS), in which all subsets of n from N units bility sampling designs, with an emphasis on new de-
are equally likely. Here, N denotes the size of the velopments. In Section 2, we introduce and compare
COMPLEX SURVEYS 167

design-based and model-based inference for complex the randomization associated with probability sam-
survey data. Section 3 brings in the role of auxiliary pling or from a model assumed to generate the popu-
information and describes model-assisted inference in lation values yk . Inference based on models is likely to
which working models are adopted to suggest estima- be familiar to most readers and so we leave it till sec-
tors/designs, but with the design-based approach be- ond. First, we discuss design-based inference, some-
ing followed for inference. Sections 4 and 5 consid- times also called randomization-based analysis. Lohr
ers model parameters (as opposed to finite population (2010) is a popular text that is primarily concerned with
characteristics) and nonprobability sampling. The im- design-based inference. The latter is more distinctive to
portant topic of variance estimation is the subject of survey sampling, though inference based on random-
Section 6. Final remarks conclude the paper in Sec- ization is sometimes used in randomized experiments
tion 7. (Cox, 2006, Chapter 9).
2.1 Design-Based and Model-Based Approaches
2. INFERENTIAL OVERVIEW
to Inference
In this section, we provide a brief overview of in-
2.1.1 Design-based. The population values y1 , . . . ,
ferential approaches to the analysis of survey data, as-
yN , are viewed as fixed constants, with the collection
suming that probability sampling is used and there is
of units selected, S, treated as random. Assuming N
no nonresponse. First, we define some notation. Let
known, a standard weighted estimator of the popula-
yk , k = 1, . . . , N represent the values of a survey vari-
tion mean is
able of interest on all N units in a well-defined finite 
population (e.g., all individuals aged 18 or greater in dk yk
(1) y HT = k∈S .
a particular administrative region); we shall also write N
k ∈ U to index this collection. A sample of these units, This will be referred to as the HT estimator, since its
denoted S ⊂ U , is taken via some probabilistic mech- numerator was proposed by Horvitz and Thompson 
anism, where 
p(s) denotes the probability of selecting (1952) for estimating the population total k∈U yk .
S = s with s p(s) = 1, and the values of yk are only The primary motivation for such design weighting is
observed for k ∈S. The probability of unit k being se- to remove bias, as discussed in detail in Haziza and
lected is πk = s:k∈s p(s), and the so-called design- Beaumont (2017). Bias and other moments are evalu-
weight is defined as dk = πk−1 . This weight is often ated in the design-based framework with respect to re-
interpreted loosely as the number of population units peated sampling of units from the finite population U .
“represented” by the kth sampled unit. We write ES [y HT ], with the subscript S on the expecta-
We suppose here that finite population characteris- tion emphasizing that we are averaging over possible
tics are the targets of inference. Other targets are con- subsets that could have been selected. Similarly, the
sidered in Section 4. It is common in the survey sam- variance of the estimator will be written as varS (y HT ).
pling literature to emphasize first the estimation of pop- We informally define two particular criteria: an estima-
ulation totals, before moving on to targets which are tor is design unbiased if its expectation (over all possi-
functions of totals. There are a variety of reasons for ble samples) is equal to the true value, and an estimator
this, beyond the obvious one that totals may indeed be is design consistent if both the design bias and the vari-
targets of inference. One reason is that issues of bias ance go to zero as the sample size increases. For the lat-
can be dealt with more simply with totals than, say, ter, one must consider a sequence of populations, with
means, especially when the population size N is un- the finite population size and the sample size tending
known, as it often is. However, to present basic ideas to infinity.
in this paper

we shall treat the finite population mean An alternative estimator of the mean, defined whether
y U = N1 k∈U yk as our target of inference, since it is a N is known or not, is the Hájek estimator (Hájek,
more natural “unit-level” parameter of interest in much 1971):
survey data analysis and includes, for example, a pro- 
k∈S dk yk
portion as a special case. (2) y HJ = 
,
We next contrast two broad paradigms under which N
= 
inference based on survey data may be performed: the where N k∈S dk , vindicating dk ’s interpretation
design-based and model-based approaches. These re- earlier as the number of population units represented
fer to two different sources of randomness, either from by the kth sampled unit. The estimator (2) is biased
168 C. SKINNER AND J. WAKEFIELD

in finite samples, but is design consistent. The Hájek 1 


= dk dl covS (Ik , Il )yk yl
estimator is often preferred to the HT estimator even N 2 k∈U l∈U
if N is known, and we give some rationale for this
1  yk yl
later when we consider models. This estimator illus- = 2
kl .
trates another surprising aspect of traditional survey N k∈U l∈U πk πl
sampling, its preoccupation with the estimation of ra-
An unbiased estimator of the variance is
tios. But many functions of interest (the mean, for ex-
ample!) can be expressed as a ratio. 1   kl yk yl
(4) S (y HT ) =
var .
The key to deriving the properties of design-based N 2 k∈S l∈S πkl πk πl
estimators is to define binary indicators of selection
Despite their ease of derivation, the forms of the vari-
Ik , such that ES [Ik ] = πk , varS (Ik ) = πk (1 − πk ),
ances in (3) and (4) can be quite mysterious to those
ES [Ik Il ] = πkl and covS (Ik , Il ) = πkl −πk πl = kl , for
raised in the model-based camp, since they do not ap-
k, l ∈ U . Here, the πkl are the probabilities that units
pear to depend on the variances of the responses, yk
k and l are both selected; these are the key quantities
(but see the end of this subsection for the emergence of
required to determine the variances of estimators, as
a familiar form). A pivotal requirement in the deriva-
we demonstrate below. It is usual for designs to sam-
tion of (4) is that πkl > 0, that is, all pairs of units
ple without replacement and for πkl = πk πl (this is in
must have a positive probability of being selected. Al-
stark contrast to the model-based approach in which
though unbiased, the estimator in (4), as well as the
values are usually assumed to be independent, since
closely related Sen–Yates–Grundy estimator given in
they are drawn from a hypothetical infinite population).
Tillé and Wilhelm (2017), has some undesirable prop-
Desirable criteria from a design-based perspective are
erties. For example, they can be negative. More impor-
design unbiased (or design consistent) estimators with
low variance. tantly, in practice, the πkl are often not available for all
It is straightforward to show that the design weight- k, l ∈ S. This is usually the case for multistage, clus-
ing in the HT estimator (1) does indeed remove bias: tered designs, for example. In order to perform design-
    based inference, it is usual therefore to adopt alterna-
1 1 tive variance estimators. Approximations obtained by
ES [y HT ] = ES dk yk = ES dk Ik yk
N k∈S
N k∈U treating the design as “with replacement” are widely
used, since the variance estimator is always nonnega-
1  −1
= π ES [Ik ]yk = y U . tive and it is not necessary to know πkl for all k, l ∈ S
N k∈U k (see, e.g., Lohr, 2010). The use of resampling meth-
The trick in the above derivation is to introduce the bi- ods, such as the jackknife or bootstrap, is also common.
nary random variables Ik , and consequently sum over Section 6 provides a fuller discussion.
U ; before that point the sum is over units in the random To illustrate some of the expressions above, consider
set S. SRS for which
⎧ −1
The unbiasedness arises because of the inverse prob- ⎪

⎨ N , if s has n elements,
ability weighting, a technique that is now in common
p(s) = n
use beyond survey sampling, particularly to adjust for ⎪


nonresponse (Seaman and White, 2013). A key point 0, otherwise,
is that we require πk > 0 for the estimator to be de- n N n n−1
sign unbiased. This makes complete sense, because we πk = , dk = , πkl = .
N n N N −1
cannot hope to achieve an unbiased estimator of a fi-
 = 
nite population characteristic, if some of the units can We find N k∈S dk = N so that y HT = y HJ =
never be sampled.  S2
= (1 − Nn ) ny
k∈s yk /n and the variance is varS (y HT )
The form of the variance of the HT estimator also 1 
where Sy2 = N−1 k∈U (yk − y U ) and
2 1 − Nn is the
follows straightforwardly:
  finite population correction (if sampling from a hy-
1 pothetical infinite population, this term would be 1).
varS (y HT ) = 2 varS dk yk
N k∈S A design unbiased estimator of Sy2 is
 
1 1 
(3) = 2 varS dk Ik yk (5) sy2 = (yk − y HT )2 ,
N k∈U
n − 1 k∈S
COMPLEX SURVEYS 169

so that we recover a familiar form for the variance of The prediction variance is
the estimator, albeit with a finite population correction.  2 
  2 1 n
1 N

2.1.2 Model-based. Under this, more mainstream, EM (Y n − Y U ) = EM Yk − Yk


n k=1 N k=1
statistical approach the yk are treated as realized values
 
of random variables Yk , k = 1, . . . , N , which follow n σ2
= 1− .
some specified model, viewing the population as drawn N n
from a hypothetical infinite superpopulation. Frequen-
Substitution of σ 2 by the finite population variance, Sy2 ,
tist evaluation refers now to repeated realizations from gives the same variance as obtained earlier for design-
the model. based inference under SRS.
In conventional statistical modeling methods of data It may appear that the complex sampling scheme
analysis, model parameters are typically of interest, plays no role in the model-based approach. There are,
rather than finite population characteristics, such as the in fact, two fundamental ways in which it does. First, a
finite population mean discussed above. Here, how- complex sampling scheme will depend on the structure
ever, we consider how inference for the latter can of the population through, for example, stratification or
be carried out by reference to the modeling frame- clustering. It is essential that this structure is captured
work. We now write the target of inference as Y U = in the model, for example using fixed effects for strata
1 
N k∈U Yk to convey that it is random, but we em- and random effects for clusters, if model-based infer-
phasize that it represents the same target of inference ence is to be valid.
as y U in the design-based framework. Since the tar- Second, in conventional model-based inference as
get is a random variable, the standard frequentist esti- above, it is assumed that the model specified at the
mation rules do not apply and we refer to a predictor, population level also applies to all sample observa-
rather than an estimator of Y U . The classic reference tions, however, they are sampled. This makes a strong
is Valliant, Dorfman and Royall (2000). Chambers and implicit modelling assumption. Prediction under the
Clark (2012) provide an introduction. There are two model-based approach conditions on the selection in-
main criteria considered to evaluate a predictor, de- dicators Ik . The assumption that the population model
  applies to the sample is therefore equivalent to assum-
noted Y . First, is the bias, EM [Y − Y U ], where now
 ing that the conditional distribution of Yk given Ik = 1
both Y and Y U are random. Second, the variance of the is the same as its conditional distribution given Ik = 0.

predictor with respect to the model is EM [(Y − Y U )2 ]. Sampling is then said to be noninformative. If this as-
These criteria are the same as those used when ran- sumption does not hold, that is if EM [Yk |Ik = 1] =
dom effects are predicted in a frequentist framework. EM [Yk |Ik = 0], where the subscript M indicates that the
The model-based approach can also be formulated in expectations are under a model, there is the potential
a Bayesian framework with inference about Y U based for bias to arise, so-called selection bias. A key advan-
on its posterior distribution given the data. We do not tage of probability sampling is that it may be used to
have the space to discuss this here, but the interested ensure the independence of Ik and Yk , and hence to
reader can consult, for example, Gelman (2007) and protect against selection bias. It is a risky endeavor to
Little (2013). carry out inference from a nonprobability sample with-
As a simple illustration of the prediction approach, out such protection.
consider a model for which: 2.2 Switching the Paradigms
(6) μ = EM (Yk ), varM (Yk ) = σ 2 , The two approaches may be compared and con-
trasted by examining design-based estimators using

with Yk and Yl independent. The predictor Y is taken as model-based criteria and vice-versa. Consider first the
 
the sample mean Y n = n k=1 Yk . Our change of nota-
1 n model bias of the HT and Hájek estimators under a
tion to k = 1, . . . , n acknowledges that the set of units model, as in (6), where μ = EM [Yk ] and the yk in (1)
S selected from N is no longer relevant. The sample and (2) are replaced by Yk . Under this model,
 n 
mean is an unbiased predictor since  μ
EM [y HT − Y U ] = dk − N ,
1 n
1 N N
 k=1
EM [Y n − YU] = EM [Yk ] − EM [Yk ] = 0.
n k=1 N k=1 EM [y HJ − Y U ] = 0,
170 C. SKINNER AND J. WAKEFIELD

so that the Hájek estimator is always unbiased under corresponds to y HT , the HT estimator. Model (7) is
this model but the HT estimator is only unbiased under not required to be correct for the properties of the
 = n dk = N . This condition often
the condition N HT estimator to be valid, but it does suggest situa-
k=1
does hold, for example, under SRS when dk = N/n, tions in which we would expect the estimator to per-
but there may be problems in the performance of the form well (or not). This could be viewed as a model-
HT estimator for sampling designs where it does not. assisted approach, which we discuss in more detail
An example is a design in which the sample size n is in Section 3. The widely cited Basu elephant exam-
random and the weights are constant (i.e., dk = d). ple (Basu, 1971, Hájek, 1971) provides an extreme ex-
We switch now to consider the design bias of the ample in which the HT estimator performs poorly in a
 situation in which the responses yk are not related to
model-unbiased estimator Y n , where Yk is replaced by
yk . We have the sampling probabilities πk . Briefly, a fictional cir-
 
cus owner would like to estimate the weight of his 50
 1 N
1 N strong herd of elephants, based on measuring a sin-
ES [Y n ] = ES Ik yk = πk yk gle elephant. Drawing on 3-year old records, he pro-
n k=1 n k=1
poses to measure the weight of Sambo, an elephant
and so this estimator will generally be design biased. who was previously of average weight, and multiply
Haziza and Beaumont (2017), Section 3, show that the this weight by 50. This purposive design traumatizes
design bias will only disappear if the πk are uncor- the circus statistician, who is obsessed with using a
related with the yk , which corresponds to the notion design-unbiased estimator. To this end, he proposes an
of noninformative sampling discussed in Section 2.1.2. alternative plan in which Sambo is selected with prob-
This illustrates the (design) bias impact of model mis- ability πk = 10099
, and one of the remaining 49 crea-
 tures with probability πk = 100 1
× 49
1
. Letting y de-
specification. The unweighted estimator Y n is unbiased
for Y U under the model μ = EM [Yk ] (and the assump- note the weight of the selected elephant, the HT esti-
tion of noninformative sampling) but does not protect mator is dy where d = 100 99 if Sambo is selected, and
against bias if these assumptions fail, unlike the design d = 100 × 49 if any of the other elephants is selected.
weighted (HT and Hájek) estimators. Clearly, whichever elephant is selected, this estimator
So far as the variance is concerned, if we assume is unsatisfactory. Putting aside the wisdom of an n = 1
that varM (Yk ) = σk2 and that Yk and Yl are independent design, is it clear here that, by construction, yk is not
under the model, then the variance under the model of, proportional to πk . For further discussion, see Brewer
for example, the Hájek estimator may be expressed as (2002) and Lumley (2010), page 149.
From a modeling perspective, the Hájek estima-
  
n
  tor (2) arises from the model with EM [Yk ] = θ and
EM (y HJ − Y U )2 = −1 dk − N −1 2 σ 2
N k varM (Yk ) = πk2 σ 2 , k = 1, . . . , n, and so one would ex-
k=1
pect it to outperform the HT estimator when the re-

N sponse is approximately constant (as opposed to be-
+ N −2 σk2 ing proportional to the sampling probabilities), which
k=n+1 argues for its use in many instances, regardless of
(the same expression holds for the HT estimator with whether N is known,
N replaced by N ). This expression will almost cer- To summarize, it is informative to view model-based
tainly not equal (3). estimators from a design-based perspective and vice
Studying the model variance of design-based estima- versa, since it gives insight into situations in which the
tors may help in assessing their efficiency. Consider, respective estimators will perform well. Conclusions
for example, the model we have drawn here include that the HT estimator may
be design-unbiased, but biased with respect to particu-
(7) Yk = θ πk + εk , lar models, and the model variance may not correspond
to the design variance.
where the error terms εk are independent with
EM [εk ] = 0 and varM (εk ) = πk2 σ 2 , k = 1, . . . , n. The
3. AUXILIARY VARIABLES AND
weighted least squares estimator  θ , that minimizes
MODEL-ASSISTED ESTIMATION

n
(yk − θ πk )2
, In most survey settings, auxiliary information about
k=1 πk2 σ 2 the population units will be available to assist both de-
COMPLEX SURVEYS 171

sign and inference. From a predictive model-based per- design consistent for y U and x U , respectively. The ra-
spective, it is very natural and commonplace to include tionale for using the regression estimator rather than
auxiliary variables as covariates in regression models. the simple estimator y S is that it improves precision.
This enables more efficient predictions to be achieved. It does this because, under (8) with B1 = 0, the er-
Conditioning on auxiliary variables used in the design ror y S − y U is correlated with x S − x U . Under SRS,
also helps to ensure that sampling is noninformative, the large-sample variance of the regression estimator is
that is, that Ik is independent of Yk conditional on these given by
covariates; hence it helps avoid the kinds of selection  
bias mentioned in Section 2.1.2. n Se2
1− ,
From a design-based perspective, it is also possible N n
to include auxiliary variables to improve inference, in 
particular, inference may be “assisted” by considera- where Se2 = N−1 1
k∈U ek with ek = yk − (B0 + B1 xk )
2

tion of suitable covariates in a regression model for Yk . denoting the residual. By comparison with the expres-
The definitive text on model-assisted inference in sur- sion before (5) we see that the use of auxiliary infor-
veys is Särndal, Swensson and Wretman (1992). For mation has reduced the variance by a factor of approx-
simplicity, suppose we know the mean x U for a sin- imately R 2 , the squared correlation between the yk and
gle variable xk that we believe is associated with yk . xk .
Consider the “working model”, For general complex designs, we may use design
weights in the estimators B 0 and B1 and obtain, as dis-
(8) Yk = B0 + B1 xk + εk ,
cussed by Breidt and Opsomer (2017), the generalized
where the error terms εk are independent with EM [εk ] = regression (GREG) model assisted estimator of y U as
0, varM (εk ) = σ 2 , k = 1, . . . , N and the intercept and
0 + B
y GREG = B 1 x U = y + B
1 (x U − x HT ),
slope are defined with respect to the finite population: HT

N
k=1 (xk − x U )(yk − y U )
so that the HT estimator of the mean of the yk is ad-
B1 = N justed via the difference between the population mean
k=1 (xk − x U )
2
(9) of the xk and its sample estimator. The GREG estima-
cov(x, y) RSy2 tor is design consistent with finite sample design bias,
= = 2 ,
var(x) Sx but for large samples its precision will be greater than
that of the HT estimator. The estimators B0 and B
1 can
(10) B0 = y U − B1 x U ,
also include weighting for heteroskedasticity in model
where R, Sx and Sy are the correlation, standard devia- (8) giving, for example, when B0 is taken as zero, the
tion of x and standard deviation of y, in the population. widely used ratio estimator, as a special case (Breidt
This model motivates B0 + B1 x U as an estimator of and Opsomer, 2017).
 
y U , where B0 and B1 are design-based estimators of The overall model-assisted approach has a similar
B0 and B1 . For simplicity, consider SRS and the esti- flavor to robust estimation using sandwich variance es-
mators: timators, where a working model is specified, but the

1 = k∈S (xk − x S )(yk − y S ) rsy2 consistency of the estimator is guaranteed, under very
B  = 2, weak assumptions, and in particular consistency does
k∈S (xk − x S )
2 sx
not depend on strong modeling assumptions. Breidt
0 = y − B
B 1 x S ,
S and Opsomer (2017) provide a much fuller account of
where r is the sample correlation, and y S , x S and sx , sy model-assisted inference, including a wide range of ex-
are, respectively, the means and standard deviations of tensions of the GREG approach.
x and y in the sample. Since ratios are involved, these We note that GREG and related estimators can be
estimators are not design unbiased for B0 and B1 but represented as weighted estimators, where the weights
they are design consistent. The resulting estimator of extend the simple idea of design weights introduced
y U is earlier by incorporating auxiliary population informa-
0 + B
1 x U = y + B
1 (x U − x S ). tion. Various adjustments can be made and the con-
B S struction of weights can be complex; the relevant is-
This is the traditional regression estimator and is de- sues are discussed in this issue by Haziza and Beau-
sign consistent for y U under SRS since y S and x S are mont (2017) and Chen et al. (2017).
172 C. SKINNER AND J. WAKEFIELD

4. MODEL PARAMETERS to θ if the population size is large. Taking θU as the


parameter of interest (rather than θ ) is attractive since
The previous two sections have focussed on the es-
it is a finite population quantity and so one may make
timation of the finite population mean. Extensions to
design-based inference about it directly (as we did in
other finite population targets of inference can often
Section 3 in the context of linear regression). This kind
be achieved by treating them as explicit functions of
of approach is discussed in Lumley and Scott (2017),
means and by estimating these targets by plugging the
particularly in the context of pseudolikelihood estima-
estimators of the means into the function; see, for ex-
tion (Binder, 1983) which provides the basis of most
ample, Breidt and Opsomer (2017), Section 1. In this
established statistical packages for survey analysis.
section, we focus instead on inference about model pa-
The definition of a census parameter in terms of
rameters, which raises additional issues. For a more de-
a specific estimation approach is somewhat arbitrary,
tailed discussion, see Lumley and Scott (2017).
however (Skinner, 2003), and it is often still preferable
From a model-based perspective, if the model for
to take a model parameter θ as the target. In this case,
the survey variables incorporates both the parameters
it may still be reasonable to take as a point estimator θ
of interest, θ say, and the auxiliary variables used in
the same estimator that would be used for θU but it will
the design, as required to ensure the sampling scheme
be necessary to modify variance estimation and related
is noninformative, then it may be possible to treat the
inference procedures by combining design-based and
sampling scheme as ignorable for inference about θ
model-based inference.
and to employ standard unweighted approaches to in-
For example, suppose we wish to make inference
ference. In this case, not only the sampling design but
about μ = EM [Y ], the mean in the (infinite) superpop-
also the finite population may effectively play no role
ulation from which the population of size N was sam-
and the only model requiring consideration is that as-
pled. We may treat the finite population mean y U as
sumed to generate the sample data. However, a prob-
the census parameter and, starting from a design based
lem with conditioning on the design variables is that
perspective, take the HT estimator y HT as a point esti-
it may move the model away from that which is of
mator of both y U and μ. Not only is it design unbiased
interest. See, for example, the discussion of Lumley
for y U but it is also unbiased for μ with respect to a
(2010), page 105, in relation to regression. Skinner,
joint design/model-based framework. Thus,
Holt and Smith (1989) refer to the distinction between
 
conditioning or not on the design variables as “disag- E[y HT ] = EM ES [y HT ] = EM [Y U ] = μ
gregated” versus “aggregated” analyses, and note that
the two approaches may serve quite different analytic (where we have replaced y U by Y U to emphasize that
purposes. it is being treated as random). Turning to the variance,
From a design based perspective, one may begin we need to consider the uncertainty due not only to
with the even more fundamental question of how to the selection of a sample of size n but also due to the
define the parameter of interest. A common approach selection of the population of size N from the super-
is again to specify a (superpopulation) model of inter- population. For simplicity, suppose the design is SRS.
est, including a parameter θ say, but where this is only We obtain
treated as a “working model” for motivation and where    
var(y HT ) = EM varS (y HT ) + varM ES [y HT ]
the model will not be used for inference. With this pur-   
pose in mind, a “census parameter”, θU may be de- n σ2
= EM 1− + varM (Y U )
fined, which is some estimator of θ , were the whole N n
finite population to be observed by conducting a hy-  
n σ2 σ2 σ2
pothetical census. For example, suppose we are inter- = 1− + = .
ested in modelling unemployment and that a param- N n N n
eter θ of this model represents the probability that a In this case, the variance is just as if a random sam-
person in the labour force in a particular population of ple was drawn from the superpopulation. Moreover, if
people U is unemployed at a particular point of time. N is much larger than n then the second term may be
Then we might define θU as the actual proportion of negligible and it may be argued that design-based in-
the labour force in this population who are unemployed ference suffices in practice. For discussion of inference
at this time. Under suitable modelling assumptions, it about model parameters in more general settings, see
may be expected that θU will be a close approximation Graubard and Korn (2002).
COMPLEX SURVEYS 173

5. NONPROBABILITY SAMPLING interval for Y U based  on the HT estimator  y HT takes


AND NONRESPONSE the form (y HT − zα/2 var(y
HT ), y HT + zα/2 var(y
HT )),
where zα/2 is the 1 − α/2 quantile of the standard nor-
So far, we have assumed probability sampling and
HT ) is a consistent estimator
mal distribution and var(y
no nonresponse. In practice, non-response arises in
most surveys of human populations and response rates of the variance of y HT . The basic idea is that the sam-
have seen a relentless decline in many countries in re- pling distribution of the point estimator can be approx-
cent decades. It thus becomes essential that inferen- imated by a normal distribution for large samples. The
tial methods allow for missing data from nonresponse. asymptotic theory used to justify the validity of such an
Alongside the decline in response rates have been sig- interval needs to take account of the complexity of the
nificant changes in survey practice, such as greater design and is discussed by Breidt and Opsomer (2017).
use of nonprobability sampling (Elliott and Valliant, Given such an approach to interval estimation, the key
2017) associated, particularly, with web surveys; see tasks are to identify a suitable point estimator for a
Schonlau and Couper (2017). specified parameter of interest and a suitable estimator
Nonresponse can be with respect to items and/or the of the variance of this point estimator.
whole unit, and we consider only the latter. Nonre- We have already seen that design consistent point es-
sponse and nonprobability sampling share a common timation is often available using survey weighting for
challenge for inference. They both involve forms of a range of designs. Given the complexity of weight
sample selection which are not fully under the con- construction, as discussed, for example, in Haziza and
trol of the survey designer and to proceed, both require Beaumont (2017), it is common to separate this task,
modeling assumptions. Elliott and Valliant (2017) pro- as a single exercise often undertaken by the agency
vide an in-depth discussion of two broad approaches in conducting the survey, from the task of incorporating
this context, and we here briefly introduce these ideas. weights in estimation, undertaken by a wide range of
One approach is quasi-randomisation, which seeks analysts. There may be further reasons for such sepa-
to represent the sample as if it had been obtained from ration of tasks. For example, confidentiality considera-
probability sampling. In the case of nonresponse, this tions may impose restrictions on what information can
often involves treating the nonresponse as a second be supplied to the analyst.
phase of sampling, as discussed by Haziza and Beau- Given that variance estimation is arguably an even
mont (2017). Under the resulting quasi-random repre- more complex challenge than weighting, there can be
sentation of the sample selection, design-based meth- a similar rationale for task separation: focussing first
ods may be employed, for example, in the construction on adding design information to the data file which
of survey weights. In the case of non-probability sam- can be used at a second stage by analysts to estimate
pling, Elliott and Valliant (2017) give particular atten- variances for estimators of multiple targets. In prin-
tion to a case where an additional probability sample ciple, one could imagine adding to the data file joint
from the population is available for use in determining probabilities of selection πkl for all pairs of sample
pseudo-weights for the nonprobability sample. units so that the variance estimator in (4) could be
The second broad method is referred to as a su- computed. This is rarely done, however, in particular
perpopulation model approach by Elliott and Valliant because all πkl may simply be unavailable, as noted
(2017) and involves the model-based prediction ap- in Section 2.1.1. Indeed, the πkl may not even be
proach outlined in Section 2.1.2. This depends criti- computable for many commonly used methods for se-
cally on the auxiliary information available. The aim is lecting clusters in multi-stage sampling. Likewise, for
to find auxiliary information so that, once conditioned model-based inference, the full design information re-
upon, sample selection is noninformative. Moreover, lating to sample selection will often also be unavail-
the auxiliary variables are used to improve precision able, perhaps for confidentiality reasons. Instead, there
via regression-type models. A key concern with both of are certain “standard” kinds of information made avail-
these approaches is the potential for bias as a result of able in survey data files to enable variance estimation
departures from the modeling assumptions employed. to be conducted.
One approach is to approximate the actual design
6. VARIANCE ESTIMATION
by a similar one that samples with replacement, rather
Interval estimation based on complex survey data is than without. This is a common approach with strat-
typically conducted by appealing to asymptotic nor- ified multistage designs in which the PSUs are se-
mality. Thus, a conventional 100(1 − α)% confidence lected with unequal probabilities within strata. For this
174 C. SKINNER AND J. WAKEFIELD

design, just the strata identifiers, PSU identifiers and and BRR techniques are discussed more fully in Rust
weights provide sufficient design information for con- and Rao (1996).
structing consistent variance estimators. Valliant, De- Another approach to variance estimation in com-
ver and Kreuter (2013), Chapter 15, describe how this plex designs is to adopt a model-based approach
may be done for linear estimators such as the HT es- and to accommodate the population complexity in
timator. For more complex nonlinear estimators, the the model. For example, the induced dependence be-
method of linearization (more commonly referred to tween units in a clustered population may be acknowl-
as the delta method in mainstream statistics) may be edged in a model-based approach using mixed mod-
employed. This approach depends on the nature of the els, an approach that was championed by Scott and
estimator, but is implemented in most statistical survey Smith (1969). In general, sandwich estimation is of-
software. Other approximations, free of joint selection ten utilized when adopting a model-based approach
probabilities, and usable for design-based variance es- (Pfeffermann et al., 1998, Rabe-Hesketh and Skrondal,
timation, are reviewed by Berger and Tillé (2009). 2006). The underlying idea behind sandwich estima-
Another broad approach is replication variance esti- tion is the empirical construction of variances under
mation. The bootstrap and jackknife methods are per- a (usually) simple working model. Sandwich estima-
haps the most well-known examples; each of these tion produces consistent standard error estimates under
are standard techniques in statistics; see Shao and reduced assumptions when compared with a model-
Tu (2012) for a book length treatment. Suppose that based approach, and is robust to the assumed variance
model.
survey weights, denoted wk , k ∈ S, are used to es-
timate a parameter, y U , say, via a weighted estima-
 7. CONCLUDING REMARKS
tor  y = k∈S wk yk . These weights may be simple
design weights, or include post-stratification, nonre- In this short paper, we have given an overview of the
sponse, etc., adjustments. Sets of L replicate weights design- and model-based approaches to inference for
wk(l) , k ∈ S for l = 1, . . . , L, are constructed and the complex survey data. There are many important and
variance estimator takes the form: emerging topics that we have not touched upon. For
example, combining different sources of data is being

L
 (l)  increasingly carried out, and Lohr and Raghunathan
cl 
y − y 2,
(2017) review this endeavor.
l=1
A particular reason for the growing interest in com-

where 
(l) (l)
y = k∈S wk yk . For suitable constants cl , bining data sources is the increased availability of “big
depending on the replication method, such replication data” sources, as noted in Section 1. There is also
weights can be constructed for a range of designs to huge current interest in a landslide of associated new
achieve consistent variance estimation. The data file, data analysis techniques, which often have their origins
released by the survey agency, now contains an addi- in machine learning. However, while many of these
tional L fields corresponding to the replicate weights, methods appear intoxicating, they must be carefully as-
alongside the basic weights wk . sessed to see whether they will provide valid inferences
For both the bootstrap and the jackknife, the replica- in the face of multiple sources of noncoverage and se-
lection. If such aspects are ignored, there is a genuine
tion weights wk(l) , k ∈ S contain zeros, either from sys-
possibility that big data analyses will produce really
tematic deletion (the jackknife) or as a result of random
big inferential train wrecks.
subsampling (the bootstrap). The implementation of
each of these techniques requires care since one must ACKNOWLEDGEMENTS
acknowledge the complex design. For example, under
multistage sampling one may remove a complete PSU, Jon Wakefield was supported by award
which preserves the dependence structure of responses R01CA095994, from the National Institute of Health.
in the same cluster, and the weights are adjusted so as
to preserve the sum of the weights. A further replica- REFERENCES
tion technique is balanced repeated replication (BRR).
BASU , D. (1971). An essay on the logical foundations of survey
Under BRR, symmetries within the design are ex- sampling, part I. In Foundations of Statistical Inference (Proc.
ploited to produce variance estimates from partially in- Sympos., Univ. Waterloo, Waterloo, Ont., 1970) 203–242. Holt,
dependent splits of the data. The bootstrap, jackknife Rinehart and Winston, Toronto. MR0423625
COMPLEX SURVEYS 175

B ERGER , Y. G. and T ILLÉ , Y. (2009). Sampling with un- L OHR , S. L. (2010). Sampling: Design and Analysis, 2nd ed.
equal probabilities. In Handbook of Statistics, Vol. 29A, Sam- Brooks/Cole Cengage Learning, Boston, MA.
ple Surveys: Design, Methods and Applications (D. Pfeffer- L OHR , S. and R AGHUNATHAN , T. (2017). Combining survey data
mann and C. R. Rao, eds.) 39–54. North-Holland, Amsterdam. with other data sources. Statist. Sci. 32 293–312.
MR2654632 L UMLEY, T. (2010). Complex Surveys: A Guide to Analysis Us-
B INDER , D. A. (1983). On the variances of asymptotically normal ing R. Wiley, Hoboken, NJ.
estimators from complex surveys. Int. Stat. Rev. 51 279–292. L UMLEY, T. and S COTT, A. (2017). Fitting regression models to
MR0731144 survey data. Statist. Sci. 32 265–278.
B REIDT, J. and O PSOMER , J. (2017). Model-assisted survey esti- P FEFFERMANN , D., S KINNER , C. J., H OLMES , D. J., G OLD -
mation with modern prediction techniques. Statist. Sci. 32 190– STEIN , H. and R ASBASH , J. (1998). Weighting for unequal se-
205. lection probabilities in multilevel models. J. R. Stat. Soc. Ser. B
B REWER , K. (2002). Combined Survey Sampling Inference: 60 23–40. MR1625668
Weighing Basu’s Elephants. Arnold, London. R ABE -H ESKETH , S. and S KRONDAL , A. (2006). Multilevel mod-
C HAMBERS , R. L. and C LARK , R. G. (2012). An Introduction to elling of complex survey data. J. Roy. Statist. Soc. Ser. A 169
Model-Based Survey Sampling with Applications. Oxford Univ. 805–827. MR2291345
Press, Oxford. MR3186498 RUST, K. F. and R AO , J. N. K. (1996). Variance estimation for
C HEN , Q., E LLIOTT, M. R., H AZIZA , D., YANG , Y., G HOSH , M., complex surveys using replication techniques. Stat. Methods
L ITTLE , R., S EDRANSK , J. and T HOMPSON , M. (2017). Ap- Med. Res. 5 283–310.
proaches to improving survey-weighted estimates. Statist. Sci. S ÄRNDAL , C.-E., S WENSSON , B. and W RETMAN , J. (1992).
32 227–248. Model Assisted Survey Sampling. Springer, New York.
C OX , D. R. (2006). Principles of Statistical Inference. Cambridge MR1140409
Univ. Press, Cambridge. MR2278763 S CHONLAU , M. and C OUPER , M. (2017). Options for conducting
E LLIOTT, M. and VALLIANT, R. (2017). Inference for non- web surveys. Statist. Sci. 32 279–292.
probability samples. Statist. Sci. 32 249–264. S COTT, A. and S MITH , T. M. F. (1969). Estimation in multi-stage
G ELMAN , A. (2007). Struggles with survey weighting and regres-
surveys. J. Amer. Statist. Assoc. 64 830–840.
sion modeling. Statist. Sci. 22 153–164. MR2408951
S EAMAN , S. R. and W HITE , I. R. (2013). Review of inverse prob-
G RAUBARD , B. I. and KORN , E. L. (2002). Inference for super-
ability weighting for dealing with missing data. Stat. Methods
population parameters using sample surveys. Statist. Sci. 17 73–
Med. Res. 22 278–295.
96. MR1910075
S HAO , J. and T U , D. (2012). The Jackknife and Bootstrap.
H ÁJEK , J. (1971). Discussion of ‘An essay on the logical founda-
Springer, New York. MR1351010
tions of survey sampling, part I’, by D. Basu. In Foundations of
S KINNER , C. J. (2003). Introduction to part b. In Analysis of Sur-
Statistical Inference (Proc. Sympos., Univ. Waterloo, Waterloo,
vey Data (R. L. Chamber and C. J. Skinner, eds.) 75–84. Wiley,
Ont., 1970). Holt, Rinehart and Winston, Toronto.
H AZIZA , D. and B EAUMONT, J.-F. (2017). Construction of Chichester. MR1978845
weights in surveys: A review. Statist. Sci. 32 206–226. S KINNER , C. J., H OLT, D. and S MITH , T. M. F., eds. (1989).
H ORVITZ , D. G. and T HOMPSON , D. J. (1952). A generalization Analysis of Complex Surveys. Wiley, Chichester. MR1049386
of sampling without replacement from a finite universe. J. Amer. T ILLÉ , Y. and W ILHELM , M. (2017). Probability sampling de-
Statist. Assoc. 47 663–685. MR0053460 signs; principles for the choice of design and balancing. Statist.
JAPEC , L., K REUTER , F., B ERG , M., B IEMER , P., D ECKER , P., Sci. 32 176–189.
L AMPE , C., L ANE , J., O’N EIL , C. and A SHER , A. (2015). Big VALLIANT, R., D EVER , J. A. and K REUTER , F. (2013). Practical
data in survey research: Aapor task force report. Public Opin. Q. Tools for Designing and Weighting Survey Samples. Springer,
79 839–880. Berlin. MR3088726
L ITTLE , R. J. (2013). Calibrated Bayes, an alternative inferential VALLIANT, R., D ORFMAN , A. H. and ROYALL , R. M. (2000).
paradigm for official statistics (with discussion). J. Off. Stat. 28 Finite Population Sampling and Inference: A Prediction Ap-
309–372. proach. Wiley-Interscience, New York. MR1784794

Das könnte Ihnen auch gefallen