Sie sind auf Seite 1von 157

VARIOUS ELEMENTARY CONCEPTS

OF SAMPLE SURVEYS

Hukum Chandra
ICAR-Indian Agricultural Statistics Research Institute,

New Delhi
Email: hchandra@iasri.res.in

About you
What

experiences (if any) do you have of Survey


Sampling?

Objectives
To

introduce various sampling schemes

Statistical Preliminaries
Definition of
Survey
Census
Sample Survey
Sample Survey Theory
Target population
Survey population
Sampling frame
Notation
Finite population parameter

Complete Enumeration (Census)

One way of obtaining the required information is to collect the data for
each and every unit belonging to the population and this procedure of
obtaining information is termed as complete enumeration (Census)

The effort, money and time required for the carrying out complete
enumeration to obtain the different types of data will, generally, be
extremely large

However, if the information is required for each and every unit in the
domain of study, a complete enumeration is clearly necessary.

Examples of such situations are preparation of voter list for election


purposes

But there are many situations, where only summary figures are required
for the domain of study as a whole or for group of units.

Need for Sampling

An effective alternative to a complete enumeration is sample


survey where only some of the units selected from the population
are surveyed and inferences are drawn about the population on the
basis of sample

In certain investigations, it may be essential to use specialized


equipment or highly trained field staff for data collection, making it
almost impossible to carry out such investigations

If a sample survey is carried out according to certain specified


statistical principles, it is possible not only to estimate the value of
the characteristic for the population, but also to get a valid estimate
of the sampling error of the estimate

What is sampling?

Sampling proceeds in several stages:


Define scope and objectives of the study, including

Population to be studied (Identify the population of interest)


General information to collect

Choose tools and techniques for making observations, e.g.

Questionnaire
Diary
Physical measurements

Select (sample) some members of the population (units)


Study the sample (Gather data on the sample)
Draw inferences about the population (Analyze the data and
make inferences)

Examples:
Sampling pasta from a pan
Sampling apples from a market stall

Population
Population consists of complete set of all observations
of interest
necessary to identify what does and what does not
belong to the population
All households in India in 2000
All women aged 15-49 in India in 2000
All businesses in the Delhi in 2014 with more than
1000 employees
All 15 year olds in India in 2011

Populations and samples


Population

Sample

Sampling

The process of how to obtain a sample from the population is


referred to as sampling
9

Definitions
Element : An element is a unit about which we require information. For example, a
field growing a particular crop is an element for collecting information on the yield of a
crop.
Population : Complete set of all observations of interest.
It is the totality of elements under consideration on which inference is required.
Thus, all fields growing a particular crop in a region constitute a population.

Sampling units
A group of elements constitute a sampling unit
Elements belonging to different sampling units are non-overlapping
A sampling unit may have one or more than one element
Sampling units are convenient as well as relatively inexpensive to observe and
identifiable
For example, it is convenient to select households for collecting data on milk
produced by animals rather than contacting the elements directly

10

Definitions
Sampling frame
An

exhaustive list of all the sampling units constitutes a sampling frame.


An example of a sampling frame may be cultivator fields growing a particular
crop or households containing animals in a region.

Sample: A subset of the population.


A part

of the population selected from a sampling frame for the purpose of


making inference about the population is called as a sample.
For example, a subset of the cultivator fields may be selected to estimate the
yield of a crop in a region.
A random

sample is a subset where units are chosen with the help of probabilities
(Sampling).

11

Sampling Error

The error which arises due to use of sample to estimate the


population parameters
Whatever method of sampling is used, there will always be a
difference between population value and its corresponding
estimate
This error is unavoidable in every sampling scheme.
A sample with the smallest sampling error will always be
considered as a good representative of the population.
This error can be reduced by increasing the size of the sample

12

Non-Sampling Error

Besides sampling error, the sample estimate may be


subject to other error which arises due to failure to
measure some of the units in the selected sample,
observational errors or errors introduced in editing,
coding and tabulating the results
Generally, census results may suffer from nonsampling error although these may be free from
sampling error
The non sampling error is likely to increase with
increase in sample size, while sampling error
decreases with increase in sample size

13

Alternatives to Sample Surveys

Analysis of administrative records (administrative


data)
(for example Health Authority data, Crime records by
Home Office or Police, School Authority data, tax
records etc)

Censuses
(all members of the population of interest are
studied)

14

Sample Surveys vs Admin Data

Administrative data may not focus on same population


(as the one of interest)
May not contain all required information
Based on definitions devised for administrative purposes
May have incomplete coverage, be out of date,
inaccurate etc
Surveys can adopt desired definitions, collect desired
data etc

15

Sample Versus Census


Which is better?
Cost
Speed
Practicality and Feasibility
Data Quality
Detail (e.g. questionnaire)
Ability to analyse small
subsets
Timeliness
Sampling Error
Inference to population

16

Census

Sample Survey

From Population to Sample

Population parameter (e.g. population mean, average


household income, or population proportion, e.g. infant
mortality rate) based on population data
refers to a summary value of variable in population
Draw a random sample from the population
Based on sample data, calculate a statistic (e.g.
sample mean, sample proportion) also referred to as
estimator
refers to summary value of a variable based on sample

17

From Population to Sample

Estimator: An estimator is a statistic obtained by a


specified procedure for estimating a population
parameter

The estimator is a random variable and its value differs


from sample to sample

Estimate: The particular value, which the estimator


takes for a given sample, is known as an estimate

18

Example

Population parameter: population mean income


denoted

Sample statistics: mean income in the sample


denoted
x

The sample statistic may be used as an estimate for


the population parameter:

x
19

Example

Population parameter: population mean income


denoted

Sample statistics: mean income in the sample


denoted
x

The sample statistic may be used as an estimate for


the population parameter:

x
20

Types of SamplesDifferent Sample Designs

21

Sample Design

A sample design is a plan determined before any data


are actually collected for obtaining a sample from a
given population.

22

Non-Probability versus Probability Samples


Non-probability sampling:
1. Convenience sampling
A sample selected because of its ease of access
to sample members

23

Non-Probability versus Probability Samples


Non-probability sampling:
2. Purposive sampling
a sample selected using a deliberate subjective
choice in order to produce a sample which the
researcher judges to be representative in some
sense
example: a quota sample
represent the major characteristics of the population by
sampling a proportional amount of each. You have to
decide on which specific characteristic to base your
quota

24

Non-Probability versus Probability Samples


Probability sampling

a sample that is selected by a random mechanism,


where each member of the population has a known and
non-zero probability of being in the sample (selection
probability)

important when choosing a random sample, that the


surveyor does not choose the sample himself. It has
been repeatedly shown that the human investigator is
not a satisfactory instrument for making random
selections.

25

Pros and Cons

Convenience sampling:
extremely cheap and quick but very large bias
Purposive (Quota) sampling:
Cheaper and quicker than random sampling, but
potential for availability/ willingness bias even after
weighting
Random (probability) sampling:
More expensive/ slower; will have nonresponse bias
(because of people refusing to take part)
if a good response rate then should have significantly
less bias then quota sample

26

Probability vs Quota samples


Probability Sampling
Method of selection is specified,
objective and replicable
Inference to population based on
mathematics
Protects (to some extent) against
availability and willingness bias
precision of estimates can be
estimated
More expensive, requires more
resources
Depending on nonresponse rate
likely to suffer less overall bias

Quota Sampling
Quota categories are specified and
replicable; but interviewer preference
typically rules on how to fulfil quotas
Inference based on subjective judgement
Prone to severe availability and
willingness bias; weighting is essential
but bias can remain
Confidence intervals cannot be
calculated
Cheaper and quicker

27

Assessing a Sample Design

Virtually all surveys that are taken seriously by social


scientists and policy makers use some form of
probability sampling

One way to ruin an otherwise well-conceived survey is to


use a convenience sample rather than one which is
based on a probability design

28

Types of Probability Samples


An Overview

29

Probability sampling methods


1. Simple random sampling (SRS)
Randomly

chosen selections using a random


number table, computer-generated random
numbers, lottery balls etc
Probably easiest way of obtaining a random
sample
With replacement: replace element back into
selection frame once selected, one unit could be
selected several times
Without replacement

30

Simple Random Sampling (SRS)


This

is the simplest and most basic method of sampling in which


the sample is drawn unit by unit, with equal probability of
selection for each unit at each draw.

Therefore,

it is a method of selection of n units out of a


population of size N by giving equal probability to all units, or

A sampling procedure in which all possible combinations of n


units that may be formed from the population of N units have the
same probability of selection.

31

Simple Random Sampling (SRS)


For

selecting a simple random sample in practice, units from population


are drawn one by one
If

a unit is selected and observation is recorded and then returned to the


population before the next drawing is made and this procedure repeated n
times. This procedure is generally known as simple random sampling with
replacement (wr)
In such a selection procedure, there is a possibility of one or more
population units getting selected more than once

In

case, this procedure is repeated till n distinct units are selected and all
repetitions are ignored, it is called a simple random sampling without
replacement (wor)

32

Simple random sampling

Advantages:
Easy to understand
Used as yardstick for assessing efficiency of
complex samples
Disadvantages:
Can be time consuming to implement
Can be costly
Statistically not the most efficient method of
sampling (e.g. use of stratification to improve
efficiency)

33

Probability sampling methods (cont)


2. Systematic Sampling
A random

start followed by successive application of


the sampling interval

34

Example: Systematic Sampling


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

31

.
.
.
98
99
100

Determine the number of


units N=100
Determine the sample size
you want n= 20
The interval size is therefore
K=N/n = 100/20 = 5
K=5 (sample one fifth)
Select at random an integer
from 1 to K: e.g. 4 is chosen
Then select every K-th unit

35

Systematic Sampling

Special methods are needed for systematic selection


with a fractional interval
Use of fractional interval

The list from which to sample should be ideally


randomly ordered

36

Systematic Sample

Disadvantage: periodicity in population list


e.g. sampling interval coincides with a periodic interval
of list
Example: suppose you select 1st , 11th, 21st, etc
element, but list is arranged that 1st is a man, 2nd his
wife, 3rd is a man, 4th his wife, etc.
we would obtain a list of males, whereas whole
population made up of males and females
Such periodicity may be easily avoided
Another way to solve this problem is to use stratification

37

Systematic Sampling

In all other sampling methods, the units (whether


elements or clusters) are selected with the help of
random numbers
But, a method of sampling in which only the first unit is
selected with the help of random number while the rest
of the units are selected according to a pre-determined
pattern, is known as systematic sampling
Very useful in

forest surveys for estimating the volume of timber


fisheries surveys for estimating the total catch of fish,
milk yield surveys for estimating the lactation yield

38

Systematic Sampling
Advantages
Easy

to understand
Quick and easy to implement
Arranging the frame in stratified order will create
implicit stratification
Disadvantages
Periodicity:

If units are ordered unnoticed or


unattended this may result in an unusual sample

39

Probability sampling methods (cont)


3. Stratified Sampling
If

we have information about the composition of a


population, we may be able to improve on e.g. simple
random sampling by using stratification
Units are aggregated (grouped) into different nonoverlapping subgroups, called strata
Then a certain number of units are randomly selected
from each stratum

40

Example: Stratified Sample

if a surveyor wants to find the most popular TV


programmes, it would be advisable to first divide the
population into 3 strata, men, women and children
then select a random sample from each of the strata
care must be taken to ensure that the strata are nonoverlapping, i.e. there is no element falling into more
than 1 category.

41

Stratified Sampling

The basic idea in this sampling is to divide a heterogeneous


population into sub-populations, usually known as strata
Strata are internally homogeneous in which case a precise
estimate of any stratum mean can be obtained based on a sample
from that stratum
By combining such estimates, a precise estimate for the whole
population can be obtained
This sampling provides a better cross section of the population than
the procedure of simple random sampling
For example, in the case of survey for income estimation, whole
population can be divide into three strata Low-income, Medium and
High-income stratum

42

Stratified Sampling

It may also simplify the organization of the field work.


Geographical proximity is sometimes taken as the basis
of stratification.
The assumption here is that geographically contiguous
areas are often more a like than areas that are far apart.
Administrative convenience may also dictate the basis
on which the stratification is made
Auxiliary information may be taken as the basis of
stratification

43

Stratified Sampling

In stratified sampling, the variance of the estimator consists of only


the within strata variation
Thus, the larger the number of strata into which a population is
divided, the higher, the precision
For estimating the variance within strata, there should be a
minimum of 2 units in each stratum
The larger the number of strata the higher will be the cost of survey
So, depending on administrative convenience, cost of the survey
and variability of the characteristic under study in the area, a
decision on number of strata will have to be arrived at

44

Example: Stratified Sample


whole

Whole Sampling
frame (size N)
N

North

South

East

West

N1

N2

N3

N4

Sample separated by region into 4 strata (N1, N2, N3, N4)

Random subsample of n1/N1

Random subsample of n2/N2

Random subsample of n3/N3

Random sub-sample from each

45

Random subsample of n4/N4

Stratified Sample
Can be
Proportionate (same sampling fraction for each strata)
Disproportionate (different sampling fractions),
this means
differential probabilities of selection
e.g. often small subgroups are selected with a higher
sampling fraction than the rest of the population to
ensure a larger number of them in your final sample to
facilitate analysis

46

Proportionate Stratified Sample


Advantages
Guards

against the more unusual samples that can


be chosen by random chance
If stratifiers are related to the variables in your survey,
stratification can reduce standard errors
Disadvantages
Stratification

information has to be available

47

Disproportionate Stratified Sample


Advantages
Allows

one to over-sample small groups so that a


good statistical comparison can be made
Also used where the goal is to achieve an optimum
allocation between variance and cost
Disadvantages
Estimates

of the total population need to be derived


using weighting (see later sessions)

48

Probability sampling methods (cont)


4. Cluster sampling
A cluster is a naturally occurring unit like a county
(country, or state)
Sampling units are selected as part of a cluster of units
Difference to stratified sampling is that the starting point
is a natural cluster, and not made up as in stratified
sampling.

49

Cluster sampling

A sampling procedure presupposes division of the


population into a finite number of distinct and identifiable
units called the sampling units.
The smallest units into which the population can be
divided are called the elements of the population and
group of elements the clusters
A cluster may be a class of students or cultivators fields
in a village
When the sampling unit is a cluster, the procedure of
sampling is called cluster sampling

50

Cluster sampling

For many types of population, a list of elements is not


available, therefore, the use of an element as the
sampling unit is not feasible.
The method of cluster is available in such cases.
For example, in a city a list of all the houses may be
available, but that of persons is rarely so and list of farms
are not available, but those of villages or enumeration
districts prepared for the census are.
Cluster sampling is, therefore, widely practiced in sample
surveys.

51

Cluster sampling

For a given number of sampling units cluster sampling is more


convenient and less costly than simple random sampling
due to the saving time in journeys, identification and contacts
etc.
Cluster sampling is generally less efficient than simple
random sampling due to the tendency of the units in a cluster
to be similar
In most practical situations, the loss in efficiency may be
balanced by the reduction in the cost and the efficiency per unit
cost may be more in cluster sampling as compares to simple
random sampling

52

Cluster sampling

Clearly, the size of the cluster will influence efficiency of


sampling
In general, the smaller the cluster, the more accurate will
usually be the estimate of the population characteristic for a
given number of elements in the sample
The optimum cluster is one which would estimate the
characteristic under study with smallest standard error for a
given proportion of the population sampled, or more
generally, for a given cost.

53

Probability sampling methods (cont)


5. Multi-stage sampling
Large

units are selected first and then smaller


units within the selected larger units are
selected (results in clustering)

54

Probability sampling methods (cont)


5. Multi-stage sampling

One of the main considerations of adopting cluster sampling is the


reduction of travel cost
However, this method restricts the spread of the sample over
population which results in increasing the variance of the estimator
In order to increase the efficiency of the estimator with the given
cost it is natural to think of further sampling the clusters and
selecting more number of clusters so as to increase the spread of
the sample over population.
Sampling which consists of first selecting clusters and then selecting
a specified number of elements from each selected cluster is known
as two stage sampling (sub- sampling)

55

Multi-stage sampling

Clusters are generally termed as first stage units (fsus) or primary


stage units (psus)
The elements within clusters or ultimate observational units are
termed as second stage units (ssus) or ultimate stage units (usus).
This procedure can be easily generalized to give rise to multistage
sampling
It can be expected to be (i) more efficient than simple random
sampling and less efficient than cluster sampling from operational
convenience and cost point of view
(ii) less efficient than simple random sampling and more efficient
than cluster sampling from the variability point of view

56

Multi-Stage Cluster Sampling


Advantages
Huge

cost savings if survey is carried out with face-toface interviews


Useful when no frame is available for the final
sampling unit
Disadvantages

to the extent that clusters are homogeneous with


respect to the survey variables you are studying, this
may result in larger standard error (less precision of
estimates)

57

Successive Sampling

Many times surveys often gets repeated on many occasions (over


years or seasons) for estimating same characteristics at different
points of time.
The information collected on previous occasion can be used to
study the change or the total value over occasion for the character
and also to study the average value for the most recent occasion
For example in milk yield survey, we are interested in

1. Average milk yield for the current season


2.The change in milk yield for two different season
3.Total milk production for the year

58

Successive Sampling

The successive method of sampling consists of selecting


sample units on different occasions such that some units are
common with samples selected on previous occasions
If objective is to estimate the change, then it is better to retain
the same sample from occasion to occasion
For populations where the basic objective is to study the total,
it is better to select a fresh sample for every occasion
If the objective is to estimate the average value for the most
recent occasion, the retain a part of the sample over
occasions

59

Multiphase Sampling

It is well known that the prior information on an auxiliary


variable could be used to enhance the precision of the
estimator.
Ratio, product and regression estimators require the
knowledge of population mean and total for the auxiliary
variable x.
When such information is lacking, it is sometimes less
expensive to select a large sample on which auxiliary
variable alone is observed.
The purpose is to furnish a good estimate of population mean
of x

60

Multiphase Sampling

Subsequently, a subsample from the initial sample is selected


for observing the variable of interest.
For example: Consider problem of estimating total production
of cow milk in a certain region. For this purpose, village is
taken as the sampling unit and the number of milch cows in all
the villages of the region may not be available
Then investigator could decide to take a large initial sample of
villages and collect information on number of milch cows in the
sample villages
This information is used to build up an estimate of total
number of milch cows in the region
A subsample of villages is selected from the first-phase
sample to observe the study variable, viz., cow milk yield in
the village
61

Probability sampling methods (cont)


6. Probability Proportional to Size (PPS)
Units

are sampled in two or more stages with


probabilities proportional to their size (a clever
solution to ensure equal sized fieldwork
assignments while maintaining equal
probabilities of selection)

62

Sampling with Varying Probability


Under certain circumstances, selection of units with unequal
probabilities provides more efficient estimators than equal
probability sampling, and this type of sampling is known as
unequal or varying probability sampling
The units are selected with probability proportional to a given
measure of size (pps) where the size measure is the value of
an auxiliary variable x
This sampling scheme is termed as probability proportional
to size (pps) sampling
In pps sampling, the units may be selected with or without
replacement.

63

63

Use of Auxiliary Information


In sampling theory if the auxiliary information, related to the
character under study, is available on all the population units

Then it may be advantageous to make use of this additional


information in survey sampling
One way of using this additional information is in the sample
selection with unequal probabilities of selection of units
The knowledge of auxiliary information may also be exploited at
the estimation stage. The estimator can be developed in such a
way that it makes use of this additional information

64

64

Use of Auxiliary Information (contd)


Examples are ratio estimator, difference estimator, regression estimator,
generalized difference estimators are the of such estimators
Obviously, it is assumed that the auxiliary information is available on all
the sampling units
Another way the auxiliary information can be used is at the stage of
planning of survey. An example of this is the stratification of the
population units by making use of the auxiliary information

Stratification I

Outline
What

is stratification ?
Implicit and explicit stratification
Systematic sampling
Implementation of stratification
Some examples of stratification

67

Review

Note: in simple random sampling all units have the same


probability of selection (the probabilities are known and
positive)

But in general, random sampling does not need to be


based on equal sampling probabilities (however they
need to be known and the need to be all positive), e.g.
some units have a higher probability of selection

68

Random Sampling

We sometimes sample with unequal probabilities

Think of the population as being divided into H subsets


(h = 1, ... H), with Nh units in the hth subset.

If we sample separately from each subset, then we call


the subsets sampling strata. If we sample nh units
from stratum h, then the sampling fraction (selection
probability) in that stratum is nh/Nh.

nh
fh
Nh
69

What is Stratified Sampling?

Stratified sampling involves sorting (stratifying) the


sampling frame prior to selection

Implicit Stratification involves sampling systematically


from an ordered (stratified) list
Explicit Stratification involves sorting the population list
(frame) into distinct strata and then sampling
independently from each stratum
It is possible (and often desirable) to combine explicit
and implicit stratification - i.e. to stratify implicitly within
explicit strata

70

Why Stratified Sampling?

The primary reason for stratification is that it ensures


(unlike SRS) that the sample proportion from any
particular stratum equals the population proportion.
will increase precision if strata are correlated with survey
measures (smaller SE and CI)
Cannot do statistical harm (estimates not less precise
than under SRS)
This is true of both explicit and implicit stratification.
A secondary motivation for stratification is to permit the
use of variable sampling fractions.

71

Systematic Sampling

Recall session 1
Involves sampling at a fixed interval down a list
If the list is ordered in some meaningful way, this has the
effect of stratification
Advantage of being easy to implement
Procedure: calculate the required interval (K=N/n), then
generate a random start (R) (random number between 1
and K). The sampled units are then the Rth, (R+K)th,
(R+2K)th etc units on the list.

72

Systematic Sampling (2)

K = N/n, where N is the total number of units on the list,


and n the desired sample size.
R is a random number between 1 and K.

Note that K need not be an integer. E.g. if desired n is


500 and N = 10,679, using K = 21.36 will give exactly n =
500, but rounding to K = 21, will give n = 508.
Do not use K = 21 and then stop once 500 are sampled:
biased! (go up to 508 sampled cases)

73

Example: Systematic Sampling


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

31

.
.
.
98
99
100

Determine the number of


units N=100
Determine the sample size
you want n= 20
The interval size is therefore
K=N/n = 100/20 = 5
K=5 (sample one fifth)
Select at random an integer
R from 1 to K: e.g. 4 chosen
Then select every K-th unit

74

Stratum Construction

Choose factors so that strata are homogeneous


If strata are correlated with survey measures then
increase in precision
Strata examples: e.g. regions
We can sometimes estimate the precision achievable
with different choices
Choice of number of strata:
More

strata, more precision


But variance estimation more difficult
And administration and sampling (and weighting) may be
more complex
75

Stratum Construction (2)

Cross factors with few categories rather than using many


categories for one factor
For example: stratify according to region and poor and
rich areas

When using a continuous factor (e.g. tax payments;


proportions of households with attribute A etc) choose
carefully the stratum boundaries (i.e. define sensible
categories and cut-off points)

76

Stratum Construction (3)

Choose stratifiers such that they are correlated with a


range of variables
For example, for national household surveys, tend to
choose stratifiers that are related to
Area

characteristics (e.g. rural, urban, population density

etc)
Income / occcupation (e.g social economic group, social
class )

It is common to use 3-4 stratification variables


hierarchically (see later example)

77

Example of Stratification: A General


Population Survey
The Health Survey for England (DH)
Stage 1:
Postcode Sectors stratified by:
14 Regional Health Authorities (1st-level explicit strata)
Proportion of adults with limiting long-term illness, in three
bands (2nd-level explicit strata)
Proportion of households with non-manual head, in two
bands (3rd-level explicit strata)
Proportion of households with no car, in two bands (4th-level
explicit strata)
Proportion "non-white" (5th-level stratification: implicit)

78

Example of Stratification: A General


Population Survey (2)

720 sectors were sampled systematically

Stage 2
Within each sector, addresses are in postcode order,
and selected systematically. This provides some
geographical stratification.

79

Example of Stratification: A Special


Population Survey
Survey of Recipients of Job Seekers Allowance (DSS)
Stage 1
Postal sectors were stratified by region and number
of recipients
200 sectors were selected with probability
proportional to number of recipients

80

Example of Stratification: A Special


Population Survey (2)
Stage 2
Recipients were stratified by sex (2 bands) x claim type
(4 bands) x length of continuous unemployment prior to
current claim (implicit)
25 recipients were selected systematically from each
sampled sector

81

Stratified sample: some notation

Dividing for example frame into distinct strata and then


sampling independently from each stratum results in:
H strata (or groups), stratum h=1,, H
In each stratum h there are Nh units (on population
level)
An independent sample of nh units is then selected
from each stratum h
Sampling fraction (selection probability)
nh
in the stratum is:

fh

82

Nh

Estimator Under Stratification (Example)

We have 2 strata (e.g. north and south GB)


Proportion of people 18+ years old in GB who use the
internet: P
Estimator p
H

ph *

h 1

Nh
( h 1 N h )
H

N1
N2
p p1 *
p2 *
( N1 N 2 )
( N1 N 2 )

83

DEFF under Stratified Sample

Increase in precision under stratified sample can be


estimated using the DEFF

DEFF =

2
SE STRAT
2
SE SRS

Numerator is the variance of the stratified design


Denominator is variance under SRS
2
How can
be calculated?
SESTRAT

84

Variance under Stratification

Variance of a mean:
H

N h2 sh2
x =
var
2
h1 N nh

.. . and for a proportion:

N h2 ph (1 ph )
p =
var
2
N
nh
h 1
H

85

Variance under Stratification (2)


where
h
is the stratum
s2h is the sample variance in stratum h (estimated from
sample)
Nh is the population size in stratum h
nh is the sample size in stratum h
N is the total population size (N=N1+N2 ++NH)
n
is the total sample size (n=n1+n2 ++nH)

86

Practical Limitations to Stratification

Often only possible at PSU level (e.g. household


surveys) (PSU= primary sampling unit, e.g. postcode
sector, schools etc) rather than at individual level
Correlation between strata and survey variables is
typically modest
Depends on what information available on the
sampling frame
Multi-purpose nature of surveys: optimal stratification
for one estimate may produce no benefit for another
Typically there is a lack of information about stratum
variances

87

Comparisons between Stratification


and Quota Sampling

Recall session 1
Imposing quotas has similar effect to stratification namely to reduce sampling variance
But, quota sampling also has inherent bias towards more
accessible and more willing population members
This may manifest itself as a bias in the survey
measures
Thus, quota sample estimates could have relatively high
precision, but be biased and therefore have low
accuracy (high mean squared error) (session 3)

88

Stratification II

Outline of session
Variable

Sampling Fractions
Motivations
Optimal allocation
Design effects

90

Variable Sampling Fraction (VSF)

We sometimes sample with unequal probabilities

Think of the population as being divided into H subsets


(h = 1, ... H), with Nh units in the hth subset.

If we sample separately from each subset, then we call


the subsets sampling strata. If we sample nh units
from stratum h, then the sampling fraction (selection
probability) in that stratum is nh/Nh.

nh
fh
Nh
91

Variable Sampling Fraction (VSF)

For unbiased estimation, each sampled unit i must be


assigned a weight in inverse proportion to its selection
probability.
This is usually referred to as the sampling weight or
design weight: wi
An example of such a weight in the case of stratified
sampling would be:

Nh
wi
for i h
nh

if sample unit i belongs to stratum h

92

Use of weights

So when certain types of units have been selected


based on different selection probabilities (oversampling)
then the sample weights need to be taken into account in
estimation
Corrective weighting is needed to get design-unbiased
estimates
If weights are ignored then sample estimates are biased

93

Motivations for VSF

1. To increase the sample size of small groups


(i.e. to get acceptable confidence intervals for
estimates based on those groups)
2. Because the frame / selection method gives us
no choice
3. To increase precision of estimates by oversampling more variable strata

94

Examples
1. A national survey where estimates are also required for
each of the component countries /regions
E.g. survey of the UK, but estimates for Scotland, Wales and NI
are also needed separately
Then a larger sampling fraction might be used in Wales and
Scotland compared to England.

2. Sampling minority ethnic groups:


a high proportion of the minority ethnic population live within a
relatively small proportion of areas
Oversampling such (ethnically dense) areas will increase
achieved sample sizes while reducing survey costs.

95

Use of Variable Sampling Fractions

Now we want to investigate further the effects of using


variable sampling fractions
We have seen we need to use weights
We want to investigate under which circumstances
precision in survey estimates is increased and when
precision is reduced after using VSF
Or in other words: what is the effect of oversampling
on the precision of estimates?

96

Standard Errors for Stratified Sampling

We have already introduced in last session a


formula for the variance
Generally, it is for a mean:
H

N h2 sh2
x =
var
2
h1 N nh

nh

1 N

(6.1)

And for a proportion:

N h2 ph (1 ph )
nh
p =
var
1 N
2

N nh
h
h 1
H

97

(6.2)

Variance under Stratification (2)


where
h
is the stratum
s2h is the sample variance in stratum h (estimated from
sample)
Nh is the population size in stratum h
nh is the sample size in stratum h
N is the total population size (N=N1+N2 ++NH)
n
is the total sample size (n=n1+n2 ++nH)

98

The finite population correction

The expression

nh
1
Nh

is referred to as the finite population correction

This term is only important if nh/Nh not close to 0


Usually nh/Nh is very close to 0 (since N very large;
even if n quite large) and the finite population
correction can be ignored
Remember (standard error):
SE x Var x

99

Variance under Stratification

If we ignore the finite population correction (for every


stratum) we can simplify this to:
Variance of a mean:
H

N h2 sh2
x =
var
2
N
nh
h 1

(6.3)

Variance of a proportion:

N h2 ph (1 ph )
p =
var
2
N
nh
h 1
H

100

(6.4)

Standard Errors for Stratified Sampling

In addition to the simplification of the variance


estimation formulae for a mean and a proportion if we
ignore the finite population correction (fpc), we note:

Differences between strata do not contribute to


variance. So, we should construct strata as
homogeneous (small sh2 ) as possible

101

Standard Errors for Stratified Sampling

Note that in the special case where we use the same


sampling fraction in each stratum, each of the
variance formulae simplify further.
We can substitute n/N in place of nh/Nh, and nh/n in
place of Nh/N. (6.3) and (6.4) then become:
For a mean:
n s 2

x =
var

For a proportion:

p =
var

102

h h
2

(6.5)

nh ph 1 ph
n

(6.6)

We will look more at (6.5) and (6.6) later.

First, we will concentrate on Variable Sampling


Selections. In the presence of VSFs, we need formulae
(6.3) and (6.4), ignoring the fpc.

103

Example: Over-Sampling More


Variable Strata

Sometimes, we can identify strata that have high


population variances ( Sh2 large). Over-sampling
these strata will tend to increase the precision of the
survey estimates (reduce standard errors).
We can only do this if we have advance estimates of
stratum variances.
Example to illustrate this:
Suppose H = 2 and N1 = N2 (=N/2).
Suppose we know (or estimate) that
S12 2S22

104

Example (cont)

Then we can substitute into expression (6.3) (ignoring


the fpc and looking at the population variance rather
than the estimated variance) and we get:

2 N 2 S2 2
var x =
2
4
N
n1

N 2 S22

2
4 N n2

S2
S2
=

2n1 4n2
2

105

Example (cont)
Now, consider two alternative sample designs:
a.) Proportional allocation
i.e. where
nh N h
n N

b.) A higher sampling fraction in stratum 1


i.e. n1 larger than n2

106

It follows:
For
a.) Substitute n1 = n2 = n/2 :

S2 2 S2 2
S2 2
var x

1.5

2n
n
n

b.) Substitute e.g. n1 = 0.58n; n2 = 0.42n :

S2 2
S2 2
S2 2
var x

1.457

n
1.16n 1.68n

107

Example (cont)

So, the sampling variance is slightly smaller under


design b)
It is smaller by a ratio of 1.457/1.5, i.e. 0.97
This is the design effect due to over-sampling the
more variable stratum (VSF):
DEFFVSF

2
SEVSF
1.457

0.97
2
1.5
SE SRS

DEFTVSF

DEFFVSF 0.98

108

Example (cont)

This example illustrates how precision can be increased


by the use of Variable Sampling Fractions! (in the case
of oversampling strata with high stratum-variances)

This approach is quite common for repeated business


and agriculture surveys, but rare for household surveys.

109

Note

We have seen when considering case b.) that a higher


sampling fraction in a stratum led to increased precision
Therefore: Important to consider which stratum allocation
will maximise survey precision (under the assumption of
not equal stratum variances)

110

Optimal Allocation

In general, the optimum allocation rule is to set:


nh
Sh

Nh
Ch
where Ch is the unit cost of data collection for a unit in
stratum h.
If data collection costs do not vary between strata, this
simplifies to:
nh / N h Sh

If stratum variances are equal, it further simplifies to a


constant K:

nh / N h K

111

Optimal Allocation (cont)

The last case demonstrates that an equal probability


selection method is optimum in the situation where
variances and data collection costs are equal in all strata
(other things being equal).

112

Example: VSFs with Equal Stratum


Variances

Example:
Again suppose H = 2, and N1 = N2.
But now suppose that stratum variances are
equal, i.e.
S12 S22

Again consider two different sampling schemes:


nh N h
a.) Proportional allocation

b.) Sampling fraction in stratum 1 is twice that in


stratum 2, i.e. n1 = 2n/3; n2 = n/3.

113

Example (cont)

Then, with design a), we find (from expression 6.3, again


ignoring the fpc):
2
2
N S2 N S2
2
2
N 2S2 S2

var x =

2
n
n
n
2N n
N 2
N 2
2
2

(Note: this is the formula of the variance of a mean under


SRS!)

114

Example (cont)

With design b), we find:


2
2
N S2 N S2
2
2
S2
S 2 9S 2

var x =

2
n
n
2
n
n
8n

2
N 2
N
4
4

3
3
3
3

It follows:

DEFFVSF

2
SEVSF
9 S 2 / 8n

9 / 8 1.125
2
2
SE SRS
S /n

115

Example (cont)
This means:
The sampling variance under design b) is 9/8 (=1.125)
times that under design a).
By allocating disproportionately, we have lost precision
(in the case of equal stratum variances)!

In general, precision will be lost whenever variable


sampling fractions are used, if the stratum variances do
not vary (much).
The level of precision loss depends on the range of the
weights used

116

Design Effects due to VSFs

If we can assume stratum variances to be equal, there is an


alternative and often-used way to estimate effect of VSFs on
sampling variance.
Expression 6.1 can be used to derive expression for
effective sample size:

nh wh

nh wh2
2

VSF
neff

where: nh is the sample size in stratum h and wh is the


weight given to each case in stratum h. (Remember that wh
will be proportional to Nh/nh)

117

Design Effects due to VSFs (cont)

Note that this expression only takes into account the


effect of VSFs on effective sample size, not the effect of
any other aspect of design.
Formula on previous slide can be used at design stage
to predict impact on precision of alternative allocations to
strata!

118

Design Effects due to VSFs (cont)

In general, it will be found that:


larger range of sampling fractions (weights) results
in a smaller neff (i.e. greater loss of precision)
over-sampling a large subgroup results in greater
loss of precision than over-sampling a small
subgroup
when main aim is to produce estimates for
subgroups, equal sample sizes per subgroup will be
an efficient design
when the main aim is to produce estimates for the
total population, equal sampling fractions will be
efficient.

119

Graphical illustration of neff

The following graph illustrates the effect of oversampling


on survey precision for a sample with 2 strata (H=2)
The graph shows relationship between

the proportion of the sample in stratum 1 (n1/n) (x-axis)


and the consequent loss of precision, as measured by the design effect
(y-axis).

The three lines relate to three oversampling rates and


the subsequent relative weights that need to be used:
2:1, 4:1 and 10:1 (i.e. w1=1 in all cases).
(2:1 means that stratum 1 is oversampled by a factor of
2)

120

3.4

DEFF VSF

3
2.6
2.2
1.8
1.4
1
0

0.6

0.4

0.2

0.8

n1/n

w2=2

121

w2=4

w2=10

Graphical illustration of neff


The

graph illustrates the two points made


earlier:
larger range of sampling fractions (weights)
results in a smaller neff (i.e. greater loss of
precision)
over-sampling a large subgroup results in
greater loss of precision than over-sampling a
small subgroup

122

Multi-Stage Sampling

Outline of session

What is multi-stage / cluster sampling


Motivations for multi-stage sampling
Choice of sampling units, sample sizes at each
stage
Selection probabilities and weighting
Probability Proportional to Size (PPS) sampling
Design effects due to clustering

124

What is Multi-Stage Sampling?

The units in the population are arranged hierarchically


A 3-stage design would entail:
Primary sampling units (PSUs)
Secondary sampling units (SSUs)
Sample elements

It would be necessary to assign every element uniquely


to one SSU and every SSU uniquely to one PSU

125

What is Multi-Stage Sampling?

Stage 1: select sample of PSUs

Stage 2: select sample of SSUs within each selected


PSU

Stage 3: select sample of elements within each selected


SSU

Note that there could be any number of stages: 2, 3 or 4


are common

126

Examples:

general population survey :


PSUs might be postcode sectors
SSUs might be households
Elements might be persons

business survey :
PSUs might be companies
SSUs might be workplaces
Elements might be employees

127

Why Multi-Stage Sampling?

No frame of elements available, but frame of PSUs


available (examples: national sample of school pupils, where
schools could be PSUs; US face to face survey where counties are
PSUs)

Cost of data collection (example: general population sample


involving face-to-face interviewing)

Access to elements may only be via gatekeepers


(examples: students, employees, trainees)

Data quality (example: in the case of face-to-face interviewing,


field work can be better supervised if in clusters)

128

Design Choices (clustering):


Example: Field interviewing

Constraint

Implication

Tight field work periods Small workload per interviewer


Completion depends on
slowest interviewer

Equal interviewer workloads

Efficient fieldwork

Each workload in small area

Training/ briefing/
learning costs

Large workload per interviewer

129

Design Choices (clustering):


Some General Points:
Larger clusters will generally result in larger design
effects due to clustering (see later)
But larger clusters will also generally result in larger cost
savings (e.g. field interviewers, gatekeepers)
Necessary to make an appropriate compromise: i.e.
where cost saving outweighs loss in precision, to
produce higher overall accuracy per unit cost
(remember key aim of sample design: minimising costs,
maximising accuracy)

130

Selection Probabilities: Principle

With multi-stage sampling, the selection probability of


each element is the product of the (conditional) selection
probabilities at each stage
e.g. probability of sampling unit i in SSU j in PSU k is
Prijk = Pr (k) x Pr (j | k) x Pr (i | j,k)

So, it is important to control and record the selection


probabilities at each stage.

131

Selection Probabilities

Other things being equal, it is desirable to keep selection


probabilities equal for all elements (remember:
stratification; otherwise loss in precision).
If selection probabilities are not equal, we will need to
weight each sampled element ijk by
wijk = 1/Prijk
for unbiased estimation.

132

Selection Options

With multi-stage sampling, there are many ways to


achieve equal selection probabilities.
(epsem design = equal probability of selection method; =
self-weighting design)
In the (rare) case of equal size PSUS and 2-stage
sampling, we can easily select PSUs (js) and elements
(is) with equal probability.
Example: Design (0):
Pr(j) =1/3 and Pr (i|j)=1/2 and the overall probability is
Pr(i) = 1/3 * 1/2 = 1/6 for all i.

133

Selection Options

In many types of sampling situations having equal size


PSUs is rare. In the case of unequal sized PSUs we
are left with 3 alternative designs:
1. select PSUs with equal probabilities and then a fixed
number of elements within each - gives unequal
selection probabilities (not an epsem design)
2. select PSUs with equal probabilities and then a
variable number of elements within each, to give
equal overall selection probabilitiesx
3. select PSUs with PPS (probability proportional to
size), then a fixed number of elements within each

134

Selection Options

Design 1) undesirable because it will generally cause


loss in precision compared with an epsem design; nonepsem design undesirable; weighting needed
Design 2) avoids this problem, but causes practical
problems. Number of elements sampled per PSU will
vary in proportion to the population size of PSU.
Elements in one PSU typically form one interviewer
workload, so this is undesirable.
Also, with design 2) the sample size is not fixed in
advance - it is a random variable. Very undesirable!

135

Selection Options

Design 3) overcomes all these problems, but it depends


on the availability of a reasonably accurate measure of
the number of elements in each PSU (and SSU, if a 3
stage design).
Note: when accurate measures of number of elements
within each PSU not available it may be possible to get a
reasonable good estimate of the measure of size and to
proceed with PPS sampling
The next slide discusses this design further:

136

Probability Proportional to Size (PPS)


Selection

Example: A 2-stage design


set Pr (j) proportional to Nj (number of elements in
population in PSU j = PPS sampling).
So Pr (j) = C Nj.
We then select the same number of elements, D, from
each sampled PSU, so Pr (i| j) = D/ Nj.
Then,
Pr (i) = Pr (j) x Pr (i|j) = C Nj x D/ Nj = CD, which is
the same for every element

137

Implementation of a PPS Design

We do not need to calculate the selection probabilities at


each stage in order to make the selection.
We need only to create a cumulative total down the list
of PSUs (e.g. 10,000) and then sample systematically
down that list of totals, including each PSU within which
the interval falls

138

Implementation of a PPS Design

Example: Selection of 3 PSUs from 10 with PPS and 25


units from each selected PSU, so that n=75
Pr(j) is probability of selecting the PSU
Pr(i|j) is the probability of selecting each unit, given that
PSU has been selected, and
Pr(i) is the overall probability of selecting each unit.
It can be seen that each of the 10,000 units in the
population has the same selection probability:

139

Example of a PPS Design


PSU Size (Nj) Pr(j)=C*Nj
1
2
3
4
5
6
7
8
9
10

1000
900
800
1200
1500
1300
1100
500
1000
700

P (i) =
Pr(i| j)=D/Nj P(j) x P(i| j)=C*D

3x1000/10000
3x 900/10000
3x 800/10000
3x1200/10000
3x1500/10000
3x1300/10000
3x1100/10000
3x1500/10000
3x1000/10000
3x 700/10000

25/1000
25/ 900
25/ 800
25/1200
25/1500
25/1300
25/1100
25/ 500
25/1000
25/ 700

C=3/10000
140

D=25

________

10000

75/10000
75/10000
75/10000
75/10000
75/10000
75/10000
75/10000
75/10000
75/10000
75/10000

Example of a PPS Design (cont)

We would select the sample of PSUs as follows:


N = 10,000 and n = 3 (PSUs).
To select systematically (see session: stratification I), K
=N/n= 3333 and R needs to be a random number
between 1 and 3333. Suppose we happen to generate
R = 1,050.
Then, we sample the PSUs that contain elements 1050,
(1050 + 3333) and (1050 + 2x3333), i.e. PSUs 2, 5 and 7
:

141

Example of a PPS Design (cont)


PSU

Size

Cum. size

Selection

_______________________________________________________

1
2
3
4
5
6
7
8
9
10

1000
900
800
1200
1500
1300
1100
500
1000
700

1000
1900
2700
3900
5400
6700
7800
8300
9300
10000
142

*
*

Some Limitations of PPS Sampling of


PSUs

We might have only imperfect estimates of number


of elements in each PSU (the size measure)
We could then adjust the sample size within each
PSU to keep overall probabilities equal or we might
simply weight by 1/Pr(i)
Sampling interval might be smaller than number of
elements in some PSUs. (This will only happen if
sampling fraction of PSUs is large and/or size of
PSUs highly variable.) Those PSUs will be certain
to be sampled, and could be sampled more than
once.

143

Some Limitations of PPS Sampling of


PSUs

We might place these PSUs in a separate stratum and


include them with certainty. We might also increase their
sample size of elements, to keep overall probabilities
equal, or we might weight

144

Design Effects due to Clustering

Clustering tends to increase sampling variance (but this


is partly offset by the fact that a larger sample size can
be obtained for any given cost).
This is because units within a cluster tend to be more
homogeneous than units as a whole.

Clustering is therefore tending to have the opposite


effect to stratification.

145

Example of Homogeneity of Clusters


Let us consider the following example to illustrate the effect
of clustering:

Population of 6 people, with values: 1, 1, 2, 2, 3, 3.


Population mean = 12/6=2
Population variance:
6

var (X) =

1
2
2 ( xi =2)4/6
6 i 1

146

= 2/3

Example (cont)
a) divide population into 3 clusters: (1,1) (2,2) and (3,3).
Then: no variance within clusters (homogeneous
clusters). But variance between the cluster means is:
var (XB) = [(1-2)2 + (2-2)2 +(3-2)2] /3 = 2/3.
It implies that sampling variance is greater than 0 since
we get different estimates of the mean depending on
which cluster is sampled.

147

Example (cont)
b) divide the population into 2 clusters: (1,2,3) (1,2,3).
No variance between cluster means. But variance
within each cluster is:
Var (XW) = 2* [[(1-2)2 + (2-2)2 +(3-2)2]/3] /2 = 2/3
The sampling variance is 0 since there is no
variability in sample means.

With design a) all the variance is between clusters clusters are perfectly homogeneous.
With design b), clusters are as heterogeneous as the
population as a whole, so cluster sampling would not
cause a loss in precision.
148

Example (cont)

If we sample one cluster (and then include all elements),


design a) has a sampling variance of 2/3; design b) has
a sampling variance of 0.

This illustrates the general point that sampling variance


will be greater if clusters are relatively homogeneous
(i.e. like in a) )

149

Design Effects due to Clustering (cont)

Typically, the sorts of units that we tend to use as PSUs


are relatively homogeneous, so in practice clustering
nearly always results in a design effect due to clustering
which is greater than one.

Examples:
people within postcode sectors,
pupils within schools,
students within classes
employees within firms.

150

Intra-Cluster Correlation

The design effect due to clustering is

DEFFCL 1 b 1
where b is sample size per cluster (in practice b may
vary slightly, in which case mean cluster size provides an
adequate approximation), and (roh) is the intracluster correlation.

=0: randomly sorted clusters


=1: perfectly homogeneous clusters

151

Intra-Cluster Correlation (cont)

Note that is a population characteristic relating to the


chosen definition of PSU, but sample design should
involve a careful choice of b.

Examples of possible values:


DEFFCL 1
b=10: if =0 then
DEFFCL 10
b=10: if =1 then; if then
DEFFCL 1.45.
more realistically, b=10, if =0.05 then

152

Inflation due to clustering

Reminder: the square root of DEFF is DEFT

DEFTCL inflates confidence intervals of the mean (or


proportion) as follows:

x 1.96 * SE * DEFTCL

153

Example of Intra-Cluster Correlations


From the British Social Attitudes Survey:

DEFT
DEFT
if b=10

Variable

Household size
Owner-occupier
Has telephone
Asian
Roman Catholic

0.070
0.231
0.102
0.334
0.037

16.6
16.5
16.5
8.3
16.4

1.45
2.14
1.61
1.86
1.25

1.28
1.75
1.38
1.53
1.15

Not racially prejudiced


Extra-marital sex wrong
Dodging VAT is OK

0.021
0.044
0.021

8.4
8.3
8.2

1.08
1.15
1.07

1.03
1.08
1.04

154

Example of Intra-Cluster Correlations

Note
islow for attitudinal variables, so design effects
small (DEFT small). But large for variables related to
ethnicity and housing type.

Thus, the most effective degree of clustering might be


greater for an attitude survey (fewer, larger clusters) than
for a housing survey.

155

References
Cochron, W.G., (1977). Sampling techniques; Wiley Eastern Ltd.

Des Raj, (1968).


Company Ltd.

Sampling

theory;

Tata-Mcgraw-Hill

Publishing

Hansen, M.H. and Hurwitz, W.H. (1943b). On the theory of sampling


from finite populations; Ann. Math. Statist., 14, 333-362.

Hansen, M.H., Hurwitz, W.H. and Madow, W.G., (1993). Sample survey
methods and theory, Vol. 1 and Vol. 2; John Wiley & Sons, Inc.
Murthy, M.N., (1977). Sampling theory and methods; Statistical
Publishing Society

Sukhatme, P.V., Sukhatme, B.V., Sukhatme, S. and Ashok, C. (1984).


Sampling theory of surveys with applications; Indian Society of
Agricultural Statistics.

156

156

157

157

Das könnte Ihnen auch gefallen