Sie sind auf Seite 1von 119

Sampling Design and Analysis

MTH 494
Lecture-32
Ossam Chohan
Assistant Professor
CIIT Abbottabad
Two-stage Cluster Sampling

2
3
4
5
6
Review of The Course

7
8
Basics of Statistics

Definition: Science of collection, presentation, analysis, and reasonable


interpretation of data.

Statistics presents a rigorous scientific method for gaining insight into data. For
example, suppose we measure the weight of 100 patients in a study. With so
many measurements, simply looking at the data fails to provide an informative
account. However statistics can give an instant overall picture of data based
on graphical presentation or numerical summarization irrespective to the
number of data points. Besides data summarization, another important task of
statistics is to make inference and predict relations of variables.
A Taxonomy of Statistics

10
Statistical Description of Data
Statistics describes a numeric set of data by its
Center
Variability
Shape
Statistics describes a categorical set of data by
Frequency, percentage or proportion of each category

11
Probability is a measure of the likelihood of
a random phenomenon or chance behavior.
Probability describes the long-term
proportion with which a certain outcome will
occur in situations with short-term
uncertainty.
EXAMPLE
Simulate flipping a coin 100 times. Plot the proportion of heads against the
number of flips. Repeat the simulation.

12
Probability deals with experiments that yield random short-term results or
outcomes, yet reveal long-term predictability.
The long-term proportion with which a certain outcome is observed is
the probability of that outcome.

13
In probability, an experiment is any
process that can be repeated in which
the results are uncertain.

A simple event is any single outcome


from a probability experiment. Each
simple event is denoted ei.

14
The sample space, S, of a
probability experiment is the
collection of all possible simple
events. In other words, the
sample space is a list of all
possible outcomes of a probability
experiment.

15
An event is any collection of
outcomes from a probability
experiment. An event may consist
of one or more simple events.
Events are denoted using capital
letters such as E.

16
The Variance

In statistics, the variance of a random variable or distribution is the


expected (mean) value of the square of the deviation of that variable
from its expected value or mean.
Thus the variance is a measure of the amount of variation within the
values of that variable, taking account of all possible values and their
probabilities.
If a random variable X has the expected (mean) value E[X]=, then
the variance of X can be given by:

Var ( X ) E[( X ) 2 ] x2
The Variance: Properties

Variance is non-negative because the squares are positive or zero.


The variance of a constant a is zero, and the variance of a variable
in a data set is 0 if and only if all entries have the same value.
Var (a ) 0
Variance is invariant with respect to changes in a location
parameter. That is, if a constant is added to all values of the
variable, the variance is unchanged.
Var ( X a) Var ( X )
If all values are scaled by a constant, the variance is scaled by the
square of that constant.
Var (aX ) a 2Var ( X )
Var (aX b) a 2Var ( X )
The Standard Deviation

In statistics, the standard deviation of a random variable


or distribution is the square root of its variance.
If a random variable X has the expected value (mean)
E[X]=, then the standard deviation of X can be given by:

x x2 E [( X )2 ]
That is, the standard deviation (sigma) is the square root
of the average value of (X )2.
The Covariance

The covariance between two real-valued random variables X and Y,


with mean (expected values) X and Y v , is
Cov( X , Y ) E[( X X ).(Y Y )] E[( X ).(Y v)]
E[ X .Y Y vX v]
E[ X .Y ] E[Y ] vE[ X ] v
E[ X .Y ] v v v
E[ X .Y ] v

Cov(X, Y) can be negative, zero, or positive


Random variables with covariance is zero are called uncorrelated
or independent
Covariance

If X and Y are independent, then their covariance is zero. This


follows because under independence,

E[ X .Y ] E[ X ].E[Y ] v
Recalling the final form of the covariance derivation given above,
and substituting, we get
Cov( X , Y ) v v 0

The converse, however, is generally not true: Some pairs of random


variables have covariance zero although they are not independent.
The Covariance: Properties

If X and Y are real-valued random variables and a and b are


constants ("constant" in this context means non-random), then the
following facts are a consequence of the definition of covariance:

Cov( X , a ) 0
Cov( X , X ) Var ( X )
Cov( X , Y ) Cov(Y , X )
Cov(aX , bY ) abCov( X , Y )
Cov( X a, Y b) Cov( X , Y )
Correlation Coefficient

A disadvantage of the covariance statistic is that its magnitude


can not be easily interpreted, since it depends on the units in
which we measure X and Y
The related and more used correlation coefficient remedies
this disadvantage by standardizing the deviations from the
mean:

Cov( X , Y ) X ,Y
x, y
Var ( X ) Var (Y ) X . Y

The correlation coefficient is symmetric, that is


x, y y, x
Correlation Coefficient

The value of correlation coefficient falls between 1 and 1:

1 rx , y 1

rx,y= 0 => X and Y are uncorrelated


rx,y= 1 => X and Y are perfectly positively correlated
rx,y = 1 => X and Y are perfectly negatively correlated
Review of Sampling
Population group of people, communities, or organizations
studied. Includes all possible objects of study.
Sampling frame list of people/organizations etc. in the
population who can be chosen for participation in the study.
Most sampling frames do not include all people in the
population (example phone book)
Sample part of the population. Reduced down to
manageable size. Ideally we would want to draw a sample
that is representative of the population in terms of certain key
characteristics (for example, gender and age).

25
Concepts related to Sampling Error
Sampling Error: The degree to which a sample differs on a key variable
from the population.
Confidence Level:
The number of times out of 100 that the true value will fall within the
confidence interval.
Confidence Interval:
A calculated range for the true value, based on the relative sizes of the
sample and the population.
Why is Confidence Level Important? Confidence levels, which indicate the
level of error we are willing to accept, are based on the concept of the
normal curve and probabilities. Generally, we set this level of confidence
at either 90%, 95% or 99%. At a 95% confidence level, 95 times out of 100
the true value will fall within the confidence interval.

26
The term used to describe the
difference between sample
statistics and population
parameters is sampling error.

27
Various sampling designs - Simple Random
Sampling (SRS)
Simple Random Sampling (SRS)
A simple random sample is a sample in which all
units in the sampling frame have an equal
probability of selection.
Many statistical tests have certain assumptions that
they rely on and these assumptions are often met
when a simple random sample is taken.
If the researcher wanted to collect a simple random
sample of people in Bangkok, the researcher would
need a list of all people in Bangkok.
Where would this list come from?
A telephone list, is only a list of all people in Bangkok with
a telephone.
28
Various sampling designs - Stratified Sampling

Stratified Sampling
The population is separated in groups or strata
and from within each strata a SRS is taken.
Again where would this list come from for each
strata to perform a SRS within each strata?

29
Various sampling designs - Convenience
Sampling
Convenience Sampling
A sample collected by what is convenient
For example, collecting surveys from a shopping mall,
yielding a lot of data at a low price.
Note: statistical tests are inappropriate when
performed on a convenience sample

30
SAMPLING:
REQUIREMENTS OF A GOOD SAMPLE

SELECTION BIAS
MEASUREMENT BIAS
SAMPLING CONTROVERSY
QUESTIONNAIRE DESIGN

31
REQUIREMENTS OF A GOOD SAMPLE

Will reproduce characteristics of interest in


the population as closely as possible

Representative: each sampled unit will


represent the characteristics of a known
number of units in the population

32
DEFINITION OF TERMS
Sampling Unit: The unit we actually sample e.g
households

Sampling Frame: The list of sampling units

33
Bias and variability
There is a multitude of sources for bias

Positive results tend to be published while negative of inconclusive


Publication bias results tend to not to be published

The outcome is correlated with the exposure. As an example,


Selection bias treatments tends to be prescribed to those thought to benefit
from them. Can be controlled by randomization
Differences in exposure e.g. compliance to treatment could be
Exposure bias associated with the outcome, e.g. patents with side effects stops
taking their treatment

The outcome is observed with different intensity depending no the


Detection bias exposure. Can be controlled by blinding investigators and patients

Essentially the I error, but also bias caused by model miss


Analysis bias specifications and choice of estimation technique

Strong preconceived views can influence how analysis results are


Interpretation bias interpreted.
34
Methods of Data Collection
Personal Interviews
Telephone Interviews
Self-Administered Questionnaires
Direct Observation

35
Why sampling?

Get information about large populations


Less costs
Less field time
More accuracy i.e. Can Do A Better Job of Data
Collection
When its impossible to study the whole
population

36
Types of sampling

Non-probability samples

Probability samples

37
Non probability samples

Convenience samples (ease of access)


sample is selected from elements of a population that
are easily accessible
Snowball sampling (friend of friend.etc.)
Purposive sampling (judgemental)
You chose who you think should be in the
study
Quota sample

38
Probability samples

Random sampling
Each subject has a known probability of being
selected
Allows application of statistical sampling
theory to results to:
Generalise
Test hypotheses

39
Conclusions

Probability samples are the best

Ensure
Representativeness
Precision

40
Methods used in probability samples

Simple random sampling


Systematic sampling
Stratified sampling
Multi-stage sampling
Cluster sampling

41
Simple Random Sampling (SRS)
Simplest sampling design
Def-1: If a sample of size n is drawn from a population
of size N in such a way that every possible sample of
size n has the same chance (probability) of being
selected the sampling procedure is called Simple
Random Sampling. The sample thus obtained is called a
simple random sample.

We will use simple random sampling to obtain


estimators for population means, totals, and
proportions.

42
How to draw a SR Sample
This is not as difficult as it looks
But selection is important because it leads to
Investigator bias
Poor estimation
The procedure for selecting a Simple Random
Sample is as follows:
List all the units in the population (construct a
sampling frame if one does not exist already), say
from 1,..,N

43
How to select the sample
In other words, give each element a unique
Identification (ID) starting from 1 to the
number of elements in the population N
N is the total number of units in the
population
Using random numbers or any other random
mechanism (eg Lottery or goldfish bowl),
select the sample of n units from the list of N
units one at a time without replacement

44
How to use a random number
table?
Decide on the minimum number of digits
Start anywhere in the table and going in any
direction choose a number/(s)
The sequence of reading the numbers should be
maintained until the desired sample size is
attained
If a particular number is not included in your
range of population values, choose another
number
Keep selecting the numbers till you have the
required number of elements in your sample
45
Advantages/Disadvantages of Simple Random
Sampling
Advantages:
Sample is easy to select in cases where the
population is small
Disadvantages:
Costs of enumeration may be high because by the
luck of the draw, the sampled units may be widely
spread across the population
By bad luck, the sample may not be representative
because it may not be evenly spread across all
sections of the population

46
Estimation of a Population Proportion
First have a look what does proportion mean?

47
Estimation of a Population Proportion
Researchers frequently interested in the
portion of population possessing a specified
characteristics.
E.g proportion of female voters in 2013
election.
Such situations exhibit a characteristics of the
binomial experiment. ?????
Population proportion is represented by p and
estimator as p
48
Sampling with probabilities
proportional to size
So far we have discussed all cases depended
on samples being a simple random sample.
In real life probabilities cannot be same for all
samples.
Varying the probabilities with which different
sampling units are selected is sometimes
advantageous.

49
What is meant by sampling weights?
Real surveys are generally multi-stage
At each stage, probabilities of selecting units at
that stage are not generally equal
When population parameters like a mean or
proportion is to be estimated, results from
lower levels need to be scaled-up from the
sample to the population
This scaling-up factor, applied to each unit in
the sample is called its sampling weight.
50
Why are weights needed?
Above was a trivial example with equal
probabilities of selection
In general, units in the sample have very
differing probabilities of selection, i.e. rare to get
a self-weighting design
To allow for unequal probabilities of selection,
each unit is weighted by the reciprocal of its
probability of selection
Thus sampling weight=(1/prob of selection)

51
Other uses of weight
Weights are also used to deal with non-
responses and missing values
If measurements on all units are not available for
some reason, may re-compute the sampling
weights to allow for this.
e.g. In conducting the Household Budget Survey
2000/2001 in Tanzania, not all rural areas
planned in the sampling scheme were visited. As
a result, sampling weights had to be re-
calculated and used in the analysis.
52
Difficulties in computations
Standard methods as illustrated in textbooks on
sampling, often do not apply in real surveys
Complex sampling designs are common
Computing correct probabilities of selection can
then be very challenging
Usually professional assistance is needed to
determine the correct sampling weights and to
use in correctly in the analysis

53
Confidence Intervals
Confidence Interval: An interval of values computed from
the sample, that is almost sure to cover the true
population value.

We make confidence intervals using values computed from the sample, not
the known values from the population

Interpretation: In 95% of the samples we take, the true


population proportion (or mean) will be in the interval.

This is also the same as saying we are 95% confident that the true population
proportion (or mean) will be in the interval
How do we compute the intervals?

We know that in 95% of the samples, the true population


proportion(or mean) will fall within in 2 standard errors of
the sample mean.
Where does the 2 come from:
For a bell curve 95% of the data will be between +/- 1.96
standard deviations.

What is the standard error:


This is not the standard deviation of the sample, it is the
standard
deviation of the sample proportion (or mean)
Sample Size Estimation
An investigator might have number of goals
while handling samplings issue.
Deciding amount of sampling error and
balance the precision of estimates with the
cost of survey especially in SRS.
Estimating a sample size is one of the major
goal of surveys.

56
Steps involved in sample size
estimation
Step-1
What is expected of the sample and how much
precision do I need?
What are the consequences of sample results.
How much error is tolerable.
A preliminary investigation, however, often needs
less precision than an ongoing survey.

57
A wrong approach
Many people usually ask what percentage of the
population should I include in my sample.
Ideally focus should be on precision of estimates.
Precision is obtained through the absolute size
of the sample, not the proportion of the
population covered (except in very small
populations)

58
Step-2
Find an equation relating the sample size n and your
expectations of the sample
Step-3
Estimate any unknown quantities and solve for n.
Step-4
If you feel that sample size (estimated) is too large to
handle, go back and adjust your expectation and then
try again.
Still if your sample size is large, then do think again to
initiate your study.

59
Specify the Tolerable Error
How much precision is needed (decided by investigator
only).
The desired precision is often expressed in absolute
term, as
Pr(|Estimator-Parameter|e)=1-

Where e is called Margin of Error


Reasonable values for and e must be decided by
investigator.
For many surveys of people in which a proportion is
measured, e=0.03 and =0.05

60
Sometimes you would like to achieve a desired
relation precision, controlling coefficient of
variation (CV) rather than the absolute Error.
In that case, if Parameter 0, the precision
may be expressed as
y yU
P
r 1

yU

61
Find an Equation
The simplest equation relating the precision
and sample size comes from the confidence
intervals in the previous section. To obtains
absolute precision e, find a value of n that
satisfies
e z / 2
n
(1 S / n
N

To solve this equation for n, we first find the
sample n0 that we would use for an SRSWR,
that is 2
z / 2 .S
n0
e
62
When should a Simple Random
Sample Be Used?
Avoid SRS in such situations
Before taking an SRS, you should consider
whether a survey sample is the best method for
studying your research question.
You may not have a list of the observation units,
or it may be expensive in terms of travel time to
take an SRS.
You may have additional information that can be
used to design a more cost effective sampling
scheme.

63
SRS should be used in following
situations
Little extra information is available that can be
used when designing the survey like sampling
frame.
Person using the data insist on using SRS formula,
whether they are appropriate or not.
The primary interest is in multivariate
relationships such as regression equations that
hold for the whole population, and there are no
compelling reasons to take a stratified or cluster
sample.
64
STRATIFIED SAMPLING
1. Stratification: The elements in the
population are divided into layers/groups/
strata based on their values on one/several
auxiliary variables. The strata must be non-
overlapping and together constitute the whole
population.
2. Sampling within strata: Samples are
selected independently from each stratum.
Different selection methods can be used in
different strata.
65
Ex. Stratification of individuals by age group

Stratum Age group


1 17 or younger
2 18-24
3 25-34
4 35-44
5 45-54
6 55-64
7 65 or older

66
Ex. Regional
stratification
Stratum 1:
Northern
Sweden

Stratum 2: Mid-
Sweden

Stratum 3:
Southern
Sweden

67
Ex. Stratification of individuals by age group and
region
Stratum Age group Region
1 17 or younger Northern
2 17 or younger Mid
3 17 or younger Southern
4 18-24 Northern
5 18-24 Mid
6 18-24 Southern
etc. etc. etc.
68
WHY STRATIFY?
Gain in precision. If the strata are more
homogenous with respect to the study
variable(s) than the population as a whole,
the precision of the estimates will improve.
Strata = domains of study. Precision
requirements of estimates for certain
subpopulations/domains can be assured by
using domains as strata.

69
WHY STRATIFY?, contd
Practical reasons. For instance
nonresponse rates, method of measurement
and the quality of auxiliary information may
differ between subpopulations, and can be
efficiently handled by stratification.
Administrative reasons. The survey
organization may be divided into
geographical districts that makes it natural to
let each district be a stratum.

70
IMPORTANT DESIGN CHOICES IN STRATIFIED
SAMPLING

Stratification variable(s)
Number of strata
Sample size in each stratum (allocation)
Sampling design in each stratum
Estimator for each stratum

71
STRATIFIED SAMPLING.

Draw a sample from each stratum

72
How to draw a Stratified Random
Sample
First step is to clearly specify the strata.
Place each sampling unit of population in its
appropriate stratum. Not an easy task
After the sampling units are divided into strata, we
select SRS from each stratum by using the techniques
discussed in last unit.
We must be certain that the samples selected from the
strata are independent, that is different random
sampling schemes should be used within each stratum
so that the observation chosen in one stratum do not
depend upon those chosen in another.

73
Important to calculate n
We must obtain approximation of the population
variances 12, 22, , L2 before we can use the
equation 2.8.
One method of obtaining these approximation is to use
the sample variances s12, s22 , sL2 from a previous
experiment to estimate 12, 22, , L2 .
A second method requires knowledge of the range of
the observations within each stratum using
Tchebysheffs theorem, range should roughly be four to
six standard deviation.
Choosing fractions w1, w2, , 2L will be discussed later.

74
Allocation of the Sample
Objective of the sample survey design is to
provide estimators with small variances at the
lowest possible cost.
After the sample size n is chosen, there are
many ways to divide n into the individual
stratum sample size n1, n2,, nL.
Each division may result in a different variance
for the sample mean.

75
Hence our objective is to use an allocation that
gives a specified amount of information at
minimum cost.
Best allocation scheme is affected by three
factors.
The total number of elements in each stratum.
The variability of observations within each stratum.
The cost of obtaining an observation from each
stratum.

76
Cost issue??
If the cost of obtaining an observation varies
from stratum to stratum, we will take small
samples from strata with high costs.
We will do so because our objective is to keep
the cost of sampling at a minimum.

77
Neymans allocation
In some stratified sampling problems the cost
of obtaining an observation is the same for all
strata.
If costs are unknown, we may be willing to
assume that the costs per observation are
equal.
If c1=c2= = cL, then the cost terms cancel in
equation (2.9) and become:

78
Equality of variances
In addition to encountering equal costs, we
sometimes encounter approximately equal
variances, 12= 22= = L2 .
In that case the i cancel in equation 2.11 and
becomes

[2.13]
79
Additional comments on Stratified
Sampling
Stratified random sampling does not always
produce an estimator with a smaller variance
than that of the corresponding estimator in
simple random sampling.
Next example will illustrate this point.

80
Complications in more than one
measurements
In many sample survey problems more than
one measurements is taken on each sampling
unit in order to estimate more than one
population parameter.
This situation causes complications in
selecting the appropriate sample size and
allocation.
This is illustrated in the following example

81
An optimal rule for choosing strata
If our only objective of stratification is to produce
estimators with small variance, then the best criterion by
which to define strata is the set of values that the response
can take on.
For example, suppose we wish to estimate the average
income per household in a community. We could estimate
this average quite accurately if we could put all low-income
households in one stratum and all high income households
in another before actually sampling.
Of course this allocation is often impossible because
detailed knowledge of income before sampling might make
the statistical problem unnecessary in the first place.

82
However, we sometimes have some relating
frequency data on board categories of the
variable of interest or on some highly correlated
variable.
In these cases the cumulative square root of the
frequency method works well for delineating
strata.
Rather than attempt to explain this method in
theory, we will simply show how it works in
practice
83
Ratio, Regression and Difference
Estimation
Estimation of the population mean and total is
previous units was based on a sample of response
measurements obtained by SRS and Stratified Random
sampling.
Sometimes other variables are closely related to the
response y.
By measuring y and one or more subsidiary variable,
we can obtain additional information for estimating
population characteristics like mean.
You are probably familiar with the use of subsidiary
variables to estimate the mean of a response y.

84
It is basic to the concept of correlation and provides
means for development of a prediction equation
relating y and x by the method of least squares.
You can look basic statistics books for this concept.
In previous discussion, primary emphasis was placed
on the design of the sample survey (SRS and St RS).
In contrast, this unit presents three new methods of
estimation based on the use of a subsidiary variable x.
The methods are called ration, regression, and
difference estimation.

85
All three require the measurement of two
variables, y and x, on each element of the
sample.
A variety of sampling designs can be
employed in conjunction with ration,
regression, or difference estimation, but we
will discuss mainly simple random sampling,
and put light with respect to stratified random
sampling as well.

86
Survey that require the use of Ratio
Estimation
Estimation a population total sometimes
requires the use of subsidiary variables.
Let us take an example to explain this
situation.

87
Understanding example
The wholesale price paid for oranges in large
shipments in based on the sugar content of the load.
The exact sugar content cannot be determined prior to
purchase and extraction of the juice from the entire
load; however, it can be estimated.
One method of estimating this quantity is to first
estimate the mean sugar content per orange, y, and
then to multiply by the number of oranges N in the
load.
Thus we could randomly sample n oranges from the
load to determine the sugar content y for each.

88
The average of these sample measurements,
y1, y2, , yn, will estimate y; N y1 will estimate
the total sugar content for the load, y.
Unfortunately, this method is not feasible
because it is too time-consuming and costly to
determine N (that is, to count the total
number of oranges in the load).

89
We can avoid the need to know N by noting
the following two facts.
First, the sugar content of an individual orange, y,
is closely related to its weight x;
Second, the ratio of the total sugar content y to
the total weight of the truckload x is equal to the
ration of the mean sugar content per orange, y ,
to the mean weight x.
Thus y N y y

x N x x
90
Solving for the total sugar content of the load, we
have
y
y x
x
We can estimate y and x by using mean of y and
x, the averages of the sugar contents and weights
for the sample of n oranges. Also, we can
measure x, the total weight of the oranges on
the truck. Then a ration estimate of the total
sugar content y is
y
y x
x
91
Or equivalently (multiplying numerator and
denominator by n)
n

ny y i
y ( x ) i 1
n
( x )
x
nx
i
i 1

In this case the number of elements in the


population, N is unknown, and therefore we
cannot use the simple estimator Ny of the
population total y.
92
Thus the ratio estimator or its equivalent is
necessary to accomplish the estimation
objective.
However, if N is known, we have the choice of
using the estimator N y or the ratio estimator
to estimate y. If y and x are highly correlated,
that is, x contributes information for the
prediction of y, the ratio estimator should be
better than N y which depends solely on y .

93
Other parameters of interest
In addition to the population total y, there are
often other parameters of interest. We may want
to estimate the population mean y, by using a
ratio estimation procedure.
For example, suppose we wish to estimate the
average sugar content per orange in a large
shipment. We could use the sample mean to
estimation population.
However if x and y are correlated, a ration
estimator that uses information from the
auxiliary variable x frequently provides a more
precise estimator of x.
94
When to use ratio estimation
Use of ratio estimator is most effective when
the relationship between the response y and a
subsidiary variable x is linear through the
origin and the variance of y is proportional to
x.

95
Regression Estimation
We observed that the ratio estimator is most
appropriate when the relationship between y and
x is linear through the origin.
If there is evidence of a linear relationship
between the observed ys and xs, but not
necessarily one that would pass through the
origin, then this extra information provided by
the auxiliary variable x may be taken into account
through a regression estimator of the mean y.

96
One must still have knowledge of x before
the estimator can be employed, as it was in
the case of ratio estimation of y.
The underlying line that shows the basic
relationship between ys and xs is sometimes
referred to as the regression line of y upon x.
Thus the subscript L in the ensuing formulas is
used to denote linear regression.

97
The estimator given in next section assumes the
xs to be fixed in advance and the ys to be
random variable.
We can think of the x values as something that
has already been observed, like last years first
quarter earnings, and the y response as a random
variable yet to be observed, such as the current
quarterly earnings of a company for which x is
already known.
The probabilistic properties of the estimator then
depend only on y for a given set of xs.

98
Difference Estimation
The difference method of estimating a population
mean or total is similar to the regression method
in that it adjusts the y value up or down by an
amount depending on the difference (( x x ) ).
However, the regression coefficient b is not
computed. In effect, b is set equal to unity.
The difference method is, then, easier to employ
than the regression method and frequently works
just as well.
99
Systematic Sampling
As we have seen in previous units , both simple
and stratified random sampling require very
detailed work in the sample selection process.
Sampling units on an adequate frame must be
numbered (or otherwise identified) so that a
randomization device, such as a random number
table, can be used to select specific units for the
sample.
A sample survey design that is widely used
primarily because it simplifies the sample
selection process is called systematic sampling.

100
The basic idea of systematic sampling is as
follows:
Suppose a sample of n names is to be selected
from a long list. A simple way to make this
selection is to choose an appropriate interval and
to select names at equal intervals along the list.
Thus every tenth name might be selected, for
example.
If the starting point for this regular selection
process is random, the result is systematic sample
101
Definition: A sample obtained by randomly
selecting one element from the first k
elements in the frame and every kth element
thereafter is called a 1-in-k- systematic
sample.
Systematic sampling provides a useful
alternative to simple random sampling for the
following reasons:

102
1. systematic sampling is easier to perform in
the field and hence is less subject to selection
errors by field- workers than are either simple
random or stratified random sample,
specifically if a good frame is not available.
2. systematic sampling can provide greater
information per unit cost than simple random
sampling can provide.

103
How to draw a Systematic Sample
Although SRS and Systematic sampling both
provide useful alternatives to one another, the
methods of selecting the sample data are
different.
A simple random sample from a population is
selected by using a table of random numbers.
In contrast, various methods are possible in
systematic sampling.
The investigator can select a 1-in-3, a 1-in-5, or in
general, a 1-in-k systematic sample.
104
For example, a medical investigator is interested in
obtaining information about the average number of
times 15,000 specialists prescribed a certain drug in a
the previous year. (N=15,000). To obtain a SRS of
n=1600 specialists, we would use the methods of old
method, however this procedure would require great
deal of work.
Alternatively, we could select one name at random
from the first k=9 names appearing on the list and then
select every ninth name thereafter until a sample of
size 1600 is selected.
This sample is called a 1-in-9 systematic sample.

105
Perhaps you wonder how k is chosen in a given
situation. If the population size N is known, we can
determine an approximate sample size n for the survey
and then choose k to achieve that sample size.
There are N=15000 specialists in the population for the
medical survey. Suppose the required sample size is
n=100.
We must then choose k to be 150 or less. For k=150,
we will obtain exactly n=100 observations, while for
k<150, the sample size will be greater than 100.

106
In general, for a systematic sample of n
elements from a population of size N, k must
be less than or equal to N/n (that is, k N/n).
Note in the preceding example that
k15000/100; that is k150.

107
We cannot accurately choose k when the
population size is unknown.
We can determine an approximation sample
size n, but we must guess the value of k
needed to achieve a sample of size n. if too
large a value of k is chosen, the required
sample size n will not be obtained by using a
1-in-k systematic sample from the population.

108
This result presents no problem if the
experimenter can return to the population and
conduct another 1-in-k systematic sample until
the required sample size is obtained.
However, in some situations obtaining a second
systematic sample is impossible.
For example, conducting another 1-in-20
systematic sample of shoppers is impossible if the
required sample of n=50 shoppers is not obtained
at the time they pass the corner.
109
If is close of one, then the elements within the
sample are all quite similar with respect to the
characteristics being measured, and systematic
sampling will yield a higher variance of the sample
mean than will simple random sampling.
If is negative, then systematic sampling be better
than simple random sampling. The correlation may be
negative if elements within the systematic sample tend
to be extremely different.
Note that cannot be so large negatively that the
variance expression becomes negative.

110
For close to zero and N fairly large; systematic
sampling is roughly equivalent to simple random
sampling.
The unbiased estimate of V( y sy ) cannot be
obtained by using the data from one systematic
sample. This statement does not imply that we
can never obtain an estimate of V( y sy ).
When systematic sampling is equivalent to simple
random sampling, we can take V( y sy ) to be
approximately equal to the estimated variance of
y based on simple random sampling

111
For which populations does this relationship
occur?
To answer this question, we must consider the
following three types of populations:
Random population
Ordered population
Periodic population

112
CLUSTER SAMPLING

113
Introduction
Just recall that the objective of a sample design is
to obtain a specified amount of information
about a population parameter at minimum cost.
Stratified random sampling is often better suited
for this than is SRS for the three reason indicated
in section 3, that is
Smaller bound on the error of estimation
Reduced cost per observation
Population parameters for subgroups of population

114
Definition: A cluster sample is a simple
random sample in which each sampling unit is
a collection, or cluster, of elements.
Cluster sampling is less costly than simple or
stratified random sampling if the cost of
obtaining a frame that lists all population
elements in very high or if the cost of
obtaining observations increases as the
distance separating the elements increases.

115
Two Stage Cluster Sampling
Two stage cluster sampling is an extension of the concept of cluster
sampling.
You will recall from the discussion of cluster sampling in previous
unit that a cluster is usually a convenient or natural collection of
elements, such as blocks of households or carton of flashbulbs.
A cluster often contains too many elements to obtain a
measurement on each, or it contains elements so nearly alike that
measurement of only a few elements provides information on an
entire cluster.
When either situation occurs, the experimenter can select a simple
random sample of clusters and then take a simple random sample
of elements within each cluster.
The result is a two stage cluster sample

116
Sampling with unequal probabilities
Up to now, we have only discussed sampling
schemes in which the probabilities of choosing
sampling units are equal.
Equal probabilities give schemes that are often
easy to design and explain. Such schemes are not,
however, always possible or, if practicable, as
efficient as schemes using unequal probabilities.
Cluster sample with equal probabilities may
result in large variance for the design-unbiased
estimator of the population mean and total
117
Primary Sampling Units(PSUS)
In sample surveys, primary sampling unit (commonly abbreviated as
PSU) arises in samples in which population elements are grouped
into aggregates and the aggregates become units in sample selection.
The aggregates are, due to their intended usage, called "sampling
units." Primary sampling unit refers to sampling units that are
selected in the first (primary) stage of a multi-stage sample
ultimately aimed at selecting individual elements. In selecting a
sample, one may choose elements directly; in such a design, the
elements are the only sampling units. One may also choose to group
the elements into aggregates and choose the aggregates in a first
stage of selection and then elements at a later stage of selection. The
aggregates and the elements are both sampling units in such a
design. For example, if a survey is selecting households as elements,
then counties may serve as the primary sampling unit, with blocks
and households ...

118
Thank You
Very Much
119

Das könnte Ihnen auch gefallen