Sie sind auf Seite 1von 123

"

Tutorial: Life Tables in Stata



Life tables list the death rates experienced by a population over a given period of time. They
have many practical uses. For example, insurance companies use them to determine premiums
and annuities; the government uses them to plan for social security.
Life tables are easy to compute in Stata through the use of the ltable command. To begin,
download the lifetable.dta data set from the course website and open it in Stata. This data
set was generated from one of the first life tables recorded, dating back to the late 17
th
century.


1) What is the mean life span? What is the median?

summ age, detail
get the mean and the median from the value of 50% (the value in the percentile column)







2) What does the histogram of age at death look like? s it symmetric?

Graphics > Histogram > Select age as variable
Command: histogram age
Symmetric after an initial peak in death times around age 0-5













.



#

3) Use the ltable command to generate a life table.

a. What is the chance of surviving from birth until age 80?
Command: ltable age
But, if we use this command, all the intervals are of length 1 which isn't very helpful
So, we will use the interval option. We want intervals of length 5
Command: ltable age, interval (0(5),85) -- start at 0, end at 85 in steps of 5 (SEE BELOW)
b. What is the proportion of individuals alive on their 50
th
birthday who die before
their 55
th
birthday?

People alive at age 50 = value in 50-55 = 346 people. People who died at age 50-5 = 54.
Therefore, proportion = 54/346 = 0.156
c. What is the chance that a 25-year-old will survive 10 years?
How many people were alive at age 25 ? = value in 25-30 row = 567
How many people were alive at age 35 ? = value in 35-40 row = 490
Proportion = 490/567 = 0.864
d. What is the chance that a 10-year old will survive to age 60?

For (3a) above, find the number of people alive at age 80 (the 80-85 row). n this case, the value was 41
Hence, our value for survival until age 80 = 41/1000 = 0.041. Alternatively, value in survival column
at age 75-80 row is 0.041 (Answer for 3a)
3d) alive at age 10 (10-15) = 661, alive at age 60 (60-65) = 242. Proportion = 242/661 = 0.366 (36.6%)

1

Example: Probability of hypertension at baseline

(1) n the Framingham dataset, of the 4,434 participants, 3,004 did not have hypertension at
baseline and 1,430 did have hypertension at baseline. Using this information, what is the
probability that a randomly selected participant in the Framingham study had hypertension
at baseline?








(2) What is the probability that this participant did not have hypertension at baseline?









(3) Are these events mutually exclusive, exhaustive, neither, or both?









(4) What is the probability that three randomly selected participants all do not have
hypertension at baseline?














P(H) = # withhyp / # total = 1430/4434 = 0.32
Hc = H Complement (= Individual did not have hyp)
P(Hc) = 1 - P(H)
= 1 - 0.32 = 0.68
Both - since complementary events cannot
happen at the same time (mutually exclusive)
and exhaustive since P(H) + P(Hc) = 1
P(H = 1 ind has hyp) = 0.68
P(all 3 hyp) = P(rst has hyp).P(second has hyp). P(third has hyp)
= P(H^3) = 0.68^3


- 2 -
(5) Suppose we again randomly select two participants from this population. What is the
probability that both participants have hypertension at baseline, given that at least one of the
participants had hypertension.

















A = both has hyp
B = at least 1 has hyp
We need to nd P(A|B)
P(A|B) = P(A and B) / P(B) = P(A)/P(B)
= prob that both hyp / prob at least 1 hyp
both has hyp = P(H).P(H)
at least 1 has hyp = P(H)+P(H) - P(H).P(H)
P(H) = 0.32
Therefore,
P(H).P(H) / {P(H)+P(H) - P(H).P(H)} = 0.39
1


Example: Relationship between hypertension and CHD using probability laws


We examine the relationship between hypertension and CHD at baseline in the
Framingham study population, using the concepts of probability learned this week.

a) What is the probability that a Framingham participant has hypertension or CHD
at baseline?







b) Are these two events independent? Would you expect these events to be
independent?










c) What is the probability that a participant has CHD at baseline? What is the
probability that a participant has CHD at baseline, given that he/she has
hypertension?

nd tab prevhyp1 prevchd1
H = hyp
C = CHD
P(H or C) = P(H) + P(C) - P(H and C)
= 1430/4434 + 194/4434 - 113/4434
= 0.34
Guessing, we won't expect them to be independent
For independence,
P (H and C) = P(H).P(C)
=> 113/4434 = 0.025 should be = P(H).P(C) = 1430/4434 X 194/4434
= 0.014
Therefore, they are not equal and hence not independent
P(C) = 194/4434 = 0.025
We want to nd P(C|H)
= P(C and H) / P(H)
= 113/4434 / 1430/4434
= 0.08 ==> which is more than just P(C) and hence
means that given you have hyp you have more prob of having
CHD as opposed to having CHD with no information about
hypertension
. tab prevhyp1 prevchd1
Prevalent | Prevalent coronary
hypertensi | heart disease, exam 1
on, exam 1 | No Yes | Total
-----------+----------------------+----------
No | 2,923 81 | 3,004
Yes | 1,317 113 | 1,430
-----------+----------------------+----------
Total | 4,240 194 | 4,434

1

Tutorial: ROC Curves in Stata

ROC curves illustrate the inherent trade-off of between sensitivity and specificity. We examine
ROC curves in the context of risk prediction.
Consider the following scenario: you are responsible for telling a patient that they are at high or
low risk for CHD, given some baseline prognostic factors. Using Framingham dataset, you can
predict the probability that an individual gets CHD, given their baseline prognostic factors.
Constructing an ROC curve to evaluate a risk prediction model:
1) Using systolic blood pressure, number of cigarettes smoked per day, total cholesterol,
sex, and BM at baseline, predict the probability that each individual in the Framingham
dataset had CHD. Call this probability p.

2) As in the diagnostic testing setting, select a cut-off probability c to distinguish high and
low risk patients. f p < c, the patient is low risk. f p < c, the patient is high risk.

3) Classify all patients in the dataset as high risk or low risk using the cutoff c.

4) Calculate P(high risk | CHD) = sensitivity. Calculate P(high risk | no CHD) = 1
specificity.
Steps 1 4 are beyond the scope of this module. These values are provided for you in the
dataset !"#$%&'$ Open the dataset !"#$%&' in Stata.
5) For the various values of c, plot the false positive rate versus sensitivity. Connect the
lines to generate your ROC curve.

Consider the following questions:
How do the sensitivity and specificity change as the cutoff increases from 0 to 1?








What value of c would you choose in distinguishing high risk versus low risk patients?
Why?





ROC = Graph of False + rate vs sensitivity
Graphics > Two Way Graph
Looking at the graph you want to balance sens and spec. You want to keep the
false +ve rate low with high sensitivity. Better to tell a patient that he/she is
high risk and be wrong compared to telling the patient that he/she is low risk
and being wrong


- 2 -
Table: Points on ROC curve for risk prediction model.
Cut-off
(c)

Sensitivity Specificity

False Positive
(1-Specificity)
! # !$#### #$####
#$%&#! #$##'! !$#### #$####
#$%!'( #$##%! #$&&)) #$##!(
#$*'&( #$#!!! #$&&** #$##+,
#$*#'' #$#',% #$&&+! #$##*&
#$'*&' #$#&&+ #$&)%' #$#!('
#$'#'' #$!*)( #$&*', #$#+,*
#$,'&' #$(+)! #$&,)# #$#'(#
#$,#,& #$+'#* #$&((! #$#%%&
#$+''' #$,(#' #$)%&, #$!(#*
#$+#(& #$'(() #$%&&% #$(##+
#$(',' #$*+)+ #$*&*+ #$+#+%
#$(#+! #$%()' #$'%(+ #$,(%%
#$!''& #$)+%& #$,!), #$')!*
#$!#,, #$&!!& #$(,#) #$%'&(
#$#'%! #$&)&& #$#''! #$&,,&
# !$#### # !



Plot: ROC curve for risk prediction model

0
.
2
.
4
.
6
.
8
1
S
p
e
c
i
f
i
c
i
t
y
0 .2 .4 .6 .8 1
False positive rate
ROC Curve

1

Tutorial: More on ROC curves and complicated graphs in Stata

We construct a new, simpler risk prediction model using only systolic blood pressure, diastolic
blood pressure, and age as our prognostic factors. We compare this risk prediction model to
the model in the previous tutorial.
Open the dataset !"#$%&' on the course website.
We construct a plot that includes:
the ROC curve for the first model from the previous tutorial (with many prognostic
factors), called Model 1
the second model, with only sex and blood pressure, Model 2, and
a reference line representing arbitrary classification as high or low risk.
Overlaying lines in Stata is relatively easy within the Twoway graph window. Using the ROC
plot, consider the following questions:
Model 1 outperforms model 2. How can you tell this from the ROC curve?





Which model would you recommend?







Later in the course, we learn how to fit the model to obtain the predicted risks. With new
biomarkers and genetic risk factors popping up all the time, risk prediction is a hot topic in
statistics right now and ROC curves are used frequently!
fpr vs sensitivity for model1 overlaid with model2
Here you have to create 2 plots on the Graphics > Twoway dialogue box


Table: Points on ROC curve for model 2.

Cut-off
(c)

Sensitivity Specificity

False Positive
(1-Specificity)
! #$#### !$#### #$####
#$%&#! #$#'%( #$&&&% #$###(
#$%!)' #$#'&* #$&&&! #$###&
#$+)&' #$#(', #$&&%) #$##')
#$+#)) #$#,(, #$&&,' #$##)*
#$)+&) #$#%&' #$&*#, #$#!&+
#$)#)) #$!#!, #$&+&& #$#(#!
#$,)&) #$'##' #$&,+# #$#),#
#$,#,& #$'+++ #$&#+, #$#&(+
#$())) #$,(%# #$*'', #$!%%+
#$(#'& #$)*+# #$%##+ #$'&&,
#$'),) #$+&)& #$+#,+ #$(&),
#$'#(! #$%+,! #$,%(+ #$)'+,
#$!))& #$*%(& #$((%! #$++'&
#$!#,, #$&*(* #$#*+) #$&!()
#$#)%! !$#### #$#### !$####
# !$#### #$#### !$####

Plot: ROC curve for Models 1 and 2, with reference line

0
.
2
.
4
.
6
.
8
1
S
e
n
s
i
t
i
v
i
t
y
0 .2 .4 .6 .8 1
False positive rate
Model 1
Model 2
ROC Curve
1

Example: Sensitivity, Specificity, PPV, NPV, and Bayes Theorem

The World Health Organization conducts surveys in countries to declare neonatal tetanus (NT)
elimination
1
. To diagnose NT deaths in rural locations, women are interviewed using the oral
autopsy method.

Notation:
D
+
- woman had a live infant who died of neonatal tetanus
D
-

- woman had a live infant who did not die of NT
T
+
- the oral autopsy concluded that an NT death occurred
T
-
- the oral autopsy concluded that an NT death did not occur

Using data from Kenya
2
, the sensitivity of the oral autopsy method is 90%; the specificity was
found to be 79%. Suppose 0.1% of the women surveyed had an infant die of neonatal tetanus.



a) What is the probability that the oral autopsy method declares a neonatal tetanus death
when the woman had an infant die of neonatal tetanus?













b) What is the probability that the oral autopsy method does not declare a neonatal tetanus
death when the woman did not have an infant die of neonatal tetanus? What is this value
called?














For more information, see
1
http://www.who.int/immunization_monitoring/diseases/MNTE_initiative/en/index.html
2
Snow R, Armstrong J.R.M, Forster D. et al. Childhood deaths in Africa: Uses and limitations of verbal autopsies,
Lancet, 1992,340:351-355.
This is actually just sensitivity
P(T+|D+) = 0.90
This is just specicity
P(T-|D-) = 0.79

2

c) What is the probability that a woman had an infant die of neonatal tetanus, given that the
oral autopsy method declared a neonatal tetanus death? What is this value called?




















d) What is the probability that a woman did not have an infant die of neonatal tetanus when
the oral autopsy method does not declare a neonatal tetanus death? What is this value
called?


















e) What are the implications of parts (c) and (d) for the neonatal tetanus survey?


P(D+|T+) = Positive Predictive Value
= P(T+|D+).P(D+)
----------------------------------------------
P(T+|D+).P(D+) + P(T+|D-).P(D-)
= 0.9 X 0.001 Note - P(D+) which = Prevalence = 0.001
----------------------------------------------
0.9X0.001 + (1-0.79).(1-0.001)
= 0.004
P(D-|T-) = Negative Predictive Value
= P(T-|D-) . P(D-)
---------------------------------------------------
P(T-|D-).P(D-) + P(T-|D+).P(D+)
= 0.79. (1-0.001)
----------------------------------------------------
0.79. (1-0.001) + (1-0.90).(0.001)
= 0.999 = NPV
PPV = 0.004
NPV = 0.999
Therefore, very low PPV, very high NPV
Even though sens and spec are reasonably high, this disease is so rare in this
population that without perfect specicity we will have very low PPV. So, a vast majority
of individuals who had died due to neonatal did not actually die of this disease
Tutorial: Binomial distribution in Stata
Using Stata to calculate binomial probabilities
Suppose X is a random variable that follows a binomial distribution; thus X represents the
number of successes out of n trials with success probability p.
binomialp(n,k,p) returns the probability of observing oor(k) successes
in oor(n) trials when the probability of a success on one trial is p.
binomial(n,k,p) returns the probability of observing oor(k) or fewer successes
in oor(n) trials when the probability of a success on one trial is p.
binomialtail(n,k,p) returns the probability of observing oor(k) or more successes
in oor(n) trials when the probability of a success on one trial is p.
Example: Uzbeki Flour Fortication Program
In 2003, a our fortication program was implemented in Uzbekistan to attempt to lower the
rates of anemia among women of reproductive age. Before the program was implemented, the
prevalence of anemia was 60%. In 2007, four years after implementing the fortication women,
suppose 100 women of reproductive age were randomly selected to provide blood samples to test
for anemia. Let X be the random variable denoting how many of the 100 women were anemic.
Suppose that the prevalence of anemia in Uzbekistan did not change between 2003 and 2007.
1. Would the binomial distribution provide an appropriate model?
B binary outcome
I independent because women were randomly selected
N sample size is xed
S same p
2. What is the expected number of women with anemia?
= n p = 60
3. In a random sample of women in Uzbekistan, what is the typical departure of the number of
women with anemia from this mean number?
sd(X) =
p
var(X)
=
p
n p (1 p)
=

100 0.6 0.4


=

24
= 4.9
1
IMPORTANT
IMPORTANT
Conditions for a binomial distribution
100 * .6 = 60
Standard Deviation for a
Binomial Distribution
4. What is the probability that exactly 60 women develop the disease? (use the formula)

n
k

p
X
(1 p)
nX
=

100
60

0.6
60
0.4
40
= 0.081
. di comb(100, 60)*0.6^60*0.4^40
.08121914
5. What is the probability that exactly 50 women are anemic?
. di binomialp(100, 50, 0.6)
.01033751
6. What is the probability that at least 50 women are anemic?
. di binomialtail(100, 50, 0.6)
.98323831
Alternatively, we could use the binomial command to calculate this probability, since P(X >
50) = 1 P(X 49).
. di binomial(100, 49, 0.6)
.01676169
. di 1 - binomial(100, 49, 0.6)
.98323831
7. Now, assume that the prevalence of anemia actually dropped after implementation of the
program, and the prevalence of anemia was 40% in 2007. Now, what is the probability that
at least 50 women are anemic?
. di binomialtail(100, 50, 0.4)
.0270992
Note that under the assumption of no change in prevalence between 2003 and 2007, the
probability that more than fty women had anemia was very high. If the prevalence of anemia
dropped to 40%, the probability that at least 50 women were anemia was then very low. So, if we
collected data on 100 women and observed fewer than 50 cases of anemia, this would suggest
that anemia prevalence dropped over time!
2
IMPORTANT
Tutorial: Poisson distribution in Stata
Using Stata to calculate Poisson probabilities
Suppose X is a random variable that follows a Poisson distribution; X is a count of breast
cancer cases.
When X Poisson(m),
poissonp(m,k) returns the probability of observing oor(k) successes
poisson(m,k) returns the probability of observing oor(k) or fewer successes
poissontail(m,k) returns the probability of observing oor(k) or more successes
Example: Ecological Cancer Study
In the United States, the National Cancer Institute (NCI) tracks cancer incidence through the
Surveillance Epidemiology and End Results (SEER) database. At various SEER sites, incident
cases of cancer, cancer type, and location are tracked. Using data from SEER, epidemiologists
can monitor patterns in disease risk and nd factors, such as socioeconomic status, that are
correlated with disease.
For instance, Los Angeles County is divided into 2,056 census tracts in the 2000 census.
Using the SEER database, we can estimate the number of expected breast cancer cases in each
census tract, based on breast cancer incidence rates in California and the age distribution within
each tract (see standardization lectures). Then, we can compare the number of observed cases in
each census tract to the expected, to determine if census tracts have more cases of cancer than
expected. We can then try to correlate excess breast cancer cases with other area-level variable,
in an ecological study.
Below, we have data on breast cancer incidence for the African-American female population in
a census tract in LA County. We choose to model the observed number of breast cancer cases in
a census tract using the Poisson distribution, with mean equal to the expected number of breast
cancer cases in the census tract.
1
Age group Observed Population Cancer rate (per 1,000 p-y) Expected
15-24 0 188 0.008 0.001
25-34 0 163 0.200 0.033
35-44 0 216 0.875 0.189
45-54 0 157 1.868 0.293
55-64 0 137 2.633 0.361
65-74 0 151 3.165 0.478
75-84 0 121 3.452 0.418
84+ 0 57 3.313 0.189
Total 0 1,190 1.648 1.962
Table 1: Census tract 1
1. What is the expected number of women with breast cancer in the census tract 1?
1.962
2. What is the typical departure of the number of women with breast cancer from this mean
number?
sd(X) =
p
var(X)
=

1.962
= 1.400714
3. Does the Poisson distribution provide an appropriate model?
Count data, so Poisson distribution seems reasonable. Difcult to assess any more informa-
tion about model t without data on many census tracts.
4. What is the probability that exactly 0 women develop breast cancer in census tract 1? (use
the formula)
e
1.962
1.962
0
0!
= e
1.962
= 0.1406
2
Consider another census tract, with a similar total African-American female popula-
tion to the previous, but with 5 observed breast cancer cases.
Age group Observed Population Cancer rate (per 1,000 p-y) Expected
15-24 0 187 0.008 0.001
25-34 0 187 0.200 0.037
35-44 1 218 0.875 0.191
45-54 0 193 1.868 0.361
55-64 1 175 2.633 0.461
65-74 1 141 3.165 0.446
75-84 2 66 3.452 0.228
84+ 0 17 3.313 0.056
Total 5 1,184 1.504 1.781
Table 2: Census Tract 2.
5. What is the probability that exactly 5 women have breast cancer in census tract 2?
. di poissonp(1.781, 5)
.02515706
6. What is the probability that at least 5 women have breast cancer in census tract 2?
. di poissontail(1.781, 5)
.03504886
Alternatively, we could use the poisson command to calculate this probability, since P(X
5) = 1 P(X 4).
di 1 - poisson(1.781, 4)
.03504886
Takeaway: Census tracts 1 and 2 have similar population sizes and consequently similar
expected breast cancer case counts. However, in census tract 1, we observe no cases; in census
tract 2, we observe 5 cases. Using the Poisson distribution, we can calculate the probability of
observing case counts as extreme as 0 or 5 in these tracts.
Remember that there are about 2,000 total tracts, so we expect to see some extreme ob-
servations. We could also incorporate ecological covariates into our analysis, such as median
household income or land-use data, to try to explain some of the differences between observed
and expected breast cancer rates.
3
Tutorial: Normal distribution in Stata
Using Stata to calculate Normal probabilities
Suppose Z is a standard normal random variable. When Z Normal(0, 1),
normal(z) returns the cumulative standard normal distribution
normalden(z) returns the standard normal density
Example: Ozone Designation Following the Clean Air Act Amendments of 1997
From 2001-2003, the Environmental Protection Agency (EPA) monitored ozone levels at
monitors across the United States. One criteria for ozone was that the ozone levels (dened as
the average fourth highest daily maximum ozone over the three year period) could not exceed
80ppb. Regulatory actions were taken if the ozone levels exceeded this threshold.
Among monitors in the Southeast, the average ozone level was 45.2 ppb, with standard
deviation 6.3 ppb. Ozone levels are usually modeled using the normal distribution. We assume
that this distribution is reasonable in our application.
Dene X as ozone level at a monitor. X N(45.2, 39.7), or, equivalently, X N(45.2, 6.3
2
).
1. What is the expected ozone level at a randomly sampled monitor?
45.2 ppb
2. What is the typical departure ozone levels from this mean number?
6.3 ppb
3. Why do you think Stata named the normal density function normalden, rather than normalp,
which would seemingly be more consistent with the binomial and Poisson commands?
The normal distribution is continuous, and therefore normalden does not return a proba-
bility, but rather a density function.
4. Why do you think Stata only calculates probabilities with respect to the standard normal,
or N(0,1), distribution?
I dont know the answer to this. Seems pretty inconvenient.
5. What is the probability that a randomly selected monitor has ozone levels exceeding 80
ppb?
First, standardize:
P(
X45.2
6.3
>
8045.2
6.3
) = P(Z > 5.524)
. di 1 - normal(5.524)
1.657e-08
1
IMPORTANT
NORMAL GIVES <
(LESS THAN)
6. Provide an interpretation of the following command:
. di normalden(0)
.39894228
0.399 is the value of the normal density function at 0. It has no interpretation in terms of
probability.
2

"

Example Problem: HIV prevalence in South Africa

According to UN ADS*, HV prevalence in South Africa was 17.8% among adults 15 to 49 years
old in 2009. Assume this prevalence estimate is accurate today, and we randomly sample 500
individuals in South Africa. Suppose X is the number of HV positive individuals in the sample.
Model X using the binomial distribution.

1. How many individuals do we expect to be HV positive in the sample.
E(X) = np = 500*0.178 = 89
2. What is the standard deviation of the number of HV positive individuals in the sample?
sd(X) = \np(1-p) = 8.553245
3. What is the probability of observing more than 100 HV positive individuals?
. di 1 - binomial(500, 100, 0.178)
.09089616
. di binomialtail(500, 101, 0.178)
.09089616
4. What is the probability of observing between 85 and 95 HV positive individuals?
. di binomial(500, 95, 0.178) - binomial(500, 84, 0.178)
.47533949
. di binomialtail(500, 85, 0.178) - binomialtail(500, 96, 0.178)
.47533949


Now, model X using the normal distribution instead.
1. What is E(X)?
E(X) = np = 500*0.178 = 89
2. What is sd(X)?
sd(X) = \np(1-p) = 8.553245
3. What is the probability of observing more than 100 HV positive individuals?
P(X> 100) = P(Z > (100 89)/ 8.55) = P(Z > 1.286)
IMPORTANT, THESE VALUES ARE >= (I.E., INCLUSIVE)
HENCE, HERE WE USED 96 INSTEAD OF 95

#

. di 1-normal(1.286)
.09922153
4. What is the probability of observing between 85 and 95 HV positive individuals?
P(85 < X < 95)
= P(X < 95) P(X < 85)
= P(Z < (95 89)/ 8.55) P(Z<(85-89)/ 8.55)
= P(Z< .702) P(Z<-.468)
. di normal(0.702) - normal(-0.468)
.43876812
5. Do the normal and binomial models give similar results?
What is the probability of observing more than 100 HV positive individuals?
Binomial: .09089616
Normal: .09922153
What is the probability of observing between 85 and 95 HV positive individuals?
Binomial: .47533949
Normal: .43876812
Yes, they give similar results. Approximation is better "in the tails, i.e. for calculating the
probability of observing more than 100 HV+ individuals; than in the center of the distribution
(between 85 and 95 HV+).









*$%%&'(()))*+,-./0*123(4,(243.1,051+,%2.40(51+,%2.40(01+%$-62.5-(
IMPORTANT
Tutorial: Central Limit Theorem in Stata
We examine BMI at baseline using the Framingham cohort as our reference population.
Specically, we can think of the Framingham population as the population of interest and
consider sampling from this population to examine how statistics behave in samples from a
population where we know about everyone.
1. Calculate the mean standard deviation BMI in the Framingham dataset at baseline.
. summarize bmi1
= 25.8 and = 4.1.
2. Take a sample of size 20 from the Framingham dataset. Calculate a sample mean BMI
at baseline, x
1
. Then take a second sample from the same population and calculate the
sample mean, x
2
. Would you expect x
1
and x
2
to be exactly the same? Why or why not?
use "fhs.dta", clear
drop if bmi1 == .
keep bmi1
preserve
sample 20, count
mean bmi1
restore
preserve
sample 20, count
mean bmi1
We dont expect x
1
and x
2
to be exactly the same, because the mean has some stochastic
variability.
3. Repeat this exercise, but with a sample size of 100. Are x
1
and x
2
closer together than
those from the samples of size 20? Are x
1
and x
2
always going to be closer together
using a sample size of 100 versus 20?
restore
preserve
sample 100, count
mean bmi1
restore
preserve
sample 100, count
mean bmi1
1
In my sample, the values of the sample mean are closer with the larger sample. This will
usually, but not always, be true.
4. Compare histograms of BMI at baseline and prevalent MI at baseline. Would the central
limit theorem apply to the binary indicator prevalent MI at baseline?
0
.
0
2
.
0
4
.
0
6
.
0
8
.
1
D
e
n
s
i
t
y
10 20 30 40 50 60
Body mass index, exam 1
0
1
0
2
0
3
0
4
0
D
e
n
s
i
t
y
0 .2 .4 .6 .8 1
Prevalent myocardial infarction, exam 1
Yes, but the more skewed a distribution is, the larger sample size we need to collect before
the CLT kicks in.
2
Tutorial: Condence and Predictive Intervals in Stata
1. Let X denote a random variable that represents BMI at baseline for the Framingham
cohort. Assume that X is normally distributed. What is the mean of X? The standard
deviation?
. summarize bmi1
2. Construct a 95% predictive interval for X. Pick a random observation from the dataset.
Does your interval contain the BMI for the randomly selected observation?
95% predictive interval for X is dened as 1.96.
3. Suppose we now draw repeated samples of size 100 from the Framingham cohort. What
is a 95% predictive interval for

X?
95% predictive interval for

X is dened as 1.96/

n.
4. Take a sample of size 100 from the Framingham dataset. Does your predictive interval
for

X contain the mean from the 100 person subsample?
. sample 100, count
. sum bmi1
5. Construct a 95% condence interval for the mean BMI in this sample. Does the 95%
condence interval contain the mean BMI for the entire cohort?
A 95% CI for is dened as

X 1.96/

n.
1
Tutorial: Condence intervals with the t-distribution in Stata
Suppose t is a random variable that follows a t-distribution with n degrees of freedom.
tden(n,t) returns the probability density function
of Students t distribution
ttail(n,t) returns the reverse cumulative
(upper tail or survivor) Students t distribution
invttail(n,p) returns the inverse reverse cumulative
(upper tail or survivor) Students t distribution
Note that if ttail(n,t)= p, then invttail(n,p) = t.
Stata will calculate condence intervals for you:
Calculator: cii n mean sd, level(95)
Function: ci varlist, level(95)
There is no Stata function for calculating condence intervals for normally distributed data
when the standard deviation is known, since this scenario doesnt really happen in practice.
1. Calculate the mean and standard deviation of BMI at baseline.
. summarize bmi1
2. Take a sample of size 20 from the Framingham cohort. Calculate the mean and
standard deviation of BMI at baseline in the subsample (I use set seed 2, if you want
to get the same sample as me). We are interested in making inference about BMI at
baseline in the total Framingham cohort using only the sample of size 20.
. set seed 2
. drop if bmi1 == .
. sample 20, count
. sum bmi1
3. Assume that the sample standard deviation is known (and equal to the standard deviation
in the Framingham cohort). Construct a 95% condence interval for the mean BMI in your
subsample. Note that if normal(z)= p, then invnormal(p) = z.
95% CI: x Z
0.975
/

n
. di 25.0 - invnormal(0.975)*4.1/sqrt(20)
. di 25.0 + invnormal(0.975)*4.1/sqrt(20)
4. Use invttail to construct a 95% condence interval for the mean BMI in your subsample
by hand, now assuming that the sample standard deviation is unknown.
1
. di 25.0 - invttail(19, 0.025)*3.2/sqrt(20)
. di 25.0 + invttail(19, 0.025)*3.2/sqrt(20)
5. Use cii to construct a 95% condence interval for the mean BMI in your subsample.
. cii 20 25.0 3.2
6. Use ci to construct a 95% condence interval for the mean BMI in your subsample.
. ci bmi1
2
Tutorial: Hypothesis testing in Stata
In adults over 15 years of age, a resting heart rate around 80bpm is usually considered
average. Using a subset of the Framingham cohort, we are going to attempt to make inference
about heart rate among healthy young adults.
Specically, we restrict our analysis to adults with the following characteristics at baseline:
non-smoker, younger than 40, BMI less than 25, diastolic blood pressure less than 80, and
systolic blood pressure less than 120. There are 61 participants who meet our criteria. We
hypothesize that heart rate at the follow up exam in 1962 would be lower than 80bpm, the
resting heart rate for adults with average health.
We are making the somewhat strong assumption that these Framingham participants are
generalizable to the broader population of healthy young adults (this assumption is necessary
if we want to make inference about heart rate in healthy young adults.) Use the dataset on this
webpage to answer the following questions:
1. Make a histogram of heart rate at exam 2. Is the normality assumption reasonable?
histogram heartrte2
histogram heartrte2 if heartrte2 < 200
2. You are interested in whether the mean heart rate at exam 2 among healthy young adults
is equal to 80bpm. Perform a hypothesis test at the = 0.05 level.
(a) What test are you using?
One-sample t-test
(b) State your null and alternative hypothesis.
H
0
: = 80, H
A
: 6= 80
(c) Perform the hypothesis test.
Hypothesis testing in Stata: To examine options for t-tests in Stata, type db ttest.
Or, using the dropdown menu, explore the options in
Summaries, tables, and tests/Classical tests of hypothesis/.
. ttest heartrte2 == 80
One-sample t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
heartr~2 | 61 76.55738 2.800032 21.86895 70.95648 82.15827
------------------------------------------------------------------------------
mean = mean(heartrte2) t = -1.2295
1
Ho: mean = 80 degrees of freedom = 60
Ha: mean < 80 Ha: mean != 80 Ha: mean > 80
Pr(T < t) = 0.1118 Pr(|T| > |t|) = 0.2237 Pr(T > t) = 0.8882
. ttesti 61 76.557 21.869 80
One-sample t test
------------------------------------------------------------------------------
| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
x | 61 76.557 2.800039 21.869 70.95609 82.15791
------------------------------------------------------------------------------
mean = mean(x) t = -1.2296
Ho: mean = 80 degrees of freedom = 60
Ha: mean < 80 Ha: mean != 80 Ha: mean > 80
Pr(T < t) = 0.1118 Pr(|T| > |t|) = 0.2236 Pr(T > t) = 0.8882
What are:
i. your test statistic, t = -1.22
ii. the distribution of your test statistic under the null hypothesis t t
60
iii. the p-value, 0.2236
iv. your decision, and Fail to reject the null hypothesis.
v. your interpretation? We do not have enough evidence to suggest that the heart
rate is different from 80 in healthy young adults at follow up.
3. As a diligent statistician, you decide to investigate the issue of the outlier in your dataset.
List the information for the outlier.
. list if heartrte2 > 200
4. Repeat the hypothesis test, excluding this observation. What do you nd?
. ttest heartrte2 == 80 if heartrte2 < 200
5. As the statistician, what results should you present in your analysis?
2
Example: Atherosclerosis and Physical Activity
Oxidation of components of LDL cholesterol (the bad cholesterol) can result in atherosclerosis,
or hardening of the arteries. Elosua et. al (2002) examine the impact of a 16 week physical activity
program on LDL resistance to oxidation in 17 healthy young adults. After completing the program,
the average maximum oxidation rate in the study participants x was 8.2 mol/min/g, and the
sample standard deviation of the maximum oxidation rate was s = 2.5mol/min/g. Assume that
the oxidation rate is normally distributed.
What is the distribution of x?
x
s/

n
t
16
.
Suppose the average maximum oxidation rate in healthy young adults who did not complete
the program was
0
= 11.3mol/min/g and the standard deviation was = 2.3. Dene x
0
as the sample mean maximum oxidation rate from a sample of size 17 from this population.
Construct a 99% predictive interval for x
0
. Is x in this interval?
. di 11.3 - invnormal(0.995)*2.3/sqrt(17)
. di 11.3 + invnormal(0.995)*2.3/sqrt(17)
Construct a 99% condence interval for .
. cii 17 8.2 2.5, level(99)
If you constructed the 99% condence interval for assuming that the standard deviation
was known and equal to = 2.3, would your condence interval be wider or narrower? Will
this result always be true?
Standard deviation known: x Z
0.99

Standard deviation unknown: x t


0.99,16
s
Let denote the mean maximum oxidation rate in young adults who participate in the pro-
gram. Test the hypothesis that =
0
against the alternative that 6=
0
the = 0.01 level.
What do you conclude?
H
0
: =
0
, H
A
: 6=
0
. ttesti 17 8.2 2.5 11.3, level(99)
1
Using a one-sample t-test, we obtain a test statistic of -5.11, which follows a t-distribution
with 16 degrees of freedom under the null hypothesis, corresponding to a p-value of 0.0001.
We reject the null at the 99% condence level and conclude that the data suggest that
the 16 week physical activity program lowers the maximum oxidation rate in healthy young
individuals.
Elosua R., Molina L., Fito M., Arquer A., Sanchez-Quesada JL, Covas MI, Ordonez-Llanos J., Marrugat J.(2003)Response of
oxidative stress biomarkers to a 16-week aerobic physical activity program, and to acute physical activity, in healthy young men and
women. Atherosclerosis 167(2), 327-334.
2
Two Sample t-tests in Stata
Example: In the Framingham cohort, we want to examine the distribution of heart rate at exams
1 and 2. Specically, we wish to test whether there is a difference in mean heart rate between
exam 1 and exam 2. Additionally, we are interested in whether the mean heart rate differs
between men and women at exam 2. We sample 100 people from the Framingham cohort.
For this example, use the dataset heartrate.dta on this webpage, which contains the random
sample of 100 participants.
Hypothesis testing with paired data in Stata:
. ttest heartrte1 == heartrte2
Paired t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
heartr~1 | 100 75.03 1.290247 12.90247 72.46987 77.59013
heartr~2 | 100 76.17 1.293031 12.93031 73.60435 78.73565
---------+--------------------------------------------------------------------
diff | 100 -1.14 1.344125 13.44125 -3.807035 1.527035
------------------------------------------------------------------------------
mean(diff) = mean(heartrte1 - heartrte2) t = -0.8481
Ho: mean(diff) = 0 degrees of freedom = 99
Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0
Pr(T < t) = 0.1992 Pr(|T| > |t|) = 0.3984 Pr(T > t) = 0.8008
. gen hdiff = heartrte2 - heartrte1
. ttest hdiff== 0
One-sample t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
hdiff | 100 1.14 1.344125 13.44125 -1.527035 3.807035
------------------------------------------------------------------------------
mean = mean(hdiff) t = 0.8481
Ho: mean = 0 degrees of freedom = 99
Ha: mean < 0 Ha: mean != 0 Ha: mean > 0
Pr(T < t) = 0.8008 Pr(|T| > |t|) = 0.3984 Pr(T > t) = 0.1992
The commands ttest heartrte2 == heartrte1 and ttest hdiff==0 lead to the same test.
This command can be found through the following drop-down menus: Statistics / Sum-
maries, tables, and tests / Classical tests of hypotheses / Mean-comparison test, paired data.
1
Hypothesis testing with unpaired data and equal variances in Stata:
. ttest heartrte2, by(sex1)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Male | 39 76.82051 2.042025 12.75244 72.68665 80.95438
Female | 61 75.7541 1.681246 13.13095 72.39111 79.11709
---------+--------------------------------------------------------------------
combined | 100 76.17 1.293031 12.93031 73.60435 78.73565
---------+--------------------------------------------------------------------
diff | 1.066414 2.662326 -4.216884 6.349713
------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = 0.4006
Ho: diff = 0 degrees of freedom = 98
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.6552 Pr(|T| > |t|) = 0.6896 Pr(T > t) = 0.3448
Hypothesis testing with unpaired data and unequal variances in Stata:
. ttest heartrte2, by(sex1) unequal
Two-sample t test with unequal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Male | 39 76.82051 2.042025 12.75244 72.68665 80.95438
Female | 61 75.7541 1.681246 13.13095 72.39111 79.11709
---------+--------------------------------------------------------------------
combined | 100 76.17 1.293031 12.93031 73.60435 78.73565
---------+--------------------------------------------------------------------
diff | 1.066414 2.645081 -4.194674 6.327503
------------------------------------------------------------------------------
diff = mean(Male) - mean(Female) t = 0.4032
Ho: diff = 0 Satterthwaites degrees of freedom = 82.8637
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.6561 Pr(|T| > |t|) = 0.6879 Pr(T > t) = 0.3439
This command can be found through the following drop-down menus: Statistics / Summaries, tables,
and tests / Classical tests of hypotheses / Two-group mean-comparison test.
Instead of the data structure above, suppose that, in your dataset, you have heart rate for men in one
variable/column and heart rate for women in another variable/column (instead of our situation where we
have heart rate in one variable and sex as another variable). How do you perform a t-test then? Use the
command ttest heartratew == heartratem, unpaired unequal, where heartratew is the heart rate
variable for women and heartratem is the heart rate for men. It is important to use the option unpaired.
If you do not use this option, Stata will perform a paired t-test. You may also choose the leave out the
unequal option if you wish to assume equal variances.
2
The following 4 lines of code transform the data to the situation where we have heart rate for men
in one variable (heartrtem) and heart rate for women in another variable (heartrtew). It is not necessary
to memorize or understand this portion of code. It is simply included for completeness. The fth line of
code runs the two sample t-test.
. gen id = _n
. reshape wide heartrte2, i(id) j(sex1)
. rename heartrte21 heartrtem
. rename heartrte22 heartrtew
. ttest heartrtew = heartrtem, unpaired unequal
Two-sample t test with unequal variances
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
heartr~w | 61 75.7541 1.681246 13.13095 72.39111 79.11709
heartr~m | 39 76.82051 2.042025 12.75244 72.68665 80.95438
---------+--------------------------------------------------------------------
combined | 100 76.17 1.293031 12.93031 73.60435 78.73565
---------+--------------------------------------------------------------------
diff | -1.066414 2.645081 -6.327503 4.194674
------------------------------------------------------------------------------
diff = mean(heartrtew) - mean(heartrtem) t = -0.4032
Ho: diff = 0 Satterthwaites degrees of freedom = 82.8637
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.3439 Pr(|T| > |t|) = 0.6879 Pr(T > t) = 0.6561
This command can be found through the following drop-down menus: Statistics / Summaries, tables,
and tests / Classical tests of hypotheses / Two-sample mean-comparison test.
Exercises
1. Calculate the sample mean and sample standard deviation of heart rate at exam 1 and exam 2 in
the Framingham cohort.
2. Are these data dependent or independent?
3. Generate a new variable for the difference in heart rate between exam 1 and exam 2. Make a
histogram of this new variable.
4. Perform a hypothesis test at the = 0.05 level.
(a) What test are you using?
(b) State your null and alternative hypothesis.
(c) Perform the hypothesis test. What are:
i. your test statistic,
ii. the degrees of freedom,
iii. the p-value,
iv. your decision, and
v. your interpretation?
3
Now, assume that you are interested in whether the mean heart rate differs between men
and women at exam 2.
5. Are these data dependent or independent?
6. Calculate the sample mean and sample standard deviation of heart rate at exam 2 for men and
women.
7. Perform a hypothesis test at the = 0.05 level, assuming unequal variances.
(a) What test are you using?
(b) State your null and alternative hypothesis.
(c) Perform the hypothesis test. What are:
i. your test statistic,
ii. the degrees of freedom,
iii. the p-value,
iv. your decision, and
v. your interpretation?
8. Given the 95% condence intervals, would you expect the hypothesis test to be signicant?
4
Power and Sample Size in Stata
Power and Sample size in Stata
sampsi - Sample size and power for means and proportions
Power
sampsi 18.4 20.4, sd1(2.8) n1(20) onesample
Sample Size
sampsi 18.4 20.4, sd1(2.8) power(.90) onesample
The notation changes slightly for two-sample or one-sided tests. Type db sampsi to see all
options available within the sampsi command or select from the drop-down menus: Statistics /
Power and sample size / Tests of means and proportions.
Example: Suppose we aim to implement a new physical activity program among school-aged
children between 6 and 11 years old at high risk for obesity. We dene high-risk children as
those children who do less than 2 hours of physical activity per week. According to Ogden
(2012), mean BMI among children 6-11 years old in the United States was 18.4 between 2009
and 2010, with standard deviation 2.8. Before implementing this program, we want to perform
a baseline survey, to evaluate the state of the obesity epidemic among the high risk children.
We plan to design the survey to test whether the mean BMI in the high risk children is equal to
the mean BMI among 6-11 year olds in the United States at the = 0.05 level. To design the
study, assume the standard deviation of BMI is equal in the general population and the high
risk children.
Ogden C.L., Carroll M.D., Kit B.K., and Flegal K.M. (2012). Prevalence of Obesity and Trends in Body Mass Index Among US
Children and Adolescents, 1999-2010. JAMA: The Journal of the American Medical Association. 307 (5). 483490.
1. State the null and alternative hypothesis for the test above.
H
0
: = 18.4
H
A
: 6= 18.4
2. Fill in the table below:
1
Sample Size
A
Power
100 19.4
200 18.9
10,000 18.4
20.4 0.9
19.4 0.8
19.4 0.9
Now, suppose we powered our study for the one-sided test that the mean BMI is equal
to 18.4 versus the alternative that the mean is higher in the high risk children. Repeat
the calculations above and compare to the two-sided calculations.
Power: sampsi 18.4 20.4, sd1(2.8) n1(20) onesample onesided
Sample Size: sampsi 18.4 20.4, sd1(2.8) power(.90) onesample onesided
1. State the null and alternative hypothesis for the test above.
H
0
: = 18.4
H
A
: > 18.4
2. Fill in the table below:
Sample Size
A
Power
100 19.4
200 18.9
10,000 18.4
20.4 0.9
19.4 0.8
19.4 0.9
Suppose we also wanted to investigate whether the BMI among high risk children dif-
fered between boys and girls. Let us assume that the standard deviations of BMI among
2
high risk children are both equal to 2.8.
Power: sampsi 18.4 20.4, sd1(2.8) sd2(2.8) n1(20) n2(20)
Sample Size: sampsi 18.4 20.4, sd1(2.8) sd2(2.8) power(.90)
1. State the null and alternative hypothesis for the test above.
H
0
:
B
=
G
H
A
:
B
6=
G
2. Let
G
and
B
denote the mean BMI in boys and girls, respectively; let n
B
and n
G
denote
the sample size required for boys and girls. Fill in the table below:
n
B
n
G

B

G
Power
20 20 20.4 18.4
20 20 19.4 18.4
20.4 19.4 0.9
22.4 18.4 0.8
3
Tutorial: ANOVA in Stata
In this example, we will use data from the California Health Interview Survey (CHIS). From
their website (http://www.chis.ucla.edu): CHIS is the nations largest state health survey. Con-
ducted every two years on a wide range of health topics, CHIS data gives a detailed picture
of the health and health care needs of Californias large and diverse population. CHIS is con-
ducted by the UCLA Center for Health Policy Research in collaboration with many public agen-
cies and private organizations.
In 2009, CHIS surveyed more than 47,000 adults, more than 12,000 teens and children and
more than 49,000 households. We will use a sample of 500 adults for this lab (CHISANOVA.dta).
Suppose we are interested in the relationship between number of hours worked (per week) and
health, as measured by BMI. Would we expect those who worked longer hours to be healthier
than those who worked shorter hours, or vice versa? Number of hours worked per week is
divided into 5 categories: 0-10, 10-25, 25-35, 35-45, 45+.
1. How many people are in each category?
2. We now wish to run an ANOVA. Are the assumptions for ANOVA met?
3. What are the null and alternative hypotheses for this test?
4. Perform the hypothesis test at the = 0.05 level.
Conduct a oneway ANOVA in Stata using the oneway command:
. oneway bmi work_cat, tabulate
| Summary of bmi
work_cat | Mean Std. Dev. Freq.
------------+------------------------------------
0-10 | 26.431579 5.9410147 38
10-25 | 26.429189 5.7075504 74
25-35 | 24.3495 4.1477871 60
35-45 | 27.128351 5.647101 188
45+ | 27.854928 6.1797228 140
------------+------------------------------------
Total | 26.8419 5.7540637 500
Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 550.823688 4 137.705922 4.27 0.0021
Within groups 15970.6916 495 32.2640234
------------------------------------------------------------------------
Total 16521.5153 499 33.1092491
Bartletts test for equal variances: chi2(4) = 11.7543 Prob>chi2 = 0.019
1
You may also use the following drop-down menus to access the oneway command: Statis-
tics / Linear models and related / ANOVA/MANOVA / One-way ANOVA.
What are:
(a) your test statistic,
(b) the degrees of freedom,
(c) the p-value,
(d) your decision, and
(e) your interpretation?
5. We have rejected the null hypothesis, thus we have evidence that at least one pair of
means are not equal. Perform all possible pairwise comparisons using the Bonferroni
correction.
6. Which pairs of means are signicantly different?
7. A colleague of yours, who has the same dataset, calculates the means for each work
category. After looking at these means he takes the group with the largest mean (45+)
and the group with the smallest mean (25-35) and performs a t-test (without a Bonferroni
correction). He tells you that since he only did one test, he does not need to correct for
multiple comparisons and that his method is valid. Do you agree? Why or why not?
2

Tutorial: Methods for one-sample proportion inference

n this tutorial, we learn about Stata commands for one-sample proportion inference:

Confidence intervals:

ci and cii calculate binomial confidence intervals

Hypothesis Tests:

bitest and bitesti exact binomial one-sample proportion hypothesis test

prtest and prtesti large-sample one-sample proportion hypothesis test

Recall that the extra ,i' at the end of a Stata command name denotes that the command is
"immediate and does not use the data in memory.

Exercises

1. Estimate the proportion of California residents who visit the doctor at least once in the
previous year, denoted p.

. tabulate doctor

2. Construct a 95% confidence interval for p using three different methods (Can we use
the normal approximation to the binomial distribution?). How do the widths of these
three Cs compare?

. ci doctor, binomial
. ci doctor, binomial wald
. ci doctor, binomial Wilson

Exact: never has lower than expected coverage, but is sometimes too conservative
Wald: Large-sample, bad coverage, easy to calculate/flexible
Wilson: Large-sample, good coverage, less flexible

3. Using the 95% confidence level, is there evidence in the data that less than 80% of the
population visits the doctor once per year? Repeat this analysis, stratifying by
above/below poverty groups.

. bysort poverty: ci doctor, binomial


4. Lets formalize question 3 using a hypothesis test. Let p
1
denote the proportion of
California residents below the federal poverty level who visited the doctor at least once
in the past year. Test the hypothesis that p
1
= 0.8 versus the alternative that p
1
= 0.8 at
the d = 0.05 level.

(a) First, use the exact binomial test. What is the p-value?

bitest doctor == 0.8 if poverty == 1

(b) Next, use the normal approximation to the binomial distribution.

. prtest doctor == 0.8 if poverty==1


2

s the normal approximation appropriate?

n
1
*p
1
> 5; n
1
*(1- p
1
) > 5

Therefore, the normal approximation to the binomial is appropriate.

What is the value of your test statistic?

Z = - 2.02

What is the distribution of your test statistic under the null hypothesis?

Z ~ N(0,1)

What is the p-value of your test?

p = 0.044

Do you reject or not reject the null hypothesis?

We reject the null hypothesis.

What do you conclude?

We conclude that there is evidence in the data that p
1
is less than 0.8.

5. Given that you got different results using the exact and large sample hypothesis tests,
what would you do if you were writing a paper?

There are no meaningful differences between a p-value of 0.049 and 0.051 - try to
include confidence intervals in practice, as p-values dont tell you anything about the
magnitude of an effect.

Two Sample Proportion Tests in Stata
Before delving into two-way associations using contingency (two-by-two) tables, we rst ex-
amine the structure of the two-sample test of proportions, using the normal approximation to
the binomial.
Exercises:
1. How might we dene a test statistic for comparing two proportions? Specically, we
would like to test the hypothesis that H
0
: p
1
= p
0
versus the alternative that p
1
6= p
0
at
the = 0.05 level. How does this test compare to the two-sample mean test for normally
distributed data from last week?
Recall the two-sample t-test for equal variances:
Assume X
1
N(
1
,
2
), and the sample mean of multiple realizations of X
1
is x
1
and
sample standard deviation is s
1
; and X
2
N(
2
,
2
), and the sample mean of multiple
realizations of X
2
is x
2
and sample standard deviation is s
2
.
To test H
0
:
1
=
2
vs. H
A
:
1
6=
2
, our test statistic for the two-sample t-test with
equal variances was:
t =
x
1
x
2
s
p
q
1
n
1
+
1
n
2
H
0
t
n
1
+n
2
2
Remember: the variance is independent of the mean for normally distributed data. For
binomial data, the variance is a function of the mean.
For binomial data:
Assume X
1
Binomial(n
1
, p
1
) and X
0
Binomial(n
0
, p
0
).
Dene p
1
= X
1
/n
1
and p
0
= X
0
/n
0
.
Using the Central Limit Theorem, we know that p
1
N(p
1
, p
1
(1 p
1
)/n
1
) and p
0

N(p
0
, p
0
(1 p
0
)/n
0
).
Under the null hypothesis that p
1
= p
0
, p
1
p
0
N(0, V ), where V = p(1
p)

1
n
1
+
1
n
0

and p =
X
1
+X
0
n
1
+n
0
.
Therefore, a natural test statistic for testing H
0
: p
1
= p
0
, H
A
: p
1
6= p
0
is:
p
1
p
0
r
p(1 p)

1
n
1
+
1
n
0

H
0
N(0, 1)
For binomial data, the structure of the test statistic is similar to the two-sample t-test with
equal variances, because, under the null, the variances are equal in both groups.
1
2. Let p
1
/p
0
denote the proportion of CA residents below/above the federal poverty level who
visited the doctor at least once in the past year. Test the hypothesis that p
1
= p
0
versus
the alternative that p
1
6= p
0
at the = 0.05 level. What do you conclude? Report a 95%
CI along with your results.
What test are you using? Is normality reasonable?
tabulate doctor
Check that n
1
p > 5, n
1
(1 p) > 5, n
0
p > 5, n
0
(1 p) > 5, where p = 0.804.
Two-group proportion test in Stata
. prtest doctor, by(poverty)
What is the value of your test statistic?
Z = 2.3
What is the distribution of your test statistic?
Z N(0, 1)
What is the p-value of your test?
p = 0.024
Do you reject or not reject the null hypothesis?
Reject H
0
What do you conclude?
There is evidence in the data that individuals in CA who are below poverty are less
likely to go to the doctor.
3. Based on these data, you decide to conduct an intervention among those below the
poverty line. You randomize individuals to intervention or no intervention. Suppose you
power your study to detect a 15%risk difference with 90%power, assuming the proportion
in the control group would equal the estimated proportion among those below poverty
(70%) in this study. What sample size would you need, with equal numbers of individuals
per arm, if you plan to conduct your test at the = 0.05?
. sampsi 0.7 0.85, power(0.9) alpha(0.05)
2
Tutorial: Contingency Tables

A well-known statistician once said, "A PhD student could write an entire dissertation on two-by-
two tables only. Continuing our health disparities research, we now consider the odds ratio and
the Pearson Chi-square test.

Exercises

1. Using data from the 500 respondents of the CHS survey, construct a 2x2 table
comparing poverty level versus past doctor visit. Display the row frequencies and the
expected cell counts.

. tabulate poverty doctor, row expected

2. Construct the odds ratio and corresponding 95% confidence interval (C) for the visiting
the doctor in the past 12 months for those above and below the poverty line.

. gen nopov = 1-pov

. cs doctor nopov, or woolf

Notice that the Woolf option is used, denoting that we want standard errors calculated
using the formula presented in class.

3. Conduct a Pearson's chi-square test to examine the association between poverty and
prior doctor visit.

. cs doctor poverty, or woolf

OR use tabulate.

. tabulate poverty doctor, expected chi2

Note that tabulate extends nicely to R x C tables.

. tabulate racecat doctor, expected chi2

What are the null and alternative hypotheses?

Null: no association between above/below poverty line and whether an individual
visited the doctor in the past 12 months.
Alternative: there is an association.

Null: OR = 1
Alternative: OR = 1

Are the expected cell counts sufficiently large?

All expected cell counts are greater than 5.

What is the value and distribution of the test statistic under the null hypothesis?

X = 5.1 ~ X
1
2

What is the p-value?

p = 0.024.

2


Do you reject the null hypothesis? What is your conclusion?

We reject the null hypothesis and conclude that there is evidence in the data that the
odds of visiting the doctor in the past 12 months are higher in those who are above
the poverty line.

4. For the above and below poverty groups, compare the following:
95% C for the odds ratio
95% C for the difference in two proportions (from the previous tutorial)
the p-value from the two-sample proportion test (from the previous tutorial)
the p-value from the Pearson Chi-square test

(a) Do you get the same general conclusion with each test?

Difference in proportions between above and below poverty groups:

12.0 with 95% C (0.2, 24.0)

Odds ratio for above and below groups

2.0 with 95% C (1.1, 3.5)

Pearson Chi-square

p = 0.024

Two sample proportion test

p = 0.024

(b) Which test do you find most useful?


t's always good to show both a p-value and a confidence interval. (Note that you can't
show a confidence interval for the risk difference if you have a case-control study!)
1

TutoriaI: Inference for Paired Data using McNemar's Test
Part 1


Consider the following study from Dekkers et al. (2011) that compared two different screening
tests for determining adrenal insufficiency. Adrenal insufficiency is a condition in which the
adrenal glands do not produce adequate amounts of certain hormones. The screening test
involves measuring a patient's cortisol response after administration of an intravenous bolus of
adrenocorticotropic hormone (ACTH).

Currently, two doses of ACTH are used for diagnostic purposes in patients with suspected
adrenal insufficiency: 1 g and 250 g (Dekkers et al. 2011). There is an ongoing debate about
which dose should be used for the initial assessment of adrenal function (Dekkers et al. 2011).

The goal of this study was to compare the cortisol response of the 1 g and 250 g ACTH test
among patients with suspected adrenal insufficiency. Patients with cortisol concentrations of
<550 nmol/l after ACTH stimulation (considered normal cortisol response) were classified as not
having adrenal insufficiency. This was a retrospective cohort study whereby patients who
received both the 1 g and 250 g ACTH test between January 2004 and December 2007 were
included for analysis. The data can be found in the AI.dta dataset.

Source: Dekkers OM, Timmermans JM, Smit JW, Romijn JA, Pereira AM. Comparison of the cortisol
responses to testing with two doses of ACTH in patients with suspected adrenal insufficiency.Eur J
Endocrinol 2011 Jan;164(1):83-7

1. Since this is paired data, we decide to use McNemar's test. State the null and alternative
hypothesis for McNemar's test.

NuII: The proportion of patients classified as having adrenal insufficiency using the 1 g
test is the same as the proportion of patients classified as having adrenal insufficiency
using the 250 g test.
AIternative: Those proportions are not equal.

s this the same as testing that the proportion of patients classified as not having adrenal
insufficiency using the 1 g test is the same as the proportion of patients classified as
not having adrenal insufficiency using the 250 g test?

2. Use the table command to summarize the data.

. tabulate one two



| two
one | Abnormal Normal | Total
-----------+----------------------+----------
Abnormal | 42 19 | 61
Normal | 14 132 | 146
-----------+----------------------+----------
Total | 56 151 | 207


a. How many discordant pairs are there?



2
19 + 14 = 33

3. Carry out McNemar's test in Stata at the d = 0.05 significance level.

. mcc one two


| Controls |
Cases | Exposed Unexposed | Total
-----------------+------------------------+------------
Exposed | 132 14 | 146
Unexposed | 19 42 | 61
-----------------+------------------------+------------
Total | 151 56 | 207

McNemar's chi2(1) = 0.76 Prob > chi2 = 0.3841
Exact McNemar significance probability = 0.4869

Proportion with factor
Cases .705314
Controls .7294686 [95% Conf. Interval]
--------- --------------------
difference -.0241546 -.0832778 .0349687
ratio .9668874 .8962794 1.043058
rel. diff. -.0892857 -.2991256 .1205541

odds ratio .7368421 .3418529 1.550025 (exact)





a. What is the test statistic? Null distribution? P-value?

The test statistic is 0.76. The null distribution of the test statistic is chi-squared
with 1 degree of freedom. The p-value is 0.3841. Note: there is an exact test
version of McNemar's test based on the binomial distribution leading to a p-value
of 0.4869, which was the p-value reported in the paper.




b. What is your conclusion?

Since our p-value is greater than 0.05 we fail to reject the null hypothesis. Thus,
we have no evidence that the proportion of patients classified as having adrenal
insufficiency using the 1 g test is different from the proportion of patients
classified as having adrenal insufficiency using the 250 g test.




1

Tutorial: Inference for Matched Data using McNemars Test

To incorporate more individual information into our analysis, we match individuals who were
below the poverty line to an individual who was above the poverty line based on: age, urban vs.
rural location, race, and gender. (Note that we could incorporate more covariates to improve
the matches.)

We conduct McNemar's test to examine the relationship between poverty and doctor visits
among matched pairs. Open the dataset chis_matched.dta.

. mcc doctor_0 doctor_1


1. State the null and alternative hypothesis for McNemar's test.

Null: there is no association between poverty and visiting the doctor in the past 12
months
Alternative: there is an association between poverty and visiting the doctor in the past
12 months.

(A subtle sidenote: we are now generalizing to a slightly different population. Because of
the way we implemented our matching scheme, we are no longer making inference
about all California residents. Rather, we are making inference with respect to the
population with a covariate pattern (age, race, location, and gender) similar to the
population below the poverty level.

2. How many pairs contribute to the test statistic?

Only discordant pairs contribute to the test statistic.

2 + 15 = 17

Due to the small sample size (number of discordant pairs less than 20), the normal
approximation is dubious in this instance. There is an exact test based on the binomial
distribution, which does not rely on large sample approximations.


3. Using a large sample test, what is the test statistic? Null distribution? P-value?
Compare to the exact test.

= 9.94 ~
2
1


p = 0.0016

Note that the more conservative exact test results in a p-value of 0.0023 (similar to the
large-sample result).


4. What is the odds ratio? Compare to the OR from the non-matched analysis.

7.5 (1.7, 67.6)


From the non-matched analysis, the OR was 2.0, with 95% C (1.1, 3.5).


2
5. Comparing the results of the McNemar's test to the Pearson Chi-square test, consider
the following question. Even though we have decreased the sample size, do we gain
power by matching? Which test provides stronger evidence that poverty impacts
whether or not someone goes to the doctor each year?

Because we are comparing doctor visits among similar individuals, we gain some power
by matching.

Survey Data Analysis in Stata
Example
Real-world, publicly available survey data is often very complex (see the DHS example).
Consequently, we will contrive an example for this tutorial, estimating p, the prevalence of a
disease, say malaria, in a hypothetical country, called Inventia.
Country prole:
Province Population size Number of districts
1 225,000 50
2 150,000 42
3 100,000 32
4 25,000 23
Total 500,000 146
In Inventia, the climate differs between provinces; for instance, province 4 is more arid and
at a higher altitudes than the rest of the country. Consequently, the prevalence of malaria p
differs between different provinces. Also, access to malaria prevention is not consistent across
the country, and subsequently p may also vary somewhat between districts. (For instance, ur-
ban populations may have more resources to prevent malaria, and thus a lower prevalence.)
The true prevalence of malaria in Inventia is 13.1%.
Today, we review how to analyze data from several different survey designs:
Simple Random Sampling - We randomly sample 1,000 people from Inventia.
Stratied Sampling - We randomly sample 250 people from each of the 4 provinces of
Inventia.
Cluster Sampling - We randomly sample 25 districts from Inventia and randomly sample
40 people within each district.
Stratied Cluster sampling - For each of the 4 provinces, we randomly sample 5 dis-
tricts. Within these 20 districts, we randomly sample 50 people.
1
Analyzing Survey Data in Stata
In order to analyze survey data in Stata, you must rst svyset your data. This command
tells Stata what survey design was used to obtain the data. This includes specication of survey
weights, the nite population correction(s), and levels of clustering and stratication.
Once Stata has this information, it incorporates the specied design elements into its calcu-
lations. You can then use the survey estimation procedures in Stata. For example, svy: mean
var name, svy: proportion var name, svy: regress ....
Before analyzing your survey data, you need to be able to answer the following questions:
1. What is the design of my survey?
2. Am I using a nite population correction? At which stage of the design?
3. What are the survey weights used in the design?
Once you know these things, you can start analyzing your data in Stata.
2
1 Simple Random Sampling
Design: We randomly sample 1,000 people from the entire country of Inventia.
Notation:
N is the total population size
n is the number of individuals sampled from the population without replacement
In our case, n = 1, 000, N = 500, 000.
Finite Population Correction: 1 f =

1
n
N

Survey Weights w
i
= P( individual i is included in the survey)
1
=
N
n
Exercise: Estimate the prevalence of malaria in Inventia.
use "srs.dta", clear
generate weight_srs = pop_size/1000
generate fpc = 1000/pop_size * note that this does not match the definition above
svyset id [pweight=weight_srs], fpc(fpc)
svy: proportion malaria
svyset id [pweight=weight_srs]
svy: proportion malaria
estat effects, deff
proportion malaria
Under simple randomsampling (SRS), when will proportion malaria and svy: proportion
malaria give you the same results? Why?
Why does it not matter much if you use the nite population correction in this example?
Exercise: Estimate the prevalence of malaria in each of the four provinces.
svy, sub(if province==1): proportion malaria
svy, sub(if province==2): proportion malaria
svy, sub(if province==3): proportion malaria
svy, sub(if province==4): proportion malaria
Is there evidence of province-level variation in malaria prevalence?
3
2 Stratied Sampling
Design: We randomly sample 250 people from each of the 4 provinces of Inventia.
Notation:
N is the total population size
N
j
is the population in province j, j = {1, 2, 3, 4}
n
j
individuals are sampled from province j
The important design question in stratied sampling is how to choose the sample size within
each stratum. In our case, N
1
= 225, 000, N
2
= 150, 000, N
3
= 100, 000 and N
4
= 25, 000.
n
j
= 250 for each j.
Finite Population Correction: 1 f
j
=

1
n
j
N
j

Survey Weights: w
ij
= P( individual i in strata j is in the survey)
1
=
N
j
n
j
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratified.dta", clear
proportion malaria
proportion malaria, over(province)
generate weight_stratified = prov_size/250
generate fpc_stratified = 1/weight_stratified
svyset id [pweight=weight_stratified], strata(province) fpc(fpc_stratified)
svydescribe weight
svy: proportion malaria
estat effects, deff
Exercise: Why is our estimate of p too low when we do not specify the survey design?
4
3 Cluster Sampling
Design: We randomly sample 25 districts (clusters) from Inventia; within each district, we ran-
domly sample 40 people.
Notation:
N is the total population size
N
k
is the population size in district k, k = {1, ..., 146}
n
I
out of N
I
total districts are sampled for inclusion in the survey (primary sampling unit)
We sample n
k
individuals in district k are selected for inclusion in the survey (secondary
sampling unit)
In our survey, n
I
= 25, N
I
= 146, n
k
= 40, and N
k
is the population size in district k.
Finite Population Correction:
Stage I: 1 f
I
=

1
n
I
N
I

Stage II: 1 f
k
=

1
n
k
N
k

Survey Weights:
w
ik
= P(individual i in cluster k is in the survey)
1
= [P(cluster k selected) P( individual i in cluster k selected | clusterk selected)]
1
=
N
I
n
I

N
k
n
k
Exercise: Estimate the prevalence of malaria in Inventia, using only the rst stage nite popu-
lation correction.
use "cluster.dta", clear
generate fpc1 = 25/146
generate fpc2 = 40/districtsize
generate weight_cluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_cluster], fpc(fpc1) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
5
4 Stratied Cluster Sampling
We could combine stratied, cluster and simple random sampling all into one design!
Design: For each of the 4 provinces, we randomly sample 5 districts. Within each of the 20
districts, we randomly sample 50 people.
Survey weights: As an example, for province 2:
P(person i in district j in province 2 in survey )
= P(district j in survey | province 2)P(person i in survey | district j)
=
5
42

50
districtsize
j
Finite population correction:
Stage I:
#sampled districts
total#districts in the province
Stage II:
#sampled per district
district population
=
50
districtsize
j
for district j.
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratifiedcluster.dta", clear
generate fpc1 = 5/ndistrict
generate fpc2 = 50/districtsize
generate weight_stratcluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_stratcluster], fpc(fpc1) strata(province) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
6
Survey Data Analysis in Stata
Example
Real-world, publicly available survey data is often very complex (see the DHS example).
Consequently, we will contrive an example for this tutorial, estimating p, the prevalence of a
disease, say malaria, in a hypothetical country, called Inventia.
Country prole:
Province Population size Number of districts
1 225,000 50
2 150,000 42
3 100,000 32
4 25,000 23
Total 500,000 146
In Inventia, the climate differs between provinces; for instance, province 4 is more arid and
at a higher altitudes than the rest of the country. Consequently, the prevalence of malaria p
differs between different provinces. Also, access to malaria prevention is not consistent across
the country, and subsequently p may also vary somewhat between districts. (For instance, ur-
ban populations may have more resources to prevent malaria, and thus a lower prevalence.)
The true prevalence of malaria in Inventia is 13.1%.
Today, we review how to analyze data from several different survey designs:
Simple Random Sampling - We randomly sample 1,000 people from Inventia.
Stratied Sampling - We randomly sample 250 people from each of the 4 provinces of
Inventia.
Cluster Sampling - We randomly sample 25 districts from Inventia and randomly sample
40 people within each district.
Stratied Cluster sampling - For each of the 4 provinces, we randomly sample 5 dis-
tricts. Within these 20 districts, we randomly sample 50 people.
1
Analyzing Survey Data in Stata
In order to analyze survey data in Stata, you must rst svyset your data. This command
tells Stata what survey design was used to obtain the data. This includes specication of survey
weights, the nite population correction(s), and levels of clustering and stratication.
Once Stata has this information, it incorporates the specied design elements into its calcu-
lations. You can then use the survey estimation procedures in Stata. For example, svy: mean
var name, svy: proportion var name, svy: regress ....
Before analyzing your survey data, you need to be able to answer the following questions:
1. What is the design of my survey?
2. Am I using a nite population correction? At which stage of the design?
3. What are the survey weights used in the design?
Once you know these things, you can start analyzing your data in Stata.
2
1 Simple Random Sampling
Design: We randomly sample 1,000 people from the entire country of Inventia.
Notation:
N is the total population size
n is the number of individuals sampled from the population without replacement
In our case, n = 1, 000, N = 500, 000.
Finite Population Correction: 1 f =

1
n
N

Survey Weights w
i
= P( individual i is included in the survey)
1
=
N
n
Exercise: Estimate the prevalence of malaria in Inventia.
use "srs.dta", clear
generate weight_srs = pop_size/1000
generate fpc = 1000/pop_size * note that this does not match the definition above
svyset id [pweight=weight_srs], fpc(fpc)
svy: proportion malaria
svyset id [pweight=weight_srs]
svy: proportion malaria
estat effects, deff
proportion malaria
Under simple randomsampling (SRS), when will proportion malaria and svy: proportion
malaria give you the same results? Why?
Why does it not matter much if you use the nite population correction in this example?
Exercise: Estimate the prevalence of malaria in each of the four provinces.
svy, sub(if province==1): proportion malaria
svy, sub(if province==2): proportion malaria
svy, sub(if province==3): proportion malaria
svy, sub(if province==4): proportion malaria
Is there evidence of province-level variation in malaria prevalence?
3
2 Stratied Sampling
Design: We randomly sample 250 people from each of the 4 provinces of Inventia.
Notation:
N is the total population size
N
j
is the population in province j, j = {1, 2, 3, 4}
n
j
individuals are sampled from province j
The important design question in stratied sampling is how to choose the sample size within
each stratum. In our case, N
1
= 225, 000, N
2
= 150, 000, N
3
= 100, 000 and N
4
= 25, 000.
n
j
= 250 for each j.
Finite Population Correction: 1 f
j
=

1
n
j
N
j

Survey Weights: w
ij
= P( individual i in strata j is in the survey)
1
=
N
j
n
j
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratified.dta", clear
proportion malaria
proportion malaria, over(province)
generate weight_stratified = prov_size/250
generate fpc_stratified = 1/weight_stratified
svyset id [pweight=weight_stratified], strata(province) fpc(fpc_stratified)
svydescribe weight
svy: proportion malaria
estat effects, deff
Exercise: Why is our estimate of p too low when we do not specify the survey design?
4
3 Cluster Sampling
Design: We randomly sample 25 districts (clusters) from Inventia; within each district, we ran-
domly sample 40 people.
Notation:
N is the total population size
N
k
is the population size in district k, k = {1, ..., 146}
n
I
out of N
I
total districts are sampled for inclusion in the survey (primary sampling unit)
We sample n
k
individuals in district k are selected for inclusion in the survey (secondary
sampling unit)
In our survey, n
I
= 25, N
I
= 146, n
k
= 40, and N
k
is the population size in district k.
Finite Population Correction:
Stage I: 1 f
I
=

1
n
I
N
I

Stage II: 1 f
k
=

1
n
k
N
k

Survey Weights:
w
ik
= P(individual i in cluster k is in the survey)
1
= [P(cluster k selected) P( individual i in cluster k selected | clusterk selected)]
1
=
N
I
n
I

N
k
n
k
Exercise: Estimate the prevalence of malaria in Inventia, using only the rst stage nite popu-
lation correction.
use "cluster.dta", clear
generate fpc1 = 25/146
generate fpc2 = 40/districtsize
generate weight_cluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_cluster], fpc(fpc1) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
5
4 Stratied Cluster Sampling
We could combine stratied, cluster and simple random sampling all into one design!
Design: For each of the 4 provinces, we randomly sample 5 districts. Within each of the 20
districts, we randomly sample 50 people.
Survey weights: As an example, for province 2:
P(person i in district j in province 2 in survey )
= P(district j in survey | province 2)P(person i in survey | district j)
=
5
42

50
districtsize
j
Finite population correction:
Stage I:
#sampled districts
total#districts in the province
Stage II:
#sampled per district
district population
=
50
districtsize
j
for district j.
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratifiedcluster.dta", clear
generate fpc1 = 5/ndistrict
generate fpc2 = 50/districtsize
generate weight_stratcluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_stratcluster], fpc(fpc1) strata(province) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
6
Survey Data Analysis in Stata
Example
Real-world, publicly available survey data is often very complex (see the DHS example).
Consequently, we will contrive an example for this tutorial, estimating p, the prevalence of a
disease, say malaria, in a hypothetical country, called Inventia.
Country prole:
Province Population size Number of districts
1 225,000 50
2 150,000 42
3 100,000 32
4 25,000 23
Total 500,000 146
In Inventia, the climate differs between provinces; for instance, province 4 is more arid and
at a higher altitudes than the rest of the country. Consequently, the prevalence of malaria p
differs between different provinces. Also, access to malaria prevention is not consistent across
the country, and subsequently p may also vary somewhat between districts. (For instance, ur-
ban populations may have more resources to prevent malaria, and thus a lower prevalence.)
The true prevalence of malaria in Inventia is 13.1%.
Today, we review how to analyze data from several different survey designs:
Simple Random Sampling - We randomly sample 1,000 people from Inventia.
Stratied Sampling - We randomly sample 250 people from each of the 4 provinces of
Inventia.
Cluster Sampling - We randomly sample 25 districts from Inventia and randomly sample
40 people within each district.
Stratied Cluster sampling - For each of the 4 provinces, we randomly sample 5 dis-
tricts. Within these 20 districts, we randomly sample 50 people.
1
Analyzing Survey Data in Stata
In order to analyze survey data in Stata, you must rst svyset your data. This command
tells Stata what survey design was used to obtain the data. This includes specication of survey
weights, the nite population correction(s), and levels of clustering and stratication.
Once Stata has this information, it incorporates the specied design elements into its calcu-
lations. You can then use the survey estimation procedures in Stata. For example, svy: mean
var name, svy: proportion var name, svy: regress ....
Before analyzing your survey data, you need to be able to answer the following questions:
1. What is the design of my survey?
2. Am I using a nite population correction? At which stage of the design?
3. What are the survey weights used in the design?
Once you know these things, you can start analyzing your data in Stata.
2
1 Simple Random Sampling
Design: We randomly sample 1,000 people from the entire country of Inventia.
Notation:
N is the total population size
n is the number of individuals sampled from the population without replacement
In our case, n = 1, 000, N = 500, 000.
Finite Population Correction: 1 f =

1
n
N

Survey Weights w
i
= P( individual i is included in the survey)
1
=
N
n
Exercise: Estimate the prevalence of malaria in Inventia.
use "srs.dta", clear
generate weight_srs = pop_size/1000
generate fpc = 1000/pop_size * note that this does not match the definition above
svyset id [pweight=weight_srs], fpc(fpc)
svy: proportion malaria
svyset id [pweight=weight_srs]
svy: proportion malaria
estat effects, deff
proportion malaria
Under simple randomsampling (SRS), when will proportion malaria and svy: proportion
malaria give you the same results? Why?
Why does it not matter much if you use the nite population correction in this example?
Exercise: Estimate the prevalence of malaria in each of the four provinces.
svy, sub(if province==1): proportion malaria
svy, sub(if province==2): proportion malaria
svy, sub(if province==3): proportion malaria
svy, sub(if province==4): proportion malaria
Is there evidence of province-level variation in malaria prevalence?
3
2 Stratied Sampling
Design: We randomly sample 250 people from each of the 4 provinces of Inventia.
Notation:
N is the total population size
N
j
is the population in province j, j = {1, 2, 3, 4}
n
j
individuals are sampled from province j
The important design question in stratied sampling is how to choose the sample size within
each stratum. In our case, N
1
= 225, 000, N
2
= 150, 000, N
3
= 100, 000 and N
4
= 25, 000.
n
j
= 250 for each j.
Finite Population Correction: 1 f
j
=

1
n
j
N
j

Survey Weights: w
ij
= P( individual i in strata j is in the survey)
1
=
N
j
n
j
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratified.dta", clear
proportion malaria
proportion malaria, over(province)
generate weight_stratified = prov_size/250
generate fpc_stratified = 1/weight_stratified
svyset id [pweight=weight_stratified], strata(province) fpc(fpc_stratified)
svydescribe weight
svy: proportion malaria
estat effects, deff
Exercise: Why is our estimate of p too low when we do not specify the survey design?
4
3 Cluster Sampling
Design: We randomly sample 25 districts (clusters) from Inventia; within each district, we ran-
domly sample 40 people.
Notation:
N is the total population size
N
k
is the population size in district k, k = {1, ..., 146}
n
I
out of N
I
total districts are sampled for inclusion in the survey (primary sampling unit)
We sample n
k
individuals in district k are selected for inclusion in the survey (secondary
sampling unit)
In our survey, n
I
= 25, N
I
= 146, n
k
= 40, and N
k
is the population size in district k.
Finite Population Correction:
Stage I: 1 f
I
=

1
n
I
N
I

Stage II: 1 f
k
=

1
n
k
N
k

Survey Weights:
w
ik
= P(individual i in cluster k is in the survey)
1
= [P(cluster k selected) P( individual i in cluster k selected | clusterk selected)]
1
=
N
I
n
I

N
k
n
k
Exercise: Estimate the prevalence of malaria in Inventia, using only the rst stage nite popu-
lation correction.
use "cluster.dta", clear
generate fpc1 = 25/146
generate fpc2 = 40/districtsize
generate weight_cluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_cluster], fpc(fpc1) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
5
4 Stratied Cluster Sampling
We could combine stratied, cluster and simple random sampling all into one design!
Design: For each of the 4 provinces, we randomly sample 5 districts. Within each of the 20
districts, we randomly sample 50 people.
Survey weights: As an example, for province 2:
P(person i in district j in province 2 in survey )
= P(district j in survey | province 2)P(person i in survey | district j)
=
5
42

50
districtsize
j
Finite population correction:
Stage I:
#sampled districts
total#districts in the province
Stage II:
#sampled per district
district population
=
50
districtsize
j
for district j.
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratifiedcluster.dta", clear
generate fpc1 = 5/ndistrict
generate fpc2 = 50/districtsize
generate weight_stratcluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_stratcluster], fpc(fpc1) strata(province) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
6
Survey Data Analysis in Stata
Example
Real-world, publicly available survey data is often very complex (see the DHS example).
Consequently, we will contrive an example for this tutorial, estimating p, the prevalence of a
disease, say malaria, in a hypothetical country, called Inventia.
Country prole:
Province Population size Number of districts
1 225,000 50
2 150,000 42
3 100,000 32
4 25,000 23
Total 500,000 146
In Inventia, the climate differs between provinces; for instance, province 4 is more arid and
at a higher altitudes than the rest of the country. Consequently, the prevalence of malaria p
differs between different provinces. Also, access to malaria prevention is not consistent across
the country, and subsequently p may also vary somewhat between districts. (For instance, ur-
ban populations may have more resources to prevent malaria, and thus a lower prevalence.)
The true prevalence of malaria in Inventia is 13.1%.
Today, we review how to analyze data from several different survey designs:
Simple Random Sampling - We randomly sample 1,000 people from Inventia.
Stratied Sampling - We randomly sample 250 people from each of the 4 provinces of
Inventia.
Cluster Sampling - We randomly sample 25 districts from Inventia and randomly sample
40 people within each district.
Stratied Cluster sampling - For each of the 4 provinces, we randomly sample 5 dis-
tricts. Within these 20 districts, we randomly sample 50 people.
1
Analyzing Survey Data in Stata
In order to analyze survey data in Stata, you must rst svyset your data. This command
tells Stata what survey design was used to obtain the data. This includes specication of survey
weights, the nite population correction(s), and levels of clustering and stratication.
Once Stata has this information, it incorporates the specied design elements into its calcu-
lations. You can then use the survey estimation procedures in Stata. For example, svy: mean
var name, svy: proportion var name, svy: regress ....
Before analyzing your survey data, you need to be able to answer the following questions:
1. What is the design of my survey?
2. Am I using a nite population correction? At which stage of the design?
3. What are the survey weights used in the design?
Once you know these things, you can start analyzing your data in Stata.
2
1 Simple Random Sampling
Design: We randomly sample 1,000 people from the entire country of Inventia.
Notation:
N is the total population size
n is the number of individuals sampled from the population without replacement
In our case, n = 1, 000, N = 500, 000.
Finite Population Correction: 1 f =

1
n
N

Survey Weights w
i
= P( individual i is included in the survey)
1
=
N
n
Exercise: Estimate the prevalence of malaria in Inventia.
use "srs.dta", clear
generate weight_srs = pop_size/1000
generate fpc = 1000/pop_size * note that this does not match the definition above
svyset id [pweight=weight_srs], fpc(fpc)
svy: proportion malaria
svyset id [pweight=weight_srs]
svy: proportion malaria
estat effects, deff
proportion malaria
Under simple randomsampling (SRS), when will proportion malaria and svy: proportion
malaria give you the same results? Why?
Why does it not matter much if you use the nite population correction in this example?
Exercise: Estimate the prevalence of malaria in each of the four provinces.
svy, sub(if province==1): proportion malaria
svy, sub(if province==2): proportion malaria
svy, sub(if province==3): proportion malaria
svy, sub(if province==4): proportion malaria
Is there evidence of province-level variation in malaria prevalence?
3
2 Stratied Sampling
Design: We randomly sample 250 people from each of the 4 provinces of Inventia.
Notation:
N is the total population size
N
j
is the population in province j, j = {1, 2, 3, 4}
n
j
individuals are sampled from province j
The important design question in stratied sampling is how to choose the sample size within
each stratum. In our case, N
1
= 225, 000, N
2
= 150, 000, N
3
= 100, 000 and N
4
= 25, 000.
n
j
= 250 for each j.
Finite Population Correction: 1 f
j
=

1
n
j
N
j

Survey Weights: w
ij
= P( individual i in strata j is in the survey)
1
=
N
j
n
j
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratified.dta", clear
proportion malaria
proportion malaria, over(province)
generate weight_stratified = prov_size/250
generate fpc_stratified = 1/weight_stratified
svyset id [pweight=weight_stratified], strata(province) fpc(fpc_stratified)
svydescribe weight
svy: proportion malaria
estat effects, deff
Exercise: Why is our estimate of p too low when we do not specify the survey design?
4
3 Cluster Sampling
Design: We randomly sample 25 districts (clusters) from Inventia; within each district, we ran-
domly sample 40 people.
Notation:
N is the total population size
N
k
is the population size in district k, k = {1, ..., 146}
n
I
out of N
I
total districts are sampled for inclusion in the survey (primary sampling unit)
We sample n
k
individuals in district k are selected for inclusion in the survey (secondary
sampling unit)
In our survey, n
I
= 25, N
I
= 146, n
k
= 40, and N
k
is the population size in district k.
Finite Population Correction:
Stage I: 1 f
I
=

1
n
I
N
I

Stage II: 1 f
k
=

1
n
k
N
k

Survey Weights:
w
ik
= P(individual i in cluster k is in the survey)
1
= [P(cluster k selected) P( individual i in cluster k selected | clusterk selected)]
1
=
N
I
n
I

N
k
n
k
Exercise: Estimate the prevalence of malaria in Inventia, using only the rst stage nite popu-
lation correction.
use "cluster.dta", clear
generate fpc1 = 25/146
generate fpc2 = 40/districtsize
generate weight_cluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_cluster], fpc(fpc1) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
5
4 Stratied Cluster Sampling
We could combine stratied, cluster and simple random sampling all into one design!
Design: For each of the 4 provinces, we randomly sample 5 districts. Within each of the 20
districts, we randomly sample 50 people.
Survey weights: As an example, for province 2:
P(person i in district j in province 2 in survey )
= P(district j in survey | province 2)P(person i in survey | district j)
=
5
42

50
districtsize
j
Finite population correction:
Stage I:
#sampled districts
total#districts in the province
Stage II:
#sampled per district
district population
=
50
districtsize
j
for district j.
Exercise: Estimate the prevalence of malaria in Inventia.
use "stratifiedcluster.dta", clear
generate fpc1 = 5/ndistrict
generate fpc2 = 50/districtsize
generate weight_stratcluster = (fpc1*fpc2)^-1
svyset district [pweight=weight_stratcluster], fpc(fpc1) strata(province) || id, fpc(fpc2)
svy: proportion malaria
estat effects, deff
6
Tutorial: Non-response bias in surveys
Non-response is a huge issue in many surveys (Groves and Peytcheva, 2008). Survey non-
response leads to signicant bias if response is correlated with the survey indicators of interest.
We use a simple example from the Framingham study to illustrate this concept.
Source: Groves, R.M. and Peytcheva, E. (2008). The impact of nonresponse rates on nonresponse
bias. Public opinion quarterly, 72(2): 167-89.
(I found a free draft via Google.)
Example:
Suppose blood samples from the participants at baseline got lost; rather than measure
everyone in the population again, the study investigators decided to try to estimate the
baseline prevalence of high cholesterol (cholesterol > 240 mg/dL). They randomly sam-
pled 400 individuals and asked them to return to the study center to have their cholesterol
measured again, knowing that not all 400 would return for the re-test.
The willingness of a participant to revisit the lab was correlated with the frailty of the
individual, sex, and prior knowledge of high cholesterol. With a lot of missing data, we
would expect to obtain biased of high cholesterol prevalence.
Prevalence of high cholesterol at baseline was 43.1% in the Framingham cohort.
We consider three different scenarios:
A. Low response rate
B. Moderate reponse rate
C. High response rate
Exercise: Calculate the prevalence of high cholesterol for each of the three response rate
settings, as well as for the complete sample of 400 individuals.
proportion highchol
proportion highcholA highcholB highcholC
proportion highcholA
proportion highcholB
proportion highcholC
As suspected, bias increases with the amount of missingness.
We have baseline covariate data from the Framingham study. We can estimate the proba-
bility that a sampled individual returns to have his/her cholesterol tested again as a function of
these covariates.
If we knew these probabilities exactly, we could obtain an unbiased estimate of high choles-
terol prevalence at baseline. In this example, we do have these probabilities (pA, pB, and pC
1
in the dataset).
Exercise: Calculate the prevalence of high cholesterol for each of the three response rate
settings using the survey weights, and compare to the complete-case data.
gen wA = 1/pA
gen wB = 1/pB
gen wC = 1/pC
proportion highchol
proportion highcholA [pweight=wA]
proportion highcholB [pweight=wB]
proportion highcholC [pweight=wC]
Here we recovered unbiased estimates. However in practice, we will never exactly know pA,
pB, and pC. Many methods have been developed to address survey non-response, including
multiple imputation and weighting for non-response. Maximizing the response rate is always
the best policy.
2
!

Tutorial: Correlation in Stata

The World Bank (http://data.worldbank.org/) is a great source of free public data on
trends in health and economics around the world. n this example, we use public data from the
World Bank to examine trends in immunization coverage for measles and DPT over time in low
income countries. Open the dataset WorldBank.dta.
Calculate the pairwise correlations between measles vaccination coverage, DPT
vaccination coverage, and time.

pwcorr measles dpt year

Make a scatterplot including both measles and immunization coverage on the plot. Does
the plot explain the results above?

twoway (scatter dpt year) (scatter measles year)
Yes, there is a very strong positive relationship between time and immunization
coverage. Further, it seems evident that trends in scaling up in immunization were
similar for measles and DPT.
Test whether there is a linear relationship between time and measles vaccination
coverage. What are the null and alternative hypotheses? What is your conclusion?

pwcorr measles year, sig

Statistics > Summaries Tables and Tests > Summaries and Descriptive Statistics >
Pairwise Correlations

Test for a monotonic relationship between time and measles vaccination coverage.
What are the null and alternative hypotheses? What is your conclusion?

spearman measles year

Statistics > Summaries, tables, and tests > Nonparametric tests of hypotheses >
Spearman's rank correlation

Why do you think the correlations are so high in this example? Should you always have
such high aspirations regarding the magnitude of your correlation coefficients when
analyzing public health data?

Source: Created from: World Bank, World Development ndicators and Global Development
Finance. Vaccination coverage from WHO and UNCEF
1

Tutorial: Non-Parametric Tests in Stata
The Sign Test and Wilcoxon Signed-Rank Test

Consider the following table taken from Whitley and Ball (2002) showing central venous oxygen
saturation in 10 patients at admission and 6 hours after admission to an intensive care unit
(ICU).

Table 1: Central Venous Oxygen Saturation (%)
Subject At admission 6 hours after admission to ICU
1 39.7 52.9
2 59.1 56.7
3 56.1 61.9
4 57.7 71.4
5 60.6 67.7
6 37.8 50.0
7 58.2 60.7
8 33.6 51.3
9 56.0 59.5
10 65.3 59.8

E. Whitley and J. Ball. Statistics review 6: Nonparametric methods. Crit Care. 2002; 6(6): 509513.

It is hypothesized that after 6 hours in the ICU central venous oxygen saturation should
increase. The authors are interested in whether the apparent increase in central venous oxygen
saturation is likely to reflect a genuine effect of admission and treatment or whether it is simply
due to chance.

The data are located in the CVOS.dta data set. In this example we want to know whether there
is a difference in central venous oxygen saturation at admission compared to 6 hours after
admission to the ICU. That is, we want to know whether 6 hours in the ICU has an effect on
central venous oxygen saturation.

1. Are the data independent or dependent? What parametric and nonparametric tests are
available for this type of data?

Dependent: We measure central venous oxygen saturation at admission and 6 hours
after admission on the same subject.
Parametric test: paired t- test
Non-parametric tests: sign test, Wilcoxon Signed-Rank Test


2. What type of statistical test is most appropriate for this data and why?

We should probably use a non-parametric test since we have a small sample size.
Furthermore, we have no information to suggest that the differences are normally
distributed. You could also make a histogram of the differences to inspect normality.


3. Suppose we decide to use the sign test. What are the null and alternative hypotheses?

The null hypothesis is that the median of the differences is equal to zero. The alternative
is that the median of the differences is not equal to zero.



2
4. Perform a sign test in Stata at alpha = 0.05. What is the value of your test statistic? Your
p-value? Your decision? Your interpretation?

You may use the following drop-down menus to access the signtest command:
Statistics / Summaries, tables, and tests / Nonparametric tests of hypotheses / Test
equality of matched pairs.


. signtest t6=t0

Sign test

sign | observed expected
-------------+------------------------
positive | 8 5
negative | 2 5
zero | 0 0
-------------+------------------------
all | 10 10

One-sided tests:
Ho: median of t6 - t0 = 0 vs.
Ha: median of t6 - t0 > 0
Pr(#positive >= 8) =
Binomial(n = 10, x >= 8, p = 0.5) = 0.0547

Ho: median of t6 - t0 = 0 vs.
Ha: median of t6 - t0 < 0
Pr(#negative >= 2) =
Binomial(n = 10, x >= 2, p = 0.5) = 0.9893

Two-sided test:
Ho: median of t6 - t0 = 0 vs.
Ha: median of t6 - t0 != 0
Pr(#positive >= 8 or #negative >= 8) =
min(1, 2*Binomial(n = 10, x >= 8, p = 0.5)) = 0.1094


Our test statistics is D = 8, since we have two plus signs. Stata uses the binomial
distribution to generate the p-value. Our p-value is 0.1094. Thus, we fail to reject the null
hypothesis and conclude that we do not find evidence that median central venous
oxygen saturation is different at admission and 6 hours after admission to the ICU using
the sign test.




5. Suppose that instead of conducting the sign test we conduct the Wilcoxon signed-rank
test. Which test has more power? Why?

The signed-rank test has more power since it incorporates the magnitude of differences
via the ranks.

6. State the null and alternative hypothesis for the Wilcoxon signed-rank test.


3

The null hypothesis is that the median of the differences is equal to zero. The alternative
is that the median of the differences is not equal to zero.

7. Perform a signed-rank test in Stata at the alpha = 0.05 level. What is the value of your
test statistic? Your p-value? Your decision? Your interpretation?

You may use the following drop-down menus to access the signrank command:
Statistics / Summaries, tables, and tests / Nonparametric tests of hypotheses / Wilcoxon
matched-pairs signed-rank test



. signrank t6=t0

Wilcoxon signed-rank test

sign | obs sum ranks expected
-------------+---------------------------------
positive | 8 50 27.5
negative | 2 5 27.5
zero | 0 0 0
-------------+---------------------------------
all | 10 55 55

unadjusted variance 96.25
adjustment for ties 0.00
adjustment for zeros 0.00
----------
adjusted variance 96.25

Ho: t6 = t0
z = 2.293
Prob > |z| = 0.0218

Our test statistic is 2.293. The p-value is 0.0218. Therefore, we reject the null
hypothesis. Thus, we have evidence that median central venous oxygen saturation is
different at admission and 6 hours after admission to the ICU. It appears that central
venous oxygen saturation is higher after 6 hours in the ICU.









1

Tutorial: Non-Parametric Tests in Stata
The Wilcoxon Rank Sum Test

n this tutorial we will use data from the Digitalis nvestigation Group (DG). Please read the
provided data documentation before continuing with this tutorial (see DG_Documentation.pdf).
We will replicate one of the analyses from the New England Journal of Medicine paper (see
NEJM_DG.pdf).

Garg R, Gorlin R, Smith T, Yusuf S, for the Digitalis nvestigation Group.. The effect of digoxin on mortality and
morbidity in patients with heart failure. N Engl J Med, 1997(336), 525-533.

n this trial, patients were randomized to either Digoxin or placebo. The Wilcoxon rank-sum test
was used to determine if there were any differences between groups in the number of
hospitalizations. The data are located in the dig.dta data set.




1. Examine the distribution of number of hospitalizations by treatment group. Are they
similar? Are they symmetric?




The two distributions are very similar. They are not symmetric, but rather right skewed.




0
.
2
.
4
.
6
0 10 20 30 40 0 10 20 30 40
0 1
D
e
n
s
i
t
y
number of hospitalizations
Graphs by 0=placebo, 1=treatment
2

2. Does the rank sum test require any assumptions?

Yes. The two samples must be independent and the distributions should have the same
general shape.

3. What is the null hypothesis for the rank sum test? What is the alternative?

The null hypothesis is that the median number of hospitalizations for the two treatment
groups are identical. Thus, the alternative is that the median number of hospitalizations
for the two treatment groups are not identical.

Since we assume the two distributions have the same general shape, a difference of the
medians would imply that the two distributions have the same shape but are shifted in
location.


4. Perform a rank sum test in Stata with alpha = 0.05. What is your test statistic? Your p-
value? Your decision? Your interpretation?

You may use the following drop-down menus to access the ranksum command:
Statistics / Summaries, tables, and tests / Nonparametric tests of hypotheses / Wilcoxon
rank-sum test




. ranksum nhosp, by(trtmt)

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

trtmt | obs rank sum expected
-------------+---------------------------------
0 | 3403 11767615 11571902
1 | 3397 11355786 11551499
-------------+---------------------------------
combined | 6800 23123400 23123400

unadjusted variance 6.552e+09
adjustment for ties -3.811e+08
----------
adjusted variance 6.171e+09

Ho: nhosp(trtmt==0) = nhosp(trtmt==1)
z = 2.491
Prob > |z| = 0.0127


Our test statistic is 2.491. The p-value is 0.0127. Note that this was the p-value reported
in the paper. We reject the null hypothesis. Thus, we conclude that we have evidence
that the median number of hospitalizations differ by treatment group. n fact we have
evidence that there significantly more hospitalizations in the placebo group.
Tutorial: Simple Linear Regression
Open the dataset hospitaldata.dta.
Exercises:
1. Calculate the Pearson correlation for the percent of patients who say their nurse always
communicated well (nursealways) and the percent of patients who would always recom-
mend the hospital (recommendyes).
pwcorr recommendyes nursealways, sig
These two variables are correlated. However, simple linear regression gives us a more
intuitive measure of the relationship between the two variables. Specically, we can state:
For a one percent increase in the percent of patients who say their nurse always com-
municated well, we would, on average, expect to see a corresponding increase of B% of
patients who would always recommend the hospital. Here B is determined by tting an
appropriate linear regression model.
2. Now that you have established that these variables are correlated, you decide to t a linear
regression model to assess the relationship between recommendyes and nursealways.
State your model.
Y i = percent of patients who always recommend the hospital
Xi = perecnt of patients who say that the nurse always communicates well
Y
i
= +X
i
+
i
where
i
N(0,
2
). Equivalently, we could write:

yi
= E(Y
i
|X
i
) = +X
i
where Y
i
N(
yi
,
2
).
Goal is to estimate and obtain measures of uncertainty for and . We use the method
of least squares for estimation.
3. Construct a scatter plot with nursealways on the x-axis and recommendyes on the y axis.
Use the scatterplot to evaluate the assumptions of simple linear regression.
twoway (scatter recommendyes nursealways)
Assumptions:
Independent observations
1
Y |X is normally distributed
Homoscedasticity (constant variance)
Linearity
4. Fit the linear regression model. Provide estimates, condence intervals, and interpreta-
tions of the regression coefcients and .
. regress recommendyes nursealways
Source | SS df MS Number of obs = 3570
-------------+------------------------------ F( 1, 3568) = 2723.72
Model | 144368.851 1 144368.851 Prob > F = 0.0000
Residual | 189118.972 3568 53.0041962 R-squared = 0.4329
-------------+------------------------------ Adj R-squared = 0.4327
Total | 333487.823 3569 93.4401297 Root MSE = 7.2804
------------------------------------------------------------------------------
recommendyes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nursealways | 1.159487 .0222169 52.19 0.000 1.115928 1.203046
_cons | -19.21559 1.712829 -11.22 0.000 -22.57381 -15.85737
------------------------------------------------------------------------------
Our tted regression line is:
Y
i
= 19.2 + 1.16X
i
+
i
where N(0, 7.3
2
).
Condence intervals for and , respectively, are (-22.57, -15.86) and (1.12, 1.2).
For a 1% increase in patients reporting their nurse communicated well, is corresponding
average increase in the percent of patients who would always recommend the hospital
1.16%.
is the mean value of the response Y
i
when X
i
= 0 and for this example has no mean-
ingful interpretation. (However, it is necessary for constructing the regression line and
making subsequent predictions).
5. Test the hypothesis that H
0
: = 0 versus the alternative that H
A
: 6= 0.
We nd that

= 1.16, se(

) = 0.02, and t = 52.2. Under H


0
, t =

/ se(

) t
35702
,
and our p-value < 0.0001. Therefore, we reject the null hypothesis and conclude that the
percent of patients who say a nurse always communicates well is positively correlated
with the percent of patients who would always recommend a hospital.
6. What is the value of R
2
. Interpret this quantity.
0.433
2
43% of the variability among the observed values of recommendyes is explained by the
linear relationship with nursealways.
7. Examine a residual plot. Using R
2
and the plot, does the model appear to t well? (Are
there any outliers?)
rvfplot
rvpplot nursealways
We dont see any strong trends or outliers in the residual plots.
8. Using the regression line, predict the expected percent of patients who always recom-
mend the hospital when the reported percent of nurses who always communicate well is
80%? Construct corresponding 95% condence interval.
Denote

Y
80
as the predicted average percent of patients who always recommend a hos-
pital among hospitals with patients reporting that nurses always communicate well 80%
of the time.

Y
80
= 19.2 + 1.16 80 = 73.6
. lincom _cons + 80*nursealways
( 1) 80*nursealways + _cons = 0
------------------------------------------------------------------------------
recommendyes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 73.54339 .1399631 525.45 0.000 73.26897 73.8178
------------------------------------------------------------------------------
A 95% condence interval for

Y
80
is (73.2690, 73.8178).
9. For a new hospital with 80% of patients reporting that nurses always communicate well,
predict the percent of patients who will always recommend the hospital. Construct corre-
sponding 95% condence interval.
Denote

Y
80
as the predicted percent of patients who always recommend the hospital in
a new hospital where patients reporting that nurses always communicate well 80% of the
time.

Y
80
= 73.54339. To nd a condence interval, we need to account for additional uncer-
tainty associate with predicting a new outcome.
se(

Y
80
) =
q
var(

Y
80
) +

2
=

.1399631
2
+ 7.2804
2
= 7.281745
. di 73.54339 - invttail(3568, 0.025)*7.281745
59.266589
3
. di 73.54339 + invttail(3568, 0.025)*7.281745
87.820191
So, a 95% condence interval

Y
80
is 73.54339 t
3568,0.975
7.281745 = (59.27, 87.82).
4
Indicator Variables and Regression
Suppose a hospital is trying to set a benchmark goal of having patients report that nurses
always communicate well at least 75% of the time. We now dene a nurse communication
indicator variable and use simple linear regression to further examine the relationship between
nurse communication and the percentage of patients always recommending the hospital.
Open the dataset hospitaldata.dta.
Exercises:
1. Generate a new variable, highnurse, that equals 1 if a hospital had nursealways 75%;
and equals 0 if nursealways < 75%.
gen highnurse = .
replace highnurse = 1 if nursealways >= 75 & nursealways <= 100
replace highnurse = 0 if nursealways < 75
2. State your model and evaluate the model assumptions.
Y
i
= percent of patients who recommend the hospital always
D
i
= 1 if at least 75% of patients at the hospital report that nurses communicate well, and
is 0 otherwise
Y
i
= +D
i
+
i
where
i
N(0,
2
).
The model is identical to a one-way ANOVA therefore the assumputions we make are the
same. When we only have two groups, the assumptions are identical to the t-test with
equal variances.
3. Fit the model.
xi: regress recommendyes i.highnurse
or
regress recommendyes highnurse
Source | SS df MS Number of obs = 3570
-------------+------------------------------ F( 1, 3568) = 1004.37
1
Model | 73254.0735 1 73254.0735 Prob > F = 0.0000
Residual | 260233.749 3568 72.9354678 R-squared = 0.2197
-------------+------------------------------ Adj R-squared = 0.2194
Total | 333487.823 3569 93.4401297 Root MSE = 8.5402
------------------------------------------------------------------------------
recommendyes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
highnurse | 9.980834 .3149346 31.69 0.000 9.363364 10.5983
_cons | 62.86486 .2653319 236.93 0.000 62.34465 63.38508
------------------------------------------------------------------------------
So, our tted model is Y
i
= 62.9 + 10.0 D
i
+
i
, where
i
N(0, 8.5
2
).
4. Interpret the coefcients.
= 62.9 is E(Y
i
|D
i
= 0). The average percent of patients who always recommend a
hospital when less than 75% of patients say nurses always communicated well is 62.9%.

= 10.0 is E(Y
i
|D
i
= 1) E(Y
i
|D
i
= 0). Comparing hospitals with at least 75% of
patients say nurses always communicated well with those where less than 75% of the
patients report that nurses always communicate well, the average difference in percent of
patients who always recommend a hospital was 10%.
+

= 72.9 is E(Y
i
|D
i
= 1). The average percent of patients who always recommend a
hospital when at least 75% of patients say nurses always communicated well is 72.9%.
5. Test the null hypothesis that there is no difference in the average percent of patients who
always recommend a hospital between hospitals with less than and at least 75% of pa-
tients reporting that nurses always communicate well.
We test H
0
: = 0 versus H
A
: 6= 0 using a two-sided test with = 0.05.
We nd that

= 10.0, se(

) = 0.3, and t = 31.7. Under H


0
, t t
n2
, and p < 0.0001.
We conclude that the average percent of patients who always recommend a hospital is
greater when at least 75% of patients report that nurses always communicate well.
6. Compare the results of the test above to a two-sample t-test with equal variances.
. ttest recommendyes, by(highnurse)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
2
---------+--------------------------------------------------------------------
0 | 1036 62.86486 .272132 8.759099 62.33087 63.39886
1 | 2534 72.8457 .1678457 8.449162 72.51657 73.17483
---------+--------------------------------------------------------------------
combined | 3570 69.9493 .1617829 9.666443 69.6321 70.2665
---------+--------------------------------------------------------------------
diff | -9.980834 .3149346 -10.5983 -9.363364
------------------------------------------------------------------------------
diff = mean(0) - mean(1) t = -31.6918
Ho: diff = 0 degrees of freedom = 3568
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
You should notice some striking similarities!
3
Multiple Linear Regression
Now, the hospital aims to assess the impact of nurse communication and hospital noise
level on the percentage of patients who would always recommend the hospital.
Fit a linear regression model with recommendyes as the outcome and nursealways and
quietalways as the covariates.
1. Make a scatter plot of quietalways versus recommendyes.
twoway (scatter recommendyes quietalways)
While the relationship appears linear, note that we cannot assess any of the assumptions
of multiple linear regression using this plot.
2. State your model.
Y i = percent of patients who recommend the hospital always
X
1i
= percent of patients in a hospital who say that the nurse always communicates well
X
2i
= percent of patients who report that the hospital is always quiet
Y
i
= +
1
X
1i
+
2
X
2i
+
i
where
i
N(0,
2
). Equivalently, we could write:

y
i
= E(Y
i
|X
1i
, X
2i
) = +
1
X
1i
+
2
X
2i
where Y
i
N(
y
i
,
2
).
3. Fit the model.
regress recommendyes nursealways quietalways
Source | SS df MS Number of obs = 3570
-------------+------------------------------ F( 2, 3567) = 1363.40
Model | 144484.252 2 72242.126 Prob > F = 0.0000
Residual | 189003.571 3567 52.9867033 R-squared = 0.4333
-------------+------------------------------ Adj R-squared = 0.4329
Total | 333487.823 3569 93.4401297 Root MSE = 7.2792
------------------------------------------------------------------------------
recommendyes | Coef. Std. Err. t P>|t| [95% Conf. Interval]
1
-------------+----------------------------------------------------------------
nursealways | 1.133725 .0282517 40.13 0.000 1.078334 1.189116
quietalways | .0229694 .0155642 1.48 0.140 -.0075463 .053485
_cons | -18.58225 1.7655 -10.53 0.000 -22.04374 -15.12075
------------------------------------------------------------------------------
Our model is Y
i
= + 1.13X
1i
+ 0.02X
2i
+
i
, where
i
N(0, 7.28
2
).
4. Evaluate the model assumptions.
The adjusted R
2
is 0.43 (compared to 0.43 from the simple linear regression model with
only nursealways).
rvfplot
rvpplot nursealways
rvpplot quietalways
5. Interpret the coefcients.
We estimate

1
= 1.13, with 95% condence interval (1.08, 1.19). For a one percent
increase in the patients who say that the nurses always communicate well, we see on
average a 1.13 percent increase in the percent of patients who would always recommend
the hospital, when the percent of patients who say the hospital is always quiet is xed
(does not vary).
We estimate

2
= 0.02, with 95% condence interval (0.01, 0.05). For a one percent
increase in the patients who say the hospital is always quiet, we see on average a 0.02
percent increase in the percent of patients who would always recommend the hospital,
xing the percent of patients who say their nurse always communicates well.
We estimate that = 18.58. is the value of E(Y
i
) when X
1i
and X
2i
are set to 0. In
our dataset, the covariates never drop below 48% and 30% respectively, and therefore
does not have a meaningful interpretation for this study.
6. Suppose we consider a new hospital, where the percentage of nurses who always com-
municate is 90% and the percentage of those who say the hospital is always quiet is 70%?
What is the expected percent of patients who would always recommend this hospital?
E(Y
i
|X
1i
= 90, X
2i
= 70) = 18.58 + 1.13 90 + 0.02 70 = 84.5%.
7. Using the regression results above, perform the follow three hypothesis tests at the 0.05
level of signicance.
H
0
:
1
= 0, H
A
:
1
6= 0
2

1
= 1.13, se(

1
) = 0.03, t = 40.1. Under H
0
, t t
357021
, and p < 0.0001. We
reject H
0
and conclude that an increase in the percent of patients who say nurses
always communicate well results in an increase in the percent of patients who always
recommend the hospital, xing the percent of patients who say the hospital is always
quiet.
H
0
:
2
= 0, H
A
:
2
6= 0

1
= 0.02, se(

1
) = 0.02, t = 1.5. Under H
0
, t t
357021
, and p = 0.14. We fail to
reject H
0
and conclude that we do not have evidence in the data that increasing the
percent of patients who say the hospital is always quiet is correlated with the percent
of patients who always recommend the hospital, xing the percent of patients who
say that the nurses always communicate well .
H
0
:
1
=
2
= 0, H
A
: one of
1
,
2
6= 0
. test nursealways quietalways
( 1) nursealways = 0
( 2) quietalways = 0
F( 2, 3567) = 1363.40
Prob > F = 0.0000
Our F-statistic equals 1363.4. Under H
0
, F F
2,3567
, and p < 0.0001. We reject H
0
and conclude that atleast one of
1
or
2
is non-zero.
8. Do we observe any collinearity between X
1i
and X
2i
. How does this impact the result.
twoway (scatter nursealways quietalways)
Yes, the covariates are collinear. We would likely see an association between X
2i
and Y
i
if X
1i
was excluded from the model.
3
More Multiple Linear Regression
For those interested in delving a bit deeper into the world of linear regression, a few addi-
tional examples are included below. In the rst example, you can work through a multiple linear
regression model with one binary covariate and one continuous covariate. In the second exam-
ple, we add an interaction between these covariates to examine effect modication/interaction
between covariates in the context of linear regression. It is important to think about how the
interpretation of the regression coefcients changes in the presence of an interaction term.
Example 1:
Fit a linear regression model with recommendyes as the outcome and highnurse and quietalways
as the covariates.
1. Make a scatterplot with quietalways on the x-axis and recommendyes on the y-axis. Strat-
ify by highnurse when you are plotting, so that you can distinguish between hospitals with
highnurse = 0 and highnurse = 1. Overlay a linear prediction line for highnurse = 0 and
highnurse = 1.
Via the dropdown menus, go to Graphics Two-way graph. Within the two-way window,
you will need to create four different plots: two scatter plots (go to Basic plots Scatter
and then ll in Y and X variables) and two linear prediction lines (go to Fit plots Linear
Prediction and then ll in Y and X variables). Or, via command line:
twoway (scatter recommendyes quietalways if highnurse==1) ///
(scatter recommendyes quietalways if highnurse==0) ///
(lfit recommendyes quietalways if highnurse == 1)///
(lfit recommendyes quietalways if highnurse==0)
2. State your model.
Y i = percent of patients who recommend the hospital always
X
1i
= percent of patients in a hospital who say that the hospital is always quiet
D
i
= 1 if at least 75% of patients at the hospital report that nurses communicate well, and
is 0 otherwise
Y
i
= +
1
X
1i
+
2
D
2i
+
i
where
i
N(0,
2
).
3. Fit the model.
1
regress recommendyes highnurse quietalways
4. Evaluate the model assumptions.
Suggestions: Check the residual plots to look for outliers and heteroskedasticity. Do the
residuals look approximately normal? Patterns in the residual plot could suggest that your
model for the mean of the outcome is misspecied (linearity is violated).
5. Interpret the coefcients.
- the average percent of patients who always recommend a hospital when high-
nurse is 0 and quietalways is 0. does not have a meaningful interpretation for this
study since quietalways never drops to 0.

1
- the average increase in the percent of patients who always recommend a hos-
pital for a one percent increase in quietalways, for a given value of highnurse.

2
- the average increase in the percent of patients who always recommend a hospi-
tal for hospitals with highnurse = 1 compared to hospitals with high nurse = 0, xing
quietalways.
+ 80
1
+
2
- the average percent of patients who recommend a hospital with
highnurse = 1 and quietalways = 80.
+80
1
- the average percent of patients who recommend a hospital with highnurse
= 0 and quietalways = 80.
2
Example 2 - Multiple Linear Regression with an Interaction
Now, we examine whether there is an interaction between highnurse and quietalways on
recommendyes. Equivalently, we look for evidence of effect modication of the relationship be-
tween quietalways and recommendyes by highnurse.
1. Check out the scatter plot from the previous example. Does the plot suggest that an in-
teraction term might improve the model?
Yes. We can look for evidence of effect modication by comparing the slopes of the
overlayed lines in the scatter plot. Because the slopes appear to differ by highnurse,
there is evidence of effect modication.
2. State your model.
Y i = percent of patients who recommend the hospital always
X
1i
= percent of patients in a hospital who say that the hospital is always quiet
D
i
= 1 if at least 75% of patients at the hospital report that nurses communicate well, and
is 0 otherwise
Y
i
= +
1
X
1i
+
2
D
i
+
3
X
1i
D
i
+
i
where
i
N(0,
2
).
3. Fit the model.
. xi: regress recommendyes i.highnurse*quietalways
4. Evaluate the model assumptions.
You can use the same approach as the previous question.
5. Interpret the coefcients.
- the average percent of patients who always recommend a hospital when high-
nurse is 0 and quietalways is 0. does not have a meaningful interpretation for this
study since quietalways never drops to 0.

1
- the average increase in the percent of patients who always recommend a hos-
pital for a one percent increase in quietalways, when highnurse = 0.
3

1
+
3
- the average increase in the percent of patients who always recommend a
hospital for a one percent increase in quietalways, when highnurse = 1.

2
- the average increase in the percent of patients who always recommend a hospi-
tal for hospitals with highnurse = 1 compared to hospitals with high nurse = 0, when
quietalways equals 0.
2
does not have a meaningful interpretation in this analysis.
Note that we could have centered the covariate quietalways around its mean, so that
the covariate would be more interpretable.

2
+70
3
- the average increase in the percent of patients who always recommend a
hospital for hospitals with highnurse = 1 compared to hospitals with high nurse = 0,
when quietalways equals 70.
+ 80
1
+
2
+ 80
3
- the average percent of patients who recommend a hospital
with highnurse = 1 and quietalways = 80.
+80
1
- the average percent of patients who recommend a hospital with highnurse
= 0 and quietalways = 80.
4
More Multiple Linear Regression
For those interested in delving a bit deeper into the world of linear regression, a few addi-
tional examples are included below. In the rst example, you can work through a multiple linear
regression model with one binary covariate and one continuous covariate. In the second exam-
ple, we add an interaction between these covariates to examine effect modication/interaction
between covariates in the context of linear regression. It is important to think about how the
interpretation of the regression coefcients changes in the presence of an interaction term.
Example 1:
Fit a linear regression model with recommendyes as the outcome and highnurse and quietalways
as the covariates.
1. Make a scatterplot with quietalways on the x-axis and recommendyes on the y-axis. Strat-
ify by highnurse when you are plotting, so that you can distinguish between hospitals with
highnurse = 0 and highnurse = 1. Overlay a linear prediction line for highnurse = 0 and
highnurse = 1.
Via the dropdown menus, go to Graphics Two-way graph. Within the two-way window,
you will need to create four different plots: two scatter plots (go to Basic plots Scatter
and then ll in Y and X variables) and two linear prediction lines (go to Fit plots Linear
Prediction and then ll in Y and X variables). Or, via command line:
twoway (scatter recommendyes quietalways if highnurse==1) ///
(scatter recommendyes quietalways if highnurse==0) ///
(lfit recommendyes quietalways if highnurse == 1)///
(lfit recommendyes quietalways if highnurse==0)
2. State your model.
Y i = percent of patients who recommend the hospital always
X
1i
= percent of patients in a hospital who say that the hospital is always quiet
D
i
= 1 if at least 75% of patients at the hospital report that nurses communicate well, and
is 0 otherwise
Y
i
= +
1
X
1i
+
2
D
2i
+
i
where
i
N(0,
2
).
3. Fit the model.
1
regress recommendyes highnurse quietalways
4. Evaluate the model assumptions.
Suggestions: Check the residual plots to look for outliers and heteroskedasticity. Do the
residuals look approximately normal? Patterns in the residual plot could suggest that your
model for the mean of the outcome is misspecied (linearity is violated).
5. Interpret the coefcients.
- the average percent of patients who always recommend a hospital when high-
nurse is 0 and quietalways is 0. does not have a meaningful interpretation for this
study since quietalways never drops to 0.

1
- the average increase in the percent of patients who always recommend a hos-
pital for a one percent increase in quietalways, for a given value of highnurse.

2
- the average increase in the percent of patients who always recommend a hospi-
tal for hospitals with highnurse = 1 compared to hospitals with high nurse = 0, xing
quietalways.
+ 80
1
+
2
- the average percent of patients who recommend a hospital with
highnurse = 1 and quietalways = 80.
+80
1
- the average percent of patients who recommend a hospital with highnurse
= 0 and quietalways = 80.
2
Example 2 - Multiple Linear Regression with an Interaction
Now, we examine whether there is an interaction between highnurse and quietalways on
recommendyes. Equivalently, we look for evidence of effect modication of the relationship be-
tween quietalways and recommendyes by highnurse.
1. Check out the scatter plot from the previous example. Does the plot suggest that an in-
teraction term might improve the model?
Yes. We can look for evidence of effect modication by comparing the slopes of the
overlayed lines in the scatter plot. Because the slopes appear to differ by highnurse,
there is evidence of effect modication.
2. State your model.
Y i = percent of patients who recommend the hospital always
X
1i
= percent of patients in a hospital who say that the hospital is always quiet
D
i
= 1 if at least 75% of patients at the hospital report that nurses communicate well, and
is 0 otherwise
Y
i
= +
1
X
1i
+
2
D
i
+
3
X
1i
D
i
+
i
where
i
N(0,
2
).
3. Fit the model.
. xi: regress recommendyes i.highnurse*quietalways
4. Evaluate the model assumptions.
You can use the same approach as the previous question.
5. Interpret the coefcients.
- the average percent of patients who always recommend a hospital when high-
nurse is 0 and quietalways is 0. does not have a meaningful interpretation for this
study since quietalways never drops to 0.

1
- the average increase in the percent of patients who always recommend a hos-
pital for a one percent increase in quietalways, when highnurse = 0.
3

1
+
3
- the average increase in the percent of patients who always recommend a
hospital for a one percent increase in quietalways, when highnurse = 1.

2
- the average increase in the percent of patients who always recommend a hospi-
tal for hospitals with highnurse = 1 compared to hospitals with high nurse = 0, when
quietalways equals 0.
2
does not have a meaningful interpretation in this analysis.
Note that we could have centered the covariate quietalways around its mean, so that
the covariate would be more interpretable.

2
+70
3
- the average increase in the percent of patients who always recommend a
hospital for hospitals with highnurse = 1 compared to hospitals with high nurse = 0,
when quietalways equals 70.
+ 80
1
+
2
+ 80
3
- the average percent of patients who recommend a hospital
with highnurse = 1 and quietalways = 80.
+80
1
- the average percent of patients who recommend a hospital with highnurse
= 0 and quietalways = 80.
4
Simple Logistic Regression
Think back to Week 7, when we used the sample from the California Health Indicator Sur-
vey (CHIS) to examine the relationship between poverty and visiting the doctor within the past
12 months. This week, we use logistic regression to examine this relationship. Open the
chis healthdisparities.dta dataset.
Fit a logistic regression model with visiting the doctor in the past 12 months as the outcome
and the poverty indicator as your covariate.
1. List the assumptions for performing logistic regression.
We assume the responses are Bernoulli, and we assume linearity in the parameters on
the logit scale.
2. State your model.
Dene Y
i
= 1 if individual i visited the doctor in the last 12 months, 0 otherwise. Dene
X
i
= 1 if the individual is above the poverty line, 0 otherwise. Then, our model is Y
i

Bernoulli(p
i
), where
logit(p
i
) = +X
i
3. Fit the model.
. logit doctor nopov
Iteration 0: log likelihood = -247.4035
Iteration 1: log likelihood = -245.14765
Iteration 2: log likelihood = -245.08244
Iteration 3: log likelihood = -245.08242
Logistic regression Number of obs = 500
LR chi2(1) = 4.64
Prob > chi2 = 0.0312
Log likelihood = -245.08242 Pseudo R2 = 0.0094
------------------------------------------------------------------------------
doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nopov | .6713351 .3013476 2.23 0.026 .0807047 1.261965
_cons | .83975 .2745156 3.06 0.002 .3017093 1.377791
------------------------------------------------------------------------------
The tted regression model is logit( p
i
) = 1.511 + 0.671X
i
.
4. Interpret the coefcients.
1
i.e., we have a binary (0 or 1) outcome and
observations are independent
our logistic regression model
this is essentially modelling the mean, here pi
is the mean as we have in linear regression
where
m yi | xi = a + bx
and yi ~ N(m yi | xi, sigma^2)
gen nopov = 1 - poverty
Statistics > Binary Outcomes > Logistic Regression
<-- IMPORTANT
= log(odds of visiting the doctor when X
i
= 0)
= log(odds ratio of visiting the doctor for no poverty versus poverty) = log(odds of
visiting doctor when X
i
= 1) - log(odds of visiting doctor when X
i
= 0)
+ = log(odds of visiting the doctor when X
i
= 1)
5. Provide an OR and a 95% condence interval.
Hard way: exp() = 1.957 with 95% CI (exp(0.0807047), exp(1.261965)) = (1.084, 3.532).
Easy way:
. lincom nopov, eform
( 1) [doctor]nopov = 0
------------------------------------------------------------------------------
doctor | exp(b) Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 1.956848 .5896914 2.23 0.026 1.084051 3.532357
------------------------------------------------------------------------------
Another easy way:
. logistic doctor nopov
Logistic regression Number of obs = 500
LR chi2(1) = 4.64
Prob > chi2 = 0.0312
Log likelihood = -245.08242 Pseudo R2 = 0.0094
------------------------------------------------------------------------------
doctor | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nopov | 1.956848 .5896914 2.23 0.026 1.084051 3.532357
_cons | 2.315788 .63572 3.06 0.002 1.352168 3.96613
------------------------------------------------------------------------------
6. What is the probability of visiting the doctor in the past 12 months for those above poverty?
below poverty?
. predict phat
(option pr assumed; Pr(doctor))
Below poverty: 0.6984126
Above poverty: 0.819222
7. Test the hypothesis that H
0
: = 0 versus H
0
: 6= 0 at the 0.05 level of signicance.

= 0.6713351, se(

) = .3013476, Z = 2.23.
Under H
0
, Z N(0, 1), and p = 0.026. We reject H
0
and conclude being above the
poverty level is associated with higher odds of visiting the doctor within the past 12
months.
2
log (pi / (1 - pi)) = a + bxi log (pi / (1-pi)) = log odds (logodds)
as you can see above, when xi = 0
beta from above = 0.67, hence e^beta = e^0.67 = 1.96
Using logistic instead of logit gives us the odds ratio
you can also use lincom nopov, or
Statistics > Binary Outcomes > Logistic Regression (Reporting Odds Ratio)
predict command: Creates a new covariate called phat which contains
the probability of going to the doctor for every individual in the dataset
table nopov, phat
get the values from the column names
you can simply check the value of p in the row for nopov
and if it is signicant it means that beta is non-zero
ALTERNATIVELY: If you are looking at Odds Ratio and the 95% CI
does not include 1 then also you can state that beta is signicant.
Note that for Odds Ratio you have to test for 1 and not 0
Note that the 95% CI for excludes 0 and the 95% CI for the odds ratio excludes 1,
leading to the same conclusion (as will always be the case).
8. Compare your results to the 2 2 table analysis from week 7.
Yes, our results match up to the contingency table analysis, as they should! The beauty
of logistic regression is in its exibility, as we see next.
3
Multiple Logistic Regression
Now, we expand the regression model, adding in more covariates. Add gender to your
model.
1. First, assume no effect modication by gender. State the model.
Dene Y
i
= 1 if individual i visited the doctor in the last 12 months, 0 otherwise; X
1i
= 1
if the individual is above the poverty line, 0 otherwise; X
2i
= 1 if female, 0 if male. Then,
our model is Y
i
Bernoulli(p
i
), where
logit(p
i
) = +
1
X
1i
+
2
X
2i
2. Fit the model.
. logit doctor nopov female
Iteration 0: log likelihood = -247.4035
Iteration 1: log likelihood = -229.36247
Iteration 2: log likelihood = -228.56747
Iteration 3: log likelihood = -228.56462
Iteration 4: log likelihood = -228.56462
Logistic regression Number of obs = 500
LR chi2(2) = 37.68
Prob > chi2 = 0.0000
Log likelihood = -228.56462 Pseudo R2 = 0.0761
------------------------------------------------------------------------------
doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nopov | .997763 .3245721 3.07 0.002 .3616134 1.633913
female | 1.384033 .2549714 5.43 0.000 .8842978 1.883767
_cons | -.0321554 .3246122 -0.10 0.921 -.6683837 .6040729
------------------------------------------------------------------------------
The tted regression model is logit( p
i
) = .0322 + .998X
1i
+ 1.384X
2i
.
3. Is there evidence of effect modication by gender?
Now, we t the model
logit(p
i
) = +
1
X
1i
+
2
X
2i
+
3
X
1i
X
2i
and test whether
3
= 0.
. xi: logit doctor i.nopov*female
i.nopov _Inopov_0-1 (naturally coded; _Inopov_0 omitted)
i.nopov*female _InopXfemal_# (coded as above)
Iteration 0: log likelihood = -247.4035
Iteration 1: log likelihood = -229.89318
1
State your model
yi = 1 visited doctor
yi = 0 otherwise
x1i = 1 above poverty level
x1i = 0 below poverty level
x2i = 1 female
x2i = 0 male
yi ~ Bernoulli (pi)
We model pi on a logit scale
gen female = gender - 1 (in the dataset
gender = 1 for males), so if we get
female = gender - 1, all males will have
the value of female = 0 (intuitively)
table gender female
b3 -- this will give us gender
specic odds ratio
but this does not account for
gender specic odds ratio.
for that we need an
interaction term as given
below
INTERACTION -> gen interaction = nopov * female
DIFFERENT METHOD -
LOGIT DOCTOR NOPOV FEMALE INTERACTION
Iteration 2: log likelihood = -228.56544
Iteration 3: log likelihood = -228.55916
Iteration 4: log likelihood = -228.55916
Logistic regression Number of obs = 500
LR chi2(3) = 37.69
Prob > chi2 = 0.0000
Log likelihood = -228.55916 Pseudo R2 = 0.0762
-------------------------------------------------------------------------------
doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
_Inopov_1 | .9619012 .472267 2.04 0.042 .036275 1.887528
female | 1.329136 .5835434 2.28 0.023 .1854119 2.47286
_InopXfemal_1 | .0678287 .6489728 0.10 0.917 -1.204135 1.339792
_cons | -3.76e-15 .4472136 -0.00 1.000 -.8765225 .8765225
-------------------------------------------------------------------------------
There is no evidence of effect modication by gender.
4. Is there evidence of confounding by gender?
Without gender:

1
= 0.671
With gender:

1
= 0.998
Yes, there is evidence of confounding by gender.
2
b3 has p of 0.917 which
means b3 is not
signicant, and no
evidence that the OR of
visiting the doctor varies
by gender.
Therefore NO Evidence of
EFFECT MODIFICATION
there is a big change in the value of b1 for
nopov when we add the female term,
which shows that gender is having an
effect on the probability of visiting the
doctor
You can also check for
logit nopov female and you'll get as coefcient of
female the value of -0.75 which shows that females are
more likely to be in poverty. So, gender is associated
with poverty level
and
if you nd the value for
logit doctor female
the coeff of female is 1.25 meaning tha females are
much more likely to go to the doctor (note that if
female the value of female = 1 and otherwise 0 for
male)
Logistic Regression with a Continuous Covariate
As in the previous tutorial, we t a model to examine the relationship between visiting the doctor
in the past 12 months and whether an individual is above or below the federal poverty level,
conditional on gender. We t a logistic regression model with doctor as the outcome, and with
nopov and female as covariates. But now we add a continuous covariate age to the model!
Open the chis healthdisparities.dta dataset.
1. Assume that, conditional on age and gender, probability of visiting the doctor varies lin-
early on the logit scale with age. State your model.
Dene Y
i
= 1 if individual i visited the doctor in the last 12 months, 0 otherwise; X
1i
= 1
if the individual is above the poverty line, 0 otherwise; X
2i
= 1 if female, 0 if male; and
X
3i
= age in years. Then, our model is Y
i
Bernoulli(p
i
), where
logit(p
i
) = +
1
X
1i
+
2
X
2i
+
3
X
3i
2. Fit the model.
. logit doctor nopov female age
Iteration 0: log likelihood = -247.4035
Iteration 1: log likelihood = -226.31928
Iteration 2: log likelihood = -225.22574
Iteration 3: log likelihood = -225.2222
Iteration 4: log likelihood = -225.2222
Logistic regression Number of obs = 500
LR chi2(3) = 44.36
Prob > chi2 = 0.0000
Log likelihood = -225.2222 Pseudo R2 = 0.0897
------------------------------------------------------------------------------
doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
nopov | .9882978 .3271762 3.02 0.003 .3470442 1.629551
female | 1.334568 .2568062 5.20 0.000 .8312367 1.837899
age | .0187776 .0074311 2.53 0.012 .0042129 .0333423
_cons | -.8066067 .4469253 -1.80 0.071 -1.682564 .0693507
------------------------------------------------------------------------------
The tted model is logit(p
i
) = .807 + .988X
1i
+ 1.335X
2i
+ 0.019X
3i
3. Is there evidence that age is a confounder of the doctor-poverty relationship? Would you
expect age to be a confounder?
With gender only:

1
= 0.998
With age and gender:

1
= 0.988
No, there is not evidence of confounding by age.
1
State your model
yi = 1 visited doctor
yi = 0 otherwise
x1i = 1 above poverty level
x1i = 0 below poverty level
x2i = 1 female
x2i = 0 male
x3i = age (continuous)
bi is the coefcient for nopov
4. Interpret the odds ratio.
. lincom nopov, eform
( 1) [doctor]nopov = 0
------------------------------------------------------------------------------
doctor | exp(b) Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 2.686657 .8790104 3.02 0.003 1.414879 5.101586
------------------------------------------------------------------------------
Conditioning on age and gender, the odds of visiting the doctor are 2.69 times higher
(with 95% CI 1.41, 5.10) in those above the poverty line, compared to those below the
poverty line.
5. Test for an association between poverty and visiting the doctor in the past 12 months,
conditioning on age and gender, at the 0.05 level of signicance.
We test H
0
:
1
= 0 versus H
0
:
1
6= 0.

1
= .988, se(

1
) = .327, Z = 3.02.
Under H
0
, Z N(0, 1), and p = 0.003. (Note: the 95% CI for

1
excludes 0 and the 95%
CI for the OR subsequently excludes 1.)
We reject H
0
and conclude that there is evidence in the data that being above the poverty
line increases the likelihood of visiting the doctor in the past 12 months, conditioning on
age and gender.
6. Predict the probability of visiting the doctor for everyone in your dataset.
predict phat
7. What is the predicted probability of visiting the doctor for a 45 year old woman above the
poverty level? below the poverty level?
. lincom _cons + age*45 + female + nopov
( 1) [doctor]nopov + [doctor]female + 45*[doctor]age + [doctor]_cons = 0
------------------------------------------------------------------------------
doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 2.361251 .2244562 10.52 0.000 1.921325 2.801177
------------------------------------------------------------------------------
. lincom _cons + age*45 + female + nopov*0
( 1) [doctor]female + 45*[doctor]age + [doctor]_cons = 0
2
or lincom nopov, or
b1 is coeff of nopov which also has a p value of
< 0.05 and is hence signicant
LINCOM _CONS + NOPOV*1 + FEMALE*1 + AGE * 45
SO,
logit (phati) = 2.36
so, to nd phat
I have to nd
inverse logit of
2.36
Now, female, below poverty
female, above poverty
------------------------------------------------------------------------------
doctor | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 1.372953 .3101936 4.43 0.000 .7649849 1.980921
------------------------------------------------------------------------------
. di invlogit(2.361251 )
.91382437
. di invlogit(1.372953)
.79785684
Above the poverty line: 91.2%
Below the poverty line: 79.8%
3
get inverse logit to nd the predicted probability
So, for above poverty, prob of visiting doctor = 91%
and for below poverty, prob. of visiting the doctor = 80%
Recap + Model Fit
Open the chis healthdisparities.dta dataset.
1. After tting and considering several models, what are our conclusions about the relation-
ship between poverty and visiting the doctor in the past 12 months?
Those below the poverty line appear less likely to visit the doctor in the past 12 months.
2. Compare the t of these models.
There are several options for assessing the t of a logistic regression model. We dont
have time to look at all of them (if you are interested, look at Hosmer-Lemeshow and
deviance). But, to relate back to week 3, lets look at the ROC curve.
Fit the logistic regression model with doctor as the outcome and nopov, female, and
age as covariates.
We choose a cut-off c and construct a classication table:
Y
i
= 1 Y
i
= 0
p
i
> c Correct False +
p
i
<= c False - Correct
For example, when c = 0.8:
. estat classification, cutoff(0.8)
Logistic model for doctor
-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 243 25 | 268
- | 159 73 | 232
-----------+--------------------------+-----------
Total | 402 98 | 500
Classified + if predicted Pr(D) >= .8
True D defined as doctor != 0
--------------------------------------------------
Sensitivity Pr( +| D) 60.45%
Specificity Pr( -|~D) 74.49%
Positive predictive value Pr( D| +) 90.67%
Negative predictive value Pr(~D| -) 31.47%
--------------------------------------------------
False + rate for true ~D Pr( +|~D) 25.51%
False - rate for true D Pr( -| D) 39.55%
False + rate for classified + Pr(~D| +) 9.33%
False - rate for classified - Pr( D| -) 68.53%
1
We can't do residual plots like we did for linear
regression and hence need to use different methods
logit doctor nopov female age
In Stata,
Statistics > Binary Outcomes > PostEstimation > Goodness
of Fit after logit ... (and then check for Report various
summary stats in the reporting options window + change
value of Positive Outcome Threshold = 0.8 instead of 0.5)
here we used 0.8 since the p of
visiting the doctor is very high.
In practice, you should try with
0.5, etc different cutoff values
--------------------------------------------------
Correctly classified 63.20%
--------------------------------------------------
To get the full ROC curve (and the area under the ROC curve), try lroc.
Plot the ROC curve for the three models above to visualize the improved classication of
the more complex models. We could likely add more covariates to further improve the
discriminatory ability of the model.
2
Statistics > Binary Outcomes > PostEstimation > ROC Curve
The steeper the curve above the diagonal lne, the better the
ROC curve and in this case, you could then add more
covariates and see which model gives you the best ROC
curve
PH207X Fall 2012
Survey Data Demo Page 1 of 11

Objectives for Survey Results Module 1 Basic Statistics

Number of respondents at baseline and follow-up
Number of participants in longitudinal dataset
Differences between baseline and longitudinal dataset

I. Number of respondents at baseline and follow-up
a. Baseline survey 9175 respondents
b. Follow-up survey 3700 respondents

II. Number of participants in longitudinal dataset
a. 596 participants provided unique identifiers in both the baseline and follow-up
survey

III. Differences between baseline and longitudinal dataset

The tables below present information on those who responded at the baseline survey to
those who were included in the longitudinal dataset for some selected variables which
we will be using later in the demo.

Baseline (n=9157) Longitudinal (n=596)
Sex
female 3984 (44%) 282 (47%)
male 4521 (49%) 310 (52%)
missing 652 (7%) 4 (0.7%)

Age 3210.1 33.7810.3

Computer
Mac 1429 (16%) 84 (14%)
PC 7080 (77%) 509 (85%)
missing 648 (7%) 3 (0.5%)

Aptitude
math 3773 (41%) 287 (48%)
verbal 4735 (52%) 305 (51%)
missing 649 (7%) 4 (0.7%)

Your Handedness
righty 7582 (83%) 531 (89%)
lefty 547 (6%) 44 (7%)
ambidexterous 351 (4%) 18 (3%)


Since the exposure and the outcome variables were
measured at the same time, we can use a Cross-
Sectional Study
With Cross Sectional, no follow-up time involved, so we cannot
get/use Rate Ratio. We "CAN" calculate a Risk Ratio / Odds Ratio
Better to use Risk Ratio - easier to interpret
Odds Ratio - May not be similar to the Risk Ratio if the
prevalence of our outcome is Rate
Therefore, preferable to present Risk Ratio
We'll just view the
PH207X Fall 2012
Survey Data Demo Page 2 of 11

Objectives for Survey Results Module 2 Factors Related to Mac vs PC Use

Choose a study design to examine the association between math and verbal aptitude
and Mac/PC use.
Calculate the appropriate measure of association comparing math versus verbal
aptitude and Mac/PC use.
Construct your own analysis to study the association between handedness and
Mac/PC use.

I. Choose a study design
a. Exposure: Math and Verbal aptitude
b. Outcome: Mac/PC Use
c. Study design: Cross-sectional study

II. Calculate the appropriate measure of association comparing the math and verbal
aptitude and Mac/PC use.
a. Measure of association Risk Ratio or Odds Ratio
b. Dropdown:
i. Statistics Epidemiology and RelatedTables for
EpidemiologistsCohort study risk-ratio etc.
ii. Case variable: macpc
iii. Exposed variable: aptitude
iv. On the options tab, check box for Report odds ratio
v. Submit
c. Command Window Syntax: cs macpc aptitude,or

| aptitude |
| Exposed Unexposed | Total
-----------------+------------------------+------------
Cases | 39 45 | 84
Noncases | 248 258 | 506
-----------------+------------------------+------------
Total | 287 303 | 590
| |
Risk | .1358885 .1485149 | .1423729
| |
| Point estimate | [95% Conf. Interval]
|------------------------+------------------------
Risk difference | -.0126263 | -.0689729 .0437202
Risk ratio | .9149826 | .6150247 1.361235
Prev. frac. ex. | .0850174 | -.3612351 .3849753
Prev. frac. pop | .0413559 |
Odds ratio | .9016129 | .5690484 1.428626 (Cornfield)
+-------------------------------------------------
chi2(1) = 0.19 Pr>chi2 = 0.6609

People with stronger math abilities were about 9% less likely to use a Mac compared to
people with stronger verbal abilities. The confidence interval for our risk ratio was 0.62
to 1.36

PH207X Fall 2012
Survey Data Demo Page 3 of 11

III. Construct your own analysis to study the association between Mac/PC use and
handedness.
a. Measure of association Risk Ratio or Odds Ratio
b. Dropdown:
i. Statistics Epidemiology and RelatedTables for
EpidemiologistsCohort study risk-ratio etc.
ii. Case variable: macpc
iii. Exposed variable: lefty
iv. On the options tab, check box for Report odds ratio
v. Submit
c. Command Window Syntax: cs macpc lefty,or

| lefty |
| Exposed Unexposed | Total
-----------------+------------------------+------------
Cases | 6 77 | 83
Noncases | 38 452 | 490
-----------------+------------------------+------------
Total | 44 529 | 573
| |
Risk | .1363636 .1455577 | .1448517
| |
| Point estimate | [95% Conf. Interval]
|------------------------+------------------------
Risk difference | -.009194 | -.1149534 .0965653
Risk ratio | .9368359 | .4330182 2.026847
Prev. frac. ex. | .0631641 | -1.026847 .5669818
Prev. frac. pop | .0048503 |
Odds ratio | .9268626 | .3889418 2.214373
(Cornfield)
+-------------------------------------------------
chi2(1) = 0.03 Pr>chi2 = 0.8678

The risk ratio for this study was 0.94 and the odds ratio was 0.93. This shows that
people who are left-handed were less likely to use a Mac compared to people who
are right-handed.
PLEASE SEE NEXT PAGE FOR SURVEY RESULTS MODULE 3
EXPLANATION FOR SURVEY RESULTS MODULE 3
Exposure came rst (tea or coffee)
Then Outcome came later (sleep difculty)
Therefore Cohort Study
The next question is what is the appropriate measure of association.
In a cohort study, we can use data on the number of exposed and unexposed cases and non-cases to calculate a risk ratio or we can collect information
about exposed and unexposed person time in our cases and non-cases and calculate a rate ratio.
As you may recall from way back in the beginning of the course, we use rates when we're concerned about competing risk and loss to follow up when we
have studies where people are followed for many, many years. And we're worried about not being able to observe all the outcomes because of these issues
of competing risk and loss to follow up. But in this study that we're thinking about right now, we have information on tea and coffee consumption and their
sleep quality that night, so we're not concerned about competing risks and loss to follow up.
Also, we often use rates when we want to understand the timing of how long it takes for the exposure to result in an outcome. But here we're not asking how
long does it take for the tea and coffee consumption to cause a change in sleep difculties. We just want to know do they have sleep difculties. Yes or no.
Case or no case. Therefore, the appropriate measure of association here would be a risk ratio.
Our exposure variable is going to be caff2hrb4, which equals one if the person said that he or she did drink your coffee within two hours of going to bed and
equals zero if he or she did not report drinking tea or coffee in the two hours before bed. And the variable equals missing if the person did not answer the
question. Our outcome variable is sleepdiff, which equals one if the participant reported sleep difculties that night and zero if the person did not report
PH207X Fall 2012
Survey Data Demo Page 4 of 11

Objectives for Survey Results Module 3 Risk Factors for Sleep Difficulties

Choose a study design to examine the association between tea/coffee consumption
before bed and sleep difficulties.
Calculate the appropriate measure of association comparing tea/coffee consumption
before bed and sleep difficulties.
Consider confounding and effect modification sex.
Consider confounding and effect modification age.
Construct your own analysis to study the association between handedness and sleep
difficulties. Consider confounding and effect modification by sex.

I. Choose a study design to examine the association between tea/coffee
consumption before bed and sleep difficulties.
a. Study design: Cohort study
b. Exposure: Tea and coffee consumption two hours before bed
c. Outcome: Sleep difficulties that night

II. Calculate the appropriate measure of association comparing tea/coffee
consumption before bed and sleep difficulties.
a. Measure of association: Risk ratio
b. Dropdown:
i. Statistics Epidemiology and RelatedTables for
EpidemiologistsCohort study risk-ratio etc.
ii. Case variable: sleepdiff
iii. Exposed variable: caff2hrb4
iv. Submit.
c. Command Window Syntax: cs sleepdiff caff2hrb4

| caff2hrb4 |
| Exposed Unexposed | Total
-----------------+------------------------+------------
Cases | 19 81 | 100
Noncases | 102 389 | 491
-----------------+------------------------+------------
Total | 121 470 | 591
| |
Risk | .1570248 .1723404 | .1692047
| |
| Point estimate | [95% Conf. Interval]
|------------------------+------------------------
Risk difference | -.0153156 | -.0885836 .0579524
Risk ratio | .9111315 | .5763827 1.440294
Prev. frac. ex. | .0888685 | -.4402942 .4236173
Prev. frac. pop | .0181947 |
+-------------------------------------------------
chi2(1) = 0.16 Pr>chi2 = 0.6886

Those who drank tea or coffee before bed had 0.91 times the risk of sleep difficulties
compared to those who did not drink tea or coffee.

PH207X Fall 2012
Survey Data Demo Page 5 of 11

III. Consider confounding and effect modification by sex.
a. Dropdown:
i. Statistics Epidemiology and RelatedTables for
EpidemiologistsCohort study risk-ratio etc.
ii. Case variable: sleepdiff
iii. Exposed variable: caff2hrb4
iv. Go to the Options tab; click the box next to stratify on variables;
use the dropdown menu to select male
Note: Under Within-stratum weights the button next to Use
Mantel-Haenszel should be automatically selected
v. Submit.
b. Command Window Syntax: cs sleepdiff caff2hrb4, by(male)

male | RR [95% Conf. Interval] M-H Weight
-----------------+-------------------------------------------------
no | .6689266 .3338178 1.34044 9.448399
yes | 1.241935 .6697457 2.302969 7.068404
-----------------+-------------------------------------------------
Crude | .9111315 .5763827 1.440294
M-H combined | .9141471 .5777317 1.446458
-------------------------------------------------------------------
Test of homogeneity (M-H) chi2(1) = 1.721 Pr>chi2 = 0.1895

The crude effect estimate is 0.911 while the Mantel-Haenszel adjusted risk ratio is
0.914. Since the crude and adjusted-risk ratios are so similar, we can conclude that
there is not strong confounding by sex in our study.

Although our risk ratios for males and females seem different, there is no evidence of
statistically significant effect modification by sex.

IV. Consider confounding and effect modification by age.
a. Dropdown:
i. Statistics Epidemiology and RelatedTables for
EpidemiologistsCohort study risk-ratio etc.
ii. Case variable: sleepdiff
iii. Exposed variable: caff2hrb4
iv. Go to the Options tab; click the box next to stratify on variables;
use the dropdown menu to select agecat
Note: Under Within-stratum weights the button next to Use
Mantel-Haenszel should be automatically selected
v. Submit.
b. Command Window Syntax: cs sleepdiff caff2hrb4,
by(agecat)
The p value > 0.05 and hence not signicant -
no evidence of effect modication
PH207X Fall 2012
Survey Data Demo Page 6 of 11


agecat | RR [95% Conf. Interval] M-H Weight
-----------------+-------------------------------------------------
18-29 yrs old | 1.202553 .654413 2.209817 7.208
30-39 yrs old | .5083612 .1870783 1.381406 6.040404
40-49 yrs old | .8452381 .2120323 3.369427 1.976471
>=50 yrs old | 1.305556 .3431283 4.967457 1.309091
-----------------+-------------------------------------------------
Crude | .9111315 .5763827 1.440294
M-H combined | .9143836 .5795498 1.442667
-------------------------------------------------------------------
Test of homogeneity (M-H) chi2(3) = 2.389 Pr>chi2 = 0.4957

The crude risk ratio is 0.911 while the Mantel-Haenszel adjusted risk ratio is 0.914.
Since the crude and adjusted-risk ratios are so similar, there is not strong
confounding by age category in our study.

Despite the differences in the risk ratios by age category, there is no evidence of
statistically significant effect modification by age category.

V. Construct your own analysis to study the association between handedness and
sleep difficulties. Consider confounding and effect modification by sex.
a. Dropdown:
i. Statistics Epidemiology and RelatedTables for
EpidemiologistsCohort study risk-ratio etc.
ii. Case variable: sleepdiff
iii. Exposed variable: lefty
iv. Submit
b. Command Window Syntax: cs sleepdiff lefty

| lefty |
| Exposed Unexposed | Total
-----------------+------------------------+------------
Cases | 8 89 | 97
Noncases | 36 440 | 476
-----------------+------------------------+------------
Total | 44 529 | 573
| |
Risk | .1818182 .168242 | .1692845
| |
| Point estimate | [95% Conf. Interval]
|------------------------+------------------------
Risk difference | .0135762 | -.1047616 .131914
Risk ratio | 1.080695 | .5614644 2.080098
Attr. frac. ex. | .0746692 | -.7810567 .5192533
Attr. frac. pop | .0061583 |
+-------------------------------------------------
chi2(1) = 0.05 Pr>chi2 = 0.8175
People who are left-handed have a slightly higher (1.08 fold higher) risk of sleep
difficulties compared people who are right-handed.

PH207X Fall 2012
Survey Data Demo Page 7 of 11

c. Dropdown:
vi. Statistics Epidemiology and RelatedTables for
EpidemiologistsCohort study risk-ratio etc.
vii. Case variable: sleepdiff
viii. Exposed variable: lefty
ix. Go to the Options tab; click the box next to stratify on variables;
use the dropdown menu to select male
Note: Under Within-stratum weights the button next to Use
Mantel-Haenszel should be automatically selected
x. Submit.
d. Command Window Syntax: cs sleepdiff lefty, by(male)

male | RR [95% Conf. Interval] M-H Weight
-----------------+-------------------------------------------------
female | .9338374 .3691631 2.362241 3.918519
male | 1.333333 .531464 3.345058 2.8
-----------------+-------------------------------------------------
Crude | 1.080695 .5614644 2.080098
M-H combined | 1.100331 .5725662 2.114564
-------------------------------------------------------------------
Test of homogeneity (M-H) chi2(1) = 0.288 Pr>chi2 = 0.5918

Our results stratified by gender show slightly different results among males and
females. Left-handed males have 1.33 times the risk of sleep difficulties compared
to right-handed males while left-handed females have 0.93 times of the risk of sleep
difficulties compared to right-handed males.
PH207X Fall 2012
Survey Data Demo Page 8 of 11

Objectives for Survey Results Module 4 Risk Factors for Left and Right Handedness

Choose a study design to examine the association between mothers age at birth of
PH207x participant and handedness of the participant.
Calculate the appropriate measure of association comparing the mothers age among
those who are left-handed to those who are right-handed.
Construct your own analysis to study the association between having a left-handed
parent and childs handedness.

I. Choose a study design
a. Exposure: Mothers age at birth of PH207x participant
b. Outcome: Handedness of PH207x participant
c. Study design: Case-control

II. Calculate the appropriate measure of association comparing the mothers age among
those who are left-handed to those who are right-handed.
a. Measure of association Odds ratio
b. Calculating the odds ratio in Stata
c. Dropdown:
i. Statistics Epidemiology and RelatedTables for
EpidemiologistsCase control odds ratio.
ii. Case variable: lefty
iii. Exposed variable: momagecat
iv. Submit
d. Command Window Syntax:

Proportion
| Exposed Unexposed | Total Exposed
-----------------+------------------------+------------------------
Cases | 9 35 | 44 0.2045
Controls | 57 474 | 531 0.1073
-----------------+------------------------+------------------------
Total | 66 509 | 575 0.1148
| |
| Point estimate | [95% Conf. Interval]
|------------------------+------------------------
Odds ratio | 2.138346 | .8576737 4.829077 (exact)
Attr. frac. ex. | .5323488 | -.1659446 .7929211 (exact)
Attr. frac. pop | .1088895 |
+-------------------------------------------------
chi2(1) = 3.78 Pr>chi2 = 0.0519

Mothers who are 35 years of age or older at the time of their childs birth have 2.14 times
the odds of having a left-handed child compared to mothers who were younger than 35
at the time of their childs birth. We are 95% confident that the true odds ratio ranges
from 0.86 to 4.83.

Whenever we do a case control study we use
the Odds Ratio as our Measure of Association
Exposed = mom age < = 35
Unexposed = mom age > 35
Cases = Left-handed
Controls = Right-handed
PH207X Fall 2012
Survey Data Demo Page 9 of 11

III. Construct your own analysis to study the association between having at least one left-
handed parent and childs handedness.

. cc lefty parentlefty
Proportion
| Exposed Unexposed | Total Exposed
-----------------+------------------------+------------------------
Cases | 6 38 | 44 0.1364
Controls | 42 487 | 529 0.0794
-----------------+------------------------+------------------------
Total | 48 525 | 573 0.0838
| |
| Point estimate | [95% Conf. Interval]
|------------------------+------------------------
Odds ratio | 1.830827 | .597326 4.708769 (exact)
Attr. frac. ex. | .4537988 | -.6741276 .7876302 (exact)
Attr. frac. pop | .0618817 |
+-------------------------------------------------
chi2(1) = 1.72 Pr>chi2 = 0.1900

People with at least one left-handed parent have 1.83 times the odds of being left-
handed compared to those without a left-handed parent.

Conclusions
The appropriate measure of association depends on the type of exposure and outcome of
interest, the type of data available and the study design used to obtain the data.
In survey studies, one must always be concerned about issues of selection bias and
generalizability.
PH207X Fall 2012
Survey Data Demo Page 10 of 11

Data Dictionary for Survey Dataset

Variable Description Values
board
In the past two weeks how often did you
use the chat room for this course to post a
question

"0"
"2-3 times"
"4 or more times"
"Never"
"Once"
male Sex
0 no (female)
1 yes (male)
. missing
degree Highest level of education
1 pre-college / university
degree
2 bachelor degree
3 masters degree
4 doctoral degree
precollege
Highest level of education is Pre-
College/University Degree
0 no
1 yes
. missing
masters
Highest level of education is Masters
Degree
0 no
1 yes
. missing
doctorate
Highest level of education is Doctoral
Degree
0 no
1 yes
. missing
macpc
Which type of computer do you use most of
the time?
0 pc
1 mac
. missing
aptitude
Which is stronger, your math aptitude or
your verbal aptitude?
0 verbal
1 math
. missing
caff2hrb4
Did you drink coffee or tea within two hours
of bedtime yesterday?

0 no
1 yes
. missing
sleepdiff
Did you have trouble sleeping last night?

0 no
1 yes
. missing
shower Do you face the shower head?
0 no
1 yes
. missing
longhair Do you consider your hair to be long?
0 no
1 yes
. missing
PH207X Fall 2012
Survey Data Demo Page 11 of 11

Variable Description Values
facialhair
If you are a man, do you have a beard,
mustache, or goatee?

0 no
1 yes
. missing
agecat
Age of participant

1 18-29 yrs old
2 30-39 yrs old
3 40-49 yrs old
4 >=50 yrs old
momagecat How old was your mother at your birth?
0 <35 yrs old
1 >=35 yrs old
lefty Are you left-handed?
0 righty
1 lefty
. missing
dadlefty Is your father left-handed?
0 righty
1 lefty
. missing
momlefty Is your mother left-handed?
0 righty
1 lefty
. missing
parentlefty
Is one (or both) of your parents left-
handed?
0 No left-handed parents
1 Left-handed parent
. missing
allhourscat
On average, how many hours per week did
you spend on all aspects of this course?
0 0-7 hours
1 >=8 hours
hwkhourscat
On average, how many hours per week did
you spend working on the homework
assignments for this course?
0 0-2 hours
1 >=3 hours
comphrscat
For how many hours did you use your
computer last night
0 0-1 hours
1 >=2 hours


!


Tutorial: Survival Analysis in Stata

In this tutorial, we use data from the Digitalis Investigation Group (DIG). Recall that the DIG trial
was a was a randomized, double-blind, multicenter trial designed to examine the safety and
efficacy of Digoxin in treating patients with congestive heart failure. In this trial, patients were
randomized to either Digoxin or placebo. The log-rank test was used to compare overall
mortality between the two groups.
To begin, open the dig.dta data set. Before we can do any analyses, we must first tell Stata that
we are working with survival data (analogous to how we had to svyset our data and tell Stata
that we were working with survey data). You can do this using the stset command. The
command for this dataset is stset deathday, failure(death==1). This command tells Stata that
our time-to-death variable is deathday; and a value of 1 for the death variable means that
person died while any other value (in this case 0) means that person was censored. For survival
data, we need at least two variables: 1) a variable for the time to the event and 2) a variable to
indicate if the observation is censored or not.

1. Graph the Kaplan-Meier estimates of the survival curves for each treatment group.


.sts graph, by(trtmt)

#



Note: you can also list the values of the survival function using the sts list, by(trtmt)
command.

2. In the New England of Journal paper (see handout NEJM_DIG), the authors plotted 1
S(t) in Figure 1. Graph 1 S(t) for each treatment group.

sts graph, failure by(trtmt)
0
.
0
0
0
.
2
5
0
.
5
0
0
.
7
5
1
.
0
0
0 500 1000 1500 2000
analysis time
trtmt = 0 trtmt = 1
Kaplan-Meier survival estimates
$




3. Conduct a log-rank test at the 0.05 level of significance to test the hypothesis that the
survival distribution is the same in the two treatment groups. Use the following
command: sts test trtmt, logrank
a. What are your null and alternative hypotheses?

The null hypothesis is that the two groups have same distribution of survival
times. The alternative is that they do not.

b. What is the value of your test statistic?

0.00

c. What distribution does your test statistic have under the null hypothesis?

Chi-squared distribution with 1 degree of freedom

d. What is your p-value?

0.9616. Note: using all 6800 observation yields a p-value of 0.8013. This is the p-
value the authors reported in Figure 1.
0
.
0
0
0
.
2
5
0
.
5
0
0
.
7
5
1
.
0
0
0 500 1000 1500 2000
analysis time
trtmt = 0 trtmt = 1
Kaplan-Meier failure estimates
logrank test command - sts test trtmt, logrank
%


e. What is your conclusion?

Since our p-value is greater than 0.05, we fail to reject the null hypothesis. Thus,
we conclude that we do not have evidence that the distribution of survival times
is different between the Digoxin group and the placebo group.

Das könnte Ihnen auch gefallen