Beruflich Dokumente
Kultur Dokumente
2. Epicycles of Analysis . . . . . . . . . . . . . . . . . . 4
2.1 Setting the Scene . . . . . . . . . . . . . . . . . 5
2.2 Epicycle of Analysis . . . . . . . . . . . . . . . 6
2.3 Setting Expectations . . . . . . . . . . . . . . . 8
2.4 Collecting Information . . . . . . . . . . . . . 9
2.5 Comparing Expectations to Data . . . . . . . 10
2.6 Applying the Epicycle of Analysis Process . 11
6. Inference: A Primer . . . . . . . . . . . . . . . . . . 78
6.1 Identify the population . . . . . . . . . . . . . 78
6.2 Describe the sampling process . . . . . . . . 79
6.3 Describe a model for the population . . . . . 79
6.4 A Quick Example . . . . . . . . . . . . . . . . . 81
6.5 Factors Affecting the Quality of Inference . 84
6.6 Example: Apple Music Usage . . . . . . . . . 86
6.7 Populations Come in Many Forms . . . . . . 89
7. Formal Modeling . . . . . . . . . . . . . . . . . . . . 92
7.1 What Are the Goals of Formal Modeling? . 92
7.2 General Framework . . . . . . . . . . . . . . . 93
7.3 Associational Analyses . . . . . . . . . . . . . 95
7.4 Prediction Analyses . . . . . . . . . . . . . . . 104
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . 111
Epicycles of Analysis
1. Setting Expectations,
2. Collecting information (data), comparing the data to
your expectations, and if the expectations don’t match,
3. Revising your expectations or fixing the data so your
data and your expectations match.
Epicycles of Analysis
Now that you have data in hand (the check at the restau-
rant), the next step is to compare your expectations to the
data. There are two possible outcomes: either your expec-
tations of the cost match the amount on the check, or they
do not. If your expectations and the data match, terrific, you
can move onto the next activity. If, on the other hand, your
expectations were a cost of 30 dollars, but the check was
40 dollars, your expectations and the data do not match.
There are two possible explanations for the discordance:
first, your expectations were wrong and need to be revised,
or second, the check was wrong and contains an error. You
review the check and find that you were charged for two
desserts instead of the one that you had, and conclude that
there is an error in the data, so ask for the check to be
corrected.
One key indicator of how well your data analysis is going is
how easy or difficult it is to match the data you collected to
your original expectations. You want to setup your expec-
tations and your data so that matching the two up is easy.
In the restaurant example, your expectation was $30 and
the data said the meal cost $40, so it’s easy to see that (a)
your expectation was off by $10 and that (b) the meal was
more expensive than you thought. When you come back
to this place, you might bring an extra $10. If our original
expectation was that the meal would be between $0 and
$1,000, then it’s true that our data fall into that range, but
it’s not clear how much more we’ve learned. For example,
would you change your behavior the next time you came
back? The expectation of a $30 meal is sometimes referred
to as a sharp hypothesis because it states something very
specific that can be verified with the data.
Epicycles of Analysis 11
1. Setting expectations,
2. Collecting information (data), comparing the data to
your expectations, and if the expectations don’t match,
3. Revising your expectations or fixing the data so that
your expectations and the data match.
race, body mass index, smoking status, and low income are
all positively associated with uncontrolled asthma.
However, you notice that female gender is inversely asso-
ciated with uncontrolled asthma, when your research and
discussions with experts indicate that among adults, female
gender should be positively associated with uncontrolled
asthma. This mismatch between expectations and results
leads you to pause and do some exploring to determine if
your results are indeed correct and you need to adjust your
expectations or if there is a problem with your results rather
than your expectations. After some digging, you discover
that you had thought that the gender variable was coded
1 for female and 0 for male, but instead the codebook indi-
cates that the gender variable was coded 1 for male and 0 for
female. So the interpretation of your results was incorrect,
not your expectations. Now that you understand what the
coding is for the gender variable, your interpretation of the
model results matches your expectations, so you can move
on to communicating your findings.
Lastly, you communicate your findings, and yes, the epicy-
cle applies to communication as well. For the purposes of
this example, let’s assume you’ve put together an informal
report that includes a brief summary of your findings. Your
expectation is that your report will communicate the infor-
mation your boss is interested in knowing. You meet with
your boss to review the findings and she asks two questions:
(1) how recently the data in the dataset were collected
and (2) how changing demographic patterns projected to
occur in the next 5-10 years would be expected to affect
the prevalence of uncontrolled asthma. Although it may be
disappointing that your report does not fully meet your
boss’s needs, getting feedback is a critical part of doing
a data analysis, and in fact, we would argue that a good
data analysis requires communication, feedback, and then
Epicycles of Analysis 15
1. Descriptive
2. Exploratory
3. Inferential
4. Predictive
5. Causal
6. Mechanistic
You can now use the information about the types of ques-
tions and characteristics of good questions as a guide to
refining your question. To accomplish this, you can iterate
through the 3 steps of:
data available to you, you will be able to adjust for this con-
founder and reduce the number of possible interpretations
of the answer to your question. As you refine your question,
spend some time identifying the potential confounders and
thinking about whether your dataset includes information
about these potential confounders.
Another type of problem that can occur when inappro-
priate data are used is that the result is not interpretable
because the underlying way in which the data were col-
lected lead to a biased result. For example, imagine that
you are using a dataset created from a survey of women
who had had children. The survey includes information
about whether their children had autism and whether they
reported eating sushi while pregnant, and you see an as-
sociation between report of eating sushi during pregnancy
and having a child with autism. However, because women
who have had a child with a health condition recall the
exposures, such as raw fish, that occurred during pregnancy
differently than those who have had healthy children, the
observed association between sushi exposure and autism
may just be the manifestation of a mother’s tendency to
focus more events during pregnancy when she has a child
with a health condition. This is an example of recall bias,
but there are many types of bias that can occur.
The other major bias to understand and consider when
refining your question is selection bias, which occurs when
the data your are analyzing were collected in such a way
to inflate the proportion of people who have both charac-
teristics above what exists in the general population. If a
study advertised that it was a study about autism and diet
during pregnancy, then it is quite possible that women who
both ate raw fish and had a child with autism would be
more likely to respond to the survey than those who had
one of these conditions or neither of these conditions. This
Stating and Refining the Question 26
The answer that he will get at the end of his analysis (when
he translates his question into a data problem) should also
be interpretable.
He then thinks through what he knows about the question
and in his judgment, the question is of interest as his boss
expressed interest.
He also knows that the question could not have been an-
swered already since his boss indicated that it had not and
a review of the company’s previous data analyses reveals no
previous analysis designed to answer the question.
Next he assesses whether the question is grounded in a
plausible framework. The question, “Which Fit on Fleek
users don’t get enough sleep?”, seems to be grounded in
plausibility as it makes sense that people who get too little
sleep would be interested in trying to improve their sleep
by tracking it. However, Joe wonders whether the duration
of sleep is the best marker for whether a person feels that
they are getting inadequate sleep. He knows some peo-
ple who regularly get little more than 5 hours of sleep a
night and they seem satisfied with their sleep. Joe reaches
out to a sleep medicine specialist and learns that a better
measure of whether someone is affected by lack of sleep
or poor quality sleep is daytime drowsiness. It turns out
that his initial expectation that the question was grounded
in a plausible framework did not match the information
he received when he spoke with a content expert. So he
revises his question so that it matches his expectations of
plausibility and the revised question is: Which Fit on Fleek
users have drowsiness during the day?
Joe pauses to make sure that this question is, indeed, an-
swerable with the data he has available to him, and confirms
that it is. He also pauses to think about the specificity of the
question. He believes that it is specific, but goes through
Stating and Refining the Question 29
> library(readr)
> ozone <- read_csv("data/hourly_44201_2014.csv",
+ col_types = "ccccinnccccccncnncccccc")
Have you ever gotten a present before the time when you
were allowed to open it? Sure, we all have. The problem
Exploratory Data Analysis 37
> nrow(ozone)
[1] 7147884
and columns.
> ncol(ozone)
[1] 23
> str(ozone)
Classes 'tbl_df', 'tbl' and 'data.frame': 7147884 obs. of 23 variables:
$ State.Code : chr "01" "01" "01" "01" ...
$ County.Code : chr "003" "003" "003" "003" ...
$ Site.Num : chr "0010" "0010" "0010" "0010" ...
$ Parameter.Code : chr "44201" "44201" "44201" "44201" ...
$ POC : int 1 1 1 1 1 1 1 1 1 1 ...
$ Latitude : num 30.5 30.5 30.5 30.5 30.5 ...
$ Longitude : num -87.9 -87.9 -87.9 -87.9 -87.9 ...
$ Datum : chr "NAD83" "NAD83" "NAD83" "NAD83" ...
$ Parameter.Name : chr "Ozone" "Ozone" "Ozone" "Ozone" ...
$ Date.Local : chr "2014-03-01" "2014-03-01" "2014-03-01" "2014-\
03-01" ...
$ Time.Local : chr "01:00" "02:00" "03:00" "04:00" ...
$ Date.GMT : chr "2014-03-01" "2014-03-01" "2014-03-01" "2014-\
03-01" ...
$ Time.GMT : chr "07:00" "08:00" "09:00" "10:00" ...
$ Sample.Measurement : num 0.047 0.047 0.043 0.038 0.035 0.035 0.034 0.0\
37 0.044 0.046 ...
$ Units.of.Measure : chr "Parts per million" "Parts per million" "Part\
s per million" "Parts per million" ...
$ MDL : num 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.0\
05 0.005 0.005 ...
$ Uncertainty : num NA NA NA NA NA NA NA NA NA NA ...
$ Qualifier : chr "" "" "" "" ...
$ Method.Type : chr "FEM" "FEM" "FEM" "FEM" ...
$ Method.Name : chr "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL -\
ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET"\
...
$ State.Name : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
$ County.Name : chr "Baldwin" "Baldwin" "Baldwin" "Baldwin" ...
$ Date.of.Last.Change: chr "2014-06-30" "2014-06-30" "2014-06-30" "2014-\
06-30" ...
For brevity I’ve only taken a few columns. And here’s the
bottom.
Exploratory Data Analysis 40
sure that you have data on all the people you thought you
would have data on.
In this example, we will use the fact that the dataset pur-
portedly contains hourly data for the entire country. These
will be our two landmarks for comparison.
Here, we have hourly ozone data that comes from monitors
across the country. The monitors should be monitoring
continuously during the day, so all hours should be repre-
sented. We can take a look at the Time.Local variable to see
what time measurements are recorded as being taken.
> head(table(ozone$Time.Local))
> library(dplyr)
> filter(ozone, Time.Local == "13:14") %>%
+ select(State.Name, County.Name, Date.Local,
+ Time.Local, Sample.Measurement)
# A tibble: 2 × 5
State.Name County.Name Date.Local Time.Local
<chr> <chr> <chr> <chr>
1 New York Franklin 2014-09-30 13:14
2 New York Franklin 2014-09-30 13:14
# ... with 1 more variables:
# Sample.Measurement <dbl>
Exploratory Data Analysis 42
Now we can see that this monitor just records its values at
odd times, rather than on the hour. It seems, from looking
at the previous output, that this is the only monitor in the
country that does this, so it’s probably not something we
should worry about.
Because the EPA monitors pollution across the country,
there should be a good representation of states. Perhaps we
should see exactly how many states are represented in this
dataset.
> unique(ozone$State.Name)
[1] "Alabama" "Alaska"
[3] "Arizona" "Arkansas"
[5] "California" "Colorado"
[7] "Connecticut" "Delaware"
[9] "District Of Columbia" "Florida"
[11] "Georgia" "Hawaii"
[13] "Idaho" "Illinois"
[15] "Indiana" "Iowa"
[17] "Kansas" "Kentucky"
[19] "Louisiana" "Maine"
[21] "Maryland" "Massachusetts"
[23] "Michigan" "Minnesota"
[25] "Mississippi" "Missouri"
[27] "Montana" "Nebraska"
[29] "Nevada" "New Hampshire"
[31] "New Jersey" "New Mexico"
[33] "New York" "North Carolina"
[35] "North Dakota" "Ohio"
[37] "Oklahoma" "Oregon"
[39] "Pennsylvania" "Rhode Island"
[41] "South Carolina" "South Dakota"
[43] "Tennessee" "Texas"
[45] "Utah" "Vermont"
[47] "Virginia" "Washington"
[49] "West Virginia" "Wisconsin"
[51] "Wyoming" "Puerto Rico"
> summary(ozone$Sample.Measurement)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.02000 0.03200 0.03123 0.04200 0.34900
2 http://www.epa.gov/ttn/naaqs/standards/ozone/s_o3_history.html
Exploratory Data Analysis 46
3 https://en.wikipedia.org/wiki/Box_plot
Exploratory Data Analysis 48
From the plot, we can see that for most states the data are
within a pretty narrow range below 0.05 ppm. However,
for Puerto Rico, we see that the typical values are very low,
except for some extremely high values. Similarly, Georgia
and Hawaii appear to experience an occasional very high
value. These might be worth exploring further, depending
on your question.
Exploratory Data Analysis 49
> library(maps)
> map("state")
> abline(v = -100, lwd = 3)
> text(-120, 30, "West")
> text(-75, 30, "East")
Exploratory Data Analysis 50
Both the mean and the median ozone level are higher in the
western U.S. than in the eastern U.S., by about 0.004 ppm.
We can also make a boxplot of the ozone in the two regions
to see how they compare.
The easy solution is nice because it is, well, easy, but you
should never allow those results to hold the day. You should
always be thinking of ways to challenge the results, espe-
cially if those results comport with your prior expectation.
Recall that previously we noticed that three states had some
Exploratory Data Analysis 53
25 20 15 5 30 7 5 10 12 40 30 30 10 25 10 20 10 10 25 5
[1] 0.1089893
Notice how closely the histogram bars and the blue curve
match. This is what we want to see with the data. If we see
Using Models to Explore Your Data 62
Okay, so the model and the data don’t match very well, as
was indicated by the histogram above. So what to do? Well,
we can either
the same data, which may impact decisions made down the
road.
Expectations
Note that if you choose any point on the blue line, there is
roughly the same number of points above the line as there
are below the line (this is also referred to as unbiased er-
rors). Also, the points on the scatterplot appear to increase
linearly as you move towards the right on the x-axis, even
if there is a quite a bit of noise/scatter along the line.
If we are right about our linear model, and that is the model
that characterizes the data and the relationship between
ozone and temperature, then roughly speaking, this is the
picture we should see when we plot the data.
Using Models to Explore Your Data 70
How does this picture compare to the picture that you were
expecting to see?
One thing is clear: There does appear to be an increasing
trend in ozone as temperature increases, as we hypothe-
Using Models to Explore Your Data 71
Refining expectations
5.6 Summary
How did the data make its way from the population to your
computer? Being able to describe this process is important
for determining whether the data are useful for making
inferences about features of the population. As an extreme
example, if you are interested in the average age of women
in a population, but your sampling process somehow is
designed so that it only produces data on men, then you
cannot use the data to make an inference about the average
age of women. Understanding the sampling process is key
to determining whether your sample is representative of
the population of interest. Note that if you have difficulty
describing the population, you will have difficulty describ-
ing the process of sampling data from the population. So
describing the sampling process hinges on your ability to
coherently describe the population.
y = β0 + β1 x + ε
The key point is that you never observe the full population
of penguins. Now what you end up with is your dataset,
which contains only three penguins.
Inference: A Primer 83
Dataset of Penguins
Time series
Natural processes
Data as population
Primary model
Secondary models
y = α + βx + γz + ε
where
• y is the outcome
• x is the key predictor
• z is a potential confounder
• ε is independent random error
• α is the intercept, i.e. the value y when x = 0 and z = 0
• β is the change in y associated with a 1-unit increase x,
adjusting for z
• γ is the change in y associated with a 1-unit increase
in z , adjusting for x
Formal Modeling 97
Expectations
The tick marks on the x-axis indicate the period when the
campaign was active. In this case, it’s pretty obvious what
effect the advertising campaign had on sales. Using just
your eyes, it’s possible to tell that the ad campaign added
about $100 per day to total daily sales. Your primary model
might look something like
y = α + βx + ε
Formal Modeling 99
Given this data and the primary model above, we’d estimate
β to be $96.78, which is not far off from our original guess
of $100.
Setting Expectations. The discussion of this ideal scenario
is important not because it’s at all likely to occur, but rather
because it instructs on what we would expect to see if the
world operated according to a simpler framework and how
we would analyze the data under those expectations.
y = α + βx + γ1 t + γ2 t2 + ε
Evaluation
Expectations
Here we can see that there isn’t quite the good separation
Formal Modeling 109
Reference
Prediction Bad Good
Bad 2 1
Good 73 174
Accuracy : 0.704
95% CI : (0.6432, 0.7599)
No Information Rate : 0.7
P-Value [Acc > NIR] : 0.4762
Kappa : 0.0289
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.99429
Specificity : 0.02667
Pos Pred Value : 0.70445
Neg Pred Value : 0.66667
Prevalence : 0.70000
Detection Rate : 0.69600
Detection Prevalence : 0.98800
Balanced Accuracy : 0.51048
Evaluation
7.5 Summary
Note that there are many fewer points on the plot above
than there were on the plot of the mortality data. This is
because PM10 is not measured everyday. Also note that
Inference vs. Prediction: Implications for Modeling Strategy 115
8.4 Summary
Directionality
Magnitude
Uncertainty
Now that you have a handle on what the model says about
the directionality and magnitude of the relationship be-
tween non-diet soda consumption and BMI, the next step
it to consider what the degree of uncertainty is for your
answer. Recall that your model has been constructed to fit
data collected from a sample of the overall population and
that you are using this model to understand how non-diet
soda consumption is related to BMI in the overall popula-
tion of adults in the US.
Let’s get back to our soda-BMI example, which does involve
using the results that are obtained on the sample to make
inferences about what the true soda-BMI relationship is in
the overall population of adults in the US. Let’s imagine that
the result from your analysis of the sample data indicates
that within your sample, people who drink an additional
ounce of non-diet soda per day have a BMI that is 0.28
kg/m2 greater than those who drink an ounce less per day.
However, how do you know whether this result is simply
the “noise” of random sampling or whether it is a close
approximation of the true relationship among the overall
population?
To assess whether the result from the sample is simply
Interpreting Your Results 136
tation is that you can be 95% confident that the true result
for the overall population is somewhere between 0.15 and
0.42 kg/m2 .
Implications
10.3 Content
that it is coded 0, 1, and 2, but I don’t see any labels for those
codes. Do you know what these codes for the “education”
variable stand for?”
For the second type of communication, in which you are
seeking feedback because of a puzzling or unexpected is-
sue with your analysis, more background information will
be needed, but complete background information for the
overall project may not be. To illustrate this concept, let’s
assume that you have been examining the relationship be-
tween height and lung function and you construct a scat-
terplot, which suggests that the relationship is non-linear as
there appears to be curvature to the relationship. Although
you have some ideas about approaches for handling non-
linear relationships, you appropriately seek input from oth-
ers. After giving some thought to your objectives for the
communication, you settle on two primary objectives: (1)
To understand if there is a best approach for handling the
non-linearity of the relationship, and if so, how to deter-
mine which approach is best, and (2) To understand more
about the non-linear relationship you observe, including
whether this is expected and/or known and whether the
non-linearity is important to capture in your analyses.
To achieve your objectives, you will need to provide your
audience with some context and background, but providing
a comprehensive background for the data analysis project
and review of all of the steps you’ve taken so far is unnec-
essary and likely to absorb time and effort that would be
better devoted to your specific objectives. In this example,
appropriate context and background might include the fol-
lowing: (1) the overall objective of the data analysis, (2) how
height and lung function fit into the overall objective of
the data analysis, for example, height may be a potential
confounder, or the major predictor of interest, and (3)
what you have done so far with respect to height and lung
Communication 151
10.4 Style
10.5 Attitude
1. Always be checking
2. Always be challenging
3. Always be communicating
The best way for the epicycle framework and these activi-
ties to become second nature is to do a lot of data analysis,
so we encourage you to take advantage of the data analysis
opportunities that come your way. Although with practice,
many of these principles will become second nature to you,
we have found that revisiting these principles has helped to
resolve a range of issues we’ve faced in our own analyses.
We hope, then, that the book continues to serve as a useful
resource after you’re done reading it when you hit the
stumbling blocks that occur in every analysis.
About the Authors
Roger D. Peng is a Professor of Biostatistics at the Johns
Hopkins Bloomberg School of Public Health. He is also a
Co-Founder of the Johns Hopkins Data Science Special-
ization1 , which has enrolled over 1.5 million students, the
Johns Hopkins Executive Data Science Specialization2 , the
Simply Statistics blog3 where he writes about statistics and
data science for the general public, and the Not So Standard
Deviations4 podcast. Roger can be found on Twitter and
GitHub under the user name @rdpeng5 .
Elizabeth Matsui is a Professor of Pediatrics, Epidemiol-
ogy and Environmental Health Sciences at Johns Hopkins
University and a practicing pediatric allergist/immunolo-
gist. She directs a data management and analysis center
with Dr. Peng that supports epidemiologic studies and clin-
ical trials and is co-founder of Skybrude Consulting, LLC6 ,
a data science consulting firm. Elizabeth can be found on
Twitter @eliza687 .
1 http://www.coursera.org/specialization/jhudatascience/1
2 https://www.coursera.org/specializations/executive-data-science
3 http://simplystatistics.org/
4 https://soundcloud.com/nssd-podcast
5 https://twitter.com/rdpeng
6 http://skybrudeconsulting.com
7 https://twitter.com/eliza68