Sie sind auf Seite 1von 14

Biostatistics I: Basic for Public Health

Lecture No.: KUI 6111

Starting Date: 01/09/2017

Exploratory Data Analysis

Module: 2

Copyright 2017, S.A. Wilopo, Department of Biostataistics,

Epidemiology, and Population Health
Faculty of Medicine, Gadjah Mada University, Yogyakarta, Indonesia
1 Learning Objectives 1

2 Introduction for EDA 1

3 Exercises in the Class 4

3.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.5 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.6 Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.7 Exercise 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.8 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.9 Exercise 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.10 Exercise 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Homework 8
4.1 Creating Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Creating Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Output 10

6 References 10
6.1 Articles for Critical Appraisal . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6.2 Required Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

7 Log Sheet 12

1 Learning Objectives
1. Using appropriate numerical measures and/or visual displays, describe the distribu-
tion of a categorical variable in context.

2. Using appropriate graphical displays and/or numerical measures, describe the distri-
bution of a quantitative variable in context: a) describe the overall pattern, and b)
describe striking deviations from the pattern

3. Define and describe the features of the distribution of one quantitative variable
(shape, center, spread, outliers).

4. Apply the standard deviation rule to the special case of distributions having the nor-
mal shape.

5. Define and interpret measures of position (percentiles, quartiles, the five-number

summary, z-scores).

6. Define and use the 1.5(IQR) and 3(IQR) criterion to identify potential outliers and
extreme outliers.

7. Write a scientific table in a correct format and interpret the contents

2 Introduction for EDA

This summary provides a quick recap of the material in the Exploratory Data Analysis unit.
Please note that this summary does not provide complete coverage of the material, only
lists the main points. The purpose of exploratory data analysis (EDA) is to convert the
available data from their raw form to an informative one, in which the main features of
the data are illuminated.

When performing EDA, we should always:

use visual displays (graphs or tables) plus numerical summaries.

describe the overall pattern and mention any striking deviations from that pat-
interpret the results we find in context.

When examining the distribution of a single variable, we distinguish between a cate-
gorical variable and a quantitative variable.

The distribution of a categorical variable is summarized using:

Display: pie-chart or bar-chart (variation: pictogram > can be misleading

Numerical summaries: category (group) percentages.

The distribution of a quantitative variable is summarized using:

Display: histogram (or stemplot, mainly for small data sets). When describing
the distribution as displayed by the histogram, we should describe the:
a. Overall pattern > shape, center, spread.
b. Deviations from the pattern: outliers.
Numerical summaries: descriptive statistics (measure of center plus measure of
a. If distribution is symmetric with no outliers, use mean and standard devia-
b. Otherwise, use the five-number summary, in particular, median and IQR
(inter-quartile range).

The five-number summary and the 1.5(IQR) Criterion for detecting outliers are the
ingredients we need to build the boxplot. Boxplots are most effective when used
side-by-side for comparing distributions (see also case C >Q in examining rela-

In the special case of a distribution having the normal shape, the Standard Deviation
Rule applies. This rule tells us approximately what percent of the observations fall
within 1,2, or 3 standard deviations away from the mean. In particular, when a
distribution is approximately normal, almost all the observations (99.7%) fall within
3 standard deviations of the mean.

When examining the relationship between two variables, the first step is to classify
the two relevant variables according to their role and type; and only then to deter-
mine the appropriate tools for summarizing the data. (We dont deal with case Q>C
in this course).

Case C>Q: Exploring the relationship amounts to comparing the distributions of
the quantitative response variable for each category of the explanatory variable. To
do this, we use:

Display: side-by-side boxplots.

Numerical summaries: descriptive statistics of the response variable, for each
value (category) of the explanatory variable separately.

Case C>C: Exploring the relationship amounts to comparing the distributions of

the categorical response variable, for each category of the explanatory variable. To
do this, we use:

Display: two-way table.

Numerical summaries: conditional percentages (of the response variable for
each value (category) of the explanatory variable separately).

Case Q>Q: We examine the relationship using:

Display: scatterplot. When describing the relationship as displayed by the scat-

terplot, be sure to consider:
* Overall pattern > direction, form, strength.
* Deviations from the pattern > outliers.

Labeling the scatterplot (including a relevant third categorical variable in our analysis),
might add some insight into the nature of the relationship.

In the special case that the scatterplot displays a linear relationship (and only then),
we supplement the scatterplot with:

Numerical summaries: Pearsons correlation coefficient (r) measures the direc-
tion and, more importantly, the strength of the linear relationship. The closer
r is to 1 (or -1), the stronger the positive (or negative) linear relationship. r is
unitless, influenced by outliers, and should be used only as a supplement to the
When the relationship is linear (as displayed by the scatterplot, and supported
by the correlation r), we can summarize the linear pattern using the least
squares regression line. Remember that:
* The slope of the regression line tells us the average change in the response
variable that results from a 1-unit increase in the explanatory variable.
* When using the regression line for predictions, you should beware of ex-

When examining the relationship between two variables (regardless of the case),
any observed relationship (association) does not imply causation, due to the pos-
sible presence of lurking variables.

When we include a lurking variable in our analysis, we might need to rethink the
direction of the relationship > Simpsons paradox.

3 Exercises in the Class

In this laboratory exercise, we will use the data from The Behavioral Risk Factor Surveil-
lance System (BRFSS). This is an annual telephone survey of 350,000 people in the United
States. As its name implies, the BRFSS is designed to identify risk factors in the adult popu-
lation and report emerging health trends. For example, respondents are asked about their
diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even
their level of health care coverage. The BRFSS Web site (
contains a complete description of the survey, including the research questions that moti-
vate the study and many interesting results derived from the data. We will focus on a
random sample of 20,000 people from the BRFSS survey conducted in 2000. While there
are over 200 variables in this data set, we will work with a small subset with 9 variables
I downloaded the random sample of 20,000 people from the BRFSS survey conducted
in 2000 using R Software into the R workspace. After launching RStudio, I entered the
following command.

I converted the R data active into ASCII text file with the name cdc.txt. I used Rcmdr
utility provided by Prof. Jon Fox at my former School, McMaster University, Ontario,
Canada (
You should begin by loading the data set of 20,000 observations (cdc.txt) into your
favorite software such as SPSS, STATA or R Software. Alternatively I provided also CSV
file and excel 2007 or 2010 format. Please learn that your software will have a specific
command to call these files that can be read by your software. Remember that each of
statistical software has a unique file type.
To make sure that you can access the data correctly, you should exercise to create a
simple data set using any text editor (ASCII file). Then write a program command for your
statistical software and load it into your software. Make sure that your small data set will
have comparable format with cdc.txt file and your command has given a correct result.
After program executed correctly, then use your command to load the cdc.txt file provided
in this session.
Your tutor will demonstrate on how to read data text file (cdc.txt) and make trial pro-
gram on reading this large data sets into your favorite software. Please be aware that first
line of this data set is the name of each variable extracted from the complete data sets.
He/she should demonstrate many different file types to you, including csv file.
Your computer program shows variable in the data set with the names genhlth, exerany,
hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables
corresponds to a question that was asked in the survey. For example, for genhlth, respon-
dents were asked to evaluate their general health, responding either: excellent, very good,
good, fair or poor. The exerany variable indicates whether the respondent exercised in the
past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had
some form of health coverage (1) or did not (0). The smoke100 variable indicates whether
the respondent had smoked at least 100 cigarettes in her lifetime. The other variables
record the respondents height in inches, weight in pounds as well as their desired weight,
wtdesire, age in years, and gender.

3.1 Exercise 1
Please create a code book for this cdc.txt file in the tabular format with following column:
order number of the variable, variable name, variable definition, and scale of measurement.
In order to understand the meaning of those variables, you might go to the web site of this
data set at: for the complete information.

3.2 Exercise 2
How many cases are there in this data set? How many variables? For each variable, identify
its data type (e.g. categorical, discrete). Define your dependent and independent variables
in this data set. Justify your selection of those types of variables!

3.3 Exercise 3
Compute the relative frequency distribution for gender and exerany. How many males are
in the sample? What proportion of the sample reports being in excellent health? Can you
present general health status into graphical form? Consider use pie diagram and bar chart.

3.4 Exercise 4
Create a new variable (object) called under23_andsmoke that contains all observations of
respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Give
the name of this new variable and put it in your codebook as well. Write the command
that you used to create the new variable (object) as the answer to this exercise.

3.5 Exercise 5
Create a numerical summary (descriptive statistics) for height and age and make histogram
for those two variables. What do you conclude from those results?

3.6 Exercise 6
Two common ways to visualize quantitative data are with histograms and box plots. You
can construct a graph for a single variable with a certain command using software. Please
create histogram of the weight and height. Compare between male and female on their
weight and height? Do you see any different shape of their distributions from the his-
tograms between male and female?

3.7 Exercise 7
The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of
comparing across several categories. So we can, for example, compare the heights of men
and women with boxplot. So were asking software to give us the box plots of heights and
weight where the groups are defined by gender.

So far, in our discussion about measures of spread, some key players were:

a. the extremes (min and Max), which provide the range covered by all the data; and

b. the quartiles (Q1, M and Q3), which together provide the IQR, the range covered by
the middle 50% of the data.

Recall that the combination of all five numbers (min, Q1, M, Q3, Max) is called the
five number summary, and provides a quick numerical description of both the center and
spread of a distribution. You can compare the locations of the components of the box by
examining the summary statistics. Confirm that the median and upper and lower quartiles
reported in the numerical summary match those in the graph. An observation is considered
a suspected outlier or potential outlier if it is:

1. below Q1 1.5(IQR) or

2. above Q3 + 1.5(IQR)

Calculate the inter-quarter ranges and decide the cut point of outliers using above
definition. Exclude all outliers on the data set and create new variables for weight and
height. Make histograms for those two new variables. Did any change of interpretation
compare to earlier histogram?

3.8 Exercise 8
Next lets consider a new variable that doesnt show up directly in this data set: Body Mass
Index (BMI). Remember in this data set, the weight was measured in lb and height was in
inch. Therefore BMI should be calculated with the following estimate:
BMI=weight(lb)/height(in)2 x 703.
The constant number 703 is the approximate conversion factor to change units from
metric (meters and kilograms) to imperial (inches and pounds).
Notice that you should create an arithmetic that will be applied to all 20,000 numbers
in the cdc data set. That is, for each of the 20,000 participants, we take their weight, divide
by their height-squared and then multiply by 703. The result is 20,000 BMI values, one for
each respondent. Please perform computations using very simple expressions and make
descriptive statistics of the BMI.

3.9 Exercise 9
Construct a box plot for BMI data. What does this box plot show? Pick another categorical
variable from the data set and see how it relates to BMI. List the variable you chose, why
you might think it would have a relationship to BMI, and indicate what the figure seems to

3.10 Exercise 10
Here is some information that would be interesting to get from the BMI data:

1. Classified BMI according to your own classification (i.e. low, medium or high) into 3
categories. Thus you are recoding from quantitative into qualitative variable.

2. How are sample divided across the three BMI categories? Are they equally divided?
If not, do the percentages follow some other kind of pattern? Justified your classifi-
cation according your decision.

3. What percentage of the sampled BMI fall into each category?

4. Present BMI using bar chart and pie diagram. What is your interpretation?

At this point, weve done a good first pass at analyzing the information in the BRFSS
questionnaire. Weve found an interesting association between smoking and gender, and
we can say something about the relationship between peoples assessment of their general
health and their own BMI. Weve also picked up essential computing tools summary
statistics, sub-setting, and plots that will serve us well throughout this course.

4 Homework
4.1 Creating Table
1. Recall variable wdiff in the previous practical exercise. Describe the distribution of
wdiff in terms of its center, shape, and spread, including any plots you use. Any
outliers of this variable? What does this tell us about how people feel about their
current weight?

2. Using numerical summaries and a side-by-side box plot, determine if men tend to
view their weight differently than women.

3. Now its time to get creative. Find the mean and standard deviation of weight and
determine what proportions of the weights are within one standard deviation of the
mean. The Standard Deviation Rule:

(a) Approximately 68% of the observations fall within 1 standard deviation of the
(b) Approximately 95% of the observations fall within 2 standard deviations of the
(c) Approximately 99.7% (or virtually all) of the observations fall within 3 standard
deviations of the mean.
(d) Read an article with the following title: ... Can you decide what ....

4. Can you create Z-Scores for the height and weight? Please construct histogram for
those Z-score and compare with a histogram of standard normal distribution. Ask
your tutor to create this normal distribution for you.

5. You are asked to present a relationship between height and weight in this data set
using graphical method. Can you identify possible outliers in the data set for weight
and height? Please discard outliers of weight and height data and recreate the scat-
terplot between weight and height.

Please write your comments for all those graphs.

4.2 Creating Tables

1. Read following article by Annesley, T. M.with entitled Bring Your Best to the Table.
Clin. Chem., 56(10), 1528-1534, 2010. Please create a scientific table using data
presented in Figure 1 from this article. Your table should be submitted on the words
file and it must use a format similar to table 2 in this article.

2. Look at Table 9 in Annesleys article. This table can be improved in several ways to
make it more clear and informative. Please retype and modify this table. Compare
your suggested changes with those provided at the end of this article.

3. Can you read those tables and write your interpretation for those tables!

5 Output
1. Competence in performing of exploratory data analysis (EDA) for a single of categor-
ical or quantitative variable.

2. Competence in performing of EDA of combined variables (categorical and/or quanti-

tative measures)

3. Understand the link between type of EDA and type of statistical distribution

4. Competence in interpreting results of EDA, including on how to identify and manage


5. Competence in writing a scientific table

6 References
6.1 Articles for Critical Appraisal
1. Annesley, T. M. (2010). Bring Your Best to the Table. Clin. Chem., 56(10), 1528-1534.
doi: 10.1373/clinchem.2010.153502

2. Annesley, T. M. (2012). Now You Be the Judge. Clin. Chem., 58(11), 1520-1526.
doi: 10.1373/clinchem.2012.195529

6.2 Required Reading

1. Annesley, T. M. (2010). Bars and Pies Make Better Desserts than Figures. Clin. Chem.,
56(9), 1394-1400. doi: 10.1373/clinchem.2010.152298

2. Annesley, T. M. (2010). Put Your Best Figure Forward: Line Graphs and Scattergrams.
Clin. Chem., 56(8), 1229-1233. doi: 10.1373/clinchem.2010.150060

3. Franzblau, L. E., & Chung, K. C. (2012). Graphs, Tables, and Figures in Scientific
Publications: The Good, the Bad, and How Not to Be the Latter. The Journal of Hand
Surgery, 37(3), 591-596. doi:

4. Marshall, G., & Jonker, L. (2010). An introduction to descriptive statistics: A review

and practical guide. Radiography, 16(4), e1-e7. doi:

5. Chen, J. C., Cooper, R. J., McMullen, M. E., & Schriger, D. L. (2017). Graph Qual-
ity in Top Medical Journals. Annals of Emergency Medicine, 69(4), 453-461.e455.

6. Inskip, H., Ntani, G., Westbury, L., Di Gravio, C., DAngelo, S., Parsons, C., & Baird, J.
(2017). Getting started with tables. Archives of Public Health, 75(1), 14. doi:10.1186/s13690-

7 Log Sheet
Name: ID:

No Activities Date Signature Comment

1. Understanding data structure and
appropriate format for analysis
2. Group Discussion on the EDA for
single and combination of categori-
cal and quantitative data
3. Assignment: Bar Chart, Histogram,
Boxplot, and Scatterplots
4. Assignment: Creating a scientific ta-

Score : ____________________