EDA Notes1

Course Information I
Course website:
Lecture 1: Review and Exploratory Data Analysis http://www.biostat.jhsph.edu/∼seckel/biostat2/
(EDA) Office hours
For questions and help
When? I’ll announce this tomorrow
Sandy Eckel
Homework
seckel@jhsph.edu
Three assignments
Department of Biostatistics,
Follow-up on material from class
The Johns Hopkins University, Baltimore USA
Written exam
When: Wednesday 21 May, 10.00 - 12.00
Where: Multimedia classroom (C wing, 2nd floor),
21 April 2008 Mannerheimintie 172/Kytosuontie 9
1 / 40 2 / 40
Course Information II Class goals
Biostat I
21 April 2008 to 21 May 2008 Numbers and probability
08.30 - 12.30 Monday, Tuesday, Thursday, Friday Sampling distributions and inference
08.30 - 10.15 Lecture Statistical models and association / causality
10.15 - 10.30 Break
10.30-12.30 Informal lecture, class exercise or computer lab Biostat II
Activities for the second half of class will vary; also time for Developing scientific questions
questions!
Translating questions into regression models
Interpreting results of regression
Critiquing the literature
3 / 40 4 / 40
Issues and recurring themes What is Biostatistics?
Populations are complicated... statistical techniques may not

capture all of the nuances Biostatistics is the use of data to describe and make inferences
about a scientific problem
Natural laws will not perfectly predict outcomes
Remember the “Bio” in Biostatistics!
Signal-to-Noise: Comparing a trend to its variability
Biostatistics has limitations: you can’t have it all
Bias-Variance trade-off: Unadjusted vs. adjusted estimates
Population vs. sample
5 / 40 6 / 40
Types of Biostatistics Exploratory Data Analysis (EDA)
1 Descriptive statistics
Exploratory data analysis (EDA): often not in literature ALWAYS look at your data!
Summaries: “Table 1” in a paper If you can’t see it, then don’t believe it!
Goal: to visualize relationships, generate hypotheses EDA allows us to:
1 Visualize distributions and relationships
2 Inferential statistics 2 Detect errors
Confirmatory data analysis 3 Assess assumptions for confirmatory analysis
Methods section of a paper EDA is the first step of data analysis
Goal: quantify relationships, test hypotheses
7 / 40 8 / 40
EDA methods (One-Way) Stem-and-Leaf Plots I
Age in years (10 observations):
25, 26, 29, 32, 35, 36, 38, 44, 49, 51

Ordering : Stem-and-Leaf plots
Grouping: frequency displays, distributions; histograms Age Interval Observations
Summaries: summary statistics, standard deviation, 20-29 569
box-and-whisker plots 30-39 2568
40-49 49
50-59 1
9 / 40 10 / 40
Stem-and-Leaf Plots II Stem-and-Leaf Plots III
Some statistical programs print output like this:

The age interval is the “stem”
The observations are the “leaves” Age Interval Observations
Rule of thumb: 2* 569
The number of stems should roughly equal the square root of 3* 2568
the number of observations 4* 49
Or the stems should be logical categories 5* 1
where 2* means 20-29.
11 / 40 12 / 40
Stem-and-Leaf Plots IV Frequency Distribution Tables
Output may also be shown like this:

Shows the number of observations for each range of data
Age Interval Observations Intervals can be chosen in ways similar to stem-and-leaf
2. 569 displays
3* 2
3. 568 Age Interval Frequency
4* 4 20-29 3
4. 9 30-39 4
5* 1 40-49 2
50-59 1
where 3* means 30-34 and 3. means 35-39.
13 / 40 14 / 40
Cumulative Frequency Distribution Tables Histograms
Show the frequency, the relative frequency, and cumulative Picture of the frequency or relative frequency distribution
frequency of observations
Histogram of Age
Age Interval Frequency Cum. Freq. Rel. Freq Cum. Rel. Freq.
20-29 3 3 0.3 0.3 3.0 Note: Graphs are generally
30-39 4 7 0.4 0.7 better to use in
2.0
Frequency
40-49 2 9 0.2 0.9 presentations than tables.

50-59 1 10 0.1 1.0 They allow your audience
to visualize a trend
1.0
This table shows an empirical distribution function quickly.

obtained from a sample
0.0
The true distribution function is the distribution of the 25 30 35 40 45 50 55

entire population Age
15 / 40 16 / 40
Summary Statistics Percentiles
The rth percentile, Pr

is the value that is greater than or equal to r percent of a
sample of n observations
Percentiles or less than or equal to (100-r) percent of the observations
Measures of central tendency
Measures of dispersion or variability Percentile Quartile Formula
n+1 th
P25 Q1 4 observation
n+1 th
3(n+1) th
17 / 40 18 / 40
Calculating quartiles I Calculating quartiles II
From the age data:
25, 26, 29, 32, 35, 36, 38, 44, 49, 51 Q1 = median of lower half of data
= third smallest value
with n=10 = 29
Q2 = median
= average of 5th and 6th observations Q3 = median of upper half of data
35 + 36 = third largest value
=
2 = 44
= 35.5
Note: If n is odd, include the median in the upper and lower half
Remember to order your data!
of the data.
19 / 40 20 / 40
Measures of Central Tendency Measures of spread: Range
Measure Formula
Pn
i=1 xi
Mean n = x̄
Median Middle observation Range = max-min
Mode Most frequent observation observation The difference between the maximum and minimum values
From age example:
Max = 51, Min = 25
From the age example the mean is: Range = 51-25 = 26
25+26+29+32+35+36+38+44+49+51
10 = 36.5
The mode is more helpful for categorical data, i.e. the most
frequent age interval is 30-39 and it has 4 observations.
21 / 40 22 / 40
Measures of spread: Variance Standard deviation
Standard deviation = Square root of the variance

q
Variance = Expected value of the squared deviation of the
σ = E [(X − X̄ )2 ]
observations from the true mean
Sample standard deviation = Square root of the sample variance
σ 2 = E [(X − X̄ )2 ]
sP
n 2
Sample variance = Average of the squared deviation of the i=1 (xi − x̄)
s=
observations from the sample mean n−1
Pn
2 (xi − x̄)2 √
s = i=1 From the age data: s = 82.9 = 9.1
n−1
Note: The units of the variance are years2 , while the units of
Sample variance from age example = 82.9 the standard deviation are years.
Interpretation: The standard deviation gives an idea of how
much observations differ from the mean
23 / 40 24 / 40
Box-and-whisker plots I Box-and-whisker plots II
Box-and-whisker plots display quartiles

Boxplot of Age
Some terminology:
50
Upper Hinge = Q3 = Third quartile
Lower Hinge = Q1 = First quartile IQR
35 40 45
Interquartile range (IQR) = Q3 − Q1 = 44-29 = 15
Age in Years
Contains the middle 50% of data Upper Fence
= 44 + 15*1.5 = 66.5
Upper Fence = Upper Hinge + 1.5 * (IQR)
Lower Fence = Lower Hinge - 1.5 * (IQR) Lower Fence
= 29 - 15*1.5 = 6.5
30
Outliers: Data values beyond the fences
“Whiskers” are drawn to the smallest and largest observations
25
within the fences
25 / 40 26 / 40
Pairwise EDA 2 Categorical Variables
Frequency Table
2 Categorical Variables Age Interval Gender Total

Frequency table Female Male
1 Categorical, 1 Continuous Variable 20-29 1 2 3
Stratified stem-and-leaf plots 30-39 2 2 4
Side-by-side box plots 40-49 1 1 2
2 Continuous variables 50-59 1 0 1
Scatterplot Total 5 5 10
It looks like the men tend to be younger than women in this
example.
27 / 40 28 / 40
1 Categorical and 1 Continuous Variable I 1 Categorical and 1 Continuous Variable II
Side-by-Side Box Plots
Stratified Stem-and-Leaf plots Boxplot of Age by Gender
Female Male
50
Age Interval Obs. Age Interval Obs. Allows us to compare the
35 40 45
Age in Years
20-29 6 20-29 59 distribution of the
30-39 56 30-39 28 continuous variable (age)
40-49 9 40-49 4 across values of the
50-59 1 50-59 categorical variable
30
Total 5 5 10 (gender)
25
Female Male
29 / 40 30 / 40
2 Continuous Variables EDA: What to notice
Scatterplot
Age by Height
185
●
●
Shape
Height in Centimeters
Center
175
Scatterplots visually
●
●
Spread
display the relationship
between two continuous
165
●
variables
●
155
25 30 35 40 45 50
Age in Years
31 / 40 32 / 40
Common Distribution Shapes Other Distribution Shapes
Symmetrical Positively skewed Negatively skewed

Bimodal Reverse J−shaped Uniform
and bell shaped or skewed to the right or skewed to the left
33 / 40 34 / 40
Measures of Center Skewness I
Positively skewed
Longer tail in the high values
Mean > Median > Mode
Positively skewed or skewed to the right

Mode: Peak(s)
Mode
Median: Equal areas point Median
Mean
Mean: Balancing point
35 / 40 36 / 40
Skewness II Symmetric
Negatively skewed Right and left sides are mirror images

Longer tail in the low values Left tail looks like right tail
Mean = Median = Mode
Mode > Median > Mean
Symmetric
Negatively skewed or skewed to the left
Mode
Median
Mean
37 / 40 38 / 40
EDA: What to notice Take Home Message
Outliers
Values that are “far” from the bulk of the data
Outliers can influence the value of some statistical measures
Look at your data FIRST!
Age example
Happy exploring!
Data Mean
Original 36.5
With 80-year-old added 40.5
39 / 40 40 / 40

EDA Notes1

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

EDA Notes1

Hochgeladen von

Copyright:

Verfügbare Formate

Course Information I

Course Information II Class goals

Populations are complicated... statistical techniques may not

Types of Biostatistics Exploratory Data Analysis (EDA)

Age in years (10 observations):

25, 26, 29, 32, 35, 36, 38, 44, 49, 51

Stem-and-Leaf Plots II Stem-and-Leaf Plots III

Some statistical programs print output like this:

where 2* means 20-29.

Output may also be shown like this:

Cumulative Frequency Distribution Tables Histograms

40-49 2 9 0.2 0.9 presentations than tables.

This table shows an empirical distribution function quickly.

The true distribution function is the distribution of the 25 30 35 40 45 50 55

The rth percentile, Pr

Calculating quartiles I Calculating quartiles II

From the age data:

Measures of spread: Variance Standard deviation

Standard deviation = Square root of the variance

Box-and-whisker plots display quartiles

“Whiskers” are drawn to the smallest and largest observations

Pairwise EDA 2 Categorical Variables

2 Categorical Variables Age Interval Gender Total

Side-by-Side Box Plots

Stratified Stem-and-Leaf plots Boxplot of Age by Gender

2 Continuous Variables EDA: What to notice

Symmetrical Positively skewed Negatively skewed

Measures of Center Skewness I

Positively skewed or skewed to the right

Negatively skewed Right and left sides are mirror images

EDA: What to notice Take Home Message

Das könnte Ihnen auch gefallen