Beruflich Dokumente
Kultur Dokumente
Course website:
Lecture 1: Review and Exploratory Data Analysis http://www.biostat.jhsph.edu/∼seckel/biostat2/
(EDA) Office hours
For questions and help
When? I’ll announce this tomorrow
Sandy Eckel
Homework
seckel@jhsph.edu
Three assignments
Department of Biostatistics,
Follow-up on material from class
The Johns Hopkins University, Baltimore USA
Written exam
When: Wednesday 21 May, 10.00 - 12.00
Where: Multimedia classroom (C wing, 2nd floor),
21 April 2008 Mannerheimintie 172/Kytosuontie 9
1 / 40 2 / 40
Biostat I
21 April 2008 to 21 May 2008 Numbers and probability
08.30 - 12.30 Monday, Tuesday, Thursday, Friday Sampling distributions and inference
08.30 - 10.15 Lecture Statistical models and association / causality
10.15 - 10.30 Break
10.30-12.30 Informal lecture, class exercise or computer lab Biostat II
Activities for the second half of class will vary; also time for Developing scientific questions
questions!
Translating questions into regression models
Interpreting results of regression
Critiquing the literature
3 / 40 4 / 40
Issues and recurring themes What is Biostatistics?
5 / 40 6 / 40
1 Descriptive statistics
Exploratory data analysis (EDA): often not in literature ALWAYS look at your data!
Summaries: “Table 1” in a paper If you can’t see it, then don’t believe it!
Goal: to visualize relationships, generate hypotheses EDA allows us to:
1 Visualize distributions and relationships
2 Inferential statistics 2 Detect errors
Confirmatory data analysis 3 Assess assumptions for confirmatory analysis
Methods section of a paper EDA is the first step of data analysis
Goal: quantify relationships, test hypotheses
7 / 40 8 / 40
EDA methods (One-Way) Stem-and-Leaf Plots I
9 / 40 10 / 40
11 / 40 12 / 40
Stem-and-Leaf Plots IV Frequency Distribution Tables
13 / 40 14 / 40
Show the frequency, the relative frequency, and cumulative Picture of the frequency or relative frequency distribution
frequency of observations
Histogram of Age
Age Interval Frequency Cum. Freq. Rel. Freq Cum. Rel. Freq.
20-29 3 3 0.3 0.3 3.0 Note: Graphs are generally
30-39 4 7 0.4 0.7 better to use in
2.0
Frequency
15 / 40 16 / 40
Summary Statistics Percentiles
17 / 40 18 / 40
25, 26, 29, 32, 35, 36, 38, 44, 49, 51 Q1 = median of lower half of data
= third smallest value
with n=10 = 29
Q2 = median
= average of 5th and 6th observations Q3 = median of upper half of data
35 + 36 = third largest value
=
2 = 44
= 35.5
Note: If n is odd, include the median in the upper and lower half
Remember to order your data!
of the data.
19 / 40 20 / 40
Measures of Central Tendency Measures of spread: Range
Measure Formula
Pn
i=1 xi
Mean n = x̄
Median Middle observation Range = max-min
Mode Most frequent observation observation The difference between the maximum and minimum values
From age example:
Max = 51, Min = 25
From the age example the mean is: Range = 51-25 = 26
25+26+29+32+35+36+38+44+49+51
10 = 36.5
The mode is more helpful for categorical data, i.e. the most
frequent age interval is 30-39 and it has 4 observations.
21 / 40 22 / 40
50
Upper Hinge = Q3 = Third quartile
Lower Hinge = Q1 = First quartile IQR
35 40 45
Interquartile range (IQR) = Q3 − Q1 = 44-29 = 15
Age in Years
Contains the middle 50% of data Upper Fence
= 44 + 15*1.5 = 66.5
Upper Fence = Upper Hinge + 1.5 * (IQR)
Lower Fence = Lower Hinge - 1.5 * (IQR) Lower Fence
= 29 - 15*1.5 = 6.5
30
Outliers: Data values beyond the fences
25
within the fences
25 / 40 26 / 40
Frequency Table
27 / 40 28 / 40
1 Categorical and 1 Continuous Variable I 1 Categorical and 1 Continuous Variable II
Female Male
50
Age Interval Obs. Age Interval Obs. Allows us to compare the
35 40 45
Age in Years
20-29 6 20-29 59 distribution of the
30-39 56 30-39 28 continuous variable (age)
40-49 9 40-49 4 across values of the
50-59 1 50-59 categorical variable
30
Total 5 5 10 (gender)
25
Female Male
29 / 40 30 / 40
Scatterplot
Age by Height
185
●
●
Shape
Height in Centimeters
Center
175
Scatterplots visually
●
●
Spread
display the relationship
between two continuous
165
●
variables
●
155
25 30 35 40 45 50
Age in Years
31 / 40 32 / 40
Common Distribution Shapes Other Distribution Shapes
33 / 40 34 / 40
Positively skewed
Longer tail in the high values
Mean > Median > Mode
35 / 40 36 / 40
Skewness II Symmetric
Mode
Median
Mean
37 / 40 38 / 40
Outliers
Values that are “far” from the bulk of the data
Outliers can influence the value of some statistical measures
Look at your data FIRST!
Age example
Happy exploring!
Data Mean
Original 36.5
With 80-year-old added 40.5
39 / 40 40 / 40