Beruflich Dokumente
Kultur Dokumente
Statistics
Colm ODushlaine
Overview
Descriptive Statistics & Graphical Presentation of
Data
Statistical Inference
Frequency Distributions/Histograms
Measures of data location
Measures of data spread
Box-plots
Scatter-plots
Clustering (Multivariate Data)
1.
2.
3.
Two-Sample Inferences
Paired t-test
Two-sample t-test
Inferences for more than two samples
One-way ANOVA
Two-way ANOVA
Regression
Correlation
Multiple Regression
ANCOVA
Normality Checks
Non-parametrics
Sample Size Calculations
Useful tools and websites
6
1. Terminology
Populations & Samples
Variables
Categorical
Variables
2. Frequency Distributions
Example Serum CK
13
121
95
84
119
62
25
82
145
57
104
83
123
100 151
64 201
139 60
110 113
67 93
70 48
68
101
78
118
92
95
58
163
94
203
110
42
14
Frequency
Relative
Frequency
Cumulative Rel.
Frequency
20-39
0.028
0.028
40-59
0.111
0.139
60-79
0.194
0.333
80-99
0.222
0.555
100-119
0.222
0.777
120-139
0.083
0.860
140-159
0.056
0.916
160-179
0.028
0.944
180-199
0.000
0.944
200-219
0.056
1.000
Total
36
1.000
15
Frequency Distribution
Distributions
CK-concentration-(U/l)
Quantiles
8
Frequency
100.0% maximu
99.5%
97.5%
90.0%
75.0%
quart
50.0%
media
25.0%
quart
10.0%
2.5%
0.5%
0.0%
minimu
20
40
60
80
100
120
140
160
180
200
220
16
CK-concentration-(U/l)
Quantiles
Mode
Shaded area is
percentage of
males with CK
values between
60 and 100 U/l,
i.e. 42%.
0.15
Right tail
(skewed)
0.10
Left tail
Relative Frequency
0.20
100.0% maxim
99.5%
97.5%
90.0%
75.0%
quar
50.0%
med
25.0%
quar
10.0%
2.5%
0.5%
0.0%
minim
0.05
20
40
60
80
100
120
140
160
180
200
220
17
3. Measures of Central
Tendency (Location)
Measures of location indicate where on the number
line the data are to be found. Common measures of
location are:
(i) the Arithmetic Mean,
(ii) the Median, and
(iii) the Mode
18
The Mean
1
n
xi
i 1
19
Example
Example 2: The systolic blood pressure of
seven middle aged men were as follows:
151, 124, 132, 170, 146, 124 and 113.
The mean is
137.14
20
21
Example 1 n is odd
The reordered systolic blood pressure data seen
earlier are:
113, 124, 124, 132, 146, 151, and 170.
The Median is the middle value of the ordered data,
i.e. 132.
Two individuals have systolic blood pressure = 124
mm Hg, so the Mode is 124.
22
Example 2 n is even
Six men with high cholesterol participated in a study to investigate
the effects of diet on cholesterol level. At the beginning of the study,
their cholesterol levels (mg/dL) were as follows:
366, 327, 274, 292, 274 and 230.
Rearrange the data in numerical order as follows:
230, 274, 274, 292, 327 and 366.
The Median is half way between the middle two readings, i.e.
(274+292) 2 = 283.
Two men have the same cholesterol level- the Mode is 274.
23
Large sample values tend to inflate the mean. This will happen if
the histogram of the data is right-skewed.
4. Measures of Dispersion
4.
Range
Variance & Standard deviation
Coefficient of Variation (or relative standard
deviation)
Inter-quartile range
25
Range
26
Sample Variance
xi x
s i 1
2
n 1
>
27
Standard Deviation
xi x
i 1
n 1
Example
Data
151
124
132
170
146
124
113
Sum = 960.0
x 137.14
Deviation
13.86
-13.14
-5.14
32.86
8.86
-13.14
-24.14
Sum = 0.00
Deviation2
192.02
172.73
26.45
1079.59
78.45
172.73
582.88
Sum = 2304.86
29
Example (contd.)
7
x x
i 1
Therefore,
2304.86
2304.86
s
7 1
19.6
30
Coefficient of Variation
CV
100%
Example
The CV of the blood pressure data is:
19.6
CV 100
%
137.1
14.3%
i.e., the standard deviation is 14.3% as large as
the mean.
32
Inter-quartile range
Example
The ordered blood pressure data is:
113 124 124 132 146 151 170
Q1
Q3
34
35
5. Box-plots
Minimum
Q1
Median
Q3
Maximum
36
Example 1
The pulse rates of 12 individuals arranged in
increasing order are:
62, 64, 68, 70, 70, 74, 74, 76, 76, 78, 78, 80
Q1=(68+70)2 = 69, Q3=(76+78)2 = 77
IQR = (77 69) = 8
37
Example 1: Box-plot
38
10
12
14
AG_04659_AS.cel AG_11745_AS.cel
KB_5828_AS.cel
KB_8840_AS.cel
39
Outliers
Outlier Boxplot
Example CK data
outliers
42
6. Scatter-plot
44
45
46
Pie-chart
Violin-plots
=boxplot+smooth density
Nice visual of data shape
47
Multivariate Data
7. Clustering
Clustering
50
UPGMA
51
Contrived Example
Array2
Array3
p53
mdm2
10
bcl2
d xy ( x1 y1 ) ( x2 y2 ) ( x3 y3 )
2
cyclinE
caspase 8
10
Example
cyclinE
caspase 8
p53
2.5
10.44
4.12
11.75
mdm2
12.5
6.4
13.93
bcl2
6.48
1.41
cyclinE
7.35
caspase 8
p53 mdm2
p53
mdm2
cyclin E
2.5
4.12
10.9
6.4
9.1
6.9
cyclin E
{caspase-8 &
bcl-2}
{p53 &
mdm2}
{p53 & mdm2}
cyclin E
{caspase-8 & bcl-2}
{caspase-8 &
bcl-2}
cyclin E
{caspase-8 &
bcl-2}
3.7
9.2
6.9
0
54
Example (contd)
55
56
Distance Metrics
Euclidean (as-the-crow-flies)
Manhattan
Minkowski (a whole class of metrics)
Correlation (similarity in profiles: called similarity metrics)
Linkage Rules
57
Clustering Summary
58