Sie sind auf Seite 1von 58

Introduction to

Statistics
Colm ODushlaine

Neuropsychiatric Genetics, TCD


codushlaine@gmail.com

Overview
Descriptive Statistics & Graphical Presentation of
Data
Statistical Inference

Hypothesis Tests & Confidence Intervals


T-tests (Paired/Two-sample)
Regression (SLR & Multiple Regression)
ANOVA/ANCOVA

Intended as an interview. Will provide slides after


lectures
Whats in the lectures?...
2

Lecture 1 Lecture 2 Lecture 3


Lecture 4
Descriptive Statistics and Graphical
Presentation
1.
Terminology of Data
2.
3.
4.
5.
6.
7.

Frequency Distributions/Histograms
Measures of data location
Measures of data spread
Box-plots
Scatter-plots
Clustering (Multivariate Data)

Lecture 1 Lecture 2 Lecture 3


Lecture 4 Statistical Inference
1.
2.
3.
4.
5.
6.
7.
8.

Distributions & Densities


Normal Distribution
Sampling Distribution & Central Limit Theorem
Hypothesis Tests
P-values
Confidence Intervals
Two-Sample Inferences
Paired Data
4

Lecture 1 Lecture 2 Lecture 3


Lecture 4 Sample Inferences

1.

2.

3.

Two-Sample Inferences

Paired t-test

Two-sample t-test
Inferences for more than two samples

One-way ANOVA

Two-way ANOVA

Interactions in Two-way ANOVA


DataDesk demo
5

Lecture 1 Lecture 2 Lecture 3


Lecture 4
1.
2.
3.
4.
5.
6.
7.
8.

Regression
Correlation
Multiple Regression
ANCOVA
Normality Checks
Non-parametrics
Sample Size Calculations
Useful tools and websites
6

FIRST, A REALLY USEFUL SITE


Explanations of outputs
Videos with commentary
Help with deciding what test
to use with what data

1. Terminology
Populations & Samples

Population: the complete set of individuals,


objects or scores of interest.

Often too large to sample in its entirety


It may be real or hypothetical (e.g. the results from an
experiment repeated ad infinitum)

Sample: A subset of the population.

A sample may be classified as random (each member


has equal chance of being selected from a population)
or convenience (whats available).
Random selection attempts to ensure the sample is
representative of the population.
8

Variables

Variables are the quantities measured in a


sample.They may be classified as:

Quantitative i.e. numerical

Continuous (e.g. pH of a sample, patient


cholesterol levels)
Discrete (e.g. number of bacteria colonies in a
culture)

Categorical

Nominal (e.g. gender, blood group)


Ordinal (ranked e.g. mild, moderate or severe
illness). Often ordinal variables are re-coded to be
quantitative.
9

Variables

Variables can be further classified as:

Dependent/Response. Variable of primary interest


(e.g. blood pressure in an antihypertensive drug trial).
Not controlled by the experimenter.
Independent/Predictor
called a Factor when controlled by experimenter. It
is often nominal (e.g. treatment)
Covariate when not controlled.

If the value of a variable cannot be predicted in


advance then the variable is referred to as a
random variable
10

Parameters & Statistics

Parameters: Quantities that describe a


population characteristic. They are usually
unknown and we wish to make statistical
inferences about parameters. Different to
perimeters.

Descriptive Statistics: Quantities and


techniques used to describe a sample
characteristic or illustrate the sample data
e.g. mean, standard deviation, box-plot
11

2. Frequency Distributions

An (Empirical) Frequency Distribution or


Histogram for a continuous variable presents the
counts of observations grouped within prespecified classes or groups

A Relative Frequency Distribution presents the


corresponding proportions of observations within
the classes

A Barchart presents the frequencies for a


categorical variable
12

Example Serum CK

Blood samples taken from 36 male


volunteers as part of a study to determine the
natural variation in CK concentration.

The serum CK concentrations were


measured in (U/I) are as follows:

13

Serum CK Data for 36 male


volunteers

121
95
84
119
62
25

82
145
57
104
83
123

100 151
64 201
139 60
110 113
67 93
70 48

68
101
78
118
92
95

58
163
94
203
110
42
14

Relative Frequency Table


Serum CK
(U/I)

Frequency

Relative
Frequency

Cumulative Rel.
Frequency

20-39

0.028

0.028

40-59

0.111

0.139

60-79

0.194

0.333

80-99

0.222

0.555

100-119

0.222

0.777

120-139

0.083

0.860

140-159

0.056

0.916

160-179

0.028

0.944

180-199

0.000

0.944

200-219

0.056

1.000

Total

36

1.000

15

Frequency Distribution

Distributions

CK-concentration-(U/l)

Quantiles
8

Frequency

100.0% maximu
99.5%
97.5%
90.0%
75.0%
quart
50.0%
media
25.0%
quart
10.0%
2.5%
0.5%
0.0%
minimu

20

40

60

80

100

120

140

160

180

200

220

16

Relative Frequency Distribution


Distributions

CK-concentration-(U/l)
Quantiles

Mode

Shaded area is
percentage of
males with CK
values between
60 and 100 U/l,
i.e. 42%.

0.15

Right tail
(skewed)
0.10

Left tail

Relative Frequency

0.20

100.0% maxim
99.5%
97.5%
90.0%
75.0%
quar
50.0%
med
25.0%
quar
10.0%
2.5%
0.5%
0.0%
minim

0.05

20

40

60

80

100

120

140

160

180

200

220

17

3. Measures of Central
Tendency (Location)
Measures of location indicate where on the number
line the data are to be found. Common measures of
location are:
(i) the Arithmetic Mean,
(ii) the Median, and
(iii) the Mode

18

The Mean

Let x1,x2,x3,,xn be the realised values of a


random variable X, from a sample of size n.
The sample arithmetic mean is defined as:

1
n

xi

i 1

19

Example
Example 2: The systolic blood pressure of
seven middle aged men were as follows:
151, 124, 132, 170, 146, 124 and 113.
The mean is

151 124 132 170 146 124 113


x
7

137.14

20

The Median and Mode

If the sample data are arranged in increasing


order, the median is
(i)
(ii)

the middle value if n is an odd number, or


midway between the two middle values if n is
an even number

The mode is the most commonly occurring


value.

21

Example 1 n is odd
The reordered systolic blood pressure data seen
earlier are:
113, 124, 124, 132, 146, 151, and 170.
The Median is the middle value of the ordered data,
i.e. 132.
Two individuals have systolic blood pressure = 124
mm Hg, so the Mode is 124.
22

Example 2 n is even
Six men with high cholesterol participated in a study to investigate
the effects of diet on cholesterol level. At the beginning of the study,
their cholesterol levels (mg/dL) were as follows:
366, 327, 274, 292, 274 and 230.
Rearrange the data in numerical order as follows:
230, 274, 274, 292, 327 and 366.
The Median is half way between the middle two readings, i.e.
(274+292) 2 = 283.
Two men have the same cholesterol level- the Mode is 274.
23

Mean versus Median

Large sample values tend to inflate the mean. This will happen if
the histogram of the data is right-skewed.

The median is not influenced by large sample values and is a better


measure of centrality if the distribution is skewed.

Note if mean=median=mode then the data are said to be


symmetrical

e.g. In the CK measurement study, the sample mean = 98.28. The


median = 94.5, i.e. mean is larger than median indicating that mean
is inflated by two large data values 201 and 203.
24

4. Measures of Dispersion

Measures of dispersion characterise how


spread out the distribution is, i.e., how variable
the data are.
Commonly used measures of dispersion
include:
1.
2.
3.

4.

Range
Variance & Standard deviation
Coefficient of Variation (or relative standard
deviation)
Inter-quartile range
25

Range

the sample Range is the difference


between the largest and smallest
observations in the sample
easy to calculate;

Blood pressure example: min=113 and


max=170, so the range=57 mmHg

useful for best or worst case scenarios


sensitive to extreme values

26

Sample Variance

The sample variance, s2, is the arithmetic


mean of the squared deviations from the
sample mean:
n

xi x

s i 1
2

n 1

>
27

Standard Deviation

The sample standard deviation, s, is the


square-root of the variance
n

xi x

i 1

n 1

s has the advantage of being in the same units


as the original variable x
28

Example
Data
151
124
132
170
146
124
113
Sum = 960.0

x 137.14

Deviation
13.86
-13.14
-5.14
32.86
8.86
-13.14
-24.14
Sum = 0.00

Deviation2
192.02
172.73
26.45
1079.59
78.45
172.73
582.88
Sum = 2304.86
29

Example (contd.)
7

x x
i 1

Therefore,

2304.86

2304.86
s
7 1
19.6
30

Coefficient of Variation

The coefficient of variation (CV) or relative


standard deviation (RSD) is the sample standard
deviation expressed as a percentage of the mean,
i.e.
s

CV

100%

The CV is not affected by multiplicative changes in


scale
Consequently, a useful way of comparing the
dispersion of variables measured on different scales
31

Example
The CV of the blood pressure data is:

19.6
CV 100
%
137.1
14.3%
i.e., the standard deviation is 14.3% as large as
the mean.

32

Inter-quartile range

The Median divides a distribution into two halves.

The first and third quartiles (denoted Q1 and Q3) are


defined as follows:

25% of the data lie below Q1 (and 75% is above Q1),

25% of the data lie above Q3 (and 75% is below Q3)

The inter-quartile range (IQR) is the difference


between the first and third quartiles, i.e.
IQR = Q3- Q1
33

Example
The ordered blood pressure data is:
113 124 124 132 146 151 170
Q1

Q3

Inter Quartile Range (IQR) is 151-124 = 27

34

60% of slides complete!

35

5. Box-plots

A box-plot is a visual description of the


distribution based on

Minimum
Q1
Median
Q3
Maximum

Useful for comparing large sets of data

36

Example 1
The pulse rates of 12 individuals arranged in
increasing order are:
62, 64, 68, 70, 70, 74, 74, 76, 76, 78, 78, 80
Q1=(68+70)2 = 69, Q3=(76+78)2 = 77
IQR = (77 69) = 8

37

Example 1: Box-plot

38

10

12

14

Example 2: Box-plots of intensities


from 11 gene expression arrays

AG_04659_AS.cel AG_11745_AS.cel

KB_5828_AS.cel

KB_8840_AS.cel

39

Outliers

An outlier is an observation which does not


appear to belong with the other data
Outliers can arise because of a measurement
or recording error or because of equipment
failure during an experiment, etc.
An outlier might be indicative of a subpopulation, e.g. an abnormally low or high
value in a medical test could indicate presence
of an illness in the patient.
40

Outlier Boxplot

Re-define the upper and lower limits of the


boxplots (the whisker lines) as:
Lower limit = Q1-1.5IQR, and
Upper limit = Q3+1.5IQR

Note that the lines may not go as far as these


limits

If a data point is < lower limit or > upper limit,


the data point is considered to be an outlier.
41

Example CK data

outliers

42

6. Scatter-plot

Displays the relationship between two


continuous variables

Useful in the early stage of analysis when


exploring data and determining is a linear
regression analysis is appropriate

May show outliers in your data


43

Example 1: Age versus Systolic


Blood Pressure in a Clinical
Trial

44

Example 2: Up-regulation/Downregulation of gene expression across an


array (Control Cy5 versus Disease Cy3)

45

Example of a Scatter-plot matrix


(multiple pair-wise plots)

46

Other graphical representations

Dot-Plots, Stem-and-leaf plots

Pie-chart

Not visually appealing


Visually appealing, but hard to compare two datasets. Best
for 3 to 7 categories. A total must be specified.

Violin-plots

=boxplot+smooth density
Nice visual of data shape

47

Multivariate Data

Clustering is useful for visualising multivariate


data and uncovering patterns, often reducing its
complexity

Clustering is especially useful for highdimensional data (p>>n): hundreds or perhaps


thousands of variables

An obvious areas of application are gel


electrophoresis and microarray experiments
where the variables are protein abundances or
gene expression ratios
48

7. Clustering

Aim: Find groups of samples or variables sharing


similiarity

Clustering requires a definition of distance between


objects, quantifying a notion of (dis)similarity

Points are grouped on the basis on minimum distance


apart (distance measures)

Once a pair are grouped, they are combined into a


single point (using a linkage method) e.g. take their
average. The process is then repeated.
49

Clustering

Clustering can be applied to rows or columns of a data set


(matrix) i.e. to the samples or variables

A tree can be constructed with branch length proportional to


distances between linked clusters, called a Dendrogram

Clustering is an example of unsupervised learning: No use is


made of sample annotations i.e. treatment groups, diagnosis
groups

50

UPGMA

Unweighted Pair-Group Method Average


Most commonly used clustering method
Procedure:

1. Each observation forms its own cluster


2. The two with minimum distance are grouped into a single
cluster representing a new observation- take their average
3. Repeat 2. until all data points form a single cluster

51

Contrived Example

5 genes of interest on 3 replicates arrays/gels


Array1

Array2

Array3

p53

mdm2

10

bcl2

d xy ( x1 y1 ) ( x2 y2 ) ( x3 y3 )
2

cyclinE

caspase 8

10

Calculate distance between each pair of genes


e.g. d ( p53, mdm2) (9 10) 2 (3 2) 2 (7 9) 2 2.5
52

Example

Construct a distance matrix of all pair-wise distances


p53 mdm2 bcl2

cyclinE

caspase 8

p53

2.5

10.44

4.12

11.75

mdm2

12.5

6.4

13.93

bcl2

6.48

1.41

cyclinE

7.35

caspase 8

Cluster the 2 genes with smallest distance


Take their average & re-calculate distances to other genes
53

p53 mdm2
p53
mdm2

cyclin E

2.5

4.12

10.9

6.4

9.1

6.9

cyclin E
{caspase-8 &
bcl-2}

{p53 &
mdm2}
{p53 & mdm2}
cyclin E
{caspase-8 & bcl-2}

{caspase-8 &
bcl-2}

cyclin E

{caspase-8 &
bcl-2}

3.7

9.2

6.9
0
54

Example (contd)

..and the final cluster:

55

Example of a gene expression dendrogram

56

Variety of approaches to clustering


Clustering techniques
agglomerative -start with every element in its own cluster, and
iteratively join clusters together
divisive - start with one cluster and iteratively divide it into
smaller clusters

Distance Metrics

Euclidean (as-the-crow-flies)
Manhattan
Minkowski (a whole class of metrics)
Correlation (similarity in profiles: called similarity metrics)

Linkage Rules

average: Use the mean distance between cluster members


single: Use the minimum distance (gives loose clusters)
complete: Use the maximum distance (gives tight clusters)
median: Use the median distance
centroid: Use the distance between the average member or
each cluster

57

Clustering Summary

The clusters & tree topology often depend highly on


the distance measure and linkage method used

Recommended to use two distance metrics, such


as Euclidean and a correlation metric

A clustering algorithm will always yield clusters,


whether the data are organised in clusters or not!

58

Das könnte Ihnen auch gefallen