Introduction To Biostatistics: Part 1

Lecture 1.
Brief history, basic concepts and descriptive statistics

Biostatistics
Xinhai Li
Biological statistics Biological statistics
Li, Xinhai
Phone: 64807898 Phone: 64807898
Email: lixh@ioz.ac.cn
Homepage: http://people.gucas.ac.cn/~LiXinhai
Blog: http://blog.sciencenet.cn/u/lixinhai
Miniblog: http://weibo.com/lixinhaiblog
1
Lecture 1. Brief history, basic concepts and descriptive statistics
Biostatistics
Xinhai Li
How to learn statistics in this
l class
No preview needed before the class
Focus on listening and thinking (3 hours / week) at g g ( )
class
Dont take notes (wasting your time) ( g y )
Intensive review (1-2 hours / week) after the class
Do the homework (1 hour / week) after the class Do the homework (1 hour / week) after the class
2
Biostatistics
Xinhai Li
Text books Text books
Sokal, R. R. and F. J. Rohlf. 1995. Biometry: the principles and
practice of statistics in biological research. Third Edition. W. H.
Freeman and Co.: New York. 887 pp.
Zar, J. H. 1999. Biostatistical Analysis. Fourth Edition.
3
From 1976 (the earliest year indexed) to mid 1997 (the date the search was
performed) the following counts were obtained: Darwin (all publications,
e.g. The origin of the species) = 7,111. Sokal and Rohlf Biometry = 31,757.
, J y
Prentice Hall: New Jersey, 663 pp.
Biostatistics
Xinhai Li Overview
Biostatistics or biometry
"biostatistics" and "biometry" are
i d i h bl sometimes used interchangeably
"biometry" is more often used of biological
i lt l li ti or agricultural applications
"biostatistics" is more often used of
medical applications
4
medical applications.
Biostatistics
Xinhai Li Overview
What is statistics?
Statistics is the science of collection
http://teeky.org/search-engine-optimization/
determine-success-via-website-statistics/
Statistics is the science of collection,
analysis, interpretation, and presentation
of data of data.
Descriptive statistics are numerical Descriptive statistics are numerical
estimates that organize, sum up or present
the data the data.
Inferential statistics is the process of
5
inferring from a sample to the population.
Biostatistics
Xinhai Li Overview
Statistical errors in publications
Underwood (1981) found statistical errors in 78% of the papers he
surveyed in marine ecology. Hurlbert (1984) reported that in two y gy ( ) p
separate surveys 26% and 48% of the ecological papers surveyed
showed the statistical error of pseudoreplication (Krebs 1999).
Charles J. Krebs. 1999. Ecological Methodology, 2nd ed. Addison-Wesley Educational Publishers, Inc.
50% of medical literature have statistical flaws (Altman et al. 1991).
Serious statistical errors were found in 40% of 164 articles published
i hi t j l (M G i 1995) (E t l 2007) in a psychiatry journal (McGuigan 1995) (Ercan et al. 2007).
Ilker Ercan, Berna Yazc, Yaning Yang, Guven zkaya, Sengul Cangur, Bulent Ediz, Ismet Kan.
Misusage Of Statistics In Medical Research Eur J Gen Med 2007; 4(3):128-134 Misusage Of Statistics In Medical Research. Eur J Gen Med 2007; 4(3):128-134
6
Biostatistics
Xinhai Li Overview
Contents
Brief history, basic concepts,
and descriptive statistics
Analysis of covariance (ANCOVA)
p
Probability distribution
Nonparametric statistics
Multivariate analysis
Hypothesis testing
Analysis of variance (ANOVA)
Multivariate analysis
Generalized linear model
Analysis of variance (ANOVA)
Simple linear regression and
l ti
Common mistakes
correlation
c + + = x x y
7
c + + =
2 1
x x y
Biostatistics
Xinhai Li
St ti ti l ft R
Overview
Statistical software R http://cran.r-project.org
R is a free software environment for statistical computing and graphics It R is a free software environment for statistical computing and graphics. It
compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
In 1995 R was initially written by Ross Ihaka and Robert Gentleman at the In 1995, R was initially written by Ross Ihaka and Robert Gentleman at the
Department of Statistics of the University of Auckland in Auckland, New Zealand.
Since mid-1997 there has been a core group (the R Core Team) who can g p ( )
modify the R source code archive.
It is free software distributed under a GNU-style copyleft, and an official part of
the GNU project (GNU S).
It has over 2100 packages in 2010.
Citation
R Development Core Team 2011 R: A Language and Environment
8
R Development Core Team. 2011. R: A Language and Environment
for Statistical Computing. R Foundation for Statistical Computing.
Vienna, Austria. ISBN: 3-900051-07-0. http://www.R-project.org.
Biostatistics
Xinhai Li Overview
Todays contents
Introduction to biological
statistics statistics
History
Data in biology
Descriptive statistics
9
Biostatistics
Xinhai Li
Hi t
History
History
John Graunt (1620-1674, British) and William Petty (1623-1687, British): Jo G au t ( 6 0 6 , t s ) a d a etty ( 6 3 68 , t s )
developed early human statistical and census methods that later provided a
framework for modern demography based on life table, mean value, census,
longevity, and mortality.
Blaise Pascal (1623-1662, French) and Pierre de Fermat (1601-1665,
French), Jacques Bernoulli (1654-1705, Swiss): probability theory (binomial
coefficients)
Abraham de Moivre ()(1667-1754, French): combine the statistics with
probability theory; approximate the normal distribution though the expansion
of the binomial distribution of the binomial distribution
Carl Friedrich Gauss (1777-1855, Germany): least square, normal distribution
Adolphe Quetelet () (1796-1874, Belgium): significance of constancy
of large numbers (rate of criminal events)
10
Florence Nightingale (1820-1910, British): graphic presentation of statistics
Biostatistics
Xinhai Li
History
Emergence of statistics in 1800s
Laplace wrote a book describing how to compute the Laplace wrote a book describing how to compute the
future positions of planets and comets on the basis of a
few observations from earth. few observations from earth.
Napoleon: "I find no mention of God in your treatise, Mr.
Laplace." Laplace.
Laplace replied: "I had no need for that hypothesis.
Th b ti f l t d t f thi thl l tf did The observations of planets and comets from this earthly platform did
not fit the predicted positions exactly. Laplace and his fellow scientists
attributed this to errors in the observations, sometimes due to
perturbations in the earth's atmosphere, other times due to human
error.
By the end of the nineteenth century the errors had mounted instead
11
By the end of the nineteenth century, the errors had mounted instead
of diminishing. As measurements became more and more precise,
more and more error cropped up.
Biostatistics
Xinhai Li
Gaps between Darwinism and genetics in early 1900s
History
Gaps between Darwinism and genetics in early 1900s
Core Evolution Concepts
Mendels law of
ti
Core Evolution Concepts
Popul at i on: Organisms that share a
common gene pool (Species = actually or
segregation
By carrying out the monohybrid crosses,
Mendel determined that the 2 alleles for g p ( p y
potentially interbreeding organisms)
Var i at i on: Modifications of forms are
produced by chance via mutations, genetic
each character segregate during gamete
production.
p y , g
coding errors of individual organisms
Nat ur al Sel ec t i on: Reproduction &
survival of organisms whose heritable traits g
are better suited to existing environmental
conditions
Ret ent i on: Persistence within a
population of the selected variation(s) over
successive generations
12
Biostatistics
Xinhai Li
History
Neo-Darwinian Modern evolutionary synthesis
in 1930s
Sir Ronald A. Fisher (1890-1962, British) developed
several basic statistical methods in support of his work pp
The Genetical Theory of Natural Selection
Sewall G Wright (1889 1988 American) used statistics Sewall G. Wright (1889-1988, American) used statistics
in the development of modern population genetics
John B. S. Haldane (1892-1964, British)
reestablished natural selection as the premier
mechanism of evolution by explaining it in terms of the mechanism of evolution by explaining it in terms of the
mathematical consequences of Mendelian genetics in
his book The Causes of Evolution.
13
Biostatistics
Xinhai Li
History
Francis Galton
Francis Galton (1822-1911, British) (father of biometry
d i ) i l ti
http://www.sil.si.edu/digitalcollections/hst/scientific-identity/fullsize/SIL14-G001-05a.jpg
and eugenics): regression, correlation
African Explorer and elected Fellow in the Royal Geographic Society
C t f th fi t th d t bli h f th t l i l Creator of the first weather maps and establisher of the meteorological
theory of anticyclones
Coined term "eugenics" and phrase "nature versus nurture"
Developed statistical concepts of correlation and regression to the mean
Discovered that fingerprints were an index of personal identity and
persuaded Scotland Yard to adopt a fingerprinting system
First to utilize the survey as a method for data collection
Produced over 340 papers and books throughout his lifetime
K i ht d i 1909
14
Knighted in 1909
Galton, F. (1869/1892/1962). Hereditary Genius: An Inquiry into its Laws and Consequences. Macmillan/Fontana, London.
Galton, F. (1883/1907/1973). Inquiries into Human Faculty and its Development. AMS Press, New York.
Biostatistics
Xinhai Li
History
Karl Pearson
Karl Pearson (1857-1936, British): continued in the tradition of Galton
http://www.economics.soton.ac.uk/staff/aldrich/New%20Folder/kpreader1.htm
( , )
and laid the foundation for much of descriptive statistics.
In 1884, Pearson became Professor of Applied Mathematics and Mechanics
C at University College London.
In 1901 Pearson, Weldon and Galton founded Biometrika, a Journal for the
Statistical Study of Biological Problems.
In 1907, Pearson took over a research unit founded by Galton and
reconstituted it as the Francis Galton Laboratory of National Eugenics.
In 1911 Pearson founded the world's first university statistics department at In 1911, Pearson founded the world s first university statistics department at
University College London.
method of moments
15
chi-square
correlation
Biostatistics
Xinhai Li
History
Ronald A. Fisher
Sir Ronald Aylmer Fisher (1890 1962) an English statistician evolutionary
http://en.wikipedia.org/wiki/Image:RonaldFisher.jpg
Sir Ronald Aylmer Fisher, (1890 1962), an English statistician, evolutionary
biologist, and geneticist.
He was described by Anders Hald as "a genius who almost single-handedly y g g y
created the foundations for modern statistical science"
[1]
and Richard Dawkins
described him as "the greatest biologist since Darwin".
[2]
(from Wikipedia)
In 1933 he became a Professor of Eugenics at University College London
In 1943 he was offered the Balfour Chair of Genetics at Cambridge
Universityy
Analysis of variance
Maximum likelihood
Fisher, R.A. 1925. Statistical Methods for Research Workers
Fisher, R.A. 1935. The design of experiments
16
Fisher information
[1] Hald, Anders (1998). A History of Mathematical Statistics. New York: Wiley.
[2] Dawkins, Richard (1995). River out of Eden.
Biostatistics
Xinhai Li
History
Society and publications in early years
In 1901 Pearson Weldon and Galton founded In 1901, Pearson, Weldon and Galton founded
Biometrika, a Journal for the Statistical Study of
Biological Problems.
Until the 1940s, the application of statistics to biological
questions began to have a profound impact on the questions began to have a profound impact on the
scientific community.
Th bi i i f h A i S i i l The biometrics section of the American Statistical
Association to publish the Biometrics Bulletin, in 1945.
In 1947, International Biometric Society (IBS) was
established. Shortly thereafter, the IBS began publishing
Biometrics
17
Biometrics.
Biostatistics
Xinhai Li
History
A story of statistics in
industry industry
http://www.census.gov/history/www/census_then_now/notable_alumni/w_edwards_deming.html
In 1980, the NBC television network aired a
documentary entitled "If Japan Can, Why Can't We?"
The documentary was really a description of the influence one
man had on Japanese industry, W. Edwards Deming.
Deming's major point about quality control is that the
output of a production line is variable because that is the output of a production line is variable, because that is the
nature of all human activity. What the customer wants is
not a perfect product but a reliable product. not a perfect product but a reliable product.
18
Biostatistics
Xinhai Li
A story of statistics and industry
History
A story of statistics and industry
Demings quality control
Deming proposed that the production line be seen as a stream of
activities that start with raw material and end with finished product.
Each activity can be measured, so each activity has its own
variability due to environmental causes.
Instead of waiting for the final product to exceed arbitrary limits of Instead of waiting for the final product to exceed arbitrary limits of
variability, the managers should be looking at the variability of each
of these activities.
The most variable of the activities is the one that should be
addressed. Once that variability is reduced, there will be another
activity that is "most variable " and it should then be addressed activity that is most variable, and it should then be addressed.
Thus, quality control becomes a continuous process, where the
most variable aspect of the production line is constantly being p p y g
worked on.
19
Biostatistics
Xinhai Li
Data
Data
Datum is one observation about the variable
being measured. g
Data are a collection of observations. Data are a collection of observations.
A population consists of all subjects about A population consists of all subjects about
whom the study is being conducted.
A sample is a sub-group of population being
examined
20
examined.
Biostatistics
Xinhai Li
Data
Parameters vs. Statistics
A parameter is a numerical quantity measuring
some aspect of a population of scores. p p p
For example, the mean is a measure of central tendency
Usually use Greek letters
A statistic computed in samples is used to estimate
parameters parameters
Quantity Parameter Statistic
Mean M
Standard deviation s
21
Proportion p
Correlation r
Biostatistics
Xinhai Li
Variables
Data
Variables
Nominal variable
classification data e g male/female 0/1 etc
Qualitative Quantitative
classification data, e.g., male/female, 0/1, etc
Ordinal variable
ordered but differences between values are not
Ordinal
Interval or
ratio
important
e.g., Likert scales, rank on a scale of 1..5 (degree of
satisfaction); restaurant ratings
ratio
satisfaction); restaurant ratings
Interval scale variable
ordered, constant scale, but no natural zero
differences make sense, but ratios do not (e.g.,
30-20 = 20-10, but 20/10 is not twice as hot!
e.g., temperature (C,F), dates
Ratio scale variable
ordered constant scale natural zero
22
ordered, constant scale, natural zero
e.g., height, weight, age, length
Biostatistics
Xinhai Li
Data
Derived variables
Ratio Ratio
Sex ratio
Index
S&P 500 index (stock
market)
Rate
23
Growth rate
Biostatistics
Xinhai Li
Data
Acc rac and precision of data Accuracy and precision of data
Accuracy Precision
Inaccuracy
24
Biostatistics
Xinhai Li
Data
Accuracy of data
Mean square error q
for estimating population mean ()
using sample mean (m)
) (M MSE
Bias
M
2
2
] ) [( = M E
Accuracy
2
] ) ( [ ) ( + = M E M Var
25
precision bias
Biostatistics
Xinhai Li
Data
Summarizing Data
Frequency Distribution Frequency Distribution
Cumulative Distributions
Relative Frequency Distribution Relative Frequency Distribution
Percent Frequency Distribution
B G h Bar Graph
Histogram
Pie Chart
Dot Plot
26
Biostatistics
Xinhai Li
Data
Frequency Distribution for
Q lit ti d t
Af di ib i i b l f
Qualitative data
A frequency distribution is a tabular summary of
data showing the frequency (or number) of items
i h f l l i l in each of several nonoverlapping classes.
h bj i i id i i h b h d The objective is to provide insights about the data
that cannot be quickly obtained by looking only at
h i i l d the original data.
27
Biostatistics
Xinhai Li
Data
Frequency Distribution
Guests staying at Holiday Inn were asked to
rate the quality of their accommodations as
being: being:
Poor Poor
22
Rating Frequency
Below Average Below Average
Average Average
33
55
Above Average Above Average
Excellent Excellent
99
11
ll
28
Total Total 20 20
Biostatistics
Xinhai Li
Data
An example for quantitative data:
Hudson Auto Repair
Sample of Parts Cost for 50 Tune Sample of Parts Cost for 50 Tune--ups ups
Hudson Auto Repair
Sample of Parts Cost for 50 Tune Sample of Parts Cost for 50 Tune--ups ups
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
29
Biostatistics
Xinhai Li
Data
Guidelines for selecting number of classes
Use between 5 and 20 classes
Data sets with a larger number of elements usually require g y q
a larger number of classes
Smaller data sets usually require fewer classes
Use classes of equal width
Approximate class width = Approximate class width =
Largest Data Value Smallest Data Value
30
Number of Classes
Biostatistics
Xinhai Li
Data
For Hudson Auto Repair, if we choose six classes:
Approximate Class Width = (109 - 52)/6 = 9.5 ~ 10
50-59 2
Parts Cost ($) Frequency
60-69
70-79
80 89
13
16
80-89
90-99
100 109
7
7
5
31
100-109 5
Total 50
Biostatistics
Xinhai Li
Data
Relati e Freq enc Distrib tion Relative Frequency Distribution
The relative frequency of a class is the fraction or
proportion of the total number of data items
belonging to the class.
A relative frequency distribution is a tabular
summary of a set of data showing the relative summary of a set of data showing the relative
frequency for each class.
32
Biostatistics
Xinhai Li
Data
Percent Frequency Distribution Percent Frequency Distribution
The percent frequency of a class is the relative
frequency multiplied by 100 frequency multiplied by 100.
Apercent frequency distribution is a tabular A percent frequency distribution is a tabular
summary of a set of data showing the percent
frequency for each class. frequency for each class.
33
Biostatistics
Xinhai Li
R l ti F d
Data
Relative Frequency and
Percent Frequency Distributions
Holiday Inn Quality Ratings
q y
Relative
Frequency
Percent
Frequency Rating
Poor
Below Average
.10 .10
.15 .15
10 10
15 15
Average
Above Average
.25 .25
.45 .45
25 25
45 45
Excellent
.05 .05
Total Total 1.00 1.00
55
100 100
34
1/20 = .05 1/20 = .05
Biostatistics
Xinhai Li
R l ti F d
Data
Percent Frequency Distributions q y
Hudson Auto Repair
Parts
Cost ($)
Relative
Frequency
Percent
Frequency
50-59
60 69
Cost ($)
.04
26
Frequency
4
26
Frequency
2/50
60-69
70-79
80 89
.26
.32
14
26
32
14
2/50
80-89
90-99
100 109
.14
.14
10
14
14
10
35
100-109 .10
Total 1.00
10
100
Biostatistics
Xinhai Li
R l i F d
Data
Insights gained from the percent frequency distribution
Only 4% of the parts costs are in the $50-59 class Only 4% of the parts costs are in the $50-59 class.
30% of the parts costs are under $70.
The greatest percentage (32% or almost one-third)
of the parts costs are in the $70-79 class. p $
10% of the parts costs are $100 or more.
36
Biostatistics
Xinhai Li
Our class
Data
Our class
students <- read.csv('D:/ioz/statistics/2012/students.csv', header=T)
students$ID <- as.character(students$ID) students$ID as.character(students$ID)
head(students)
order ID name visits email
33 201028007610020 2093 163.com
nrow(students) #115
f il < b t ( t d t $ 1 1)
3 201028016215017 222 163.com
111 201028016215018 99 163.com
4 201028016215019 130 163.com
25 201128000206033 130 163.com
56 201128000206061 282 mails.gucas.ac.cn
family.name <- substr(students$name,1,1)
length(unique(family.name)) #62
f.name <- table(family.name)[table(family.name)>1]
f.name <- as.table(f.name)
barplot(f.name, ylab='Number')
0
1
2
g
m
b
e
r
6
8
1
0
N
u
m
2
4
37

0
Biostatistics
Xinhai Li
Our class
Data
Our class
email <- table(students$email)[table(students$email) > 2]
class(email) # array
email <- as.table(email)
barplot(email, ylab='Number')
r
3
0
4
0
N
u
m
b
e
1
0
2
0
126.com 163.com gmail.com mails.gucas.ac.cn qq.com sina.com
0
1
38
Biostatistics
Xinhai Li
Our class
Data
Our class
Histogramof students$visits
hist(students$visits, freq=T, nclass=15, xlab='Times')
Histogram of students$visits
0
2
5
u
e
n
c
y
1
5
2
0
F
r
e
q
u
5
1
0
0 500 1000 1500 2000
0
39
Times
0 500 1000 1500 2000
Biostatistics
Xinhai Li
Data
Bar Graph
1
0
1
2
Barplot()
Bar Graph
N
u
m
b
e
r
2
4
6
8

0
A bar graph is a graphical device for depicting qualitative data.
Specify the labels that are used for each of the classes on one axis
(usually the horizontal axis) (usually the horizontal axis).
A frequency, relative frequency, or percent frequency scale can be used
for the other axis (usually the vertical axis).
Use a bar of fixed width drawn above each class label.
The bars are separated to emphasize the fact that each class is a
separate category
40
separate category.
Biostatistics
Xinhai Li
Histogram
Data
Histogram
Another common graphical presentation of quantitative data is a
histogram.
The variable of interest is placed on the horizontal axis.
A rectangle is drawn above each class interval with its height
corresponding to the inter als freq enc relati e freq enc or percent corresponding to the intervals frequency, relative frequency, or percent
frequency.
Unlike a bar graph, a histogram has no natural separation between
rectangles of adjacent classes.
R code
hist(rnorm(100),nclass=6)
41
( ( ), )
Biostatistics
Xinhai Li
Holiday Inn Quality Ratings
Data
Pie Chart
R code
x=sample(1:100,6,replace=TRUE)
names(x)=c('A' 'B' 'C' 'D' 'E' 'F') names(x)=c('A','B','C','D','E','F')
pie(x)
The pie chart is a commonly used graphical device for presenting relative
frequency distributions for qualitative data.
First draw a circle; then use the relative frequencies to subdivide the First draw a circle; then use the relative frequencies to subdivide the
circle into sectors that correspond to the relative frequency for each class.
Since there are 360 degrees in a circle, a class with a relative frequency
42
of .25 would consume .25(360) = 90 degrees of the circle.
Biostatistics
Xinhai Li
D t Pl t
Data
Dot Plot
One of the simplest graphical summaries of data is a One of the simplest graphical summaries of data is a
dot plot.
A horizontal axis shows the range of data values A horizontal axis shows the range of data values.
Then each data value is represented by a dot placed
above the axis above the axis.
Tune-up Parts Cost p
. .
. .. . . . . .. . . .

50 50 60 60 70 70 80 80 90 90 100 100 110 110 50 50 60 60 70 70 80 80 90 90 100 100 110 110
. . . ..... .......... .. . .. . . ... . .. . . . . ..... .......... .. . .. . . ... . .. .
. .. .. .. .. . . . .. .. .. .. . .
43
50 50 60 60 70 70 80 80 90 90 100 100 110 110 50 50 60 60 70 70 80 80 90 90 100 100 110 110
Cost ($) Cost ($)
Biostatistics
Xinhai Li
C l ti Di t ib ti
Data
Cumulative frequency distribution - shows the
Cumulative frequency distribution shows the
number of items with values less than or equal to
the upper limit of each class..
Cumulative relative/ percent frequency distribution
x=seq(-5,5,by=0.1)
R code
plot(pnorm(x,mean=0,sd=1),type='l')
44
Biostatistics
Xinhai Li
C l ti Di t ib ti
Data
Hudson Auto Repair Hudson Auto Repair
Cumulative Cumulative
Cost ($)
Cumulative
Frequency
Cumulative
Relative
Frequency
Cumulative
Percent
Frequency
< 59
< 69
2
15
.04
.30
4
30
< 79
< 89
31
38
.62
.76
62
76
2 + 13 15/50
< 99
< 109
45
50
.90
1.00
90
100
45
Biostatistics
Xinhai Li
Leaf Unit = 0.1 Leaf Unit = 0.1
Data
Stem-and-Leaf Display
88
99
10 10
6 8 6 8
1 4 1 4
22
p y
A stem-and-leaf display shows both the rank order and shape of the
11 11 0 7 0 7
p y p
distribution of the data.
It is similar to a histogram on its side, but it has the advantage of
h i th t l d t l showing the actual data values.
The first digits of each data item are arranged to the left of a vertical
line. line.
To the right of the vertical line we record the last digit for each item in
rank order.
Each line in the display is referred to as a stem.
Each digit on a stem is a leaf.
46
Biostatistics
Xinhai Li
E l L f U it 0 1
Data
Example: Leaf Unit = 0.1
If we have data with values such as
8.6 8.6 11.7 11.7 9.4 9.4 9.1 9.1 10.2 10.2 11.0 11.0 8.8 8.8
a stem a stem and and leaf display of these data will be leaf display of these data will be
Leaf Unit = 0.1 Leaf Unit = 0.1
a stem a stem--and and--leaf display of these data will be leaf display of these data will be
88
99
Leaf Unit 0.1 Leaf Unit 0.1
6 8 6 8
1 4 1 4 99
10 10
11 11
1 4 1 4
22
0 7 0 7
47
11 11 0 7 0 7
Biostatistics
Xinhai Li
Example: Leaf Unit = 10
Data
If we have data with values such as If we have data with values such as
Example: Leaf Unit = 10
1806 1717 1974 1791 1682 1910 1838 1806 1717 1974 1791 1682 1910 1838
a stem a stem--and and--leaf display of these data will be leaf display of these data will be a stem a stem and and leaf display of these data will be leaf display of these data will be
Leaf Unit = 10 Leaf Unit = 10
16 16
17 17
88
1 9 1 9
The 82 in 1682 The 82 in 1682
is rounded down is rounded down
18 18
19 19
0 3 0 3
1 7 1 7
to 80 and is to 80 and is
represented as an 8. represented as an 8.
48
Biostatistics
Xinhai Li
Data
Probability density function (PDF)
A probability density function (pdf) is a function that represents a
probability distribution in terms of integrals. p y g
Formally, a probability distribution has density f(x), such that the
probability of the interval [a, b] is given by
}
b
a
dx x f ) (
I t iti l if b bilit di t ib ti h d it f( ) th th i fi it i l Intuitively, if a probability distribution has density f(x), then the infinitesimal
interval [x, x + dx] has probability f(x) dx.
x=seq(-5,5,by=0.1)
plot(dnorm(x,mean=0,sd=1),type='l')
49
1 ) (
-
=
}
dx x f
The total area under the graph is 1
Biostatistics
Xinhai Li
Are the scores generally high or generally low? Are the scores generally high or generally low?
Where the center of the distribution tends to be
located
Th f t l t d Three measures of central tendency
Mode
Median
Mean
50
Biostatistics
Xinhai Li
Mode
The most frequently occurring score
Report mode when using nominal scale, the
most frequently occurring category
Based on the simple frequency of each score
If you have a rectangular distribution, do not
report the mode
Unimodal, bimodal, multimodal, antimode
51
Biostatistics
Xinhai Li
E l f M d
Example of Mode
Measurements Measurements
x
3
5
5
I n this case the data have
tow modes:
1
7
2
5 and 7
Both measurements are
2
6
7
0
Both measurements are
repeated twice
0
4
52
Biostatistics
Xinhai Li
E l f M d
Example of Mode
M t Measurements
x
3
Mode: 3
5
1
1
Mode: 3
1
4
7
Notice that it is possible for a
data not to have any mode.
7
3
8
y
3
53
Biostatistics
Xinhai Li
Median
S t th 50
th
til Score at the 50
th
percentile
For normal distribution the median is the same
as the mode
Arrange scores from lowest to highest if odd Arrange scores from lowest to highest, if odd
number of scores the Median is the one in the
middle, if even number of scores then average middle, if even number of scores then average
the two scores in the middle
Used when have ordinal scale and when the Used when have ordinal scale and when the
distribution is skewed
54
Biostatistics
Xinhai Li
Example of Median
Medi an: ( 4+ 5) / 2 =
4.5
Measurements Measurements
Ranked
Notice that only the two
l l d
x x
3 0
5 1
5 2
central values are used
in the computation.
5 2
1 3
7 4
2
The median is not
sensible to extreme
2 5
6 5
7 6
values
0 7
4 7
40 40
55
Biostatistics
Xinhai Li
Mean
Mean
Score at the exact mathematical center of Score at the exact mathematical center of
distribution (average)
U d ith i t l d ti l d h Used with interval and ratio scales, and when
have a symmetrical and unimodal distribution
Not accurate when distribution is skewed
because it is pulled towards the tail because it is pulled towards the tail
n
~ =
=
x
X
i
i
1
56
~
n
X
Biostatistics
Xinhai Li
Uses of the Mean
Describes scores
Deviation of mean gives us the error of our Deviation of mean gives us the error of our
estimate of the score, with total error equal to
zero zero
Predict scores
Describe a scores location
Describe the population mean () which is a
parameter
57
Biostatistics
Xinhai Li
Deviations around the Mean
The score minus the mean The score minus the mean
Include plus or minus sign Include plus or minus sign
Sum of deviations of the mean always
equals zero E(X-M)=0
58
Biostatistics
Xinhai Li
Range
Report the maximum difference between the Report the maximum difference between the
lowest and highest
Semi-interquartile range used with the median:
one half the distance between the scores at the
25th and 75th percentile
59
Biostatistics
Xinhai Li
Measures of Variability
Extent to which the scores differ from each other
or how spread out the scores are
Tells us how accurately the measure of central Tells us how accurately the measure of central
tendency describes the distribution
Shape of the distribution
60
Biostatistics
Xinhai Li
Wh d b t i bilit ?
Why do we care about variability?
Where would you rather vacation LA Bungalows Where would you rather vacation, LA Bungalows,
where the mean temperature is 24 degrees, or
Sahara Condos where the mean temperature is Sahara Condos where the mean temperature is
also 24 degrees?
LA temperature range:
day = 26 y
night = 22
S h t t Sahara temperature range:
day = 40
61
night = 8
Biostatistics
Xinhai Li
Variance
Uses the deviation from the mean
Remember the sum of the deviations always Remember, the sum of the deviations always
equals zero, so you have to square each of the
deviations deviations
S
2
X
= sum of squared deviations divided by the
number of scores
Provides information about the relative variability Provides information about the relative variability
62
Biostatistics
Xinhai Li
S Li it Some Limits
It isnt the average deviation
Interpretation doesnt make sense because:
N b i t l Number is too large
And it is a squared value And it is a squared value
63
Biostatistics
Xinhai Li
The standard deviation (SD)
Take the square root of the variance
S
X
Uses the same units of measurement as the raw
scores
How much scores deviate below and above the
mean mean
64
Biostatistics
Xinhai Li
The standard deviation (SD) ( )
Standard deviation ~ the mean of
deviations from the mean (sort of) ( )
(lowercase sigma) is the population standard deviation.

th l t d d d i ti S the sample standard deviation S
s
(s-hat) is the sample estimate of

65
s
(s hat) is the sample estimate of
Biostatistics
Xinhai Li
The deviation (definitional) formula for
the population standard deviation
x
n
2
) (
p p
n
x
i
i
=

=
1
) (
o
n
The larger the standard deviation the more e a ge e s a da d de a o e o e
variability there is in the scores
The standard deviation is somewhat less
sensitive to extreme outliers than the range
66
g
(as N increases)
Biostatistics
Xinhai Li
Th d i ti (d fi iti l) f l f
The deviation (definitional) formula for
the sample standard deviation
( ) X X

2
( )
N
X X
S
i
=
Whats the difference between this formula and
the population standard deviation?
In the first case all the Xs represent the entire In the first case, all the Xs represent the entire
population. In the second case, the Xs
represent a sample.
67
p p
Biostatistics
Xinhai Li
St d d D i ti E l
Standard Deviation: Example
( )
( )
2
X
21
( ) X X
( )
2
X X
-5.8 33.64
25
24
5.8
-1.8
-2 8
33.64
3.24
7 84 24
30
34
-2.8
3.2
7 2
7.84
10.24
51 84
34
0
26.8
7.2 51.84
21.36 Mean
62 4 36 21
8 . 106
S
68
62 . 4 36 . 21
5
= = = S
Biostatistics
Xinhai Li
Calculating S using the
raw-score formula raw score formula
( ) X
2
( )
N
N
X
X
S
=
2
N
To calculate X
2
you square all the scores first and To calculate X you square all the scores first and
then sum them
To calculate (X)
2
you sum all the scores first and
then square them
69
then square them
Biostatistics
Xinhai Li
Population and sample
variance and standard deviation variance and standard deviation
When we have data from the entire population
we use (not x bar) to compute o
X
using the
same formula
Variance and standard deviations of the sample Variance and standard deviations of the sample
are biased estimates of the population
70
Biostatistics
Xinhai Li
Estimating the population
standard deviation from a sample standard deviation from a sample
S the sample standard deviation is usually a little smaller S, the sample standard deviation, is usually a little smaller
than the population standard deviation. Why?
The sample mean minimizes the sum of squared deviations
(SS). Therefore, if the sample mean differs at all from the
l ti th th SS f th l ill b population mean, then the SS from the sample will be an
understimate of the SS from the population
Therefore, statisticians alter the formula of the sample
standard deviation by subtracting 1 from N
71
standard deviation by subtracting 1 from N
Biostatistics
Xinhai Li
Formulas for s-hat (estimated)
( )
2
X X
Definitional ( )
1
=

N
X X
s
Definitional
formula:
( )
2
2

X
R
( )
1
=

N
N
X
X
s
Raw-score
formula:
72
Biostatistics
Xinhai Li
The estimated variance
The standard deviation squares
( )
1
2
2

=
X X
s
( )
N
X

=
2
2

o
1 n
N
The variance is not a very useful descriptive statistic,
but it is very important value you will use in other
t h i ( th l i f i )
73
techniques (e.g., the analysis of variance)
Biostatistics
Xinhai Li
For a standard normal
distribution
Sample mean is a good estimate of population
mean
The estimate of the population variance and
standard deviation tells us how spread out the standard deviation tells us how spread out the
scores are
68% of the scores are within +1 and 1 S
X
74
Biostatistics
Xinhai Li
Standard error Standard error
The standard error of a sample of sample size n is the sample's
standard deviation divided by . It therefore estimates the
standard deviation of the sample mean based on the standard deviation of the sample mean based on the
population mean.
s
SE
n
SE
x
=
75
Biostatistics
Xinhai Li
Coefficient of variation
In probability theory and statistics, the coefficient of variation (CV) is a
normalized measure of dispersion of a probability distribution It is normalized measure of dispersion of a probability distribution. It is
defined as the ratio of the standard deviation to the mean :
100
o
CV 100 =
o
CV
76
Biostatistics
Xinhai Li
Skewness
Skewness
Symmetrical distribution
Symmetric
Left tail is the mirror image of the right tail Left tail is the mirror image of the right tail
Examples: heights and weights of people
n
c
y
n
c
y .30 .30
.35 .35
F
r
e
q
u
e
n
F
r
e
q
u
e
n
.20 .20
.25 .25
R
e
l
a
t
i
v
e

F
R
e
l
a
t
i
v
e

F
05 05
.10 .10
.15 .15
77
RR
.05 .05
00
Biostatistics
Xinhai Li
Skewness
Skewness
Asymmetrical distribution
Moderately Skewed Left
A longer tail to the left A longer tail to the left
Example: exam scores
yy
.30 .30
.35 .35
r
e
q
u
e
n
c
y
r
e
q
u
e
n
c
y
.20 .20
.25 .25
.30 .30
l
a
t
i
v
e

F
r
l
a
t
i
v
e

F
r
.10 .10
.15 .15
78
R
e
l
R
e
l
.05 .05
00
Biostatistics
Xinhai Li
Skewness
Skewness
Asymmetrical distribution
Frequency
I Income
Populations of
countries
Value
79
Biostatistics
Xinhai Li
Skewness
N
A Measure of skewness based on the 3rd moment about the Mean
N
3
i
) (
_ _
3
1 i
s ) 1 (N
skewness
=
=
( )
n
x x
n
i
i
) 1 (
2 / 3
1
3
s ) 1 (N
( )
n
n
x x
n
i
i
i
2
) 1 (
1
2 / 3
1
~
=
=
80
s median mean s e mean
i
/ ) ( 3 / ) mod (
1
~
=
Biostatistics
Xinhai Li
Sk
Skewness
Frequency q y
Value
81
Biostatistics
Xinhai Li
Skewed Right - Positive Skewness
Number of Music CDs of Spring 1998 Stat 250 Students
20
10
q
u
e
n
c
y
10
F
r
e
0 100 200 300 400
0
82
Number of Music CDs
Biostatistics
Xinhai Li
Kurtosis
Measures of Kurtosis
Kurtosis is a measure of the flatness or peakedness of a Distribution
Normal Kurtosis - Mesokurtic
Flat Kurtosis - Platokurtic
Peaked Kurtosis Leptokurtic Peaked Kurtosis - Leptokurtic
A Measure of Kurtosis based on the 4th moment about the Mean
83
Biostatistics
Xinhai Li
Kurtosis
N
4
1 i
4
i
) (
kurtosis

=
=
_ _
4
s ) 1 (N
kurtosis
If less then 0 = Platokurtic

More than 0 = Leptokurtic
If 0 then = Mesokurtic
84
If 0 then Mesokurtic
Biostatistics
Xinhai Li
Kurtosis
Frequency
k > 3
q y
k=3 k=3
k < 3
85
Value
Biostatistics
Xinhai Li
Describing data
Statistic (mean
based)
Statistic (non-
mean based)
Center Mean Mode, median
Spread Variance, SD Range, Spread Variance, SD
(standard
deviation), SE,
Range,
Interquartile
range
deviation), SE,
CV
range
Skew Skewness Skew Skewness --
Peaked Kurtosis --
86
Biostatistics
Xinhai Li
R code
x = rnorm(100) ( )
mean(x)
sd(x)
var(x)
min(x)
max(x) max(x)
median(x)
range(x)
quantile(x)
summary(x)
skewness = sum((x-mean(x))^3/sqrt(var(x))^3)/length(x); skewness skewness = sum((x-mean(x)) 3/sqrt(var(x)) 3)/length(x); skewness
kurtosis = sum((x-mean(x))^4/var(x)^2) /length(x) -3; kurtosis
87
Biostatistics
Xinhai Li
SAS Example
/****************************************************************/
/* SAS SAMPLE LIBRARY */
/* */
/* NAME: UNIVAR */
/* TITLE: Simple Descriptive Statistics using PROC UNIVARIATE */
SAS
SAS Example
OPTIONS LS=75 NODATE;
DATA STATEPOP;
/* PRODUCT: SAS */
/* SYSTEM: ALL */
/* KEYS: DESCRIPTIVE STATISTICS, */
/* PROCS: UNIVARIATE */
/* DATA: */
/* */
/* REF: */
/* MISC: */
/* DESC: INPUT A SMALL DATA SET USING THE CARDS STATEMENT. */
/* RUN UNIVARIATE USING THE FREQ, PLOT AND NORMAL */
/* PROC OPTIONS. ANALYZE THE VARIABLE POP AND */
/* RETAIN THE VARIABLE STATE USING THE ID STATEMENT. */
/* NO OTHER OPTIONS ARE USED. */
/* */
DATA STATEPOP;
INPUT STATE $ POP @@;
LABEL POP ='1970 CENSUS POPULATION IN MILLIONS';
/****************************************************************/
OPTIONS LS=75 NODATE;
DATA STATEPOP;
INPUT STATE $ POP @@;
LABEL POP ='1970 CENSUS POPULATION IN MILLIONS';
CARDS;
ALA 3.44 ALASKA 0.30 ARIZ 1.77 ARK 1.92 CALIF 19.95
COLO 2.21 CONN 3.03 DEL 0.55 FLA 6.79 GA 4.59
HAW 0.77 IDAHO 0.71 ILL 11.01 IND 5.19 IOWA 2.83
KAN 2.25 KY 3.22 LA 3.64 ME 0.99 MD 3.92
MASS 5.69 MICH 8.88 MINN 3.81 MISS 2.22 MO 4.68
MONT 0.69 NEB 1.48 NEV 0.49 NH 0.74 NJ 7.17
NM 1.02 NY 18.24 NC 5.08 ND 0.62 OHIO 10.65
OKLA 2.56 ORE 2.09 PA 11.79 RI 0.95 SC 2.59
SD 0 67 TENN 3 92 TEXAS 11 2 UTAH 1 06 VT 0 44
CARDS;
ALA 3.44 ALASKA 0.30 ARIZ 1.77 ARK 1.92 CALIF 19.95
COLO 2.21 CONN 3.03 DEL 0.55 FLA 6.79 GA 4.59
SD 0.67 TENN 3.92 TEXAS 11.2 UTAH 1.06 VT 0.44
VA 4.65 WASH 3.41 W.VA 1.74 WIS 4.42 WYO 0.33
PROC UNIVARIATE FREQ PLOT NORMAL;
VAR POP; ID STATE;
run;
HAW 0.77 IDAHO 0.71 ILL 11.01 IND 5.19 IOWA 2.83
KAN 2.25 KY 3.22 LA 3.64 ME 0.99 MD 3.92
MASS 5.69 MICH 8.88 MINN 3.81 MISS 2.22 MO 4.68
MONT 0.69 NEB 1.48 NEV 0.49 NH 0.74 NJ 7.17
NM 1.02 NY 18.24 NC 5.08 ND 0.62 OHIO 10.65
OKLA 2.56 ORE 2.09 PA 11.79 RI 0.95 SC 2.59 OKLA 2.56 ORE 2.09 PA 11.79 RI 0.95 SC 2.59
SD 0.67 TENN 3.92 TEXAS 11.2 UTAH 1.06 VT 0.44
VA 4.65 WASH 3.41 W.VA 1.74 WIS 4.42 WYO 0.33
88
VAR POP; ID STATE;
run;
Biostatistics
Xinhai Li
SAS
The SAS System
SAS results
The UNIVARIATE Procedure
Variable: POP (1970 CENSUS POPULATION IN MILLIONS)
Moments
N 50 Sum Weights 50
Mean 4.0472 Sum Observations 202.36
Std Deviation 4.32931867 Variance 18.7430002
Skewness 2.05521839 Kurtosis 4.54561679
Uncorrected SS 1737.3984 Corrected SS 918.407008
89
Coeff Variation 106.970712 Std Error Mean 0.61225812
Biostatistics
Xinhai Li
SAS
Basic statistics
Basic Statistical Measures Basic Statistical Measures
Location Variability Location Variability
M 4 047200 Std D i ti 4 32932 Mean 4.047200 Std Deviation 4.32932
Median 2.710000 Variance 18.74300
Mode 3.920000 Range 19.65000
Interquartile Range 3.69000
90
Biostatistics
Xinhai Li
SAS
Quantiles
Quantile Estimate
100% Max 19.950
99% 19 950 99% 19.950
95% 11.790
90% 10.830
75% Q3 4.680
50% Median 2.710
25% Q1 0.990
91
Biostatistics
Xinhai Li
Assignment
Be familiar with the following terms: Be familiar with the following terms:
Probability density function (PDF)
Deviation
Variance
Standard deviation
Standard error Standard error
Range
Mode Mode
Quantile
Coefficient of variation
Download and install R on your laptop
Plot histograms using
92
Plot histograms using
hist(rnorm(100), nclass=6)

Introduction To Biostatistics: Part 1

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Introduction To Biostatistics: Part 1

Hochgeladen von

Copyright:

Verfügbare Formate

Lecture 1.

Brief history, basic concepts and descriptive statistics

(lowercase sigma) is the population standard deviation.

(s-hat) is the sample estimate of

If less then 0 = Platokurtic

Das könnte Ihnen auch gefallen