Beruflich Dokumente
Kultur Dokumente
biostatistics
Descriptive statistics include measures of central tendency and vari- Gary M Gaddis, MD, PhD
ability. Measures of central tendency include mean, median, and mode. Monica L Gaddis, PhD
The mean is the arithmetic average of data from interval or ratio scales. Kansas City, Missouri
The median reflects the 50th percentile score. The mode is the most fre-
quently occurring value of a data distribution. Measures of variability in- From the Departments of Emergency
clude range, interquartile range, standard deviation, and standard error of Health Services and Surgery, Truman
the mean. The range describes the spread between the extreme values of Medical Center, University of Missouri,
Kansas City.
data. Interquartile range is data included between the 25th and 75th per-
centile of a distribution. Standard deviation describes variability of data
about the sample mean, while standard error of the mean helps describe Received for publication September 1,
1989. Accepted for publication
the distribution of several sample means about a true population mean. December 4, 1989.
Finally, confidence intervals, which are derived from the standard error of
the mean, define an.interval likely to include a true population value,
Address for reprints: Monica L Gaddis,
based on sample statistical values and probability characteristics of data PhD, Department of Surgery, Truman
distributions. [Gaddis GM, Gaddis ML: Introduction to biostatistics: Part Medical Center, 2301 Holmes, Kansas
2, descriptive statistics. Ann Emerg Med March 1990;19:309-315.] City, Missouri 64108.
INTRODUCTION
Statistical analysis is the process by which numerical data obtained by
scientific inquiry are transformed into a useable form for scientific inter-
pretation. This involves manipulation of data for describing characteristics
studied (descriptive statistics) and transformation of the data to help infer
conclusions from the data (inferential statistics).
This second of a six-part series on biostatistics focuses on descriptive
statistics. A thorough understanding of this topic is needed before advanc-
ing to discussions about inferential statistics. Familiarity with the con-
cepts regarding types of data and data distributions, as presented in part 1,1
is required for understanding the concepts presented herein. Numerical ex-
amples are provided to facilitate understanding. Finally, there exist many
common, yet inappropriate uses of statistics, which will be discussed in
this article.
denotes male subjects, Fdenotes fe- 100 110 120 130 140
Median
T h e m e d i a n is t h e " m i d - m o s t "
value of a data distribution. It is the
v a l u e above w h i c h or b e l o w w h i c h
half of the data points lie.2,4, 5 Alter-
natively, the m e d i a n is the 50th per-
centile value of a distribution.
T h e m e d i a n is unaffected by out-
liers and m a y be m o r e useful t h a n
the m e a n to describe data w h e n out-
liers exist 2 or w h e n c o n t i n u o u s data
are n o t n o r m a l l y d i s t r i b u t e d . 4 T h e
m e d i a n is useful for describing ordi-
nal d a t a 4 b e c a u s e the m a g n i t u d e of x x X X X X X X
I I I I I I I I
difference b e t w e e n p o i n t s of a data 180 190 200 210 220 230 240 250
scale need n o t be consistent to deter-
m i n e the 50th percentile value. 2 The
m e d i a n is n o t u s e f u l to d e s c r i b e Mean = 228.7
Median = 230
n o m i n a l data 2 b e c a u s e of the arbi-
2 M o d e = 240
trary s e l e c t i o n of n u m b e r s used to
generate this scale.
Mode
T h e m o d e is the m o s t c o m m o n l y
o b t a i n e d v a l u e or v a l u e s on a data
scale, or the highest p o i n t of a peak
on a f r e q u e n c y d i s t r i b u t i o n . 2 T h e
m o d e is m o s t useful w h e n two clus- F F
ters of data exist (bimodal distribu-
F F F M M F M M \
tion), such that a group m e a n is mis-
I ! I I I I I !
leading or meaningless. 2 T h e m o d e is
70 80 90 100 110 120 130 140
useful to describe n o m i n a l data, de-
fining the m o s t prevalent characteris-
tic of a sample. M e a n = 106,1
Median = 105
M o d e = 95,120
Numerical Examples 3
T h r e e d i f f e r e n t d i s t r i b u t i o n s of
data will be e x a m i n e d to d e t e r m i n e
h o w the type of data d i s t r i b u t i o n ob- Figure 2 p r e s e n t s t h e o r e t i c a l sys- will alter the value of the mean, b u t
tained affects the previously defined tolic blood pressure data of patients n o t the m e d i a n or mode.
measures of central tendency. w i t h u n t r e a t e d r e n o v a s c u l a r hyper- Figure 3 p r e s e n t s s y s t o l i c b l o o d
Figure 1 r e p r e s e n t s n o r m a l l y dis- t e n s i o n . T h e d i s t r i b u t i o n is n e g a - p r e s s u r e for a s a m p l e t h a t i n c l u d e s
tributed data for systolic blood pres- tively skewed. In the absence of nor- t w o groups, pregnant w o m e n in their
sure of 30 m e n aged 31 to 40 years. mality, the mean, median, and m o d e second t r i m e s t e r and men. Again, the
For n o r m a l l y d i s t r i b u t e d data, t h e are not equal. Also, an outlier, such m e a n and m e d i a n are u n e q u a l in this
v a l u e s of t h e m e a n , m e d i a n , a n d as a systolic blood pressure value of n o n - n o r m a l l y distributed data. Also,
m o d e are identical. 150 m m Hg instead of 180 m m Hg, there exist t w o peaks of data cluster,
142/310 Annals of Emergency Medicine 19:3 March 1990
50 th percentile
15.9 th percentile
S 84.1 percentile
TABLE 2. Estimates of variability of systolic blood pressure data of men aged 31 to 40 years
Systolic
Subject Blood Pressure (X - Xi) (X - X=)2
1 135 15 225
2 115 5 25
3 110 10 100
4 130 10 100
5 125 5 25
6 125 5 25
7 105 15 225
8 120 0 0
9 120 0 0
10 120 0 0
11 125 5 25
12 110 "10 100
13 115 5 25
14 115 5 25
15 135 15 225
16 100 20 400
17 120 0 0
18 125 5 25
19 120 0 0
20 130 10 100
21 140 20 400
22 120 0 0
23 115 5 25
24 1t0 10 100
25 130 10 100
26 105 15 225
27 120 0 0
28 115 5 25
29 125 5 25
30 120 0 0
Mean = ~ X/n = 3,600~0 = 120
Median = 120
Mode = 120
Variance: .Y.;(X - Xi)2/(n-1) = 2,550/29 = 87.931
SD = ~ = 9.377
SEM = 9.337 = 1.712
without using statistical techniques individual s a m p l e data points about points of a n o r m a l l y distributed pop-
improperly. the s a m p l e mean. ulation fall w i t h i n plus or m i n u s one
The usefulness of the SD lies in its SD of t h e m e a n , a n d 9 5 . 4 4 % of
Standard Deviation properties as related to the Gaussian, points fall w i t h i n plus or m i n u s two
T h e standard deviation (SD) is one or normal, distribution. The SD itself SD of the mean. 5
of the m o s t c o m m o n l y encountered can be u s e d to d e f i n e an e x t r e m e The SD is calculated as the square
e s t i m a t e s of d a t a v a r i a b i l i t y and is score, s u c h as the v a l u e t h a t is ex- root of a n o t h e r t e r m called the vari-
integral to performance of inferential ceeded by 5% or 95% of all scores ance. Because individual data p o i n t s
statistical techniques. 2 It provides an from a s a m p l e of a population. 2 Fig- w i l l fall b o t h above and b e l o w t h e
e s t i m a t e of t h e degree of s c a t t e r of u r e 4 s h o w s t h a t 6 8 . 2 6 % of d a t a mean, the effect of direction of differ-
x21¢xN
<~3¢.K;
I I l l I I I I I I I I
100 110 120 130 140
SD = 9.37 SD = 9.37
SEM = 1,71 SEM = 1,71
I 1
I 9 5 % CI l
MEAN
Interquartile
statistical inference techniques. T h e
Characteristic Range Range SD SEM
calculation of SD from the n o r m a l l y
Useful to describe interval or Yes Yes Yes Yes distributed systolic blood pressure
ratio data data of Figure 1 is shown (Table 2).
Used to describe ordinal data Yes Yes No No
Standard Error of the Mean
Descriptive of sample variability Yes Yes Yes No
T h e s t a n d a r d error of t h e m e a n
Assists in statistical inference No No Yes Yes (SEM) is a statistic derived from the
Used to calculate confidence No No No* Yes SD, and is s i m p l y calculated as
intervals SEM - SD/N/~-
*SEM = SD/x/~, thus SD is involved indirectly in calculating a confidence interval. It is o b v i o u s f r o m t h e c a l c u l a t i o n
that the SEM is always smaller t h a n
ence will cause some deviations from instance, in Figure 1, the variance of t h e SD and t h e g r e a t e r t h e n, t h e
the m e a n to be positive and some to s y s t o l i c b l o o d p r e s s u r e is expressed s m a l l e r the SEM will be.
be negative. To overcome this effect, as m m Hg 2. However, these squared T h e SEM is an a b s t r a c t concept.
d e v i a t i o n s are s q u a r e d to o b t a i n a units are n o t meaningful. Therefore, Imagine repeating an e x p e r i m e n t nu-
p o s i t i v e n u m b e r . I n d i v i d u a l squared t h e s q u a r e r o o t of t h e v a r i a n c e is merous times. With each experi-
d e v i a t i o n s from the m e a n are t h e n t h e n c a l c u l a t e d , to b r i n g t h e vari- ment, a different sample group w o u l d
averaged to calculate the e s t i m a t e of a b i l i t y e s t i m a t e b a c k to t h e correct be drawn from the study population.
v a r i a b i l i t y k n o w n as t h e v a r i a n c e . scale. This is the value k n o w n as the Because each repetition of the experi-
Numerically SD: m e n t contains u n i q u e sample m e m -
bers, different m e a n v a l u e s w i l l be
Variance = ~ (X - Xi)2/(n-1 ) SD -
generated w i t h each study. The col-
w h e r e X equals the mean, X i equals T h e SD is meaningful only w h e n ap- lection of these m e a n values, as gen-
each individual data point, and n p l i e d to d a t a t h a t are n o r m a l l y or erated from repetitive sampling and
e q u a l s t h e t o t a l n u m b e r of d a t a n e a r l y n o r m a l l y distributed.2,s, 9 It is e x p e r i m e n t a t i o n , w i l l reflect "scat-
points. 5-7 T h e variance represents the a p p l i c a b l e to i n t e r v a l or r a t i o scale ter" about the true but unknown
d e v i a t i o n f r o m t h e m e a n , expressed data. ~ p o p u l a t i o n mean. The SEM is s i m p l y
as the square of the u n i t s used. For The SD is useful in application to a quantification of the variability of
19:3 March 1990 Annals of Emergency Medicine 313/145
BIOSTATISTICS
Gaddis & Gaddis
Standard Deviation Versus error to u s e SEM in s p e c u l a t i n g a lationA s Also, the closer a p o i n t lies
Standard Error of the Mean range, or confidence interval, w i t h i n to t h e m i d d l e of t h e CI, t h e m o r e
Both SD and SEM are measures of w h i c h a t r u e population m e a n is l i k e l y it is-representative of the pop-
variability. However, the two statis- l i k e l y to fall. The SD and SEM of the ulation. 16
tics are different and are f r e q u e n t l y data shown in Figure 1 are given (Fig- Though by convention the 95% CI
confused or m i s u s e d , la T h e SD de- u r e 5). T a b l e 3 s u m m a r i z e s t h e is m o s t c o m m o n l y reported, the 95%
f i n e s v a r i a b i l i t y of s a m p l e d a t a proper use of e s t i m a t e s of variability. level is n o t rigidly required. W i d e r
points about a sample mean. The SD CIs, such as a 99% or 99.9% CI, are
is always greater than the SEM. The Confidence Intervals even m o r e l i k e l y to include the true
SEM is m o s t c o m m o n l y calculated to W h e n s t a t i s t i c s derived from the p o p u l a t i o n p a r a m e t e r v a l u e and are
help derive confidence intervals. sampling of a p o p u l a t i o n are studied c o m m o n l y used for critical appraisal
Various a u t h o r s have c o m m e n t e d to infer values for p o p u l a t i o n param- of data. T h e y are also advocated for
about the i n t e l l e c t u a l sleight of h a n d eters, it w o u l d be useful to have con- e x a m i n a t i o n s of data during ongoing
of incorrectly using SEM when only fidence that the sample statistical a c c u m u l a t i o n of subjects in a clinical
SD is appropriate to describe sample value, such as a m e a n or SD, w o u l d trial. 15 N a r r o w e r CIs, s u c h as t h e
data variability.6A2A3 Bunce et aU 3 be representative of the true popula- 90% CI, can be used w h e n s t u d y au-
reviewed 608 articles in six journals tion p a r a m e t e r . One c a n n o t be cer- thors find it acceptable that ten
in w h i c h m e a n _+ SD or SEM were tain that a s a m p l e statistical value is t i m e s out of 100, the true p o p u l a t i o n
reported. In 50%, SEM values were representative of the true population p a r a m e t e r m a y n o t lie w i t h i n the CI.
r e p o r t e d w h e n o n l y t h e SD w o u l d parameter, b u t one can c a l c u l a t e a However, the w i d t h of a CI depends
have been appropriate. T h e a u t h o r s range of values l i k e l y to be represen- n o t only on the variability of the data
concluded that " m a n y workers m a y t a t i v e of t h e p o p u l a t i o n p a r a m e - and the level of confidence selected,
choose to report the SEM because it ter.4,14 That range of values is called but also the sample size.
is s i m p l y s m a l l e r t h a n t h e SD. ''13 a c o n f i d e n c e i n t e r v a l (CI). Calcula- W h e n one broadens a CI by mov-
The inappropriate use of SEM to de- tion of a CI is a m e t h o d of e s t i m a t i n g ing a 95% to a 99% CI, accuracy is
scribe sample data v a r i a b i l i t y m a y be the range of values l i k e l y to include i n c r e a s e d b e c a u s e t h e c a l c u l a t e d CI
p r e s e n t e d by a u t h o r s in an a t t e m p t the true value of a p o p u l a t i o n param- b e c o m e s m o r e l i k e l y to i n c l u d e a
to i m p l y that a significant difference eter. Since one cannot study all true p o p u l a t i o n parameter. However,
exists b e t w e e n groups, w h e n in fact m e m b e r s of a population, a represen- w h e n the level of CI is held constant
no difference exists. Elenbaas et al la t a t i v e s a m p l e of t h e p o p u l a t i o n is and s a m p l e size is increased, SEM is
were m o r e blunt, concluding that au- studied, and from this one uses the d e c r e a s e d a n d t h u s t h e CI is nar-
thors w h o p r e s e n t data as m e a n _+ m e a n and SEM to w o r k backward to rowed. This narrowing of the CI in-
SEM instead of m e a n ± SD m a y be e s t i m a t e a CI. creases the precision of the CI. The
trying to actively i m p a i r the reader's T h e w i d t h of the CI depends on effect of level of confidence selected
a b i l i t y to accurately identify the vari- the SEM and the degree of confidence and s a m p l e size on the w i d t h of a CI
ability in the s t u d y data. we arbitrarily choose. For instance, a is shown (Table 4).
W h e t h e r by error or by design, it is 95% CI, w h i c h is the degree of confi- C a l c u l a t i o n of the CI for e s t i m a -
incorrect to underrepresent the vari- dence m o s t c o m m o n l y expressed, ~4 tion of true p o p u l a t i o n m e a n values
a b i l i t y of s a m p l e d a t a as m e a n _+ is a range of v a l u e s b r o a d e n o u g h applies to c o n t i n u o u s data from nor-
SEM. We suggest that readers m u l t i - that, if the entire p o p u l a t i o n could be m a l or n e a r - n o r m a l d i s t r i b u t i o n s . 4
p l y the SEM by ~ n - t o obtain the SD studied, 95% of the t i m e the popula- Also, a CI can be e s t i m a t e d for such
w h e n SEM is erroneously used to ex- tion m e a n w o u l d fall w i t h the CI es- o t h e r s t a t i s t i c s as m e d i a n s , regres-
press sample variability. It is not an t i m a t e d from the sample of the popu- s i o n slopes, r e l a t i v e r i s k data, re-