Beruflich Dokumente
Kultur Dokumente
Basic Concepts
Normal Distribution & 'Standard Deviation
Skewed Distribution
Descriptive Statistics (mean, median,mode, 95% confidence interval for a mean, standard deviation, standard error, ra
Epidemiology/Biostatistics Tools
+/- 2 SD =95%
+/- 2 SD =95%
Variability in the data can be quantified from the variance, which basically calculates the average dis
the mean (the x with the "bar" over it).
2
(x – x)
n -1
Standard deviation is just the square root of the variance, and it is convenient because the mean +
and the mean + 2 SD captures 95% of the observations.
2
(x – x)
n -1
Note the functions that are used to calculate variance and SD in Excel. Then compare the standard de
these affect the shape of the frequency distributions. Data
Using SD versus SEM: A standard deviation from a sample is an estimate of the population SD, e.g. t
weight in the population. The SEM is a measure of the precision of our estimate of the population’s me
will increase as the sample size increases, i.e. the SEM will be narrower with larger samples.
If the purpose is to describe a group of patients, for example, to see if they are typical in their variability
Tables 2& 3 in Gottlieb et al.: N. Engl. J. Med. 1981; 305:1425-31). However, if the purpose is to estima
prevalence of disease, one should use SEM or a confidence interval.
Wayne W. LaMorte, MD, PhD, MPH
Copyright 2006
Main Menu
Data set 2:
rmal distribution fairly closely, BMI
cal around a mean or average value. The
#DIV/0! 23
atively little variability to short & wide for 24
e two datasets that show values of body 24
25
ach dataset by following these steps: 1) 25
nd choose "Sort"; 3) i indicate that it is to 25
"Ok." 3) With the data sorted I can easily 25
each of value in the range. Put these 25
he 2 column block of data in the "Counts 26
he miniature, multicolored vertical bar 26
equency as two separate entities, you 26
26
they are related, and then convert the
26
mation about mean, standard deviation, 26
26
27 Counts for Bar Chart
27
27 BMI frequency
27 22 0
27 23 1
27 24 2
27 25 5
27 26 7
27 27 14
27 28 6
27 29 5
27 30 3
27 31 1
27 32 0
28 33 0
28
28
16
28
28 14
28 12
29 10
29
8
29
29 6
29 4
30 2
30
0
30
22 23 24 25 26 27 28 29 30 31 32 33
31
27.05 Mean
3.02 Variance +/- 1 SD =68%
1.74 Standard Deviation
+/- 2 SD =95%
+/- 2 SD =95%
mpare the standard deviations for datasets 1 and 2, and see how
Examining the frequency distribution of a data set is an important first step in analysis. It gives an overall picture of the data, an
distribution determines the appropriate statistical analysis. Many statistical tests rely on the assumption that the data are norm
this isn't always the case. Below in the green cells is a data set with hospital length of stay (days) for rwo sets of patients who
surgery. One data set was collected before instituting a new clinical pathway and one set was collected after instituting it.
Question: Was LOS different after
instituting the pathway?
LOS
Before After
3 3
12 1
2 1
1 5
11 1
4 6
2 1 We can rapidly get a feel for what is going on here by creating a frequency
2 5
3 2
histogram. The first step is to sort each of the data sets. Begin by selecting the
1 3 "before" values of LOS. Then, from the top toolbar, click on "Data", "Sort" (if you
8 3 get a warning about adjacent data, just indicate you want to continue with the
2 1 current selection). Also, indicate that there is no "header" row and that you want
3 5 to sort in ascending order. Repeat this procedure for the other data set.
6 2
1 2
13 2
3 3
8 3
10 7
6 3
4 4
12 1
9 3
7 3
1 2
3 2
3 2
2 4
5.07 2.86
y creating a frequency
ets. Begin by selecting the
lick on "Data", "Sort" (if you
want to continue with the
ader" row and that you want
the other data set.
Mean
before before
after
4 5 6 7 8 9 10 11 12 13 14 15
N 12 Median 19
Mean 17.83 Mode 17
STD 4.80 Minimum 7
Std Error 1.39 Maximum 23
T-crititcal 2.12
T-crititcal*std err 2.94
Note:
CONFIDENCE 2.72 Using the Excel 'CONFIDENCE' function Note:
1.96* Std Error= 2.72 gives same thing as 1.96 x stderr
This worksheet is
currently under
Example Data:
development.
14
17
22
18
22
17
12
7
20
21
21
23
Note:
This worksheet is
currently under
development.
T-tests calculate a "t" statistic that takes into account the difference between the means, the variability in
observations in each group. Based on the "t" statistic and the degrees of freedom (total observations in
look up the probability of observing a difference this great or greater if the null hypothesis were true.
Group 1 Group 2 From a practical point of view Excel provides built in functions tha
4.5 4.2
cell C44 to see the function used for a t-test with equal variance.
5.0 7.2
5.3 8.0
• the cells where the first groups data is found,
5.3 3.5 • the cells where the second group's data is found,
6.0 6.3 • then whether it is a 2-tailed test or a 1-tailed test, and
6.0 5.1 • finally a "2" to indicate a test for equal variance.
7.6 4.6 If the variance is unequal, there is a modified calculations that on
7.7 4.8 the last parameter in the function (compare the formulae in cells C
6.4 2.0 thumb, if one standard deviation is more than twice the other, you
7.2 5.0 variance test.
7.0 5.4
5.6
8.4 Note also that the two groups do not have to have the same num
8.3
9.5 Finally, note that in this case we are estimating the means in each
15 11 N are different; consequently, it is appropriate to calculate SEM, wh
6.7 5.1 Mean root of N.
2.10 2.77 Variance
1.45 1.66 SD
0.37 0.50 SEM (standard error of the mean)
0.02 Two-tailed p-value by t-test for equal variance
0.02 Two-tailed p-value by t-test for unequal variance
The t-test is a "parametric" test, because it relies on the legitimate use of the means and standard deviations, w
the parameters that define normally distributed continuous variables. If the groups you want to compare are cl
skewed (i.e. do not conform to a Normal distribution), you have two options:
1) Sometimes you can "transform" the data, e.g. by taking the log of each observation; if the lo
normally distributed, you can then do a t-test on the transformed data; this is legitimate.
1) Sometimes you can "transform" the data, e.g. by taking the log of each observation; if the lo
normally distributed, you can then do a t-test on the transformed data; this is legitimate.
n the means, the variability in the data, and the number of 4.5
edom (total observations in the two groups minus 2) one can 4
ull hypothesis were true. 3.5
3
2.5
2
1.5
1
0.5
0
10-20 21-30 31-40 41-50
provides built in functions that make t-tests easy. Click on
a t-test with equal variance. One specifies:
failed ok
ps data is found, 56 19
oup's data is found, 37 25
est or a 1-tailed test, and 57 38
for equal variance. 39
modified calculations that one can get by specifying "3" as 35
mpare the formulae in cells C44 & C45). As a rule of 40
ore than twice the other, you should use the unequal 66
19
43.6 27.3
227.4 94.3
have to have the same number of subjects. 15.1 9.7
Freq. failed
Freq OK
Mean
Variance
SD
Controls (ANOVA)
Aortoiliac Fem-AK Pop Fem-Distal The columns of data to the left are serum creatinine le
0.7 1.1 1.5 1.2 factor analysis of variance can be performed to determ
differences in the means of these groups.
1.2 1.3 1.1 0.8
1.1 0.9 0.8 0.7
Select the block of data (including column labels) from
0.7 0.7 0.9 0.7 select "Tools", then "Data Analysis", then "Single Fact
1.0 0.8 1.1 8.4 for labels, and specify the Output Range as G12. The
0.5 1.4 0.9 1.8
1.6 0.5 7.0 0.8 The p-value (0.0764) indicates differences in means t
0.8 1.1 1.4 1.0 criterion for statistical significance.
0.6 2.0 0.8 0.7
0.6 0.8 1.1 2.8 Anova: Single Factor
0.6 0.7 0.6 1.5
1.3 1.4 1.2 0.6 SUMMARY
0.5 1.1 0.6 1.3 Groups Count
1.0 1.5 1.2 0.5 Controls 25
1.0 1.0 0.6 1.2 Aortoiliac 25
0.8 0.9 0.8 8.2 Fem-AK Pop 25
0.8 0.9 0.8 0.4 Fem-Distal 25
0.6 0.6 1.3 0.6
0.5 0.9 1.3 1.6
0.9 0.9 1.5 0.5 ANOVA
0.7 1.2 1.5 11.4 Source of Variation SS
0.7 1.2 0.4 0.8 Between Groups 30.3779
0.7 1.3 12.9 0.7 Within Groups 412.2632
0.7 0.4 1.1 0.6
1.1 0.7 8.6 0.9 Total 442.6411
Means: 0.828 1.012 2.040 1.988
ck on "Tools" (above) and then on "Add-Ins" and select "Analysis Tool
ols," you will see a new selection ("Data Analysis") at the bottom of the
d other procedures.
to the left are serum creatinine levels among 4 groups of subjects. A one-
iance can be performed to determine whether there are significant
ans of these groups.
ata (including column labels) from B2:E27. Then, from the upper menu,
Data Analysis", then "Single Factor Analysis of Variance". Check the box
y the Output Range as G12. The result is shown in the box below.
df MS F P-value F crit
3 10.12597 2.35794221 0.0764786914 2.699393
96 4.294408
99
Case-Control Studies Main Menu
Enter data into the blue cells to calculate a p-value with the chi squared test.
d Under H0
5
46
51
d cell is <5.
ad/T= 0 ad/T= 0
bc/T= 0 bc/T= 0
ad/T= 0 ad/T= 0
bc/T= 0 bc/T= 0
Incidence exposed= 0.0090 95% Confidence Interval for the Relative Risk (test-base
Incidence nonexposed= 0.0050 Upper 95% CI= 2.20
Relative Risk= 1.81 Lower 95% CI= 1.49
Risk Difference= 0.004032
Chi Sq= 35.954
p-value= 0.000000
xpected Under H0
No Disease
707.76 812
1906.24 2187
2614 2999
(Precision-based)
xpected Under H0
No Disease
-
-
ce (2-6 Substrata)
Mantel-Haenszel RR= 1.03
Mantel-Haenszel chi square= 0.937257
Stratum 3 Stratum 4
DiseasedNot Diseased Not
87 0 0
237 0 0
324 0 0 0 0 0 0
a(c+d) 0 a(c+d) 0
c(a+B) 0 c(a+B) 0
a(c+d) 0 a(c+d) 0
c(a+B) 0 c(a+B) 0
Enter data into the blue cells to calculate a p-value with the chi squared test.
Observed Data Expected Under H0
+ Outcome -Outcome + Outcome -Outcome
Exposed 0 Exposed #DIV/0! #DIV/0! ###
Non-exposed 0 Non-exposed #DIV/0! #DIV/0! ###
0 0 0 #DIV/0! #DIV/0! ###
Chi Sq= #DIV/0!
p-value= #DIV/0! #DIV/0!
The chi squared test can also be applied to situations with multiple groups and outcomes.
For example, the number of runners who finished a marathon in less than 4 hours among those who trained not at all, a little, m
The Excel function CHITEST will calculate the p-value automatically, if you specify the range of actual (observe
frequencies and the range of expected observations. For example,
Observed Data Expected Under H0
Finished Didn't finish Finished Didn't finish
Not at all 2 5 7 3.29 3.71 7
A little 8 30 38 17.86 20.14 38
Moderately 20 15 35 16.45 18.55 35
A lot 25 12 37 17.39 19.61 37
55 62 117 55 62 117
p-value= 0.000280
The Chi Squared Test is based on the difference
PhD, MPH
between the frequency distribution that was observed
Copyright 2006
and the frequency distribution that would have been
expected under the null hypothesis. In the example
above, only 8 of 210 subjects had the outcome of
interest (3.8095%). Under the null hypothesis, we would
expect 3.8095% of the exposed group to have the
outcome, and we would expect 3.8095% of the non-
exposed group to have the outcome as well. The 2x2
table to the right calculates the frequencies expected
under the null hypothesis
for each cell. The
ed for each cell and
d if the number of
2
(O-E)
isher's Exact Test
=
2
E
Enter data into the blue cells to calculate a p-value with the chi squared test.
se(lnOR) 0.96073
Wayne W. LaMorte, MD, PhD, MPH
Copyright 2006
d Under H0
5
46
51
d cell is <5.
Wayne W. LaMorte, MD, PhD, MPH Confidence Intervals for a Proportion
Copyright 2006
"N" Interval Interval
Numerator Denominator Estimated +/- Lower Upper +/- Lower Upper
proportion Limit Limit Limit Limit
1 79 0.01265823 0.021 0.00 0.03 0.025 0.00 0.04
9 16 23 30 6 13 20 27 4 11 18
3/ 3/ 3/ 4/ 4/ 4/ 4/ 5/ 5/ 5/
Random Number Generator Main Menu
Number of groups= 4
Enter a seed # 7
Assign to Group: 3
random # 0.696398
69
This program usesxe2a random number
276 generator to assign subjects randomly to a group. You
need to specify how
/100many groups
2.76you want in the first blue cell. You then need to “spark” the
random number generator
trunc by entering
3 some number (ANY number) in the 2nd blue cell. Enter a
number and click outside the cell; this will generate a random number and specify to which
group the subject should be assigned, based on how many groups you specified.
Main Menu
group. You
o “spark” the
ue cell. Enter a
y to which
.
T-Tests
Unpaired T-test
Group 1 Group 2
Consider the values of body mass index for the two groups to
BMI BMI
25 23 represents values in a group that was treated with a regimen
25 26 variability from person to person. Values range from 22-34, a
27 24 two groups.
34 32 40
ss index for the two groups to the left; group1 was untreated & group 2
at was treated with a regimen of diet and exercise for 4 months. There is
n. Values range from 22-34, and there is considerable overlap between the
40
an
e 35
30
25
20
0.8 1.8
s substantial person-to-person
. However, what we are really 20
ponse to treatment. 0.8 1.8
e just about all subjects reduced their BMI somewhat, and if you
o-person differences, it looks like the treatment regimen had an
he null hypothesis is that the means are the same, but in a paired
esis is that the mean difference between the pairs is zero.
(frequency)
ue boxes.
1.0000
0.9000
Risk
0.8000
Initial No. at
No. at Risk
Cumulative
Surv. Prob.
95% Lower
95% Upper
0.7000
Follow-up
Effective
sum q/pL
0.6000
Survival
Lost to
Events
Bound
Bound
Prob.
Period
0.5000
Risk
0.4000
0 100 6 4 98.0 0.0612 0.9388 0.9388 0.8728 0.9716 0.3000
0.000665
1 90 6 5 87.5 0.0686 0.9314 0.8744 0.7931 0.9267 0.2000
0.001507
2 79 3 2 78.0 0.0385 0.9615 0.8408 0.7535 0.9012 0.1000
0.002020
3 74 5 7 70.5 0.0709 0.9291 0.7811 0.6854 0.8540 0.0000
0.003102
4 62 4 7 58.5 0.0684 0.9316 0.7277 0.6254 0.8106 0.004357
1 2 3 4
5 51 5 2 50.0 0.1000 0.9000 0.6550 0.5459 0.7498 0.006579
6 44 3 6 41.0 0.0732 0.9268 0.6070 0.4947 0.7091 0.008505
7 35 0 3 33.5 0.0000 1.0000 0.6070 0.4947 0.7091 0.008505
8 32 7 3 30.5 0.2295 0.7705 0.4677 0.3493 0.5899 0.018271
9 22 5 4 20.0 0.2500 0.7500 0.3508 0.2364 0.4854 0.034938
10 13 6 7 9.5 0.6316 0.3684 0.1292 0.0517 0.2879 0.215389
11 0
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
S u rviv al P ro b ab ility
Survival Curve
1.0000
0.9000
0.8000
Effective Size
Sensitivity Specificity
0.408 0.944
Standardized Incidence Ratios
SIR is useful for evaluating whether the number of observed cancers in a community exceeds
the overall average rate for the entire state.
(Column CxD)
State Cancer # People in Expected # Observed #
Calculation of Standard
Rate Community Community Community (SIRs)
e.g. age Stratum (Standard) Strata Cancers Cancers To determine whether e
<20 1 0.00010 74657 7.5 11
20-44 2 0.00020 134957 27.0 25 cases occurred in a com
45-64 3 0.00050 54463 27.2 30 data are tabulated by ag
65-74 4 0.00150 25136 37.7 40 compare the observed n
75-84 5 0.00180 17012 30.6 30
85+ 6 0.00100 6337 6.3 8
the number that would b
7 0.00000 0 0.0 statewide cancer rate.
8 0.00000 0 0.0
Totals 312562 136.4 144
Example:
(Column CxD)
State Cancer # People in Expected # Observed #
Rate Community Community Community
e.g. age Stratum (Standard) Strata Cancers Cancers Age Group Community
<20 1 0.00010 74657 7.5 11 population
20-44 2 0.00020 134957 27.0 25
45-64 3 0.00050 54463 27.2 30 0-19 .0001
65-74 4 0.00150 25136 37.7 40
75-84 5 0.00180 17012 30.6 30
20-44 .0002
85+ 6 0.00100 6337 6.3 8 45-64 .0005
7 0.00000 0 0.0
8 0.00000 0 0.0 65-74 .0015
Totals 312562 136.4 144
75-84 .0018
Standarized Incidence Ratio (SIR): 106
85+ .0010
Lower 95% Confidence Limit: 88
Upper 95% Confidence Limit: 123
If the observed count is >30, the confidence interval for
Main Menu observed count is calculated using the Poisson distribu
mmunity exceeds to approximate the distribution of the observed counts
nested if statements
120
Confidence Limits
for Observed Count
Direct Standardization (for Adjusted Rates)
Adapted from Dr. Tim Heeren, Boston University School of Public Health, Dept. of Biostatist
For specific strata of a population (e.g. age groups) indicate the number of observed events and the number of people in the s
the distribution of some standard reference population in column C. [Leave a "1" in column F for extra strata to prevent calcula
Distribution of Number
Reference of Number of Proportion
e.g. age Stratum Population Events Subjects or "Rate" SE
<5 1 0.07 2414 850000 0.00284 0.00006
5-19 2 0.22 1300 2280000 0.00057 0.00002
20-44 3 0.40 8732 4410000 0.00198 0.00002
45-64 4 0.19 21190 2600000 0.00815 0.00006
65+ 5 0.12 97350 2200000 0.04425 0.00014
6 0.00 0 1 0.00000 0.00000
7 0.00 0 1 0.00000 0.00000
8 0.00 0 1 0.00000 0.00000
Totals 1.00 130986 12340003
Crude Rate 0.01061
Standardized Proportion or "Rate" 0.00797
Standard Error 0.00002
95% CI for Standardized Rate 0.00793 0.00802
Suppose you want to compare Florida and Alaska with respect to death rates from cancer. The problem is tha
Example: and Florida and Alaska have different age distributions. However, we can calculate age-adjusted rates by usin
determine what the overall rates for Florida and Alaska would have been if their populations had similar distrib
rates observed for each population and calculates a weighted average using the "standard" populations distrib
distribution in 1988 was used as a standard, but you can use any other standard. Note that the crude rates for
(1,061 per 100,000 vs.391 per 100,000, but Florida has a higher percentage of old people. The standardized (
750 per 100,000).
Distribution of Florida
US Population of Number of Proportion
e.g. age Stratum in 1988 Events Subjects or "Rate" SE
<5 1 0.07 2414 850000 0.00284 0.00006 Florida
5-19 2 0.22 1300 2280000 0.00057 0.00002 Age Deaths Pop.
20-44 3 0.40 8732 4410000 0.00198 0.00002 <5 2,414 850,00
45-64 4 0.19 21190 2600000 0.00815 0.00006 5-19 1,300 2,280,0
65+ 5 0.12 97350 2200000 0.04425 0.00014 20-44 8,732 4,410,0
6 0.00 0 1 0.00000 0.00000 45-64 21,190 2,600,0
7 0.00 0 1 0.00000 0.00000 >65 97,350 2,200,00
8 0.00 0 1 0.00000 0.00000 Tot. 130,986 12,340,00
Totals 1.00 130986 12340003
Crude Rate= 130,986/1
Crude Rate 0.01061
Standardized Proportion or "Rate" 0.00797
Standard Error 0.00002
95% CI for Standardized Rate 0.00793 0.00802
Distribution of Alaska
US Population of Number of Proportion Alaska
Stratum in 1988 Events Subjects or "Rate" SE
Age Deaths Pop
1 0.07 164 60000 0.00273 0.00021 <5 164 60
2 0.22 85 130000 0.00065 0.00007 5-19 85 130
3 0.40 450 240000 0.00188 0.00009 20-44 450 240
45-64 503 80
>65 870 20
Tot. 2,072 530,
h, Dept. of Biostatistics
number of people in the stratum in columns E and F. Indicate
a strata to prevent calculation error.]
0.00019880 0.00000000
0.00012544 0.00000000
0.00079202 0.00000000
0.00154850 0.00000000
0.00531000 0.00000000
0.00000000 0.00000000
0.00000000 0.00000000
0.00000000 0.00000000
0.00797476 0.00000000
ancer. The problem is that death rates are markedly affected by age,
age-adjusted rates by using a reference or "standard" distribution to
ulations had similar distributions. The calculation uses the age-specific
andard" populations distribution for weighting. In this case, the US age
ote that the crude rates for Florida and Alaska differ substantially
eople. The standardized (age-adjusted) rates are very similar (797 vs.
Florida0.00019880 0.00000000
% of total Rate per
Age Deaths0.00012544
Pop. 0.00000000
(Weight) 100,000
<5 0.00079202
2,414 850,0000.00000000
7% 284
5-19 1,300 2,280,0000.00000000
0.00154850 18% 57
20-44 8,732 4,410,0000.00000000
0.00531000 36% 198
45-64 21,190 2,600,0000.00000000
0.00000000 21% 815
>65 97,350 2,200,0000.00000000
0.00000000 18% 4,425
Tot. 130,986 12,340,0000.00000000
0.00000000 100%
0.00797476 0.00000000
Crude Rate= 130,986/12,340,000=1,061 per 100,000