Beruflich Dokumente
Kultur Dokumente
Summary Statistics
Summarizing Data
Mean
2127
M
75.96
28
Summarizing Data
Ordered data
32
34
40
62
63
65
68
68
70
76
76
77
78
79
80
80
81
81
87
88
88
88
89
90
93
95
97
102
Summarizing Data
Median
32
34
40
62
63
65
68
68
70
76
76
77
78
79
80
80
81
81
87
88
88
88
89
90
93
95
97
102
Summarizing Data
Median
32
34
40
62
63
65
68
68
70
76
76
77
78
79
80
80
81
81
87
88
88
88
89
90
93
95
97
102
159
Mdn
79.5
2
Summarizing Data
Score
Frequency
32
34
40
62
63
65
68
70
76
77
78
79
80
81
87
88
89
90
93
95
97
102
1
1
1
1
1
1
2
1
2
1
1
1
2
2
1
3
1
1
1
1
1
1
Mode
3.0
Frequency
2.0
1.0
0.0
SCORE
(8 categories)
10
Frequency
0
35.0
45.0
55.0
65.0
75.0
SCORE
85.0
95.0
105.0
(16 categories)
Frequency
0
32.5
37.5
42.5
47.5
52.5
57.5
62.5
67.5
72.5
SCORE
77.5
82.5
87.5
92.5
97.5
102.5 107.5
Histogram
Graph of frequencies
Frequency
0
32.5
37.5
42.5
47.5
52.5
57.5
62.5
67.5
72.5
SCORE
77.5
82.5
87.5
92.5
97.5
102.5 107.5
Company B
C ompany C
$10,000
$10,000
$12,000
$12,000
$12,000
$12,000
$12,000
$12,000
$14,000
$14,000
$10,000
$10,000
$12,000
$12,000
$12,000
$12,000
$12,000
$12,000
$14,000
$380,000
$10,000
$10,000
$12,000
$12,000
$12,000
$12,000
$12,000
$12,000
$15,000
$380,000
Note the dramatic effect of one salary on the mean, especially comparing A to B
$10,000
$10,000
$12,000
$12,000
$12,000
$12,000
$12,000
$12,000
$14,000
$14,000
M
Mdn
Mo
$12,000
$12,000
$12,000
Company B
C ompany C
$10,000
$10,000
$10,000
$10,000
$12,000
$12,000
$12,000
$12,000
Means$12,000
for Company B$12,000
and C are
$12,000
pulled
in the direction$12,000
of the one
$12,000
$12,000
unusual score
$12,000
$12,000
$14,000
$15,000
$380,000
$380,000
$48,600
$12,000
$12,000
$48,700
$12,000
$12,000
Company B
C ompany C
$10,000
$10,000
$10,000
$10,000
$10,000
$10,000
$12,000
$12,000
$12,000
$12,000
$12,000
$12,000
$12,000
$12,000
$12,000
$12,000 and Modes
$12,000
$12,000 the
But the Medians
remain unaffected
$12,000
$12,000
$12,000
impact of
a few unusually
high or low scores
is seen
$12,000
most clearly$12,000
on the MEAN $12,000
$14,000
$14,000
$15,000
$14,000
$380,000
$380,000
M
Mdn
Mo
$12,000
$12,000
$12,000
$48,600
$12,000
$12,000
Not skewed
Skewed
$48,700
$12,000
$12,000
Skewed
12
Frequency
10
2
0
95.0
105.0
100.0
115.0
110.0
125.0
120.0
135.0
130.0
145.0
140.0
155.0
150.0
IQ
Median 120.5
Mean
12
10
Mean = 7.3
N = 50.00
0
2.0
4.0
murder rate
6.0
8.0
10.0
12.0
14.0
16.0
18.0
20.0
20
10
0
0.00
1.00
.50
2.00
1.50
2.50
3.00
4.00
3.50
5.00
4.50
6.00
5.50
7.00
6.50
7.50
8.00
Frequency
200
100
Mean gets
dragged to
the right
0
15.0
25.0
20.0
35.0
30.0
45.0
40.0
55.0
50.0
65.0
60.0
75.0
70.0
85.0
80.0
90.0
AGE
Mode 21
31.34
Median
25
Mean
3.0
Frequency
2.5
2.0
1.5
1.0
.5
0.0
70.0
72.5
75.0
77.5
80.0
82.5
85.0
87.5
90.0
Score on exam
Mean 82.48
Mode 88.8
Median 84
Frequency
1
0
2.81
3.06
3.31
2.94
3.19
3.56
3.44
3.81
3.69
3.94
Median
3.55
Mean 3.51
Mode 3.71
12
10
Mean gets
dragged
to the left
Frequency
2
0
3.19
3.31
3.25
3.44
3.38
3.56
3.50
3.69
3.63
3.81
3.75
3.94
3.88
4.00
Mean 3.78
Mode 3.98
Median
3.82
Skewed Distributions
Relative positions of measures of central
tendency
Salaries in 2 Companies
Means and Medians suggest that the employees at these 2 companies have similar pay
M = $60,000
Mdn = $60,000
M = $60,000
Mdn = $60,000
Salaries in 2 Companies
But looking at the distribution of pay, the shapes are
different
6.0
5.5
5.0
4.5
4
Frequency
Frequency
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
M = $60,000
Mdn = $60,000
M = $60,000
Mdn = $60,000
Range
6.0
5.5
5.0
4.5
4
3.5
Frequency
Frequency
4.0
3.0
2.5
2.0
1.5
1.0
.5
0.0
R = Maximum Minimum
R = $110,000 $10,000
$40,000
R = $ 100,000
R = $ 80,000
R = $ 40,000
Standard Deviation
A more precise measure of dispersion
Comparable to the mean in that it takes all of the
data points into account (not just highest and
lowest)
More commonly used than Range
Is basis for many later measures and tests well be
using
Standard = average
Deviation = away from normal
Standard Deviation
Scores (X)
3
4
5
6
7
We are going to
calculate a standard
deviation (SD) for
this set of 5 scores.
N
25
5
5
The mean is 5, so
how far away (on
average) do these 5
scores deviate from
the mean?
Standard Deviation
Scores (X)
3
4
5
X M (deviation score)
6
7
N
25
5
5
35
= -2
45
= -1
55
= 0
65
= 1
75
= 2
Subtracting
each score
from the
mean to get
the deviation
score
Standard Deviation
Scores (X)
3
4
5
X M (deviation score)
6
7
N
25
5
5
35
= -2
45
= -1
55
= 0
65
= 1
75
= 2
(X- M)
N
0
5
then taking
the average of
these 5
numbers -- !
Always results
in 0. The
positive &
negative
numbers
cancel each
other out to
Standard Deviation
Scores (X)
3
This method
wont work
positives &
negatives
always cancel
each other
out to zero!
4
5
X M (deviation score)
6
7
N
25
5
5
35
= -2
45
= -1
55
= 0
65
= 1
75
= 2
(X- M)
N
0
5
Need another
method to get
the information
we want.
Standard Deviation
Deviation method
(X M)2
Scores (X)
3
4
XM
(deviation score)
-2
-1
4
1
0
1
4
Standard Deviation
Deviation method
(X M)2
Scores (X)
3
4
XM
(deviation score)
-2
-1
4
1
0
1
4
Standard Deviation
Deviation method
(X M)2
Scores (X)
3
4
XM
(deviation score)
-2
-1
4
1
0
1
4
s2
SS
n- 1
10
Standard Deviation
Deviation method
(X M)2
Scores (X)
3
4
XM
(deviation score)
-2
-1
4
1
0
1
4
s2
SS
n- 1
10
4
2.5
Standard Deviation
Deviation method
(X M)2
Scores (X)
3
4
XM
(deviation score)
-2
-1
4
1
0
1
4
s2 2.5
Taking the square root of the
variance gets this measure back
into the original scale.
2.5 1.58
Standard Deviation
Deviation method
Scores (X)
3
4
5
6
7
Standard Deviation
Computational method for SS
The deviation method (the last few pages) will give you
the standard deviation, and it is easier to see where that
end number comes from (a measure of how far from the
mean the numbers are).
However, the deviation method can be a bit tedious in
hand calculations, especially with certain types of raw
data.
We have another more direct method (computational
2 hand
method) of getting to the SS when using
X
2
calculations.
SS X
Standard Deviation
Computational method
Scores (X)
SS
3
4
X2
9
16
25
49
N
25
5
5
SS 10
s
SS
n 1
10
1.58
4
X2 = 135
25
SS 135
36
Standard Deviation
Second example deviation method
Scores (X)
10
12
13
15
18
20
22
25
Another dataset,
showing you how to
calculate standard
deviation using both
methods.
Standard Deviation
Second example deviation method
Scores (X)
10
12
13
15
18
20
22
25
M = 135 / 8
= 16.875
XM
-6.875
-4.875
-3.875
-1.875
1.125
3.125
5.125
8.125
(X
M)2
47.27
23.77
15.02
3.52
1.27
SS
s2 9.77
n- 1
192.91
26.27
7
66.02
27.56
Standard Deviation
Second example deviation method
Scores (X)
10
12
13
15
18
20
22
25
M = 135 / 8
= 16.875
XM
-6.875
-4.875
-3.875
-1.875
1.125
3.125
5.125
8.125
(X
M)2
47.27
23.77
15.02
3.52
1.27
9.77
s2 27.56
26.27
s 27.56 5.25
66.02
Standard Deviation
Second example computation method
Scores (X)
10
12
13
15
18
20
22
25
M = 135 / 8
= 16.875
X2
100
144
169
225
324
400
434
625
SS
135
SS 2471
SS 192.875
s
SS
n 1
X2 = 2471
192.875
5.249
7
Standard Deviation
Third example deviation method
Scores
(X)
6
2
8
5
4
4
7/7
M = 36
= 5.143
XM
.857
-3.14
2.86
-.143
-1.143
-1.143
1.857
(X
M)2
.735
9.878
8.16
.020
1.306
1.306
3.449
SS
2
s
n- 1
24.857
6
4.143
Standard Deviation
Third example deviation method
Scores
(X)
6
2
8
5
4
4
7/7
M = 36
= 5.143
XM
.857
-3.14
2.86
-.143
-1.143
-1.143
1.857
(X
M)2
.735
9.878
8.16
.020
1.306
1.306
3.449
s2 4.143
s 4.143 2.035
Standard Deviation
Third example computation method
Scores
(X)
6
2
8
5
4
4
7
M = 36 / 7
= 5.143
X2
36
4
64
25
16
16
49
SS
SS
36
210
SS 24.857
s
X2 = 210
SS
n 1
24.857
2.035
6
Kurtosis
Another check on normality
40
Frequency
30
20
10
0
16.0
18.0
20.0
22.0
24.0
26.0
28.0
30.0
32.0
Kurtosis
10
Frequency
0
1.5
2.5
3.5
4.5
5.5
6.5
7.5
8.5
9.5
Score
Kurtosis
55
50
45
40
Frequency
35
30
25
20
15
10
5
0
23.00
23.50
24.00
24.50
25.00
Score
s
.17
R
Anything LARGER than that (such as .38 or .47) would
indicate a platykurtic distribution one that has MORE
variability than normal
Anything SMALLER than that (such as .11 or .09) would
suggest that the data are more closely packed or
Assessing Kurtosis
In a normal curve,
the standard
deviation is
approximately 1/6
of the range
40
Frequency
30
s
.17
R
20
10
In this example:
0
16.0
18.0
20.0
22.0
24.0
26.0
28.0
30.0
32.0
s
2.93
.19
R
15
Verdict: Mesokurtic
Assessing Kurtosis
Normal:
10
s
.17
R
Frequency
Inthisexample
:
s
2.42
.30
R
8
0
1.5
2.5
3.5
4.5
5.5
Score
6.5
7.5
8.5
9.5
Verdict: Platykurtic
Assessing Kurtosis
Normal:
50
s
.17
R
Frequency
40
30
20
Inthisexample
:
10
s
.226
.11
R
2
0
22.95
23.55
24.15
24.75
25.35
Score
Verdict:
Leptokurtic
40
45
50
60
60
65
70
74
75
75
77
80
80
80
85
85
85
90
90
90
93
93
93
93
94
94
95
100
100
105
105
105
105
110
120
127
130
137
140
180
X = 3675
X2 = 365971
M = X /N
15
=
3675/40
12
=
91.875
Mdn = 91.5
0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
180.0
Mo = 93 &
105
3675
SS 365971
18
40
15
SS 28330.375
12
0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
180.0
28330.375
26.952
39
s
26.952
.193
R
140
Conclusion: Mesokurtic
Chapter 5:
Probability Concepts
and Screening
Definitions
Random variable: a numerical quantity that takes on
different values depending on chance
Population: the set of all possible values for a
random variable
Event: an outcome or set of outcomes
Probability: the proportion of times an event is
expected to occur in the population
Ideas about probability are founded on relative
frequencies (proportions) in populations.
Classical Model
a priori, or before the fact predictions, ideas about
what will happen using reason alone
(Based on conditions that all possible outcomes are
equally likely and that only 1 can occur)
probability = # of specific events out of total
possible events:
p = s/t
This will give us a number between 0 and 1
0 ------------------.50------------------1.00
no chance
certainty
of occurring
of occurring
Classical/Theoretical Model
Tossing a coin or rolling dice are good examples
In flipping a coin:
Head is just as likely as tails
Can only have 1 outcome (cant be both H & T)
In rolling a die:
Getting a 3 is just as likely as getting a 5
Can only have 1 outcome
Classical/Theoretical Model
What is the probability of flipping a coin and getting
a Head as the outcome?
Number of specific outcomes that satisfy this
condition
1 (getting a Head)
How many total possible outcomes are there?
2 (H or T)
p = s / t = 1 / 2 = .50
Classical model predicts a 50% chance of getting a
Head
Classical/Theoretical Model
What is the probability of rolling a die and getting a 3?
How many specific outcomes satisfy this condition?
1 (i.e., tossing a 3)
How many total possible outcomes are there?
6
p=
s/t
= 1/6
.167
Classical Model
What is the probability of tossing a die and getting
an even number as an outcome?
How many specific outcomes satisfy this condition?
3 (i.e., getting a 2, 4 or 6 mutually exclusive
outcomes)
How many total possible outcomes are there?
6
p =
s/t
= 3/6
.50
Long-run/Empirical model
a posteriori viewpoint, after the fact, data have
already been collected
Rather than looking at what SHOULD occur, this is
what HAS actually occurred over a large number of
events
Can compare this number to the expected outcomes
as predicted by the classical model
Long-run because the more data you have, the
more accurate the results are going to be
Long-run/Empirical model
Coin flip game: a coin will be flipped
Heads = Ill give you $5
Tails = you give me $5
So youd be hoping for heads
With multiple tosses, you might generally expect to
break even, or maybe one person would be $5 ahead
of the other
Long-run/Empirical model
On 10 flips, the outcome is:
5 tails (sounds good)
6 tails (I got a bit lucky)
7 tails (more lucky)
8 tails (hm)
9 tails (???)
At what point do you (should you) become
suspicious?
Long-run/Empirical model
Same scenario:
Heads = Ill give you $5
Tails = you give me $5
BUT instead of there being 10 tosses, there will
be 100
Long-run/Empirical model
On 100 flips, the outcome is:
50 tails
60 tails
70 tails
80 tails
90 tails
Percentages are the same as with 10 tosses, but now
its based on a larger number
60% based on 100 is a lot more suspicious than
60% based on 10
= > 50 / 100
=> .50
Pr(AandB)
Pr(B| A)
Pr(A)
(as long as Pr(A) > 0)
Pr A andB Pr A Pr B| A
Combining Probabilities
1 outcome
Finding the probability that ONE of several
acceptable outcomes may occur
For example, what is the probability that you will
get a Head OR a Tail when you flip a coin?
In this type of situation, you add the
probabilities together:
p = Head OR Tail
p = .50
+
.50
= 1.00
Combining Probabilities
1 outcome
p = Head
p = .50
OR
+
Tail
.50
1.00
Combining Probabilities
1 outcome
Finding the probability that one of several
acceptable outcomes may occur
Outcomes must be all mutually exclusive
What is the probability of rolling a die and getting
an even number OR a 3?
ADD-OR rule
p = even number OR
=
.50
= .667
.167
Combining Probabilities
Multiple outcomes
Finding the probability that multiple events will
occur
For example, what is the probability that you will
flip 2 coins and get 2 Heads?
In this type of situation, you multiply the
probabilities:
p = Head AND Head
p = .50
x
.50
= .25
Combining Probabilities
Multiple outcomes
p = Head
p = .50
AND
x
Head
.50
.25
Combining Probabilities
Multiple outcomes
You will draw one card from a deck, replace that card
and draw a second card from the deck (the
replacement will make the two draws independent)
What is the probability that you will get an Ace on
the first draw AND and an Ace on the second draw?
MULT-AND rule:
p = (Ace) AND
= (4/52)
X
= .077
X
(Ace)
(4/52)
.077 =
.006
Combining probabilities
When you have one outcome desired, and more
than one way to achieve that outcome, it makes it
easier to achieve (higher probability)
When you have multiple outcomes that need to
occur together, it makes it more difficult to achieve
than any one of the outcomes alone
Applying probability
Vital statistics rates
We can combine these rules and properties to
determine certain information of interest
Mortality rate: standard way to compare death rate
across different circumstances
Numerator: # of people who died during a given
period of time
Denominator: # of people who were at risk of
dying during the same period
Denominator may be difficult to calculate, so the
number of people alive in the population halfway
through the time period is sometimes used as an
estimate
Applying probability
Vital statistics rates
Morbidity rate: calculated like mortality, but
conveys the rate disease of disease in relation to
the population
Prevalence: number of individuals with the
disease at one point in time divided by the
population at risk at that time
Incidence: number of new cases during a given
time span divided by the population at risk (at
the beginning of the interval)
Applying probability
Vital statistics rates
NOTE:
(Because prevalence does not involve a period of
time, it is actually a proportion, but is often
mistakenly termed a rate.)
The term incidence is sometimes used
erroneously when the term prevalence is meant.
One way to distinguish between them is to look for
units: An incidence rate should always be
expressed in terms of a unit of time.
Applying probability
Vital statistics rates
Prevalence and incidence can be compared to crosssectional and longitudinal studies
Prevalence is like a snapshot (like a cross-sectional
study); may see cross-sectional studies referred to as
prevalence studies
Incidence needs a period of time to pass, like cohort
studies (begin at a given time and continue to
examine outcomes over the specific span of the
study)
Applying probability
Screening
Screening used to distinguish those who are apparently
well from those who have a decently high probability of
having the disease or condition under study (with the goal
of further testing and/or tx)
Generally employed when:
Target disease is serious enough to warrant
Test is proven and acceptable to detect the disease
early enough for intervention
There is tx available
Two probabilities are used to measure the ability of a test
to distinguish between those who have the disease and
those who do not
Compare the screening results to definitive diagnosis
results
Applying probability
Screening
Sensitivity: does the test return a positive result
on those who actually have the disease?
(Missing out on finding people who have it would
reduce the sensitivity)
Sensitivit
y
#ofpeoplewhotested
()atscreening
x 100
total#ofpeoplescreened
whohavethedisease
#ofpeoplewithoutthedisease
whoscreened
(-)
X 100
total#ofpeoplescreened
whoarewithoutthedisease
Applying probability
Screening
Screening
Higher cutoff (27)
Screening
Higher cutoff
Screening
Higher cutoff
True negatives
Screening
Higher cutoff
Some false negatives (misses) and
some correct negatives
Screening
Higher cutoff
Screening
Lower cutoff (22)
Screening
Lower cutoff (22)
Screening
Lower cutoff
True negatives
Screening
Lower cutoff
Some hits and
some false positives
Screening
Lower cutoff
False positives
Applying probability
Screening
False positives with low cutoff
Applying probability
Screening
Applying probability
Screening
Which is worse? False positive or false negative?
Where should we err in putting the cutoff?
Depends:
Is the disease rare? Then high sensitivity is valuable
Is the disease silent for a while before symptoms?
Is the disease lethal?
Is there effective treatment available?
Will see a parallel between these questions, and
those relating to Type I/II errors and power (related to
hypothesis/ significance testing)
Applying probability
Predictive value (positive)
Predictive value of a test: the probability of
disease given a positive result, the chance that a
patient with a positive test has the disease
This is a conditional probability in which the
event of the disease being present is dependent
(i.e., conditional) on having a positive test result
P(T
|
D
)
P(D
)
P(D | T )
P(T | D ) P(D ) P(T | D ) P(D )
Applying probability
Predictive value (positive)
The probability that a test is
positive, given that the disease is
present (i.e., the sensitivity of
the test)
P(T
|
D
)
P(D
)
P(D | T )
Applying probability
Predictive value (negative)
Predictive value of a test: the probability of
absence of disease given a negative result: the
chance that a patient with a negative test is
free of the disease
This is a conditional probability in which the
event of the disease being present is dependent
(i.e., conditional) on having a negative test
result
P(T
|
D
)
P(D
)
P(D | T )
Applying probability
Predictive value (negative)
P(T | D ) P(D )
P(D | T )
Applying probability
Screening exercise
A newly developed test produced positive results in 138 of 150
known diabetics and in 24 of 150 persons known not to have
diabetes.
a.
b.
c.
d.