Beruflich Dokumente
Kultur Dokumente
Chapter 1:
Descriptive Statistics
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 1 / 80
What is Statistics?
Outline
1 What is Statistics?
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 2 / 80
What is Statistics?
What is Statistics?
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 3 / 80
What is Statistics?
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 4 / 80
What is Statistics?
Types of Variables
quantitative
Raw data from categorical variables consist of group or
category names that don’t necessarily have a logical ordering.
norminal
Examples: eye color, country of residence.
Categorical variables for which the categories have a logical
ordering are called ordinal variables.
Examples: highest educational degree earned, tee shirt size (S,
M, L, XL). likert scale
Outline
1 What is Statistics?
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 6 / 80
Descriptive Statistics – Graphical Summaries
Graphical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 7 / 80
Descriptive Statistics – Graphical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 8 / 80
Descriptive Statistics – Graphical Summaries
A majority, 1686/3042 = 55.4%, said they always wear a seat belt, while
115/3042 = 3.8%, said they never wear a seat belt.
Rarely or never: 8.2% + 3.8% = 12%
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 9 / 80
Descriptive Statistics – Graphical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 10 / 80
Descriptive Statistics – Graphical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 11 / 80
Descriptive Statistics – Graphical Summaries
Note: Study does not prove sleeping with light actually caused
myopia in more children.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 12 / 80
Descriptive Statistics – Graphical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 14 / 80
Descriptive Statistics – Graphical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 15 / 80
Descriptive Statistics – Graphical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 16 / 80
Descriptive Statistics – Graphical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 17 / 80
Stem-and-Leaf Display
78 65 90 86 79 51 79 62 84 101
5* 1
N = 10
6 25
5
Leaf unit = 1
7 89 9
8 66
4
9 0
10 * 1
stem leaf
Stem-and-Leaf Display
3* 8 N = 15
Leaf unit = 1
4 0014556779
5* 0138
3 + 8 N = 15
4 * 0014 Leaf unit = 1
4 + 556779 * for 0 – 4
+ for 5 – 9
5 * 013
5 + 8
Descriptive Statistics – Graphical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 18 / 80
Descriptive Statistics – Graphical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 19 / 80
Histogram
**** Area represent frequency ****
Histogram
**** Area represent frequency ****
frequency datasize
density =
width
relative frequency
=
width
Descriptive Statistics – Graphical Summaries
Per $36,000
Capita $24,000
Income $12,000
$40,000
Per
$30,000
Capita
$20,000
Income
$10,000
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 21 / 80
Descriptive Statistics – Graphical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 22 / 80
Descriptive Statistics – Graphical Summaries
Box plot
Strength:
Give a direct look at central location and spread as it
summarizes the five-number summary.
Can identify outliers.
Side-by-side box plot is an excellent tool for comparing two or
more groups.
Weakness:
Not entirely useful for judging shape.
Cannot distinguish between bell-shaped or bimodal.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 23 / 80
Descriptive Statistics – Graphical Summaries
Stem-and-Leaf plot
Strength:
Excellent for sorting data.
With a sufficient sample size, it can be used to judge shape.
Weakness:
With a large sample size, a stem-and-leaf plot may be too
cluttered because the display shows all individual data values.
More restricted in the choices for “intervals” when compared to
histograms.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 24 / 80
Descriptive Statistics – Graphical Summaries
Dot plot
Strength:
Can present all individual data values.
Easy to create.
Weakness:
With a large sample size, a dot plot may be too cluttered.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 25 / 80
Descriptive Statistics – Graphical Summaries
Misleading Graphs
Statistics can be misleading if not presented appropriately.
Same data can appear very di↵erently when graphed.
Suppose, for example, that two of your classmates are instructed
by your history professor to construct a graph of the number of
men and women who scored in the top half of the class on the
history exam. 3D effect causes disproportion representation not proportional to quantities
Correct Incorrect
15
Number of people scoring in
10 13
12
5
11
10
0
Men Women Men Women
( a) (b)
FIGURE 3.12 Two bar diagrams showing the same results using different scales for frequency.
(a) The left graph shows the correct proportional relationship between men and women. (b) In the right
graph, putting a break in the vertical axis results in an incorrect proportional relationship.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 27 / 80
Descriptive Statistics – Graphical Summaries
Misleading Graphs
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 28 / 80
Lying with Statistical Graphics
Market penetration of four brands of cigarette: A, B, C, D
D D
A
18% 18% B
27%
37%
C C
18% 18%
B A
37% 27%
1.5
1.2
1.1 1.0
1.0
0.5
0 0
A B C D A B C D
Bad Graphic Designs
Representation not proportional to quantities
Bad Graphic Designs
Misleading
alignment
Bad Graphic Designs
Percentage of college
Enrolment with age 25 and over
Descriptive Statistics – Graphical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 29 / 80
Descriptive Statistics – Graphical Summaries
Frequency
Frequency
Frequency
J–shaped
distribution Positively skewed Negatively skewed
distribution distribution
Frequency
Frequency
distribution distribution distribution
Bell-Shaped Distributions
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 31 / 80
Descriptive Statistics – Graphical Summaries
Bell-Shaped Distributions
Data: representative sample of 199 married British couples.
Below shows a histogram of the wives’ heights with a normal
curve superimposed. The mean height = 1602 millimeters.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 32 / 80
Descriptive Statistics – Numerical Summaries
Outline
1 What is Statistics?
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 33 / 80
Descriptive Statistics – Numerical Summaries
Summarizing Data
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 34 / 80
Descriptive Statistics – Numerical Summaries
Numerical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 35 / 80
Descriptive Statistics – Numerical Summaries
Frequency
(b)
Frequency
(c)
Frequency
FIGURE 5.1 Differences in central tendency, variability, and shape of frequency distributions (polygons).
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 36 / 80
Descriptive Statistics – Numerical Summaries
Measures of Central Location (the center (or average) of the data): Mean,
Median, Mode
Mean is the arithmetic average
P
x
x=
n
Median is the middle value in the data
Mode is the value in the data set that occurs most frequently
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 37 / 80
Descriptive Statistics – Numerical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 38 / 80
Descriptive Statistics – Numerical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 39 / 80
Descriptive Statistics – Numerical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 40 / 80
Descriptive Statistics – Numerical Summaries
X: 8 9 10 11 12 13 15 16 17 18
14
(X = 14)
−2,
−2, +4
(X − X): −6 −2 −1 +2 +3 +4
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 41 / 80
Descriptive Statistics – Numerical Summaries
X X
x= x /n =) (x x) = 0
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 42 / 80
Descriptive Statistics – Numerical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 43 / 80
Descriptive Statistics – Numerical Summaries
The mean earning was $1, 253, 638, but the median was
$837, 886. The money earned by the best player, Tiger Woods,
was over $4 million greater than that of the second-ranked
player, Steve Stricker, and Stricker’s earnings were over $1
million greater than the third-ranked player.
The earnings of these two players strongly a↵ected the total, and
hence the mean, but their values did not a↵ect the median.
Imagine the change if Tiger Woods had earned $5 million more.
Therefore, in distributions that are strongly asymmetrical
(skewed) or have a few very deviant scores, the median may be
the better choice for measuring the central tendency.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 44 / 80
Descriptive Statistics – Numerical Summaries
* HKU Average refers to the figure for the total HKU population and includes M.B.,B.S. and B.D.S. graduates.
Discussion topics:
VII. Number of Full-Time Job Offers
e.g., What can we say about the statistics? Whether we should use mean or median in
The number
measuring the central of job offersWhat
location? received by Bachelorinformation
other of Laws graduates is shown
shouldin the following
be added table. here to make the
x=
å x 1
= (x + x2 + ! + xn )
1
n n
x =3
Unbalanced if support
at the peak.
Balanced if support at
the mean.
Measure of Central Location
Median – number with the middle rank
ìæ n + 1 ö
ïç 2 ÷ th number if n is odd
ïè ø
median = í
ïaverage of æç n ö÷ th and æç n + 1ö÷ th numbers if n is even
ïî è2ø è2 ø
4.6 + 5.4
median = =5
2
Median as a Balance Point
median = 5
13 14 15 19 20
13 + 14 + 15 + 19 + 20
x= = 16.2 median = 15
5
13 14 15 19 53
13 + 14 + 15 + 19 + 53
x= = 22.8 median = 15
5
peak of distribution
Modal class
Measure of Central Location
50% 50%
mean
mode
median
Measures of Variation
Measures of Variation
x =3
å (x - x ) = 0
-ve deviations
S x = 506 S (x - x ) = 0 S (x - x )2 = 1368.4
Standard Deviation
å (X - X )
2
Population s.d. s=
N
Sample s.d. s=
å (x - x ) 2
n -1
1368.4
Sample s.d. of previous sample (n = 10) s= = 12.33
10 - 1
1368.4
Sample variance s =
2
= 152.04
10 - 1
Standard Deviation å x 2 - (å x ) n
2
s=
n -1
71 5041 å x - 2
n
= 1368.4
64 4096
50 2500 1368.4
48 2304
s= = 12.33
10 - 1
63 3969
38 1444
41 1681
47 2209
52 2704
S x = 506 S x 2 = 26972
Standard Deviation
x =3
s.d. decreases
x =3
s.d. increases
Median
Question: What minimizes sum of absolute deviations?
s
Coefficient of Variation (CV ) = ´ 100%
x
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 46 / 80
Descriptive Statistics – Numerical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 47 / 80
Descriptive Statistics – Numerical Summaries
This is not exactly ordinal neither: Does the scale then fall only
at the ordinal level of measurement? If so, the seven numbers
along the scale would indicate only a rank ordering from the
least favorable attitude to the most favorable. But there is
probably more information in the numbers than that. A
two-point di↵erence between scores probably signifies a greater
di↵erence in favorability than a one-point di↵erence.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 48 / 80
Descriptive Statistics – Numerical Summaries
IQR = Q3 Q1
where
Q1 = lower quartile = median of lower half of the ordered data values
Q3 = upper quartile = median of upper half of the ordered data values
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 51 / 80
Descriptive Statistics – Numerical Summaries
① X ③ (X ! X
!) ④ (X ! X
!)2
32 32 ! 50.6 " !18.6 345.96
71 71 ! 50.6 " #20.4 416.16
64 64 ! 50.6 " #13.4 179.56
50 50 ! 50.6 " !0.6 .36
48 48 ! 50.6 " !2.6 6.76
63 63 ! 50.6 " #12.4 153.76
38 38 ! 50.6 " !12.6 158.76
41 41 ! 50.6 " !9.6 92.16
47 47 ! 50.6 " !3.6 12.96
52 52 ! 50.6 " #1.4 1.96
②X
506 " !)
⑥ (X ! X 2
1368.40
SX2 " $$ " $ " 136.84
! " $ " 50.6
10 n 10
qP q p
(x x )2 1368.40
Sample s.d.: s = n 1 = 10 1 = 152.04 = 12.33
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 52 / 80
Descriptive Statistics – Numerical Summaries
① X ② X2
32 1,024
71 5,041
64 4,096
50 2,500
48 2,304
63 3,969
38 1,444
41 1,681
47 2,209
52 2,704
③ !X ! 506 ③ !X 2
! 26,972
P P 2 P
Actually, (x qx )2 = x q ( x )2 /n (Try to prove)
P P 2 P 2
(x x )2 x ( x ) /n
Therefore, s = =
qn 1 n 1
p
26972 5062 /10
Sample s.d.: s = 10 1
= 152.04 = 12.33
Same result as before.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 53 / 80
Descriptive Statistics – Numerical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 54 / 80
Descriptive Statistics – Numerical Summaries
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 55 / 80
Descriptive Statistics – Numerical Summaries
3. Skewness, Kurtosis
where
P
(x x )2
m2 (the second sample moment about mean) =
P n
(x x )3
m3 (the third sample moment about mean) =
n
cubic function magnifies the deviation
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 56 / 80
Descriptive Statistics – Numerical Summaries
Skewness, Kurtosis
The value of this measure generally lies between 3 and +3. The
closer the value lies to 3, the more the distribution is skewed left
(negatively skewed). The closer the value lies to +3, the more the
distribution is skewed right (positively skewed). A value close to 0
indicates a symmetric distribution.
A normal distribution is symmetric and has skewness of 0.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 57 / 80
Descriptive Statistics – Numerical Summaries
Skewness, Kurtosis
There are other measures of skewness:
simpler measure
1 Pearson mode skewness or first skewness coefficient
mean mode
skewness =
standard deviation
Skewness, Kurtosis
Skewed negatively (a) (b) Skewed positively
Frequency
Frequency
(to the left) (to the right)
X Mdn Mo Mo Mdn X
Scores Scores
(c)
Normal
Frequency
distribution
X
Mdn
Mo
Scores
FIGURE 4.3 X
!, Mdn, and Mo in skewed distributions and in the normal distribution.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 59 / 80
Descriptive Statistics – Numerical Summaries
Skewness, Kurtosis
act measuring the tail
where
P
(x x )4
m4 (the fourth sample moment about mean) =
P n
(x x )2
m2 (the second sample moment about mean) =
n
Excess kurtosis is defined as the kurtosis minus 3, i.e.,
3 is the kurtosis of normal distribution
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 60 / 80
Descriptive Statistics – Numerical Summaries
Skewness, Kurtosis
Several well-known, uni-modal and symmetric distributions from di↵erent parametric
families are compared here.
Each has a mean and skewness of zero.
The parameters have been chosen to result in a variance equal to 1 in each case.
N denotes the standard normal curve, with excess kurtosis equal to 0.
Generally, if a distribution has a greater excess kurtosis, it has a higher peak and thicker
tails, compared to another distribution of the same kind.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 61 / 80
Descriptive Statistics – Numerical Summaries
4. Outlier
Outlier is a data point that is not consistent with the bulk of the
data
Look for them via graphs
Can have big influence on conclusions
Can cause complications in some statistical analysis
Cannot discard without justification
The general guideline used in box plot to identify outlier is:
If an observation is outside the range
[Q1 1.5IQR , Q3 + 1.5IQR], then it is regarded as outlier.
Mild outlier
Extreme outlier +/- 3IQR
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 62 / 80
Descriptive Statistics – Numerical Summaries
Outlier
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 63 / 80
Descriptive Statistics – Numerical Summaries
Outlier
Question
Find the outliers in the following data of the ages of actresses at the
time they first won the Oscar:
21, 24, 25, 26, 26, 26, 26, 27, 28, 30, 30, 31, 31,
33, 33, 33, 34, 34, 34, 34, 35, 35, 35, 37, 37, 38,
39, 41, 41, 41, 42, 44, 49, 50, 60, 61, 61, 74, 80
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 64 / 80
Descriptive Statistics – Numerical Summaries
5. Coefficient of Variation
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 65 / 80
Descriptive Statistics – Numerical Summaries
Coefficient of Variation
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 66 / 80
Descriptive Statistics – Numerical Summaries
Coefficient of Variation
In finance, CV measures the relative risk of a stock portfolio.
Assume portfolio A has a collection of stocks that average a
12% return with a standard deviation of 3% and portfolio B has
an average return of 6% with a standard deviation of 2%.
We can compute the CV values for each as follows:
3
CV (A) = (100%) = 25%
12
2
CV (B ) = (100%) = 33%
6
Even though portfolio B has a lower standard deviation, it would
be considered riskier than portfolio A because B’s CV is 33%
and A’s CV is 25%.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 67 / 80
Normal Distribution and Other Statistics
Outline
1 What is Statistics?
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 68 / 80
Normal Distribution and Other Statistics
Normal Distribution
Normal Distribution (Distribution means the overall pattern of how often the
possible values occur) (or Gaussian Distribution) Characteristics:
Normal Distribution
Normal Distribution
A random variable X following normal distribution with mean µ
and variance 2 is denoted as X ⇠ N(µ, 2 ) N=normal distribution
The probability density function (pdf) of X is
1 1
(x µ 2
),
f (x ; µ, ) = p e 2 1 < x < 1,
2⇡
where e = a mathematical constant (approximately 2.718282)
µ = mean of the random variable X
= standard deviation of the random variable X
⇡ = a mathematical constant (approximately 3.141593)
The Z -transformation for X gives the standard normal
distribution:
X µ
Z = ⇠ N(0, 1)
Aim: One size fits all
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 71 / 80
Normal Distribution and Other Statistics
0 z
0 z
(This is only a part of the table, refer to the Appendix – Standard Normal Table for detail.)
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 72 / 80
Normal Distribution and Other Statistics
Percentile
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 73 / 80
Normal Distribution and Other Statistics
Percentile
Frequency
Percentile rank: P0
P20 P40 P60 P80 P100
Score:
60 70 80 90 100 110
(a)
Frequency
Percentile rank: P0
P20 P40 P60 P80 P100
Score:
60 70 80 90 100 110
(b)
FIGURE 5.9 Comparative location of percentile ranks in two distributions. The area under the curve
in each segment equals 10% of the whole, representing 10% of the scores (see Section 3.4).
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 74 / 80
Normal Distribution and Other Statistics
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 75 / 80
Normal Distribution
Many data have distributions that are unimodal, almost
symmetric, and tailing off towards both sides.
(
N 45, 2.25 2 )
Normal Distribution
Normal curve : a bell-shaped curve as an approximation to
many such data distributions.
(
N 45, 2.25 2)(
N 45, 2.25 2 )
2.25 2.25
Characteristics of Normal Curves
95%
68%
Area =1
X -µ
Z= X = µ + Z ´s
s
Example
Human IQ scores µ = 100 , s = 15
An IQ score of 136
80 isis
136- 100
80 - 100
ZZ == = =-11.33
4 standard deviations below
2..33 above the mean.
15
15
Standard Normal Distribution
N (0,1)
P(between a and b )
F(b) - F(a)
FF(-(z1)) 1 - F(1) = F(b ) - F(a )
F(- z ) = 1 - F(z )
a b
z
Standard Normal Distribution Table
IQ scores (
N 100,15 2 )
80) =)==9?.?74
P(less than 80
110 18.%
86%
Normal score
110 - 100
Z = 80 - 100 »»-01..67
33
15
15
F(0.67) = 0.7486
F(1.33) = 0.9082
F(- 1.33) = 1 - 0.9082
= 0.0918
Standard Normal Distribution Table
Gasoline use (in mpg)
(
N 30.5, 4.52 )
P (P25 40))== 87
(25< <XX<<40 ? .14%
Normal scores
25 - 30.5 40 - 30.5
= -1.22 = 2.11
4.5 4.5
F(1.22) = 0.8888 F(2.11) = 0.9826
F (2.11) - F (- 1.22 )
= 0.9826 - 0.1112 = 0.8714
Standard Normal Distribution Table
A2 A1
– 1.22 2.11
A2 = 0.3888 A1 = 0.4826
k% (100 – k) %
kth percentile
Standard Normal Distribution Table
Salaries of MBA graduates
(in $1000)
(
N 45, 2.25 2 )
95th percentile = ?
= 45 + 1.645 ´ 2.25 = 48.7
P (Z < c ) = 0.95
c » 1.645
Normal Distribution and Other Statistics
VaR (Value-at-Risk)
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 76 / 80
Normal Distribution and Other Statistics
VaR (Value-at-Risk)
Suppose an investment has the following historical loss distribution:
Loss Distribution
60
50
40
Frequency
30
20
10
0
0 10 20 30 40 50 60 70 80 90
Loss Amount (in million dollars)
z -score
The standardized score or z -score is a useful measure of the relative value
of any observation in a data set:
Observed Value Mean x µ
z= =
Standard Deviation
99.7% of cases
95% of cases
68% of cases
Relative frequency
FIGURE 5.7 Relative frequencies of cases contained within certain limits in the normal distribution.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 78 / 80
Normal Distribution and Other Statistics
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 79 / 80
Normal Distribution and Other Statistics