Sie sind auf Seite 1von 120

STAT1600B

Statistics: Ideas and Concepts


2017-2018 (2nd Semester)

Department of Statistics and Actuarial Science


The University of Hong Kong

Chapter 1:
Descriptive Statistics

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 1 / 80
What is Statistics?

Outline

1 What is Statistics?

2 Descriptive Statistics – Graphical Summaries

3 Descriptive Statistics – Numerical Summaries

4 Normal Distribution and Other Statistics

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 2 / 80
What is Statistics?

What is Statistics?

Statistics is a collection of procedures and principles for


gathering data and analyzing information in order to help people
make decisions when faced with uncertainty.
Examples:
Does Aspirin reduce heart attack rates?
Does the Internet increase loneliness and depression?
Data are collected and used to make a judgment about a
situation.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 3 / 80
What is Statistics?

Some Basic Statistical Terms

An observation is an individual entity in a study.


A variable is a characteristic that may di↵er among individuals.
Sample data are collected from a subset of a larger population.
Population data are collected when all individuals in a
population are measured.
A statistic is a summary measure of sample data.
A parameter is a summary measure of population data.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 4 / 80
What is Statistics?

Types of Variables
quantitative
Raw data from categorical variables consist of group or
category names that don’t necessarily have a logical ordering.
norminal
Examples: eye color, country of residence.
Categorical variables for which the categories have a logical
ordering are called ordinal variables.
Examples: highest educational degree earned, tee shirt size (S,
M, L, XL). likert scale

Raw data from quantitative variables consist of numerical


values taken on each individual.
Examples: height, number of siblings.
interval scale
ratio scale: hv meaningful 0
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 5 / 80
Descriptive Statistics – Graphical Summaries

Outline

1 What is Statistics?

2 Descriptive Statistics – Graphical Summaries

3 Descriptive Statistics – Numerical Summaries

4 Normal Distribution and Other Statistics

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 6 / 80
Descriptive Statistics – Graphical Summaries

Graphical Summaries

Graphs or tables are used to visually display the data.


The graphical summaries that we are going to learn are:
Frequency Table
Pie Chart
Bar Chart
Box Plot
Side-by-Side Box Plot
Stem-and-Leaf Plot
Dot Plot
Histogram

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 7 / 80
Descriptive Statistics – Graphical Summaries

Di↵erent Graphical Summaries for Di↵erent Types of Data

Not all graphical summaries are necessary when describing a set


of data.
So which type of the graphical summaries should we use?
It depends on the types of data. Di↵erent graphical summaries
are used for di↵erent types of data.
Basically,
frequency table, pie chart, bar chart are used for categorical
variables.
box plot and histogram are used for quantitative variable.
side-by-side box plot is used for the combination of 1
quantitative variable and 1 categorical variable.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 8 / 80
Descriptive Statistics – Graphical Summaries

Frequency Table – One Categorical Variable


2003 nationwide survey of American HS students “How often do you wear a seat
belt when driving a car?”
Total sample size n = 3042 students.

A majority, 1686/3042 = 55.4%, said they always wear a seat belt, while
115/3042 = 3.8%, said they never wear a seat belt.
Rarely or never: 8.2% + 3.8% = 12%
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 9 / 80
Descriptive Statistics – Graphical Summaries

Frequency Table – Rounding Error

TABLE 10.1 Education of people 25 years and over, 2006


Level of Number of persons
education (thousands) Percent
Less than high school 27,896 14.5
High school graduate 60,898 31.7
Some college, no degree 32,611 17.0
Associate’s degree 16,760 8.7
Bachelor’s degree 35,153 18.3
Advanced degree 18,567 9.7
Total 191,884 100.0
Source: Census Bureau, Educational Attainment in the United States: 2006.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 10 / 80
Descriptive Statistics – Graphical Summaries

Frequency Table – Two Categorical Variables


Are females more likely to say always wear seat belt? Are males more likely to say
rarely or never wear seat belt? contingency table

Females: 915/1467 = 62.4% said always wear seat belt.


Males: 771/1575 = 49.0% said always wear seat belt.
Males: 10.5% + 5.7% = 16.2% rarely or never wear one.
Females: 5.7% + 1.7% = 7.4% rarely or never wear one.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 11 / 80
Descriptive Statistics – Graphical Summaries

Frequency Table – Two Categorical Variables

Nightlights and Nearsightedness Survey of n = 479 children. Those


who slept with nightlight or in fully lit room before age 2 had higher
incidence of nearsightedness (myopia) later in childhood.

Note: Study does not prove sleeping with light actually caused
myopia in more children.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 12 / 80
Descriptive Statistics – Graphical Summaries

Pie Chart & Bar Chart – One Categorical Variable


Survey of n = 190 college students. “Randomly pick a number
between 1 and 10.”

Results: Most chose 7, very few chose 1 or 10.


Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 13 / 80
Descriptive Statistics – Graphical Summaries

Bar Chart – Two or More Categorical Variables


Revisit the Nightlights and Nearsightedness Survey of n = 479 children.
Response Variable: Degree of Myopia
Explanatory Variable: Amount of Sleep time Lighting.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 14 / 80
Descriptive Statistics – Graphical Summaries

Box Plot – One Quantitative Variable


Box covers the middle 50% of the data, from lower quartile (median of lower half of the ordered
data values) to upper quartile (median of upper half of the ordered data values).
Line within box marks the median (middle value in the data).
Possible outliers are marked with asterisk.
Apart from outliers, lines extending from box reach to min. and max. values.
Good visual display of the spread of data.
Also good to identify outliers.

draw box plot


box plot to histogram

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 15 / 80
Descriptive Statistics – Graphical Summaries

Side-by-Side Box Plot – One Quantitative Variable


and One Categorical Variable
Side-by-side box plot displays two single box plots on the same graph.
Good to compare di↵erent groups.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 16 / 80
Descriptive Statistics – Graphical Summaries

Stem-and-Leaf Plot – One Quantitative Variable


In stem-and-leaf plot, every individual data value is shown.
This plot is a quick way to summarize small data sets and is also useful for
ordering the data from lowest to highest.
A row in the plot starts with a “stem” and each stem gives the first part of a
data value.
A value within a row is called a “leaf” and it gives information about the
last part of a data value.
Good to sort the data.
Unit of stem n leaf
size of data set

can c distribution n raw data


not suitable for large data set

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 17 / 80
Stem-and-Leaf Display
78 65 90 86 79 51 79 62 84 101

5* 1
N = 10
6 25
5
Leaf unit = 1
7 89 9
8 66
4
9 0
10 * 1

stem leaf
Stem-and-Leaf Display
3* 8 N = 15
Leaf unit = 1
4 0014556779
5* 0138

3 + 8 N = 15
4 * 0014 Leaf unit = 1

4 + 556779 * for 0 – 4
+ for 5 – 9
5 * 013
5 + 8
Descriptive Statistics – Graphical Summaries

Dot Plot – One Quantitative Variable


The horizontal axis in dot plot covers the range from the smallest to the
largest data value.
A dot is placed above the number line located at the observation’s data
value.
When there are multiple observations with the same value, the dots are
stacked vertically.
Present all the individual data values and easy to create the plot.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 18 / 80
Descriptive Statistics – Graphical Summaries

Histogram – One Quantitative Variable


Step 1: Decide how many equally spaced (same width) intervals to use for the horizontal
axis. Between 6 and 15 intervals is a good number.
Step 2: Decide to use frequencies (count) or relative frequencies (proportion) on the
vertical axis.
Step 3: Draw equally spaced intervals on horizontal axis covering entire range of the
data. Determine frequency or relative frequency of data values in each interval and draw
a bar with corresponding height. Decide rule to use for values that fall on the border
between two intervals.
Good to illustrate the shape of distribution. Area represent frequency

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 19 / 80
Histogram
**** Area represent frequency ****
Histogram
**** Area represent frequency ****

Unequal bin sizes

Shape of distribution is totally distorted if we use height


to represent frequency.
Histogram
**** Area represent frequency ****

frequency datasize
density =
width
relative frequency
=
width
Descriptive Statistics – Graphical Summaries

Histogram vs Bar Chart


Histograms and bar charts look very similar. What’s di↵erent about them?
With bar charts, each column represents a group defined by a categorical
variable.

Per $36,000

Capita $24,000

Income $12,000

New New New New


Jersey Hampshire York Mexico

With histograms, each column represents a group defined by a quantitative


variable.

$40,000
Per
$30,000
Capita
$20,000
Income
$10,000

25 -34 35 -44 45 -54 55 -64 65 -74


Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 20 / 80
Descriptive Statistics – Graphical Summaries

Histogram vs Bar Chart

One implication of this distinction: it is always appropriate to talk about


the skewness of a histogram; that is, the tendency of the observations to
fall more on the low end or the high end of the x -axis.
With bar charts, however, the x -axis does not have a low end or a high end;
because the labels on the x -axis are categorical - not quantitative. As a
result, it is less appropriate to comment on the skewness of a bar chart.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 21 / 80
Descriptive Statistics – Graphical Summaries

Pros and Cons of the Four Visual Displays for


Quantitative Variables

Box plots, stem-and-leaf plots, dot plots, and histograms


organize quantitative data in ways that let us begin to find the
information in a data set.
As to the question of which type of display is the best, there is
no unique answer.
The answer depends on what feature of the data may be of
interest and, to a certain degree, on the sample size.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 22 / 80
Descriptive Statistics – Graphical Summaries

Pros and Cons of the Four Visual Displays for


Quantitative Variables

Box plot
Strength:
Give a direct look at central location and spread as it
summarizes the five-number summary.
Can identify outliers.
Side-by-side box plot is an excellent tool for comparing two or
more groups.
Weakness:
Not entirely useful for judging shape.
Cannot distinguish between bell-shaped or bimodal.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 23 / 80
Descriptive Statistics – Graphical Summaries

Pros and Cons of the Four Visual Displays for


Quantitative Variables

Stem-and-Leaf plot
Strength:
Excellent for sorting data.
With a sufficient sample size, it can be used to judge shape.
Weakness:
With a large sample size, a stem-and-leaf plot may be too
cluttered because the display shows all individual data values.
More restricted in the choices for “intervals” when compared to
histograms.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 24 / 80
Descriptive Statistics – Graphical Summaries

Pros and Cons of the Four Visual Displays for


Quantitative Variables

Dot plot
Strength:
Can present all individual data values.
Easy to create.
Weakness:
With a large sample size, a dot plot may be too cluttered.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 25 / 80
Descriptive Statistics – Graphical Summaries

Pros and Cons of the Four Visual Displays for


Quantitative Variables
Histogram
Strength:
Excellent for judging the shape of a data set with moderate or
large sample sizes.
Flexible in choosing number as well as the width of the intervals
for the display.
Between 6 and 15 intervals usually gives a good picture of the
shape.
Weakness:
With a small sample size, a histogram may not “fill in”
sufficiently well to show the shape of the data.
With either too few intervals or too many, we may not see the
true shape of the data.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 26 / 80
Descriptive Statistics – Graphical Summaries

Misleading Graphs
Statistics can be misleading if not presented appropriately.
Same data can appear very di↵erently when graphed.
Suppose, for example, that two of your classmates are instructed
by your history professor to construct a graph of the number of
men and women who scored in the top half of the class on the
history exam. 3D effect causes disproportion representation not proportional to quantities
Correct Incorrect
15
Number of people scoring in

Number of people scoring in


top half of the history exam

top half of the history exam


14

10 13

12
5
11

10
0
Men Women Men Women
( a) (b)

FIGURE 3.12 Two bar diagrams showing the same results using different scales for frequency.
(a) The left graph shows the correct proportional relationship between men and women. (b) In the right
graph, putting a break in the vertical axis results in an incorrect proportional relationship.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 27 / 80
Descriptive Statistics – Graphical Summaries

Misleading Graphs

Both students construct bar diagrams with the same relative


scale for height and width (see Figure 3.12), but guess which one
wishes to convey that the women were far superior to the men?
You will sometimes see graphs with a break in the vertical axis
as in Figure 3.12. This is not appropriate in this case. Frequency
on the vertical axis should be continuous from zero. When we
put a break in the axis, we lose the proportional relationship
among class interval frequencies.
In Figure 3.12 (b), for example, it incorrectly appears as if
women did over twice as well as men. Figure 3.12 (a) shows the
correct proportional relationship.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 28 / 80
Lying with Statistical Graphics
Market penetration of four brands of cigarette: A, B, C, D

D D
A
18% 18% B
27%
37%

C C
18% 18%
B A
37% 27%

Monthly sales of the four brands (in $m)

1.5
1.2
1.1 1.0
1.0
0.5

0 0
A B C D A B C D
Bad Graphic Designs
Representation not proportional to quantities
Bad Graphic Designs

Misleading
alignment
Bad Graphic Designs
Percentage of college
Enrolment with age 25 and over
Descriptive Statistics – Graphical Summaries

Shape of Frequency Distributions

What is the pattern of the distribution of scores over the range


of possible values? Are most of the scores in the middle, at one
end, or clustered in two distinct locations?
Certain shapes of frequency distributions occur with enough
regularity in statistical work that they have names. The names
e↵ectively summarize the general characteristics of the
distribution.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 29 / 80
Descriptive Statistics – Graphical Summaries

Shape of Frequency Distributions


We illustrate several of them in Figure 3.13.

Frequency

Frequency

Frequency
J–shaped
distribution Positively skewed Negatively skewed
distribution distribution

Scores Scores Scores


(a) (b) (c)

Rectangular Bimodal Bell-shaped


Frequency

Frequency

Frequency
distribution distribution distribution

Scores Scores Scores


(d ) (e) (f )

FIGURE 3.13 Shapes of some distributions that occur in statistical work.

Distribution name Can be resulted from


(a) J-shaped plotting the speeds at which automobiles go through an intersection where a stop sign is present.
(b) Positively skewed a test that is too difficult for most of the group taking it.
(c) Negatively skewed a test that is too easy for most of the group taking it.
(d) Rectangular an equal number of cases in all class intervals.
(e) Bimodal measuring strength of grip in a group that contained both men and women.
(f) Bell-shaped plotting a histogram of female’s heights in Hong Kong.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 30 / 80
Descriptive Statistics – Graphical Summaries

Bell-Shaped Distributions

Many measurements follow a predictable pattern:


Most individuals are clumped around the center.
The greater the distance a value is from the center, the fewer
individuals have that value
Variables that follow such a pattern are said to be“bell-shaped”.
A special case is called a normal distribution or normal curve.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 31 / 80
Descriptive Statistics – Graphical Summaries

Bell-Shaped Distributions
Data: representative sample of 199 married British couples.
Below shows a histogram of the wives’ heights with a normal
curve superimposed. The mean height = 1602 millimeters.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 32 / 80
Descriptive Statistics – Numerical Summaries

Outline

1 What is Statistics?

2 Descriptive Statistics – Graphical Summaries

3 Descriptive Statistics – Numerical Summaries

4 Normal Distribution and Other Statistics

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 33 / 80
Descriptive Statistics – Numerical Summaries

Summarizing Data

Suppose we are given the following sample data, how are we


going to summarize it?
n = 6 , x1 = 11.6 , x2 = 7.2 , x3 = 3.1 , x4 = 4.6 , x5 = 7.7 , x6 = 5.4
Apart from graphical summaries, we can also use numerical
summaries.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 34 / 80
Descriptive Statistics – Numerical Summaries

Numerical Summaries

1 Measures of Central Location: Mean, Median, Mode


2 Measures of Variability: Standard Deviation, Variance, Range,
Interquartile Range
3 Measures of Shape: Skewness, Kurtosis
4 Outlier
5 Other Measure: Coefficient of Variation

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 35 / 80
Descriptive Statistics – Numerical Summaries

Distinguish between Central Location, Variability, Shape


(a)

Frequency

Equal means, unequal variability

(b)
Frequency

Equal variability, unequal means

(c)
Frequency

Equal variability, equal means, different shapes

FIGURE 5.1 Differences in central tendency, variability, and shape of frequency distributions (polygons).

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 36 / 80
Descriptive Statistics – Numerical Summaries

1. Mean, Median, Mode

Measures of Central Location (the center (or average) of the data): Mean,
Median, Mode
Mean is the arithmetic average
P
x
x=
n
Median is the middle value in the data
Mode is the value in the data set that occurs most frequently

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 37 / 80
Descriptive Statistics – Numerical Summaries

Mean, Median, Mode

For example, in this given set of data, n = 6, x1 = 11.6,


x2 = 7.2, x3 = 3.1, x4 = 4.6, x5 = 7.7, x6 = 5.4.
Mean: x = 11.6+7.2 3.1+4.6
6
7.7+5.4
=3
Median: First rearrange the data in ascending/descending order.
Ascending order: 7.7, 3.1, 4.6, 5.4, 7.2, 11.6
There are six values, so the middle value is the average of the
3rd and 4th value.
median = 4.6+5.4
2 =5
Mode: Since there are no repeated values, mode is not
applicable in this case.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 38 / 80
Descriptive Statistics – Numerical Summaries

Mean, Median, Mode


Question 1
Find the mean, median, mode in this data set: n = 8, x1 = 3, x2 = 1,
x3 = 0, x4 = 7, x5 = 0, x6 = 2, x7 = 9, x8 = 0.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 39 / 80
Descriptive Statistics – Numerical Summaries

Mean, Median, Mode


Question 2
Given the following information about five integers less than 9, find
out ALL five integers with the following given information.
mean = 5, mode = 3, median = 4.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 40 / 80
Descriptive Statistics – Numerical Summaries

Mean as the Balance Point of a Distribution


Unlike the median and the mode, the mean is responsive to the exact
position of each score in the distribution.
P
Inspect the basic formula x /n. It showsPthat increasing or
decreasing the value of any score changes x and thus also changes
the value of the mean.
The mean is the balance point of a distribution.
Frequency

X: 8 9 10 11 12 13 15 16 17 18
14
(X = 14)

−2,
−2, +4
(X − X): −6 −2 −1 +2 +3 +4

Σ(X − X) = −13 Σ(X − X) = +13

FIGURE 4.2 The mean as the balance point of a distribution.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 41 / 80
Descriptive Statistics – Numerical Summaries

Mean as the Balance Point of a Distribution

X X
x= x /n =) (x x) = 0

This says that if we express the scores in terms of the amount by


which they deviate from their mean, taking into account the
negative and positive deviations, their sum is zero.
To put it another way, the sum of the negative deviations from
the mean exactly equals the sum of the positive deviations.
So when a measure of central tendency should reflect the total
of the scores, the mean is the best choice.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 42 / 80
Descriptive Statistics – Numerical Summaries

Median in the Case with Outliers


The median is less sensitive than the mean to the presence of a few extreme
scores (called outliers).
Consider, for example, the money earned by the top 200 professionals on
the men’s PGA (golf) tour in 2009 (see Table 4.1).
TABLE 4.1 Money Earned by the Top 200 Players on the
2009 PGA Tour

RANK PLAYER MONEY ($)

1 Tiger Woods 10,508,163


2 Steve Stricker 6,332,636
3 Phil Mickelson 5,332,755
4 Zach Johnson 4,714,813
5 Kenny Perry 4,445,562
. . .
. . .
96 James Nitties 931,532
97 Kevin Stadler 925,514
98 Michael Letzig 896,478
99 Lee Janzen 871,187
100 Ted Purdy 838,707
. . .
. . .
196 Dudley Hart 158,399
197 Greg Kraft 156,686
198 Kirk Triplett 155,480
199 John Huston 135,476
200 Michael Sim 130,188

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 43 / 80
Descriptive Statistics – Numerical Summaries

Median in the Case with Outliers

The mean earning was $1, 253, 638, but the median was
$837, 886. The money earned by the best player, Tiger Woods,
was over $4 million greater than that of the second-ranked
player, Steve Stricker, and Stricker’s earnings were over $1
million greater than the third-ranked player.
The earnings of these two players strongly a↵ected the total, and
hence the mean, but their values did not a↵ect the median.
Imagine the change if Tiger Woods had earned $5 million more.
Therefore, in distributions that are strongly asymmetrical
(skewed) or have a few very deviant scores, the median may be
the better choice for measuring the central tendency.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 44 / 80
Descriptive Statistics – Numerical Summaries

HKU LLB Graduate Employment Statistics 2013 –


Mean or Median?
Below is extracted from
Where the2013
did the “Graduate
Bachelor ofEmployment
Laws graduatesStatistics
go? 2013 for LLB Graduates”
Careers and Placement,
published by HKU CEDARS. Centre of Development and Resources for Students

VI. Basic Salary and Gross Income

The remuneration received by Bachelor of Laws graduates is shown below.

Basic Salary Gross Income

LLB HKU Average* LLB HKU Average*

2013 2012 2013 2012 2013 2012 2013 2012


Mean $21,929 $20,547 $18,778 $18,662 $22,955 $21,224 $19,547 $19,598
Median $16,000 $18,000 $15,000 $15,000 $17,958 $18,625 $16,000 $15,833
Minimum $10,000 $9,000 $5,000 $2,000 $10,000 $9,750 $5,000 $6,000
Maximum $60,000 $45,000 $83,333 $60,000 $60,000 $60,000 $83,333 $150,000

* HKU Average refers to the figure for the total HKU population and includes M.B.,B.S. and B.D.S. graduates.

Discussion topics:
VII. Number of Full-Time Job Offers
e.g., What can we say about the statistics? Whether we should use mean or median in
The number
measuring the central of job offersWhat
location? received by Bachelorinformation
other of Laws graduates is shown
shouldin the following
be added table. here to make the

summary more complete?


No. of job offers received No. of graduates (% of graduates)
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts
One 5
2017-2018 (Sem 2)
(38%)
Ch 1 45 / 80
Measures of Central Location
x1 x2 x3 x4 x5 x6
n = 6, 11.6 7.2 -3.1 4.6 -7.7 5.4

mean / average / arithmetic mean

x=
å x 1
= (x + x2 + ! + xn )
1
n n

å X = 11.6 + 7.2 - 3.1 + 4.6 - 7.7 + 5.4 = 18


18
x= =3
6
Mean as a Balance Point
+ve deviations

x =3

-ve deviations Balance the total


deviations on both sides.
Mean as a Balance Point

Unbalanced if support
at the peak.

Balanced if support at
the mean.
Measure of Central Location
Median – number with the middle rank
ìæ n + 1 ö
ïç 2 ÷ th number if n is odd
ïè ø
median = í
ïaverage of æç n ö÷ th and æç n + 1ö÷ th numbers if n is even
ïî è2ø è2 ø

11.6 7.2 -3.1 4.6 -7.7 5.4


sort
-7.7 -3.1 4.6 5.4 7.2 11.6

4.6 + 5.4
median = =5
2
Median as a Balance Point
median = 5

50% data 50% data

Balance the number of observations on both sides,


ignoring the exact positions.
Measure of Central Location
Starting salary of five graduates (in $1000)

13 14 15 19 20

13 + 14 + 15 + 19 + 20
x= = 16.2 median = 15
5

13 14 15 19 53

13 + 14 + 15 + 19 + 53
x= = 22.8 median = 15
5

Sensitve to outlier Robust


Measure of Central Location
Mode – value in data set that occurs most frequently

peak of distribution

Most frequent age-at-death


60 - 65

Modal class
Measure of Central Location

50% 50%

mean
mode
median
Measures of Variation
Measures of Variation

Spread out wider


Measure of Variation
+ve deviations

x =3

å (x - x ) = 0
-ve deviations

Measure the variation by (


å x-x )2
Measure of Variation
Data Deviation from the mean Deviation Squared
x x- x (x – x )2
32 32 – 50.6 = - 18.6 345.96
71 71 – 50.6 = + 20.4 416.16
64 64 – 50.6 = + 13.4 179.56
50 50 – 50.6 = - 0.6 0.36
48 48 – 50.6 = - 2.6 6.76
63 63 – 50.6 = + 12.4 135.76
38 38 – 50.6 = - 12.6 158.76
41 41 – 50.6 = - 9.6 92.16
47 47 – 50.6 = - 3.6 12.96
52 52 – 50.6 = + 1.4 1.96

S x = 506 S (x - x ) = 0 S (x - x )2 = 1368.4
Standard Deviation
å (X - X )
2

Population s.d. s=
N

Sample s.d. s=
å (x - x ) 2

n -1

1368.4
Sample s.d. of previous sample (n = 10) s= = 12.33
10 - 1

1368.4
Sample variance s =
2
= 152.04
10 - 1
Standard Deviation å x 2 - (å x ) n
2

s=
n -1

Data Squared data


x x2
32 1024 (å x ) 2

71 5041 å x - 2

n
= 1368.4
64 4096
50 2500 1368.4
48 2304
s= = 12.33
10 - 1
63 3969
38 1444
41 1681
47 2209
52 2704

S x = 506 S x 2 = 26972
Standard Deviation

x =3

s.d. decreases

Measure the variation by å ( x - x )2


Standard Deviation

x =3

s.d. increases

Standard deviation is sensitive to outliers.


Standard Deviation
Mathematical result
Mean minimizes sum of square deviations.

å (x - x ) < å (x - any other value )


2 2

Median
Question: What minimizes sum of absolute deviations?

å x - ? < å x - any other value


Standard Deviation

How the sum of absolute deviation changes when the vertical


line is moved towards to the left? To the right?
Coefficient of Variation
Monthly revenues from two markets (in $m)

Market A: mean = 9.5 sd = 3.9


3 5 6 7 8 9 9 11 13 14 14 15

Market B: mean = 73.5 sd = 3.9


66 67 71 73 73 74 74 75 76 77 77 79

In which market the company has larger variation on the


revenues?
It would be better to compare relative spreads.
Coefficient of Variation
Monthly revenues from two markets (in $m)

Market A: mean = 9.5 sd = 3.9 CV = 41.0%


3 5 6 7 8 9 9 11 13 14 14 15

Market B: mean = 73.5 sd = 3.9 CV = 5.3%


66 67 71 73 73 74 74 75 76 77 77 79

s
Coefficient of Variation (CV ) = ´ 100%
x

Revenues flow from market B is relatively more stable.


Descriptive Statistics – Numerical Summaries

Is It Permissible to Calculate the Mean for Tests in


the Behavioral Sciences?

Questionnaires with items like this are common. To indicate


their attitudes, respondents circle numbers.

Colleges should be free to pay salaries to their athletes.


Strongly disagree !3 !2 !1 0 "1 "2 "3 Strongly agree

How should we summarize a sample of such numbers? By


calculating the mean?
(Refer to supplementary material: Is It Permissible to Calculate
the Mean for Tests in the Behavioral Sciences? for detail.)

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 46 / 80
Descriptive Statistics – Numerical Summaries

Is It Permissible to Calculate the Mean for Tests in


the Behavioral Sciences?

First of all, we have to ask ourselves a question: “Is the


measurement on this scale interval or ordinal?”
This is not exactly interval: Consider two attitudes, one
represented by 3 and the other by 2.The di↵erence between
those attitudes in favorability to salaries for college athletes is
probably not the same as the di↵erence in favorability between
attitudes represented by, say, +1 and +2. So a one-point
di↵erence between scores does not necessarily signify an equal
di↵erence in attitudes all along the scale.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 47 / 80
Descriptive Statistics – Numerical Summaries

Is It Permissible to Calculate the Mean for Tests in


the Behavioral Sciences?

This is not exactly ordinal neither: Does the scale then fall only
at the ordinal level of measurement? If so, the seven numbers
along the scale would indicate only a rank ordering from the
least favorable attitude to the most favorable. But there is
probably more information in the numbers than that. A
two-point di↵erence between scores probably signifies a greater
di↵erence in favorability than a one-point di↵erence.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 48 / 80
Descriptive Statistics – Numerical Summaries

Is It Permissible to Calculate the Mean for Tests in


the Behavioral Sciences?
Measurement on this scale is therefore likely to lie somewhere
between the ordinal and the interval levels of sophistication.
So are we or are we not justified in calculating the mean to
summarize a sample of such scores? Some say yes and others
say no.
The same is true with many other measuring instruments used in
the behavioral sciences — with inventories of moods like anger
and elation, with assessments of personality traits like
extra-version and conscientiousness, with tests of aptitudes and
achievements. They do not yield scores that carry as much
intrinsic meaning as, say, temperatures and weights, but they tell
us more than ranks do.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 49 / 80
Descriptive Statistics – Numerical Summaries

2. Standard Deviation, Range, Interquartile Range


Measures of variability (the degree of spread (or dispersion) of
the data): Standard Deviation, Range, Interquartile Range
Standard Deviation (s.d.)
v
uXn
u
s u (xi x )2
sum of squared deviations t
i=1
Population s.d. = = =
sample size n
when n=1, undefined, no comparison v
n=2, 0 u n
uX
n=3, there are 2 comparisons, 2s
degree of freedom u (xi x )2
u
sum of squared deviations t i=1
Sample s.d. = s= =
sample size 1 n 1
square: take away the sign, and……
The value of the squared standard deviation is called the
variance.
The larger the standard deviation, the greater the dispersion of
the data.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 50 / 80
Descriptive Statistics – Numerical Summaries

2. Standard Deviation, Range, Interquartile Range

Range is the di↵erence between the maximum and minimum


values.

Range = maximum value minimum value

Interquartile Range is the di↵erence between the upper


quartile and lower quartile.

IQR = Q3 Q1

where
Q1 = lower quartile = median of lower half of the ordered data values
Q3 = upper quartile = median of upper half of the ordered data values

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 51 / 80
Descriptive Statistics – Numerical Summaries

Calculation of Standard Deviation


Consider a sample of 11 data.
TABLE 5.1 Calculation of the Variance: Deviation-Score Method

① X ③ (X ! X
!) ④ (X ! X
!)2
32 32 ! 50.6 " !18.6 345.96
71 71 ! 50.6 " #20.4 416.16
64 64 ! 50.6 " #13.4 179.56
50 50 ! 50.6 " !0.6 .36
48 48 ! 50.6 " !2.6 6.76
63 63 ! 50.6 " #12.4 153.76
38 38 ! 50.6 " !12.6 158.76
41 41 ! 50.6 " !9.6 92.16
47 47 ! 50.6 " !3.6 12.96
52 52 ! 50.6 " #1.4 1.96

"X " 506 "(X ! X!) " 0 ⑤ "(X ! X!)


2
" 1368.40

②X
506 " !)
⑥ (X ! X 2
1368.40
SX2 " $$ " $ " 136.84
! " $ " 50.6
10 n 10

qP q p
(x x )2 1368.40
Sample s.d.: s = n 1 = 10 1 = 152.04 = 12.33
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 52 / 80
Descriptive Statistics – Numerical Summaries

Calculation of Standard Deviation (A Quicker Way)


TABLE 5.2 Calculation of the Standard Deviation: Raw-Score Method

① X ② X2
32 1,024
71 5,041
64 4,096
50 2,500
48 2,304
63 3,969
38 1,444
41 1,681
47 2,209
52 2,704
③ !X ! 506 ③ !X 2
! 26,972

P P 2 P
Actually, (x qx )2 = x q ( x )2 /n (Try to prove)
P P 2 P 2
(x x )2 x ( x ) /n
Therefore, s = =
qn 1 n 1
p
26972 5062 /10
Sample s.d.: s = 10 1
= 152.04 = 12.33
Same result as before.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 53 / 80
Descriptive Statistics – Numerical Summaries

Properties of Standard Deviation

The standard deviation, like the mean, is responsive to the exact


position of every score in the distribution.
Because it is calculated by taking deviations from the mean, if a
score is shifted to a position more deviant from the mean, the
standard deviation will increase. If the shift is to a position
closer to the mean, the standard deviation decreases.
When we calculate deviations from the mean, the sum of squares
of these values is smaller than if they had been taken about any
other point. Putting it another way, (Try to prove)
mean minimizes sum of square deviations
X X
(x x )2 < (x any other score)2
median minimizes sum of absolute deviations

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 54 / 80
Descriptive Statistics – Numerical Summaries

Calculation of Range and Interquartile Range

Using the same data in the previous slide.


We should rearrange the data in ascending order before
calculating the range and the interquartile range.
Data set (in ascending order): 32, 38, 41, 47, 48, 50, 52, 63, 64, 71
Range = max min = 71 32 = 39
Since n = 10, lower quartile Q1 = 41 and upper quartile
Q3 = 63
Therefore, IQR = Q3 Q1 = 63 41 = 22

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 55 / 80
Descriptive Statistics – Numerical Summaries

3. Skewness, Kurtosis

Measures of Shape: Skewness, Kurtosis


Skewness is a measure of a data set’s deviation from symmetry.
m3
skewness = 3/2
m2

where
P
(x x )2
m2 (the second sample moment about mean) =
P n
(x x )3
m3 (the third sample moment about mean) =
n
cubic function magnifies the deviation

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 56 / 80
Descriptive Statistics – Numerical Summaries

Skewness, Kurtosis

The value of this measure generally lies between 3 and +3. The
closer the value lies to 3, the more the distribution is skewed left
(negatively skewed). The closer the value lies to +3, the more the
distribution is skewed right (positively skewed). A value close to 0
indicates a symmetric distribution.
A normal distribution is symmetric and has skewness of 0.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 57 / 80
Descriptive Statistics – Numerical Summaries

Skewness, Kurtosis
There are other measures of skewness:
simpler measure
1 Pearson mode skewness or first skewness coefficient
mean mode
skewness =
standard deviation

mean < (>) mode =) distribution is ve-ly (+ve-ly) skewed


2 Pearson median skewness or second skewness coefficient
3(mean median)
skewness =
standard deviation

mean < (>) median =) distribution is ve-ly (+ve-ly) skewed


3 Bowley skewness or quartile skewness coefficient
more robust measure
used more often skewness =
(Q3 Q2 ) (Q2 Q1 ) Q1 2Q2 + Q3
=
Q3 Q1 Q3 Q1
where Q2 = median and Q3 Q1 = IQR
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 58 / 80
Descriptive Statistics – Numerical Summaries

Skewness, Kurtosis
Skewed negatively (a) (b) Skewed positively

Frequency

Frequency
(to the left) (to the right)

X Mdn Mo Mo Mdn X
Scores Scores

(c)
Normal
Frequency

distribution

X
Mdn
Mo
Scores

FIGURE 4.3 X
!, Mdn, and Mo in skewed distributions and in the normal distribution.

Distribution Coefficient of Skewness Measures of Central Location


Symmetrical 0 Mean = Median = Mode
Skewed to the right >0 Mean > Median > Mode
Skewed to the left <0 Mean < Median < Mode

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 59 / 80
Descriptive Statistics – Numerical Summaries

Skewness, Kurtosis
act measuring the tail

Kurtosis is a measure of peakedness of a distribution.


m4
kurtosis =
m22

where
P
(x x )4
m4 (the fourth sample moment about mean) =
P n
(x x )2
m2 (the second sample moment about mean) =
n
Excess kurtosis is defined as the kurtosis minus 3, i.e.,
3 is the kurtosis of normal distribution

excess kurtosis = kurtosis 3

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 60 / 80
Descriptive Statistics – Numerical Summaries

Skewness, Kurtosis
Several well-known, uni-modal and symmetric distributions from di↵erent parametric
families are compared here.
Each has a mean and skewness of zero.
The parameters have been chosen to result in a variance equal to 1 in each case.
N denotes the standard normal curve, with excess kurtosis equal to 0.
Generally, if a distribution has a greater excess kurtosis, it has a higher peak and thicker
tails, compared to another distribution of the same kind.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 61 / 80
Descriptive Statistics – Numerical Summaries

4. Outlier

Outlier is a data point that is not consistent with the bulk of the
data
Look for them via graphs
Can have big influence on conclusions
Can cause complications in some statistical analysis
Cannot discard without justification
The general guideline used in box plot to identify outlier is:
If an observation is outside the range
[Q1 1.5IQR , Q3 + 1.5IQR], then it is regarded as outlier.
Mild outlier
Extreme outlier +/- 3IQR

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 62 / 80
Descriptive Statistics – Numerical Summaries

Outlier

Possible reasons for outliers and what to do about them:


Outlier is legitimate data value and represents natural variability
for the group and variable(s) measured. Values may not be
discarded. They provide important information about location
and spread.
Mistake made while taking measurement or entering it into
computer. If verified, should be discarded or corrected.
Individual in question belongs to a di↵erent group than bulk of
individuals measured. Values may be discarded if summary is
desired and reported for the majority group only.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 63 / 80
Descriptive Statistics – Numerical Summaries

Outlier
Question
Find the outliers in the following data of the ages of actresses at the
time they first won the Oscar:

21, 24, 25, 26, 26, 26, 26, 27, 28, 30, 30, 31, 31,
33, 33, 33, 34, 34, 34, 34, 35, 35, 35, 37, 37, 38,
39, 41, 41, 41, 42, 44, 49, 50, 60, 61, 61, 74, 80

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 64 / 80
Descriptive Statistics – Numerical Summaries

5. Coefficient of Variation

The standard deviation measures the variation in a set of data.


For decision makers, the standard deviation indicates how spread
out a distribution is.
For distributions having the same mean, the distribution with the
largest standard deviation has the greatest relative spread.
When two or more distributions have di↵erent means, the
relative spread cannot be determined by merely comparing the
standard deviations.
Coefficient of variation (CV ), is used to measure the relative
variation for distributions with di↵erent means.
only use for data of the same sign (+ve/ -ve)

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 65 / 80
Descriptive Statistics – Numerical Summaries

Coefficient of Variation

Sample coefficient of variation: CV = xs (100%)


where
s = sample standard deviation
x = sample mean
When the coefficients of variation for two or more distributions
are compared, the distribution with the largest CV is said to
have the greatest relative spread.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 66 / 80
Descriptive Statistics – Numerical Summaries

Coefficient of Variation
In finance, CV measures the relative risk of a stock portfolio.
Assume portfolio A has a collection of stocks that average a
12% return with a standard deviation of 3% and portfolio B has
an average return of 6% with a standard deviation of 2%.
We can compute the CV values for each as follows:
3
CV (A) = (100%) = 25%
12
2
CV (B ) = (100%) = 33%
6
Even though portfolio B has a lower standard deviation, it would
be considered riskier than portfolio A because B’s CV is 33%
and A’s CV is 25%.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 67 / 80
Normal Distribution and Other Statistics

Outline

1 What is Statistics?

2 Descriptive Statistics – Graphical Summaries

3 Descriptive Statistics – Numerical Summaries

4 Normal Distribution and Other Statistics

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 68 / 80
Normal Distribution and Other Statistics

Normal Distribution
Normal Distribution (Distribution means the overall pattern of how often the
possible values occur) (or Gaussian Distribution) Characteristics:

The distribution is symmetrical about its mean, µ.


Other measures of central location, such as the median and the
mode, equal the mean.
A value taken from a normal distribution varies from 1 to
+1.
A normal distribution has two parameters, µ and .
The value µ locates the center of the distribution.
The value (the standard deviation) indicates the dispersion
within the distribution.
The normal distribution is really a family of distributions.
Di↵erent members of this family exist for each possible di↵erent
pair of values of µ and .
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 69 / 80
Normal Distribution and Other Statistics

Normal Distribution

These curves are called probability densities.


Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 70 / 80
Normal Distribution and Other Statistics

Normal Distribution
A random variable X following normal distribution with mean µ
and variance 2 is denoted as X ⇠ N(µ, 2 ) N=normal distribution
The probability density function (pdf) of X is
1 1
(x µ 2
),
f (x ; µ, ) = p e 2 1 < x < 1,
2⇡
where e = a mathematical constant (approximately 2.718282)
µ = mean of the random variable X
= standard deviation of the random variable X
⇡ = a mathematical constant (approximately 3.141593)
The Z -transformation for X gives the standard normal
distribution:
X µ
Z = ⇠ N(0, 1)
Aim: One size fits all
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 71 / 80
Normal Distribution and Other Statistics

How to Read the Standard Normal Table


TABLE A Areas under the Normal Curve Corresponding to Given Values of z

Column 2 gives the proportion of the area under the en-


tire curve that is between the mean (z ! 0) and the pos-
itive value of z. Areas for negative values of z are the same Area
as for positive values because the curve is symmetrical. (Col. 2)

0 z

Column 3 gives the proportion of the area under the en-


tire curve that falls beyond the stated positive value of z.
Areas for negative values of z are the same because the Area
curve is symmetrical. (Col. 3)

0 z

AREA AREA AREA AREA


BETWEEN BEYOND BETWEEN BEYOND
z MEAN AND z z z MEAN AND z z
1 2 3 1 2 3

0.00 .0000 .5000 0.27 .1064 .3936


0.01 .0040 .4960 0.28 .1103 .3897
0.02 .0080 .4920 0.29 .1141 .3859
0.03 .0120 .4880 0.30 .1179 .3821
0.04 .0160 .4840 0.31 .1217 .3783

(This is only a part of the table, refer to the Appendix – Standard Normal Table for detail.)
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 72 / 80
Normal Distribution and Other Statistics

Percentile

k -th percentile is a number that has k % of the data values at or


below it and (100 k )% of the data values at or above it.
Lower quartile, median, upper quartile are special cases of
percentile.
lower quartile = 25th percentile
median = 50th percentile
upper quartile = 75th percentile

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 73 / 80
Normal Distribution and Other Statistics

Percentile
Frequency

Percentile rank: P0
P20 P40 P60 P80 P100
Score:
60 70 80 90 100 110
(a)
Frequency

Percentile rank: P0
P20 P40 P60 P80 P100
Score:
60 70 80 90 100 110
(b)

FIGURE 5.9 Comparative location of percentile ranks in two distributions. The area under the curve
in each segment equals 10% of the whole, representing 10% of the scores (see Section 3.4).

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 74 / 80
Normal Distribution and Other Statistics

Percentile in Standard Normal Distribution


Question
(a) Find the P95 of a standard normal distribution.
(b) Find the P95 of a normal random variable X ⇠ N(5, 10).

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 75 / 80
Normal Distribution
Many data have distributions that are unimodal, almost
symmetric, and tailing off towards both sides.

(
N 45, 2.25 2 )
Normal Distribution
Normal curve : a bell-shaped curve as an approximation to
many such data distributions.

(
N 45, 2.25 2)(
N 45, 2.25 2 )

2.25 2.25
Characteristics of Normal Curves

95%
68%
Area =1

µ-2s µ-s µ µ+s µ+2s


Standardization
Normal Score How many standard deviations an
observation is from the mean

X -µ
Z= X = µ + Z ´s
s

Example
Human IQ scores µ = 100 , s = 15
An IQ score of 136
80 isis
136- 100
80 - 100
ZZ == = =-11.33
4 standard deviations below
2..33 above the mean.
15
15
Standard Normal Distribution
N (0,1)

F(z ) = P(less than z )

P(between a and b )
F(b) - F(a)
FF(-(z1)) 1 - F(1) = F(b ) - F(a )

F(- z ) = 1 - F(z )

a b

z
Standard Normal Distribution Table
IQ scores (
N 100,15 2 )
80) =)==9?.?74
P(less than 80
110 18.%
86%
Normal score
110 - 100
Z = 80 - 100 »»-01..67
33
15
15
F(0.67) = 0.7486
F(1.33) = 0.9082
F(- 1.33) = 1 - 0.9082
= 0.0918
Standard Normal Distribution Table
Gasoline use (in mpg)
(
N 30.5, 4.52 )
P (P25 40))== 87
(25< <XX<<40 ? .14%

Normal scores
25 - 30.5 40 - 30.5
= -1.22 = 2.11
4.5 4.5
F(1.22) = 0.8888 F(2.11) = 0.9826

F(-1.22) = 1 - 0.8888 = 0.1112

F (2.11) - F (- 1.22 )
= 0.9826 - 0.1112 = 0.8714
Standard Normal Distribution Table

P (0 < Z < 1.22 ) = 0.3888


P (Z < 1.22 ) = F (1.22 ) = 0.5 + 0.3888
= 0.8888

P (- 1.22 < z < 0) = 0.3888


Standard Normal Distribution Table

A2 A1

– 1.22 2.11

A2 = 0.3888 A1 = 0.4826

P (- 1.22 < Z < 2.11)


= 0.3888 + 0.4826 = 87.14%
Percentile

k% (100 – k) %

kth percentile
Standard Normal Distribution Table
Salaries of MBA graduates
(in $1000)
(
N 45, 2.25 2 )
95th percentile = ?
= 45 + 1.645 ´ 2.25 = 48.7

P (Z < c ) = 0.95

P (0 < Z < c ) = 0.45

c » 1.645
Normal Distribution and Other Statistics

VaR (Value-at-Risk)

One important application of percentile in risk management is


VaR (Value-at-Risk).
VaR is defined as the worst loss over a target horizon that will
not be exceeded with a certain confidence level.
For instance, the VaR at the 95% confidence level gives a loss
value that will not be exceeded with no less than 95% of
probability.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 76 / 80
Normal Distribution and Other Statistics

VaR (Value-at-Risk)
Suppose an investment has the following historical loss distribution:
Loss Distribution
60

50

40
Frequency

30

20

10

0
0 10 20 30 40 50 60 70 80 90
Loss Amount (in million dollars)

Thus, the 95% VaR would be 70 million dollars.


What is the practical meaning?
How can we estimate the 90% VaR from the above distribution?
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 77 / 80
Normal Distribution and Other Statistics

z -score
The standardized score or z -score is a useful measure of the relative value
of any observation in a data set:
Observed Value Mean x µ
z= =
Standard Deviation

99.7% of cases

95% of cases

68% of cases
Relative frequency

! − 3" ! − 2" ! − 1" ! ! + 1" ! + 2" ! + 3"


Scores

FIGURE 5.7 Relative frequencies of cases contained within certain limits in the normal distribution.
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 78 / 80
Normal Distribution and Other Statistics

z -score – Empirical Rules

In any normal distribution, the interval:


µ ± 1 contains about 68% of the scores
µ ± 2 contains about 95% of the scores
µ ± 3 contains about 99.7% of the scores
The Empirical Rules for bell-shaped data can be restated for
standardized scores as follows:
About 68% of values have z -score between 1 and +1.
About 95% of values have z -score between 2 and +2.
About 99.7% of values have z -score between 3 and +3.

Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 79 / 80
Normal Distribution and Other Statistics

z -score – Women’s Heights Revisited


Mean height for the 199 British women is 1602 mm and standard
deviation is 62.4 mm.
68% of the 199 heights would fall in the range 1602 ± 62.4, or
[1539.6, 1664.4] mm
95% of the heights would fall in the interval 1602 ± 2(62.4), or
[1477.2, 1726.8] mm
99.7% of the heights would fall in the interval 1602 ± 3(62.4), or
[1414.8, 1789.2] mm

Note: Not perfect, but follows Empirical Rule quite well.


Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 80 / 80

Das könnte Ihnen auch gefallen