2 Correlation and Regression

STAT1600B
Statistics: Ideas and Concepts

2017-2018 (2nd Semester)
Department of Statistics and Actuarial Science

The University of Hong Kong
Chapter 1:
Descriptive Statistics
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts 2017-2018 (Sem 2) Ch 1 1 / 80
What is Statistics?
Outline
1 What is Statistics?
2 Descriptive Statistics – Graphical Summaries
3 Descriptive Statistics – Numerical Summaries
4 Normal Distribution and Other Statistics
What is Statistics?
What is Statistics?
Statistics is a collection of procedures and principles for

gathering data and analyzing information in order to help people
make decisions when faced with uncertainty.
Examples:
Does Aspirin reduce heart attack rates?
Does the Internet increase loneliness and depression?
Data are collected and used to make a judgment about a
situation.
What is Statistics?
Some Basic Statistical Terms
An observation is an individual entity in a study.

A variable is a characteristic that may di↵er among individuals.
Sample data are collected from a subset of a larger population.
Population data are collected when all individuals in a
population are measured.
A statistic is a summary measure of sample data.
A parameter is a summary measure of population data.
What is Statistics?
Types of Variables
quantitative
Raw data from categorical variables consist of group or
category names that don’t necessarily have a logical ordering.
norminal
Examples: eye color, country of residence.
Categorical variables for which the categories have a logical
ordering are called ordinal variables.
Examples: highest educational degree earned, tee shirt size (S,
M, L, XL). likert scale
Raw data from quantitative variables consist of numerical

values taken on each individual.
Examples: height, number of siblings.
interval scale
ratio scale: hv meaningful 0
Descriptive Statistics – Graphical Summaries
Outline
Graphical Summaries
Graphs or tables are used to visually display the data.

The graphical summaries that we are going to learn are:
Frequency Table
Pie Chart
Bar Chart
Box Plot
Side-by-Side Box Plot
Stem-and-Leaf Plot
Dot Plot
Histogram
Di↵erent Graphical Summaries for Di↵erent Types of Data
Not all graphical summaries are necessary when describing a set

of data.
So which type of the graphical summaries should we use?
It depends on the types of data. Di↵erent graphical summaries
are used for di↵erent types of data.
Basically,
frequency table, pie chart, bar chart are used for categorical
variables.
box plot and histogram are used for quantitative variable.
side-by-side box plot is used for the combination of 1
quantitative variable and 1 categorical variable.
Frequency Table – One Categorical Variable

2003 nationwide survey of American HS students “How often do you wear a seat
belt when driving a car?”
Total sample size n = 3042 students.
A majority, 1686/3042 = 55.4%, said they always wear a seat belt, while
115/3042 = 3.8%, said they never wear a seat belt.
Rarely or never: 8.2% + 3.8% = 12%
Frequency Table – Rounding Error
TABLE 10.1 Education of people 25 years and over, 2006

Level of Number of persons
education (thousands) Percent
Less than high school 27,896 14.5
High school graduate 60,898 31.7
Some college, no degree 32,611 17.0
Associate’s degree 16,760 8.7
Bachelor’s degree 35,153 18.3
Advanced degree 18,567 9.7
Total 191,884 100.0
Source: Census Bureau, Educational Attainment in the United States: 2006.
Frequency Table – Two Categorical Variables

Are females more likely to say always wear seat belt? Are males more likely to say
rarely or never wear seat belt? contingency table
Females: 915/1467 = 62.4% said always wear seat belt.

Males: 771/1575 = 49.0% said always wear seat belt.
Males: 10.5% + 5.7% = 16.2% rarely or never wear one.
Females: 5.7% + 1.7% = 7.4% rarely or never wear one.
Frequency Table – Two Categorical Variables
Nightlights and Nearsightedness Survey of n = 479 children. Those

who slept with nightlight or in fully lit room before age 2 had higher
incidence of nearsightedness (myopia) later in childhood.
Note: Study does not prove sleeping with light actually caused
myopia in more children.
Pie Chart & Bar Chart – One Categorical Variable

Survey of n = 190 college students. “Randomly pick a number
between 1 and 10.”
Results: Most chose 7, very few chose 1 or 10.

Bar Chart – Two or More Categorical Variables

Revisit the Nightlights and Nearsightedness Survey of n = 479 children.
Response Variable: Degree of Myopia
Explanatory Variable: Amount of Sleep time Lighting.
Box Plot – One Quantitative Variable

Box covers the middle 50% of the data, from lower quartile (median of lower half of the ordered
data values) to upper quartile (median of upper half of the ordered data values).
Line within box marks the median (middle value in the data).
Possible outliers are marked with asterisk.
Apart from outliers, lines extending from box reach to min. and max. values.
Good visual display of the spread of data.
Also good to identify outliers.
draw box plot

box plot to histogram
Side-by-Side Box Plot – One Quantitative Variable

and One Categorical Variable
Side-by-side box plot displays two single box plots on the same graph.
Good to compare di↵erent groups.
Stem-and-Leaf Plot – One Quantitative Variable

In stem-and-leaf plot, every individual data value is shown.
This plot is a quick way to summarize small data sets and is also useful for
ordering the data from lowest to highest.
A row in the plot starts with a “stem” and each stem gives the first part of a
data value.
A value within a row is called a “leaf” and it gives information about the
last part of a data value.
Good to sort the data.
Unit of stem n leaf
size of data set
can c distribution n raw data

not suitable for large data set
Stem-and-Leaf Display
78 65 90 86 79 51 79 62 84 101
5* 1
N = 10
6 25
5
Leaf unit = 1
7 89 9
8 66
4
9 0
10 * 1
stem leaf
Stem-and-Leaf Display
3* 8 N = 15
Leaf unit = 1
4 0014556779
5* 0138
3 + 8 N = 15
4 * 0014 Leaf unit = 1
4 + 556779 * for 0 – 4
+ for 5 – 9
5 * 013
5 + 8
Dot Plot – One Quantitative Variable

The horizontal axis in dot plot covers the range from the smallest to the
largest data value.
A dot is placed above the number line located at the observation’s data
value.
When there are multiple observations with the same value, the dots are
stacked vertically.
Present all the individual data values and easy to create the plot.
Histogram – One Quantitative Variable

Step 1: Decide how many equally spaced (same width) intervals to use for the horizontal
axis. Between 6 and 15 intervals is a good number.
Step 2: Decide to use frequencies (count) or relative frequencies (proportion) on the
vertical axis.
Step 3: Draw equally spaced intervals on horizontal axis covering entire range of the
data. Determine frequency or relative frequency of data values in each interval and draw
a bar with corresponding height. Decide rule to use for values that fall on the border
between two intervals.
Good to illustrate the shape of distribution. Area represent frequency
Histogram
**** Area represent frequency ****
Histogram
Unequal bin sizes
Shape of distribution is totally distorted if we use height

to represent frequency.
Histogram
frequency datasize
density =
width
relative frequency
=
width
Histogram vs Bar Chart

Histograms and bar charts look very similar. What’s di↵erent about them?
With bar charts, each column represents a group defined by a categorical
variable.
Per $36,000
Capita $24,000
Income $12,000
New New New New

Jersey Hampshire York Mexico
With histograms, each column represents a group defined by a quantitative

variable.
$40,000
Per
$30,000
Capita
$20,000
Income
$10,000
25 -34 35 -44 45 -54 55 -64 65 -74

Histogram vs Bar Chart
One implication of this distinction: it is always appropriate to talk about

the skewness of a histogram; that is, the tendency of the observations to
fall more on the low end or the high end of the x -axis.
With bar charts, however, the x -axis does not have a low end or a high end;
because the labels on the x -axis are categorical - not quantitative. As a
result, it is less appropriate to comment on the skewness of a bar chart.
Pros and Cons of the Four Visual Displays for

Quantitative Variables
Box plots, stem-and-leaf plots, dot plots, and histograms

organize quantitative data in ways that let us begin to find the
information in a data set.
As to the question of which type of display is the best, there is
no unique answer.
The answer depends on what feature of the data may be of
interest and, to a certain degree, on the sample size.

Box plot
Strength:
Give a direct look at central location and spread as it
summarizes the five-number summary.
Can identify outliers.
Side-by-side box plot is an excellent tool for comparing two or
more groups.
Weakness:
Not entirely useful for judging shape.
Cannot distinguish between bell-shaped or bimodal.

Stem-and-Leaf plot
Strength:
Excellent for sorting data.
With a sufficient sample size, it can be used to judge shape.
Weakness:
With a large sample size, a stem-and-leaf plot may be too
cluttered because the display shows all individual data values.
More restricted in the choices for “intervals” when compared to
histograms.

Dot plot
Strength:
Can present all individual data values.
Easy to create.
Weakness:
With a large sample size, a dot plot may be too cluttered.

Histogram
Strength:
Excellent for judging the shape of a data set with moderate or
large sample sizes.
Flexible in choosing number as well as the width of the intervals
for the display.
Between 6 and 15 intervals usually gives a good picture of the
shape.
Weakness:
With a small sample size, a histogram may not “fill in”
sufficiently well to show the shape of the data.
With either too few intervals or too many, we may not see the
true shape of the data.
Misleading Graphs
Statistics can be misleading if not presented appropriately.
Same data can appear very di↵erently when graphed.
Suppose, for example, that two of your classmates are instructed
by your history professor to construct a graph of the number of
men and women who scored in the top half of the class on the
history exam. 3D effect causes disproportion representation not proportional to quantities
Correct Incorrect
15
Number of people scoring in
Number of people scoring in

top half of the history exam
top half of the history exam

14
10 13
12
5
11
10
0
Men Women Men Women
( a) (b)
FIGURE 3.12 Two bar diagrams showing the same results using different scales for frequency.
(a) The left graph shows the correct proportional relationship between men and women. (b) In the right
graph, putting a break in the vertical axis results in an incorrect proportional relationship.
Misleading Graphs
Both students construct bar diagrams with the same relative

scale for height and width (see Figure 3.12), but guess which one
wishes to convey that the women were far superior to the men?
You will sometimes see graphs with a break in the vertical axis
as in Figure 3.12. This is not appropriate in this case. Frequency
on the vertical axis should be continuous from zero. When we
put a break in the axis, we lose the proportional relationship
among class interval frequencies.
In Figure 3.12 (b), for example, it incorrectly appears as if
women did over twice as well as men. Figure 3.12 (a) shows the
correct proportional relationship.
Lying with Statistical Graphics
Market penetration of four brands of cigarette: A, B, C, D
D D
A
18% 18% B
27%
37%
C C
18% 18%
B A
37% 27%
Monthly sales of the four brands (in $m)
1.5
1.2
1.1 1.0
1.0
0.5
0 0
A B C D A B C D
Bad Graphic Designs
Representation not proportional to quantities
Bad Graphic Designs
Misleading
alignment
Bad Graphic Designs
Percentage of college
Enrolment with age 25 and over
Shape of Frequency Distributions
What is the pattern of the distribution of scores over the range

of possible values? Are most of the scores in the middle, at one
end, or clustered in two distinct locations?
Certain shapes of frequency distributions occur with enough
regularity in statistical work that they have names. The names
e↵ectively summarize the general characteristics of the
distribution.
Shape of Frequency Distributions

We illustrate several of them in Figure 3.13.
Frequency
Frequency
Frequency
J–shaped
distribution Positively skewed Negatively skewed
distribution distribution
Scores Scores Scores

(a) (b) (c)
Rectangular Bimodal Bell-shaped

Frequency
Frequency
Frequency
distribution distribution distribution
Scores Scores Scores

(d ) (e) (f )
FIGURE 3.13 Shapes of some distributions that occur in statistical work.
Distribution name Can be resulted from

(a) J-shaped plotting the speeds at which automobiles go through an intersection where a stop sign is present.
(b) Positively skewed a test that is too difficult for most of the group taking it.
(c) Negatively skewed a test that is too easy for most of the group taking it.
(d) Rectangular an equal number of cases in all class intervals.
(e) Bimodal measuring strength of grip in a group that contained both men and women.
(f) Bell-shaped plotting a histogram of female’s heights in Hong Kong.
Bell-Shaped Distributions
Many measurements follow a predictable pattern:

Most individuals are clumped around the center.
The greater the distance a value is from the center, the fewer
individuals have that value
Variables that follow such a pattern are said to be“bell-shaped”.
A special case is called a normal distribution or normal curve.
Bell-Shaped Distributions
Data: representative sample of 199 married British couples.
Below shows a histogram of the wives’ heights with a normal
curve superimposed. The mean height = 1602 millimeters.
Descriptive Statistics – Numerical Summaries
Outline
Summarizing Data
Suppose we are given the following sample data, how are we

going to summarize it?
n = 6 , x1 = 11.6 , x2 = 7.2 , x3 = 3.1 , x4 = 4.6 , x5 = 7.7 , x6 = 5.4
Apart from graphical summaries, we can also use numerical
summaries.
Numerical Summaries
1 Measures of Central Location: Mean, Median, Mode

2 Measures of Variability: Standard Deviation, Variance, Range,
Interquartile Range
3 Measures of Shape: Skewness, Kurtosis
4 Outlier
5 Other Measure: Coefficient of Variation
Distinguish between Central Location, Variability, Shape

(a)
Frequency
Equal means, unequal variability
(b)
Frequency
Equal variability, unequal means
(c)
Frequency
Equal variability, equal means, different shapes
FIGURE 5.1 Differences in central tendency, variability, and shape of frequency distributions (polygons).
1. Mean, Median, Mode
Measures of Central Location (the center (or average) of the data): Mean,
Median, Mode
Mean is the arithmetic average
P
x
x=
n
Median is the middle value in the data
Mode is the value in the data set that occurs most frequently
Mean, Median, Mode
For example, in this given set of data, n = 6, x1 = 11.6,

x2 = 7.2, x3 = 3.1, x4 = 4.6, x5 = 7.7, x6 = 5.4.
Mean: x = 11.6+7.2 3.1+4.6
6
7.7+5.4
=3
Median: First rearrange the data in ascending/descending order.
Ascending order: 7.7, 3.1, 4.6, 5.4, 7.2, 11.6
There are six values, so the middle value is the average of the
3rd and 4th value.
median = 4.6+5.4
2 =5
Mode: Since there are no repeated values, mode is not
applicable in this case.
Mean, Median, Mode

Question 1
Find the mean, median, mode in this data set: n = 8, x1 = 3, x2 = 1,
x3 = 0, x4 = 7, x5 = 0, x6 = 2, x7 = 9, x8 = 0.
Mean, Median, Mode

Question 2
Given the following information about five integers less than 9, find
out ALL five integers with the following given information.
mean = 5, mode = 3, median = 4.
Mean as the Balance Point of a Distribution

Unlike the median and the mode, the mean is responsive to the exact
position of each score in the distribution.
P
Inspect the basic formula x /n. It showsPthat increasing or
decreasing the value of any score changes x and thus also changes
the value of the mean.
The mean is the balance point of a distribution.
Frequency
X: 8 9 10 11 12 13 15 16 17 18
14
(X = 14)
−2,
−2, +4
(X − X): −6 −2 −1 +2 +3 +4
Σ(X − X) = −13 Σ(X − X) = +13
FIGURE 4.2 The mean as the balance point of a distribution.
Mean as the Balance Point of a Distribution
X X
x= x /n =) (x x) = 0
This says that if we express the scores in terms of the amount by

which they deviate from their mean, taking into account the
negative and positive deviations, their sum is zero.
To put it another way, the sum of the negative deviations from
the mean exactly equals the sum of the positive deviations.
So when a measure of central tendency should reflect the total
of the scores, the mean is the best choice.
Median in the Case with Outliers

The median is less sensitive than the mean to the presence of a few extreme
scores (called outliers).
Consider, for example, the money earned by the top 200 professionals on
the men’s PGA (golf) tour in 2009 (see Table 4.1).
TABLE 4.1 Money Earned by the Top 200 Players on the
2009 PGA Tour
RANK PLAYER MONEY ($)
1 Tiger Woods 10,508,163

2 Steve Stricker 6,332,636
3 Phil Mickelson 5,332,755
4 Zach Johnson 4,714,813
5 Kenny Perry 4,445,562
. . .
. . .
96 James Nitties 931,532
97 Kevin Stadler 925,514
98 Michael Letzig 896,478
99 Lee Janzen 871,187
100 Ted Purdy 838,707
. . .
. . .
196 Dudley Hart 158,399
197 Greg Kraft 156,686
198 Kirk Triplett 155,480
199 John Huston 135,476
200 Michael Sim 130,188
Median in the Case with Outliers
The mean earning was $1, 253, 638, but the median was
$837, 886. The money earned by the best player, Tiger Woods,
was over $4 million greater than that of the second-ranked
player, Steve Stricker, and Stricker’s earnings were over $1
million greater than the third-ranked player.
The earnings of these two players strongly a↵ected the total, and
hence the mean, but their values did not a↵ect the median.
Imagine the change if Tiger Woods had earned $5 million more.
Therefore, in distributions that are strongly asymmetrical
(skewed) or have a few very deviant scores, the median may be
the better choice for measuring the central tendency.
HKU LLB Graduate Employment Statistics 2013 –

Mean or Median?
Below is extracted from
Where the2013
did the “Graduate
Bachelor ofEmployment
Laws graduatesStatistics
go? 2013 for LLB Graduates”
Careers and Placement,
published by HKU CEDARS. Centre of Development and Resources for Students
VI. Basic Salary and Gross Income
The remuneration received by Bachelor of Laws graduates is shown below.
Basic Salary Gross Income
LLB HKU Average* LLB HKU Average*
2013 2012 2013 2012 2013 2012 2013 2012

Mean $21,929 $20,547 $18,778 $18,662 $22,955 $21,224 $19,547 $19,598
Median $16,000 $18,000 $15,000 $15,000 $17,958 $18,625 $16,000 $15,833
Minimum $10,000 $9,000 $5,000 $2,000 $10,000 $9,750 $5,000 $6,000
Maximum $60,000 $45,000 $83,333 $60,000 $60,000 $60,000 $83,333 $150,000
* HKU Average refers to the figure for the total HKU population and includes M.B.,B.S. and B.D.S. graduates.
Discussion topics:
VII. Number of Full-Time Job Offers
e.g., What can we say about the statistics? Whether we should use mean or median in
The number
measuring the central of job offersWhat
location? received by Bachelorinformation
other of Laws graduates is shown
shouldin the following
be added table. here to make the
summary more complete?

No. of job offers received No. of graduates (% of graduates)
Chung, Li (SAAS , HKU ) STAT1600B Statistics: Ideas and Concepts
One 5
2017-2018 (Sem 2)
(38%)
Ch 1 45 / 80
Measures of Central Location
x1 x2 x3 x4 x5 x6
n = 6, 11.6 7.2 -3.1 4.6 -7.7 5.4
mean / average / arithmetic mean
x=
å x 1
= (x + x2 + ! + xn )
1
n n
å X = 11.6 + 7.2 - 3.1 + 4.6 - 7.7 + 5.4 = 18

18
x= =3
6
Mean as a Balance Point
+ve deviations
x =3
-ve deviations Balance the total

deviations on both sides.
Mean as a Balance Point
Unbalanced if support
at the peak.
Balanced if support at
the mean.
Measure of Central Location
Median – number with the middle rank
ìæ n + 1 ö
ïç 2 ÷ th number if n is odd
ïè ø
median = í
ïaverage of æç n ö÷ th and æç n + 1ö÷ th numbers if n is even
ïî è2ø è2 ø
11.6 7.2 -3.1 4.6 -7.7 5.4

sort
-7.7 -3.1 4.6 5.4 7.2 11.6
4.6 + 5.4
median = =5
2
Median as a Balance Point
median = 5
50% data 50% data
Balance the number of observations on both sides,

ignoring the exact positions.
Starting salary of five graduates (in $1000)
13 14 15 19 20
13 + 14 + 15 + 19 + 20
x= = 16.2 median = 15
5
13 14 15 19 53
13 + 14 + 15 + 19 + 53
x= = 22.8 median = 15
5
Sensitve to outlier Robust

Mode – value in data set that occurs most frequently
peak of distribution
Most frequent age-at-death

60 - 65
Modal class
50% 50%
mean
mode
median
Measures of Variation
Measures of Variation
Spread out wider

Measure of Variation
+ve deviations
x =3
å (x - x ) = 0
-ve deviations
Measure the variation by (

å x-x )2
Measure of Variation
Data Deviation from the mean Deviation Squared
x x- x (x – x )2
32 32 – 50.6 = - 18.6 345.96
71 71 – 50.6 = + 20.4 416.16
64 64 – 50.6 = + 13.4 179.56
50 50 – 50.6 = - 0.6 0.36
48 48 – 50.6 = - 2.6 6.76
63 63 – 50.6 = + 12.4 135.76
38 38 – 50.6 = - 12.6 158.76
41 41 – 50.6 = - 9.6 92.16
47 47 – 50.6 = - 3.6 12.96
52 52 – 50.6 = + 1.4 1.96
S x = 506 S (x - x ) = 0 S (x - x )2 = 1368.4
Standard Deviation
å (X - X )
2
Population s.d. s=
N
Sample s.d. s=
å (x - x ) 2
n -1
1368.4
Sample s.d. of previous sample (n = 10) s= = 12.33
10 - 1
1368.4
Sample variance s =
2
= 152.04
10 - 1
Standard Deviation å x 2 - (å x ) n
2
s=
n -1
Data Squared data

x x2
32 1024 (å x ) 2
71 5041 å x - 2
n
= 1368.4
64 4096
50 2500 1368.4
48 2304
s= = 12.33
10 - 1
63 3969
38 1444
41 1681
47 2209
52 2704
S x = 506 S x 2 = 26972
Standard Deviation
x =3
s.d. decreases
Measure the variation by å ( x - x )2

Standard Deviation
x =3
s.d. increases
Standard deviation is sensitive to outliers.

Standard Deviation
Mathematical result
Mean minimizes sum of square deviations.
å (x - x ) < å (x - any other value )

2 2
Median
Question: What minimizes sum of absolute deviations?
å x - ? < å x - any other value

Standard Deviation
How the sum of absolute deviation changes when the vertical

line is moved towards to the left? To the right?
Coefficient of Variation
Monthly revenues from two markets (in $m)
Market A: mean = 9.5 sd = 3.9

3 5 6 7 8 9 9 11 13 14 14 15
Market B: mean = 73.5 sd = 3.9

66 67 71 73 73 74 74 75 76 77 77 79
In which market the company has larger variation on the

revenues?
It would be better to compare relative spreads.
Monthly revenues from two markets (in $m)
Market A: mean = 9.5 sd = 3.9 CV = 41.0%

3 5 6 7 8 9 9 11 13 14 14 15
Market B: mean = 73.5 sd = 3.9 CV = 5.3%

66 67 71 73 73 74 74 75 76 77 77 79
s
Coefficient of Variation (CV ) = ´ 100%
x
Revenues flow from market B is relatively more stable.

Is It Permissible to Calculate the Mean for Tests in

the Behavioral Sciences?
Questionnaires with items like this are common. To indicate

their attitudes, respondents circle numbers.
Colleges should be free to pay salaries to their athletes.

Strongly disagree !3 !2 !1 0 "1 "2 "3 Strongly agree
How should we summarize a sample of such numbers? By

calculating the mean?
(Refer to supplementary material: Is It Permissible to Calculate
the Mean for Tests in the Behavioral Sciences? for detail.)

First of all, we have to ask ourselves a question: “Is the

measurement on this scale interval or ordinal?”
This is not exactly interval: Consider two attitudes, one
represented by 3 and the other by 2.The di↵erence between
those attitudes in favorability to salaries for college athletes is
probably not the same as the di↵erence in favorability between
attitudes represented by, say, +1 and +2. So a one-point
di↵erence between scores does not necessarily signify an equal
di↵erence in attitudes all along the scale.

This is not exactly ordinal neither: Does the scale then fall only
at the ordinal level of measurement? If so, the seven numbers
along the scale would indicate only a rank ordering from the
least favorable attitude to the most favorable. But there is
probably more information in the numbers than that. A
two-point di↵erence between scores probably signifies a greater
di↵erence in favorability than a one-point di↵erence.

Measurement on this scale is therefore likely to lie somewhere
between the ordinal and the interval levels of sophistication.
So are we or are we not justified in calculating the mean to
summarize a sample of such scores? Some say yes and others
say no.
The same is true with many other measuring instruments used in
the behavioral sciences — with inventories of moods like anger
and elation, with assessments of personality traits like
extra-version and conscientiousness, with tests of aptitudes and
achievements. They do not yield scores that carry as much
intrinsic meaning as, say, temperatures and weights, but they tell
us more than ranks do.
2. Standard Deviation, Range, Interquartile Range

Measures of variability (the degree of spread (or dispersion) of
the data): Standard Deviation, Range, Interquartile Range
Standard Deviation (s.d.)
v
uXn
u
s u (xi x )2
sum of squared deviations t
i=1
Population s.d. = = =
sample size n
when n=1, undefined, no comparison v
n=2, 0 u n
uX
n=3, there are 2 comparisons, 2s
degree of freedom u (xi x )2
u
sum of squared deviations t i=1
Sample s.d. = s= =
sample size 1 n 1
square: take away the sign, and……
The value of the squared standard deviation is called the
variance.
The larger the standard deviation, the greater the dispersion of
the data.
2. Standard Deviation, Range, Interquartile Range
Range is the di↵erence between the maximum and minimum

values.
Range = maximum value minimum value
Interquartile Range is the di↵erence between the upper

quartile and lower quartile.
IQR = Q3 Q1
where
Q1 = lower quartile = median of lower half of the ordered data values
Q3 = upper quartile = median of upper half of the ordered data values
Calculation of Standard Deviation

Consider a sample of 11 data.
TABLE 5.1 Calculation of the Variance: Deviation-Score Method
① X ③ (X ! X
!) ④ (X ! X
!)2
32 32 ! 50.6 " !18.6 345.96
71 71 ! 50.6 " #20.4 416.16
64 64 ! 50.6 " #13.4 179.56
50 50 ! 50.6 " !0.6 .36
48 48 ! 50.6 " !2.6 6.76
63 63 ! 50.6 " #12.4 153.76
38 38 ! 50.6 " !12.6 158.76
41 41 ! 50.6 " !9.6 92.16
47 47 ! 50.6 " !3.6 12.96
52 52 ! 50.6 " #1.4 1.96
"X " 506 "(X ! X!) " 0 ⑤ "(X ! X!)

2
" 1368.40
②X
506 " !)
⑥ (X ! X 2
1368.40
SX2 " $$ " $ " 136.84
! " $ " 50.6
10 n 10
qP q p
(x x )2 1368.40
Sample s.d.: s = n 1 = 10 1 = 152.04 = 12.33
Calculation of Standard Deviation (A Quicker Way)

TABLE 5.2 Calculation of the Standard Deviation: Raw-Score Method
① X ② X2
32 1,024
71 5,041
64 4,096
50 2,500
48 2,304
63 3,969
38 1,444
41 1,681
47 2,209
52 2,704
③ !X ! 506 ③ !X 2
! 26,972
P P 2 P
Actually, (x qx )2 = x q ( x )2 /n (Try to prove)
P P 2 P 2
(x x )2 x ( x ) /n
Therefore, s = =
qn 1 n 1
p
26972 5062 /10
Sample s.d.: s = 10 1
= 152.04 = 12.33
Same result as before.
Properties of Standard Deviation
The standard deviation, like the mean, is responsive to the exact

position of every score in the distribution.
Because it is calculated by taking deviations from the mean, if a
score is shifted to a position more deviant from the mean, the
standard deviation will increase. If the shift is to a position
closer to the mean, the standard deviation decreases.
When we calculate deviations from the mean, the sum of squares
of these values is smaller than if they had been taken about any
other point. Putting it another way, (Try to prove)
mean minimizes sum of square deviations
X X
(x x )2 < (x any other score)2
median minimizes sum of absolute deviations
Calculation of Range and Interquartile Range
Using the same data in the previous slide.

We should rearrange the data in ascending order before
calculating the range and the interquartile range.
Data set (in ascending order): 32, 38, 41, 47, 48, 50, 52, 63, 64, 71
Range = max min = 71 32 = 39
Since n = 10, lower quartile Q1 = 41 and upper quartile
Q3 = 63
Therefore, IQR = Q3 Q1 = 63 41 = 22
3. Skewness, Kurtosis
Measures of Shape: Skewness, Kurtosis

Skewness is a measure of a data set’s deviation from symmetry.
m3
skewness = 3/2
m2
where
P
(x x )2
m2 (the second sample moment about mean) =
P n
(x x )3
m3 (the third sample moment about mean) =
n
cubic function magnifies the deviation
Skewness, Kurtosis
The value of this measure generally lies between 3 and +3. The
closer the value lies to 3, the more the distribution is skewed left
(negatively skewed). The closer the value lies to +3, the more the
distribution is skewed right (positively skewed). A value close to 0
indicates a symmetric distribution.
A normal distribution is symmetric and has skewness of 0.
Skewness, Kurtosis
There are other measures of skewness:
simpler measure
1 Pearson mode skewness or first skewness coefficient
mean mode
skewness =
standard deviation
mean < (>) mode =) distribution is ve-ly (+ve-ly) skewed

2 Pearson median skewness or second skewness coefficient
3(mean median)
skewness =
standard deviation
mean < (>) median =) distribution is ve-ly (+ve-ly) skewed

3 Bowley skewness or quartile skewness coefficient
more robust measure
used more often skewness =
(Q3 Q2 ) (Q2 Q1 ) Q1 2Q2 + Q3
=
Q3 Q1 Q3 Q1
where Q2 = median and Q3 Q1 = IQR
Skewness, Kurtosis
Skewed negatively (a) (b) Skewed positively
Frequency
Frequency
(to the left) (to the right)
X Mdn Mo Mo Mdn X
Scores Scores
(c)
Normal
Frequency
distribution
X
Mdn
Mo
Scores
FIGURE 4.3 X
!, Mdn, and Mo in skewed distributions and in the normal distribution.
Distribution Coefficient of Skewness Measures of Central Location

Symmetrical 0 Mean = Median = Mode
Skewed to the right >0 Mean > Median > Mode
Skewed to the left <0 Mean < Median < Mode
Skewness, Kurtosis
act measuring the tail
Kurtosis is a measure of peakedness of a distribution.

m4
kurtosis =
m22
where
P
(x x )4
m4 (the fourth sample moment about mean) =
P n
(x x )2
m2 (the second sample moment about mean) =
n
Excess kurtosis is defined as the kurtosis minus 3, i.e.,
3 is the kurtosis of normal distribution
excess kurtosis = kurtosis 3
Skewness, Kurtosis
Several well-known, uni-modal and symmetric distributions from di↵erent parametric
families are compared here.
Each has a mean and skewness of zero.
The parameters have been chosen to result in a variance equal to 1 in each case.
N denotes the standard normal curve, with excess kurtosis equal to 0.
Generally, if a distribution has a greater excess kurtosis, it has a higher peak and thicker
tails, compared to another distribution of the same kind.
4. Outlier
Outlier is a data point that is not consistent with the bulk of the
data
Look for them via graphs
Can have big influence on conclusions
Can cause complications in some statistical analysis
Cannot discard without justification
The general guideline used in box plot to identify outlier is:
If an observation is outside the range
[Q1 1.5IQR , Q3 + 1.5IQR], then it is regarded as outlier.
Mild outlier
Extreme outlier +/- 3IQR
Outlier
Possible reasons for outliers and what to do about them:

Outlier is legitimate data value and represents natural variability
for the group and variable(s) measured. Values may not be
discarded. They provide important information about location
and spread.
Mistake made while taking measurement or entering it into
computer. If verified, should be discarded or corrected.
Individual in question belongs to a di↵erent group than bulk of
individuals measured. Values may be discarded if summary is
desired and reported for the majority group only.
Outlier
Question
Find the outliers in the following data of the ages of actresses at the
time they first won the Oscar:
21, 24, 25, 26, 26, 26, 26, 27, 28, 30, 30, 31, 31,
33, 33, 33, 34, 34, 34, 34, 35, 35, 35, 37, 37, 38,
39, 41, 41, 41, 42, 44, 49, 50, 60, 61, 61, 74, 80
5. Coefficient of Variation
The standard deviation measures the variation in a set of data.

For decision makers, the standard deviation indicates how spread
out a distribution is.
For distributions having the same mean, the distribution with the
largest standard deviation has the greatest relative spread.
When two or more distributions have di↵erent means, the
relative spread cannot be determined by merely comparing the
standard deviations.
Coefficient of variation (CV ), is used to measure the relative
variation for distributions with di↵erent means.
only use for data of the same sign (+ve/ -ve)
Sample coefficient of variation: CV = xs (100%)

where
s = sample standard deviation
x = sample mean
When the coefficients of variation for two or more distributions
are compared, the distribution with the largest CV is said to
have the greatest relative spread.
In finance, CV measures the relative risk of a stock portfolio.
Assume portfolio A has a collection of stocks that average a
12% return with a standard deviation of 3% and portfolio B has
an average return of 6% with a standard deviation of 2%.
We can compute the CV values for each as follows:
3
CV (A) = (100%) = 25%
12
2
CV (B ) = (100%) = 33%
6
Even though portfolio B has a lower standard deviation, it would
be considered riskier than portfolio A because B’s CV is 33%
and A’s CV is 25%.
Normal Distribution and Other Statistics
Outline
Normal Distribution
Normal Distribution (Distribution means the overall pattern of how often the
possible values occur) (or Gaussian Distribution) Characteristics:
The distribution is symmetrical about its mean, µ.

Other measures of central location, such as the median and the
mode, equal the mean.
A value taken from a normal distribution varies from 1 to
+1.
A normal distribution has two parameters, µ and .
The value µ locates the center of the distribution.
The value (the standard deviation) indicates the dispersion
within the distribution.
The normal distribution is really a family of distributions.
Di↵erent members of this family exist for each possible di↵erent
pair of values of µ and .
Normal Distribution
These curves are called probability densities.

Normal Distribution
A random variable X following normal distribution with mean µ
and variance 2 is denoted as X ⇠ N(µ, 2 ) N=normal distribution
The probability density function (pdf) of X is
1 1
(x µ 2
),
f (x ; µ, ) = p e 2 1 < x < 1,
2⇡
where e = a mathematical constant (approximately 2.718282)
µ = mean of the random variable X
= standard deviation of the random variable X
⇡ = a mathematical constant (approximately 3.141593)
The Z -transformation for X gives the standard normal
distribution:
X µ
Z = ⇠ N(0, 1)
Aim: One size fits all
How to Read the Standard Normal Table

TABLE A Areas under the Normal Curve Corresponding to Given Values of z
Column 2 gives the proportion of the area under the en-

tire curve that is between the mean (z ! 0) and the pos-
itive value of z. Areas for negative values of z are the same Area
as for positive values because the curve is symmetrical. (Col. 2)
0 z
Column 3 gives the proportion of the area under the en-

tire curve that falls beyond the stated positive value of z.
Areas for negative values of z are the same because the Area
curve is symmetrical. (Col. 3)
0 z
AREA AREA AREA AREA

BETWEEN BEYOND BETWEEN BEYOND
z MEAN AND z z z MEAN AND z z
1 2 3 1 2 3
0.00 .0000 .5000 0.27 .1064 .3936

0.01 .0040 .4960 0.28 .1103 .3897
0.02 .0080 .4920 0.29 .1141 .3859
0.03 .0120 .4880 0.30 .1179 .3821
0.04 .0160 .4840 0.31 .1217 .3783
(This is only a part of the table, refer to the Appendix – Standard Normal Table for detail.)
Percentile
k -th percentile is a number that has k % of the data values at or

below it and (100 k )% of the data values at or above it.
Lower quartile, median, upper quartile are special cases of
percentile.
lower quartile = 25th percentile
median = 50th percentile
upper quartile = 75th percentile
Percentile
Frequency
Percentile rank: P0
P20 P40 P60 P80 P100
Score:
60 70 80 90 100 110
(a)
Frequency
Percentile rank: P0
P20 P40 P60 P80 P100
Score:
60 70 80 90 100 110
(b)
FIGURE 5.9 Comparative location of percentile ranks in two distributions. The area under the curve
in each segment equals 10% of the whole, representing 10% of the scores (see Section 3.4).
Percentile in Standard Normal Distribution

Question
(a) Find the P95 of a standard normal distribution.
(b) Find the P95 of a normal random variable X ⇠ N(5, 10).
Normal Distribution
Many data have distributions that are unimodal, almost
symmetric, and tailing off towards both sides.
(
N 45, 2.25 2 )
Normal Distribution
Normal curve : a bell-shaped curve as an approximation to
many such data distributions.
(
N 45, 2.25 2)(
N 45, 2.25 2 )
2.25 2.25
Characteristics of Normal Curves
95%
68%
Area =1
µ-2s µ-s µ µ+s µ+2s

Standardization
Normal Score How many standard deviations an
observation is from the mean
X -µ
Z= X = µ + Z ´s
s
Example
Human IQ scores µ = 100 , s = 15
An IQ score of 136
80 isis
136- 100
80 - 100
ZZ == = =-11.33
4 standard deviations below
2..33 above the mean.
15
15
Standard Normal Distribution
N (0,1)
F(z ) = P(less than z )
P(between a and b )
F(b) - F(a)
FF(-(z1)) 1 - F(1) = F(b ) - F(a )
F(- z ) = 1 - F(z )
a b
z
Standard Normal Distribution Table
IQ scores (
N 100,15 2 )
80) =)==9?.?74
P(less than 80
110 18.%
86%
Normal score
110 - 100
Z = 80 - 100 »»-01..67
33
15
15
F(0.67) = 0.7486
F(1.33) = 0.9082
F(- 1.33) = 1 - 0.9082
= 0.0918
Gasoline use (in mpg)
(
N 30.5, 4.52 )
P (P25 40))== 87
(25< <XX<<40 ? .14%
Normal scores
25 - 30.5 40 - 30.5
= -1.22 = 2.11
4.5 4.5
F(1.22) = 0.8888 F(2.11) = 0.9826
F(-1.22) = 1 - 0.8888 = 0.1112
F (2.11) - F (- 1.22 )
= 0.9826 - 0.1112 = 0.8714
P (0 < Z < 1.22 ) = 0.3888

P (Z < 1.22 ) = F (1.22 ) = 0.5 + 0.3888
= 0.8888
P (- 1.22 < z < 0) = 0.3888

A2 A1
– 1.22 2.11
A2 = 0.3888 A1 = 0.4826
P (- 1.22 < Z < 2.11)

= 0.3888 + 0.4826 = 87.14%
Percentile
k% (100 – k) %
kth percentile
Salaries of MBA graduates
(in $1000)
(
N 45, 2.25 2 )
95th percentile = ?
= 45 + 1.645 ´ 2.25 = 48.7
P (Z < c ) = 0.95
P (0 < Z < c ) = 0.45
c » 1.645
VaR (Value-at-Risk)
One important application of percentile in risk management is

VaR (Value-at-Risk).
VaR is defined as the worst loss over a target horizon that will
not be exceeded with a certain confidence level.
For instance, the VaR at the 95% confidence level gives a loss
value that will not be exceeded with no less than 95% of
probability.
VaR (Value-at-Risk)
Suppose an investment has the following historical loss distribution:
Loss Distribution
60
50
40
Frequency
30
20
10
0
0 10 20 30 40 50 60 70 80 90
Loss Amount (in million dollars)
Thus, the 95% VaR would be 70 million dollars.

What is the practical meaning?
How can we estimate the 90% VaR from the above distribution?
z -score
The standardized score or z -score is a useful measure of the relative value
of any observation in a data set:
Observed Value Mean x µ
z= =
Standard Deviation
99.7% of cases
95% of cases
68% of cases
Relative frequency
! − 3" ! − 2" ! − 1" ! ! + 1" ! + 2" ! + 3"

Scores
FIGURE 5.7 Relative frequencies of cases contained within certain limits in the normal distribution.
z -score – Empirical Rules
In any normal distribution, the interval:

µ ± 1 contains about 68% of the scores
µ ± 2 contains about 95% of the scores
µ ± 3 contains about 99.7% of the scores
The Empirical Rules for bell-shaped data can be restated for
standardized scores as follows:
About 68% of values have z -score between 1 and +1.
About 95% of values have z -score between 2 and +2.
About 99.7% of values have z -score between 3 and +3.
z -score – Women’s Heights Revisited

Mean height for the 199 British women is 1602 mm and standard
deviation is 62.4 mm.
68% of the 199 heights would fall in the range 1602 ± 62.4, or
[1539.6, 1664.4] mm
95% of the heights would fall in the interval 1602 ± 2(62.4), or
[1477.2, 1726.8] mm
99.7% of the heights would fall in the interval 1602 ± 3(62.4), or
[1414.8, 1789.2] mm
Note: Not perfect, but follows Empirical Rule quite well.


2 Correlation and Regression

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

2 Correlation and Regression

Hochgeladen von

Copyright:

Verfügbare Formate

STAT1600B

Statistics: Ideas and Concepts

Department of Statistics and Actuarial Science

2 Descriptive Statistics – Graphical Summaries

3 Descriptive Statistics – Numerical Summaries

4 Normal Distribution and Other Statistics

Statistics is a collection of procedures and principles for

Some Basic Statistical Terms

An observation is an individual entity in a study.

Raw data from quantitative variables consist of numerical

2 Descriptive Statistics – Graphical Summaries

3 Descriptive Statistics – Numerical Summaries

4 Normal Distribution and Other Statistics

Graphs or tables are used to visually display the data.

Di↵erent Graphical Summaries for Di↵erent Types of Data

Not all graphical summaries are necessary when describing a set

Frequency Table – One Categorical Variable

Frequency Table – Rounding Error

TABLE 10.1 Education of people 25 years and over, 2006

Frequency Table – Two Categorical Variables

Females: 915/1467 = 62.4% said always wear seat belt.

Frequency Table – Two Categorical Variables

Nightlights and Nearsightedness Survey of n = 479 children. Those

Pie Chart & Bar Chart – One Categorical Variable

Results: Most chose 7, very few chose 1 or 10.

Bar Chart – Two or More Categorical Variables

Box Plot – One Quantitative Variable

draw box plot

Side-by-Side Box Plot – One Quantitative Variable

Stem-and-Leaf Plot – One Quantitative Variable

can c distribution n raw data

Dot Plot – One Quantitative Variable

Histogram – One Quantitative Variable

Unequal bin sizes

Shape of distribution is totally distorted if we use height

Histogram vs Bar Chart

New New New New

With histograms, each column represents a group defined by a quantitative

25 -34 35 -44 45 -54 55 -64 65 -74

Histogram vs Bar Chart

One implication of this distinction: it is always appropriate to talk about

Pros and Cons of the Four Visual Displays for

Box plots, stem-and-leaf plots, dot plots, and histograms

Pros and Cons of the Four Visual Displays for

Pros and Cons of the Four Visual Displays for

Pros and Cons of the Four Visual Displays for

Pros and Cons of the Four Visual Displays for

Number of people scoring in

top half of the history exam

Both students construct bar diagrams with the same relative

Monthly sales of the four brands (in $m)

Shape of Frequency Distributions

What is the pattern of the distribution of scores over the range

Shape of Frequency Distributions

Scores Scores Scores

Rectangular Bimodal Bell-shaped

Scores Scores Scores

FIGURE 3.13 Shapes of some distributions that occur in statistical work.

Distribution name Can be resulted from

Many measurements follow a predictable pattern:

2 Descriptive Statistics – Graphical Summaries

3 Descriptive Statistics – Numerical Summaries

4 Normal Distribution and Other Statistics