Descriptive Statistics Correlation and Regression: Patrick Breheny

Descriptive statistics
Correlation
Regression
Descriptive statistics; Correlation and regression
Patrick Breheny
September 16
Patrick Breheny STA 580: Biostatistics I 1/59

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Tables and figures
Human beings are not good at sifting through large streams

of data; we understand data much better when it is
summarized for us
We often display summary statistics in one of two ways:
tables and figures
Tables of summary statistics are very common (we have
already seen several in this course) – nearly all published
studies in medicine and public health contain a table of basic
summary statistics describing their sample
However, figures are usually better than tables in terms of
distilling clear trends from large amounts of information

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Types of data
The best way to summarize and present data depends on the

type of data
There are two main types of data:
Categorical data: Data that takes on distinct values (i.e., it
falls into categories), such as sex (male/female), alive/dead,
blood type (A/B/AB/O), stages of cancer
Continuous data: Data that takes on a spectrum of fractional
values, such as time, age, temperature, cholesterol levels
The distinction between categorical (also called discrete) and
continuous data is fundamental and we will return to it
throughout the course

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Categorical data
Summarizing categorical data is pretty straightforward – you

just count how many times each category occurs
Instead of counts, we are often interested in percents
A percent is a special type of rate, a rate per hundred
Counts (also called frequencies), percents, and rates are the
three basic summary statistics for categorical data, and are
often displayed in tables or bar charts, as we saw in lab

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Continuous data
For continuous data, instead of a finite number of categories,

observations can take on a potentially infinite number of
values
Summarizing continuous data is therefore much less
straightforward
To introduce concepts for describing and summarizing
continuous data, we will look at data on infant mortality rates
for 111 nations on three continents: Africa, Asia, and Europe

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Histograms
One very useful way of looking at continuous data is with

histograms
To make a histogram, we divide a continuous axis into equally
spaced intervals, then count and plot the number of
observations that fall into each interval
This allows us to see how our data points are distributed

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Histogram of European infant mortality rates
Europe
0 5 10 15 20 25
Asia
0 2 4 6 8 10
Count
Africa
0 2 4 6 8 10
0 50 100 150 200
Deaths per 1,000 births

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Summarizing continuous data
As we can see, continuous data comes in a variety of shapes

Nothing can replace seeing the picture, but if we had to
summarize our data using just one or two numbers, how
should we go about doing it?
The aspect of the histogram we are usually most interested in
is, “Where is its center?”
This is typically represented by the average

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
The average and the histogram

The average represents the center of mass of the histogram:
Europe
0 5 10 15 20 25
Asia
0 2 4 6 8 10
Count
Africa
0 2 4 6 8 10
0 50 100 150 200

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Spread
The second most important bit of information from the

histogram to summarize is, “How spread out are the
observations around the center”?
This is most typically represented by the standard deviation
To understand how standard deviation works, let’s return to
our small example with the numbers {4, 5, 1, 9}
Each of these numbers deviates from the mean by some
amount:
4 − 4.75 = −0.75 5 − 4.75 = 0.25

1 − 4.75 = −3.75 9 − 4.75 = 4.25
How should we measure the overall size of these deviations?

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Root-mean-square
Taking their mean isn’t going to tell us anything (why not?)

We could take the average of their absolute values:
|−0.75| + |0.25| + |−3.75| + |4.25|

= 2.25
4
But it turns out that for a variety of reasons, the
root-mean-square works better as a measure of overall size:
r
(−0.75)2 + (0.25)2 + (−3.75)2 + (4.25)2
≈ 2.86
4

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
The standard deviation
The formula for the standard deviation is

sP
n 2
i=1 (xi − x̄)
s=
n−1
Wait a minute; why n − 1?

The reason (which we will discuss further in a few weeks) is
that dividing by n turns out to underestimate the true
standard deviation
Dividing by n − 1 instead of n corrects some of that bias
The standard deviation of {4, 5, 1, 9} is 3.30 (recall that we
got 2.86 if we divide by n)

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Meaning of the standard deviation
The standard deviation (SD) describes how far away numbers

in a list are from their average
The SD is often used as a “plus or minus” number, as in
“adult women tend to be about 5’4, plus or minus 3 inches”
Most numbers (roughly 68%) will be within 1 SD away from
the average
Very few entries (roughly 5%) will be more than 2 SD away
from the average
This rule of thumb works very well for a wide variety of data;
we’ll discuss where these numbers come from in a few weeks

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Standard deviation and the histogram

Background areas within 1 SD of the mean are shaded:
Europe
10 15
5
0
10 20 30 40
Asia
6
Count
4
2
0
0 50 100 150
Africa
0 2 4 6 8 10
50 100 150 200

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
The 68%/95% rule in action
% of observations within
Continent One SD Two SDs
Europe 78 97
Asia 67 97
Africa 63 95

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Summaries can be misleading!

All of the following have the same mean and standard deviation:
Frequency
−4 −2 0 2 4 −4 −2 0 2 4
Frequency
−4 −2 0 2 4 −4 −2 0 2 4

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Percentiles
The average and standard deviation are not the only ways to
summarize continuous data
Another type of summary is the percentile
A number is the 25th percentile of a list of numbers if it is
bigger than 25% of the numbers in the list
The 50th percentile is given a special name: the median
The median, like the mean, can be used to answer the
question, “Where is the center of the histogram?”

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Median vs. mean

The dotted line is the median, the solid line is the mean:
Europe
10 15
5
0
10 20 30 40
Asia
6
Count
4
2
0
0 50 100 150
Africa
0 2 4 6 8 10
50 100 150 200

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Skew
Note that the histogram for Europe is not symmetric: the tail
of the distribution extends further to the right than it does to
the left
Such distributions are called skewed
The distribution of infant mortality rates in Europe is said to
be right skewed or skewed to the right
For asymmetric/skewed data, the mean and the median will
be different

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Hypothetical example
Azerbaijan had the highest infant mortality rate in Europe at

37
What if, instead of 37, it was 200?
Mean Median
Real 14.1 11
Hypothetical 19.2 11
The mean is now higher than 72% of the countries
Note that the average is sensitive to extreme values, while the
median is not; statisticians say that the median is robust to
the presence of outlying observations

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Box plots
Quantiles are used in a type of graphical summary called a

box plot
Box plots are constructed as follows:
Calculate the three quartiles (the 25th, 50th, and 75th)
Draw a box bounded by the first and third quartiles and with a
line in the middle for the median
Call any observation that is extremely far from the box an
“outlier” and plot the observations using a special symbol (this
is somewhat arbitrary and different rules exist for defining
outliers)
Draw a line from the top of the box to the highest observation
that is not an outlier; likewide for the lowest non-outlier

Introduction
Histograms
Correlation
Numerical summaries
Regression
Percentiles
Box plots of the infant mortality rate data
●
150
100
50
●
0
Africa Asia Europe

Introduction
Correlation
Correlation
Regression
Introduction
Box plots are a way to examine the relationship between a

continuous variable and a categorical variable
In lab, we saw bar charts as a way of comparing two (or more)
categorical variables
Now, we will discuss how to summarize and illustrate the
relationship between two continuous variables

Introduction
Correlation
Correlation
Regression
Pearson’s height data
Statisticians in Victorian England were fascinated by the idea

of quantifying hereditary influences
Two of the pioneers of modern statistics, the Victorian
Englishmen Francis Galton and Karl Pearson were quite
passionate about this topic
In pursuit of this goal, they measured the heights of 1,078
fathers and their (fully grown) sons

Introduction
Correlation
Correlation
Regression
The scatter plot
As we’ve mentioned, it is important to plot continuous data –

this is especially true when you have two continuous variables
and you’re interested in the relationship between them
The most common way to plot the relationship between two
continuous variables is the two-way scatter plot
Scatter plots are created by setting up two continuous axes,
then creating a dot for every pair of observations

Introduction
Correlation
Correlation
Regression
Scatter plot of Pearson’s height data
80 ● ●
● ● ●
●
●
●
● ● ● ●
●
75
● ●
● ● ● ●
● ● ●●
● ● ● ●
● ● ● ●● ● ● ●●
●
● ● ●●●●● ● ● ● ●
● ●● ●
● ●
● ●●● ● ● ●● ● ● ●●
●● ● ●● ●
●● ● ● ● ●
● ● ●● ● ● ● ● ●● ●●●● ●
● ● ●
● ● ● ●●
Son's height (Inches)
● ●● ●
● ● ●● ● ●● ●
● ●● ● ● ●● ●●
● ● ● ● ●●
●
●● ● ●●
●
● ●● ● ●
● ● ● ● ● ●
● ●● ●●● ●
● ● ● ●
● ●
● ● ● ●● ●● ●● ● ● ●● ●
●● ● ●
●
●● ● ●●● ●● ●●
●
●●● ●
● ●●●
● ●
●● ●● ●
● ●● ●
● ● ● ● ● ●● ●● ● ● ●
●
● ● ● ●
● ● ● ● ● ●● ● ●●● ●●●● ● ● ●
● ●●
●
● ● ●●● ●●
●● ●● ●● ● ●● ● ● ●● ●● ●●
●●
● ●● ● ● ● ● ●●● ●● ● ●● ● ●●● ●● ●
70
●● ●
●● ● ●
● ● ● ●● ●●● ●● ●
● ●● ● ●● ●●
●●●● ● ● ● ●●●●● ●●● ●●●● ●●
●●●
● ● ●
● ● ●● ●●● ● ● ● ●●● ●● ●● ● ●
● ●●●●●●● ●
● ●
● ●●● ●●● ● ● ●●●● ● ● ●● ●●●
●
●
●● ● ●● ●●●●● ● ● ● ●● ●●● ●● ●
●●
●
●
●●● ● ● ● ●●●
●● ●● ● ●●● ● ● ●● ●
● ● ●
● ●● ● ●●● ● ● ● ● ●● ●
● ●● ● ●●
●●●●●●●
● ●● ● ●●● ●
● ●● ● ● ● ● ● ●● ● ●●● ●● ●●●
● ● ●
●● ●
●● ●● ●●
● ●●●●● ● ●● ● ●● ●
● ● ●●
● ● ●● ●●●
●
● ● ●● ●●●● ●●
●
●● ● ●● ●● ● ● ●
●●● ● ●●● ●
● ● ●●●
● ● ● ●
● ● ● ● ● ● ● ●●● ●●● ●● ●●
●● ●●●●●● ●● ●● ● ●●●● ●
●●● ●● ●
●● ●●
●●●● ● ● ● ● ●
● ●● ● ●● ●●● ●●● ●●● ● ●●● ●
●
● ●●● ● ●
● ●
●● ● ● ● ●● ● ●● ●
●● ● ● ●
● ●● ● ●● ● ●● ● ● ● ●●● ● ● ●
● ●●●●● ●●
●●● ● ● ● ● ●
● ●● ●● ●
● ●● ●●●● ● ●
●●●●●● ● ● ●●● ●●●●●● ●
● ●●● ●● ● ●● ●●●●● ●●● ● ●●●●●● ●
● ● ● ●● ● ● ●●●●●●● ●●●● ●● ●● ● ●●●● ● ● ● ●●
● ● ● ● ● ● ● ●● ● ● ●● ●
● ●● ● ●● ●●●● ● ● ●●● ● ●● ● ●●● ● ●
● ●● ●● ● ●
● ●
●● ●● ● ●●●● ● ● ●●● ● ●
● ●
●● ● ● ● ● ●● ●
●
● ● ● ●● ● ● ●●● ● ●●● ●● ●● ● ●
● ●● ●● ● ● ● ●
● ● ● ● ● ●●
●● ●● ● ●● ● ● ● ●
65
● ● ● ● ●
● ● ●● ●● ● ●●●
●● ● ● ● ●● ● ●
●
●
●●● ● ● ●
● ●
● ● ●● ● ● ● ●●● ● ●
●● ● ●
● ● ●● ● ● ● ●
● ● ● ● ●● ●● ● ● ●
●
●
● ● ● ●
● ●
● ●
● ● ● ●
● ●
● ● ●
● ●
● ● ●
● ●
●● ●
●
60
●
● ●
●
●
60 65 70 75 80
Father's height (Inches)

Introduction
Correlation
Correlation
Regression
Observations about the scatter plot
Taller fathers tend to have taller sons

The scatter plot shows how strong this association is – there
is a tendency, but there are plenty of exceptions

Introduction
Correlation
Correlation
Regression
Standardizing a variable
Before we summarize this relationship numerically, we must

discuss the idea of standardizing a variable
In Pearson’s height data, one of the sons measured 63.2
inches tall
Because the average height of the sons in the sample was 68.7
inches, another way of describing his height is to say that he
was 5.5 inches below average
Furthermore, because the standard deviation of the sons was
2.8 inches, yet another way of describing his height is to say
that he was 1.9 standard deviations below the average

Introduction
Correlation
Correlation
Regression
The standardization formula
Putting this into a formula, we standardize an observation xi

by subtracting the average and dividing by the standard
deviation:
xi − x̄
zi =
SDx
where x̄ and SDx are the mean and standard deviation of the
variable x
One virtue of standardizing a variable is interpretability:
If someone tells you that the concentration of urea in your
blood is 50 mg/dL, that likely means nothing to you
On the other hand, if you are told that the concentration of
urea in your blood is 4 standard deviations above average, you
can immediately recognize this as a very high value

Introduction
Correlation
Correlation
Regression
More benefits of standardization

If you standardize all of the observations in your sample, the
resulting variable will be “standardized” in the sense of having
mean 0 and standard deviation 1
Standardization therefore brings all variables onto a common
scale – regardless of whether the heights were originally
measured in inches, centimeters, or miles, the standardized
heights will be identical
As we will see momentarily, this allows us to study the
relationship between two continuous variables without
worrying about the scale of measurement
The concept behind standardization – taking an observation,
then subtracting the expected value and dividing by the
variability – is fundamental to statistics and we will variations
on this idea many times in this course
Introduction
Correlation
Correlation
Regression
The correlation coefficient
The summary statistic for describing the strength of

association between two variables is the correlation coefficient,
denoted by r (and sometimes called Pearson’s correlation
coefficient)
The correlation coefficient is always between 1 (perfect
positive correlation) and -1 (perfect negative correlation), and
can take on any value in between
A positive correlation means that as one variable increases,
the other one tends to increase as well
A negative correlation means that as one variable increases,
the other one tends to decrease

Introduction
Correlation
Correlation
Regression
Calculating the correlation coefficient
The correlation coefficient is simply the average of the

products of the standardized variables
In mathematical notation,
x y
Pn
i=1 zi zi
r= ,
n−1
where zix and ziy are the standardized values of x and y
Note: The “n versus n − 1” issue has nothing to do with
correlation; however, if n − 1 is used when standardizing, it
must be used again here

Introduction
Correlation
Correlation
Regression
Meaning behind the correlation coefficient formula
80 ● ●
● ● ●
●
●
●
● ● ● ●
●
75
● ●
● ● ● ●
● ● ●●
● ● ● ●
● ● ● ●● ● ● ●●
●
● ● ●●●●● ● ● ● ●
● ●● ●
● ●
● ●●● ● ● ●● ● ● ●●
●● ● ●● ●
●● ● ● ● ●
● ● ●● ● ● ● ● ●● ●●●● ●
● ● ●
● ● ● ●●
● ●● ●
● ● ●● ● ●● ●
● ●● ● ● ●● ●●
● ● ● ● ●●
●
●● ● ●●
●
● ●● ● ●
● ● ● ● ● ●
● ●● ●●● ●
● ● ● ●
● ●
● ● ● ●● ●● ●● ● ● ●● ●
●● ● ●
●
●● ● ●●● ●● ●●
●
●●● ●
● ●●●
● ●
●● ●● ●
● ●● ●
● ● ● ● ● ●● ●● ● ● ●
●
● ● ● ●
● ● ● ● ● ●● ● ●●● ●●●● ● ● ●
● ●●
●
● ● ●●● ●●
●● ●● ●● ● ●● ● ● ●● ●● ●●
●●
● ●● ● ● ● ● ●●● ●● ● ●● ● ●●● ●● ●
70
●● ●
●● ● ●
● ● ● ●● ●●● ●● ●
● ●● ● ●● ●●
●●●● ● ● ● ●●●●● ●●● ●●●● ●●
●●●
● ● ●
● ● ●● ●●● ● ● ● ●●● ●● ●● ● ●
● ●●●●●●● ●
● ●
● ●●● ●●● ● ● ●●●● ● ● ●● ●●●
●
●
●● ● ●● ●●●●● ● ● ● ●● ●●● ●● ●
●●
●
●
●●● ● ● ● ●●●
●● ●● ● ●●● ● ● ●● ●
● ● ●
● ●● ● ●●● ● ● ● ● ●● ●
● ●● ● ●●
●●●●●●●
● ●● ● ●●● ●
● ●● ● ● ● ● ● ●● ● ●●● ●● ●●●
● ● ●
●● ●
●● ●● ●●
● ●●●●● ● ●● ● ●● ●
● ● ●●
● ● ●● ●●●
●
● ● ●● ●●●● ●●
●
●● ● ●● ●● ● ● ●
●●● ● ●●● ●
● ● ●●●
● ● ● ●
● ● ● ● ● ● ● ●●● ●●● ●● ●●
●● ●●●●●● ●● ●● ● ●●●● ●
●●● ●● ●
●● ●●
●●●● ● ● ● ● ●
● ●● ● ●● ●●● ●●● ●●● ● ●●● ●
●
● ●●● ● ●
● ●
●● ● ● ● ●● ● ●● ●
●● ● ● ●
● ●● ● ●● ● ●● ● ● ● ●●● ● ● ●
● ●●●●● ●●
●●● ● ● ● ● ●
● ●● ●● ●
● ●● ●●●● ● ●
●●●●●● ● ● ●●● ●●●●●● ●
● ●●● ●● ● ●● ●●●●● ●●● ● ●●●●●● ●
● ● ● ●● ● ● ●●●●●●● ●●●● ●● ●● ● ●●●● ● ● ● ●●
● ● ● ● ● ● ● ●● ● ● ●● ●
● ●● ● ●● ●●●● ● ● ●●● ● ●● ● ●●● ● ●
● ●● ●● ● ●
● ●
●● ●● ● ●●●● ● ● ●●● ● ●
● ●
●● ● ● ● ● ●● ●
●
● ● ● ●● ● ● ●●● ● ●●● ●● ●● ● ●
● ●● ●● ● ● ● ●
● ● ● ● ● ●●
●● ●● ● ●● ● ● ● ●
65
● ● ● ● ●
● ● ●● ●● ● ●●●
●● ● ● ● ●● ● ●
●
●
●●● ● ● ●
● ●
● ● ●● ● ● ● ●●● ● ●
●● ● ●
● ● ●● ● ● ● ●
● ● ● ● ●● ●● ● ● ●
●
●
● ● ● ●
● ●
● ●
● ● ● ●
● ●
● ● ●
● ●
● ● ●
● ●
●● ●
●
60
●
● ●
●
●
60 65 70 75 80

Introduction
Correlation
Correlation
Regression
The correlation coefficient and the scatter plot
−0.88 −0.34 0.02 0.29 0.91

● ● ● ● ● ●
● ●
● ● ●● ●● ● ● ●●
● ● ●
●
● ● ●●● ● ● ● ●● ●● ● ● ●● ●
● ●● ● ●
●
● ●● ● ●● ● ● ● ●● ● ● ● ●●●● ● ●●●●● ● ● ● ●● ● ●
●● ● ● ●
● ●●
● ● ●●● ●● ●● ● ● ●● ●● ● ● ●●● ●● ● ●● ●
● ●●● ●●●●
● ●● ● ●● ●●
●●●● ●
●● ● ●●●●
●●●●● ●●● ● ● ●●
● ● ●●●● ● ● ●
● ● ● ●●●●
●
●●●●● ●●●●● ●●●
●● ● ● ● ● ● ● ●●● ● ●●● ● ● ●
●●● ● ●● ●● ●●● ● ●● ● ● ● ●●
● ● ●●
●● ●● ●
● ● ●●
● ● ●
●● ● ● ●● ●
● ●●
● ●●●
●● ● ● ● ● ●● ●
● ●●
●
●●●● ●●● ● ●●● ●● ●●●● ● ●● ● ●●●●
●● ●● ●●● ● ●●●
● ● ●●●●●
●●
●
●●● ●●
● ●
●●
● ●●
●● ● ●●●●●●
●
● ● ● ● ● ● ●● ●●● ●●● ●● ● ●●●●●
● ● ● ● ●●●●●●●
●●●
● ●● ●
● ●●●
● ● ●● ● ●● ●
●● ●● ●●●● ●●
● ●●●
● ●●
●● ●● ●
● ●●●●● ●●●● ● ● ● ●● ●● ●●
●● ●●
●
●●● ● ●●● ●● ● ●
●●● ● ●● ● ●● ●●●●● ● ● ●●●●●● ● ●●
● ●● ● ●●
●●
●●●●●
● ●
●●
● ●●● ●
● ●●
● ● ●● ● ● ●
●●●● ●● ●● ●● ● ● ●● ●● ●
●● ●●●●● ●● ● ● ●●● ●
●●●●● ●●●● ●●
●●●
●
● ●●● ●●● ●●
●● ● ● ● ● ●●
●●
●●
●●
● ● ●●
●●●●●
●
● ●
● ●
●●●
●
●
●●● ●
●●
●
●●●●
●
●● ● ●● ●●● ● ●
●
●● ●
● ●●●●● ●
● ●●● ●●
●
●● ●●●● ●
●●●●●● ●● ●
●●● ● ●● ●●● ● ● ●●●● ●●●●● ●●
●● ●●●● ●
● ●●●
●
● ●●●
●● ● ●● ●●● ●
● ●●
●
●
●●●
●●
●●●●●
●●● ●
●●● ●●
●●● ● ●
● ●
●●●● ●●
●●●● ● ●●●● ● ● ● ●● ●●●
●●●● ● ●●●● ●
● ●● ● ● ● ● ●
●● ●
● ●● ● ● ● ●●
● ●●●
●●
●●
●●
●●●
● ●●
●●
● ●
●
●●
●●●
●
●
●●●●
●●
●● ●●
●● ●● ●● ●●●
●●●●●● ●
● ●
●●●●● ●●
● ●●● ● ● ●● ● ● ●● ● ●●●
●●
●
●
● ● ●●●● ●●
●
●●●●
●●
●
●● ●● ●●●
●●●● ●●●● ●●● ● ● ● ● ●●● ●●●●●●● ●●
● ●●
●
●●●
●●
● ●
●
● ● ●●
●● ●●●●
●●● ●● ●●
● ● ● ●●● ●● ●
●●●
●
●
●●●
●
●●
●
●
●
●●
●
●
●●
●●
●●
●●●
●●
●● ●● ●● ● ●● ●● ●● ●● ●
y
y
● ●●●●
●
● ● ● ●● ● ● ●● ●● ●●● ●● ● ● ●● ●●●● ●● ●●● ● ●●● ● ●
● ●
●●
● ●●●
●●
●
●●
●
● ●
●
●●●●●●
●●
●●
●
●● ●●●● ●●●●
●● ●● ●●●
●●●●●● ●
●●●
●●● ● ●●
● ●● ●●●●●●
●●● ● ●●●
● ●●●● ●● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ●●
●● ● ●●●●● ●
●
●●
●
●●●
●
●●
● ●●
● ●●●
●● ●
● ●●
●
●●●●●●●●●●●●● ● ● ●● ●
●● ●●
●●●●● ●
●●●
●● ●
●●
● ●●● ● ● ● ●● ● ●
●
●● ●●● ●●●● ●●●●●●●●● ●
● ● ● ● ●●●●● ●
●●
● ● ● ●
●
● ●
● ●●●● ●● ● ●●
●●●
●●●
● ●●
●● ●●
●
●●
●●
●● ● ● ● ●
●●●●●●●
●●●●●
●●
● ●● ● ●● ●●●●●
●● ●● ●● ● ●
●
●● ●● ●●
● ● ●●●●
●●● ●●●
●● ● ●●●●● ● ●
●
● ●● ●● ●●●●● ●
● ● ●
●●●● ●●
●●●● ●
●●●● ●●●●●●
● ●●●● ●●
●●● ●●
● ● ● ● ● ● ●● ●
●●
●
● ●
●●●
●
●●
●●
●●
●●● ●
● ●
●●●●
● ●●
●
●
●● ●
●●
●●
●
●●●
●●●
●
●●●
●●●●
● ● ●● ●●●● ●● ● ●
●● ●●● ●
●●●●●
● ●● ●●● ●●
●● ●●● ●● ●●
●
●
●●●● ● ●●
●
●●● ●●●●
●
●●
●
●●●
●●●●●
●
●●●● ● ●●●
● ● ● ● ● ●
●
●●●●●●●
●●
●
●●
●
●●
●●
●
●
●●● ●● ●
●
●●●
● ●
●
●● ●● ●
● ● ● ● ●
●● ● ●
● ● ●● ●●
● ● ● ● ● ●● ●●●● ● ● ● ● ● ● ●●● ●●●●●● ●● ●●● ● ●● ●
●●●
●●●
● ●
● ●●
●
●●●
●
●●
●
●●
●
●● ● ● ● ●●
●●
●●
●●●
●
●
●
●
●
●
●
●
●●●●
● ●●
●
●●●● ● ● ●●●●
● ●●
●●●● ●
●● ●●● ● ● ●●
●● ●● ●●●
●● ● ● ●●●● ● ●
●● ●●● ●
● ● ●●
●●●● ●● ● ●
● ●
●●●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●●
● ●●●
● ●
●●●
● ●
●●●●●
●●●●
●●●●● ●●
● ●
●●● ● ●●
●●●● ●●●● ●●
●●
● ● ●● ●●● ●
●
●● ●
●●●●● ● ● ● ● ● ●
● ● ●●●●●● ● ● ●● ●●
●●
●
● ●●
●●●
●●●●
● ●●●
● ● ●●
●
●●●● ● ● ●● ● ●●● ●● ● ●●●● ● ● ●●●● ● ●● ● ● ●
●
● ●●●
●●●
●●
● ●●● ●
●● ●
●● ● ●
●● ● ● ●●● ●● ●● ●● ●
● ●● ● ● ●●● ● ● ●● ● ●●● ●●
● ●
● ● ● ● ● ● ● ●
●
●●●●
●
●
●
●●
●
●●●
●
●
●●
●
●
● ●
● ●●
● ●●● ● ● ● ● ●● ● ● ●● ●●● ● ● ●● ● ● ● ● ●●●● ●
●●
●●●●●
● ● ●
● ● ●● ●● ● ● ●● ●● ● ●
●● ●● ● ● ● ●●
●
●●●●●
● ● ●● ● ●●● ● ● ●
● ● ● ●
●●●●
●●●●
●● ●
●● ● ● ● ● ●●
●● ●
● ● ●● ●
● ● ●●
●● ●
●● ● ●
● ●
● ● ●
●●
●
● ● ● ● ● ● ●
x x x x x

Introduction
Correlation
Correlation
Regression
More about the correlation coefficient
Because the correlation coefficient is based on standardized

variables, it does not depend on the units of measurement
Thus, the correlation between father’s and son’s heights would
be 0.5 even if the father’s height was measured in inches and
the son’s in centimeters
Furthermore, the correlation between x and y is the same as
the correlation between y and x

Introduction
Correlation
Correlation
Regression
Interpreting the correlation coefficient
The correlation between heights of identical twins is around

0.95
The correlation between income and education in the United
States is about 0.44
The correlation between a woman’s education and the number
of children she has is about -0.2
When concrete physical laws determine the relationship
between two variables, their correlation can exceed 0.9
In the social sciences, this is rare – correlations of 0.3 to 0.7
are considered quite strong in these fields

Introduction
Correlation
Correlation
Regression
Numerical summaries can be misleading!

130 6 Miscellaneous Topics
is negative rather than positive. The plot at bottom right shows two variables
with some positive linear dependence, but the obvious non-linear dependence
From Cook & Swayne’s Interactive and Dynamic Graphics for Data
is more interesting.
Analysis:
4 ●
● ● 30 ●
●
●
●● ●●
●●●●● ● ●
2 ●●●●●●●●●● ●
●●
● ● ●● ●●●● ●●● ●●
●●●
●● ●●●● ●●● ●●
●●●● ●● 20
●●●● ●●
●●●●●
●
● ●
●
● ●●
● ● ●●
●
● ● ●● ●●
● ●
● ●●● ● ●● ●
● ●● ●●● ● ● ● ●●
●●
●●
● ●● ●
● ●●
●●
●● ● ●●●●● ●
●● ●●●
●
● ●●●●●
●●●●●
●●●
●
●● ●●●●●●●
●
● ●
●●●
● ● ●
●●●
●●●
●
●
● ●●●●●●
● ●●
●●●●
0 ●● ● ●●●●●●●●●
●
●
●●●●
● ●●
● ●
●
●● ●
●●
●●●
●
●● ●●
● ● ●
Y
●
Y
●●●● ● ●●
●● ●●● ●
● ●
● ●●● ● ●● ●●
●●●
●●
●● ●
●●
●
●
●●● ● ● ●
● 10
● ●●●●●
●● ●● ● ●
●●●
●
●
●
●● ●
●●●● ● ●●●
● ●
● ● ●●●●●●●●●
●● ●●
●
●
●
●●●
●● ●
●●●●
●● ● ●
● ●
● ● ● ● ● ●●
●●● ●●●
●●●
●
●
●●
●●
●●
●
●
●●●●
●●
●
●
●●
●●●●● ● ● ●
● ●● ●●● ●●
●●●● ● ●●
● ● ●●
● ●● ●●●●●●● ●●
−2 ●
●●●
●●● ● ● ●● ●
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●●
●●
● ●
●
●● ● ● ●●
●●
●
●●
●
●
●
●
●●
●
●● ●●
●●●
●●
●
−3 −2 −1 0 1 2 3 −10 0 10 20 30 40 50
X X
8 ●
●
●●
●●
●
●●
●
●
●
●●
● ●
●
●● ●
●●
● ●●●
●
● ●
●●
●
● ●
6 ●●
●●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
10 ●
● ● ●●●
●●
●● ●●
●
● ●● ●
●●
●●
●
●
● ●
●●
● ● ●● ●
● ●
●● ●
●
●●
●
●
●● ●● ●
4 ●
●
●●●
●
●●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●● ●
●
●
● ●●
● ●● ●
●●
●●
●
●●
●● ●
● ● ●●●● ●
●●
●●
●
●●●
●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●● 5 ●●●
●●● ●
●●
●●●
●
●
2 ●●
● ●
●
●
●● ● ● ●
Y
●
●
●
●
●● ●●
●● ● ● ● ●●
●●
●●●●
●
●●●
● ●
●●
●
●
●
●● ●● ● ●● ●
●●●●●
●
● ●
●
●
●●
● ●● ●● ●●
● ●
●
●●●●
● ●
●
●●
●●
● ●
●●
● ● ● ●●
●
●●●
●● ●
●
●●● ●●●
● ●● ●●●●●
●
●●●●
●
● ●●●
●●
●
●
●●
●●●
●●●
●
●
●●
● ●●
●
●●
●●
●●
●●● ●●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●●
0 ●
●●
●●
●
●●
●
●
●
●●●●
●
●
●●● ●
● ● ●●● ●
●●
●● ●●
●●
●●
●
●
● ●●
● ●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●●
●
●●
●●
●●
●
●
●●
●
●●
●
●●
●
● 0 ●●
●● ●●
● ●●●●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
● ●
●●
● ●
●●●
●
● ●● ● ●
●
● ●●
● ●●
●●●
●●
●●●
●●●●
−2 ●●
●●
●●
●
●●● ●
●
●
●
●●
●
●
●
●
●● ●
●●
●●●●
●
●
●
● ●
●●
●● ●● ●
●
−4
−2 0 2 4 6 8 10 −3 −2 −1 0 1 2 3
X X
Fig. 6.1. Studying dependence between X and Y. All four pairs of variables have
correlation approximately equal to 0.7, but they all have very diﬀerent patterns. Only
the top left plot shows two variables matching a dependence modeled by correlation.

Introduction
Correlation
Correlation
Regression
Ecological correlations
Epidemiologists often look at the correlation between two

variables at the ecological level – say, the correlation between
cigarette consumption and lung cancer deaths per capita
However, people smoke and get cancer, not countries
These correlations have the potential to be misleading
The reason is that by replacing individual measurements by
the averages, you eliminate a lot of the variability that is
present at the individual level and obtain a higher correlation
than there really is

Introduction
Correlation
Correlation
Regression
Fat in the diet and cancer

From an article by Carroll in Cancer Research (1975):

Regression and correlation
Correlation
The regression fallacy
Regression
NHANES
Every few years, the CDC conducts a huge survey of randomly

chosen Americans called the National Health and Nutrition
Examination Survey (NHANES)
Hundreds of variables are measured on these individuals:
Demographic variables like age, education, and income
Physiological variables like height, weight, blood pressure, and
cholesterol levels
Dietary habits
Disease status
Lots more: everything from cavities to sexual behavior

Correlation
Regression
Predicting weight from height
For the 2,649 adult women in the NHANES data set:

average height = 5 feet, 3.5 inches
average weight = 166 pounds
SD(height) = 2.75 inches
SD(weight) = 44.5 pounds
correlation between height and weight = 0.3
Suppose you were asked to predict a person’s weight from
their height
First, an easy case: suppose the woman was 5 feet, 3.5 inches
Since the woman is average height, we have no reason to
guess anything other than the average weight, 166 pounds

Correlation
Regression
Predicting weight from height (cont’d)
How about a woman who is 5’6?

She’s a bit taller than average, so she probably weighs a bit
more than average
But how much more?
To put the question a different way, she is almost one
standard deviation above the average height; how many
standard deviations above the average weight should we
expect her to be?

Correlation
Regression
Using the correlation coefficient
The answer turns out to depend on the correlation coefficient

Since the correlation coefficient for this data is 0.3, we would
expect the woman to be 0.3 standard deviations above the
mean weight, or 166 + 0.3(44.5) = 179 pounds

Correlation
Regression
Graphical interpretation
● ●
400
● ●
● ●
● ●
● ● ●
● ● ●
●
●
●
● ●
●
●●
300
● ●
●
● ●● ●
●
● ● ●
● ●
● ●
Weight (lbs)
● ● ●
● ● ●
●
●
● ● ●● ●● ●
●● ●
● ●●
● ●● ●●
● ● ●
● ● ● ● ● ● ●
●● ● ●
● ● ● ●● ● ● ●
●
● ●
● ● ● ● ●
●● ● ●● ● ●● ●
● ● ● ● ●● ●● ● ●● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ●
● ●●
● ●● ● ● ● ● ●● ●
● ●●●●●● ● ●● ●
●
●● ●●● ● ●●● ●●● ●● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●
●● ●● ● ● ●
● ● ● ●● ● ● ●● ●● ●● ●● ● ●● ●
●
● ● ●●●● ● ●●
● ● ●● ● ● ●● ● ●
● ● ●
● ●● ● ●● ● ● ● ● ●●
● ●● ● ●●● ●●●
● ●● ● ●
● ● ● ●●● ●●● ●●●●●● ●●● ●● ●●
● ●●●● ●
● ●● ●● ●●●●● ●
● ● ● ●
● ●●● ● ● ● ●●●●● ● ● ● ● ●● ● ●
● ● ● ● ●●●●● ●●● ●● ●● ● ●● ● ● ● ● ● ●
●●
● ●●
● ● ●●●● ● ● ●● ● ● ●●
● ● ●●
●● ●●● ●●●● ●●● ● ● ● ●● ●●●● ●●●
●
●● ● ●● ●●
● ● ●
● ●● ●●
200
● ●
● ●● ●●● ● ●● ●●●●●●●●● ●● ●● ●●●●● ● ●
● ● ● ● ●● ● ●● ● ●● ●
● ● ● ●● ●●● ●
●●●● ● ● ● ●● ●●●●●●● ●●●●● ●● ● ● ●● ● ●● ●
●● ●● ●●●● ●● ● ●● ●
● ●● ●● ●●●●
● ●
●● ● ●●
●
● ● ●●●● ● ●● ● ●●● ●●● ●●● ●● ● ● ●● ●
●
● ●●●
●●● ●●●●●
●● ● ●●● ●● ●● ● ● ● ● ● ●●● ●● ●●● ●● ● ●
● ●
● ● ● ● ●● ●
●● ●● ●
●
●●●● ● ●●●●● ●
● ● ● ● ● ● ●●
● ● ● ●
● ● ●● ●●
● ● ● ●● ● ● ● ● ●● ●
● ● ● ● ●
●
●● ●
●●● ● ●● ● ●● ●● ●●
●● ●● ●●
●●● ●
●● ●
●●●● ● ●●● ● ●●
●●●
●●●● ● ● ●
● ●● ●
●● ●● ●●●●● ●● ●●●● ●●
● ●●●
●● ●●
●● ●●●●● ●●●● ●●●
● ● ●
●●●
● ●● ● ● ●
● ●
●
●●●●●● ● ●● ●
● ●●● ●● ● ● ● ● ●
●
●●
●
● ●●● ●
●
●● ●●●●
● ● ●
●
●
●
●●●
● ●●
●● ● ● ● ● ●●●● ● ●
●
● ●●
●
●
●
●
●
● ● ●●● ● ● ● ● ●●●● ● ●
●●●●● ● ● ● ● ●● ●●●● ●● ●● ● ●
●● ●●● ●●●●
● ● ●● ● ●●● ●● ● ● ● ●●●●●●●● ●●●● ● ● ●●● ● ● ●
● ● ●● ● ●● ●●●●
● ●
●
●●● ●●
●●●●● ●
●●
●
●● ●
●●●● ●●●● ●●●● ●●●●● ●●● ●
●●●●●●
● ●● ● ●●●● ●● ●● ●●
●●● ●● ● ● ● ●
●
●●
●●
●
●● ●● ● ●●● ●●●
●●
●●●●● ●● ●●
●●●
● ●●●● ● ● ● ●●
●
●
● ●
●
●●
●
●● ●● ● ● ●
●● ● ●● ●
●●●
● ● ●●●● ●●●●
●
●
●
● ●
●
●●●
●●
●
●
●
● ●●
● ● ● ●
●● ●●●●● ●● ●●●● ●● ●●●● ● ●
●●●●● ● ●●●
● ● ● ●
● ● ● ● ●●
● ●● ●● ● ●● ●●
●
●●● ● ● ●● ●●● ●●
●●● ●●
●●
●●● ●●●●●●●●●●●●●
●●●●●● ●●●●●
●● ●
● ●●● ●●●
● ●●
●●
●● ●
●● ● ●● ●
●● ●
●
●●● ●
●
●●●●●● ●● ●●● ●●●●●●
● ●● ● ●●● ● ●● ●●●● ●●
●●● ●●●
●
● ●●●
●●●●● ●●●●●●●
●●
●
●
●
●
●●
●●●
●●●
●● ●
●●●
●
●●● ● ● ●
●●
●●●●●●● ● ● ● ● ●● ●
●●● ●● ●● ●
●●●●● ● ●●●●●●● ●
●● ●● ●●●●●
● ● ●
●●●●● ●●
●●●●
●
●●●●
●●
●●●●●
●●●●
●
●●●● ●●● ●●
●●●
●●●●●
●● ●●
●●●●●●●● ●●●
●●
●●
●● ●● ●●●
● ●●●●● ●
● ● ● ●●●● ●●●
●
● ●
● ● ●●●●●● ●●●● ●
●●
●●
●●● ●
●● ●●●
● ●●●●●●●●● ●●
●
●
●
●●●●●● ●●●● ●● ●● ●●●
● ● ●●
● ●● ● ●● ● ●
● ● ● ●● ●●●● ● ●
● ●
●
● ● ●●●● ●●
●
● ●●●
● ● ●●●
●
● ●●● ● ● ●
● ● ●
● ●●
● ●● ●●●
●●●●●● ● ● ●● ●●●●●●● ●●● ●●● ● ●●
● ● ● ● ●
●
● ● ●● ●●●●● ● ● ● ●
●● ● ● ● ●● ●●●● ●●
● ● ● ● ● ●●●●●●●●●
● ● ●●
●● ●●● ● ●●
●●●
●●● ●
●●●● ●●●●● ●●
● ●●
● ●●●●●● ● ●●●●● ●
● ●●●●● ●●● ●
● ●● ●
●●●● ● ● ●● ●● ●●● ●●● ● ●● ●●●● ●● ●●●● ●●●●●●●●●●
●●●●●●
● ●
● ● ●●
● ●●
●
●●
●●● ●●●
●● ● ●
● ●● ●
● ●●●●● ●●● ●●● ● ●●●●●● ●●●●●●● ●●●●● ●●●● ●●●●●
●●
●●
●●● ●●● ●● ●●● ●● ● ●●● ●
● ●●●● ● ●● ●●●● ● ●
●●● ● ● ●●●
●● ●●●● ●●● ●●●● ●● ●●● ●●● ●●●
● ●●
●●●●● ●●
●●● ●●● ●●
● ●
●● ●●●● ●
●● ●●● ● ●
●
●● ● ● ●● ● ●● ● ● ●●● ●●● ●●●
● ●
●●● ● ●
●● ● ●●● ●●●●●
●● ●●● ●● ●●
●●●●
●●●● ● ● ●●●● ● ● ●● ●● ● ●
● ●● ● ● ● ● ● ●●●●● ● ●
●● ● ● ●
●●●●●
●●●●
●●●●● ●●● ● ●●●● ●●●● ●●●●● ●●●●
● ● ● ●● ●●●
●
●● ●●● ●● ● ●● ●
● ●●● ● ● ● ●●●●● ●●● ● ●● ●
● ● ●
● ●●● ●●● ●●●●●● ●●
● ● ●
●● ●● ● ● ●
● ● ● ●● ● ●●●●● ● ●
●● ●●● ● ●●
●● ● ● ●●●●● ●●●●●● ●● ●●●
●●● ●●● ●● ●
●
● ●● ● ● ●
●● ●● ●●
●● ● ●●●●●●●●● ●● ●
● ●
●●●● ●●
●● ● ●● ●●●● ●
●●● ●● ●●● ● ●● ● ●●●●● ●●●●●●●●●
● ● ● ●
● ● ●●●●●●● ●●●● ●● ●●
●●●●
●● ●●
●
●●●● ●●●●●●●●●●●
●
●●● ●● ●
● ● ●●●● ● ●
●● ●
●●● ●
● ●● ●● ●●
●● ●● ●
●● ●● ● ● ●
●●● ●● ●●● ● ●●
●● ● ● ● ●● ●
●● ●● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ● ●
●● ●
100
● ● ●●● ● ● ●
● ● ●● ●● ●● ● ● ● ● ● ●● ●●●●●●● ● ●●● ●
● ● ●●
● ●
● ●●●●● ●●●● ● ● ●● ● ●
●●
● ● ●● ● ● ● ●● ●● ● ●● ● ●
● ●● ● ●
● ● ● ●
● ● ● ● ●● ●
● ●
● ● ●
55 60 65 70
Height (inches)

Correlation
Regression
The regression line
This line is called the regression line

It tells you, for any height, the average weight for women of
that height
Here, we were trying to predict one variable based on one
other variable; if we were trying to predict weight based on
height, dietary habits, and cholesterol levels, or trying to study
the relationship between cholesterol and weight while
controlling for height, then this is called multiple regression
Multiple regression is beyond the scope of this course, but is a
major topic in Biostatistics II

Correlation
Regression
The equation of the regression line
Like all lines, the regression line may be represented by the

equation
y = α + βx,
where α is the intercept and β is the slope

For the height/weight NHANES data, the intercept is -137
pounds and the slope is 4.8 pounds/inch

Correlation
Regression
β vs. r
Note the similarity and the difference between the slope of the
regression line (β) and the correlation coefficient (r):
The correlation coefficient says that if you go up in height by
one standard deviation, you can expect to go up in weight by
r = 0.3 standard deviations
The slope of the regression line tells you that if you go up in
height by one inch, you can expect to go up in weight by
β = 4.8 pounds
Essentially, they tell you the same thing, one in terms of
standard units, the other in terms of actual units
Therefore, if you know one, you can always figure out the
other simply by changing units (which here involves
multiplying by the ratio of the standard deviations)

Correlation
Regression
β vs. r (cont’d)
Suppose a woman’s height is increased one inch; what do we

expect to happen to her weight?
1 inch = 1/2.75 SDs
An increase of 1/2.75 SDs in height means an increase in
0.3/2.75 SDs in weight
0.3/2.75 SDs = 0.3(44.5/2.75) = 4.8 pounds

Correlation
Regression
β vs. r (cont’d)
Suppose a woman’s height is increased by one SD; what do

we expect to happen to her weight?
1 SD = 2.75 inches
An increase of 2.75 inches in height means an increase in
4.85(2.75) pounds in weight
4.85(2.75) pounds = 4.85(2.75)/44.5 = 0.3 SDs

Correlation
Regression
There are two regression lines
We said that the correlation between weight and height is the

same as the correlation between height and weight
This is not true for regression
The regression of weight on height will give a different answer
than the regression of height on weight

Correlation
Regression
The two regression lines
400 ● ●
● ●
● ●
● ●
● ● ●
● ● ●
●
●
●
● ●
●
●●
300
● ●
●
● ●●● ●
● ● ●● ●
● ●
Weight (lbs)
● ● ●
● ● ●
●
●
● ● ●● ●● ●
●● ●
● ●●
● ● ●● ●● ● ●
● ● ● ● ● ● ●
●● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
●● ● ● ●● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ●● ●●● ● ● ●
● ●●● ● ●
● ● ● ● ●
● ● ●● ● ● ●
●
●
● ●● ● ● ● ● ●●● ●●●●●●● ● ●● ●
●● ●●● ● ●●● ●● ● ● ●● ●
● ● ● ● ●● ●● ● ●
● ● ● ●
●
● ● ●
●● ●● ●
●● ● ● ● ●● ●● ●● ● ●● ●
● ● ●●● ● ●● ● ●●●● ● ●● ●
● ● ●● ●● ●
● ● ●
● ●● ● ●● ● ● ● ● ●●
● ●● ● ●●● ●●
● ●● ●● ●
● ● ● ●●● ●●● ●●●●●● ●●● ●● ●●
● ●●●● ●
● ●● ●● ●●●●● ● ●
● ● ● ●
● ●●● ● ● ● ●●●●● ● ● ● ● ●●
●
● ● ● ● ●●●●● ●●● ●● ●● ● ●● ● ● ● ● ● ● ●
●●
● ●●
● ● ●●●● ● ● ●●● ● ● ●●
● ● ●●
●● ●●● ●●●● ●●● ● ●● ● ●●●
● ● ●● ●
● ●● ● ● ● ●● ●●
200
● ● ●●● ● ●●●●●● ●
● ● ● ●● ● ●● ● ●● ●
● ●● ●●● ● ●●●● ●●●● ●●● ●● ● ● ● ● ●
● ● ● ● ●● ●●● ●
●●●●●● ●●●●●●●●●● ●● ●● ● ● ● ●● ● ● ●●
● ● ●● ●● ●●●● ●● ●●●●● ●●● ●
● ●● ●
● ●●●●
●
●●●
● ●●●
●
● ●●●● ● ●● ●●●●
● ● ●
●● ●
● ● ● ●
●●
●●
●● ●●
● ● ● ● ● ●● ●
●● ● ●
● ●
●
●
●
● ● ●
● ●● ● ●● ● ● ● ●●●● ●●● ●● ●● ●● ●●● ● ● ● ●
● ● ● ●
●●● ● ●●●●● ●
●● ●● ●● ● ●●
● ● ●●● ● ●● ●
●●
● ● ● ●●● ● ●● ● ●●● ●● ●●
●● ●● ●●
●● ●●● ●●
●●
●●●●●●
●●●● ●●● ●●●●●●
●
● ● ●● ●
●● ●● ●● ●●●●● ●● ●●●● ●●
●
● ●●●●●●
● ● ●●●●● ●●●●●●●●●
● ● ● ●●
●●●● ● ● ●●
●●
● ●● ● ● ● ●
● ●●
●●●● ●
●●● ● ●● ● ●●● ●● ●● ●●
● ●
●● ●
●
●●●
● ●●● ●● ●● ●●●● ●● ●●●●● ●●
● ● ● ● ●● ● ●
● ● ●● ● ●●●
●●●● ●● ● ●●● ● ●● ●●●●●
● ●
●●
●●● ●● ●●●●●● ●●●●● ●●● ●
● ●● ●●●● ●●●●
● ● ● ●● ● ●
● ● ● ●● ● ● ●
● ●● ● ●●
●
● ●
●●●● ● ● ● ●●●
● ● ● ● ● ●●
●
● ●● ● ● ●
● ● ● ● ●
● ● ● ● ● ●● ●
●●●● ● ● ●●●●●●●● ● ●
●●●●●●●●●●●● ● ●●● ● ● ● ● ●
●● ● ●●●
● ● ●
●●●●●●●●
●● ●●
●●
●●●
● ●●● ●● ●●● ●● ● ●● ●● ●
●
●● ●●●●
●●
●
●● ●● ● ●● ●
●
● ●● ●●
●● ● ●●● ●
● ●●●● ●●● ●●●●●●●●● ●●
●●● ● ●
●●●●
●
●●●●●●●● ●●●●
●●
●●●●●●● ●●
●
●● ●
●●●
● ●
● ●●●● ●●
● ● ● ●
●●●
●●●●●●●●● ●● ●
●● ●
● ●● ●
● ● ● ●
● ●
●●●● ● ● ● ● ● ● ●● ●●●● ●
● ●●●
●●
● ● ●●
● ●● ●
●
●
●●●●
●●●●●●
● ●●●
● ● ● ●● ● ● ●
● ●● ●
●●● ● ● ●● ● ●●●● ●●
●●●
● ●● ●●●● ● ●● ●●● ●
● ●● ● ●●● ●
●●●●●●●● ●●● ●●●●●●●●● ● ●●●●●●● ● ●● ●
● ●● ●
● ● ● ●●
●●● ● ●● ●●●●● ●●● ●●●●●●● ●●●●●●● ●
●● ●● ●●●
●● ●●
●
●●●● ●● ●●●●●●● ●●●●●● ● ● ●
● ● ● ●●●●●
●●
●● ●●●●●●● ●
● ● ●
●
● ● ●●●●
● ● ●
●● ● ●
●●●●
●● ●
●●●
●●●●● ●●
●
● ● ● ●●●●
●
●●● ●●
●●●
● ● ●
●●
●
●●
●●●●
●●●
●
● ● ● ●●●● ●●● ● ● ● ●
● ● ●●●●● ● ●● ● ●●●●● ●●
●●● ● ● ●● ● ●●● ●●●●● ●● ● ● ●● ●● ●
●● ● ● ●●●●● ● ● ●●●●●
●●●●● ● ●● ●●●●●●● ● ●● ●● ●●●●●
● ● ● ● ●●● ●●● ●●●●●
●● ● ●●●●●
● ●●●●
●●●● ● ● ●●●
●
● ●●●
●●●●
●●●● ●● ●
●●
●●
●
●
●● ●●●
●●●
●●●●● ●● ● ●●●●●●●●●●●●●● ●● ● ● ● ● ●●
●●●
●●
● ● ● ● ●● ●●
●●
●
● ●●
●●
●
●
●●●●
●●●● ●
● ●●● ●●
●●
●●●
●
● ●●
●●●●
●●
●
●
●●●●
●
●
● ●
●●
●
●●
●
●
●● ●
●
●
●●●
●
●
●●● ●
●
●●
●●●●●● ●
● ●
●
● ● ● ●● ● ● ●
●
● ●
● ●● ●●● ● ●●● ●● ●●●●●● ●●●●●● ● ●●●●● ● ●●●
● ●●●● ●● ●● ● ●●● ●
● ● ● ●●
●●●● ● ●● ●●●● ● ● ●●● ● ●●● ●●●● ●●
●● ●●●●●● ●
● ●●●● ●●●●●●
●●● ●● ●● ● ●●●
●●● ●●● ● ●●● ● ●
● ●● ●●● ●● ●●●● ●●● ● ● ●●
● ●●
● ●
● ●●● ●● ● ●● ●● ●
●
●●●● ●●
● ●
●● ●●
●
●●
●
●●
●● ● ●
●●
● ●● ● ● ● ●
●● ●●
● ●●●●● ●● ●● ●●●●●
● ● ● ●●●● ●
●●●●● ●
●●● ● ●● ●● ●●
●
●●●●●
●●● ●
●
● ●●●●
●●●
●● ●●
● ●
●●● ●● ●● ● ●● ●●● ●● ●
● ●●● ● ● ● ●●●●● ●●● ●●●●● ●●●●●
● ●● ● ●●●●● ●●● ●●
●●●●●● ●● ● ● ● ●● ●●●● ●● ●● ●
●● ●
●
●● ● ● ●●●●●● ●●●●● ●●● ● ●●
●
●●
●●●●● ●● ●
●
●
●●●● ●● ● ● ● ● ●● ●
●● ●● ●●
●
● ● ●
● ● ●●●●● ● ●
●●
● ●
●●● ●
●●●● ●●●
● ● ●
●
●●
●
●●
● ●●
●● ● ●
● ●
●●●●● ●●●● ● ● ●● ●
● ● ● ●●
● ● ● ●●● ● ●
● ● ● ● ●
●●●●●● ●●●● ● ● ●●●●● ● ● ●●
● ● ●●●● ● ● ●●●●●●● ●●●●● ●●●●● ●●● ●●● ●●
●● ●● ●●● ● ●● ●●● ● ●●●●● ● ●● ●●
● ● ●●
●● ● ● ● ●●●
●● ●● ●●● ● ● ●●
● ● ● ●●●
●●● ●●● ● ●●●
●
● ● ● ●
●● ●●●●● ●
100
● ● ●● ●
● ● ●● ●● ●● ● ● ● ● ● ●● ●●●●●
●● ●● ●● ● ●●
● ●
● ● ●●●●●●
● ●●● ● ●● ● ●
●●
● ● ●● ● ● ● ●● ●● ● ● ● ●
● ●● ●● ● ● ●
● ● ●
● ● ● ● ●● ●
● ●
● ● ●
55 60 65 70
Height (inches)

Correlation
Regression
Regression and root-mean-square error
The amount by which the regression prediction is off is called

the residual
One way of looking at the quality of our predictions is by
measuring the size of the residuals
Out of all possible lines that you could draw, which one has
the lowest possible root-mean-square of the residuals?
The regression line
Because of this, the regression line is also called the “least
squares” fit

Correlation
Regression
Why only r standard deviations?

Only moving r standard deviations away from the average
may be counterintuitive; if height goes up by one SD,
shouldn’t weight too?
Here’s an example that I hope will help clarify this concept:
A student is taking her first course in statistics, and we want
to predict whether she will do well in the course or not
Suppose we know that last semester, she got an A in math
Now suppose that we know that last semester, she got an A in
pottery
These two pieces of information are not equally informative
for predicting how well she will do in her statistics class
We need to balance our baseline guess (that she will receive
an average grade) with this new piece of information, and the
correlation coefficient tells us how much weight the new
information should carry
Correlation
Regression
Fathers and sons again
80 ● ●
● ● ●
●
●
●
● ● ● ●
●
75
● ●
● ● ● ●
● ● ●●
● ● ● ●
● ● ● ●● ● ● ●●
●
● ● ●●●●● ● ● ● ●
● ●● ●
● ●
● ●●● ● ● ●● ● ● ●●
●● ● ●● ●
●● ● ● ● ●
● ● ●● ● ● ● ● ●● ●●●● ●
● ● ●
● ● ● ●●
● ●● ●
● ● ●● ● ●● ●
● ●● ● ● ●● ●●
● ● ● ● ●●
●
●● ● ●●
●
● ●● ● ●
● ● ● ● ● ●
● ●● ●●● ●
● ● ● ●
● ●
● ● ● ●● ●● ●● ● ● ●● ●
●● ● ●
●
●● ● ●●● ●● ●●
●
●●● ●
● ●●●
● ●
●● ●● ●
● ●● ●
● ● ● ● ● ●● ●● ● ● ●
●
● ● ● ●
● ● ● ● ● ●● ● ●●● ●●●● ● ● ●
● ●●
●
● ● ●●● ●●
●● ●● ●● ● ●● ● ● ●● ●● ●●
●●
● ●● ● ● ● ● ●●● ●● ● ●● ● ●●● ●● ●
70
●● ●
●● ● ●
● ● ● ●● ●●● ●● ●
● ●● ● ●● ●●
●●●● ● ● ● ●●●●● ●●● ●●●● ●●
●●●
● ● ●
● ● ●● ●●● ● ● ● ●●● ●● ●● ● ●
● ●●●●●●● ●
● ●
● ●●● ●●● ● ● ●●●● ● ● ●● ●●●
●
●
●● ● ●● ●●●●● ● ● ● ●● ●●● ●● ●
●●
●
●
●●● ● ● ● ●●●
●● ●● ● ●●● ● ● ●● ●
● ● ●
● ●● ● ●●● ● ● ● ● ●● ●
● ●● ● ●●
●●●●●●●
● ●● ● ●●● ●
● ●● ● ● ● ● ● ●● ● ●●● ●● ●●●
● ● ●
●● ●
●● ●● ●●
● ●●●●● ● ●● ● ●● ●
● ● ●●
● ● ●● ●●●
●
● ● ●● ●●●● ●●
●
●● ● ●● ●● ● ● ●
●●● ● ●●● ●
● ● ●●●
● ● ● ●
● ● ● ● ● ● ● ●●● ●●● ●● ●●
●● ●●●●●● ●● ●● ● ●●●● ●
●●● ●● ●
●● ●●
●●●● ● ● ● ● ●
● ●● ● ●● ●●● ●●● ●●● ● ●●● ●
●
● ●●● ● ●
● ●
●● ● ● ● ●● ● ●● ●
●● ● ● ●
● ●● ● ●● ● ●● ● ● ● ●●● ● ● ●
● ●●●●● ●●
●●● ● ● ● ● ●
● ●● ●● ●
● ●● ●●●● ● ●
●●●●●● ● ● ●●● ●●●●●● ●
● ●●● ●● ● ●● ●●●●● ●●● ● ●●●●●● ●
● ● ● ●● ● ● ●●●●●●● ●●●● ●● ●● ● ●●●● ● ● ● ●●
● ● ● ● ● ● ● ●● ● ● ●● ●
● ●● ● ●● ●●●● ● ● ●●● ● ●● ● ●●● ● ●
● ●● ●● ● ●
● ●
●● ●● ● ●●●● ● ● ●●● ● ●
● ●
●● ● ● ● ● ●● ●
●
● ● ● ●● ● ● ●●● ● ●●● ●● ●● ● ●
● ●● ●● ● ● ● ●
● ● ● ● ● ●●
●● ●● ● ●● ● ● ● ●
65
● ● ● ● ●
● ● ●● ●● ● ●●●
●● ● ● ● ●● ● ●
●
●
●●● ● ● ●
● ●
● ● ●● ● ● ● ●●● ● ●
●● ● ●
● ● ●● ● ● ● ●
● ● ● ● ●● ●● ● ● ●
●
●
● ● ● ●
● ●
● ●
● ● ● ●
● ●
● ● ●
● ●
● ● ●
● ●
●● ●
●
60
●
● ●
●
●
60 65 70 75 80

Correlation
Regression
How regression got its name
Because the correlation coefficient is always less than 1, the

regression line will always lie beneath the “x goes up by 1 SD,
y goes up by 1 SD” rule
Galton called this phenomenon “regression to mediocrity,”
and this is where regression gets its name
People frequently read too much into the regression effect –
this is called the regression fallacy

Correlation
Regression
The regression fallacy, example #1
A group of subjects are recruited into a study

Their initial blood pressure is taken, then they take an herbal
supplement for a month, and their blood pressure is taken
again
The mean blood pressure was the same, both before and after
However, subjects with high blood pressure tended to have
lower blood pressure one month later, and subjects with low
blood pressure tended to have higher blood pressure later
Does this supplement act to stabilize blood pressure?

Correlation
Regression
Why the does regression to the mean happen?
Not really; the same effect would occur if they took placebo
Why?
Consider a person with a blood pressure 2 SDs above average
It’s possible that the person has a true blood pressure 1 SD
above average, but happened to have a high first
measurement; it’s also possible that the person has a true
blood pressure 3 SDs above average, but happened to have a
low first measurement
However, the first explanation is much more likely

Correlation
Regression
In professional sports, some first-year players have outstanding

years and win “Rookie of the Year” awards
They often fail to live up to expectations in their second years
Writers call this the “sophomore slump”, and come up with
elaborate explanations for it

Correlation
Regression
An instructor standardizes her midterm and final so that the

class average is 50 and the SD is 10 on both tests
She has taught this class many times and the correlation
between the tests is always around 0.5
This year, she decides to do something different – she takes
the 10 students with the lowest scores on the midterm and
gives them special tutoring
On the final, all ten students score above 50; can this be
explained by the regression effect?
No!
The regression effect can only take these students closer to
the average; the fact that they all score above average
indicates that the tutoring really did work

Descriptive Statistics Correlation and Regression: Patrick Breheny

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Descriptive Statistics Correlation and Regression: Patrick Breheny

Hochgeladen von

Copyright:

Verfügbare Formate

Descriptive statistics

Descriptive statistics; Correlation and regression

Patrick Breheny STA 580: Biostatistics I 1/59

Tables and figures

Human beings are not good at sifting through large streams

Patrick Breheny STA 580: Biostatistics I 2/59

The best way to summarize and present data depends on the

Patrick Breheny STA 580: Biostatistics I 3/59

Summarizing categorical data is pretty straightforward – you

Patrick Breheny STA 580: Biostatistics I 4/59

For continuous data, instead of a finite number of categories,

Patrick Breheny STA 580: Biostatistics I 5/59

One very useful way of looking at continuous data is with

Patrick Breheny STA 580: Biostatistics I 6/59

Histogram of European infant mortality rates

0 50 100 150 200

Deaths per 1,000 births

Summarizing continuous data

As we can see, continuous data comes in a variety of shapes

Patrick Breheny STA 580: Biostatistics I 8/59

The average and the histogram

0 50 100 150 200

Deaths per 1,000 births

Patrick Breheny STA 580: Biostatistics I 9/59

The second most important bit of information from the

4 − 4.75 = −0.75 5 − 4.75 = 0.25

How should we measure the overall size of these deviations?

Patrick Breheny STA 580: Biostatistics I 10/59

Taking their mean isn’t going to tell us anything (why not?)

|−0.75| + |0.25| + |−3.75| + |4.25|

Patrick Breheny STA 580: Biostatistics I 11/59

The standard deviation

The formula for the standard deviation is

Wait a minute; why n − 1?

Patrick Breheny STA 580: Biostatistics I 12/59

Meaning of the standard deviation

The standard deviation (SD) describes how far away numbers

Patrick Breheny STA 580: Biostatistics I 13/59

Standard deviation and the histogram

50 100 150 200

Deaths per 1,000 births

Patrick Breheny STA 580: Biostatistics I 14/59

The 68%/95% rule in action

Patrick Breheny STA 580: Biostatistics I 15/59

Summaries can be misleading!

Patrick Breheny STA 580: Biostatistics I 16/59

Patrick Breheny STA 580: Biostatistics I 17/59

Median vs. mean

50 100 150 200

Deaths per 1,000 births

Patrick Breheny STA 580: Biostatistics I 18/59

Patrick Breheny STA 580: Biostatistics I 19/59

Azerbaijan had the highest infant mortality rate in Europe at

Patrick Breheny STA 580: Biostatistics I 20/59

Quantiles are used in a type of graphical summary called a

Patrick Breheny STA 580: Biostatistics I 21/59

Box plots of the infant mortality rate data

Africa Asia Europe

Patrick Breheny STA 580: Biostatistics I 22/59

Box plots are a way to examine the relationship between a

Patrick Breheny STA 580: Biostatistics I 23/59

Pearson’s height data