Sie sind auf Seite 1von 28

Do heavier people burn more energy?

Does
wine consumption affect cause a decrease in
heart disease?
These questions reflect a desire to understand the
relationship between two variables.
What we need:
1. A plot/graph to view the relationship
2. Characteristics to describe
3. Measures of the characteristics
4. Method to make inferences about the relationship

Correlation & Regression


The graph…a Scatter Plot
Response variable
Y (dependent variable)

Explanatory variable X
(independent variable)
Correlation & Regression
Do heavier people burn more energy?

Response: metabolic rate


Explanatory: weight or mass

Does wine consumption cause a decrease in heart


disease?

Response: death rate from heart disease


Explanatory: wine consumption

Correlation & Regression


Do heavier people burn more energy?
Lean body mass vs. metabolic rate
2000
Rate(cal)

1500

1000

30 40 50 60
Mass(kg)

Correlation & Regression


Is wine good for your heart?
wine consumption vs. heart disease rate (per 100,000)

300
hrt_death rate

200

100

0 1 2 3 4 5 6 7 8 9
Alcohol
wine consumption

Correlation & Regression


Interpreting…characteristics to look for:

• Patterns:
• Form (clusters, scatter, linear..)
• Direction (positive, negative)
• Strength ( how closely points follow form)

• Deviations:
• Outliers

Interpret the last two scatter plots….

Correlation & Regression


Options to consider:
Adding a categorical variable

Correlation & Regression


Scatter plot: Strength?
relationship between
quantitative variables

Form: Linear is
probably the most
common form

Strength: We can
measure the strength of
Strength?
a linear relationship
…because our eyes can
deceive us!!!
Correlation
…measure the direction and strength of a linear relationship

Standardised value of each x


Standardised value of each y

Correlation is an average product of standardised values

Correlation & Regression


Correlation = r

• Quantitative variables
• Linear relationships
• r has no units
• r can be between –1 and
1
• Positive r =
positive association
• Negative r =
negative association
• 0 = no association
• r is influenced by
outliers
Do heavier people burn more energy?
Lean body mass vs. metabolic rate
2000

Rate(cal)
1500

1000

30 40 50 60
Mass(kg)

Correlations: Mass (kg), Rate (cal)


Pearson correlation of Mass(kg) and Rate(cal) = 0.865 r
P-Value = 0.000

Correlation & Regression


Weight (mass) vs. metabolic rate
2000

Males +
Rate(cal)

1500
Females o

1000

30 40 50 60
Mass(kg)

Correlations: Mass (kg)_F, Rate (cal)_F


Pearson correlation of Mass(kg)_F and Rate(cal)_F = 0.876
Correlations: Mass (kg)_M, Rate (cal)_M
Pearson correlation of Mass (kg)_M and Rate (cal)_M = 0.592
Correlation & Regression
Is wine good for your heart?
wine consumption vs. heart disease rate (per 100,000)

300

hrt_death rate
200

100

0 1 2 3 4 5 6 7 8 9
Alcohol
wine c onsumption

Correlations: Alcohol, heart_death rate


Pearson correlation of Alcohol and hrt_death rate = -0.843

Correlation & Regression


heart disease death rate vs. wine consumption
(outliers removed)
300

250
hrt death rate

200

150

1 2 3 4
Alc wine consumption

Correlations: Alcohol Wine consumption, heart death rate


Pearson correlation of Alc Wine consumption and hrt death rate = -0.648

Correlation & Regression


Linear relationships…using a LINE
Is wine good for your heart?
wine consumption vs. heart disease rate (per 100,000)

300

hrt_death rate

200

100

0 1 2 3 4 5 6 7 8 9
Alcohol
wine c onsumption

We can summarise an overall linear form with a line…the


best line is called the Regression Line
Correlation & Regression
A regression line describes how a response variable changes as an
explanatory variable changes. We can now predict a value of y when
given an x.
Fitted regression line death rate vs.wine consumption
death rate = 260.563 - 22.9688 wine consumt

S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

300

What would be the death rate


due to heart disease if the
average daily consumption of
death rate

200

wine was 3 glasses?


100
191.66 deaths per 100,000

0 1 2 3 4 5 6 7 8 9
wine consumption

Correlation & Regression


How do we determine the regression line?

We want the vertical


distances from the
points (observed) to
the line (predicted) to
be as small as
possible…this means
our error in predicting
y is small.

Correlation & Regression


Calculating the line…
We will use the method of least squares to calculate the line.
Least squares regression is the line that makes the sum of the
squares of the vertical distances as small as possible.

y  a  bx Equation of the line (read “y hat”)

sy
b  rsx b is the slope (rate of change in y when x
increases)

a  y  bx a is the y intercept (value of y when x is 0)

Correlation & Regression


Fitted regression line death rate vs.wine consumption
death rate = 260.563 - 22.9688 wine consumt

S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

300

death rate
200

100

0 1 2 3 4 5 6 7 8 9

wine consumption

 The regression equation is


death rate = 260.563 - 22.9688 wine consumption

S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %


 
Analysis of Variance
 
Source DF SS MS F P
Regression 1 59813.6 59813.6 41.6881 0.000
Error 17 24391.4 1434.8
Total 18 84204.9

Correlation & Regression


Facts about regression….

1. Clear distinction between the response variable and the


explanatory variable.
2. Correlation and slope…a change in one s of x corresponds
to a change of r s in y.
3. Least-squares regression line passes through (x, y )
4. Some variation (spread) in y can be accounted for by
changes in x when there is a linear relationship. The
square of the correlation coefficient is the the fraction of
the variation in y values that is explained by changes in x.
variation in y due to x
r 
2
total variation in observed y
= coefficient of determination
Correlation & Regression
Fitted regression line death rate vs.wine consumption
death rate = 260.563 - 22.9688 wine consumt

S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

300

death rate
200

100

0 1 2 3 4 5 6 7 8 9

wine consumption

The regression equation is


death rate = 260.563 - 22.9688 wine consumption

S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

R-sq can have a value between 0 and 1.

Correlation & Regression


VARIATION OF DEPENDENT Y

Correlation & Regression


Residuals…
the left overs from least-squares regression

Deviations from the overall pattern are important. The deviations


In regression are the “scatter” of points about the line. The
vertical distances from the line to the points are called residuals
and they are the “left-over” variation after a regression line is fit.

Residual = observed y – predicted y


residuals  y  y

Correlation & Regression


The regression equation is
death rate = 260.563 - 22.9688 wine consumption
s = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

The residuals are….


Obs Alcohol hrt_deat Fit SE Fit Residual St Resid
1 2.50 211.00 203.14 8.89 7.86 0.21
2 3.90 167.00 170.99 9.23 -3.99 -0.11
3 2.90 131.00 193.95 8.70 -62.95 -1.71
4 2.40 191.00 205.44 8.97 -14.44 -0.39
5 2.90 220.00 193.95 8.70 26.05 0.71
6 0.80 297.00 242.19 11.76 54.81 1.52
7 9.10 71.00 51.55 23.29 19.45 0.65 X
8 0.80 211.00 242.19 11.76 -31.19 -0.87
9 0.70 300.00 244.49 12.00 55.51 1.55
10 7.90 107.00 79.11 19.39 27.89 0.86
11 1.80 167.00 219.22 9.72 -52.22 -1.43
12 1.90 266.00 216.92 9.57 49.08 1.34
13 0.80 227.00 242.19 11.76 -15.19 -0.42
14 6.50 86.00 111.27 15.11 -25.27 -0.73
15 1.60 207.00 223.81 10.06 -16.81 -0.46
16 5.80 115.00 127.34 13.15 -12.34 -0.35
17 1.30 285.00 230.70 10.64 54.30 1.49
18 1.20 199.00 233.00 10.85 -34.00 -0.94
19 2.70 172.00 198.55 8.77 -26.55 -0.72

The mean of residuals is always equal to 0


Correlation & Regression
Residual Plots
Things to look for:
Residuals Versus Alcohol
(response is hrt_deat)

1. A curved pattern means


50
the relationship is not
linear.
2. Increasing/decreasing
Residual

spread about the line


-50
3. Individual points with
large residuals
0 1 2 3 4 5

Alcohol
6 7 8 9
4. Individual points that are
extreme in the x
Do we have any influential direction
points here?
Correlation & Regression
Ideal residual pattern

Curvature…a linear fit is not


appropriate

Increasing variation

Correlation & Regression


Fitted regression line death rate vs.wine consumption
death rate = 260.563 - 22.9688 wine consumt
Residuals Versus Alcohol
(response is hrt_deat)
S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

300
50
death rate

200

Residual
0

100

-50

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
wine consumption Alcohol

Regression Plot
Residuals Versus C5
C6 = 280.215 - 33.7666 C5
(response is C6)
S = 40.0879 R-Sq = 42.0 % R-Sq(adj) = 37.5 %

300 50

250

Residual 0
C6

200

150
-50

1 2 3 4
1 2 3 4
C5 C5

Correlation & Regression


Attention!! Caution!!
1. Correlation and regression describe only linear
relationships
2. R and r-sq are not resistant
3. Do not extrapolate!!! What is extrapolate?
4. Correlations based on averages are too high when
applied to individuals…if the data has been “averaged”,
the values of correlation and regression cannot be used
with un-averaged values. (i.e., average alcohol
consumption per country…not individuals).
5. Lurking variables…like the male/female variable in the
weight vs. energy and the possible Mediterranean
variable in the wine data.
6. Correlation/association is not causation.
Correlation & Regression

Das könnte Ihnen auch gefallen