Correlation and Regression Measuring and Predicting Relationships

Slide 11-1
Chapter 11
Correlation and Regression: Measuring and Predicting Relationships
2/10/2012
Slide 11-2
Bivariate Data: Relationships

Examples of relationships:
Sales and earnings Cost and number produced Microsoft and the stock market Effort and results
Scatterplot
A picture to explore the relationship in bivariate data
Correlation r
Measures strength of the relationship (from 1 to 1)
Regression
Predicting one variable from the other
2/10/2012
Slide 11-3
Interpreting Correlation
Y Y
r=1
A perfect straight line tilting up to the right
X Y Y
r=0
No overall tilt No relationship?
X Y
r = 1
A perfect straight line tilting down to the right
2/10/2012
Slide 11-4 Fig 11.1.3
Example: Internet Site Ratings

At the top right are eBay, Yahoo!, and MSN
Time Spent vs. Internet Pages Viewed

Two measures of the abilities of 25 Internet sites Correlation is r = 0.964
Very strong positive association (since r is close to 1) Straight line with scatter
Minutes per person
Linear relationship
90 60 30 0 0 100 Pages per person 200 MSN eBay Yahoo!
Increasing relationship
Tilts up and to the right
2/10/2012
Slide 11-5 Fig 11.1.4
Example: Merger Deals

244 deals worth $756 billion by Goldman Sachs
Dollars vs. Deals

For mergers and acquisitions by investment bankers Correlation is r = 0.419
Positive association Straight line with scatter
Dollars (billions)
Linear relationship
$1,000
$500
Increasing relationship
Tilts up and to the right
$0 0 100 200 Deals 300 400
2/10/2012
Slide 11-6 Fig 11.1.5
Example: Mortgage Rates & Fees

If the interest rate is lower, does the bank make it up with a higher loan fee?
Interest Rate vs. Loan Fee

For mortgages
Correlation is r = 0.890
Strong negative association Straight line with scatter
Interest rate
Linear relationship
6.0% 5.5%
Decreasing relationship
Tilts down and to the right 5.0%
0%
1%
2% Loan fee
3%
4%
2/10/2012
Slide 11-7 Fig 11.1.7
Example: The Stock Market

If the market was up yesterday, is it more likely to be up today? Or is each days performance independent?
Todays vs. Yesterdays Percent Change

Is there momentum?
Correlation is r = 0.11
Today's change
3% 2% 1% 0% -1% -2% -3% -3% -2% -1% 0% 1% 2% 3% Yesterday's change
A weak relationship?
No relationship?
Tilt is neither up nor down
2/10/2012
Slide 11-8 Fig 11.1.10
Example: Stock Options

Call Price is the price of the option contract to buy stock at the Strike Price The right to buy at a lower strike price has more value
Call Price vs. Strike Price

For stock options
A nonlinear relationship
Call Price
Not a straight line: A curved relationship
$100 $75 $50 $25 $0 $450 $500 $550 Strike Price $600 $650
Correlation r = 0.895
A negative relationship: Higher strike price goes with lower call price
2/10/2012
Slide 11-9 Fig 11.1.11
Example: Maximizing Yield

With a best optimal temperature setting
Output Yield vs. Temperature

For an industrial process A nonlinear relationship
Not a straight line: A curved relationship
Yield of process 160 150 140 130 120 500 600 700 800 900
r suggests no relationship
But relationship is strong

It tilts neither up nor down
Temperature
2/10/2012
Slide 11-10 Fig 11.1.12,14
Example: Telecommunications
Circuit Miles vs. Investment (lower left)

For telecommunications firms A relationship with unequal variability
More vertical variation at the right than at the left Variability is stabilized by taking logarithms (lower right)
Circuit miles (millions) Log of miles 2,000 1,000 0 0 1,000 2,000 Investment ($millions) 20
r = 0.957
15 15 20 Log of investment
2/10/2012
Slide 11-11 Fig 11.1.15
Example: Bond Coupon and Price

Bonds paying a higher coupon generally cost more
Price vs. Coupon Payment

For trading in the bond market Two clusters are visible
Ordinary bonds (value is from coupon) Inflation-indexed bonds (payout rises with inflation)
Bid price
for all bonds
$150 $100 0% 5% Coupon rate 10%
Ordinary bonds only
2/10/2012
Slide 11-12 Fig 11.1.16,17
Example: Cost and Quantity
Cost vs. Number Produced

For a production facility
It usually costs more to produce more
An outlier is visible
A disaster (a fire at the factory) High cost, but few produced
Cost 5,000 Cost
Outlier removed: More details, r = 0.869
10,000
r = 0.623
4,000 3,000
0 0 20 40 60 Number produced
20 30 40 50 Number produced
2/10/2012
Slide 11-13
Example: Salary and Experience

For n = 6 employees Linear (straight line) relationship Increasing relationship
higher salary generally goes with higher experience
Salary vs. Years Experience
Salary ($thousand) Experience 15 10 20 5 15 5
2/10/2012
Salary 30 35 55 22 40 27
60 50 40 30 20 0 10 20 Experience
Slide 11-14
The Least-Squares Line Y=a+bX

with smallest errors (in vertical direction, for Y axis) Intercept is 15.32 salary (at 0 years of experience) Slope is 1.673 salary (for each additional year of experience, on average)
60 Salary (Y) 50 40 30 20 10 0 10 20 Experience (X)
Summarizes bivariate data: Predicts Y from X
2/10/2012
Slide 11-15
Predicted Values and Residuals

For example, Mary (with 20 years of experience) has predicted salary 15.32+1.673(20) = 48.8
So does anyone with 20 years of experience
Predicted Value comes from Least-Squares Line
Residual is actual Y minus predicted Y

Marys residual is 55 48.8 = 6.2
She earns about $6,200 more than the predicted salary for a person with 20 years of experience A person who earns less than predicted will have a negative residual
2/10/2012
Slide 11-16
Predicted and Residual (continued)

Marys residual is 6.2 60 Mary earns 55 thousand 50 Marys predicted value is 48.8 40
Salary
30 20 10 0 10 Experience 20
2/10/2012
Slide 11-17
Standard Error of Estimate

n2
Se ! SY r 2 n 1 1
Approximate size of prediction errors (residuals)

Actual Y minus predicted Y: Y[a+bX]
Example (Salary vs. Experience)

6 1 Se ! 11.686 0.8667 1 ! 6.52 62 Predicted salaries are about 6.52 (i.e., $6,520) away from actual salaries
2
2/10/2012
Slide 11-18
Se (continued)
About 68% of the data are within one standard error of estimate of the least-squares line
(For a bivariate normal distribution)
60 50 Salary 40 30 20 0 10 Experience 20
Interpretation: similar to standard deviation Can move Least-Squares Line up and down by Se
2/10/2012
Slide 11-19
Regression and Prediction Error

Errors are approximately SY = 11.686
Predicting Y as Y (not using regression) Predicting Y as a+bX (using regression)

Errors are approximately Se = 6.52 Errors are smaller when regression is used!
This is often the true payoff for using regression
Coefficient of Determination R2
Tells what percent of the variability (variance) of Y is explained by X Example: R2 = 0.86672 = 0.751
Experience explains 75.1% of the variation in salaries
2/10/2012
Slide 11-20
Linear Model
The foundation for statistical inference in regression Observed Y is a straight line, plus randomness
Linear Model for the Population
Y = E+ FX + I
Randomness of individuals Population relationship, on average
Y
X
2/10/2012
Slide 11-21
Why Statistical Inference?

when, in fact, the population is just random
Because there can seem to be a relationship Samples of size n = 10

from a population with no relationship (correlation 0) Sample correlations are not zero!
Due to the randomness of sampling
r = 0.471 r = 0.089 r = 0.395
2/10/2012
Slide 11-22
Standard Error of the Slope

Se Sb ! SX n 1
Approximately how far the observed slope b is from the population slope F Example (Salary vs. Experience)
6.52 Sb ! ! 0.48 6.06 6 1
Observed slope, b = 1.673, is about 0.48 away from the unknown slope of the population
2/10/2012
Slide 11-23
Statistical Inference
b s tS b
where t has n 2 degrees of freedom
Confidence Interval for the Slope Hypothesis Test

Is F different from F0 = 0? Is the regression significant? Are X and Y significantly related?
YES
If 0 is not in the confidence interval
Or if |tstatistic| = |b/Sb| > ttable
NO
Otherwise
2/10/2012
Slide 11-24
Example (Salary and Experience)

use t = 2.776 from t table, 6 2=4 degrees of freedom
95% confidence interval for population slope F

1.673 s 2.776 v 0.48
From 0.34 to 3.00
We are 95% sure that the population slope is somewhere between 0.34 and 3.00 ($thousand per year of experience)
Hypothesis test result

Significant
because 0 is not in the confidence interval
Experience and Salary are significantly related

2/10/2012
Slide 11-25
Regression Can Be Misleading

Nonlinear? Unequal variability? Clustering?
Linear Model May Be Wrong Predicting Intervention from Experience is Hard

Relationship may become different if you intervene
Intercept May Not Be Meaningful

if there are no data near X = 0
Explaining Y from X vs. Explaining X from Y

Use care in selecting the Y variable to be predicted
Is there a hidden Third Factor?

Use it to predict better with multiple regression?
2/10/2012

Correlation and Regression Measuring and Predicting Relationships

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Correlation and Regression Measuring and Predicting Relationships

Hochgeladen von

Copyright:

Verfügbare Formate

Slide 11-1

Bivariate Data: Relationships

Slide 11-4 Fig 11.1.3

Example: Internet Site Ratings

Time Spent vs. Internet Pages Viewed

90 60 30 0 0 100 Pages per person 200 MSN eBay Yahoo!

Slide 11-5 Fig 11.1.4

Example: Merger Deals

Dollars vs. Deals

$0 0 100 200 Deals 300 400

Slide 11-6 Fig 11.1.5

Example: Mortgage Rates & Fees

Interest Rate vs. Loan Fee

Tilts down and to the right 5.0%

Slide 11-7 Fig 11.1.7

Example: The Stock Market

Todays vs. Yesterdays Percent Change

3% 2% 1% 0% -1% -2% -3% -3% -2% -1% 0% 1% 2% 3% Yesterday's change

Slide 11-8 Fig 11.1.10

Example: Stock Options

Call Price vs. Strike Price

Not a straight line: A curved relationship

Slide 11-9 Fig 11.1.11

Example: Maximizing Yield

Output Yield vs. Temperature

But relationship is strong

Slide 11-10 Fig 11.1.12,14

Circuit Miles vs. Investment (lower left)

Slide 11-11 Fig 11.1.15

Example: Bond Coupon and Price

Price vs. Coupon Payment

$150 $100 0% 5% Coupon rate 10%

Slide 11-12 Fig 11.1.16,17

Example: Cost and Quantity

Cost vs. Number Produced

Outlier removed: More details, r = 0.869

Example: Salary and Experience

Salary vs. Years Experience

The Least-Squares Line Y=a+bX

Summarizes bivariate data: Predicts Y from X

Predicted Values and Residuals

Predicted Value comes from Least-Squares Line

Residual is actual Y minus predicted Y

Predicted and Residual (continued)

Standard Error of Estimate

Approximate size of prediction errors (residuals)

Example (Salary vs. Experience)

Regression and Prediction Error

Predicting Y as Y (not using regression) Predicting Y as a+bX (using regression)

Linear Model for the Population

Why Statistical Inference?

Because there can seem to be a relationship Samples of size n = 10

Standard Error of the Slope

Confidence Interval for the Slope Hypothesis Test

Example (Salary and Experience)

95% confidence interval for population slope F

Hypothesis test result

Experience and Salary are significantly related

Regression Can Be Misleading

Linear Model May Be Wrong Predicting Intervention from Experience is Hard

Intercept May Not Be Meaningful

Explaining Y from X vs. Explaining X from Y

Is there a hidden Third Factor?