Sie sind auf Seite 1von 25

Slide 11-1

Chapter 11
Correlation and Regression: Measuring and Predicting Relationships

2/10/2012

Slide 11-2

Bivariate Data: Relationships


Examples of relationships:
Sales and earnings Cost and number produced Microsoft and the stock market Effort and results

Scatterplot
A picture to explore the relationship in bivariate data

Correlation r
Measures strength of the relationship (from 1 to 1)

Regression
Predicting one variable from the other
2/10/2012

Slide 11-3

Interpreting Correlation
Y Y

r=1
A perfect straight line tilting up to the right

X Y Y

r=0
No overall tilt No relationship?

X Y

r = 1
A perfect straight line tilting down to the right
2/10/2012

Slide 11-4 Fig 11.1.3

Example: Internet Site Ratings


At the top right are eBay, Yahoo!, and MSN

Time Spent vs. Internet Pages Viewed


Two measures of the abilities of 25 Internet sites Correlation is r = 0.964
Very strong positive association (since r is close to 1) Straight line with scatter
Minutes per person

Linear relationship

90 60 30 0 0 100 Pages per person 200 MSN eBay Yahoo!

Increasing relationship
Tilts up and to the right

2/10/2012

Slide 11-5 Fig 11.1.4

Example: Merger Deals


244 deals worth $756 billion by Goldman Sachs

Dollars vs. Deals


For mergers and acquisitions by investment bankers Correlation is r = 0.419
Positive association Straight line with scatter
Dollars (billions)

Linear relationship

$1,000

$500

Increasing relationship
Tilts up and to the right

$0 0 100 200 Deals 300 400

2/10/2012

Slide 11-6 Fig 11.1.5

Example: Mortgage Rates & Fees


If the interest rate is lower, does the bank make it up with a higher loan fee?

Interest Rate vs. Loan Fee


For mortgages

Correlation is r = 0.890
Strong negative association Straight line with scatter
Interest rate

Linear relationship

6.0% 5.5%

Decreasing relationship

Tilts down and to the right 5.0%

0%

1%

2% Loan fee

3%

4%

2/10/2012

Slide 11-7 Fig 11.1.7

Example: The Stock Market


If the market was up yesterday, is it more likely to be up today? Or is each days performance independent?

Todays vs. Yesterdays Percent Change


Is there momentum?

Correlation is r = 0.11
Today's change

3% 2% 1% 0% -1% -2% -3% -3% -2% -1% 0% 1% 2% 3% Yesterday's change

A weak relationship?

No relationship?
Tilt is neither up nor down

2/10/2012

Slide 11-8 Fig 11.1.10

Example: Stock Options


Call Price is the price of the option contract to buy stock at the Strike Price The right to buy at a lower strike price has more value

Call Price vs. Strike Price


For stock options

A nonlinear relationship
Call Price

Not a straight line: A curved relationship

$100 $75 $50 $25 $0 $450 $500 $550 Strike Price $600 $650

Correlation r = 0.895
A negative relationship: Higher strike price goes with lower call price
2/10/2012

Slide 11-9 Fig 11.1.11

Example: Maximizing Yield


With a best optimal temperature setting

Output Yield vs. Temperature


For an industrial process A nonlinear relationship
Not a straight line: A curved relationship
Yield of process 160 150 140 130 120 500 600 700 800 900

Correlation r = 0.0155
r suggests no relationship

But relationship is strong


It tilts neither up nor down

Temperature
2/10/2012

Slide 11-10 Fig 11.1.12,14

Example: Telecommunications

Circuit Miles vs. Investment (lower left)


For telecommunications firms A relationship with unequal variability
More vertical variation at the right than at the left Variability is stabilized by taking logarithms (lower right)

Correlation r = 0.820
Circuit miles (millions) Log of miles 2,000 1,000 0 0 1,000 2,000 Investment ($millions) 20

r = 0.957

15 15 20 Log of investment

2/10/2012

Slide 11-11 Fig 11.1.15

Example: Bond Coupon and Price


Bonds paying a higher coupon generally cost more

Price vs. Coupon Payment


For trading in the bond market Two clusters are visible
Ordinary bonds (value is from coupon) Inflation-indexed bonds (payout rises with inflation)
Bid price

Correlation r = 0.950
for all bonds

$150 $100 0% 5% Coupon rate 10%

Correlation r = 0.994
Ordinary bonds only

2/10/2012

Slide 11-12 Fig 11.1.16,17

Example: Cost and Quantity

Cost vs. Number Produced


For a production facility
It usually costs more to produce more

An outlier is visible
A disaster (a fire at the factory) High cost, but few produced
Cost 5,000 Cost

Outlier removed: More details, r = 0.869

10,000

r = 0.623

4,000 3,000

0 0 20 40 60 Number produced

20 30 40 50 Number produced

2/10/2012

Slide 11-13

Example: Salary and Experience


For n = 6 employees Linear (straight line) relationship Increasing relationship
higher salary generally goes with higher experience

Salary vs. Years Experience

Correlation r = 0.8667
Salary ($thousand) Experience 15 10 20 5 15 5
2/10/2012

Salary 30 35 55 22 40 27

60 50 40 30 20 0 10 20 Experience

Slide 11-14

The Least-Squares Line Y=a+bX


with smallest errors (in vertical direction, for Y axis) Intercept is 15.32 salary (at 0 years of experience) Slope is 1.673 salary (for each additional year of experience, on average)
60 Salary (Y) 50 40 30 20 10 0 10 20 Experience (X)

Summarizes bivariate data: Predicts Y from X

2/10/2012

Slide 11-15

Predicted Values and Residuals


For example, Mary (with 20 years of experience) has predicted salary 15.32+1.673(20) = 48.8
So does anyone with 20 years of experience

Predicted Value comes from Least-Squares Line

Residual is actual Y minus predicted Y


Marys residual is 55 48.8 = 6.2
She earns about $6,200 more than the predicted salary for a person with 20 years of experience A person who earns less than predicted will have a negative residual

2/10/2012

Slide 11-16

Predicted and Residual (continued)


Marys residual is 6.2 60 Mary earns 55 thousand 50 Marys predicted value is 48.8 40

Salary

30 20 10 0 10 Experience 20

2/10/2012

Slide 11-17

Standard Error of Estimate


n2

Se ! SY  r 2 n  1 1

Approximate size of prediction errors (residuals)


Actual Y minus predicted Y: Y[a+bX]

Example (Salary vs. Experience)


6 1 Se ! 11.686  0.8667 1 ! 6.52 62 Predicted salaries are about 6.52 (i.e., $6,520) away from actual salaries
2
2/10/2012

Slide 11-18

Se (continued)
About 68% of the data are within one standard error of estimate of the least-squares line
(For a bivariate normal distribution)
60 50 Salary 40 30 20 0 10 Experience 20

Interpretation: similar to standard deviation Can move Least-Squares Line up and down by Se

2/10/2012

Slide 11-19

Regression and Prediction Error


Errors are approximately SY = 11.686

Predicting Y as Y (not using regression) Predicting Y as a+bX (using regression)


Errors are approximately Se = 6.52 Errors are smaller when regression is used!
This is often the true payoff for using regression

Coefficient of Determination R2
Tells what percent of the variability (variance) of Y is explained by X Example: R2 = 0.86672 = 0.751
Experience explains 75.1% of the variation in salaries
2/10/2012

Slide 11-20

Linear Model
The foundation for statistical inference in regression Observed Y is a straight line, plus randomness

Linear Model for the Population

Y = E+ FX + I
Randomness of individuals Population relationship, on average
Y

X
2/10/2012

Slide 11-21

Why Statistical Inference?


when, in fact, the population is just random

Because there can seem to be a relationship Samples of size n = 10


from a population with no relationship (correlation 0) Sample correlations are not zero!
Due to the randomness of sampling
r = 0.471 r = 0.089 r = 0.395

2/10/2012

Slide 11-22

Standard Error of the Slope


Se Sb ! SX n  1

Approximately how far the observed slope b is from the population slope F Example (Salary vs. Experience)
6.52 Sb ! ! 0.48 6.06 6  1
Observed slope, b = 1.673, is about 0.48 away from the unknown slope of the population
2/10/2012

Slide 11-23

Statistical Inference
b s tS b
where t has n 2 degrees of freedom

Confidence Interval for the Slope Hypothesis Test


Is F different from F0 = 0? Is the regression significant? Are X and Y significantly related?

YES
If 0 is not in the confidence interval
Or if |tstatistic| = |b/Sb| > ttable

NO
Otherwise

2/10/2012

Slide 11-24

Example (Salary and Experience)


use t = 2.776 from t table, 6 2=4 degrees of freedom

95% confidence interval for population slope F


1.673 s 2.776 v 0.48
From 0.34 to 3.00
We are 95% sure that the population slope is somewhere between 0.34 and 3.00 ($thousand per year of experience)

Hypothesis test result


Significant
because 0 is not in the confidence interval

Experience and Salary are significantly related


2/10/2012

Slide 11-25

Regression Can Be Misleading


Nonlinear? Unequal variability? Clustering?

Linear Model May Be Wrong Predicting Intervention from Experience is Hard


Relationship may become different if you intervene

Intercept May Not Be Meaningful


if there are no data near X = 0

Explaining Y from X vs. Explaining X from Y


Use care in selecting the Y variable to be predicted

Is there a hidden Third Factor?


Use it to predict better with multiple regression?
2/10/2012

Das könnte Ihnen auch gefallen