Sie sind auf Seite 1von 48

Introduction to Probability

and Statistics
Twelfth Edition
Robert J. Beaver Barbara M. Beaver William Mendenhall

Presentation designed and written by:


Barbara M. Beaver
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Introduction to Probability
and Statistics
Twelfth Edition
Chapter 12
Linear Regression and
Correlation
Some graphic screen captures from Seeing Statistics
Some images 2001-(current year) www.arttoday.com

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Introduction
In Chapter 11, we used ANOVA to investigate the
effect of various factor-level combinations
(treatments) on a response x.
Our objective was to see whether the treatment
means were different.
In Chapters 12 and 13, we investigate a response y
which is affected by various independent variables,
x i.
Our objective is to use the information provided by
the xi to predict the value of y.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Example
Let y be a students college achievement,
measured by his/her GPA. This might be a
function of several variables:
x1 = rank in high school class
x2 = high schools overall rating
x3 = high school GPA
x4 = SAT scores
We want to predict y using knowledge of x1, x2,
x3 and x4.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Example
Let y be the monthly sales revenue for a
company. This might be a function of several
variables:
x1 = advertising expenditure
x2 = time of year
x3 = state of economy
x4 = size of inventory
We want to predict y using knowledge of x1, x2,
x3 and x4.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Some Questions
Which of the independent variables are
useful and which are not?
How could we create a prediction equation
to allow us to predict y using knowledge of
x1, x2, x3 etc?
How good is this prediction?
We
Westart
start with
with the
the simplest
simplest case,
case,in
in which
which the
the
response
response yy isis aa function
function of
of aa single
single independent
independent
variable,
variable, x.x.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

A Simple Linear Model


In Chapter 3, we used the equation of
a line to describe the relationship between y
and x for a sample of n pairs, (x, y).
If we want to describe the relationship
between y and x for the whole population,
there are two models we can choose
Deterministic
Deterministic Model:
Model: yy == x
x
Probabilistic
Probabilistic Model:
Model:
yy == deterministic
deterministic model
model ++ random
random error
error
yy == x
x

Copyright 2006 Brooks/Cole

A division of Thomson Learning, Inc.

A Simple Linear Model


Since the bivariate measurements that we
observe do not generally fall exactly on a
straight line, we choose to use:
Probabilistic Model:
y = x
E(y) = x
Points deviate from the
line of means by an amount
where has a normal
distribution with mean 0 and
Copyright 2006 Brooks/Cole
2
variance .
A division of Thomson Learning, Inc.

The Random Error


The line of means, E(y) = x , describes
average value of y for any fixed value of x.
The population of measurements is generated
as y deviates from
the population line
by . We estimate
and using sample
information.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

MY

APPLET

The Method of
Least Squares

The equation of the best-fitting line


is calculated using a set of n pairs (xi, yi).

We choose our
estimates a and b to
estimate
line
and:ysoa bx
Bestfitting
that the vertical
Choose
a
and
b
to
minimize
distances of the
2
2

SSE from
( y the
y ) line,
( y a bx)
points
are minimized.

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Least Squares Estimators


Calculate
the
of
::
Calculate
thesums
sums
ofsquares
squares
22
22
((
xx))
((
yy))

22
22
SSxxxx
SSyyyy
xx
yy
nn
nn
((
xx)()(
yy))

SSxyxy
xy
xy
nn
Best
fitting
line
Best
fitting
line:: yy aa bx
bx where
where
SSxyxy
bb
and
and aa yy bbxx
SSxxxx
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Example
The table shows the math achievement test scores
for a random sample of n = 10 college freshmen,
along with their final calculus grades.
Student

Math test, x

10

39 43

21

64

57

47

28

75

34

52

Calculus grade, y 65 78

52

82

92

89

73

98

56

75

Use
Useyour
yourcalculator
calculator
to
tofind
findthe
thesums
sums
and
andsums
sumsof
of
squares.
squares.

xx 460
460
yy 760
760
22
22

xx 23634
23634
yy 59816
59816

xy
xy 36854
36854
2006 Brooks/Cole
xx 46

76
46 yyCopyright

76
A division of Thomson Learning, Inc.

Example
22
((460
)
460) 2474
SSxxxx 23634

23634
2474
10
10
22
((760
)
760) 2056
SSyyyy 59816

59816
2056
10
10
((460
)()(760
))
460
760
SSxyxy 36854
1894
36854
1894
10
10
1894
bb 1894 .76556
.76556and
and aa 76
76..76556
76556((46
46)) 40
40..78
78
2474
2474
Best
fitting
line
Best
fitting
line:: yy 40
40..78
78..77
77xx
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

The Analysis of Variance

The total variation in the experiment is


measured by the total sum of squares:
squares
22
Total
SS

(
y

y
)
TotalSS Syyyy ( y y )

The Total SS is divided into two parts:


SSR (sum of squares for regression):
measures the variation explained by using x in
the model.
SSE (sum of squares for error): measures the
leftover variation not explained by x.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

The Analysis of Variance


We calculate
22
((SSxyxy))22 1894
1894
SSR

SSR

SSxxxx
2474
2474

1449
1449..9741
9741
SSE
SSE Total
TotalSS
SS--SSR
SSR
((SSxyxy))22
SSyyyy
SSxxxx

2056
20561449
1449..9741
9741
606
606..0259
0259
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

The ANOVA Table


Total df = n -1
Regression df = 1
Error df = n 1 1 = n - 2

Mean Squares
MSR = SSR/(1)
MSE = SSE/(n-2)

Source

df

SS

MS

Regression

SSR

SSR/(1)

MSR/MSE

Error

n-2

SSE

SSE/(n-2)

Total

n -1

Total SS

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

The Calculus Problem


22
((SSxyxy))22 1894
1894 1449.9741
SSR

SSR

1449.9741
SSxxxx
2474
2474

((SSxyxy))22
SSE
SSETotal
TotalSS
SS--SSR
SSR SSyyyy
SSxxxx
2056
20561449
1449..9741
9741606
606..0259
0259

Source

df

SS

MS

Regression

1449.9741 1449.9741 19.14

Error

606.0259

Total

2056.0000

75.7532

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Testing the Usefulness


of the Model

The first question to ask is whether the


independent variable x is of any use in
predicting y.
If it is not, then the value of y does not change,
regardless of the value of x. This implies that
the slope of the line, , is zero.
H
H00 :: 00

versus
H
versus
Haa:: 00
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Testing the
Usefulness of the Model

The test statistic is function of b, our best


estimate of Using MSE as the best estimate
of the random variation 2, we obtain a t
statistic.
bb00
Test
::tt
Teststatistic
statistic
MSE
MSE
SSxxxx

which
has
aattdistributi
on
which
has
distributi
on

MSE
MSE
with
df

2
or
a
confidence
interval
:
b

t
withdf n 2 oraconfidence
interval
: b t/ /22
SSxxxx
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

The Calculus Problem


MY

APPLET

Is there a significant relationship between


There
is
a
significant
the calculus grades and the test scores at the
linear relationship
5% level of significance?
between the calculus
grades and the test scores
H 0 : 0 versus
H a : for
0 the population of
b0
.7656 college
0
freshmen.
t

4.38
MSE/ S xx
75.7532 / 2474
Reject H 0 when |t| > 2.306. Since t = 4.38 falls into
the rejection region, H 0 is rejected .
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

The F Test
You can test the overall usefulness of the
model using an F test. If the model is
useful, MSR will be large compared to
the unexplained variation, MSE.
To testH 0 : model
isuseful
in predicting
y
This test is
exactly
equivalent to
the t-test, with t2
= F.
Copyright 2006 Brooks/Cole

MSR
TestStatistic
:F
MSE
Reject
H 0 if F F with1 andn - 2 df .

A division of Thomson Learning, Inc.

Minitab Output

Least squares
To test
H 0 : 0line
regression

Regression Analysis: y versus x


The regression equation is y = 40.8 + 0.766 x
Predictor
Coef
SE Coef
T
P
Constant
40.784
8.507
4.79
0.001
x
0.7656
0.1750
4.38
0.002
S = 8.70363

R-Sq = 70.5%

Analysis of Variance
Source
DF
SS
Regression
1 1450.0
Residual Error 8
606.0
Total
9 2056.0

MSE

R-Sq(adj) = 66.8%
MS
1450.0
75.8

F
19.14

P
0.002

Regression coefficients, 2
t F
a and Copyright
b 2006 Brooks/Cole

A division of Thomson Learning, Inc.

Measuring the Strength


of the Relationship
If the independent variable x is of useful in
predicting y, you will want to know how well
the model fits.
The strength of the relationship between x and y
can be measured using:
SSxyxy
Correlatio
nncoefficien
tt::rr
Correlatio
coefficien
SSxxxxSSyyyy
22
xy
xy

SS
SSR
SSR
Coefficien
t
of
determinat
ion
:
r

Coefficien
t of determinat
ion: r

SCopyright
Brooks/Cole
SxxxxSSyyyy2006Total
TotalSS
SS
22

A division of Thomson Learning, Inc.

Measuring the Strength


of the Relationship

Since Total SS = SSR + SSE, r2 measures


the proportion of the total variation in the
responses that can be explained by using the
independent variable x in the model.
the percent reduction the total variation by
using the regression equation rather than just
using the sample mean y-bar to estimate y.
For the calculus problem, r2 = .705 or
70.5%. The model is working well!

SSR
rr SSR
Total
TotalSS
SS
22

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Interpreting a
Significant Regression

Even if you do not reject the null hypothesis


that the slope of the line equals 0, it does not
necessarily mean that y and x are unrelated.

Type II errorfalsely declaring that the slope is


0 and that x and y are unrelated.

It may happen that y and x are perfectly related


in a nonlinear way.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Some Cautions

You may have fit the wrong model.

Extrapolationpredicting

values of y outside

the range of the fitted data.


CausalityDo

not conclude that x causes y.


There may be an unknown variable at work!
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Checking the
Regression Assumptions
Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.
1.
1. The
The relationship
relationship between
between xx and
and yy isis linear,
linear,
given
given by
by yy == ++ x
x ++

2.
2. The
The random
random error
error terms
terms are
are independent
independent and,
and,
for
for any
any value
value of
of x,
x, have
have aa normal
normal distribution
distribution
with
with mean
mean 00 and
and variance
variance
22..
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Diagnostic Tools
We use the same diagnostic tools used in
Chapter 11 to check the normality
assumption and the assumption of equal
variances.
1.
1. Normal
Normal probability
probability plot
plot of
of residuals
residuals
2.
2. Plot
Plot of
of residuals
residuals versus
versus fit
fit or
or
residuals
residuals versus
versus variables
variables
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Residuals
The residual error is the leftover
variation in each data point after the
variation explained by the regression model
has been removed.
Residual
yyii yyii or

Residual
or yyii aa bx
bxii
If all assumptions have been met, these
residuals should be normal,
normal with mean 0
and variance 2.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Normal Probability Plot

IfIf the
the normality
normality assumption
assumption is
is valid,
valid, the
the
plot
plot should
should resemble
resemble aa straight
straight line,
line,
sloping
sloping upward
upward to
to the
the right.
right.

IfIf not,
not, you
you will
will often
often see
see the
the pattern
pattern fail
fail
in
in the
the tails
tails of
of the
the graph.
graph.
Normal Probability Plot of the Residuals
(response is y)

99

95
90

Percent

80
70
60
50
40
30
20
10
5

-20

-10

0
Residual

10

20

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Residuals versus Fits

IfIf the
the equal
equal variance
variance assumption
assumption isis valid,
valid,
the
the plot
plot should
should appear
appear as
as aa random
random
scatter
scatter around
around the
the zero
zero center
center line.
line.

IfIf not,
not, you
you will
will see
see aa pattern
pattern in
in the
the
residuals.
residuals.
Residuals Versus the Fitted Values
(response is y)

15

Residual

10

-5

-10
60

70

80
Fitted Value

90

100

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Estimation and Prediction

Once you have


determined that the regression line is useful
used the diagnostic plots to check for
violation of the regression assumptions.
You are ready to use the regression line to
Estimate the average value of y for a
given value of x
Predict a particular value of y for a
given value of x.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Estimation and Prediction


Estimating a
particular value of y
when x = x0

Estimating the
average value of y
when x = x0

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Estimation and Prediction

The best estimate of either E(y) or y for


a given value x = x0 is

yy aa bx
bx00

Particular values of y are more difficult to


predict, requiring a wider range of values in the
prediction interval.

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Estimation and Prediction


To
the
value
of
xx xx00 ::
To estimate
estimate
theaverage
average
value
of yy when
when
11 ((xx0 xx))22
0

yy tt//22 MSE

MSE

n
S
Sxxxx
n
To
aaparticular
value
of
xx xx00 ::
To predict
predict
particular
value
of yy when
when
yy tt//22

11 ((xx00 xx))22

MSE
MSE 11

n
S
n
Sxxxx

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

The Calculus Problem


Estimate the average calculus grade for
students whose achievement score is 50 with a
95% confidence interval.

Calculate
yy 40.78424
.76556(50)
79.06
Calculate
40.78424
.76556(50)
79.06
22
11 ((50
46
))
50
46

yy 22..306

306 75
75..7532
7532

10
2474
10
2474

79
to
79..06
06 66..55
55 or
or72.51
72.51
to85.61.
85.61.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

The Calculus Problem


Estimate the calculus grade for a particular
student whose achievement score is 50 with a
95% confidence interval.

Calculate
yy 40.78424
.76556(50)
79.06
Calculate
40.78424
.76556(50)
79.06
22

11 ((50
46
))
50
46

yy 22..306
306 75
75..7532
7532 11

10
2474
10
2474

Notice how
79
.
06

21
.
11
or
57.95
to
100.17.
79.06 21.11 or57.95
to100.17.
much wider this

interval is!

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Minitab Output

Confidence and prediction


intervals when x = 50

Predicted Values for New Observations


New Obs
Fit
SE Fit
95.0% CI
1
79.06
2.84
(72.51, 85.61)

95.0% PI
(57.95,100.17)

Values of Predictors for New Observations


Fitted Line Plot
New Obs
x
y = 40.78 + 0.7656 x
1
50.0
120

Regression
95% CI
95% PI

110

S
R-Sq
R-Sq(adj)

100

8.70363
70.5%
66.8%

90
y

Green prediction
bands are always
wider than red
confidence bands.

80
70
60

Both intervals are


narrowest when x = xbar.

50
40
30
20

30

40

50
x

60

70

80

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Correlation Analysis

The strength of the relationship between x and y is


measured using the coefficient of correlation:
correlation
SSxyxy
Correlatio
n
tt::rr
Correlatio
ncoefficien
coefficien
SSxxxxSSyyyy

Recall from Chapter 3 that


(1) -1 r 1
(2) r and b have the same sign
(3) r 0 means no linear relationship
(4) r 1 or 1 means a strong (+) or (-)
Copyright 2006 Brooks/Cole
relationship
A division of Thomson Learning, Inc.

Example
The table shows the heights and weights of
n = 10 randomly selected college football
players.
Player

10

Height, x

73

71

75

72

72

75

67

69

71

69

Weight, y

185 175 200 210

Use
Useyour
yourcalculator
calculator
to
tofind
findthe
thesums
sums
and
andsums
sumsof
of
squares.
squares.

190 195 150

170 180 175

SSxyxy 328
328 SSxxxx 60
60..44 SSyyyy 2610
2610
328
328
rr
..8261
8261
((60
60..44)()(2610
2610))
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Football Players
Scatterplot of Weight vs Height
210
200

Weight

190

rr==.8261
.8261

180
170

Strong
Strongpositive
positive
correlation
correlation

160
150
66

67

68

69

70

71
Height

72

73

74

75

As
Asthe
theplayers
players
height
heightincreases,
increases,so
so
does
doeshis
hisweight.
weight.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Some Correlation Patterns

Use
the
Exploring
Correlation
applet
to
rr==0;
No
0; No
rr==.931;
.931;Strong
Strong
explore
some correlation patterns:
correlation
correlation
positive
positivecorrelation
correlation

rr==1;
1;Linear
Linear
relationship
relationship

rr==-.67;
-.67;Weaker
Weaker
negative
negativecorrelation
correlation
MY

APPLET

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Inference using r

The population coefficient of correlation is


called (rho). We can test for a significant
correlation between x and y using a t test:

To testH 0 : 0 versus
Ha : 0

This test is
exactly
equivalent to
the t-test for the
slope .

n2
Test Statistic
: tr
1 r2
Reject
H 0 if t t / 2 ort t / 2 withn - 2 df .

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

rr ..8261
8261

Example

Is there a significant positive correlation


between weight and height in the population
of all college football players?
H
H00:: 00
H
Haa:: 00
Use
Usethe
thet-table
t-tablewith
withn-2
n-2==88df
dfto
to
bound
boundthe
thep-value
p-valueas
asp-value
p-value<<..
005.
005.There
Thereisisaasignificant
significant
positive
positivecorrelation.
correlation.

nn22
Test
:: tt rr
TestStatistic
Statistic
11rr22
88
..8261
4.15
8261
22 4.15
11..8261
8261
MY

APPLET

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Key Concepts
I. A Linear Probabilistic Model
1. When the data exhibit a linear relationship, the
appropriate model is y x .
2. The random error has a normal distribution with
mean 0 and variance 2.
II. Method of Least Squares
1. Estimates a and b, for and , are chosen to
minimize SSE, the sum of the squared deviations
about the regression line, y a bx.
2. The least squares estimates are b = Sxy/Sxx and

a y b x.

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Key Concepts
III. Analysis of Variance
1. Total SS SSR SSE, where Total SS Syy and
SSR (Sxy)2 Sxx.
2. The best estimate of 2 is MSE SSE (n 2).
IV. Testing, Estimation, and Prediction
1. A test for the significance of the linear regression
H0 : 0 can be implemented using one of two test
statistics:

b
MSE/ S xx

or

MSR
F
MSE

Copyright 2006 Brooks/Cole


A division of Thomson Learning, Inc.

Key Concepts
2.

The strength of the relationship between x and y can be


measured using

SSR
R
TotalSS
2

3.
4.
5.

which gets closer to 1 as the relationship gets stronger.


Use residual plots to check for nonnormality, inequality of
variances, and an incorrectly fit model.
Confidence intervals can be constructed to estimate the
intercept and slope of the regression line and to estimate
the average value of y, E( y ), for a given value of x.
Prediction intervals can be constructed to predict a
particular observation, y, for a given value of x. For a given
x, prediction intervals are always wider than confidence
intervals.
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Key Concepts
V. Correlation Analysis
1. Use the correlation coefficient to measure the
relationship between x and y when both variables are
random:
S
r

xy

S xx S yy

2. The sign of r indicates the direction of the


relationship; r near 0 indicates no linear relationship,
and r near 1 or 1 indicates a strong linear
relationship.
3. A test of the significance of the correlation coefficient
is identical to the test of the slope
Copyright 2006 Brooks/Cole
A division of Thomson Learning, Inc.

Das könnte Ihnen auch gefallen