Sie sind auf Seite 1von 20

Chapter 11

Inferences for Regression Parameters

11.1 Simple Linear Regression (SLR) Model


This topic is covered in Chapter 2 (which we skipped). In these notes we are going to cover
sections 2.3 to 2.10 (which in a sense describe the relation between two variables and hence
treated early as descriptive statistics) as well the material in Chapter 11 that covers inference
about the regression parameters.

A simple linear regression model is a mathematical relationship between two quantitative


variables, one of which, Y, is the variable we want to predict, using information on the
second variable, X, which is assumed to be non-random. The model is written as
Y = b 0 + b1 x + e .
In this model,
 Y is called the response variable or the dependent variable (because its value
depends to some extent on the value of X).
 X is called the predictor (because it is used to predict Y) or explanatory variable
(because it explains the variation or changes in Y). It is also called the independent
variable (because its value does not depend on Y).
 True regression line is denoted as Y = b 0 + b1 x
 The parameters of the true regression line are the constants, β0 and β1
 β0 is the intercept of the true regression line.
 β1 is the slope of the true regression line.
 The true values of the regression parameters as well as the true regression line are
unknown. The true regression line shows the deterministic relationship between X
and Y (since X is a non-random variable and β0 as well as β1 are (unknown) constants.
 The random error, ε, incorporates the effects, on Y, of all variables (factors) other
than X, in such a way that their net effect is zero on the average.
 An observation on the ith unit in the population, denoted by yi, is
yi = b 0 + b1 xi + e i .
 Here, εi is the difference between the observed value of Y and the value on the true
regression line that corresponds to X = xi.
 The εi are independent of each other and they all have the same normal distribution,
with mean zero and variance σ2, that is εi ~ Niid(0, σ2).
 As a result of the above property, yi = b 0 + b1 xi + e i are random variables, that
have normal distributions with mean μY that depends on the value of X and variance
σ2 that is the same for all X values, i.e., yi = b 0 + b1 xi + e i ~ N ( mY |X , s )
2

 ŷ = bˆ0 + bˆ1 x is called the prediction equation and ŷ is the predicted value of Y for
X = x.
 Residual = yi - yˆ i = ei = eˆi is the difference between the observed and predicted
value of Y.

STA 3032 Chapter 11 Page 1 of 20


 The parameters of the true regression line are estimated the method of least squares,
n n
(LSE), where sum of the squared residuals �e
i =1
2
i is minimized, subject to �e
i =1
i = 0.

ˆ SY
 LSE of β0 and β1 are b1 = b1 = r � and b0 = bˆ0 = Y - b1 X .
SX

 In the above formula, r is called the correlation coefficient

More on Correlation coefficient (r):

 A number that measures the strength and direction of the linear association between
X and Y (both quantitative variables).

 – 1  r  + 1 always.

 Correlation (r) = + 1 when there is a perfectly linear increasing relationship between


X and Y.

 Correlation (r) = – 1 when there is a perfectly linear decreasing relationship between


X and Y.

 No units. Correlation is a unitless entity

 R2 = (r)2 = is called coefficient of determination.

 R2 measures the percent of variability in the response (Y) explained by the changes in
X [or by the regression on X].

 What does R2 = 0.81 (= 81%) mean?

 How do you find r when you are given R2? For example what is r, if R2 = 0.81 =
81%?

STA 3032 Chapter 11 Page 2 of 20


Example (Problem 11.6) Does the price of crude oil ($/barrel) influence the price at the
pump ($/gallon)? To answer this question, let’s have a look at the data in that problem:
Year Crude Pump Year Crude Pump
1976 10.89 0.61 1991 19.06 1.14
1977 11.96 0.66 1992 18.43 1.13
1978 12.46 0.67 1993 16.41 1.11
1979 17.72 0.90 1994 15.59 1.11
1980 28.07 1.25 1995 17.23 1.15
1981 35.24 1.38 1996 20.71 1.23
1982 31.87 1.30 1997 19.04 1.23
1983 28.99 1.24 1998 12.52 1.06
1984 28.63 1.21 1999 17.51 1.17
1985 26.75 1.20 2000 28.26 1.51
1986 14.55 0.93 2001 22.95 1.46
1987 17.90 0.95 2002 24.10 1.36
1988 14.67 0.95 2003 28.53 1.59
1989 17.97 1.02 2004 36.98 1.88
1990 22.22 1.16
Source: Statistical Abstracts of the US, 2009.

First step in regression analysis is to identify the


o Independent (explanatory) variable (X) and
o Dependent (response) variable (Y).

It is said that the crude oil prices influence the price at the pump. Hence,
X = Crude oil price = Independent variable
Y = Price at the pump = Dependent Variable.

Step Two: draw a scatter diagram of the data and interpret what you see (to get some ideas
about the relation between two variables).

A Scatter Plot of Crude Oil Prices vs. Prices at the Pump (1976 - 2004)
2.00
Price at the Pump ($/Gallon)

1.75

1.50

1.25

1.00

0.75

0.50
10 15 20 25 30 35 40
Crude Oil Prices ($/Barrel)
Souce: Statistical Abstract of the US, 2009

STA 3032 Chapter 11 Page 3 of 20


1. What do you see in the scatter diagram?

2. Verify the following summary statistics using your calculator:


x = 21.28 s X = 7.16
y = 1.1572 sY = 0.2732 r = 0.829
3. Compute the slope and intercept of the least squares regression line, given that r = 0.829
Slope = b1 = r �SY / S X
= 0.829 �0.2732 / 7.16 = 0.031631676
Intercept = b0 = Y - b1 X
= 1.1572 - 0.031631676 �21.28 = 0.4841

Hence the prediction equation is yˆ = 0.48 + 0.03 x .


Are these results consistent with what you have observed in the scatter plot?
4. Interpret the numerical results:
 Correlation = r = 0.829, so there is a
o Strong (since r is close to +1)
o Increasing, (since r is positive)
o Linear relationship (from scatter diagram) between the price of crude oil and
the price at the pump.

 Slope = 0.0316 means that


For every dollar increase in the crude oil price (X)
The price at the pump increases by about 3 cents.

 Intercept = 0.4841
In general, the intercept is the value of Y when X = 0. HOWEVER,
DO NOT INTERPRET the intercept in this case because
a) Zero is not within the range (10 to 38) of observed values of the crude oil
price (X) and
b) X = 0 for the price of crude oil is not meaningful.

5. Compute R2 (coefficient of determination) and interpret it.


R2 = (r)2 = (0.829)2 = 0.687
= 0.969  100% = 68.7 %
Interpretation:
 68.7% of the variation in PRICE AT THE PUMP (Y) is explained by the
changes in PRICE OF CRUDE OIL (X).
 68.7% of variability in price at the pump is explained by linear regression on
crude oil prices.

STA 3032 Chapter 11 Page 4 of 20


6. Plot the estimated regression line on the scatter diagram. For this we choose any two
values of X (as far from each other as meaningful) and predict the value of Y for those
two values of X, using the prediction equation yˆ = 0.48 + 0.03x .
For x = 10 we have ŷ = 0.51
yˆ = 0.48 + 0.03x = 0.48 + 0.03 �10 = 0.51

For x = 40, we get ŷ = 1.68


yˆ = 0.48 + 0.03x = 0.48 + 0.03 �40 = 1.68 .

These give us the 2 points (10, 0.51) and (40, 1.68) that we connect on the scatter diagram.

Now mark these points on the graph and joint them with a ruler, to get the blue line on the
graph.

A Scatter Plot of Crude Oil Prices vs. Prices at the Pump (1976 - 2004)
2.00

1.75
Price at the Pump ($/Gallon)

1.50

1.25

1.00

0.75

0.50
10 15 20 25 30 35 40
Crude Oil Prices ($/Barrel)
Souce: Statistical Abstract of the US, 2009

STA 3032 Chapter 11 Page 5 of 20


Preliminaries for Inferences for SLR

A linear relation between two quantitative variables (denoted by X and Y) was shown by
ŷ = b0 + b1 x (called the prediction equation)
In this equation
 Y is called the response (dependent) variable,
 X is called the predictor (explanatory variable)
 ŷ is called the predicted value of Y for some X = x.
 b0 = bˆ = estimate of the intercept (β0) and
 b1 = bˆ 1 = estimate of the slope (b1).
Hence, ŷ = b0 + b1 x is an estimate for a (true but unknown) regression line Y = b0 + b 1 X
Also, ŷ = a + bX is called an estimate of true (but unknown) regression line µY = α + bx,
so we also write ˆy = m
ˆY

What do the slope and intercept of the regression line tell us?
 Slope is the average amount of change in Y for one unit of increase in X.
Note: slope ≠ rise / run. Why?

 Intercept is the value of Y when X = 0.


Important Note: We DO NOT use the above interpretation when
a) X = 0 is not meaningful or
b) Zero is not within the range or near the observed values of X

Regression Model: A mathematical (or theoretical) equation that shows the linear relation
between the explanatory variable and the response. The simple linear regression
model we will use is Y = α + bX + e, where e is called the error term.

Let’s see the error terms graphically:

 Total error for ith observation = yi - y = ( yi - yˆ i ) + ( yˆ i - y )


= Random error + Regression error

 In this relation, total error is divided into two parts,


o The random error = y – ŷ and
o The regression error = ŷ - y = error due to using the regression model
instead of the sample mean, y =1.1572, as the predicted value of y.

STA 3032 Chapter 11 Page 6 of 20


A Scatter Plot of Crude Oil Prices vs. Prices at the Pump (1976 - 2004)
2.00

1.75
Price at the Pump ($/Gallon)

1.50

1.25

1.00

0.75

0.50
10 15 20 25 30 35 40
Crude Oil Prices ($/Barrel)
Souce: Statistical Abstract of the US, 2009

Assumptions of Simple Linear Regression:

1. A random sample of n pairs of observations,


(X1 , Y1), (X2 , Y2), …, (Xn , Yn)
2. The population of Ys have normal distribution with mean µY = α + bX, which
changes for each value of X.
3. The population of Y’s have a constant standard deviation, s, which is the same for
every value of X.
4. The linear relation between X and Y may be formulated as Y = a + b X + e .
5. As a result of the above assumptions, the error terms, e, are iid (identically and
independently distributed) random variables that have a normal distribution with
mean zero and standard deviation s, i.e., e ~ Niid(0, s).
6. These mean that both Y and e are random variables [we may choose any value for X
hence it is assumed to be a non-random variable (even when it is random)].

Are these assumptions satisfied in Example – 1?

Assumption 1 (random sample) is not really satisfied. However, our main interest is in the
relation between X and Y and for that purpose we may assume that the relation in the years
1976 to 2004 is typical (representative) of the relation in other years.

To check assumptions 2 and 3 we look at the residuals, where,


Residual = Observed value of Y – Predicted value of Y = y - yˆ .

STA 3032 Chapter 11 Page 7 of 20


If these residuals do not have any extreme value, that is, when all standardized residuals are
between – 3 and + 3 we say assumptions 2 and 3 are justifiable (more later).

So, let’s calculate the residuals using the prediction equation, yˆ = 0.48 + 0.03x .

Year Crude Pump yˆ = 0.48 + 0.03x Residuals St. Res.


1976 10.89 0.61 0.82852 -0.218517 -1.48764
1977 11.96 0.66 0.86236 -0.202360 -1.36640
1978 12.46 0.67 0.87817 -0.208175 -1.40078
1979 17.72 0.90 1.04454 -0.144544 -0.94924
1980 28.07 1.25 1.37190 -0.121904 -0.81048
1981 35.24 1.38 1.59868 -0.218685 -1.54210
1982 31.87 1.30 1.49209 -0.192095 -1.30984
1983 28.99 1.24 1.40100 -0.161003 -1.07580
1984 28.63 1.21 1.38962 -0.179617 -1.19773
1985 26.75 1.20 1.33015 -0.130154 -0.86015
1986 14.55 0.93 0.94428 -0.014280 -0.09491
1987 17.90 0.95 1.05024 -0.100237 -0.65797
1988 14.67 0.95 0.94808 0.001925 0.01279
1989 17.97 1.02 1.05245 -0.032451 -0.21298
1990 22.22 1.16 1.18687 -0.026875 -0.17573
1991 19.06 1.14 1.08693 0.053073 0.34756
1992 18.43 1.13 1.06700 0.063000 0.41304
1993 16.41 1.11 1.00311 0.106890 0.70481
1994 15.59 1.11 0.97717 0.132826 0.87863
1995 17.23 1.15 1.02905 0.120954 0.79541
1996 20.71 1.23 1.13911 0.090885 0.59419
1997 19.04 1.23 1.08629 0.143706 0.94112
1998 12.52 1.06 0.88007 0.179927 1.21021
1999 17.51 1.17 1.03790 0.132098 0.86800
2000 28.26 1.51 1.37791 0.132086 0.87903
2001 22.95 1.46 1.20996 0.250036 1.63613
2002 24.10 1.36 1.24634 0.113663 0.74515
2003 28.53 1.59 1.38645 0.203546 1.35655
2004 36.98 1.88 1.65372 0.226281 1.63142

Then we will plot the residuals.

STA 3032 Chapter 11 Page 8 of 20


Histogram of Standardized Residuals
Mean -0.004404
6 StDev 1.030
N 29

5
Number of Years

0
-2 -1 0 1 2
Standardized Residuals

Do you think the assumption of normality is satisfied? Why or why not?

Inferences about parameters of SLR (1)

The parameters of the regression model are β0, b 1 and s.


They are estimated by (2) b0, b1 and S, respectively.

In this section we will deal with inferences about the true SLR model i.e., a regression model
with one explanatory variable (X). When making inferences about the parameters of the
regression model, we will determine
 If X is a “good predictor” of Y
 If the regression line is useful for making predictions about Y
 If the slope is different from zero
 Prediction interval for one individual response, Y at X = x.
 Confidence intervals for the mean of Y, that is µY = mean response, at X = x.
1
We will see how to make inferences about the parameters of a multiple regression model, i.e., a regression
model with several (k  2) explanatory variables, X1, X2, …, Xk.
2
Remember the estimators of β0 and β1 are b0 and b1 respectively.
n

The estimator of σ 2 �e 2
i
SSE
sˆ 2 = s 2 = i =1
= = MSE
n-2 n-2

STA 3032 Chapter 11 Page 9 of 20


ANOVA FOR SLR

Is X a good predictor of Y? This is equivalent to saying is the slope of the line significantly
different from zero? [If not, we might as well use Y as a predictor.]
We can answer these questions using an ANOVA table:

ANOVA for SLR


Source df SS MS = SS/df F
Regression MS Re g
1 SSReg MS Re g = SS Re g /1 F=
Model MSE
Residuals SSE
n–2 SSE MSE =
(Error) n-2
Total n–1 SST

Total SS = Model SS + Error SS


SST = SSReg + SSE
n n n

�( yi - y ) = �( yˆi - y )2 + �( yi - yˆi ) 2
i =1
2

i =1 i =1
df = (n – 1) = 1 + (n – 2)

The df1 = df for regression = 1 because there is only 1 independent variable


The df2 = df for residuals = n – 2 because we estimate 2 parameters (β0 and b1).

We have the usual six steps of hypothesis testing:

1. Assumptions for ANOVA:


 Random sample
 Normal distribution (of e and hence Y)
 Constant variance (of e and Y)

2. The hypothesis of interest is Ho: b 1 = 0 vs. Ha: b 1 ≠ 0.

MSReg
3. Test statistic = F = = F( df1 , df2 )
MSE

4. To find the p-value, we will use the tables of the F-distribution.


First find the tabulated F-value from the F-tables with df1 = 1 and df2 = n – 2;
Then compare that value with the F in ANOVA table.
5. Decision
6. Conclusion

The following is the output obtained from Minitab:

STA 3032 Chapter 11 Page 10 of 20


Regression Analysis: Pump versus Crude

Analysis of Variance

Source DF SS MS F P
Regression 1 1.4350 1.4350 59.21 0.000
Residual Error 27 0.6544 0.0242
Total 28 2.0894

To test Ho: b = 0 vs. Ha: b ≠ 0,


The calculated value of the test statistic is Fcal = 59.21 (from the ANOVA table).

The Fcal value is extremely large!!! What does it mean?

The p-value = 0.000. What does it mean?

Decision?

Conclusion?

Decision: Reject Ho since the p-value < 0.0005 is less than any reasonable level of
significance.

Conclusion: The observed data indicate that the slope is significantly different from zero,
i.e., the observed data strongly indicate that the price of gas at the pump
depends on the price of crude oil.

Using the t-test


We may also use t-test for testing the above hypotheses as explained when testing hypotheses
about the mean of a population.

Hypotheses: Ho: b 1 = 0 vs. Ha: b 1 ≠ 0

The test statistic is


Estimator - Value of parameter in Ho bˆ1 - 0
T= = ~ t( df 2 )
SE ( Estimator ) se( bˆ1 )

To carry out the test we use the first block of the Minitab output:

STA 3032 Chapter 11 Page 11 of 20


Regression Analysis: Pump versus Crude

The regression equation is


Pump = 0.484 + 0.0316 Crude

Predictor Coef SE Coef T P


Constant 0.48408 0.09214 5.25 0.000
Crude 0.031629 0.004111 7.69 0.000

S = 0.155683 R-Sq = 68.7% R-Sq(adj) = 67.5%

In this case the parameter is b 1


Estimate of b1 is b̂1 = b1 = 0.031629
SE(Estimate) = SE( b̂ ) = 0.004111

0.031629 - 0
Calculated value of the test statistic: Tcal = = 7.69
0.004111
To find the p-value, go back and look at Ha.
We have a 2-sided alternative, Ha: β1 ≠ 0 and hence,
P-Value = 2  P(T(n – 2)  |Tcal |) = 2  P(T  7.69) = 0 (almost).

Note that the df of the t-distribution is df2 = dferror = df for error in ANOVA table.

The above p-value gives us the same decision and conclusion as the one we got from the
ANOVA table.

Compare the Tcal = 7.69 (above) with the Fcal = 59.21 in ANOVA table.
We have the following general relation between the Tcal and Fcal in SLR (only):
( Tcal )
2
= Fcal and equivalently Tcal = � Fcal .

So the p-value for the t-test is the same as the p-value for the F-test. Hence in SLR, the two
significance tests for the slope give the same results in both the F-test (using ANOVA table)
and the t-test.

Observe that the above conclusion does not tell us in what way b 1 is different from zero.
We could use the t-test for testing one-sided alternatives about b1. However, these should be
decided before looking at the data.

When the null hypothesis, Ho: β1 = 0 is rejected we conclude that X, the explanatory
variable, explains a “good” percentage of the variation in Y. Equivalently, X is a “good”
predictor of Y.

STA 3032 Chapter 11 Page 12 of 20


Confidence Interval for the Slope:

Remember the general formula for the confidence interval:


CI = ( Estimate � ME ) = ( Estimate � t �SE (estimate ) )
This is used in finding a CI for b1, where the estimate is b1 and SE(estimate) is given in the
Minitab output. All we need to do is to find t from the t-tables with df = (n – 2) = dferror =df2.

For the above example we had the following results from Minitab:

Predictor Coef SE Coef T P


Constant 0.48408 0.09214 5.25 0.000
Crude 0.031629 0.004111 7.69 0.000

That is, b1 = estimated slope = 0.031628 and SE(b1) = 0.004111.

Also, since df2 = dferror = 27 in the ANOVA table, we use the table of t-distribution and read
the value of t for a 95% CI on the row with df = 27 as t = 2.052, which gives
ME = t*SE(Estimate) = (2.053)(0.004111) = 0.00844

Hence a 95% CI for b is


CI = (0.031629 ± 0.00844) = (0.023188, 0.040068) = (0.02, 0.04)

As in previous chapters we can use the CI to make a decision for significance test:
When zero is not in the CI we reject Ho and conclude that

The observed data give evidence that


the slope of the regression line is
different from zero.

Actually, we can say more: since the CI for b in this example is (0.02, 0.04), we see that both
ends of the CI are positive thus we can conclude with 95% confidence that the slope of the
true regression line is some number between 0.02 and 0.04.

Alternatively we interpret the CI as follows:


We are 95% confident that on the average, as the crude oil price increases by one
dollar, the price at the pump increases by somewhere between 0.9 and 1.2 years.

STA 3032 Chapter 11 Page 13 of 20


CI for Mean Response

General formula for CIs:


CI = ( Estimator � t * �SE (estimator ) )
Additional symbols:
 µY|x* = α + bx*
= Mean response for the population of ALL Y’s that have to X = x*
= The point on the true regression line that corresponds to X = x*
 mˆY | x* = yˆ | x* = b0 + b1 x * = Estimator of mean response at X = x*
� �
� 1 ( x * - X )2 �
 SE( yˆ | x * ) = �S � n + n �
� �
� � ( X i - X )2 �
� i =1 �
 Hence CI for μY|X* = Mean Response when X = x* is
� �
� 1 ( x * - X )2 �

CI ( Mean Response) = yˆ | x � t( df ,a /2) �S � + n
*

� n �
� � ( X i - X )2 �
� i =1 �
Prediction Interval (CI for one new response)

 yˆ | x* = a + bx * = Predicted value for one new response at X = x*


� �
� 1 ( x * - X )2 �
 SE( yˆ | x * ) = SE(One new response) = �S � 1+ + n
n

� �
� � ( X i - X )2 �
� i =1 �

 Hence prediction interval (PI) for one new response is

� �
� 1 (x * -X ) � 2
PI (One New Response) = �yˆ | x * � t( df ,a /2) �S � 1 + + n �
� n
� � ( X i - X )2 �

� i =1 �

STA 3032 Chapter 11 Page 14 of 20


 Compare the formula for CI and PI to see the difference between them:

� �
� 1 (x * - X ) � 2
CI ( Mean Response) = �yˆ | x * � t( df ,a /2) �S � + n �
� n
� � ( X i - X )2 �

� i =1 �

� �
� 1 ( x * - X )2 �
 PI (One New Response) = �yˆ | x * � t( df ,a /2) �S � 1 + + n �
� n
� � ( X i - X )2 �

� i =1 �
In both of the above formulas
 S = Standard deviation of points around the regression line = MSE
 df = df2 = dferror
 x* = a particular value of X for which we are making prediction.
 Both CI and PI are centered around yˆ | x* = a + bx * = prediction at X = x*
 PI for a new response is always wider than CI for mean response at the same value
of X = x*. (Why?)
 SE’s and hence intervals will be smaller when x* is closer to X = the mean of the
sample of X’s and wider when x* is far from X . (Why?)

Look at Figure 11.19. Can you find the 95% confidence band for β1?
How about the prediction band?

Inferences about ρ:
Remember the estimator of the slope of the true regression line? It is bˆ1 = b1 = r �SY / S X .
We may rearrange the equation and write rˆ = r = b �S X / SY . Hence any inference on β1
gives the same result for ρ. This is because the test statistic for testing Ho: ρ = 0 vs. Ha: ρ ≠ 0
n-2 bˆ1
is T = r � = ~ t( n -2) [See proof on page 589.]
1 - r 2 se( bˆ1 )

Thus, a Test of Ho: ρ = 0 vs. Ha: ρ ≠ 0 gives the same results as Ho: β1 = 0 vs. Ha: β1 ≠ 0.

Similarly if a confidence interval for β1 contains zero, the confidence interval for ρ will also
contain zero.

STA 3032 Chapter 11 Page 15 of 20


More on R2:

We have seen that R2 = (r)2. This can also be defined and calculated from the following
SSReg Variation in Y explain by Regression
relation: R = =
2

SST Total variation in Y

This leads to alternative interpretation of R2:

 R2 is the proportion of variability in Y that is explained by the regression on X or


equivalently,

 R2 is the proportional reduction in the prediction error, that is,

 R2 is the percentage of reduction in prediction error we will see when the prediction
equation is used, instead of y= the sample mean of Y as the predicted value of Y.

Example: In the ANOVA table for the analysis of guessed ages we had the following output:

S = 4.48483 R-Sq = 96.9% R-Sq(adj) = 96.5%

Source DF SS MS F P
Regression 1 5030.0 5030.0 250.08 0.000
Residual
8 160.9 20.1
(Error)
Total 9 5190.9

SSReg 5030.0
Then, R = = = 0.969 = 96.9%.
2

SST 5190.0

This is the same result we had from Minitab, as it should be. We may now interpret this as
follows:

The regression model yields a predicted value for Y that has 96.9% less error than we would
have if we used the sample mean of Y’s as a predicted value.

STA 3032 Chapter 11 Page 16 of 20


More on Residuals:

Residual = Vertical distance from an observed point to the predicted value for the same X.
= Observed y – predicted y
y - yˆ
=

Where yˆ = -0.03 + 1.05 x


Observed Predicted Residuals
Values of y Values ( ŷ ) y - yˆ
20 21.06 – 3.06
45 47.43 4.57
70 73.79 – 8.79
85 89.61 0.39
25 26.33 1.67
50 52.70 5.30
15 15.79 – 2.79
60 63.25 2.75
40 42.15 1.85
35 36.88 – 1.88

Hence, for example, for someone whose actual age is 35, the predicted value of his/her age is
36.88. This means the prediction was 1.88 years higher than the true age.

 Positive residuals: Observations above regression line

 Negative residuals: Observations below regression line

 Sum of residuals = 0 ALWAYS.

 We (or computers) can make residual plots to see if there are any problems with the
assumptions.

 Computer finds “standardized residuals = z-score for each observation. Any point that
has a z score bigger than 3 in absolute value, i.e., |z| > 3 is called an outlier.

STA 3032 Chapter 11 Page 17 of 20


More on Correlation:

If x* - X = kS,

That is, the distance between a given value of X, say x* and X (in absolute value)
is k standard deviations,
Then ŷ - Y = r kS

That is, the distance (in absolute value) between the predicted value of y ( ŷ ) at x*
and Y is rk×S, where S = standard deviation.

Example:
Suppose Y = Height of children and X = heights of their fathers and the correlation between
the two variables is r = 0.5.

Then,
 If a father’s height is k = 2 standard deviations above the mean height of all fathers,
then the predicted height of his child will be
r × k × S = 0.5  2 = 1 standard deviation
above the mean height of children.

 If the father’s height is 1.5 standard deviations below the mean height of all fathers,
then his child’s predicted height will be 0.5  1.5 = 0.75 standard deviations below
the mean height of all children.

Some more on correlation

 Correlation is very much affected by outliers and influential points.

 Outliers weaken the correlation.

 Influential points (far from the rest of observations in the x-direction that does not
follow the trend) may change the sign and value of the slope.

STA 3032 Chapter 11 Page 18 of 20


Residual Plots

Residuals are the estimators of the error term (e) in the regression model. Thus, the
assumption of normality of e can be checked by looking at the histogram of the residuals.

 A histogram of residuals that is (almost) bell-shaped (symmetric) supports the


assumption of normality of the residuals.

 A histogram or a dot plot that shows outliers is indicative of the violation of the
assumption of normality.

 Normal probability plot or normal quantile plot can also be used to check the
normality assumption. Points in a normal PP or QP around a straight line support
assumption of normality.

Plot of residuals
Against the explanatory variable (X)
Magnify any problems with assumptions.

 If the residuals are randomly scattered around the line residuals = 0, this is good. It
means nothing else is left after using X to predict Y.

 If the residual plot shows a curved pattern this indicates that a curvilinear fit
(quadratic?) will give better results.

 If the residual plot is funnel shaped this means the assumption of constant variance is
violated.

 If the residual plot shows an outlier, this may mean the violation of normality and/or
constant variance or show an influential point.

STA 3032 Chapter 11 Page 19 of 20


Summary of SLR

Model: y = a + b x + e
Assumptions:
a) Random sample
b) Normal distribution
c) Constant variance
d) e ~ N(0, s).

Parameters and Estimators:


 Intercept = α Estimated by a = Y - bX
SY
 Slope = b Estimated by b = r �
SX
 St. deviation = s Estimated by S = MSE

Interpretation of

Slope

Intercept

R2

r

Testing if the model is good:


 ANOVA
 The t-test for slope
 CI for slope

PI and CI for response

Residual plots and interpretations.

STA 3032 Chapter 11 Page 20 of 20

Das könnte Ihnen auch gefallen