Beruflich Dokumente
Kultur Dokumente
ŷ = bˆ0 + bˆ1 x is called the prediction equation and ŷ is the predicted value of Y for
X = x.
Residual = yi - yˆ i = ei = eˆi is the difference between the observed and predicted
value of Y.
ˆ SY
LSE of β0 and β1 are b1 = b1 = r � and b0 = bˆ0 = Y - b1 X .
SX
A number that measures the strength and direction of the linear association between
X and Y (both quantitative variables).
– 1 r + 1 always.
R2 measures the percent of variability in the response (Y) explained by the changes in
X [or by the regression on X].
How do you find r when you are given R2? For example what is r, if R2 = 0.81 =
81%?
It is said that the crude oil prices influence the price at the pump. Hence,
X = Crude oil price = Independent variable
Y = Price at the pump = Dependent Variable.
Step Two: draw a scatter diagram of the data and interpret what you see (to get some ideas
about the relation between two variables).
A Scatter Plot of Crude Oil Prices vs. Prices at the Pump (1976 - 2004)
2.00
Price at the Pump ($/Gallon)
1.75
1.50
1.25
1.00
0.75
0.50
10 15 20 25 30 35 40
Crude Oil Prices ($/Barrel)
Souce: Statistical Abstract of the US, 2009
Intercept = 0.4841
In general, the intercept is the value of Y when X = 0. HOWEVER,
DO NOT INTERPRET the intercept in this case because
a) Zero is not within the range (10 to 38) of observed values of the crude oil
price (X) and
b) X = 0 for the price of crude oil is not meaningful.
These give us the 2 points (10, 0.51) and (40, 1.68) that we connect on the scatter diagram.
Now mark these points on the graph and joint them with a ruler, to get the blue line on the
graph.
A Scatter Plot of Crude Oil Prices vs. Prices at the Pump (1976 - 2004)
2.00
1.75
Price at the Pump ($/Gallon)
1.50
1.25
1.00
0.75
0.50
10 15 20 25 30 35 40
Crude Oil Prices ($/Barrel)
Souce: Statistical Abstract of the US, 2009
A linear relation between two quantitative variables (denoted by X and Y) was shown by
ŷ = b0 + b1 x (called the prediction equation)
In this equation
Y is called the response (dependent) variable,
X is called the predictor (explanatory variable)
ŷ is called the predicted value of Y for some X = x.
b0 = bˆ = estimate of the intercept (β0) and
b1 = bˆ 1 = estimate of the slope (b1).
Hence, ŷ = b0 + b1 x is an estimate for a (true but unknown) regression line Y = b0 + b 1 X
Also, ŷ = a + bX is called an estimate of true (but unknown) regression line µY = α + bx,
so we also write ˆy = m
ˆY
What do the slope and intercept of the regression line tell us?
Slope is the average amount of change in Y for one unit of increase in X.
Note: slope ≠ rise / run. Why?
Regression Model: A mathematical (or theoretical) equation that shows the linear relation
between the explanatory variable and the response. The simple linear regression
model we will use is Y = α + bX + e, where e is called the error term.
1.75
Price at the Pump ($/Gallon)
1.50
1.25
1.00
0.75
0.50
10 15 20 25 30 35 40
Crude Oil Prices ($/Barrel)
Souce: Statistical Abstract of the US, 2009
Assumption 1 (random sample) is not really satisfied. However, our main interest is in the
relation between X and Y and for that purpose we may assume that the relation in the years
1976 to 2004 is typical (representative) of the relation in other years.
So, let’s calculate the residuals using the prediction equation, yˆ = 0.48 + 0.03x .
5
Number of Years
0
-2 -1 0 1 2
Standardized Residuals
In this section we will deal with inferences about the true SLR model i.e., a regression model
with one explanatory variable (X). When making inferences about the parameters of the
regression model, we will determine
If X is a “good predictor” of Y
If the regression line is useful for making predictions about Y
If the slope is different from zero
Prediction interval for one individual response, Y at X = x.
Confidence intervals for the mean of Y, that is µY = mean response, at X = x.
1
We will see how to make inferences about the parameters of a multiple regression model, i.e., a regression
model with several (k 2) explanatory variables, X1, X2, …, Xk.
2
Remember the estimators of β0 and β1 are b0 and b1 respectively.
n
The estimator of σ 2 �e 2
i
SSE
sˆ 2 = s 2 = i =1
= = MSE
n-2 n-2
Is X a good predictor of Y? This is equivalent to saying is the slope of the line significantly
different from zero? [If not, we might as well use Y as a predictor.]
We can answer these questions using an ANOVA table:
�( yi - y ) = �( yˆi - y )2 + �( yi - yˆi ) 2
i =1
2
i =1 i =1
df = (n – 1) = 1 + (n – 2)
MSReg
3. Test statistic = F = = F( df1 , df2 )
MSE
Analysis of Variance
Source DF SS MS F P
Regression 1 1.4350 1.4350 59.21 0.000
Residual Error 27 0.6544 0.0242
Total 28 2.0894
Decision?
Conclusion?
Decision: Reject Ho since the p-value < 0.0005 is less than any reasonable level of
significance.
Conclusion: The observed data indicate that the slope is significantly different from zero,
i.e., the observed data strongly indicate that the price of gas at the pump
depends on the price of crude oil.
To carry out the test we use the first block of the Minitab output:
0.031629 - 0
Calculated value of the test statistic: Tcal = = 7.69
0.004111
To find the p-value, go back and look at Ha.
We have a 2-sided alternative, Ha: β1 ≠ 0 and hence,
P-Value = 2 P(T(n – 2) |Tcal |) = 2 P(T 7.69) = 0 (almost).
Note that the df of the t-distribution is df2 = dferror = df for error in ANOVA table.
The above p-value gives us the same decision and conclusion as the one we got from the
ANOVA table.
Compare the Tcal = 7.69 (above) with the Fcal = 59.21 in ANOVA table.
We have the following general relation between the Tcal and Fcal in SLR (only):
( Tcal )
2
= Fcal and equivalently Tcal = � Fcal .
So the p-value for the t-test is the same as the p-value for the F-test. Hence in SLR, the two
significance tests for the slope give the same results in both the F-test (using ANOVA table)
and the t-test.
Observe that the above conclusion does not tell us in what way b 1 is different from zero.
We could use the t-test for testing one-sided alternatives about b1. However, these should be
decided before looking at the data.
When the null hypothesis, Ho: β1 = 0 is rejected we conclude that X, the explanatory
variable, explains a “good” percentage of the variation in Y. Equivalently, X is a “good”
predictor of Y.
For the above example we had the following results from Minitab:
Also, since df2 = dferror = 27 in the ANOVA table, we use the table of t-distribution and read
the value of t for a 95% CI on the row with df = 27 as t = 2.052, which gives
ME = t*SE(Estimate) = (2.053)(0.004111) = 0.00844
As in previous chapters we can use the CI to make a decision for significance test:
When zero is not in the CI we reject Ho and conclude that
Actually, we can say more: since the CI for b in this example is (0.02, 0.04), we see that both
ends of the CI are positive thus we can conclude with 95% confidence that the slope of the
true regression line is some number between 0.02 and 0.04.
� �
� 1 (x * -X ) � 2
PI (One New Response) = �yˆ | x * � t( df ,a /2) �S � 1 + + n �
� n
� � ( X i - X )2 �
�
� i =1 �
� �
� 1 ( x * - X )2 �
PI (One New Response) = �yˆ | x * � t( df ,a /2) �S � 1 + + n �
� n
� � ( X i - X )2 �
�
� i =1 �
In both of the above formulas
S = Standard deviation of points around the regression line = MSE
df = df2 = dferror
x* = a particular value of X for which we are making prediction.
Both CI and PI are centered around yˆ | x* = a + bx * = prediction at X = x*
PI for a new response is always wider than CI for mean response at the same value
of X = x*. (Why?)
SE’s and hence intervals will be smaller when x* is closer to X = the mean of the
sample of X’s and wider when x* is far from X . (Why?)
Look at Figure 11.19. Can you find the 95% confidence band for β1?
How about the prediction band?
Inferences about ρ:
Remember the estimator of the slope of the true regression line? It is bˆ1 = b1 = r �SY / S X .
We may rearrange the equation and write rˆ = r = b �S X / SY . Hence any inference on β1
gives the same result for ρ. This is because the test statistic for testing Ho: ρ = 0 vs. Ha: ρ ≠ 0
n-2 bˆ1
is T = r � = ~ t( n -2) [See proof on page 589.]
1 - r 2 se( bˆ1 )
Thus, a Test of Ho: ρ = 0 vs. Ha: ρ ≠ 0 gives the same results as Ho: β1 = 0 vs. Ha: β1 ≠ 0.
Similarly if a confidence interval for β1 contains zero, the confidence interval for ρ will also
contain zero.
We have seen that R2 = (r)2. This can also be defined and calculated from the following
SSReg Variation in Y explain by Regression
relation: R = =
2
R2 is the percentage of reduction in prediction error we will see when the prediction
equation is used, instead of y= the sample mean of Y as the predicted value of Y.
Example: In the ANOVA table for the analysis of guessed ages we had the following output:
Source DF SS MS F P
Regression 1 5030.0 5030.0 250.08 0.000
Residual
8 160.9 20.1
(Error)
Total 9 5190.9
SSReg 5030.0
Then, R = = = 0.969 = 96.9%.
2
SST 5190.0
This is the same result we had from Minitab, as it should be. We may now interpret this as
follows:
The regression model yields a predicted value for Y that has 96.9% less error than we would
have if we used the sample mean of Y’s as a predicted value.
Residual = Vertical distance from an observed point to the predicted value for the same X.
= Observed y – predicted y
y - yˆ
=
Hence, for example, for someone whose actual age is 35, the predicted value of his/her age is
36.88. This means the prediction was 1.88 years higher than the true age.
We (or computers) can make residual plots to see if there are any problems with the
assumptions.
Computer finds “standardized residuals = z-score for each observation. Any point that
has a z score bigger than 3 in absolute value, i.e., |z| > 3 is called an outlier.
If x* - X = kS,
That is, the distance between a given value of X, say x* and X (in absolute value)
is k standard deviations,
Then ŷ - Y = r kS
That is, the distance (in absolute value) between the predicted value of y ( ŷ ) at x*
and Y is rk×S, where S = standard deviation.
Example:
Suppose Y = Height of children and X = heights of their fathers and the correlation between
the two variables is r = 0.5.
Then,
If a father’s height is k = 2 standard deviations above the mean height of all fathers,
then the predicted height of his child will be
r × k × S = 0.5 2 = 1 standard deviation
above the mean height of children.
If the father’s height is 1.5 standard deviations below the mean height of all fathers,
then his child’s predicted height will be 0.5 1.5 = 0.75 standard deviations below
the mean height of all children.
Influential points (far from the rest of observations in the x-direction that does not
follow the trend) may change the sign and value of the slope.
Residuals are the estimators of the error term (e) in the regression model. Thus, the
assumption of normality of e can be checked by looking at the histogram of the residuals.
A histogram or a dot plot that shows outliers is indicative of the violation of the
assumption of normality.
Normal probability plot or normal quantile plot can also be used to check the
normality assumption. Points in a normal PP or QP around a straight line support
assumption of normality.
Plot of residuals
Against the explanatory variable (X)
Magnify any problems with assumptions.
If the residuals are randomly scattered around the line residuals = 0, this is good. It
means nothing else is left after using X to predict Y.
If the residual plot shows a curved pattern this indicates that a curvilinear fit
(quadratic?) will give better results.
If the residual plot is funnel shaped this means the assumption of constant variance is
violated.
If the residual plot shows an outlier, this may mean the violation of normality and/or
constant variance or show an influential point.
Model: y = a + b x + e
Assumptions:
a) Random sample
b) Normal distribution
c) Constant variance
d) e ~ N(0, s).
Interpretation of
Slope
Intercept
R2
r