Beruflich Dokumente
Kultur Dokumente
y yˆ 0 1 x
• Model fit is characterized by r, which describes the strength and
direction of the relationship between x and y, and by r2, which
describes the proportion of variance explained by the model
An Example
How does educational expenditure affect student performance?
Our Model:
y 0 1 x
Residuals:
Min 1Q Median 3Q Max
-145.074 -46.821 4.087 40.034 128.489
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1089.294 44.390 24.539 < 2e-16 ***
Expend -20.892 7.328 -2.851 0.00641 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Our Model:
y 1089.29 20.89 x
Residuals:
Min 1Q Median 3Q Max
-79.158 -27.364 3.308 19.876 66.080
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1053.3204 8.2112 128.28 <2e-16 ***
PctSAT -2.4801 0.1862 -13.32 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Our Model:
y 1053.32 2.48 x
2nd Approach: Let’s model the effects of per-student expenditure and test-
taking percentage on SAT scores
Our Model:
y 0 1 x1 2 x2
Residuals:
Min 1Q Median 3Q Max
-88.400 -22.884 1.968 19.142 68.755
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 993.8317 21.8332 45.519 < 2e-16 ***
Expend 12.2865 4.2243 2.909 0.00553 **
PctSAT -2.8509 0.2151 -13.253 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Our Model:
Predictions
Example: You learn that, in some other year, 50% of students take the
SAT in a state that spends $5k per pupil. What is the expected mean
SAT score for this state (assuming that the relationship holds across
years)?
912.78
Interpretation of Coefficients
• Again, each estimated coefficient represents the amount that y is
expected to increase when the value of the corresponding predictor
(explanatory variable) is increased by one, while holding constant
the values of all other predictors.
• CI’s for coefficients are computed in the same way as for simple
linear regression (and for t-distributed variables generally)
CI1 ( i ) bi ˆ bi t / 2
• The number of degrees of freedom for the t distribution is (n-k-1),
where n is the number of data points (records) and k is the number
of predictors (explanatory variables) in the model.
R 2 (n k 1)
F
1 R 2 k
• Which will be distributed approximately as an F(k,n-k-1) statistic
Avoiding (Multi-)Collinearity
• When predictors are highly correlated, standard errors become
inflated
• Conceptual example:
– Suppose that two variables z and x are exactly the same.
– Suppose the population regression line of y is
y 10 5 x
– If you fit a regression using sample data of y on both x and z, you wind
up fitting
y 10 1 x 2 z
– You can see that any value will work for the two coefficients, as long as
they add up to 5. Equivalently, this means that the standard errors for
the coefficients are huge.
Residuals:
Min 1Q Median 3Q Max
-87.040 -14.739 -5.112 20.255 72.428
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1038.7458 48.6843 21.336 <2e-16 ***
Expend 7.9098 4.5649 1.733 0.0900 .
PctSAT -3.0762 0.2361 -13.030 <2e-16 ***
PTratio -1.0618 2.1860 -0.486 0.6295 Remove and refit
is.northeast 33.8557 16.5452 2.046 0.0466 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residuals:
Min 1Q Median 3Q Max
-84.833 -18.528 -4.838 20.309 74.865
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1018.1301 23.6529 43.045 <2e-16 ***
Expend 8.3857 4.4214 1.897 0.0642 . Remove and refit
PctSAT -3.0888 0.2327 -13.273 <2e-16 ***
is.northeast 35.5920 16.0197 2.222 0.0313 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Adjusted R2
• R2 in multiple regression means much the same thing in MLR as it
does in simple linear regression: percent variance explained by the
model.
SS y yˆ
R 1
2
total
SS
SS y yˆ n 1
1
SS n k 1
2
Radj
total