Sie sind auf Seite 1von 73

Multiple Regression: EDA and

model specification

Recap: simple linear regression


Two variables
OLS estimation
Minimize the residual sum of squares
All coefficients are calculated through this
procedure

Assessing the resulting regression model


R-squared for the entire model. F test its
significance.
t test for individual coefficients.

Multiple Regression
Same concepts as bivariate regression

Interpretations
r2 (1) proportions of variability in y explained by X; (2) maximal
correlation between y and a weighted combination of X
Coefficients (bp): an increase in y for a unit increment in xp HOLDING
OTHER INDEPENDENT VARIABLES CONSTANT

MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE

Yi 1 2 X 2 i 3 X 3 i ui
Yi b1 b2 X 2 i b3 X 3 i

The regression coefficients are derived using the same least squares principle used in simple regression
analysis. The fitted value of Y in observation i depends on our choice of b1, b2, and b3.
11

MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE

Yi 1 2 X 2 i 3 X 3 i ui
Yi b1 b2 X 2 i b3 X 3 i
ei Yi Yi Yi b1 b2 X 2 i b3 X 3 i

The residual ei in observation i is the difference between the actual and fitted values of Y.

12

MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE

RSS ei2 (Yi b1 b2 X 2 i b3 X 3 i )2

We define RSS, the sum of the squares of the residuals, and choose b1, b2, and b3 so as to minimize it.

13

MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE

RSS ei2 (Yi b1 b2 X 2 i b3 X 3 i )2

(Yi 2 b12 b22 X 22i b32 X 32i 2b1Yi 2b2 X 2 iYi


2b3 X 3 iYi 2b1b2 X 2 i 2b1b3 X 3 i 2b2 b3 X 2 i X 3 i )
Yi 2 nb12 b22 X 22i b32 X 32i 2b1 Yi
2b2 X 2 iYi 2b3 X 3 iYi 2b1b2 X 2 i
2b1b3 X 3 i 2b2 b3 X 2 i X 3 i

RSS
0
b1

RSS
0
b2

RSS
0
b3

First we expand RSS as shown, and then we use the first order conditions for minimizing it.

14

MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE

b1 Y b2 X 2 b3 X 3

b2

X
X

X 2 Yi Y X 3 i X 3

2i
3i

X 3 Yi Y X 2 i X 2 X 3 i X 3

2
2
2

X
X

X
X

X
2i 2 3i 3 2i 2 3i 3

We thus obtain three equations in three unknowns. Solving for b1, b2, and b3, we obtain the expressions
shown above. (The expression for b3 is the same as that for b2, with the subscripts 2 and 3 interchanged
everywhere.)
15

MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE

b1 Y b2 X 2 b3 X 3

b2

X
X

X 2 Yi Y X 3 i X 3

2i
3i

X 3 Yi Y X 2 i X 2 X 3 i X 3

2
2
2

X
X

X
X

X
2i 2 3i 3 2i 2 3i 3

The expression for b1 is a straightforward extension of the expression for it in simple regression analysis.

16

MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE

b1 Y b2 X 2 b3 X 3

b2

X
X

X 2 Yi Y X 3 i X 3

2i
3i

X 3 Yi Y X 2 i X 2 X 3 i X 3

2
2
2

X
X

X
X

X
2i 2 3i 3 2i 2 3i 3

However, the expressions for the slope coefficients are considerably more complex than that for the slope
coefficient in simple regression analysis.
17

MULTIPLE REGRESSION WITH TWO EXPLANATORY VARIABLES: EXAMPLE

b1 Y b2 X 2 b3 X 3

b2

X
X

X 2 Yi Y X 3 i X 3

2i
3i

X 3 Yi Y X 2 i X 2 X 3 i X 3

2
2
2

X
X

X
X

X
2i 2 3i 3 2i 2 3i 3

For the general case when there are many explanatory variables, ordinary algebra is inadequate. It is
necessary to switch to matrix algebra.
18

Interpretation

y: house price
x1: lot size
x2: number of bedrooms
Holding the number of bedrooms
constant, increasing one square foot
adds an average of $20 to the price

Assumptions of OLS
The relationship between y and x is linear, there is an
equation, = + + that constitutes the population
model.
The errors have mean zero, and constant variance; that is
= 0 and V[] = 2 . The errors about the regression
line do not vary with x; that is, = 2 = 2 .
The residuals are independent; the value of one error is not
affected by the value of another error.
For each value of x, the errors have a normal distribution
about the regression line with mean 0 and variance 2 .
This normal distribution is centered on the regression line.
This assumption may be written as ~(0, 2 ).

Exploratory Analysis
Is the dependent variable normally
distributed?
Are there outliers in the dependent variable?
Are the relationship between the dependent
variable linear?
Are pairs of explanatory variables
independent from each other?
Is the residual IID?

Some simple diagnostics


Normality of the regression residual
Q-Q plot

Homoscedasticity
Scatter plot (predicted value on horizontal, residual on
vertical axis)

Linearity
Scatter plot

Independence of the errors


MC, Moran scatter plot

QQ plot for examining normality

Shape of Histogram

Note that it is the tail that characterizes the skewness.

Leptomeans slim in Greek


Platmeans fat in Greek
Mesomeans between micro and macro in Greek

Shape of Histogram: skewedness and kurtosis


n

skewness

(x x)
i 1

ns

kurtosis

4
(
x

x
)
i
i 1

ns 4

Skewness measures the degree of asymmetry exhibited by the data.


When there are more data below the mean, skewness is positive.
Otherwise, it is negative. When skewness is 0, the histogram is
symmetric about the mean.
Kurtosis measures how peaked the histogram is as compared to the
normal distribution.
Leptokurtic: Kurtosis > 3, data with high degree of peakedness
Platykurtic: Kurtosis < 3, indicating a flat histogram

Alternative to Histogram: Stem-Leaves


5, 12, 14, 21, 22, 36, 21, 6, 77, 12, 21, 16, 10, 5, 11,
42, 31, 31, 26, 24, 11, 19, 9, 44, 21, 17, 26, 21, 24, 23
Stem: leading digits

Leaf: trailing digits


0 | 5569
1 | 011224679
2 | 11111234466
3 | 116
4 | 24
5|
6|
7|7

The decimal point is 1 digit(s) to the right of the |

Variable transformation

Alternative to Histogram: Stem-Leaves

Outliers: Boxplot
5, 5,
6, 9,
10,
11,
11,
12,
12,
14,
16,
17,
19,
21,
21,
21,
21,
21,
21,
22,
23,
24,
26,
26,
31,
31,
36,
42,
44,
77

Maximum

1.5 IQR above hinge


Whiskers
hinges

3rd quartile (75 percentile)


2nd quartile (median)
1st quartile (25 percentile)

Minimum (within 1.5 IQR)


IQR = 3rd quartile 1st quartile

Linking and brushing

Bivariate relations: Scatterplots

Correlation matrix

Linear relationship between DV and IV

In regression analysis, we use EDA for


model specification and diagnosis

GEODA demo

Variable selection

Police Expenditure Analysis

Initial run

> names(police)
[1] "AREA"
"PERIMETER" "CNTY_"
"CNTY_ID"
"NAME"
"STATE_NAME"
[7] "STATE_FIPS" "CNTY_FIPS" "FIPS"
"FIPSNO"
"POLICE"
"POP"
[13] "TAX"
"TRANSFER"
"INC"
"CRIME"
"UNEMP"
"OWN"
[19] "COLLEGE"
"WHITE"
"COMMUTE"
> police_full<lm(police$POLICE~police$INC+police$TAX+police$CRIME+police$UNEMP+police$OWN+police$COLLEGE+police$WHITE+p
olice$COMMUTE)
> summary(police_full)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-3681.2602 1643.0551 -2.240 0.028102 *
police$INC
0.5707
0.1552
3.678 0.000447 ***
police$TAX
-0.1777
1.5500 -0.115 0.909063
police$CRIME
2.5243
0.5264
4.796 8.33e-06 ***
police$UNEMP
-47.6656
56.5926 -0.842 0.402394
police$OWN
1.5789
18.5719
0.085 0.932483
police$COLLEGE
39.7642
12.2447
3.247 0.001760 **
police$WHITE
-12.3306
8.3782 -1.472 0.145386
police$COMMUTE
-2.1177
9.8309 -0.215 0.830050
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 896 on 73 degrees of freedom
Multiple R-squared: 0.6797,
Adjusted R-squared:
F-statistic: 19.36 on 8 and 73 DF, p-value: 2.796e-15

0.6445

Diagnostics of multicollinearity
> colldiag(police_full)
Condition
Index
Variance Decomposition Proportions
intercept police$INC police$TAX police$CRIME
1 1.000 0.000
0.000
0.002
0.004
2 3.625 0.000
0.000
0.011
0.471
3 5.423 0.000
0.001
0.071
0.408
4 6.807 0.000
0.002
0.422
0.035
5 7.785 0.001
0.000
0.156
0.001
6 9.589 0.000
0.000
0.302
0.003
7 22.199 0.031
0.104
0.026
0.010
8 34.797 0.007
0.601
0.006
0.058
9 55.595 0.960
0.293
0.005
0.010

> vif(police_full)
police$INC
police$TAX
2.561469
1.344747

> cor(police[,c(11,13:20)])
POLICE
TAX
POLICE
1.00000000 0.36371791
TAX
0.36371791 1.00000000
TRANSFER 0.96694077 0.32001111
INC
0.66915713 0.28918142
CRIME
0.57533772 0.33857309
UNEMP
-0.20729823 0.08596563
OWN
-0.27590473 -0.25053979
COLLEGE 0.66062780 0.38021943
WHITE
0.04537959 -0.17779552

police$CRIME
1.302452

TRANSFER
0.96694077
0.32001111
1.00000000
0.63810786
0.43073136
-0.21383170
-0.24399134
0.65734105
0.03903367

police$UNEMP
0.001
0.000
0.000
0.035
0.192
0.023
0.425
0.076
0.248

police$UNEMP
1.865568

INC
0.6691571
0.2891814
0.6381079
1.0000000
0.3215812
-0.4679109
-0.0610233
0.5998025
0.3905673

police$OWN
0.000
0.000
0.000
0.001
0.001
0.002
0.045
0.416
0.535

police$COLLEGE
0.001
0.002
0.039
0.011
0.064
0.456
0.148
0.265
0.014

police$OWN police$COLLEGE
2.225621
1.999341

CRIME
0.57533772
0.33857309
0.43073136
0.32158124
1.00000000
0.09645134
-0.23300136
0.29997201
-0.03203264

UNEMP
-0.20729823
0.08596563
-0.21383170
-0.46791092
0.09645134
1.00000000
-0.28464074
-0.22604117
-0.59108844

police$WHITE
0.001
0.001
0.000
0.055
0.002
0.082
0.756
0.028
0.074

police$COMMUTE
0.002
0.071
0.325
0.057
0.231
0.063
0.008
0.235
0.008

police$WHITE police$COMMUTE
2.360316
1.491455

OWN
-0.2759047
-0.2505398
-0.2439913
-0.0610233
-0.2330014
-0.2846407
1.0000000
-0.3619120
0.5457160

COLLEGE
0.66062780
0.38021943
0.65734105
0.59980246
0.29997201
-0.22604117
-0.36191204
1.00000000
0.01510858

WHITE
0.04537959
-0.17779552
0.03903367
0.39056734
-0.03203264
-0.59108844
0.54571601
0.01510858
1.00000000

pairs(police[,c(11,13:20), pch=1, cex=0.2)


300

500

5000

8000

12

16

20

40

60

20

40
6000

100

500

POLICE

0 80000

100

TAX

5000 9000

TRANSFER

1000

INC

10 16

CRIME

70

UNEMP

50

50

OWN

60

20

COLLEGE

30

20

WHITE

COMMUTE
0

4000 8000

0 40000

120000

500

1500

50

60

70

80

20

40

60

80

Diagnostics: outliers, normality, heteroskedasticity

>plot(police_out, which=1:4)
Residuals vs Fitted

Normal Q-Q
52

81

-2000

Standardized residuals

2000

81

Residuals

4000

52

-2

32

32

-1000

1000

2000

3000

4000

5000

6000

-2

-1

Fitted values
lm(police$POLICE ~ police$INC + police$TAX + police$CRIME + police$UNEMP + ...

Theoretical Quantiles
lm(police$POLICE ~ police$INC + police$TAX + police$CRIME + police$UNEMP + ...

Scale-Location

Cook's distance

2.5

52

1.0

Cook's distance

1.0

1.5

32

1.5

52
81

0.5

32

0.0

0.0

0.5

Standardized residuals

2.0

2.0

56

-1000

1000

2000

3000

4000

5000

6000

Fitted values
lm(police$POLICE ~ police$INC + police$TAX + police$CRIME + police$UNEMP + ...

20

40

60

80

Obs. number
lm(police$POLICE ~ police$INC + police$TAX + police$CRIME + police$UNEMP + ...

Relevant R functions
cor(police[,c(11,13:20)]) #correlation matrix
colldiag(OLS_output_file) #condition index (numbers)
vif(OLS_output_file) #variance inflation factor
pairs(police[,c(11,13:20), pch=1, cex=0.2) #scatterplot matrix
plot(OLS_output_file, which=1:4) #four plots for diagnostics

Concepts to review

Condition numberassessing multicollinearity (30)


Variance inflation factorassessing multicolliearity (5)
Q Q plot (quantile plot)for visual assessment of normality
Cooks distancefor identifying outliers

Use dummy variable to filter impact from outliers (#52, #32, #81)
SUMMARY OF OUTPUT:
Data set
Dependent Variable
Mean dependent var
S.D. dependent var

ORDINARY LEAST
: police
:
POLICE
:
927.768
:
1493.61

SQUARES ESTIMATION
Number of Observations:
Number of Variables :
Degrees of Freedom
:

R-squared
:
0.796163 F-statistic
Adjusted R-squared :
0.770683 Prob(F-statistic)
Sum squared residual: 3.72884e+07 Log likelihood
Sigma-square
:
517894 Akaike info criterion
S.E. of regression :
719.649 Schwarz criterion
Sigma-square ML
:
454736
S.E of regression ML:
674.341

:
:
:
:
:

82
10
72
31.247
1.868e-21
-650.479
1320.96
1345.03

----------------------------------------------------------------------Variable Coefficient
Std.Error
t-Statistic Probability
----------------------------------------------------------------------CONSTANT
-3155.889
1322.244
-2.386767
0.0196264
TAX
0.4208701
1.248475
0.3371073
0.7370149
INC
0.6768542
0.1257428
5.382846
0.0000009
CRIME
2.431794
0.4230122
5.748756
0.0000002
UNEMP
-59.4707
45.49251
-1.307264
0.1952825
OWN
3.643113
14.92046
0.2441689
0.8077919
COLLEGE
-5.767071
12.12848
-0.4754983
0.6358709
WHITE
-16.7632
6.764766
-2.478017
0.0155575
COMMUTE
-1.972053
7.89626
-0.2497452
0.8034976
DUMMY
3644.325
568.0769
6.415197
0.0000000
-----------------------------------------------------------------------

Create a dummy (indicator) variable in GeoDA

Add a new field (integer)


Assign 0 to the new field
Assign 1 to the outliers

Eliminate variables with large p


SUMMARY OF OUTPUT:
Data set
Dependent Variable
Mean dependent var
S.D. dependent var

ORDINARY LEAST
: police
:
POLICE
:
927.768
:
1493.61

SQUARES ESTIMATION
Number of Observations:
Number of Variables :
Degrees of Freedom
:

R-squared
:
0.790559 F-statistic
Adjusted R-squared :
0.779679 Prob(F-statistic)
Sum squared residual: 3.83134e+07 Log likelihood
Sigma-square
:
497577 Akaike info criterion
S.E. of regression :
705.391 Schwarz criterion
Sigma-square ML
:
467237
S.E of regression ML:
683.547

82
5
77

:
72.6614
: 2.28191e-25
:
-651.591
:
1313.18
:
1325.22

----------------------------------------------------------------------Variable Coefficient
Std.Error
t-Statistic Probability
----------------------------------------------------------------------CONSTANT
-3962.951
588.2222
-6.737166
0.0000000
INC
0.7053126
0.09277907
7.602066
0.0000000
CRIME
2.312218
0.3912496
5.909829
0.0000001
WHITE
-12.32904
4.757191
-2.591664
0.0114225
DUMMY
3484.013
441.0964
7.898529
0.0000000
-----------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER

21.087093

Backward elimination
Use Akaike Information Criterion (AIC) for model selection
The smaller the AIC, the better

police2_bw<step(lm(police2$POLICE~police2$INC+police2$TAX+police2$CRIME+police2$UNEMP+police2$OWN+police2
$COLLEGE+police2$WHITE+police2$COMMUTE+police2$DUMMY),direction="backward")

Step: AIC=1080.48
police2$POLICE ~ police2$INC + police2$CRIME + police2$WHITE +
police2$DUMMY
Df Sum of Sq

<none>
- police2$WHITE
- police2$CRIME
- police2$INC
- police2$DUMMY

1 3342086
1 17378415
1 28755675
1 31042215

RSS
38313426
41655513
55691841
67069102
69355641

AIC
1080.5
1085.3
1109.2
1124.4
1127.1

Coefficients:
Estimate Std. Error t value
(Intercept) -3.963e+03 5.882e+02 -6.737
police2$INC
7.053e-01 9.278e-02 7.602
police2$CRIME 2.312e+00 3.912e-01 5.910
police2$WHITE -1.233e+01 4.757e+00 -2.592
police2$DUMMY 3.484e+03 4.411e+02 7.899
--Signif. codes: 0 *** 0.001 ** 0.01

Pr(>|t|)
2.60e-09
5.92e-11
8.81e-08
0.0114
1.59e-11

***
***
***
*
***

* 0.05 . 0.1 1

Residual standard error: 705.4 on 77 degrees of freedom


Multiple R-squared: 0.7906, Adjusted R-squared: 0.7797
F-statistic: 72.66 on 4 and 77 DF, p-value: < 2.2e-16

Start: AIC=1088.25
police2$POLICE ~ police2$INC + police2$TAX + police2$CRIME +
police2$UNEMP + police2$OWN + police2$COLLEGE + police2$WHITE +
police2$COMMUTE + police2$DUMMY
Df Sum of Sq
RSS
AIC
- police2$OWN
1
30876 37319263 1086.3
- police2$COMMUTE 1
32302 37320689 1086.3
- police2$TAX
1
58854 37347241 1086.4
- police2$COLLEGE 1
117095 37405482 1086.5
- police2$UNEMP
1
885049 38173436 1088.2
<none>
37288387 1088.2
- police2$WHITE
1
3180164 40468551 1093.0
- police2$INC
1 15006004 52294391 1114.0
- police2$CRIME
1 17115469 54403856 1117.2
- police2$DUMMY
1 21313811 58602197 1123.3
Step: AIC=1086.32
police2$POLICE ~ police2$INC + police2$TAX + police2$CRIME +
police2$UNEMP + police2$COLLEGE + police2$WHITE + police2$COMMUTE +
police2$DUMMY
Df Sum of Sq
RSS
AIC
- police2$COMMUTE 1
13542 37332805 1084.3
- police2$TAX
1
62287 37381549 1084.5
- police2$COLLEGE 1
154982 37474245 1084.7
- police2$UNEMP
1
902951 38222214 1086.3
<none>
37319263 1086.3
- police2$WHITE
1
3720805 41040068 1092.1
- police2$INC
1 15062883 52382146 1112.1
- police2$CRIME
1 17143866 54463129 1115.3
- police2$DUMMY
1 21288736 58607999 1121.3

If OWN eliminated

Step: AIC=1084.35
police2$POLICE ~ police2$INC + police2$TAX + police2$CRIME +
police2$UNEMP + police2$COLLEGE + police2$WHITE + police2$DUMMY
Df Sum of Sq
RSS
AIC
- police2$TAX
1
62140 37394944 1082.5
- police2$COLLEGE 1
152099 37484904 1082.7
- police2$UNEMP
1
892960 38225764 1084.3
<none>
37332805 1084.3
- police2$WHITE
1 3884963 41217768 1090.5
- police2$INC
1 16632077 53964882 1112.6
- police2$CRIME
1 17219408 54552212 1113.5
- police2$DUMMY
1 21306678 58639482 1119.4
Step: AIC=1082.49
police2$POLICE ~ police2$INC + police2$CRIME + police2$UNEMP +
police2$COLLEGE + police2$WHITE + police2$DUMMY
Df Sum of Sq
- police2$COLLEGE 1
121592
- police2$UNEMP
1
857660
<none>
- police2$WHITE
1 4167247
- police2$INC
1 17415537
- police2$CRIME
1 18275653
- police2$DUMMY
1 21254605

RSS
37516536
38252604
37394944
41562191
54810481
55670597
58649549

AIC
1080.8
1082.3
1082.5
1089.2
1111.8
1113.1
1117.4

Step: AIC=1080.75
police2$POLICE ~ police2$INC + police2$CRIME + police2$UNEMP +
police2$WHITE + police2$DUMMY
Df Sum of Sq
- police2$UNEMP 1
796891
<none>
- police2$WHITE 1 4130655
- police2$CRIME 1 18154115
- police2$INC
1 21931399
- police2$DUMMY 1 30589229

RSS
38313426
37516536
41647191
55670651
59447935
68105765

AIC
1080.5
1080.8
1087.3
1111.1
1116.5
1127.7

Step: AIC=1080.48
police2$POLICE ~ police2$INC + police2$CRIME + police2$WHITE +
police2$DUMMY

Df Sum of Sq
<none>
- police2$WHITE
- police2$CRIME
- police2$INC
- police2$DUMMY

1 3342086
1 17378415
1 28755675
1 31042215

RSS
38313426
41655513
55691841
67069102
69355641

AIC
1080.5
1085.3
1109.2
1124.4
1127.1

Forward selection

The opposite of backward elimination


Still use AIC as the selection criterion
Start with maximum AIC, i.e., regress with the vector 1
Add the variable that results in the smallest AIC
Stop when smallest AIC is reached

>step(lm(DV~1),direction=forward,scope=~IV1+IV2++IVk)

> police2_fw<-step(lm(police2$POLICE~1),direction="forward",
scope=~police2$INC+police2$TAX+police2$CRIME+police2$UNEMP+police2$OWN+police2$COLLEGE+pol
ice2$WHITE+police2$COMMUTE+police2$DUMMY)

Start: AIC=1200.67
police2$POLICE ~ 1
Df
+ police2$INC
1
+ police2$COLLEGE 1
+ police2$DUMMY
1
+ police2$CRIME
1
+ police2$TAX
1
+ police2$COMMUTE 1
+ police2$OWN
1
+ police2$UNEMP
1
<none>
+ police2$WHITE
1

Sum of Sq
81911671
79836825
76851690
60552944
24200200
19595871
13925406
7861053

RSS
101020281
103095128
106080263
122379008
158731753
163336082
169006546
175070899
182931953
376713 182555239

AIC
1154.0
1155.6
1158.0
1169.7
1191.0
1193.4
1196.2
1199.1
1200.7
1202.5

Step: AIC=1153.98
police2$POLICE ~ police2$INC
Df
+ police2$DUMMY
1
+ police2$CRIME
1
+ police2$COLLEGE 1
+ police2$OWN
1
+ police2$WHITE
1
+ police2$TAX
1
+ police2$UNEMP
1
<none>
+ police2$COMMUTE 1

Sum of Sq
38628598
26464464
19206089
10146268
10068483
5783456
2622045

RSS
62391683
74555817
81814192
90874014
90951799
95236825
98398237
101020281
1802737 99217544

AIC
1116.5
1131.1
1138.7
1147.3
1147.4
1151.1
1153.8
1154.0
1154.5

Step: AIC=1116.46
police2$POLICE ~ police2$INC + police2$DUMMY
Df Sum of Sq
1 20736170
1 6699842
1 3444367
1 3333611
1 2027396

+ police2$CRIME
+ police2$WHITE
+ police2$OWN
+ police2$TAX
+ police2$UNEMP
<none>
+ police2$COMMUTE 1
+ police2$COLLEGE 1

RSS
41655513
55691841
58947316
59058072
60364287
62391683
857584 61534099
582073 61809610

AIC
1085.3
1109.2
1113.8
1114.0
1115.8
1116.5
1117.3
1117.7

Step: AIC=1085.33
police2$POLICE ~ police2$INC + police2$DUMMY + police2$CRIME
Df Sum of Sq
RSS
+ police2$WHITE
1 3342086 38313426
<none>
41655513
+ police2$OWN
1
845383 40810130
+ police2$TAX
1
417622 41237891
+ police2$COMMUTE 1
218384 41437129
+ police2$COLLEGE 1
87241 41568272
+ police2$UNEMP
1
8322 41647191

AIC
1080.5
1085.3
1085.7
1086.5
1086.9
1087.2
1087.3

Step: AIC=1080.48
police2$POLICE ~ police2$INC + police2$DUMMY + police2$CRIME +
police2$WHITE

Df Sum of Sq
<none>
+ police2$UNEMP
+ police2$COLLEGE
+ police2$OWN
+ police2$TAX
+ police2$COMMUTE

1
1
1
1
1

796891
60823
52178
13488
2848

RSS
38313426
37516536
38252604
38261248
38299939
38310578

AIC
1080.5
1080.8
1082.3
1082.4
1082.5
1082.5

Coefficients:

Estimate Std. Error t value


(Intercept) -3.963e+03 5.882e+02 -6.737
police2$INC
7.053e-01 9.278e-02 7.602
police2$DUMMY 3.484e+03 4.411e+02 7.899
police2$CRIME 2.312e+00 3.912e-01 5.910
police2$WHITE -1.233e+01 4.757e+00 -2.592
--Signif. codes: 0 *** 0.001 ** 0.01

Pr(>|t|)
2.60e-09
5.92e-11
1.59e-11
8.81e-08
0.0114

***
***
***
***
*

* 0.05 . 0.1 1

Residual standard error: 705.4 on 77 degrees of freedom


Multiple R-squared: 0.7906, Adjusted R-squared: 0.7797
F-statistic: 72.66 on 4 and 77 DF, p-value: < 2.2e-16

Stepwise selection

A combination of backward and forward


Start with a full model or a minimum one
Calculate the initiate AIC
Drop the variable with the smallest AIC
Run new model without the dropped variable
Calculate the AIC
Calculate a new set of AIC with the dropped variable(s) added back in
>step(lm(DV~IV1+IV2++IVk),direction=both)
> police2_bth<step(lm(police2$POLICE~police2$INC+police2$TAX+police2$CRIME+police2$UNEMP+po
lice2$OWN+police2$COLLEGE+police2$WHITE+police2$COMMUTE+police2$DUMMY),direct
ion="both")

Start: AIC=1088.25
police2$POLICE ~ police2$INC + police2$TAX + police2$CRIME +
police2$UNEMP + police2$OWN + police2$COLLEGE + police2$WHITE +
police2$COMMUTE + police2$DUMMY

Df Sum of Sq
RSS
AIC
- police2$OWN
1
30876 37319263 1086.3
- police2$COMMUTE 1
32302 37320689 1086.3
- police2$TAX
1
58854 37347241 1086.4
- police2$COLLEGE 1
117095 37405482 1086.5
- police2$UNEMP
1
885049 38173436 1088.2
<none>
37288387 1088.2
- police2$WHITE
1 3180164 40468551 1093.0
- police2$INC
1 15006004 52294391 1114.0
- police2$CRIME
1 17115469 54403856 1117.2
- police2$DUMMY
1 21313811 58602197 1123.3
Step: AIC=1086.32
police2$POLICE ~ police2$INC + police2$TAX + police2$CRIME +
police2$UNEMP + police2$COLLEGE + police2$WHITE + police2$COMMUTE +
police2$DUMMY
Df Sum of Sq
RSS
AIC
- police2$COMMUTE 1
13542 37332805 1084.3
- police2$TAX
1
62287 37381549 1084.5
- police2$COLLEGE 1
154982 37474245 1084.7
- police2$UNEMP
1
902951 38222214 1086.3
<none>
37319263 1086.3
+ police2$OWN
1
30876 37288387 1088.2
- police2$WHITE
1
3720805 41040068 1092.1
- police2$INC
1 15062883 52382146 1112.1
- police2$CRIME
1 17143866 54463129 1115.3
- police2$DUMMY
1 21288736 58607999 1121.3

Step: AIC=1084.35
police2$POLICE ~ police2$INC + police2$TAX + police2$CRIME +
police2$UNEMP + police2$COLLEGE + police2$WHITE + police2$DUMMY

- police2$TAX
- police2$COLLEGE
- police2$UNEMP
<none>
+ police2$COMMUTE
+ police2$OWN
- police2$WHITE
- police2$INC
- police2$CRIME
- police2$DUMMY

Df Sum of Sq
RSS
AIC
1
62140 37394944 1082.5
1
152099 37484904 1082.7
1
892960 38225764 1084.3
37332805 1084.3
1
13542 37319263 1086.3
1
12115 37320689 1086.3
1
3884963 41217768 1090.5
1 16632077 53964882 1112.6
1 17219408 54552212 1113.5
1 21306678 58639482 1119.4

Step: AIC=1082.49
police2$POLICE ~ police2$INC + police2$CRIME + police2$UNEMP +
police2$COLLEGE + police2$WHITE + police2$DUMMY

Df Sum of Sq
- police2$COLLEGE 1
121592
- police2$UNEMP
1
857660
<none>
+ police2$TAX
1
62140
+ police2$OWN
1
14159
+ police2$COMMUTE 1
13395
- police2$WHITE
1 4167247
- police2$INC
1 17415537
- police2$CRIME
1 18275653
- police2$DUMMY
1 21254605

RSS
37516536
38252604
37394944
37332805
37380785
37381549
41562191
54810481
55670597
58649549

AIC
1080.8
1082.3
1082.5
1084.3
1084.5
1084.5
1089.2
1111.8
1113.1
1117.4

Step: AIC=1080.75
police2$POLICE ~ police2$INC + police2$CRIME + police2$UNEMP +
police2$WHITE + police2$DUMMY

- police2$UNEMP
<none>
+ police2$COLLEGE
+ police2$OWN
+ police2$TAX
+ police2$COMMUTE
- police2$WHITE
- police2$CRIME
- police2$INC
- police2$DUMMY

Df Sum of Sq
RSS
1
796891 38313426
37516536
1
121592 37394944
1
37398 37479137
1
31632 37484904
1
10804 37505732
1
4130655 41647191
1 18154115 55670651
1 21931399 59447935
1 30589229 68105765

AIC
1080.5
1080.8
1082.5
1082.7
1082.7
1082.7
1087.3
1111.1
1116.5
1127.7

Step: AIC=1080.48
police2$POLICE ~ police2$INC + police2$CRIME + police2$WHITE +
police2$DUMMY
Df Sum of Sq
<none>
+ police2$UNEMP
+ police2$COLLEGE
+ police2$OWN
+ police2$TAX
+ police2$COMMUTE
- police2$WHITE
- police2$CRIME
- police2$INC
- police2$DUMMY

1
796891
1
60823
1
52178
1
13488
1
2848
1
3342086
1 17378415
1 28755675
1 31042215

RSS
38313426
37516536
38252604
38261248
38299939
38310578
41655513
55691841
67069102
69355641

AIC
1080.5
1080.8
1082.3
1082.4
1082.5
1082.5
1085.3
1109.2
1124.4
1127.1

Coefficients:
Estimate Std. Error t value
(Intercept) -3.963e+03 5.882e+02 -6.737
police2$INC
7.053e-01 9.278e-02 7.602
police2$CRIME 2.312e+00 3.912e-01 5.910
police2$WHITE -1.233e+01 4.757e+00 -2.592
police2$DUMMY 3.484e+03 4.411e+02 7.899
--Signif. codes: 0 *** 0.001 ** 0.01

Pr(>|t|)
2.60e-09
5.92e-11
8.81e-08
0.0114
1.59e-11

***
***
***
*
***

* 0.05 . 0.1 1

Residual standard error: 705.4 on 77 degrees of freedom


Multiple R-squared: 0.7906, Adjusted R-squared: 0.7797
F-statistic: 72.66 on 4 and 77 DF, p-value: < 2.2e-16

Summary on variable selection


Three common methods
Backward, forward, stepwise

Akaike Information Criterion as threshold


The smaller the better
Other methods possible

CASE II

Baltimore housing price analysis

Data

> str(housing)
'data.frame':
211 obs. of 17 variables:
$ STATION: int 1 2 3 4 5 6 7 8 9 10 ...
$ PRICE : num 47 113 165 104.3 62.5 ...
$ NROOM : num 4 7 7 7 7 6 6 8 6 7 ...
$ DWELL : num 0 1 1 1 1 1 1 1 1 1 ...
$ NBATH : num 1 2.5 2.5 2.5 1.5 2.5 2.5 1.5 1 2.5 ...
$ PATIO : num 0 1 1 1 1 1 1 1 1 1 ...
$ FIREPL : num 0 1 1 1 1 1 1 0 1 1 ...
$ AC
: num 0 1 0 1 0 0 1 0 1 1 ...
$ BMENT : num 2 2 3 2 2 3 3 0 3 3 ...
$ NSTOR : num 3 2 2 2 2 3 1 3 2 2 ...
$ GAR
: num 0 2 2 2 0 1 2 0 0 2 ...
$ AGE
: num 148 9 23 5 19 20 20 22 22 4 ...
$ CITCOU : num 0 1 1 1 1 1 1 1 1 1 ...
$ LOTSZ : num 5.7 279.5 70.6 174.6 107.8 ...
$ SQFT
: num 11.2 28.9 30.6 26.1 22 ...
$ X
: num 907 922 920 923 918 900 918 907 918 897 ...
$ Y
: num 534 574 581 578 574 577 576 576 562 576 ...
- attr(*, "data_types")= chr "N" "N" "N" "N" ...

1. Data exploration
Dependent variable
Distribution: normal, skewedness
Histogram, boxplot

Transformation

IVs
Linear relationship with DV
Scatter plot
Transformation (bulging rules)

Linear relationships with each other


Correlation matrix
Scatter plot matrix

DV

2. Transformation: lambda for DV


R package AID
> boxcoxnc(PRICE,method="sw")
$title
[1] "Implementation of Box-Cox Power
Transformation when No Covariate Is Available"
$version
[1] "Version 1.0"
$method
[1] "Shapiro-Wilk"
$date
[1] "Thu Nov 14 17:23:01 2013"
$result
sw
lambda.hat 0.3400000000
sw.pvalue 0.0005484248
sf.pvalue 0.0005484248
jb.pvalue 0.0005484248

Transformation of DV

Linear relationships among variables


> cor(cbind(PRICE^0.3,NROOM,NBATH,BMENT,NSTOR,GAR,AGE,LOTSZ,SQFT))
NROOM
NBATH
BMENT
NSTOR
GAR
1.0000000 0.39855637 0.4466113 0.18969889 -0.1532602 0.3582917
NROOM 0.3985564 1.00000000 0.5522312 0.12218455 0.3714874 0.3758094
NBATH 0.4466113 0.55223124 1.0000000 0.23662778 0.1602845 0.2400910
BMENT 0.1896989 0.12218455 0.2366278 1.00000000 0.1261798 0.0685948
NSTOR -0.1532602 0.37148741 0.1602845 0.12617982 1.0000000 0.1390975
GAR
0.3582917 0.37580943 0.2400910 0.06859480 0.1390975 1.0000000
AGE
-0.3734580 0.07951349 -0.1019667 -0.10078529 0.3368935 0.1121604
LOTSZ 0.5996575 0.30723963 0.3166627 0.03797552 -0.1596150 0.4434802
SQFT
0.4103358 0.68237255 0.5828008 0.06428611 0.5404123 0.3804563

AGE
LOTSZ
SQFT
-0.37345798 0.59965753 0.41033577
0.07951349 0.30723963 0.68237255
-0.10196667 0.31666268 0.58280075
-0.10078529 0.03797552 0.06428611
0.33689350 -0.15961499 0.54041230
0.11216037 0.44348023 0.38045633
1.00000000 -0.13200062 0.10549370
-0.13200062 1.00000000 0.41380466
0.10549370 0.41380466 1.00000000

Then use scatter plot to identify proper transformation for the IVs
plot(NROOM,PRICE^0.3)
lines(lowess(PRICE^0.3~NROOM)

Only LOTSZ needs to be transformed to log(LOTSZ)

lm(formula = PRICE^0.3 ~ NROOM + NBATH + BMENT + NSTOR + GAR +


AGE + log(LOTSZ) + SQFT + DWELL + PATIO + FIREPL + AC + CITCOU)
Residuals:
Min
1Q
-1.30170 -0.13678
Coefficients:

Median
0.02056

3Q
0.14965

Max
0.93086

3. OLS with all


IVs

Estimate Std. Error t value Pr(>|t|)


(Intercept) 1.7446236 0.1875344
9.303 < 2e-16 ***
NROOM
0.0296308 0.0227112
1.305 0.193526
NBATH
0.1148046 0.0395685
2.901 0.004137 **
BMENT
0.0882614 0.0211390
4.175 4.47e-05 ***
NSTOR
-0.0087766 0.0602624 -0.146 0.884355
GAR
0.0684316 0.0354620
1.930 0.055079 .
AGE
-0.0005577 0.0012272 -0.454 0.650029
log(LOTSZ)
0.1182991 0.0396679
2.982 0.003223 **
SQFT
-0.0007289 0.0046376 -0.157 0.875266
DWELL
0.1350517 0.0628093
2.150 0.032759 *
PATIO
0.1107756 0.0571756
1.937 0.054118 .
FIREPL
0.1833169 0.0511999
3.580 0.000432 ***
AC
0.1558560 0.0499401
3.121 0.002074 **
CITCOU
0.2948313 0.0510607
5.774 2.97e-08 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.2616 on 197 degrees of freedom
Multiple R-squared: 0.7258,
Adjusted R-squared: 0.7077
F-statistic: 40.12 on 13 and 197 DF, p-value: < 2.2e-16
> AIC(m4)
[1] 48.45174

4. Check multicollinearity
> colldiag(m4)
Condition
Index
Variance Decomposition Proportions
intercept NROOM NBATH BMENT NSTOR
1
1.000 0.000
0.000 0.001 0.001 0.000
2
2.896 0.000
0.000 0.000 0.003 0.001
3
3.200 0.000
0.000 0.000 0.001 0.000
4
3.871 0.000
0.000 0.000 0.002 0.001
5
4.067 0.000
0.000 0.002 0.005 0.000
6
4.492 0.000
0.000 0.000 0.003 0.000
7
6.312 0.000
0.000 0.000 0.095 0.001
8
6.982 0.000
0.000 0.037 0.227 0.001
9
7.984 0.001
0.002 0.054 0.276 0.007
10 12.151 0.001
0.000 0.632 0.104 0.044
11 13.337 0.035
0.040 0.064 0.262 0.029
12 21.813 0.000
0.177 0.179 0.014 0.546
13 23.044 0.027
0.743 0.005 0.004 0.001
14 41.055 0.935
0.036 0.026 0.001 0.368
> vif(m4)
NROOM
NBATH
BMENT
NSTOR
FIREPL
2.167749
2.015631
1.174733
2.950392
1.481237
AC
CITCOU
1.409244
1.918044

GAR
0.002
0.054
0.291
0.042
0.062
0.415
0.013
0.000
0.004
0.004
0.014
0.024
0.000
0.076

AGE
0.001
0.022
0.010
0.006
0.004
0.000
0.112
0.217
0.151
0.035
0.363
0.006
0.000
0.072

log(LOTSZ)
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.001
0.030
0.113
0.216
0.640

GAR
1.390686

SQFT
0.000
0.000
0.000
0.000
0.000
0.000
0.002
0.001
0.106
0.141
0.129
0.196
0.072
0.352

DWELL
0.001
0.002
0.001
0.014
0.070
0.031
0.188
0.068
0.033
0.043
0.000
0.357
0.116
0.075

AGE log(LOTSZ)
1.817996

No actions

4.186010

PATIO
0.002
0.164
0.000
0.709
0.014
0.008
0.001
0.011
0.025
0.034
0.016
0.006
0.000
0.009

FIREPL
0.002
0.064
0.060
0.067
0.186
0.389
0.059
0.038
0.046
0.000
0.017
0.071
0.000
0.002

AC
0.002
0.083
0.149
0.046
0.166
0.221
0.179
0.040
0.030
0.052
0.015
0.010
0.001
0.006

CITCOU
0.001
0.002
0.023
0.013
0.072
0.002
0.219
0.224
0.001
0.022
0.350
0.010
0.060
0.002

SQFT

DWELL

PATIO

4.448950

3.025165

1.263142

5. Check outliers

Add indicators
> bltm$dummy[1]<-1
> bltm$dummy[16]<-1
> bltm$dummy[53]<-1

lm(formula = PRICE^0.3 ~ NROOM + NBATH + BMENT + NSTOR + GAR +


AGE + log(LOTSZ) + SQFT + DWELL + PATIO + FIREPL + AC + CITCOU +
dummy)
Residuals:
Min
1Q
-0.91242 -0.13807
Coefficients:

Median
0.01454

3Q
0.14102

Max
1.22804

6. Model with
dummy

Estimate Std. Error t value Pr(>|t|)


(Intercept) 1.7012122 0.1852023
9.186 < 2e-16 ***
NROOM
0.0271138 0.0223653
1.212 0.22685
NBATH
0.1193375 0.0389680
3.062 0.00250 **
BMENT
0.0862543 0.0208123
4.144 5.07e-05 ***
NSTOR
-0.0110848 0.0593002 -0.187 0.85191
GAR
0.0550100 0.0352354
1.561 0.12009
AGE
0.0009228 0.0013232
0.697 0.48638
log(LOTSZ)
0.1178271 0.0390309
3.019 0.00288 **
SQFT
0.0003459 0.0045799
0.076 0.93988
DWELL
0.1219116 0.0619865
1.967 0.05062 .
PATIO
0.1070707 0.0562733
1.903 0.05855 .
FIREPL
0.1874886 0.0504003
3.720 0.00026 ***
AC
0.1711843 0.0494561
3.461 0.00066 ***
CITCOU
0.3116858 0.0506166
6.158 4.10e-09 ***
dummy
-0.4676703 0.1709286 -2.736 0.00679 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.2574 on 196 degrees of freedom
Multiple R-squared: 0.7359,
Adjusted R-squared: 0.7171
F-statistic: 39.01 on 14 and 196 DF, p-value: < 2.2e-16

lm(formula = PRICE^0.3 ~ NROOM + NBATH + BMENT + GAR + log(LOTSZ)


+
DWELL + PATIO + FIREPL + AC + CITCOU + dummy)
Residuals:
Min
1Q
-0.93368 -0.14123

Median
0.01368

3Q
0.14424

Max
1.26457

Coefficients:
Estimate Std. Error t value
(Intercept) 1.73060
0.13238 13.073
NROOM
0.02788
0.01985
1.405
NBATH
0.11679
0.03540
3.299
BMENT
0.08390
0.02025
4.144
GAR
0.05983
0.03408
1.756
log(LOTSZ)
0.11580
0.03679
3.148
DWELL
0.12976
0.05876
2.208
PATIO
0.10592
0.05481
1.932
FIREPL
0.18950
0.04896
3.870
AC
0.16165
0.04574
3.534
CITCOU
0.29745
0.04589
6.482
dummy
-0.41866
0.15292 -2.738
--Signif. codes: 0 *** 0.001 ** 0.01

Pr(>|t|)
< 2e-16
0.161690
0.001149
5.04e-05
0.080684
0.001898
0.028377
0.054729
0.000147
0.000509
6.99e-10
0.006746

***

7. BE
and Goodness of fits

**
***
.
**
*
.
***
***
***
**

* 0.05 . 0.1 1

Residual standard error: 0.2558 on 199 degrees of freedom


Multiple R-squared: 0.7352,
Adjusted R-squared: 0.7206
F-statistic: 50.24 on 11 and 199 DF, p-value: < 2.2e-16
> AIC(m6)
[1] 37.09025

6. Model diagnostics
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER
27.991764
TEST ON NORMALITY OF ERRORS
TEST
DF
VALUE
Jarque-Bera
2
122.9635

DIAGNOSTICS FOR HETEROSKEDASTICITY


RANDOM COEFFICIENTS
TEST
DF
VALUE
Breusch-Pagan test
11
282.276
Koenker-Bassett test 11
99.06831
SPECIFICATION ROBUST TEST
TEST
DF
VALUE
White
77
211

DIAGNOSTICS FOR SPATIAL DEPENDENCE


FOR WEIGHT MATRIX : baltply_r.gal
(row-standardized weights)
TEST
MI/DF
Moran's I (error)
0.038901
Lagrange Multiplier (lag)
1
Robust LM (lag)
1
Lagrange Multiplier (error)
1
Robust LM (error)
1
Lagrange Multiplier (SARMA)
2

VALUE
N/A
11.3346274
12.1628749
0.8921555
1.7204031
13.0550305

PROB
0.0000000

PROB
0.0000000
0.0000000
PROB
0.0000000

PROB
N/A
0.0007608
0.0004875
0.3448939
0.1896412
0.0014626

Residual plots

R-squared
Sq. Correlation
Sigma-square
S.E of regression

:
0.737190
: :
0.0612524
:
0.247492

R-squared (BUSE)
Log likelihood
Akaike info criterion
Schwarz criterion

: :
-5.040731
:
34.0815
:
74.3038

----------------------------------------------------------------------Variable
Coefficient
Std.Error
z-value
Probability
----------------------------------------------------------------------CONSTANT
1.751199
0.1312221
13.34531
0.0000000
NROOM
0.02961356
0.01906336
1.553428
0.1203208
NBATH
0.1185139
0.0343727
3.447908
0.0005650
BMENT
0.08276084
0.0196993
4.201207
0.0000266
GAR
0.05483605
0.03321603
1.650891
0.0987607
LOGLOTSZ
0.1086438
0.03611104
3.008603
0.0026247
DWELL
0.1378764
0.05676524
2.428887
0.0151452
PATIO
0.09818663
0.05363735
1.830565
0.0671654
FIREPL
0.1779353
0.04749288
3.746569
0.0001793
AC
0.1578844
0.04443501
3.553152
0.0003807
CITCOU
0.297427
0.04709394
6.315611
0.0000000
DUMMY
-0.4276138
0.1484325
-2.880864
0.0039660
LAMBDA
0.121354
0.10976
1.10563
0.2688867
-----------------------------------------------------------------------

REGRESSION DIAGNOSTICS
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST
Breusch-Pagan test

DF
11

VALUE
307.1502

DIAGNOSTICS FOR SPATIAL DEPENDENCE


SPATIAL ERROR DEPENDENCE FOR WEIGHT MATRIX : baltply_r.gal
TEST
DF
VALUE
Likelihood Ratio Test
1
1.00879

PROB
0.0000000

PROB
0.3151928

7. Spatial
error model

8. Maps (residuals, fitted)