Treating Multicollinearity With SAS

TREATING MULTICOllINEARITY WITH SAS
William J. Wilson, University of North Florida

even in these extreme cases a method of finding
a solution is not immediately apparent.
Further exploration is needed.
1.
Introduction
A problem that must be considered in
almost all multiple regression analyses is that
of multicollinearity among the regressor variables. Many authors even suggest than an
examination for the existence of multicollinearity should be routinely performed as an
initial step in regression analysis (cof.
Mansfield and Helms. 1982). Simply determining
the existence of multicollinearity often is not
enough to obtain effective remedies.' The
nature of the multicollinearity must often be
closely examined.
The next logical step seems to be to

examine correlations between the regressor
variable and also between the coefficient
estimates. This is done with PROC CORR and the
CORRB option in PROC REG. The results for the
Hald data are given in Table II. Notice that
extremely large correlations are indicated
between Xl and X3 and X2 and X4.
Pairwise correlations are usually not
sufficient to isolate the problem nor to indicate a possible solution. However~ the
correlation matrix can be used to obtain other
useful statistics (Mansfield and Helms, 1982).
The output from PROC CORR can be used in PROC
MATRIX to obtain the following:
(a) the determinant of C, ICI. A small
value of the determinant is another indication
of multicollinearity.
(b) the eigenvalues and eigenvectors of
C. An eigenvalue of zero, of course, indicates
an exact linear dependence. The relationship
between the eigenvalues can be further examined
by the condition index which is given -using
the COLLIN option in PROC REG. The condition
index is the square root of the ratio of the
largest eigenvalue to the particular eigenvalue. Further, the eigenvectors can sometimes
indicate where a problem exists. For example,
a large value within an eigenvector corresponding to a small eigenvalue indicates that the
corresponding regressor variable is contributing to the multicollinearity problem.
(c) the variance inflation factors, V1F(j)
for each variable by taking the diagonal
element of the inverse of the C matrix. The
values can also be obtained using the VIF
option in PROC REG.
(d) the coefficient of determination~ Rj2~
whe~ Xj is regressed ~n the othyr predictor
VarlableS by uSlng Rj = 1 - iJ'fF'G)' Also,
TOlERANCE(j) = 1 . Rj2. The TOl option in
PROC REG gives this value also.
(e) the overall design efficiencYkfactor
given by Willan and Watts, (1978). Icl' also
can be used to calculate an effective sample
size. That is, the sample size required for
an orthogonal design with the same rms values
to yield the same power as the actual data in
the parametric tests of hypothesis.
Further, explanation of multicollinearity

either in a classroom setting or in a consult-
ing situation where the experiment~r has limited training in statistics, usually requires an
indepth exploration of the relationship between
regressor variables. This exploration is
easily accomplished through various procedures
in SAS, especially PROC MATRIX. This paper is
primarily concerned with the exploration techniques which can be used to detect the nature
Of the multicollinearity using SAS. The general objective is tutorial -and several examples
are presented that the author has found effective in classroom situations.
2.
of Multicollinearity
The presence of multicollinearity is first
felt (sometimes to the surprise of the investigator) when examining the statistics concerning
the regression coefficient estimates. Since
the least square estimates have inflated variances in the presence of multicollinearity,
some unusual results can occur. For example,
a signifi~ant overall regreSSion equation with
a large R may have none of the individual
coefficients significant. Many numerical
examples exist in the literature, one is a four
variable problem given by A. Hald on page 647
of his book Statistical Theory with Engineering
Applications published by Wiley, New York in
1952 and used extensively in Draper and Smith
(1981). This problem will be used throughout
this portion of the paper, referred to as
the Hald data, and is included in section 3.
The paradox discussed above is apparent from
the PROC REG output in Table I.
~tection
Another typical situation in which multicollinearity causes problems is in problems

like that given by Gunst and Mason, (1977) in
which a regression analysis is performed with
data on spinal cord injuries. In this analysis
a backward elimination variable selection procedure and a maximum R2 procedure gave widely
different results and indicated models which
did not agree with the initial regression
analysis. That is, the selection procedures
indicated that variables should be included
in the model which had very small t statistics
and eliminated several with relatively large
ones. These two examples obviously represent
extreme effects of multicollinearity. However,
In addition to those options already discussed in PROC REG, the COllIN option also
displays the proportion of the variance of the
estimate accounted for by each principal component. This not only aids in detecting multicollinearity, but helps if a principal component regression is to be performed. A multicollinearity problem exists when a component
associated with a high condition index contributes strongly to the variance of two or more
variables. All the statistics discussed above
717
Exampl e 2:
are displayed for the Hald data in Table III.
OSS
Once the problem of multicollinearity has
11
been recognized and completely explored, it

usually is possible to select an appropriate
technique to alleviate it. Some possible
techniques suggested in the literature are
(1) transformation of the variables,
(2) reduction in the number of variables,
(3)
12
13
Example 3:
ridge and principal component

regression,
use of additional regressor variables
(5)
disigned to break the pattern of the

multicollinearity. or
reparameterization or centering of
the model.
(6)
All of these techniques can be performed using
SAS. For a complete discussion of the effects

of multicollinearity see Belsley, Kuh and
Welach (1980).
3.
This section gives three sets of data that

illustrate various types of multicollinearity.
The first is an example of orthogonal regressor
New York.
Freund, R. J. and Minton, P. D., 1979,
Regression Methods, Marcel Dekker, Inc .
correlation matrix of the form C=(I-p)l+pJ

where I ;s the identity matrix of order 3 and
J is a square matrix of ones. The value of
-.499 was assigned to p. The actual correlation matrix is different because of sampling
variation. This example illustrates the point
New York.
Gunst, R. F. and Mason. R. L.. 1977. "Advan_

tages of Examining Multicollinearities in
Regression Analysis", Biometrics, 33,
That
pp 249-260.
is, the example illustrates the fact that the

existence of extreme pairwise correlations may
be sufficient for detecting multicollinearity,
but it is not necessary.
OSS
1
2
3
4
5
6
7
8
Manfield, E. R. and Helms, B. P., 1982,

"Detecting Multicollinearity!!, The American
Statistician, 36, pp 158-160.
Orthogonal Regressor Variable
Xl
4
4
4
4
6
6
6
6
Montgomery, D. C. and Peck, E. A., 1982,

Introduction to Linear Regression Analysis.
John Wiley and Sons., Inc., New York.
Y
42
39
48
51
49
53
61
60
X2
2
2
3
3
2
2
3
3
Neter. J. and Wasserman. W., 1974. Applied

Linear Statistical Models. Richard D. Irwin.
Inc., Homewood, Ill.
Willan, A. R., and Watt, D. G. 1978, !!Meaningful Multicollinearity Measures", Tech-
nometrics, 20, pp 407-412.
The analysis of example 1 is shown in Table IV.

Example 2:
OSS
1
2
3
4
5
6
7
8
9
10
XI
7
1
11
11
7
11
3
1
2
21
Hald Data
X2
26
29
56
31
52
55
71
31
54
47
X3
6
15
8
8
6
9
17
22
18
4
Data generated with correlation
Draper, N. R. and Smith, H., 1981, Applied

Regression AnalYSis, John Wiley and Sons.
variables from Neter and Wasserman (1974).

This data, of course, nas no multicollinearity.
The second is the Hald data. The third is a
set which is artificially generated with a
Example 1:
( continued)
X3
Y
X4
23
34
83.8
9
12 113.3
8
12 109.4
4. References:
Belsley, D. A., Kuh, E., and Welsch, R. E.,
1980, Regression Diagnostics, John Wiley
and Sons, Inc., New York.
Examples
that Manfield and Helms (1982) stress.
Hald Oata
X1
40
66
68
matrix C = (l-p)I + pJ. The analysis of

example 3 is shown in Table V.
OSS
Xl
X2
X3
Y
1
-1.0
1.5
3.0
9.336
2
1.5
-0.5 -12.0 -28.679
3
1.0
2.5 -13.0 -45.118
4
1.0
-2.5
-1.0 16.985
5
-0.5
-1.5
5.5 33.521
0.5
6
-1.0
0.0 14.420
7
0.5
1.0
-5.5 -12.599
8
0.0
0.0
5.722
-1.0
9
1.0
-6.5
2.0 41.430
10
1.5
-2.5
-5.0
5.611
multivariate techniques such as

factor analysis,
biased regression procedures such as
(4)
XI
1
11
10
X4
60
52
20
47
33
22
6
44
22
26
Y
78.5
74.3
104.3
87.6
95.9
109.2
102.7
72.5
93.1
115.9
718
--
DEP VARIABLE:
SOURCE
OF
SUM OF
SQUARES
MEAN
SQUARE
F VALUE
PROB>F
MODEL
ERROR
C TOTAL
4
8
12
2667.899
47.863639
2715.763
666.975
5.982955
111.479
0.0001
2.446008
95.423077
2.56333
R-SQUARE
ADJ R-SQ
0.9824
0.9736
PARAMETER
ESTIMATE
STANDARD
ERROR
ROOT MSE
DEP MEAN
C.V.
VARIABLE
OF
INTERCEP
1
1
1
1
1
Xl
X2
X3
X4
62.405369
1.551103
0.510168
0.101909
-0.144061
70.070959
0.744770
0.723788
0.754709
0.709052
0.891
2.083
0.705
0.135
-0.203
CORRELATION COEFFICIENTS / PROB > I Rl UNDER HO:RHO=O /
0.3991
0.0708
0.5009
0.8959
0.8441
N = 13
Xl
X2
X3
X4
Xl
1.00000
0.0000
X2
0.22858
0.4526
0.22858
0.4526
1.00000
0.0000
-0.82413
0.0005
-0.13924
0.6501
-0.24545
0.4189
-0.97295
0.0001
X3
-0.82413
0.0005
-0.13924
0.6501
1.00000
0.0000
0.02954
0.9237
X4
-0.24545
0.4189
-0.97295
0.0001
0.02954
0.9237
],00000
0.0000
CORRELATION OF ESTIMATES
CORRB
INTERCEP
Xl
X2
X3
X4
INTERCEP
Xl
X2
X3
X4
1.0000
-0.9678
-0.9978
-0.9769
-0.9983
-0.9678
1.0000
0.9510
0.9861
0.9568
-0.9978
0.9510
1.0000
0.9624
0.9979
-0.9769
0.9861
0.9624
1.0000
0.9659
-0.9983
0.9568
0.9979
0.9659
1.0000
TABLE II
Correlations for the Ha1d Data
f
r
PROB > I T I
TABLE I
PROC REG Output for Ha1d Data
,,
'
T FOR HO:
PARAMETER=O
719
= 0.00106766
(a)
ICI
(b)
Eigenvalue
Eigenvector
Xl
.475955
.508979
-.6755
.241052
2.2357
1.57607
0.186606
0.00162375
X2
.56387
-.413931
.31442
.641756
X3
- .394067
- .604969
- .637691
.268466
from PROC REG

COLLINEARITY DIAGNOSTICS
CONOITION
NUM8ER EIGENVALUE
INDEX
1
4.120
1.000
2
0.553894
2.727
3
0.288702
3:778
4
0.037638
10.462
5 .000066138 249.578
(c)
from PROC MATRIX or from PROC REG

VARIANCE
INFLATION
0.000000
38.496211
254.423166
46.868386
282.512865
(d)
from PROC MATRIX or from PROC REG

TOLERANCE
0.025977
0.003030
0.021336
0.003540
(e)
ICI~
.0326751
from PROC REG
VARIANCE PROPORTIONS
PORTION
PORTION
INTERCEP
Xl
0.0000
0.0004
0.0000
0.0100
0.0006
0.0000
0.0574
0.0001
0.9999
0.9316
PORTION
X2
0.0000
0.0000
0.0003
0.0028
0.9969
PORTION
X3
0.0002
0.0027
0.0016
0.0457
0.9498
PORTION
X4
0.0000
0.0001
0.0017
0.0009
0.9973
TABLE III
Multicollinearity Statistics for the Hald Data
720
X4
-.547931
.451235
.195421
.676734
PROC REG:
DEP VARIABLE:
OF
SUM OF
SQUARES
MEAN
SQUARE
F VALUE
PROB>F
2
5
7
402.250
17 .625000
419.875
201.125
3.525000
57.057
0.0004
ROOT MSE
DEP MEAN
C. V.
1.877498
50.375000
3.727044
R-SQUARE
ADJ R-SQ
0.9580
0.9412
SOURCE
MODEL
ERROR
C TOTAL
VARIABLE
OF
PARAMETER
ESTIMATE
STANDARD
ERROR
T FOR HO:
PARAMETER-O
INTERCEP
Xl
X2
I
I
I
0.375000
5.375000
9.250000
4.740451
0.663796
1. 327592
0.079
8.097
6.968
ITI
TOLERANCE
VARIANCE
INFLATION
0.9400
0.0005
0.0009
1.000000
1.000000
0.000000
1.000000
1.000000
PROB
>
1.0000
Icl
Eigenvector
Eigenvalues
X2
0
1
Xl
1
0
Col linearity Diagnostics from PROC REG
NUMBER EIGENVALUE
CONOITION
INDEX
PORTION
INTERCEP
PORTION
Xl
PORTION
X2
2.948
0.038462
0.013044
1.000
8.756
15.034
0.0022
0.0000
0.9978
0.0043
0.5000
0.4957
0.0043
0.5000
0.4957
2
3
Icl'-1.000
TABLE IV
721
PROC REG:
DEP VARIABLE:
SOURCE
OF
SUM OF
SQUARES
MEAN
SQUARE
F VALUE
PROB>F
MODEL
ERROR
C TOTAL
3
6
9
6336.529
3.137541
6339.666
2112.176
0.522924
4039.168
0.0001
ROOT MSE
OEP ~EAN
C.V.
0.723135
4.063061
17.79777
R-SQUARE
AOJ R-SQ
0.9995
0.9993
VARIABLE OF
PARAMETER
ESTIMATE
STANOARD
ERROR
T FOR HO:
PARAMETER=O
PROB > ITI
TOLERANCE
VARIANCE
INFLATION
INTERCEP
Xl
X2
X3
9.525137
4.591704
-3.042167
3.916058
0.2B0618
1.192333
0.292608
0.158279
33.943
3.B51
-10.397
24.742
0.0001
0.0084
0.0001
0.0001
0.059089
0.094653
0.061284
0.000000
16.923739
10.564858
16.317445
1
1
1
1
C=
Xl
X2
X3
1.00000
0.0000
-0.40291
0.2483
-0.67650
0.0317
-0.40291
0.2483
1.00000
0.0000
-0.36223
0.3037
-0.67650
0.0317
-0.36223
0.3037
1.00000
0.0000
ICI
.0513354
Eigenvalues
1.67829
1.29815
0.0235627
Eigenvectors
Xl
X2
X3
0.723058
-0.0620897
-0.687991
-0.295539
0.872397
-0.389334
-0.624375
-0.484839
-0.612444
Collinearity Diagnostics from PROC REG
NUMBER EIGENVALUE
1
2
3
4
2.285
1.170
0.526483
0.018175
CONDITION
INDEX
PORTION
INTERCEP
PORTION
Xl
PORTION
X2
PORTION
X3
1.000
1.397
2.083
11.212
0.0776
0.0035
0.9166
0.0023
0.0069
0.0000
0.0115
0.9815
0.0034
0.0468
0.0104
0.9394
0.0054
0.0151
0.0055
0.9741
ICI~ = .226573
TABLE V
722

Treating Multicollinearity With SAS

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Treating Multicollinearity With SAS

Hochgeladen von

Copyright:

Verfügbare Formate

TREATING MULTICOllINEARITY WITH SAS

William J. Wilson, University of North Florida

The next logical step seems to be to

Further, explanation of multicollinearity

Another typical situation in which multicollinearity causes problems is in problems

are displayed for the Hald data in Table III.

Once the problem of multicollinearity has

been recognized and completely explored, it

ridge and principal component

use of additional regressor variables

disigned to break the pattern of the

All of these techniques can be performed using

SAS. For a complete discussion of the effects

This section gives three sets of data that

correlation matrix of the form C=(I-p)l+pJ

Gunst, R. F. and Mason. R. L.. 1977. "Advan_

is, the example illustrates the fact that the

Manfield, E. R. and Helms, B. P., 1982,

Statistician, 36, pp 158-160.

Orthogonal Regressor Variable

Montgomery, D. C. and Peck, E. A., 1982,

Neter. J. and Wasserman. W., 1974. Applied

nometrics, 20, pp 407-412.

The analysis of example 1 is shown in Table IV.

Data generated with correlation

Draper, N. R. and Smith, H., 1981, Applied

variables from Neter and Wasserman (1974).

that Manfield and Helms (1982) stress.

matrix C = (l-p)I + pJ. The analysis of

multivariate techniques such as

CORRELATION COEFFICIENTS / PROB > I Rl UNDER HO:RHO=O /

from PROC REG

from PROC MATRIX or from PROC REG

from PROC MATRIX or from PROC REG

Col linearity Diagnostics from PROC REG

PROB > ITI

Collinearity Diagnostics from PROC REG

Das könnte Ihnen auch gefallen