Sie sind auf Seite 1von 6

TREATING MULTICOllINEARITY WITH SAS

William J. Wilson, University of North Florida


even in these extreme cases a method of finding
a solution is not immediately apparent.
Further exploration is needed.

1.

Introduction
A problem that must be considered in
almost all multiple regression analyses is that
of multicollinearity among the regressor variables. Many authors even suggest than an
examination for the existence of multicollinearity should be routinely performed as an
initial step in regression analysis (cof.
Mansfield and Helms. 1982). Simply determining
the existence of multicollinearity often is not
enough to obtain effective remedies.' The
nature of the multicollinearity must often be
closely examined.

The next logical step seems to be to


examine correlations between the regressor
variable and also between the coefficient
estimates. This is done with PROC CORR and the
CORRB option in PROC REG. The results for the
Hald data are given in Table II. Notice that
extremely large correlations are indicated
between Xl and X3 and X2 and X4.
Pairwise correlations are usually not
sufficient to isolate the problem nor to indicate a possible solution. However~ the
correlation matrix can be used to obtain other
useful statistics (Mansfield and Helms, 1982).
The output from PROC CORR can be used in PROC
MATRIX to obtain the following:
(a) the determinant of C, ICI. A small
value of the determinant is another indication
of multicollinearity.
(b) the eigenvalues and eigenvectors of
C. An eigenvalue of zero, of course, indicates
an exact linear dependence. The relationship
between the eigenvalues can be further examined
by the condition index which is given -using
the COLLIN option in PROC REG. The condition
index is the square root of the ratio of the
largest eigenvalue to the particular eigenvalue. Further, the eigenvectors can sometimes
indicate where a problem exists. For example,
a large value within an eigenvector corresponding to a small eigenvalue indicates that the
corresponding regressor variable is contributing to the multicollinearity problem.
(c) the variance inflation factors, V1F(j)
for each variable by taking the diagonal
element of the inverse of the C matrix. The
values can also be obtained using the VIF
option in PROC REG.
(d) the coefficient of determination~ Rj2~
whe~ Xj is regressed ~n the othyr predictor
VarlableS by uSlng Rj = 1 - iJ'fF'G)' Also,
TOlERANCE(j) = 1 . Rj2. The TOl option in
PROC REG gives this value also.
(e) the overall design efficiencYkfactor
given by Willan and Watts, (1978). Icl' also
can be used to calculate an effective sample
size. That is, the sample size required for
an orthogonal design with the same rms values
to yield the same power as the actual data in
the parametric tests of hypothesis.

Further, explanation of multicollinearity


either in a classroom setting or in a consult-

ing situation where the experiment~r has limited training in statistics, usually requires an
indepth exploration of the relationship between
regressor variables. This exploration is
easily accomplished through various procedures
in SAS, especially PROC MATRIX. This paper is
primarily concerned with the exploration techniques which can be used to detect the nature
Of the multicollinearity using SAS. The general objective is tutorial -and several examples
are presented that the author has found effective in classroom situations.
2.

of Multicollinearity
The presence of multicollinearity is first
felt (sometimes to the surprise of the investigator) when examining the statistics concerning
the regression coefficient estimates. Since
the least square estimates have inflated variances in the presence of multicollinearity,
some unusual results can occur. For example,
a signifi~ant overall regreSSion equation with
a large R may have none of the individual
coefficients significant. Many numerical
examples exist in the literature, one is a four
variable problem given by A. Hald on page 647
of his book Statistical Theory with Engineering
Applications published by Wiley, New York in
1952 and used extensively in Draper and Smith
(1981). This problem will be used throughout
this portion of the paper, referred to as
the Hald data, and is included in section 3.
The paradox discussed above is apparent from
the PROC REG output in Table I.
~tection

Another typical situation in which multicollinearity causes problems is in problems


like that given by Gunst and Mason, (1977) in
which a regression analysis is performed with
data on spinal cord injuries. In this analysis
a backward elimination variable selection procedure and a maximum R2 procedure gave widely
different results and indicated models which
did not agree with the initial regression
analysis. That is, the selection procedures
indicated that variables should be included
in the model which had very small t statistics
and eliminated several with relatively large
ones. These two examples obviously represent
extreme effects of multicollinearity. However,

In addition to those options already discussed in PROC REG, the COllIN option also
displays the proportion of the variance of the
estimate accounted for by each principal component. This not only aids in detecting multicollinearity, but helps if a principal component regression is to be performed. A multicollinearity problem exists when a component
associated with a high condition index contributes strongly to the variance of two or more
variables. All the statistics discussed above

717

Exampl e 2:

are displayed for the Hald data in Table III.

OSS

Once the problem of multicollinearity has

11

been recognized and completely explored, it


usually is possible to select an appropriate
technique to alleviate it. Some possible
techniques suggested in the literature are
(1) transformation of the variables,
(2) reduction in the number of variables,

(3)

12
13

Example 3:

ridge and principal component


regression,

use of additional regressor variables

(5)

disigned to break the pattern of the


multicollinearity. or

reparameterization or centering of
the model.

(6)

All of these techniques can be performed using

SAS. For a complete discussion of the effects


of multicollinearity see Belsley, Kuh and
Welach (1980).
3.

This section gives three sets of data that


illustrate various types of multicollinearity.
The first is an example of orthogonal regressor

New York.
Freund, R. J. and Minton, P. D., 1979,
Regression Methods, Marcel Dekker, Inc .

correlation matrix of the form C=(I-p)l+pJ


where I ;s the identity matrix of order 3 and
J is a square matrix of ones. The value of
-.499 was assigned to p. The actual correlation matrix is different because of sampling
variation. This example illustrates the point

New York.

Gunst, R. F. and Mason. R. L.. 1977. "Advan_


tages of Examining Multicollinearities in
Regression Analysis", Biometrics, 33,

That

pp 249-260.

is, the example illustrates the fact that the


existence of extreme pairwise correlations may
be sufficient for detecting multicollinearity,
but it is not necessary.

OSS
1
2
3
4
5
6
7
8

Manfield, E. R. and Helms, B. P., 1982,


"Detecting Multicollinearity!!, The American

Statistician, 36, pp 158-160.

Orthogonal Regressor Variable

Xl
4
4
4
4
6
6
6
6

Montgomery, D. C. and Peck, E. A., 1982,


Introduction to Linear Regression Analysis.
John Wiley and Sons., Inc., New York.

Y
42
39
48
51
49
53
61
60

X2
2
2
3
3
2
2
3
3

Neter. J. and Wasserman. W., 1974. Applied


Linear Statistical Models. Richard D. Irwin.
Inc., Homewood, Ill.
Willan, A. R., and Watt, D. G. 1978, !!Meaningful Multicollinearity Measures", Tech-

nometrics, 20, pp 407-412.

The analysis of example 1 is shown in Table IV.


Example 2:

OSS
1
2
3
4
5
6
7
8
9
10

XI
7
1
11
11

7
11
3
1
2
21

Hald Data
X2
26
29
56
31
52
55
71
31
54
47

X3
6
15
8
8
6
9
17
22
18
4

Data generated with correlation

Draper, N. R. and Smith, H., 1981, Applied


Regression AnalYSis, John Wiley and Sons.

variables from Neter and Wasserman (1974).


This data, of course, nas no multicollinearity.
The second is the Hald data. The third is a
set which is artificially generated with a

Example 1:

( continued)
X3
Y
X4
23
34
83.8
9
12 113.3
8
12 109.4

4. References:
Belsley, D. A., Kuh, E., and Welsch, R. E.,
1980, Regression Diagnostics, John Wiley
and Sons, Inc., New York.

Examples

that Manfield and Helms (1982) stress.

Hald Oata
X1
40
66
68

matrix C = (l-p)I + pJ. The analysis of


example 3 is shown in Table V.
OSS
Xl
X2
X3
Y
1
-1.0
1.5
3.0
9.336
2
1.5
-0.5 -12.0 -28.679
3
1.0
2.5 -13.0 -45.118
4
1.0
-2.5
-1.0 16.985
5
-0.5
-1.5
5.5 33.521
0.5
6
-1.0
0.0 14.420
7
0.5
1.0
-5.5 -12.599
8
0.0
0.0
5.722
-1.0
9
1.0
-6.5
2.0 41.430
10
1.5
-2.5
-5.0
5.611

multivariate techniques such as


factor analysis,
biased regression procedures such as

(4)

XI
1
11
10

X4
60
52
20
47
33
22
6
44
22
26

Y
78.5
74.3
104.3
87.6
95.9
109.2
102.7
72.5
93.1
115.9

718

--

DEP VARIABLE:

SOURCE

OF

SUM OF
SQUARES

MEAN
SQUARE

F VALUE

PROB>F

MODEL
ERROR
C TOTAL

4
8
12

2667.899
47.863639
2715.763

666.975
5.982955

111.479

0.0001

2.446008
95.423077
2.56333

R-SQUARE
ADJ R-SQ

0.9824
0.9736

PARAMETER
ESTIMATE

STANDARD
ERROR

ROOT MSE
DEP MEAN
C.V.
VARIABLE

OF

INTERCEP

1
1
1
1
1

Xl
X2
X3
X4

62.405369
1.551103
0.510168
0.101909
-0.144061

70.070959
0.744770
0.723788
0.754709
0.709052

0.891
2.083
0.705
0.135
-0.203

CORRELATION COEFFICIENTS / PROB > I Rl UNDER HO:RHO=O /

0.3991
0.0708
0.5009
0.8959
0.8441

N = 13

Xl

X2

X3

X4

Xl

1.00000
0.0000

X2

0.22858
0.4526

0.22858
0.4526
1.00000
0.0000

-0.82413
0.0005
-0.13924
0.6501

-0.24545
0.4189
-0.97295
0.0001

X3

-0.82413
0.0005

-0.13924
0.6501

1.00000
0.0000

0.02954
0.9237

X4

-0.24545
0.4189

-0.97295
0.0001

0.02954
0.9237

],00000
0.0000

CORRELATION OF ESTIMATES
CORRB
INTERCEP
Xl
X2
X3
X4

INTERCEP

Xl

X2

X3

X4

1.0000
-0.9678
-0.9978
-0.9769
-0.9983

-0.9678
1.0000
0.9510
0.9861
0.9568

-0.9978
0.9510
1.0000
0.9624
0.9979

-0.9769
0.9861
0.9624
1.0000
0.9659

-0.9983
0.9568
0.9979
0.9659
1.0000

TABLE II
Correlations for the Ha1d Data

f
r

PROB > I T I

TABLE I
PROC REG Output for Ha1d Data

,,

'

T FOR HO:
PARAMETER=O

719

= 0.00106766

(a)

ICI

(b)

Eigenvalue

Eigenvector
Xl
.475955
.508979
-.6755
.241052

2.2357
1.57607
0.186606
0.00162375

X2
.56387
-.413931
.31442
.641756

X3
- .394067
- .604969
- .637691
.268466

from PROC REG


COLLINEARITY DIAGNOSTICS
CONOITION
NUM8ER EIGENVALUE
INDEX
1
4.120
1.000
2
0.553894
2.727
3
0.288702
3:778
4
0.037638
10.462
5 .000066138 249.578

(c)

from PROC MATRIX or from PROC REG


VARIANCE
INFLATION
0.000000
38.496211
254.423166
46.868386
282.512865

(d)

from PROC MATRIX or from PROC REG


TOLERANCE
0.025977
0.003030
0.021336
0.003540

(e)

ICI~

.0326751
from PROC REG

VARIANCE PROPORTIONS
PORTION
PORTION
INTERCEP
Xl
0.0000
0.0004
0.0000
0.0100
0.0006
0.0000
0.0574
0.0001
0.9999
0.9316

PORTION
X2
0.0000
0.0000
0.0003
0.0028
0.9969

PORTION
X3
0.0002
0.0027
0.0016
0.0457
0.9498

PORTION
X4
0.0000
0.0001
0.0017
0.0009
0.9973

TABLE III
Multicollinearity Statistics for the Hald Data

720

X4
-.547931
.451235
.195421
.676734

PROC REG:
DEP VARIABLE:

OF

SUM OF
SQUARES

MEAN
SQUARE

F VALUE

PROB>F

2
5
7

402.250
17 .625000
419.875

201.125
3.525000

57.057

0.0004

ROOT MSE
DEP MEAN
C. V.

1.877498
50.375000
3.727044

R-SQUARE
ADJ R-SQ

0.9580
0.9412

SOURCE
MODEL
ERROR
C TOTAL

VARIABLE

OF

PARAMETER
ESTIMATE

STANDARD
ERROR

T FOR HO:
PARAMETER-O

INTERCEP
Xl
X2

I
I
I

0.375000
5.375000
9.250000

4.740451
0.663796
1. 327592

0.079
8.097
6.968

ITI

TOLERANCE

VARIANCE
INFLATION

0.9400
0.0005
0.0009

1.000000
1.000000

0.000000
1.000000
1.000000

PROB

>

1.0000

Icl

Eigenvector

Eigenvalues

X2
0
1

Xl
1
0

Col linearity Diagnostics from PROC REG

VARIANCE PROPORTIONS

COLLINEARITY DIAGNOSTICS
NUMBER EIGENVALUE

CONOITION
INDEX

PORTION
INTERCEP

PORTION
Xl

PORTION
X2

2.948
0.038462
0.013044

1.000
8.756
15.034

0.0022
0.0000
0.9978

0.0043
0.5000
0.4957

0.0043
0.5000
0.4957

2
3

Icl'-1.000
TABLE IV

721

PROC REG:
DEP VARIABLE:

SOURCE

OF

SUM OF
SQUARES

MEAN
SQUARE

F VALUE

PROB>F

MODEL
ERROR
C TOTAL

3
6
9

6336.529
3.137541
6339.666

2112.176
0.522924

4039.168

0.0001

ROOT MSE
OEP ~EAN
C.V.

0.723135
4.063061
17.79777

R-SQUARE
AOJ R-SQ

0.9995
0.9993

VARIABLE OF

PARAMETER
ESTIMATE

STANOARD
ERROR

T FOR HO:
PARAMETER=O

PROB > ITI

TOLERANCE

VARIANCE
INFLATION

INTERCEP
Xl
X2
X3

9.525137
4.591704
-3.042167
3.916058

0.2B0618
1.192333
0.292608
0.158279

33.943
3.B51
-10.397
24.742

0.0001
0.0084
0.0001
0.0001

0.059089
0.094653
0.061284

0.000000
16.923739
10.564858
16.317445

1
1

1
1

C=

Xl

X2

X3

1.00000
0.0000

-0.40291
0.2483

-0.67650
0.0317

-0.40291
0.2483

1.00000
0.0000

-0.36223
0.3037

-0.67650
0.0317

-0.36223
0.3037

1.00000
0.0000

ICI

.0513354

Eigenvalues

1.67829
1.29815
0.0235627

Eigenvectors
Xl

X2

X3

0.723058
-0.0620897
-0.687991

-0.295539
0.872397
-0.389334

-0.624375
-0.484839
-0.612444

Collinearity Diagnostics from PROC REG

COLLINEARITY DIAGNOSTICS
NUMBER EIGENVALUE
1
2
3
4

2.285
1.170
0.526483
0.018175

VARIANCE PROPORTIONS

CONDITION
INDEX

PORTION
INTERCEP

PORTION
Xl

PORTION
X2

PORTION
X3

1.000
1.397
2.083
11.212

0.0776
0.0035
0.9166
0.0023

0.0069
0.0000
0.0115
0.9815

0.0034
0.0468
0.0104
0.9394

0.0054
0.0151
0.0055
0.9741

ICI~ = .226573
TABLE V

722

Das könnte Ihnen auch gefallen