Sie sind auf Seite 1von 11

Quality & Quantity (2008) 42:417426

DOI 10.1007/s11135-006-9055-1

Springer 2006

Solving Multicollinearity in the Process


of Fitting Regression Model Using
the Nested Estimate Procedure
FENG-JENQ LIN
Department of Applied Economics National I-Lan University
1, Shen-Nung Rd., Sec.1, I-Lan, Taiwan, R.O.C. E-mail: fjlin@mail.niu.edu.tw

Abstract. In the practical cases, we are usually faced with the more difcult problem of
multicollinearity in our tted regression model. Multicollinearity will arise when there are
approximate linear relationships between two or more independent variables. It may cause
some serious problems in validation, interpretation, and analysis of the model, such as
unstable estimates, unreasonable sing, high-standard errors, and so on. Although there are
some methods to solve or avoid this problem, we will propose another alternative from
the practical view in this paper, called nested estimate procedure. The rst half of the
paper explains the concept and process of this procedure, and the second half provides two
examples to illustrate this procedures suitability and reliability.
Key words: multicollinearity, nested estimate procedure, variance ination factors, tolerance

1. Introduction
In the process of tting regression model, when one independent variable is
nearly combination of other independent variables, there will affect parameter estimates. This problem is called multicollinearity. Basically, multicollinearity is not a violation of the assumptions of regression but it may
cause serious difculties (Neter et al., 1989) (1) variances of parameter estimates may be unreasonably large, (2) parameter estimates may not be signicant, (3) a parameter estimate may have a sign different from what is
expected, and so on.
For solving or alleviating this problem in certain regression model, the
usually best way is dropping redundant variables from this model directly,
that is to try to avoid it by not including redundant variables in the regression model (Bowerman et al., 1993). But sometimes, it is hard to decide
which redundant variables are. Another alternative to deleting variables is
to perform a principal component analysis (Maddala, 1977). With principal component regression, we create a set of articial uncorrelated variables that can then be used in the regression model. Although principal

418

FENG-JENQ LIN

component variables are dropped from the model, when the model is transformed back, there will be cause other biases, too (Draper and Smith,
1981).
In this paper, from the practice view, we are going to provide another
idea to avoid multicollinearity problem in the tting process, based on the
Ordinary Least Square (OLS) method, by estimated the different parameters of independent variable individually and sequentially. This idea is
called nested estimate procedure, and the constructed model is called nested
regression model. For explaining the feasibility of this method, the rst half
of this paper describes the concept and process of nested estimate procedure, and the second half provides some examples to demonstrate its suitability and reliability.
2. Procedure Concept and Executing Flow
The nested estimate procedure is easy to execute. Suppose there are k independent variables x1 , x2 , . . . , xk and one dependent variable y. Now, if we
want to use them to construct one multiple regression model, then the real
full model will be
y = 0 + 1 x1 + 2 x2 + + k xk +
k

= 0 +
m xm +
m=1

the estimate full model can be


y = 0 + 1 x1 + 2 x2 + + k xk
k


m xm ,
= 0 +
m=1

where i is the estimator of i .


As we know, although each variable might be useful the separate contribution of model variation, the high correlation only causes confusion
when they are used together. Therefore, how to choose a set of independent variables from x1 , x2 , . . ., xk to avoid or alleviate this problem in our
tting process will become a very important thing.
Actually, nested estimate procedure is based on the OLS method completely, only parameters of model are not estimated all in one instead of
separately and step-by-step. That is, we divide estimating work into many
steps in the tting process, and we estimate only one variable parameter
in each step when we have several candidate variables choice. With this
method, we can ensure that different independent variables will give different contributions for modeling variation. At the same time, we can also

THE PROCESS OF FITTING REGRESSION MODEL

419

reduce the number of independent variables as possible as we can under


the given signicant level, and we can promise there must be no signicant
relationships between error item and independent variables.
The detail idea and executing procedure of nested estimate procedure
will be described as followings:
(1) First, we have to identify the logic sign between each pair (y, xi ). At
the same time, we nd the independent variable (xi ) which has the
maximum correlation with dependent variable (y). Now we let x(1) is
the satised variable among xi .
(2) Then, we t the simple regression model with y and x(1) by OLS. The
real model is
y = 10 + 11 x(1) + 1 .
And the estimate model of this step will be
y = 10 + 11 x(1) .
In this time, if we use 10 and 11 to estimate 10 and 11 , then real
estimate model will become as following:
y = 10 + 11 x(1) + 1 .
And from the real values yj and forecasting values yj , we can acquire
residuals 1j . That is, we calculate variable 1 = y y.

(3) Next, we will continue to nd the independent variable (xi ) which has
the maximum correlation with residual variable (1 ) besides x(1) , and
this correlation has the same logic sign with y. Now we let x(2) be the
satised variable among xi .
(4) After that, we ret a simple regression model with 1 and x(2) by OLS.
And the estimate model will be
1 = 20 + 21 x(2) .
At the same time, we rst test whether signicance of parameter 21
or not under the null hypothesis that the statistic is zero. If the testing result is fail to reject the null hypothesis, then we will stop the tting process. The nal estimate model, only a simple regression model,
which is in step (2), and there does not exit colearnearily problem.
Otherwise, the real estimate model will become
y = 10 + 11 x(1) + 1
= (10 + 20 ) + 11 x(1) + 21 x(2) + 2 .

420

FENG-JENQ LIN

Also, from the real values yj and forecasting values yj , we can acquire
residuals 2j . That is, we calculate variable 2 = y y.

(5) Similar to continue steps (3) and (4). We can keep nding which independent variables (xi ) has the maximum correlation with new residual
variable (r1 ) in the rth iteration (r k) besides those variables have
retained in model. And let x(r) is the satised variable among xi . And
then, we t the simple regression model with r1 and x(r) by OLS. So,
the estimate model of rth step will be
r1 = r0 + r1 x(r) .
After that, we will test whether signicance of parameter r1 or not
under the null hypothesis that the statistic is zero. If the testing result
is fail to reject the null hypothesis, then we will stop the tting process.
And nal model will be the real estimate model of last iteration (r 1).
That is
y=

r1


m0 +

m=1

r1


m1 x(m) + r1 .

m=1

If the testing result is rejected the null hypothesis, then we should go


on step (3). And the real model will become
y=

r


m0 +

m=1

r


m1 x(m) + r .

m=1

Also, from the real values yj and forecasting values yj , we can acquire
residuals rj . That is, we calculate variable r = y y.

From above descriptions, suppose k variables are all signicance in each


step, then the real full model will be
y = 10 + 11 x(1) + 1
= 10 + 11 x(1) + (20 + 21 x(2) + 2 )
= 10 + 11 x(1) + (20 + 21 x(2) + (30 + 31 x(3) + 3 ))
...
= 10 + 11 x(1) + (20 + 21 x(2) + (30 + 31 x(3)
+(. . .(k0 + k1 x(k) + k ))))
k
k


=
m0 +
m1 x(m) + k .
m=1

m=1

For more clearly understanding, the owchart of nested estimate procedure


is showed in Figure 1.

THE PROCESS OF FITTING REGRESSION MODEL

421

Collect Variables Data

Identify logic sign in


each pair ( y , xi )
k=1
Find xi is max{ yxi }
and Let x (k ) = xi

Construct Base Model

k=k+1

Calculate Residual
k = y - y
Find xi is max{ k xi }

and Let x ( k +1) = xi


Arrange nested model

Construct Simple Model


for k and x ( k +1)

yes

Check whether
parameters significance
or not
no
The last arrangement
model is final model

Figure 1. The owchart of constructing nested regression model.

3. Two Examples
For validating the feasibility of nested estimate procedure, there are two examples that have different degree correlation among their independent variables
are used to explain and support. By the way, for examining whether the presence
of multicollinearity in the model or not, we try to use two criteria to indicate:
(1) Variance ination factors
It is used to measure the ination of the variances for the parameters
above what is expected if there is no correlation among the independent
variables. We can calculate VIF for an independent variable xi as

422

FENG-JENQ LIN

VIFi =

1
,
1 Ri2

where Ri2 is the coefcient of determination that is obtained when xi is


regressed on all other independent variables in the model.
From above formula, if Ri2 equal 0, then VIFi will be 1. If Ri2
approaches 1, then VIFi will approaches innity. Marquardt (1980)
suggests that a VIF greater than 10 indicates the presence of strong
multicollinearity, which is arbitrary.
(2) Tolerance (Belsley et al., 1980)
It is the reciprocal of VIF, the range is between 0 and 1. Tolerance for
an independent variable is dened xi by
TOLi =

1
= 1 Ri2 .
VIFi

Therefore, that is close to 0 indicates possible problems with multicollinearity. A rule of thumb is that a tolerance value less than 0.1 may
indicate the presence of multicollinearity.
Example 1: Air Pollution in US Cities (Sokal and Rohlf, 1981) A climatologist is interested in predicting air quality in 41 US cities. The mean
concentration of sulfur dioxide in the air and information pertaining to
seven explanatory variables are gathered over a 3-year period as followings:

Dependent variable
SO2
Average SO2 content of the air in micrograms per cubic meter
Independent variables:
FACTORY Number of manufacturing enterprises employing 20 or more workers
POP
1979 population in thousands
TEMP
Average annual temperature in degrees Fahrenheit
WIND
Average annual wind speed
PRECIP
Average annual precipitation in inches
DAYRAIN Average number of days with precipitation per year
DUST
Average concentration of dust particles in ppm

Analysis: First, if we t a multiple regression model to the pollution data


using all the independent variables, then the relative statistics are shown in
Table I.
Second, if we t a multiple regression model to the pollution data using
the stepwise selection method to retain signicant variables in the model,
then we can nd relative statistics for signicant variables as Table II.

423

THE PROCESS OF FITTING REGRESSION MODEL

Table I. The relative statistics by using all independent variables for the
pollution data
Variable

Parameter estimate

t-statistic

VIF

TOL

INTERCEPT
POP
FACTORY
TEMP
WIND
PRECIP
DAYRAIN
DUST

112.15865
0.03932
0.06436
1.28230
3.22214
0.49681
0.04807
0.23317

2.338
2.564
4.008
2.032
1.747
1.340
0.292
0.319

0.00000
14.34186
14.88308
3.78325
1.26159
3.46483
3.46361
1.27935

0.06973
0.06719
0.26432
0.79265
0.28861
0.28872
0.78165

Symbol

stands for the variable being signicant at 0.05 level.

Table II. The relative statistics by using the stepwise selection method
for the pollution data
Variable

Parameter estimate

t-statistic

VIF

TOL

INTERCEPT
FACTORY
POP

26.32508
0.08243
0.05661

6.855
5.609
3.959

0.00000
11.43374
11.43374

0.08746
0.08746

Symbol

stands for the variable being signicant at 0.05 level.

From the full model, to summarize the nature of the multicollinearity between all independent variables, we might conclude that POP and
FACTORY are probably not needed in the same model. Unfortunately,
from the stepwise selection method, we nd FACTORY and POP are still
in the model. And VIF and TOL indicate they both exist multicollinearity.
Now, if we change to t another regression model to the pollution data
using the nested estimate procedure to retain signicant variables in the
model, then we can nd relative statistics for the signicant variables as
Table III.
Obviously, when we use the nested estimate procedure, the variables in
the model will not exit the problem of multicollinearity. Even though the
variable POP replaces with TEMP entering model, in fact, there also does
not cause multicollinearity. Because TEMP is used to t another simple
regression to the residual, not the SO2 in the tting process, FACTORY
and TEMP are the separate contribution of tted model variation.
Example 2: Labor Needs in US Navy Hospitals (Bowerman et al., 1993)
We present the case concerning the need of labor hours for 17 US Navy

424

FENG-JENQ LIN

Table III. The relative statistics by using the nested estimate procedure for
the pollution data.
Variable

Parameter estimate

t-statistic

VIF

TOL

INTERCEPT
FACTORY
TEMP

17.61057 + 56.33228
0.02686
1.01020

5.268
2.782

0.00000
1.03747
1.03747

0.96388
0.96388

Symbol

stands for the variable being signicant at 0.05 level.

hospitals in 1979 and their information pertaining to ve explanatory variables as followings:


Dependent variable
Y

Monthly labor hours

Independent variables
X1
X2
X3
X4
X5

Average daily patient load


Monthly X-ray exposures
Monthly occupied bed days
Eligible population in the area (divided by 1000)
Average length of patients stays in days

Analysis: First, if we t a multiple regression model to the labor needs data


using all the independent variables, then the relative statistics are shown in
Table IV.
Second, if we t a multiple regression model to the labor needs data
using the stepwise selection method to retain signicant variables in the
Table IV. The relative statistics by using all independent variables
for the labor needs data
Variable

Parameter estimate

t-statistic

VIF

TOL

INTERCEPT
X1
X2
X3
X4
X5

1962.948156
15.851675
0.055930
1.589624
4.218668
394.314117

1.832
0.162
2.631
0.514
0.588
1.881

0.00
9597.57
7.94
8933.09
23.29
4.28

0.00010
0.12593
0.00011
0.04293
0.23365

Symbol

stands for the variable being signicant at 0.05 level.

425

THE PROCESS OF FITTING REGRESSION MODEL

Table V. The relative statistics by using the stepwise selection method


for the labor needs data
Variable

Parameter estimate

t-statistic

VIF

TOL

INTERCEPT
X2
X3

68.3139590
0.0748659
0.8228746

0.299
3.913
9.919

0.000000
5.647157
5.647157

0.17708
0.17708

Symbol

stands for the variable being signicant at 0.05 level.

Table VI. The relative statistics by using the nested estimate procedure for
the labor needs data
Variable

Parameter estimate

t-statistic

VIF

TOL

INTERCEPT
X2
X5

492.183+(3235.971)
0.2470
549.0719

11.209*
2.113*

0.00000
1.24921
1.24921

0.80050
0.80050

Symbol

stands for the variable being signicant at 0.05 level.

model, then we can nd relative statistics for signicant variables as


Table V.
From the full model, to summarize the nature of the multicollinearity
between all independent variables, we might conclude that X1 and X3 are
probably not needed in the same model. Similarly, X4 might not be need
in a model utilizing X1 or in a model utilizing X3 . And from the stepwise
selection method, we nd, not like Example 1, it indeed solves the multicollinearity of full model and the VIFs are 5.647.
Now, if we change to t another regression model to the labor needs
data using the nested estimate procedure to retain signicant variables in
the model, then we can nd relative statistics for the signicant variables
as Table VI.
From above information, as we expect, the nested estimate procedure
also solves the problem of multicollinearity in this example (the VIFs are
1.249). Variables X2 and X5 are the separate contribution of tted model
variation.
4. Conclusions
As we know, the multicollinearity may cause some serious problems. In
this paper, we proposed the nested estimate procedure for trying to solve
or alleviate the multicollinearity in regression model. The concept of this

426

FENG-JENQ LIN

procedure is very simple and is easy to execute. It is based on the


OLS method, by estimated the different parameters of independent variable individually and sequentially in each iteration. From the examples
of this paper, we can illustrate that the nested estimate procedure indeed
can promise to avoid the multicollinearity. It appears that this method is
suitability and reliability.
Acknowledgment
The author acknowledges the support of National Science Council under
Contract NSC89-2118-M -197-001.
References
Belsley, D. A., Kuh, D. & Welsch, R. E. (1980). Regression Diagnostics. New York: John Wiley.
Bowerman, B. L., OConnell, R. T. & Richard, T. (1993). Forecasting and Time Series: An
Applied Approach. Belmont, CA Wadsworth.
Draper, N. & Smith, H. (1981). Applied Regression Analysis. New York: Wiley.
Maddala, G. S. (1977). Econometrics. New York: McGRAW-Hill Book Company.
Marquardt, D. W. (1980). You should standardize the predictor variables in your regression
models. Journal of the American Statistical Association 75: 74103.
Neter, J., Wasserman, W. & Kutner, M. H. (1989). Applied Linear Regression Models.
Homewood, IL Richard D. Irwin.
Sokal, R. R. & Rohlf, F. J. (1981). Biometry. San Francisco, CA: W.H. Freeman and Company.

Authors Biographies
Feng-Jenq Lin is a Doctor of Management Sciences. He is interested in modelling forecast.
His papers have appeared in the Asia-Pacic Journal of Operational Research, the Yugoslav Journal of Operations Research, Journal of Information & Optimization Sciences and
various other journals.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Das könnte Ihnen auch gefallen