Sie sind auf Seite 1von 10

The Nature of Multicollinearity

Multicollinearity occurs when two or more predictors/ regressors/ independent variables in


the model are correlated and provide redundant information about the response/ dependent
variable [Hawking, 1983].

Assumption 10 of the classical linear regression model (CLRM) is that there is no


multicollinearity among the regressors included in the regression model.

Ragnar Frisch coined the term multicollinearity, meant the existence of a “perfect,” or exact,
linear relationship among some or all explanatory variables in a regression model. If we have
k-variables as independent X1, X2, . . . , Xk then an exact linear relationship is said to exist if
the following condition is satisfied:

λ1X1 + λ2X2 +· · ·+λkXk = 0

where λ1, λ2, . . . , λk are constants such that not all of them are zero simultaneously. To
accommodate imperfect correlation a stochastic error term is introduced as under:

λ1X1 + λ2X2 +· · ·+λkXk + vi = 0

Consequences of high multicollinearity:

If the goal is to understand how the various X variables impact Y, then multicollinearity is a
big problem. Multicollinearity is a matter of degree, not a matter of presence or absence. In
presence of multicollinearity the ordinary least squares estimators are imprecisely estimated.

1. Increased standard error of estimates of the β’s (decreased reliability).

2. Often confusing and misleading results. t−tests for the individual β1, β2 might suggest that
none of the two predictors are significantly associated to y, while the F−test indicates that the
model is useful for predicting y.

If multicollinearity is perfect, the regression coefficients of the X variables are indeterminate


and their standard errors are infinite.
If multicollinearity is less than perfect, the regression coefficients, although determinate,
possess large standard errors (in relation to the coefficients themselves), which means the
coefficients cannot be estimated with great precision or accuracy.

Why is Multicollinearity a Problem?

If the goal is simply to predict Y from a set of X variables, then multicollinearity is not a
problem. The predictions will still be accurate, and the overall R2 (or adjusted R2 ) quantifies
how well the model predicts the Y values.

If the goal is to understand how the various X variables impact Y, then multicollinearity is
a big problem. One problem is that the individual P values can be misleading (a P value can
be high, even though the variable is important). The second problem is that the confidence
intervals on the regression coefficients will be very wide. The confidence intervals may even
include zero, which means one can’t even be confident whether an increase in the X value is
associated with an increase, or a decrease, in Y. Because the confidence intervals are so wide,
excluding a variable (or adding a new one) can change the coefficients dramatically and may
even change their signs.

Sources of multicollinearity

Montgomery and Peck stated multicollinearity may be due to the following factors:

1. The Sampling method: Sampling over a limited range of the regressors values in the
population.

2. Model or Population Constraints: For example, income relationship with size of house
owned when regressing electricity consumption on income and house size, there is a physical
constraint in the population

3. Model specification, for example, adding polynomial terms to a regression model,


especially when the range of the X variable is small.

4. An over-determined model. This happens when the model has more explanatory variables
than the number of observations. This could happen in medical research where there may be a
small number of patients about whom information is collected on a large number of variables.

5. Especially in time series data, may be that the regressors included in the model share a
common trend, that is, they all increase or decrease over time. Thus, in the regression of
consumption expenditure on income, wealth, and population, the regressors income, wealth,
and population may all be growing over time at more or less the same rate, leading to
collinearity among these variables.

Practical Consequences of Multicollinearity

In cases of near or high multicollinearity, one is likely to encounter the following


consequences:

1. Although BLUE, the OLS estimators have large variances and covariances, making precise
estimation difficult.

2. Confidence intervals tend to be much wider, leading to the acceptance of the “zero null
hypothesis” (i.e., the true population coefficient is zero) more readily.

3. The t ratio of one or more coefficients tends to be statistically insignificant.

4. Although the t ratio of one or more coefficients is statistically insignificant, R2, the overall
measure of goodness of fit, can be very high.

5. The OLS estimators and their standard errors can be sensitive to small changes in the data.

Detecting multicollinearity

By observing correlation matrix, variance influence factor (VIF), eigenvalues of the


correlation matrix, one can detect the presence of multicollinearity

Multicollinearity - Solutions

• If interest is only in estimation and prediction, multicollinearity can be ignored since it does
not affect ˆy or its standard error (either ˆ_ˆy or ˆ_y−ˆy).

• Above is true only if the xp at which we want estimation or prediction is within the range of
the data.

• If the wish is to establish association patters between y and the predictors, then analyst can:

– Eliminate some predictors from the model.

– Design an experiment in which the pattern of correlation is broken.

Dataset
There 27 instances into the data file. The goal is to predict the consumption of cars from
various characteristics (price, engine size, horsepower and weight).

Multiple linear regression

In the first instance I performed a multiple regression analysis using all the explanatory
variables. I set Consumption as dependent; Price, Cylinder, Horsepower and Weight as
independent variables.
The coefficient of determination R2 is 0.9295 But, when I consider the coefficients of the
model, some results seem strange. Only the weight is significant for the explanation of the
consumption (p value = .009 < ,05). The sign is positive, implies when the weight of the car
increases, the consumption also increases. It seems natural as our domain of knowledge
reflects. But, neither the horsepower nor the engine size (cylinder) seems to influence the
consumption, the p-values > .05 in both the cases, also the t statistics are lower than 1.96
level. It is unusual as our domain of knowledge about car’s consumption indicates the direct
relationship of horsepower and cylinder with consumption. Leading me to conclude in the
case that two cars with the same weight have similar consumption, even if the engine size of
the second is 4 times bigger than the first, this last result does not correspond at all with what
we know about cars

Analysis of Variance Section

An analysis of variance (ANOVA) table summarizes the information related to the sources of
variation in data.

Source This represents the partitions of the variation in dependent variable Y (Consumption).
There are four sources of variation listed: intercept, model, error, and total (adjusted for the
mean).

DF The degrees of freedom are the number of dimensions associated with this term. Each
observation can be interpreted as a dimension in n-dimensional space. The degrees of
freedom for the intercept as 1, model as p, error as n-p-1, and adjusted total n-1.

Sum of Squares These are the sums of squares associated with the corresponding sources of
variation.

Mean Square The mean square is the sum of squares divided by the degrees of freedom. This
mean square is an estimated variance. For example, the mean square error is the estimated
variance of the residuals (errors).

F-Ratio This is the F statistic for testing the null hypothesis that all β j = 0. This F-statistic has
p degrees of freedom for the numerator variance and n-p-1 degrees of freedom for the
denominator variance.

Prob Level This is the p-value for the above F test. The p-value is the probability that the test
statistic will take on a value at least as extreme as the observed value, assuming that the null
hypothesis is true. If the p-value is less than 0.05, the null hypothesis is rejected. If the p-
value is greater than .05, then the null hypothesis is accepted.
If the model fits the data well, the overall R2 value will be high, and the corresponding P
value will be low (P value is the observed significance level at which the null hypothesis is
rejected)

Root Mean Square Error This is the square root of the mean square error. It is an estimate of
σ, the standard deviation of the ei’s.

Mean of Dependent Variable This is the arithmetic mean of the dependent variable.

R-Squared This is the coefficient of determination.

Coefficient of Variation

The coefficient of variation is a relative measure of dispersion, computed by dividing root


mean square error by the mean of the dependent variable. By itself, it has little value, but it
can be useful in comparative studies.

Actual This is the actual value of Y for the ith row.

Predicted The predicted value of Y for the ith row. It is predicted using the levels of the X’s
for this row.

Residual This is the estimated value of ei. This is equal to the Actual minus the Predicted.

R-Squared

R-squared is the coefficient of determination. It represents the percent of variation in the


dependent variable explained by the independent variables in the model.

Regression Coefficient These are the estimated values of the regression coefficients b0,
b1, ..., bp. The value indicates how much change in Y occurs for a one-unit change in x when
the remaining X’s are held constant. These coefficients are also called partial-regression
coefficients since the effect of the other X’s is removed.

Standard Error These are the estimated standard errors (precision) of the ridge regression
coefficients. The standard error of the regression coefficient, sbj, is the standard deviation of
the estimate. In regular regression, we divide the coefficient by the standard error to obtain a t
statistic. However, this is not possible here because of the bias in the estimates.
Detecting the multicollinearity

I suspect a multicollinearity phenomenon here. We know for instance that the engine size and
the horsepower are often highly correlated. It influences the results in a different ways. The
model is very unstable; a small change in the dataset (by removing or adding instances)
causes a large modification of the estimated parameters. The sign and the values of the
coefficients are inconsistent with the domain knowledge. For instance, it seems here that the
horsepower has a negative influence on the consumption. We know that this cannot be true.
In short, we have an excellent model (according the R 2) but unusable because we cannot draw
a meaningful interpretation of the coefficients. It is impossible to understand the causal
mechanism of the phenomenon studied.

We use very simple calculations to detect the multicollinearity problem.

Variance Inflation The variance inflation factor (VIF) is a measure of multicollinearity, the
term coined by Marquardt (1970). It is the reciprocal of 1-Rx2, where Rx2 is the R2 obtained
when independent variables are individually regressed on the remaining independent
variables. A VIF of 10 or more for large data sets indicates a multicollinearity problem since
the Rx2 with the remaining X’s is 90 percent. For small data sets, even VIF’s of 5 or more can
signify multicollinearity. A high Rx2 indicates a lot of overlap in explaining the variation
among the remaining independent variables.

Tolerance It is 1- Rx2, the denominator of the variance inflation factor. These are the values
of the variance inflation factors associated with the variables. When multicollinearity has got
remedy, these values will all be less than 10.

Correlation Matrix Section

Pearson correlations given for all variables where outliers, non-normality, non-constant
variance, and non-linearities can all impact these correlations. These correlation coefficients
show which independent variables are highly correlated with the dependent variable and with
each other. Independent variables that are highly correlated with one another as shown in the
figure below are causing multicollinearity problems.

Sign consistency. The first strategy is very basic. We check if the sign of the coefficient is
consistent with the sign of the correlation of each explanatory variable with the target
variable (computed individually). If some of them are inconsistent, it means that other
variables interfere in the association between the explanatory variable and the dependent
variable.

To compute the correlation between each independent variable and the dependent variable,
we ………………………………….

Each explanatory variable is highly correlated with the dependent variable (0.8883). We
note also that there is a problem about HORSEPOWER (horsepower). The correlation is
positive, but the sign of its coefficient into the regression is negative. Another variable
probably interferes with HORSEPOWER.

Klein Klein’s r ’s rule. We compute the square of the correlation for each couple of
explanatory variables. If one or more of the values are higher than (or at less near) the
coefficient of determination (R2) of the regression, there is probably a multicollinearity
problem. The advantage here is that we can identify the variables which are redundant in the
regression.

We note among others than the square of the correlation between HORSEPOWER
(horsepower) and CYLINDER (engine size) is very close to the coefficient of determination
of the regression (0.9137 vs. 0.9295).
All these symptoms suggest that there is a problem of collinearity in our study. We must
adopt an appropriate strategy if we want to get usable results

Variable selection

It helps to identify relevant variables and give an interpretable result. In the context of
multicollinearity problem, it can especially remove redundant variables which interfere in the
regression. I use a forward search. At each step, I search the most relevant explanatory
variable according to the absolute value of the correlation coefficient.

I observe that:

 The selected explanatory variables are weight and cylinder (engine size). They are
very significant.
 Compared to the initial regression, despite the elimination of two variables, the
proportion of explained variance remains very good with a coefficient of
determination of R2 = 0.9277 (R2 = 0.9295 for the model with 4 variables).
 Weight and Cylinder have both a positive influence on the consumption i.e. when the
weight (or the engine size) increases, the consumption increases also. This is rather
consistent to the domain knowledge.
 And the signs of the coefficients are adequate to the sign of the correlation coefficient
computed individually.
 Because the p-value is lower than the significance level, we add the variable into the
regression.

Das könnte Ihnen auch gefallen