Beruflich Dokumente
Kultur Dokumente
Primary Analysis:
Let us first fit an Ordinary Least Square Model of the response variable on all of the explanatory variables.
This will give us some insight about the nature of the data and we will then proceed towards checking for the
validity of the assumptions. We observe that the fit seems to be very good in terms of the Adjusted R2 and
F − statistic.
Call:
lm(formula = y ~ 1 + x1 + x2 + x3 + x4)
Residuals:
Min 1Q Median 3Q Max
-5.586 -1.221 -0.118 1.320 5.106
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.01502 1.86131 0.545 0.59001
x1 -0.02861 0.09060 -0.316 0.75461
x2 0.21582 0.06772 3.187 0.00362 **
x3 -4.32005 2.85097 -1.515 0.14132
x4 8.97489 2.77263 3.237 0.00319 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1
Residual Analysis
Now we prepare the residual plot, which does not seem to be uniformly scattered around zero.
Residual Plot
4
2
Residuals
0
−2
−4
−6
0 5 10 15 20 25 30
Index
Figure 5
So, we also the plot of the fits against the residuals.
2
Plot of Residuals vs. Predicted Values
4
2
Residuals
0
−2
−4
−6
20 25 30 35 40 45 50
Fits
Figure 6
This should have been a random plot, which it is not. Hence we suspect that all the assumptions do not hold
good.
Model Assumptions
The assumptions of the OLS model Y = Xβ + are the following:
∼ ∼ ∼
Normality of Errors
In this case, we first draw the Quantile Quantile Plot of the residuals. The diagram closely resembles the
y = x line.
3
Normal Q−Q Plot
4
Sample Quantiles
2
0
−2
−4
−6
−2 −1 0 1 2
Theoretical Quantiles
Fiqure 7
Then we perform the Shapiro-Wilk Normality test and the null hypothsis of the test, i.e. normality of
errors is accepted with considerably high p-value. So we can conclude that errors can be assumed to come
from a normal distribution.
data: e.1
W = 0.97847, p-value = 0.7539
Multicollinearity
Next, we plot each of the explanatory variables against one another. Some of these graphs, specifically the
graph of x3 vs x4 shows linear pattern.
4
Figure 8
90
80
Petrol.temperature
70
60
50
40
30 40 50 60 70 80 90
Tank.temperature
5
Figure 9
7
Initial.tank.pressure
6
5
4
3
30 40 50 60 70 80 90
Tank.temperature
6
Figure 10
7
Petrol.pressure
6
5
4
3
30 40 50 60 70 80 90
Tank.temperature
7
Figure 11
7
Initial.tank.pressure
6
5
4
3
40 50 60 70 80 90
Petrol.temperature
8
Figure 12
7
Petrol.pressure
6
5
4
3
40 50 60 70 80 90
Petrol.temperature
9
Figure 13
7
Petrol.pressure
6
5
4
3
3 4 5 6 7
Initial.tank.pressure
We then compute the correlation matrix, to see whether the explanatory variables are correlated or not.
x1 x2 x3 x4
x1 1.0000000 0.7742909 0.9554116 0.9337690
x2 0.7742909 1.0000000 0.7815286 0.8374639
x3 0.9554116 0.7815286 1.0000000 0.9850748
x4 0.9337690 0.8374639 0.9850748 1.0000000
Now, we strongly suspect multicollinearity and hence calculate the VIFs and obtain the following.
x1 x2 x3 x4
12.997379 4.720998 71.301491 61.932647
The high values suggest collinearity too. For the final verification, we compute the Condition Number
0
of X ∗ X ∗ , where X ∗ is the scaled design matrix. Its large value leads us to conclude that the extent of
multicollinearity is significant in the dataset.
[1] 482.6577
Outliers in x-directions
We know that if there are high leverage points or influential points present in the dataset, those may lead
to pseudo-multicollinearity, for example by masking, swamping, etc. So to avoid that situation, we first
try to detect these points and check whether removal of these points leads to decrease in the extent of
multicollinearity.
10
Detection of Leverage Points
First, we detect the influential points by the hat diagonals and covariance ratios, and obtain the following
detected points:
[1] 2 3 4 15 17 18 20 23
Outliers in y-direction
Before proceeding to fitting models, we first detect the outliers by DFBETA, DFFIT and Cook’s Distance
criteria. The detected points are the following:
[1] 4 15 18 21 23 24 25 26
Then to verify whether they are really outliers, we compare these points against the rest of the points which are
assumed to be clean data points. Here we test whether the observations under testing are coming from some
different distribution other than that of the normal observations. We consider the model Y = Xβ + Zγ + δ
∼ ∼ ∼ ∼
If the null hypothesis H0 : γ = 0 gets rejected, we can conclude that at least some of these points are outliers
∼ ∼
and then we test for the significance of the γ coeffiients. We include those points in the clean dataset for
which these coefficients are not significantly different from zero. Then we perform the test again and continue
in the same way until we get a set of points for which all the coefficients are significant. Then we shall treat
those points as outliers.
k.1 <- length(potential.outlier.1) # NUMBER OF INITIALLY DETECTED OUTLIERS
y.mod.1 <- c(y[-potential.outlier.1], y[potential.outlier.1]) # INITIALLY MODIFIED RESPONSE
X.mod.1 <- cbind(rbind(X.1[-potential.outlier.1, ], X.1[potential.outlier.1, ]), rbind(matrix(0, n - k.1
outlier.model.1 <- lm(y.mod.1 ~ 1 + X.mod.1) # INITIAL OUTLIER SHIFT MODEL
F.1 <- ((sum((residuals(model.1)) ^ 2) - sum((residuals(outlier.model.1)) ^ 2)) / k.1) / (sum((residuals
F.1 > qf(0.05, k.1, n - p - k.1, lower.tail = FALSE)
[1] TRUE
summary(outlier.model.1)
Call:
lm(formula = y.mod.1 ~ 1 + X.mod.1)
Residuals:
Min 1Q Median 3Q Max
-4.4613 -0.6052 0.0000 0.4497 3.9471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.75298 1.41066 -0.534 0.599688
X.mod.1x1 -0.11158 0.08277 -1.348 0.193478
X.mod.1x2 0.31378 0.07556 4.153 0.000541 ***
X.mod.1x3 0.10255 3.84141 0.027 0.978981
X.mod.1x4 4.71023 3.97008 1.186 0.250076
X.mod.1 -3.97634 2.67335 -1.487 0.153318
X.mod.1 2.85640 3.03530 0.941 0.358485
11
X.mod.1 3.30415 2.94678 1.121 0.276142
X.mod.1 -2.51355 2.23577 -1.124 0.274913
X.mod.1 -6.80742 2.46254 -2.764 0.012344 *
X.mod.1 -3.79404 2.16636 -1.751 0.096014 .
X.mod.1 4.63336 2.05466 2.255 0.036121 *
X.mod.1 5.58316 2.09783 2.661 0.015419 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] TRUE
summary(outlier.model.2)
Call:
lm(formula = y.mod.2 ~ 1 + X.mod.2)
Residuals:
Min 1Q Median 3Q Max
-3.5204 -0.8975 0.0000 1.0743 4.3300
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.16239 1.45093 0.112 0.911818
X.mod.2x1 -0.10068 0.07347 -1.370 0.183236
X.mod.2x2 0.21759 0.05235 4.157 0.000354 ***
X.mod.2x3 -0.31137 2.36207 -0.132 0.896226
X.mod.2x4 5.98182 2.26424 2.642 0.014282 *
X.mod.2 -7.12255 2.38953 -2.981 0.006496 **
X.mod.2 5.30706 2.21009 2.401 0.024441 *
X.mod.2 6.33178 2.23658 2.831 0.009238 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Checking Influence
Now, we consider the design matrix without the rows corresponding to the influential points and calculate
the condition number based on that. But in this context, we must mention that if multicollinearity is present
12
in the dataset, this is not the correct approach.
[1] 1281.337
Anyway, even now the condition number is too large. So we can conclude that the problem of multicollinearity
is serious and hence we will opt for suitable regression methods.
Modifying Data
Before that, we should decide upon our treatment with the outliers. Since our dataset is small and we have
already verified the presence of severe multicollinearity in the data, we cannot afford to completely remove
these observations. Instead, we predict these points by a OLS regression fitted to the rest of the points and
henceforth continue our analysis with these predicted observations.
Model Fitting
To handle multicollinearity, we can proceed either by removing some of the explanatory variables, or by
performing biased regression, where we minimise the Mean Square Error subject to some Penalty Term.
In this assignment, first we will try to select a model by the stepwise regression, and then we shall apply
Lasso regression.
Stepwise Regression
Here, we start with the null model, i.e. only with an intercept term. Then we shall add variables one by one
and calculate AIC. At each step, we see what gives us minimum AIC value:
1. Adding any further variable,
2. Removing the variable which is added,
3. Keep the model same.
Among these, if the last gives us minimum AIC, the algorithm stops there and that will be our final model.
Otherwise, we repeat the same process until we reach such a stage.
Results
Start: AIC=48.27
y.mod.3 ~ 1 + x1 + x2 + x3 + x4
Step: AIC=46.3
y.mod.3 ~ x1 + x2 + x4
13
+ x3 1 0.089 105.81 48.268
- x1 1 17.254 123.15 49.125
- x2 1 112.325 218.22 67.433
- x4 1 186.416 292.31 76.787
LASSO Regression
0
In this method, we minimise n1 Σn1 (Yi − x β)2 + λΣpi |βj |. This is justified as multicollinearity inflates the
variances of the estimated regression coefficients and by including the constraint of the form Σpi |βj | ≤ c, we
enforce the regression coefficients to take small values and can also make some of these coefficients close to
zero. This can be supported by the fact in presence of multicollinearity, all the explanatory variables are
actually not required. Thus, even though this method will no longer yield unbiased estimators, but we will
still avoid loss as the estimates will have comparatively smaller MSE’s than the unbiased estimates.
Results
Here, we first obtain the an optimal λ.
[1] 0.0633387
Then we fit the model, and the regression coefficients are obtained as follows:
5 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 0.26954539
x1 -0.05775523
x2 0.22050927
x3 .
x4 5.02592959
14
Plot of Original Observations and LASSO Predictions
50
Original Observations
Lasso Predictions
40
y
30
20
0 5 10 15 20 25 30
Index
Figure 14
Normality Check
Initially, we check for normality of the errors, and both the QQ Plot and the Shapiro Test concludes in the
affirmative.
15
Normal Q−Q Plot
6
4
Sample Quantiles
2
0
−2
−6 −4
−8
−2 −1 0 1 2
Theoretical Quantiles
Figure 15
data: e.2
W = 0.96997, p-value = 0.4985
16
Autocorrelation Plot of LASSO residuals
0.8
0.4
ACF
0.0
−0.4
0 5 10 15
Lag
Figure 16
17
Partial Autocorrelation Plot of LASSO residuals
0.3
0.1
Partial ACF
−0.1
−0.3
2 4 6 8 10 12 14
Lag
Figure 17
18
Moving variances with Order 10
15
Variance
10
5
5 10 15 20
Index
Figure 19
19
Moving variances with Order 15
12
10
Variance
8
6
4
5 10 15
Index
Figure 20
20
Moving variances with Order 20
10
8
Variance
6
4
2 4 6 8 10 12
Index
Figure 21
To confirm our suspicion, we test whether the variances can be modelled by the explanatory variables. We
use the square of the residuals as the estimates of the variances, i.e. we consider the model e = Xµ + ν .
∼ ∼ ∼
From this model, we observe that the test is accepted at 5% level of significance, but with a low p-value. So
based on the data, we cannot conclude that our data is homoscedastic, but we would certainly prefer to test
it for more number of observations. Alternatively, as the regression coefficient of x2 is not significant, we
drop that variable, and now the test becomes significant, at the same level of significance of 5%.
Call:
lm(formula = (e.2^2) ~ 1 + x1 + x2 + x4)
Residuals:
Min 1Q Median 3Q Max
-13.657 -6.869 -1.496 2.590 42.344
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.2856 7.8284 -0.292 0.7725
x1 0.7482 0.3029 2.470 0.0199 *
x2 0.1723 0.2460 0.700 0.4895
x4 -10.0453 4.9177 -2.043 0.0506 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
21
Residual standard error: 11.77 on 28 degrees of freedom
Multiple R-squared: 0.2066, Adjusted R-squared: 0.1216
F-statistic: 2.43 on 3 and 28 DF, p-value: 0.08611
Conclusion
After all these calculations, we can see that the initial pressure of the tank is not included in our final model,
as the other explanatory variables explain this due to multicollinearity. If temperature of the tank increases,
the amount of escaped hydrocarbon decreases, and for increment in the temperature or pressure of the petrol,
waste is increased. So one should consider these points while taking steps against pollution with the fact that
these conclusions are based on a biased model affected by the problem of heteroscedasticity.
22