Analysis of Hydrocarbon Data

Analysis of Hydrocarbon Data
Anirban Ray & Soumya Sahu

October 27, 2017
Description of the Dataset

When petrol is pumped into tanks, hydrocarbons escape. To evaluate the effectiveness of pollution controls,
experiments were performed. The following dataset is obtained from the experiments.
'data.frame': 32 obs. of 5 variables:
$ Tank.temperature : int 33 31 33 37 36 35 59 60 59 60 ...
$ Petrol.temperature : int 53 36 51 51 54 35 56 60 60 60 ...
$ Initial.tank.pressure: num 3.32 3.1 3.18 3.39 3.2 3.03 4.78 4.72 4.6 4.53 ...
$ Petrol.pressure : num 3.42 3.26 3.18 3.08 3.41 3.03 4.57 4.72 4.41 4.53 ...
$ Hydrocarbons.escaping: int 29 24 26 22 27 21 33 34 32 34 ...
Here, we have the data of 32 observations on response variable Hydrocarbons escaping(grams), and
4 explanatory variables Tank temperature (degrees Fahrenheit), Petrol temperature (degrees
Fahrenheit), Initial tank pressure (pounds/square inch) and Petrol pressure (pounds/square
inch). Let us respectively denote these by y, x1 , x2 , x3 and x4 .
Primary Analysis:
Let us first fit an Ordinary Least Square Model of the response variable on all of the explanatory variables.
This will give us some insight about the nature of the data and we will then proceed towards checking for the
validity of the assumptions. We observe that the fit seems to be very good in terms of the Adjusted R2 and
F − statistic.
Call:
lm(formula = y ~ 1 + x1 + x2 + x3 + x4)
Residuals:
Min 1Q Median 3Q Max
-5.586 -1.221 -0.118 1.320 5.106
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.01502 1.86131 0.545 0.59001
x1 -0.02861 0.09060 -0.316 0.75461
x2 0.21582 0.06772 3.187 0.00362 **
x3 -4.32005 2.85097 -1.515 0.14132
x4 8.97489 2.77263 3.237 0.00319 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.73 on 27 degrees of freedom

Multiple R-squared: 0.9261, Adjusted R-squared: 0.9151
F-statistic: 84.54 on 4 and 27 DF, p-value: 7.249e-15
1
Residual Analysis
Now we prepare the residual plot, which does not seem to be uniformly scattered around zero.
Residual Plot
4
2
Residuals
0
−2
−4
−6
0 5 10 15 20 25 30
Index
Figure 5
So, we also the plot of the fits against the residuals.
2
Plot of Residuals vs. Predicted Values
4
2
Residuals
0
−2
−4
−6
20 25 30 35 40 45 50
Fits
Figure 6
This should have been a random plot, which it is not. Hence we suspect that all the assumptions do not hold
good.
Model Assumptions
The assumptions of the OLS model Y = Xβ + are the following:
∼ ∼ ∼
1. Errors are unbiased, i.e. E(i ) = 0, ∀i,

2. Errors have constant variance, i.e. V (i ) = σ 2 , ∀i,
3. Errors are uncorrelated, i.e. cov(i , j ) = 0, ∀i 6= j,
4. Errors are normally distributed, i.e. ∼ N (0 , σ 2 In )
∼ ∼
5. Explanatory variables are independent, i.e. X is of full column rank.
Now, we test these assumptions one by one.
Normality of Errors
In this case, we first draw the Quantile Quantile Plot of the residuals. The diagram closely resembles the
y = x line.
3
Normal Q−Q Plot
4
Sample Quantiles
2
0
−2
−4
−6
−2 −1 0 1 2
Theoretical Quantiles
Fiqure 7
Then we perform the Shapiro-Wilk Normality test and the null hypothsis of the test, i.e. normality of
errors is accepted with considerably high p-value. So we can conclude that errors can be assumed to come
from a normal distribution.
Shapiro-Wilk normality test
data: e.1
W = 0.97847, p-value = 0.7539
Multicollinearity
Next, we plot each of the explanatory variables against one another. Some of these graphs, specifically the
graph of x3 vs x4 shows linear pattern.
4
Figure 8
90
80
Petrol.temperature
70
60
50
40
30 40 50 60 70 80 90
Tank.temperature
5
Figure 9
7
Initial.tank.pressure
6
5
4
3
30 40 50 60 70 80 90
Tank.temperature
6
Figure 10
7
Petrol.pressure
6
5
4
3
30 40 50 60 70 80 90
Tank.temperature
7
Figure 11
7
6
5
4
3
40 50 60 70 80 90
Petrol.temperature
8
Figure 12
7
Petrol.pressure
6
5
4
3
40 50 60 70 80 90
Petrol.temperature
9
Figure 13
7
Petrol.pressure
6
5
4
3
3 4 5 6 7
We then compute the correlation matrix, to see whether the explanatory variables are correlated or not.
x1 x2 x3 x4
x1 1.0000000 0.7742909 0.9554116 0.9337690
x2 0.7742909 1.0000000 0.7815286 0.8374639
x3 0.9554116 0.7815286 1.0000000 0.9850748
x4 0.9337690 0.8374639 0.9850748 1.0000000
Now, we strongly suspect multicollinearity and hence calculate the VIFs and obtain the following.
x1 x2 x3 x4
12.997379 4.720998 71.301491 61.932647
The high values suggest collinearity too. For the final verification, we compute the Condition Number
0
of X ∗ X ∗ , where X ∗ is the scaled design matrix. Its large value leads us to conclude that the extent of
multicollinearity is significant in the dataset.
[1] 482.6577
Outliers in x-directions
We know that if there are high leverage points or influential points present in the dataset, those may lead
to pseudo-multicollinearity, for example by masking, swamping, etc. So to avoid that situation, we first
try to detect these points and check whether removal of these points leads to decrease in the extent of
multicollinearity.
10
Detection of Leverage Points
First, we detect the influential points by the hat diagonals and covariance ratios, and obtain the following
detected points:
[1] 2 3 4 15 17 18 20 23
Outliers in y-direction
Before proceeding to fitting models, we first detect the outliers by DFBETA, DFFIT and Cook’s Distance
criteria. The detected points are the following:
[1] 4 15 18 21 23 24 25 26
Outlier Shift Model
Then to verify whether they are really outliers, we compare these points against the rest of the points which are
assumed to be clean data points. Here we test whether the observations under testing are coming from some
different distribution other than that of the normal observations. We consider the model Y = Xβ + Zγ + δ
∼ ∼ ∼ ∼
If the null hypothesis H0 : γ = 0 gets rejected, we can conclude that at least some of these points are outliers
∼ ∼
and then we test for the significance of the γ coeffiients. We include those points in the clean dataset for
which these coefficients are not significantly different from zero. Then we perform the test again and continue
in the same way until we get a set of points for which all the coefficients are significant. Then we shall treat
those points as outliers.
k.1 <- length(potential.outlier.1) # NUMBER OF INITIALLY DETECTED OUTLIERS
y.mod.1 <- c(y[-potential.outlier.1], y[potential.outlier.1]) # INITIALLY MODIFIED RESPONSE
X.mod.1 <- cbind(rbind(X.1[-potential.outlier.1, ], X.1[potential.outlier.1, ]), rbind(matrix(0, n - k.1
outlier.model.1 <- lm(y.mod.1 ~ 1 + X.mod.1) # INITIAL OUTLIER SHIFT MODEL
F.1 <- ((sum((residuals(model.1)) ^ 2) - sum((residuals(outlier.model.1)) ^ 2)) / k.1) / (sum((residuals
F.1 > qf(0.05, k.1, n - p - k.1, lower.tail = FALSE)
[1] TRUE
summary(outlier.model.1)
Call:
lm(formula = y.mod.1 ~ 1 + X.mod.1)
Residuals:
-4.4613 -0.6052 0.0000 0.4497 3.9471
Coefficients:
(Intercept) -0.75298 1.41066 -0.534 0.599688
X.mod.1x1 -0.11158 0.08277 -1.348 0.193478
X.mod.1x2 0.31378 0.07556 4.153 0.000541 ***
X.mod.1x3 0.10255 3.84141 0.027 0.978981
X.mod.1x4 4.71023 3.97008 1.186 0.250076
X.mod.1 -3.97634 2.67335 -1.487 0.153318
X.mod.1 2.85640 3.03530 0.941 0.358485
11
X.mod.1 3.30415 2.94678 1.121 0.276142
X.mod.1 -2.51355 2.23577 -1.124 0.274913
X.mod.1 -6.80742 2.46254 -2.764 0.012344 *
X.mod.1 -3.79404 2.16636 -1.751 0.096014 .
X.mod.1 4.63336 2.05466 2.255 0.036121 *
X.mod.1 5.58316 2.09783 2.661 0.015419 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

potential.outlier.2 <- potential.outlier.1[c(5, 7, 8)] # OUTLIERS DETECTED AFTER CROOSCHECK
k.2 <- length(potential.outlier.2) # NUMBER OF DETECTED OUTLIERS AFTER CROSSCHECK
y.mod.2 <- c(y[-potential.outlier.2], y[potential.outlier.2]) # MODIFIED RESPONSE AFTER CROSSCHECK
X.mod.2 <- cbind(rbind(X.1[-potential.outlier.2, ], X.1[potential.outlier.2, ]), rbind(matrix(0, n - k.2
outlier.model.2 <- lm(y.mod.2 ~ 1 + X.mod.2) # OUTLIER SHIFT MODEL AFTER CROSSCHECK
F.2 <- ((sum((residuals(model.1)) ^ 2) - sum((residuals(outlier.model.2)) ^ 2)) / k.2) / (sum((residuals
F.2 > qf(0.05, k.2, n - p - k.2, lower.tail = FALSE)
[1] TRUE
summary(outlier.model.2)
Call:
lm(formula = y.mod.2 ~ 1 + X.mod.2)
Residuals:
-3.5204 -0.8975 0.0000 1.0743 4.3300
Coefficients:
(Intercept) 0.16239 1.45093 0.112 0.911818
X.mod.2x1 -0.10068 0.07347 -1.370 0.183236
X.mod.2x2 0.21759 0.05235 4.157 0.000354 ***
X.mod.2x3 -0.31137 2.36207 -0.132 0.896226
X.mod.2x4 5.98182 2.26424 2.642 0.014282 *
X.mod.2 -7.12255 2.38953 -2.981 0.006496 **
X.mod.2 5.30706 2.21009 2.401 0.024441 *
X.mod.2 6.33178 2.23658 2.831 0.009238 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Checking Influence
Now, we consider the design matrix without the rows corresponding to the influential points and calculate
the condition number based on that. But in this context, we must mention that if multicollinearity is present
12
in the dataset, this is not the correct approach.
[1] 1281.337
Anyway, even now the condition number is too large. So we can conclude that the problem of multicollinearity
is serious and hence we will opt for suitable regression methods.
Modifying Data
Before that, we should decide upon our treatment with the outliers. Since our dataset is small and we have
already verified the presence of severe multicollinearity in the data, we cannot afford to completely remove
these observations. Instead, we predict these points by a OLS regression fitted to the rest of the points and
henceforth continue our analysis with these predicted observations.
Model Fitting
To handle multicollinearity, we can proceed either by removing some of the explanatory variables, or by
performing biased regression, where we minimise the Mean Square Error subject to some Penalty Term.
In this assignment, first we will try to select a model by the stepwise regression, and then we shall apply
Lasso regression.
Stepwise Regression
Here, we start with the null model, i.e. only with an intercept term. Then we shall add variables one by one
and calculate AIC. At each step, we see what gives us minimum AIC value:
1. Adding any further variable,
2. Removing the variable which is added,
3. Keep the model same.
Among these, if the last gives us minimum AIC, the algorithm stops there and that will be our final model.
Otherwise, we repeat the same process until we reach such a stage.
Results
Start: AIC=48.27
y.mod.3 ~ 1 + x1 + x2 + x3 + x4
Df Sum of Sq RSS AIC

- x3 1 0.089 105.90 46.295
<none> 105.81 48.268
- x1 1 9.204 115.01 48.937
- x4 1 34.690 140.50 55.342
- x2 1 76.949 182.76 63.757
Step: AIC=46.3
y.mod.3 ~ x1 + x2 + x4
Df Sum of Sq RSS AIC

<none> 105.90 46.295
13
+ x3 1 0.089 105.81 48.268
- x1 1 17.254 123.15 49.125
- x2 1 112.325 218.22 67.433
- x4 1 186.416 292.31 76.787
LASSO Regression
0
In this method, we minimise n1 Σn1 (Yi − x β)2 + λΣpi |βj |. This is justified as multicollinearity inflates the
variances of the estimated regression coefficients and by including the constraint of the form Σpi |βj | ≤ c, we
enforce the regression coefficients to take small values and can also make some of these coefficients close to
zero. This can be supported by the fact in presence of multicollinearity, all the explanatory variables are
actually not required. Thus, even though this method will no longer yield unbiased estimators, but we will
still avoid loss as the estimates will have comparatively smaller MSE’s than the unbiased estimates.
Results
Here, we first obtain the an optimal λ.
[1] 0.0633387
Then we fit the model, and the regression coefficients are obtained as follows:
5 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 0.26954539
x1 -0.05775523
x2 0.22050927
x3 .
x4 5.02592959
Checking Goodness of Fitted LASSO Model

Now, we plot of the original y observations and the predictions obtained from this method. Since it seems
that the biased model yields good results, we now proceed to the residual analysis of this model. Here, we
check for the normality, autocorrelation and heteroscedasticity of the residuals of the model.
14
Plot of Original Observations and LASSO Predictions
50
Original Observations
Lasso Predictions
40
y
30
20
0 5 10 15 20 25 30
Index
Figure 14
Normality Check
Initially, we check for normality of the errors, and both the QQ Plot and the Shapiro Test concludes in the
affirmative.
15
Normal Q−Q Plot
6
4
Sample Quantiles
2
0
−2
−6 −4
−8
−2 −1 0 1 2
Theoretical Quantiles
Figure 15
Shapiro-Wilk normality test
data: e.2
W = 0.96997, p-value = 0.4985
Checking for Multicollinearity

Now, we check for condition number and note that it is much lower than 100, and hence it can be thought of
to be free from the effect of multicollinearity.
[1] 46.34992
Checking for Autocorrelation

Next, we prepare the ACF and PACF plots of the residuals, and note that none of the picks are significant.
16
Autocorrelation Plot of LASSO residuals
0.8
0.4
ACF
0.0
−0.4
0 5 10 15
Lag
Figure 16
17
Partial Autocorrelation Plot of LASSO residuals
0.3
0.1
Partial ACF
−0.1
−0.3
2 4 6 8 10 12 14
Lag
Figure 17
Checking for Homoscedasticity

Now, we try to check for equal variances. We recursively select subsets (of some fixed size) of the residuals and
compute the variances of each group. If the plot of these variances against the indices reveal some pattern,
we can suspect that heteroscedasticity is present in the dataset. We repeat this procedure for different sizes
of the subsets. From the plots, we can see that there is an increasing pattern in each of the plots. This
indicates that the residuals are not homoscedastic.
18
Moving variances with Order 10
15
Variance
10
5
5 10 15 20
Index
Figure 19
19
12
10
Variance
8
6
4
5 10 15
Index
Figure 20
20
10
8
Variance
6
4
2 4 6 8 10 12
Index
Figure 21
Breusch - Pagan Test
To confirm our suspicion, we test whether the variances can be modelled by the explanatory variables. We
use the square of the residuals as the estimates of the variances, i.e. we consider the model e = Xµ + ν .
∼ ∼ ∼
From this model, we observe that the test is accepted at 5% level of significance, but with a low p-value. So
based on the data, we cannot conclude that our data is homoscedastic, but we would certainly prefer to test
it for more number of observations. Alternatively, as the regression coefficient of x2 is not significant, we
drop that variable, and now the test becomes significant, at the same level of significance of 5%.
Call:
lm(formula = (e.2^2) ~ 1 + x1 + x2 + x4)
Residuals:
-13.657 -6.869 -1.496 2.590 42.344
Coefficients:
(Intercept) -2.2856 7.8284 -0.292 0.7725
x1 0.7482 0.3029 2.470 0.0199 *
x2 0.1723 0.2460 0.700 0.4895
x4 -10.0453 4.9177 -2.043 0.0506 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
21
F-statistic: 2.43 on 3 and 28 DF, p-value: 0.08611
Conclusion
After all these calculations, we can see that the initial pressure of the tank is not included in our final model,
as the other explanatory variables explain this due to multicollinearity. If temperature of the tank increases,
the amount of escaped hydrocarbon decreases, and for increment in the temperature or pressure of the petrol,
waste is increased. So one should consider these points while taking steps against pollution with the fact that
these conclusions are based on a biased model affected by the problem of heteroscedasticity.
22

Analysis of Hydrocarbon Data - Application of LASSO Regression

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Analysis of Hydrocarbon Data - Application of LASSO Regression

Hochgeladen von

Copyright:

Verfügbare Formate

Anirban Ray & Soumya Sahu

Description of the Dataset

Residual standard error: 2.73 on 27 degrees of freedom

1. Errors are unbiased, i.e. E(i ) = 0, ∀i,

Shapiro-Wilk normality test

Outlier Shift Model

Residual standard error: 1.912 on 19 degrees of freedom

Residual standard error: 2.1 on 24 degrees of freedom

Df Sum of Sq RSS AIC

Df Sum of Sq RSS AIC

Checking Goodness of Fitted LASSO Model

Shapiro-Wilk normality test

Checking for Multicollinearity

Checking for Autocorrelation

Checking for Homoscedasticity

Breusch - Pagan Test

Das könnte Ihnen auch gefallen

Analysis of Hydrocarbon Data - Application of LASSO Regression

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Analysis of Hydrocarbon Data - Application of LASSO Regression

Hochgeladen von

Copyright:

Verfügbare Formate

Analysis of Hydrocarbon Data

Anirban Ray & Soumya Sahu

Description of the Dataset

Residual standard error: 2.73 on 27 degrees of freedom

1. Errors are unbiased, i.e. E(i ) = 0, ∀i,

Shapiro-Wilk normality test

Outlier Shift Model

Residual standard error: 1.912 on 19 degrees of freedom

Residual standard error: 2.1 on 24 degrees of freedom

Df Sum of Sq RSS AIC

Df Sum of Sq RSS AIC

Checking Goodness of Fitted LASSO Model

Shapiro-Wilk normality test

Checking for Multicollinearity

Checking for Autocorrelation

Checking for Homoscedasticity

Breusch - Pagan Test

Das könnte Ihnen auch gefallen

1. Errors are unbiased, i.e. E(i ) = 0, ∀i,