Beruflich Dokumente
Kultur Dokumente
Group 11 (Section B)
There have been many studies documenting that the average global temperature has been increasing over the last century. The
consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme
weather events will affect billions of people. In this problem, we will attempt to study the relationship between average global
temperature and several other factors. The file climate_change.csv contains climate data from May 1983 to December 2008.
The available variables include:
Year & Month
Temp
CO2, N2O, CH4, CFC.11, CFC.12
Aerosols
TSI
MEI
Format (Top 5 rows for example):
Aerosol
Year Month MEI CO2 CH4 N2O CFC-11 CFC-12 TSI s Temp
1366.10
1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 2 0.0863 0.109
1366.12
1983 6 2.167 345.52 1633.71 303.746 192.057 351.848 1 0.0794 0.118
1366.28
1983 7 1.741 344.15 1633.22 303.795 192.818 353.725 5 0.0731 0.137
1983 8 1.13 342.25 1631.35 303.839 193.602 355.633 1366.42 0.0673 0.176
1366.23
1983 9 0.428 340.17 1648.4 303.901 194.392 357.465 4 0.0619 0.149
Solution
Based on the analysis of the data set answer the following questions:
0.6
0.4
0.2
0
198319841986198719891990199219931995199619981999200120022004200520072008
-0.2
-0.4
SST: Calculate the average of Temp y* -> calculated the error Y-Y* -> Square the error (Y-Y*)
-> Sum of square of error (Y-Y*)^2
Value = 9.846641
R2 = 1- (SSE/SST) = 0.78
(please refer the attached excel sheet, tab Q3 for calculation)
Assignment_01_SEC_
B_GROUP No._11.xlsx
4. Run the raw regression and note whether the regression model is better than the baseline or not?
Ans: Ran the raw regression in excel , please refer the attached sheet, tab Q4 for calculation
R2 = 0.737144
Raw regression model is providing lesser R2 value than baseline which means baseline model is better.
Using R Studio:
Output
5. Identify the significant variables in the raw regression (one variable in each line)?
Ans: We consider a variable significant only if the p-value is below 0.05 (confidence interval 95%)
6. Which of the following assumptions are not fulfilled in the raw model and how you concluded it?
A component residual plot adds a line indicating where the line of best fit lies. A significant
difference between the residual line and the component line indicates that the predictor
does not have a linear relationship with the dependent variable.
If the Blue Dashed line coincides with Pink line it means that predictor variable is linearly
related with Dependent variable. (Component and Residual line should coincide for linearity)
MEI, CO2, CH4, N2O, CFC.11, CFC.12 predictors are linearly related to dependent variable
TSI and Aerosols are showing slight variation from the residual line.
Output:
Output:
For a given predictor (p), multicollinearity can be assessed by computing a score called the variance
inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated
due to multicollinearity in the model.
Output:
#VIF value greater than 4 indicate multicollinearity between variables(We can see that CFC.12 is
highly correlated with CO2, CH4, N2O and CFC.11)
Excel Output:
Aerosol
MEI CO2 CH4 N2O CFC-11 CFC-12 TSI s
MEI 1.0000
CO2 -0.1529 1.0000
CH4 -0.1056 0.8723 1.0000
N2O -0.1624 0.9811 0.8944 1.0000
CFC-11 0.0882 0.4013 0.7135 0.4122 1.0000
CFC-12 -0.0398 0.8232 0.9582 0.8393 0.8314 1.0000
TSI -0.0768 0.0179 0.1463 0.0399 0.2846 0.1893 1.0000
Aerosol
s 0.3524 -0.3693 -0.2904 -0.3535 -0.0323 -0.2438 0.0832 1.0000
Conclusion: N2O having high correlation with CO2, CH4, CFC-11, CFC-12
9. Examine the residual plot and give your observations for it.
Residual Plot Analysis
1. Residual versus Fitted Values Plot
The plot of residuals versus predicted values is useful for checking the assumption of linearity and
homoscedasticity. Here in Residuals versus fitted values plot shows randomly distributed data points and
the Red line is flat and horizontal along y=0 line.
Hence it is linear and homoscedastic. R has flagged the data points that have high residuals (i.e
observations 190, 184 and 183)
3. The third plot is a scale-location plot (square rooted standardized residual vs. predicted value).
This is useful for checking the assumption of homoscedasticity. In this particular plot we are checking to
see if there is a pattern in the residuals.
Here Red line is horizontal and Data points are scattered randomly around it hence homoscedasticity
assumption is satisfied.
Although R flagged 3 data points that have large residuals (observations 190, 184 and 183).
10. Modify the model and obtain your best model. What is its R2 and Adj R2?
Modification: Removing CO2, CH4, N2O and CFC.11 as they are highly correlated with N 2O
Output:
R2 = 0.7261
Adjusted R2 = 0.722
11. Now Set seed as average of the numerals of the roll number of members in the group. Using the data
mining approach obtain your best model and test it on testing data. Compare your models in terms of R 2,
Adj R2 and RMSE. Share the results for the same.
Output:
R2 = 0.71
Adj R2 = 0.71
RMSE of training (0.09) and Test(0.1) Set nearly equal. (Hence a good model)
12. Compare the model obtained in Question 10 with that of Question 11 and give your observations for the
same.
Ans:
Raw Regression Model in Q10, after removing multicollinear variables gives following values:
R2 = 0.7261
Adjusted R2 = 0.722
Observation: R square value slightly decreased after we changed the seed to 53.
14. If the residual plot shows autocorrelation, then what steps can you take to overcome it?
Ans:
Durbin-Watson Statistic: The Durbin-Watson statistic has a range from 0 to 4 with a midpoint of 2. 2 implies
no autocorrelation. Value below 2 is positive autocorrelation and value above 2 is negative autocorrelation
Output:
Output:
Observation: RMSE value of Seasonal naïve model for training (0.15) & test (0.13) are very small and
close as well.
Also, We plotted the temp against time and we can see the trendline is a good fit.
Temp
0.8
0.6
0.4
0.2
0
198319841986198719891990199219931995199619981999200120022004200520072008
-0.2
-0.4