Sie sind auf Seite 1von 14

BAUDM Assignment 1

Group 11 (Section B)

Gaurav Ogrey -19PGP174


Gaurav Sukhwani -19PGP175
Nitin Jangra – 19PGP198
Rahul Kr Tiwari – 19PGP204
Shahrukh Siddiqi -19PGP212
CLIMATE CHANGE: GLOBAL WARMING

There have been many studies documenting that the average global temperature has been increasing over the last century. The
consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme
weather events will affect billions of people. In this problem, we will attempt to study the relationship between average global
temperature and several other factors. The file climate_change.csv contains climate data from May 1983 to December 2008.
The available variables include:
 Year & Month
 Temp
 CO2, N2O, CH4, CFC.11, CFC.12
 Aerosols
 TSI
 MEI
Format (Top 5 rows for example):
Aerosol
Year Month MEI CO2 CH4 N2O CFC-11 CFC-12 TSI s Temp
1366.10
1983 5 2.556 345.96 1638.59 303.677 191.324 350.113 2 0.0863 0.109
1366.12
1983 6 2.167 345.52 1633.71 303.746 192.057 351.848 1 0.0794 0.118
1366.28
1983 7 1.741 344.15 1633.22 303.795 192.818 353.725 5 0.0731 0.137
1983 8 1.13 342.25 1631.35 303.839 193.602 355.633 1366.42 0.0673 0.176
1366.23
1983 9 0.428 340.17 1648.4 303.901 194.392 357.465 4 0.0619 0.149

Solution

Based on the analysis of the data set answer the following questions:

1. Identify the dependent variable in the above data.


Ans: Temp

2. Is this a Time-Series Data? Why or Why Not?


Ans: Yes, this is a Time-Series Data because data set is collected from a process with equally spaced periods of
time.
Temp
0.8

0.6

0.4

0.2

0
198319841986198719891990199219931995199619981999200120022004200520072008

-0.2

-0.4

3. If you consider only the baseline what is the R2 of the model?


Ans: In time series data, baseline is set up at point t with data of point t-1
To calculate the R2, we need to calculate SSE & SST.
SSE: Set the baseline Y’ -> calculated the error Y-Y’ -> Square the error (Y-Y’)
-> Sum of square of error (Y-Y’)^2
Value= 2.193735

SST: Calculate the average of Temp y* -> calculated the error Y-Y* -> Square the error (Y-Y*)
-> Sum of square of error (Y-Y*)^2
Value = 9.846641

R2 = 1- (SSE/SST) = 0.78
(please refer the attached excel sheet, tab Q3 for calculation)

Assignment_01_SEC_
B_GROUP No._11.xlsx

4. Run the raw regression and note whether the regression model is better than the baseline or not?
Ans: Ran the raw regression in excel , please refer the attached sheet, tab Q4 for calculation

R2 = 0.737144

Raw regression model is providing lesser R2 value than baseline which means baseline model is better.

Using R Studio:
Output

5. Identify the significant variables in the raw regression (one variable in each line)?
Ans: We consider a variable significant only if the p-value is below 0.05 (confidence interval 95%)

  Coefficients Standard Error t Stat P-value


Intercept -127.6957758 19.1909145 -6.653970335 1.36E-10
MEI 0.066321799 0.006185667 10.72185019 6.55E-23
CO2 0.00520746 0.002192387 2.375246216 0.018168
CH4 6.37103E-05 0.000497699 0.128009694 0.898227
N2O -0.016928544 0.007835403 -2.160519989 0.031527
CFC-11 -0.007277836 0.001461301 -4.980379862 1.07E-06
CFC-12 0.004271973 0.000876258 4.875245901 1.77E-06
TSI 0.095862092 0.014007568 6.843592727 4.38E-11
Aerosols -1.581837443 0.209944883 -7.534536774 5.86E-13
From the above output of R studio and excel regression output, we can observe MEI, CO2, CFC.11,
CFC.12, TSI, and Aerosols are all significant.

6. Which of the following assumptions are not fulfilled in the raw model and how you concluded it?

Ans: Checking all five assumptions:

(A) Normality: Fulfilled

-Histogram is normally distributed


- Values in Q-Q plot are lying on diagonal

(B) Linearity: Not fulfilled

A component residual plot adds a line indicating where the line of best fit lies. A significant
difference between the residual line and the component line indicates that the predictor
does not have a linear relationship with the dependent variable.
If the Blue Dashed line coincides with Pink line it means that predictor variable is linearly
related with Dependent variable. (Component and Residual line should coincide for linearity)
MEI, CO2, CH4, N2O, CFC.11, CFC.12 predictors are linearly related to dependent variable
TSI and Aerosols are showing slight variation from the residual line.

(C) No Autocorrelation: Not fulfilled


Durbin-Watson Statistic: The Durbin-Watson statistic has a range from 0 to 4 with a midpoint of 2. 2
implies no autocorrelation. Value below 2 is positive autocorrelation and value above 2 is negative
autocorrelation

Output:

There is positive correlation.

(D) Homoscedasticity: Fulfilled

Output:

Homoscedasticity not present as p value is greater then 0.05

(E) No Multicollinearity: Not fulfilled

For a given predictor (p), multicollinearity can be assessed by computing a score called the variance
inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated
due to multicollinearity in the model.
Output:

#VIF value greater than 4 indicate multicollinearity between variables(We can see that CFC.12 is
highly correlated with CO2, CH4, N2O and CFC.11)

Excel Output:
Aerosol
MEI CO2 CH4 N2O CFC-11 CFC-12 TSI s
MEI 1.0000              
CO2 -0.1529 1.0000            
CH4 -0.1056 0.8723 1.0000          
N2O -0.1624 0.9811 0.8944 1.0000        
CFC-11 0.0882 0.4013 0.7135 0.4122 1.0000      
CFC-12 -0.0398 0.8232 0.9582 0.8393 0.8314 1.0000    
TSI -0.0768 0.0179 0.1463 0.0399 0.2846 0.1893 1.0000  
Aerosol
s 0.3524 -0.3693 -0.2904 -0.3535 -0.0323 -0.2438 0.0832 1.0000

Conclusion: N2O having high correlation with CO2, CH4, CFC-11, CFC-12

7. Which variables exhibit non-linearity and why?


Ans: TSI and Aerosols, refer Q6 – part B for detailed analysis.

8. Which variables exhibit multicollinearity and why?


Ans: There are several variables which exhibit high multicollinearity (CO2, CH4, N2O and CFC.11,
CFC.12). Refer Q6 – part E for detailed analysis.

9. Examine the residual plot and give your observations for it.
Residual Plot Analysis
1. Residual versus Fitted Values Plot
The plot of residuals versus predicted values is useful for checking the assumption of linearity and
homoscedasticity. Here in Residuals versus fitted values plot shows randomly distributed data points and
the Red line is flat and horizontal along y=0 line.
Hence it is linear and homoscedastic. R has flagged the data points that have high residuals (i.e
observations 190, 184 and 183)

2.Standardized Residuals and Theoretical Quantities Plot (Normal Q-Q Plot)


The normality assumption is evaluated based on the residuals and can be evaluated using a QQ-plot by
comparing the residuals to "ideal" normal observations along the 45-degree line.
R automatically flagged those same 3 data points that have large residuals (observations 190, 184,
and 183). However, aside from those 3 data points, observations lie well along the 45-degree line in the
QQ-plot. So, we may say that normality holds here.

3. The third plot is a scale-location plot (square rooted standardized residual vs. predicted value).
This is useful for checking the assumption of homoscedasticity. In this particular plot we are checking to
see if there is a pattern in the residuals.
Here Red line is horizontal and Data points are scattered randomly around it hence homoscedasticity
assumption is satisfied.
Although R flagged 3 data points that have large residuals (observations 190, 184 and 183).
10. Modify the model and obtain your best model. What is its R2 and Adj R2?
Modification: Removing CO2, CH4, N2O and CFC.11 as they are highly correlated with N 2O

Output:

R2 = 0.7261
Adjusted R2 = 0.722
11. Now Set seed as average of the numerals of the roll number of members in the group. Using the data
mining approach obtain your best model and test it on testing data. Compare your models in terms of R 2,
Adj R2 and RMSE. Share the results for the same.

Ans: Average of roll no: 53

Output:

R2 = 0.71
Adj R2 = 0.71
RMSE of training (0.09) and Test(0.1) Set nearly equal. (Hence a good model)

12. Compare the model obtained in Question 10 with that of Question 11 and give your observations for the
same.
Ans:
Raw Regression Model in Q10, after removing multicollinear variables gives following values:

R2 = 0.7261
Adjusted R2 = 0.722

Model obtained in Q11 (With seed= 53):


R2 = 0.71
Adj R2 = 0.71
RMSE of training (0.09) and Test(0.1) Set nearly equal. (Hence a good model)

Observation: R square value slightly decreased after we changed the seed to 53.

13. Mention your Best Model and why?


Ans: The Model obtained in Q3 considering the baseline gave the R square value of 0.78 which was highest
amongst all the models so we will consider that and improvise on that. We will build different model based on
Trend, Exponential trend, Polynomial, Seasonality , Naïve and seasonal naïve.
Then, we will select the model with least RMSE with same RMSE for both training and test data.

14. If the residual plot shows autocorrelation, then what steps can you take to overcome it?
Ans:
Durbin-Watson Statistic: The Durbin-Watson statistic has a range from 0 to 4 with a midpoint of 2. 2 implies
no autocorrelation. Value below 2 is positive autocorrelation and value above 2 is negative autocorrelation

Output:

There is positive correlation,


Solution: We need to include the omitted causal factor into the multivariate analysis.
We will identify the variable that is causing the correlation and include that as an
independent variable
15. Would you consider predicting using Time Series on this data? Why or Why Not?
Ans: As calculated in Q3, we get R square = 0 when we consider baseline only.

Naïve & Seasonal Naïve model:

Output:

Observation: RMSE value of Seasonal naïve model for training (0.15) & test (0.13) are very small and
close as well.

Also, We plotted the temp against time and we can see the trendline is a good fit.
Temp
0.8

0.6

0.4

0.2

0
198319841986198719891990199219931995199619981999200120022004200520072008

-0.2

-0.4

Hence, we can consider predicting using Time Series on this data

Das könnte Ihnen auch gefallen