You are on page 1of 8

ENVR 210: Linear Regression and Correlation

Analysis in R
As part of your narrative to each answer, please be sure to copy all
supporting graphs, statistics, and R code into this document.
Question 1: Linear regression analysis without data transformation

Load data
y=c(5.39,5.73,6.18,6.42,6.77,7.11,7.46,7.7
1,8.15,8.5);
x=c(4,5,6,7,8,9,10,11,12,13);
dat=cbind(x,y);
dat=as.data.frame(dat);

Calculate the linear regression (x is predictor variable, y is response


variable)
o What is the equation of the regression line?
- Y = 2.939*X 11.093
o What is the null hypothesis? What is the p-value of the
regression? Do you reject or fail to reject the null hypothesis?
- For this dataset, we reject the null hypothesis (2.607e-12 <
2.93915)
o What is the r^2? Is it good or bad?
- R^2 is 0.998. This is good because larger values indicate
residual variation.
Plot the best fit line on a scatterplot of the data
o How does this look in relation to the data?
The 99.8% of the R^2 represents the strong best fit line
because of its close proximity to the residuals.

Plot the residual charts

o
o

Linearity: Residuals vs Fitted


Do the data appear to be linear?
- The data appears to be linear thus a relationship.
Independently sampled values, normally distributed errors:
Normal Q-Q
Do the points line up on the 45-degree line?

- Yes the points line up on the 45-degree line. Our data are
normally distributed, thus normal distribution from the
residuals.
o Constant variance: Scale-Location
Is the spread of the points the same?
- Since the spread of the points are the same there is
constant variance.
o Provide 1 statistic and 1 chart that show that transforming
data can not improve regression performance.
1. We still reject the null hypothesis and have not improved our
regression performance from (r2 = 0.9886, p = 2.067e-12) to (r2 =
0.9882, p = 3.347e-09).

2. The linear regression line still fits the data well and has not
improved significantly.

Question 2: Linear regression analysis with transformation

Load data
Number=c(8398,239,728,758,1453,75,27,915,67,4,28,1,168
,7,16,7,3);
Distance=c(364,357,343,251,216,133,115,90,88,58,54,54,5
3,47,25,16,8);

dat=cbind(Number,Distance);
dat=as.data.frame(dat);

Calculate the linear regression (Distance is predictor variable,


Number is response variable)
o What is the equation of the regression line?
- Y = 0.03522*X + 106.93396
o What is the null hypothesis? What is the p-value of the
regression? Do you reject or fail to reject the null hypothesis?
- For this dataset, we reject the null hypothesis (0.01629 <
0.03522)
o What is the r^2? Is it good or bad?
- 0.2831 is the r^2. It is bad because of the lower value.
Plot the best fit line on a scatterplot of the data
o How does this look in relation to the data?

Plot the residual charts

Linearity: Residuals vs Fitted


Do the data appear to be linear?
- No, the data is not linear
o Independently sampled values, normally distributed errors:
Normal Q-Q
Do the points line up on the 45-degree line?
- The points do not line up on the 45-degree line.
o Constant variance: Scale-Location
Is the spread of the points the same?
- No, the spread of the points are not the same.
What are your conclusions about the regression?
- Equation of the regression line is Y = 0.03522*X +
106.93396
- We reject the null hypothesis that the slope of the line
is 0.01629, and we conclude that the linear regression
line is insignificant (r2=0.2831, p=0.01629)
- The correlation coefficient is low, but our residual plots
offers insight for improvement
o

Could you improve the model fit by transforming the data? (Hint:
Yes, you can). If so, answer the following questions based on logtransformed data:

o
o

o
o

What is the equation of your regression line?


- Y = 0.3380*X + 1.3022
Did the significance of the regression change? Do you reject
or fail to reject the null hypothesis?
- We reject the null hypothesis at 3.3% because (0.0002429
< 0.3380)
Did the r^2 change? Is it good or bad?
- The r^2 changed from 0.2831 to 0.5773 which is good
because the larger the value the better.
Plot best fit line on a scatterplot of the transformed data
how does it look?
The best fit line is not perfect but it has improved from the
prior data.

Step through examining each of the assumptions of linear


regression

Linearity: Residuals vs Fitted


Does the data appear to be linear?
- The red line is not straight thus the data does not
have a perfect linear relationship. We can see that
observations 2, 12 and 17 have the largest residuals.

Independently sampled values, normally distributed


errors: Normal Q-Q
Do the points line up on the 45-degree line?
The points do not line up on the 45-degree line
because the data is not normal distributed from the
residuals.

Constant variance: Scale-Location


Is the spread of the points the same?
There are points far away from each other like the
highest point on
the other side of Cooks distance.
Did transforming the data improve the linear fit?
- Yes transforming the data improved the linear fit but not
significant to create a perfect linear relationship.