Linear Regression

Analysis in R

As part of your narrative to each answer, please be sure to copy all

supporting graphs, statistics, and R code into this document.

Question 1: Linear regression analysis without data transformation

Load data

y=c(5.39,5.73,6.18,6.42,6.77,7.11,7.46,7.7

1,8.15,8.5);

x=c(4,5,6,7,8,9,10,11,12,13);

dat=cbind(x,y);

dat=as.data.frame(dat);

variable)

o What is the equation of the regression line?

- Y = 2.939*X 11.093

o What is the null hypothesis? What is the p-value of the

regression? Do you reject or fail to reject the null hypothesis?

- For this dataset, we reject the null hypothesis (2.607e-12 <

2.93915)

o What is the r^2? Is it good or bad?

- R^2 is 0.998. This is good because larger values indicate

residual variation.

Plot the best fit line on a scatterplot of the data

o How does this look in relation to the data?

The 99.8% of the R^2 represents the strong best fit line

because of its close proximity to the residuals.

o

o

Do the data appear to be linear?

- The data appears to be linear thus a relationship.

Independently sampled values, normally distributed errors:

Normal Q-Q

Do the points line up on the 45-degree line?

- Yes the points line up on the 45-degree line. Our data are

normally distributed, thus normal distribution from the

residuals.

o Constant variance: Scale-Location

Is the spread of the points the same?

- Since the spread of the points are the same there is

constant variance.

o Provide 1 statistic and 1 chart that show that transforming

data can not improve regression performance.

1. We still reject the null hypothesis and have not improved our

regression performance from (r2 = 0.9886, p = 2.067e-12) to (r2 =

0.9882, p = 3.347e-09).

2. The linear regression line still fits the data well and has not

improved significantly.

Load data

Number=c(8398,239,728,758,1453,75,27,915,67,4,28,1,168

,7,16,7,3);

Distance=c(364,357,343,251,216,133,115,90,88,58,54,54,5

3,47,25,16,8);

dat=cbind(Number,Distance);

dat=as.data.frame(dat);

Number is response variable)

o What is the equation of the regression line?

- Y = 0.03522*X + 106.93396

o What is the null hypothesis? What is the p-value of the

regression? Do you reject or fail to reject the null hypothesis?

- For this dataset, we reject the null hypothesis (0.01629 <

0.03522)

o What is the r^2? Is it good or bad?

- 0.2831 is the r^2. It is bad because of the lower value.

Plot the best fit line on a scatterplot of the data

o How does this look in relation to the data?

Do the data appear to be linear?

- No, the data is not linear

o Independently sampled values, normally distributed errors:

Normal Q-Q

Do the points line up on the 45-degree line?

- The points do not line up on the 45-degree line.

o Constant variance: Scale-Location

Is the spread of the points the same?

- No, the spread of the points are not the same.

What are your conclusions about the regression?

- Equation of the regression line is Y = 0.03522*X +

106.93396

- We reject the null hypothesis that the slope of the line

is 0.01629, and we conclude that the linear regression

line is insignificant (r2=0.2831, p=0.01629)

- The correlation coefficient is low, but our residual plots

offers insight for improvement

o

Could you improve the model fit by transforming the data? (Hint:

Yes, you can). If so, answer the following questions based on logtransformed data:

o

o

o

o

- Y = 0.3380*X + 1.3022

Did the significance of the regression change? Do you reject

or fail to reject the null hypothesis?

- We reject the null hypothesis at 3.3% because (0.0002429

< 0.3380)

Did the r^2 change? Is it good or bad?

- The r^2 changed from 0.2831 to 0.5773 which is good

because the larger the value the better.

Plot best fit line on a scatterplot of the transformed data

how does it look?

The best fit line is not perfect but it has improved from the

prior data.

regression

Does the data appear to be linear?

- The red line is not straight thus the data does not

have a perfect linear relationship. We can see that

observations 2, 12 and 17 have the largest residuals.

errors: Normal Q-Q

Do the points line up on the 45-degree line?

The points do not line up on the 45-degree line

because the data is not normal distributed from the

residuals.

Is the spread of the points the same?

There are points far away from each other like the

highest point on

the other side of Cooks distance.

Did transforming the data improve the linear fit?

- Yes transforming the data improved the linear fit but not

significant to create a perfect linear relationship.

