Sie sind auf Seite 1von 36

Case Study: Linear regression

Let us assume that for are a small business owner for regional delivery service,inc.(RDS)who
offer same day delivery for letters, packages, and other small cargo. They are able to use
Google Map to group individual deliveries into one trip to reduce time and fuel costs.
Therefore, some trips will have more than one delivery. As the owner, you would like to able
to estimate how long a delivery will take based on three factors: 1) the total distance trip
in miles 2) the number of deliveries that must be made during the trip, and 3) the daily
price of gas/petrol in U.S dollars.
To predict analysis, take a random sample of 10 past trip and records four pieces of
information for each trip
# install package
install.packages("car")
install.packages("SparseM")
install.packages("corrplot")
install.packages("PerformanceAnalytics")
#library
library(car)
# dataset
test<- data.frame(
TravelTime_y = c (7,5.4,6.6,7.4,4.8,6.4,7,5.6,7.3,6.4),
MilesTravel_x1 = c (89,66,78,111,44,77,80,66,109,76),
NumDeliveries_x2 = c (4,1,3,6,1,3,3,2,5,3),
GasPrice_x3 = c (3.84,3.19,3. 78,3.89,3.57,3.57,3.03,3.51,3.54,3.25))
#scatter plot
# Plot a correlation graph
plot(test)
other method: -
library(corrplot)
newdatacor = cor(test)
corrplot(newdatacor, method = "number")
library("PerformanceAnalytics")

chart.Correlation(test, histogram=F, pch=19)


#AIC
AIC(m1)
AIC(m2)
AIC(m3)
newdatacor = round(cor(test),2) AIC(m4)
AIC(m5)
AIC(m6)
AIC(m7)
#AIC
step(m7,direction = "both")
step(m7,direction = "forward")
step(m7,direction = "backward")
#RMSE
predict=predict(m7, test)
difference=predict-test$TravelTime_y
rmse<-sqrt(mean(difference^2))
Relationship Correlation coefficient(r) P value ACTION
Dependent variable vs Dependent variable
X4 vs x1 0.928 (high) 0.0001 Accept (r value is high)
(significant)
X4 vs x2 0.916 (high) 0.0002 Accept (r value is high)
(significant)
X4 vs x3 0.236 (low) 0.5111 Reject (r value is low)
(Not significant)
Independent variable vs Independent variable
X2 vs x1 0.956 (high) <0.0001 Reject (r value is high)
(significant)
X3 vs x2 0.314 (low) 0.3771 Accept (r value is high)
(Not significant)
X3 vs x2 0.453 (low) 0.1881 Accept (r value is high)
(Not significant)
#Model #summary #vif
m1 <- lm(TravelTime_y ~ MilesTravel_x1 ,data=test) summary(m1) car::vif(m1)
m2 <- lm(TravelTime_y ~ NumDeliveries_x2 ,data=test) summary(m2) car::vif(m2)
m3 <- lm(TravelTime_y ~ GasPrice_x3,data=test) car::vif(m3)
summary(m3)
summary(m4) car::vif(m4)
m4 <- lm(TravelTime_y ~ MilesTravel_x1 + NumDeliveries_x2 ,data=test)
summary(m5) car::vif(m5)
m5 <- lm(TravelTime_y ~ MilesTravel_x1 + GasPrice_x3,data=test)
summary(m6) car::vif(m6)
m6 <- lm(TravelTime_y ~ NumDeliveries_x2 + GasPrice_x3,data=test)
summary(m7) car::vif(m7)
Result
m7 <- of all 7 model:
lm(TravelTime_y ~ MilesTravel_x1 +NumDeliveries_x2 +GasPrice_x3,data=test)

X1 X2 X3 Multiple Adjusted Residual RMSE F- p-value Coefficients VIF AIC


R-squared R-squared Standard statistic
error MilesTrav NumDelive GasPri
Intercept MilesTravel_x1 NumDeliv GasPrice_ el_x1 ries_x2 ce_x3
eries_x2 x3

x 0.8615 0.8442 0.3423 0.306 49.77 0.00010 3.185 0.040 10.7

x 0.8399 0.8199 0.3681 0.329 41.96 0.00019 4.845 0.498 12.1

x 0.0714 -0.0446 0.8864 0.792 0.615 0.4555 3.536 0.81 29.7

x x 0.8714 0.8347 0.3526 0.295 23.72 0.00076 3.732 0.026 0.184 11.59 11.59 11.9

x x 0.8661 0.8278 0.3599 0.301 22.63 0.00087 3.867 0.041 -0.21 1.14 1.14 12.3

x x 0.8876 0.8555 0.3297 0.275 27.63 0.00047 7.324 0.566 -0.76 1.33 1.33 10.6

x x x 0.8947 0.842 0.3447 0.266 16.99 0.00245 6.211 0.014 0.383 -0.606 14.93 17.35 1.71 11.9

Conclusion:

1. correlation coefficient of X3 is very low (0.27) so, we (m3, m5, m6, m7) can’t consider for
model selection.
2. Model m7 & m4 are redundant due to vif >10 and correlation is high in between x1 & x2.
3. Model m1 have high R-square adj, high F-Statistic, low RMSE and low AIS compare with Model
m2, so Model m1 is the best model.
After excluding third variable(GasPrice_x3) according to correlation matrix, we can also
verify model with stepwise AIC method.

> step(m4,direction = "both")


Start: AIC=-18.41
TravelTime_y ~ MilesTravel_x1 + NumDeliveries_x2

Df Sum of Sq RSS AIC


- NumDeliveries_x2 1 0.066906 0.9374 -19.672
<none> 0.8705 -18.413
- MilesTravel_x1 1 0.213433 1.0839 -18.220

Step: AIC=-19.67
TravelTime_y ~ MilesTravel_x1

Df Sum of Sq RSS AIC


<none> 0.9374 -19.6723
+ NumDeliveries_x2 1 0.0669 0.8705 -18.4128
- MilesTravel_x1 1 5.8316 6.7690 -1.9023

Call:
lm(formula = TravelTime_y ~ MilesTravel_x1, data = test)

Coefficients:
(Intercept) MilesTravel_x1
3.18556 0.04026
Some plot for Model m1:
par(mfrow=c (2,2))
plot(m1)

• First is residual Vs fitted.it should be randomly distributed and there should not be any pattern. in figure,it looks like random so looks good.
• Q-q plot is for normality .it stands for quantile plot. And it should follow the straight line. Doesn't looks very good but ok.
• Third is same as first one but it is standardized residual. It should be random as well and chart looks ok.
• Leverage graph is for extreme point and here also it looks good.
• More inference could have been drawn if the number of observations were higher.
Implementing Logistic Regression using Titanic dataset in R

Data Sourse: https://www.kaggle.com/c/titanic/data

Data Dictionary

Variable Definition Key


survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)


1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...


Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...


Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

SHASHI ONLINE CLASS


R Code
Loading the dataset

Train <- read.csv("C:/Users/Shashi/Desktop/Regression/train.csv",stringsAsFactors =


F,na.strings=c(""," ","NA"))

test<-read.csv("C:/Users/Shashi/Desktop/Regression/test.csv",stringsAsFactors
= F,na.strings=c(""," ","NA"))

#Structure of both dataset


str(train)
str(test) #Survived is absent
## Setting Survived column for test data to NA
test$Survived <- NA

## Combining Training and Testing dataset


complete_data <- rbind(train, test)
## Check data structure
str(complete_data)

> str(complete_data)
'data.frame': 1309 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley
(Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle,
Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 28 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr NA "C85" NA "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...

## Let's check for any missing values in the data


colSums(is.na(complete_data))
> colSums(is.na(complete_data))
PassengerId Survived Pclass Name Sex Age SibSp
Parch Ticket
0 418 0 0 0 0 0
0 0
Fare Cabin Embarked
1 1014 2

SHASHI ONLINE CLASS


## Check number of uniques values for each of the column to find out columns
which we can convert to factors
sapply(complete_data, function(x) length(unique(x)))

> sapply(complete_data, function(x) length(unique(x)))


PassengerId Survived Pclass Name Sex Age SibSp
Parch Ticket
1309 3 3 1307 2 98 7
8 929
Fare Cabin Embarked
282 187 4

## Missing values imputation


complete_data$Embarked[complete_data$Embarked==""] <- "S"
complete_data$Age[is.na(complete_data$Age)] <- median(complete_data$Age,na.rm=T)

## Removing Cabin as it has very high missing values, passengerId,


Ticket and Name are not required
library(dplyr)
titanic_data <- complete_data %>% select(-c(Cabin, PassengerId, Ticket, Name))

## Converting "Survived","Pclass","Sex","Embarked" to factors


for (i in c("Survived","Pclass","Sex","Embarked")){
titanic_data[,i]=as.factor(titanic_data[,i])
}

Dummy variable creation


For logistic regression, it is important to create dummy variables for all the categorical variables so that all
the factors are taken into account while creating the model. Dummy variables are simply a variable for
each
category of a categorical variable. For example, in our dataset we have a variable ‘Pclass’, so creating a
dummy variable for this will create three different variables in the dataset for each of the class
(Pclass_1, Pclass_2, PClass_3). I’ll first convert all the categorical variables to factors and then I’ll use
‘Dummies’ package in R to create dummy variables for each of these variables.
library(dummies)
titanic_data <- dummy.data.frame(titanic_data, names=c("Pclass","Sex","Embarked"), sep="_")

Data Splitting
Next, we split the dataset into training and testing sets. I’ll split the train_data from the overall data as it has
the value of target variable. There are 891 observations in the training dataset and I’ll split that in 75:25
ratio.
This will help us calculate the model accuracy. We can then again use ‘predict()’ to predict the target
variable of the test dataset.
## Splitting training and test data
train <- titanic_data[1:667,]
test <- titanic_data[668:889,]

## Model Creation

SHASHI ONLINE CLASS


model <- glm(Survived ~.,family=binomial(link='logit'),data=train)
## Model Summary
summary(model)

> summary(model)

Call:
glm(formula = Survived ~ ., family = binomial(link = "logit"),
data = train)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.3773 -0.6567 -0.4301 0.6397 2.3946

Coefficients: (3 not defined because of singularities)


Estimate Std. Error z value Pr(>|z|)
(Intercept) 10.037973 535.411351 0.019 0.985042
Pclass_1 2.170198 0.359422 6.038 1.56e-09 ***
Pclass_2 1.302429 0.271548 4.796 1.62e-06 ***
Pclass_3 NA NA NA NA
Sex_female 2.673335 0.227027 11.775 < 2e-16 ***
Sex_male NA NA NA NA
Age -0.031650 0.008942 -3.539 0.000401 ***
SibSp -0.248091 0.123290 -2.012 0.044193 *
Parch -0.090412 0.141914 -0.637 0.524069
Fare -0.001409 0.003175 -0.444 0.657116
Embarked_C -10.976346 535.411265 -0.021 0.983644
Embarked_Q -10.875858 535.411346 -0.020 0.983794
Embarked_S -11.410996 535.411254 -0.021 0.982996
Embarked_NA NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 891.99 on 666 degrees of freedom


Residual deviance: 605.57 on 656 degrees of freedom
AIC: 627.57

Number of Fisher Scoring iterations: 12

Interpreting Model Results


From the results, we can see that ‘SibSp’, ‘Parch’, ‘Fare’ and ‘Embarked’ are not statistically significant.
For important variables, we see that “Sex_Female” has lowest p-value which means that gender of the
passenger has strong association with the target variable. The coefficient of the variable ‘Sex_Female’ is
positive which tell us that females have high probability of surviving on the ship given that all other
variables are constant. Being a female, increases the odds of survival by 2.68 while if we look for age
coefficient, we can say that one unit increase in the age reduces the odds of survival by 0.032. For age,
we say that it will reduce the odds as the coefficient for age is negative.

SHASHI ONLINE CLASS


Testing variable importance using Anova
## Using anova() to analyze the table of devaiance
anova(model, test="Chisq")

> anova(model, test="Chisq")


Analysis of Deviance Table
Model: binomial, link: logit

Response: Survived

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev Pr(>Chi)


NULL 666 891.99
Pclass_1 1 39.603 665 852.39 3.112e-10 ***
Pclass_2 1 26.485 664 825.91 2.655e-07 ***
Pclass_3 0 0.000 664 825.91
Sex_female 1 197.978 663 627.93 < 2.2e-16 ***
Sex_male 0 0.000 663 627.93
Age 1 8.986 662 618.94 0.002721 **
SibSp 1 8.114 661 610.83 0.004393 **
Parch 1 0.998 660 609.83 0.317889
Fare 1 0.044 659 609.79 0.834588
Embarked_C 1 1.936 658 607.85 0.164139
Embarked_Q 1 2.067 657 605.78 0.150485
Embarked_S 1 0.218 656 605.57 0.640317
Embarked_NA 0 0.000 656 605.57
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We use Anova() function for analyzing table of deviance. The anova() summary above tells
us how adding each variable to the model has an effect on the model against null model.
The difference between the null deviance and residual deviance is used for finding that out.
We can see that adding ‘Pclass’, ‘Sex’ and ‘Age’ to the model reduces the residual deviance
significatntly. However, other variables do not reduce the residual deviance significantly.
Higher p-values for ‘SibSp’, ‘Fare’, ‘Embarked’, and ‘Parch’ also indicates that these
variables are not statistically significant.

SHASHI ONLINE CLASS


Predicting target variable and confusion matrix statistics

## Predicting Test Data


result <- predict(model,newdata=test,type='response')
result <- ifelse(result > 0.5,1,0)

## Confusion matrix and statistics


library(caret)
confusionMatrix(data=result, reference=test$Survived)

> confusionMatrix(data=result, reference=test$Survived)


Confusion Matrix and Statistics

Reference
Prediction 0 1
0 128 25
1 13 56

Accuracy : 0.8288
95% CI : (0.7727, 0.8759)
No Information Rate : 0.6351
P-Value [Acc > NIR] : 1.817e-10

Kappa : 0.6187
Mcnemar's Test P-Value : 0.07435

Sensitivity : 0.9078
Specificity : 0.6914
Pos Pred Value : 0.8366
Neg Pred Value : 0.8116
Prevalence : 0.6351
Detection Rate : 0.5766
Detection Prevalence : 0.6892
Balanced Accuracy : 0.7996

'Positive' Class : 0

From the confusion matrix we can see that for 0, the misclassification is 25, i.e. 25 variables were
predicted 1 by the model whereas their value was 0. Similarly, for 1,misclassification is 13.
The model accuracy is 82.88% which is good.

SHASHI ONLINE CLASS


ROC(Reciver-Operator Charateristics) Curve

## ROC Curve and calculating the area under the curve(AUC)


library(ROCR)
predictions <- predict(model, newdata=test, type="response")
ROCRpred <- prediction(predictions, test$Survived)
ROCRperf <- performance(ROCRpred, measure = "tpr", x.measure = "fpr")

plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7), print.cutoffs.at = seq(0,1,0.1))

The ROC curve is used for calculating AUC (Area under the curve) which is used for measuring the
performance of a binomial classifier. The ROC is a curve generated by plotting the true positive rate (TPR)
against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC
curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal)
then to 0.5.

> auc <- performance(ROCRpred, measure = "auc")


> auc <- auc@y.values[[1]]
> auc
[1] 0.8715086

SHASHI ONLINE CLASS

Das könnte Ihnen auch gefallen