Beruflich Dokumente
Kultur Dokumente
Let us assume that for are a small business owner for regional delivery service,inc.(RDS)who
offer same day delivery for letters, packages, and other small cargo. They are able to use
Google Map to group individual deliveries into one trip to reduce time and fuel costs.
Therefore, some trips will have more than one delivery. As the owner, you would like to able
to estimate how long a delivery will take based on three factors: 1) the total distance trip
in miles 2) the number of deliveries that must be made during the trip, and 3) the daily
price of gas/petrol in U.S dollars.
To predict analysis, take a random sample of 10 past trip and records four pieces of
information for each trip
# install package
install.packages("car")
install.packages("SparseM")
install.packages("corrplot")
install.packages("PerformanceAnalytics")
#library
library(car)
# dataset
test<- data.frame(
TravelTime_y = c (7,5.4,6.6,7.4,4.8,6.4,7,5.6,7.3,6.4),
MilesTravel_x1 = c (89,66,78,111,44,77,80,66,109,76),
NumDeliveries_x2 = c (4,1,3,6,1,3,3,2,5,3),
GasPrice_x3 = c (3.84,3.19,3. 78,3.89,3.57,3.57,3.03,3.51,3.54,3.25))
#scatter plot
# Plot a correlation graph
plot(test)
other method: -
library(corrplot)
newdatacor = cor(test)
corrplot(newdatacor, method = "number")
library("PerformanceAnalytics")
x x 0.8714 0.8347 0.3526 0.295 23.72 0.00076 3.732 0.026 0.184 11.59 11.59 11.9
x x 0.8661 0.8278 0.3599 0.301 22.63 0.00087 3.867 0.041 -0.21 1.14 1.14 12.3
x x 0.8876 0.8555 0.3297 0.275 27.63 0.00047 7.324 0.566 -0.76 1.33 1.33 10.6
x x x 0.8947 0.842 0.3447 0.266 16.99 0.00245 6.211 0.014 0.383 -0.606 14.93 17.35 1.71 11.9
Conclusion:
1. correlation coefficient of X3 is very low (0.27) so, we (m3, m5, m6, m7) can’t consider for
model selection.
2. Model m7 & m4 are redundant due to vif >10 and correlation is high in between x1 & x2.
3. Model m1 have high R-square adj, high F-Statistic, low RMSE and low AIS compare with Model
m2, so Model m1 is the best model.
After excluding third variable(GasPrice_x3) according to correlation matrix, we can also
verify model with stepwise AIC method.
Step: AIC=-19.67
TravelTime_y ~ MilesTravel_x1
Call:
lm(formula = TravelTime_y ~ MilesTravel_x1, data = test)
Coefficients:
(Intercept) MilesTravel_x1
3.18556 0.04026
Some plot for Model m1:
par(mfrow=c (2,2))
plot(m1)
• First is residual Vs fitted.it should be randomly distributed and there should not be any pattern. in figure,it looks like random so looks good.
• Q-q plot is for normality .it stands for quantile plot. And it should follow the straight line. Doesn't looks very good but ok.
• Third is same as first one but it is standardized residual. It should be random as well and chart looks ok.
• Leverage graph is for extreme point and here also it looks good.
• More inference could have been drawn if the number of observations were higher.
Implementing Logistic Regression using Titanic dataset in R
Data Dictionary
Variable Notes
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
test<-read.csv("C:/Users/Shashi/Desktop/Regression/test.csv",stringsAsFactors
= F,na.strings=c(""," ","NA"))
> str(complete_data)
'data.frame': 1309 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley
(Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle,
Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 28 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr NA "C85" NA "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
Data Splitting
Next, we split the dataset into training and testing sets. I’ll split the train_data from the overall data as it has
the value of target variable. There are 891 observations in the training dataset and I’ll split that in 75:25
ratio.
This will help us calculate the model accuracy. We can then again use ‘predict()’ to predict the target
variable of the test dataset.
## Splitting training and test data
train <- titanic_data[1:667,]
test <- titanic_data[668:889,]
## Model Creation
> summary(model)
Call:
glm(formula = Survived ~ ., family = binomial(link = "logit"),
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3773 -0.6567 -0.4301 0.6397 2.3946
Response: Survived
We use Anova() function for analyzing table of deviance. The anova() summary above tells
us how adding each variable to the model has an effect on the model against null model.
The difference between the null deviance and residual deviance is used for finding that out.
We can see that adding ‘Pclass’, ‘Sex’ and ‘Age’ to the model reduces the residual deviance
significatntly. However, other variables do not reduce the residual deviance significantly.
Higher p-values for ‘SibSp’, ‘Fare’, ‘Embarked’, and ‘Parch’ also indicates that these
variables are not statistically significant.
Reference
Prediction 0 1
0 128 25
1 13 56
Accuracy : 0.8288
95% CI : (0.7727, 0.8759)
No Information Rate : 0.6351
P-Value [Acc > NIR] : 1.817e-10
Kappa : 0.6187
Mcnemar's Test P-Value : 0.07435
Sensitivity : 0.9078
Specificity : 0.6914
Pos Pred Value : 0.8366
Neg Pred Value : 0.8116
Prevalence : 0.6351
Detection Rate : 0.5766
Detection Prevalence : 0.6892
Balanced Accuracy : 0.7996
'Positive' Class : 0
From the confusion matrix we can see that for 0, the misclassification is 25, i.e. 25 variables were
predicted 1 by the model whereas their value was 0. Similarly, for 1,misclassification is 13.
The model accuracy is 82.88% which is good.
The ROC curve is used for calculating AUC (Area under the curve) which is used for measuring the
performance of a binomial classifier. The ROC is a curve generated by plotting the true positive rate (TPR)
against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC
curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal)
then to 0.5.