Beruflich Dokumente
Kultur Dokumente
PREDICTION MODEL
KRISHNAVENI A -BACMAY19016
THUWAARAHAN RAGHUNATHAN -BACMAY19061
S.ANDREW-BACMAY19005
SRINEVETHA-BACMAY19036
NARMATHA B-BACMAY19023
MALAVIKA-BACMAY19019
Customer Churn in Telecom
Table of Contents
1.Project objective............................................................................................................ 2
2. Assumptions ................................................................................................................. 2
3. Solution ........................................................................................................................ 2
3.1 Data Preparation: ............................................................................................................... 2
3.2 Exploratory Data Analysis ................................................................................................... 4
3.3 Model Building ................................................................................................................... 5
3.4 Prediction ........................................................................................................................ 10
3.4 Conclusion: ...................................................................................................................... 14
Appendix – Source Code ................................................................................................. 14
1
Customer Churn in Telecom
1.Project objective
To build a logistic regression model to predict the customer churn of the firm based on the
account information like AccountWeeks, ContractRenewal, DataPlan, DataUsage,
CustServCalls, DayMins, DayCalls, MonthlyCharge, OverageFee, RoamMins and interpret the
result.
The data to be partitioned by allocating 70% -for training data and 30% -for validating the
results.
2. Assumptions
3. Solution
The customer data has around 3333 unique customers. The structure of the dataset is shown
below.
2
Customer Churn in Telecom
Churn, ContractRenewal, DataPlan variables are integer variables with binary values
which need to be converted to categorical variables.
#Loading Data
data =read.csv("Data-Table 1.csv", header =TRUE)
attach(data)
The variable “Churn” is the dependent variable and the remaining are independent variables.
str(data)
Before the conversion of the variables the original data set is stored in another variable as
the check for collinearity requires the variables to be continuous.
3
Customer Churn in Telecom
data1 <-data
data$Churn =as.factor(Churn)
data$Churn =as.factor(ContractRenewal)
data$Churn =as.factor(DataPlan)
Logistic regression requires each observation to be independent. The model should have
little or no multicollinearity. That is the independent variables should be independent from
each other. If multicollinearity exists, a factor analysis with orthogonally rotated factors can
be done to remove the collinearity.
library(corrplot)
corrplot(cor(data1))
4
Customer Churn in Telecom
From the above plot we can see that Data Usage and Data Plan are highly correlated. There is
also a correlation between Monthly Charge and Data Usage, Data Plan and Day Mins.
Churn does not seem to be highly correlated with any of the variables but has some
correlation with Contract Renewal, Customer Service Calls and Day Mins.
In the predictive modelling, the data need to be partitioned into train and test sets. 70% of
the data will be partitioned for training purpose and 30% of the data will be partitioned for
testing purpose.
require(caTools)
set.seed(101)
split <-sample.split(Churn, SplitRatio =0.70)
Classification algorithms such as Logistic Regression, Decision Tree, and Random Forest can
be used to predict churn. Multiple models can be executed on top of the telecom dataset to
5
Customer Churn in Telecom
compare their performance and error rate to choose the best model. Here we have used
Logistic Regression Model in R using glm package.
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9567 -0.5245 -0.3533 -0.2085 3.0586
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.794667 0.664172 -10.230 < 2e-16 ***
## AccountWeeks 0.001942 0.001632 1.190 0.23388
## ContractRenewal -1.729601 0.175062 -9.880 < 2e-16 ***
## DataPlan -1.023936 0.672866 -1.522 0.12807
## DataUsage 0.653364 2.292662 0.285 0.77566
## CustServCalls 0.516659 0.047201 10.946 < 2e-16 ***
## DayMins 0.024124 0.038632 0.624 0.53233
## DayCalls 0.004922 0.003199 1.539 0.12381
## MonthlyCharge -0.066333 0.227172 -0.292 0.77029
## OverageFee 0.285698 0.387788 0.737 0.46128
## RoamMins 0.083332 0.025483 3.270 0.00108 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1930.4 on 2332 degrees of freedom
## Residual deviance: 1548.9 on 2322 degrees of freedom
## AIC: 1570.9
##
## Number of Fisher Scoring iterations: 6
From the model summary, we can see the response churn variable is affected by
ContractRenewal, CustServCalls, and RoamMins variables. The importance of the variable is
shown by the legend ‘***’ in the summary report.
We can use variance inflation factor (vif) to get rid of multicollinearity between the variables.
Multicollinearity exists when two or more predictor variables are highly related to each other
making it difficult to understand the impact of an independent variable on the dependent
variable.
A variable having a VIF of 2 or less is generally considered safe and it can be assumed that it is
not correlated with other variables. Higher the VIF, greater is the correlation of the
independent variable.
vif(model)
6
Customer Churn in Telecom
## OverageFee RoamMins
## 212.736171 1.152772
From the above VIF results, we can see multicollinearity between the variables.
The variable with the highest VIF value which is MonthlyCharges here is chosen and removed
from the model.
model <-glm (Churn ~.-MonthlyCharge, data = train, family = binomial)
summary(model)
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9573 -0.5248 -0.3543 -0.2089 3.0479
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.827915 0.654530 -10.432 < 2e-16 ***
## AccountWeeks 0.001957 0.001631 1.200 0.23029
## ContractRenewal -1.730112 0.175059 -9.883 < 2e-16 ***
## DataPlan -1.015929 0.672237 -1.511 0.13072
## DataUsage -0.012695 0.230348 -0.055 0.95605
## CustServCalls 0.516474 0.047188 10.945 < 2e-16 ***
## DayMins 0.012850 0.001272 10.105 < 2e-16 ***
## DayCalls 0.004916 0.003197 1.538 0.12416
## OverageFee 0.172751 0.027129 6.368 1.92e-10 ***
## RoamMins 0.083536 0.025478 3.279 0.00104 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1930.4 on 2332 degrees of freedom
## Residual deviance: 1549.0 on 2323 degrees of freedom
## AIC: 1569
##
## Number of Fisher Scoring iterations: 6
The VIF for the model is computed again to check for multicollinearity
library(car)
vif(model)
Since multicollinearity exists, the variable with the next highest VIF value is chosen. We will
remove next DataUsage variable.
7
Customer Churn in Telecom
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge - DataUsage, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9564 -0.5255 -0.3539 -0.2096 3.0471
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.824530 0.651568 -10.474 < 2e-16 ***
## AccountWeeks 0.001958 0.001631 1.201 0.229897
## ContractRenewal -1.729576 0.174769 -9.896 < 2e-16 ***
## DataPlan -1.051666 0.178515 -5.891 3.83e-09 ***
## CustServCalls 0.516491 0.047182 10.947 < 2e-16 ***
## DayMins 0.012850 0.001271 10.106 < 2e-16 ***
## DayCalls 0.004917 0.003197 1.538 0.124051
## OverageFee 0.172764 0.027127 6.369 1.91e-10 ***
## RoamMins 0.083051 0.023903 3.474 0.000512 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1930.4 on 2332 degrees of freedom
## Residual deviance: 1549.0 on 2324 degrees of freedom
## AIC: 1567
##
## Number of Fisher Scoring iterations: 5
library(car)
vif(model)
VIF values above show all the values to be below the threshold of 5. Hence the
multicollinearity has been reduced and we stop with this model.
lrtest(model)
8
Customer Churn in Telecom
The log likelihood ratio is shown to be better for the full model.
pR2(model)
19.75% of the uncertainty in the intercept only model is explained by the full model
#3 Coefficients importance
summary(model)
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge - DataUsage, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9564 -0.5255 -0.3539 -0.2096 3.0471
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.824530 0.651568 -10.474 < 2e-16 ***
## AccountWeeks 0.001958 0.001631 1.201 0.229897
## ContractRenewal -1.729576 0.174769 -9.896 < 2e-16 ***
## DataPlan -1.051666 0.178515 -5.891 3.83e-09 ***
## CustServCalls 0.516491 0.047182 10.947 < 2e-16 ***
## DayMins 0.012850 0.001271 10.106 < 2e-16 ***
## DayCalls 0.004917 0.003197 1.538 0.124051
## OverageFee 0.172764 0.027127 6.369 1.91e-10 ***
## RoamMins 0.083051 0.023903 3.474 0.000512 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1930.4 on 2332 degrees of freedom
## Residual deviance: 1549.0 on 2324 degrees of freedom
## AIC: 1567
9
Customer Churn in Telecom
##
## Number of Fisher Scoring iterations: 5
AccountWeeks and DayCalls variables are found to be insignificant hence we can remove
them.
From the above summary result, AccountWeeks and DayCalls variables are found to be
insignificant hence we can remove them.
model <-glm (Churn ~.-MonthlyCharge -DataUsage -AccountWeeks -DayCalls, data =
train, family = binomial)
summary(model)
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge - DataUsage - AccountWeeks -
## DayCalls, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9355 -0.5254 -0.3561 -0.2092 3.0291
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.09823 0.52534 -11.608 < 2e-16 ***
## ContractRenewal -1.73567 0.17442 -9.951 < 2e-16 ***
## DataPlan -1.05453 0.17832 -5.914 3.34e-09 ***
## CustServCalls 0.51404 0.04708 10.918 < 2e-16 ***
## DayMins 0.01285 0.00127 10.115 < 2e-16 ***
## OverageFee 0.16976 0.02706 6.273 3.55e-10 ***
## RoamMins 0.08438 0.02386 3.536 0.000406 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1930.4 on 2332 degrees of freedom
## Residual deviance: 1552.9 on 2326 degrees of freedom
## AIC: 1566.9
##
## Number of Fisher Scoring iterations: 5
Now the model has only variables that have significant influence on the churn variable.
3.4 Prediction
The model was built using the train dataset and now we can evaluate the model using the
test dataset. The behaviour of the model on the test dataset can be understood by analysing
the Accuracy and error rate of the model for the test dataset.
10
Customer Churn in Telecom
Confusion Matrix is used to understand the performance of the classification model on a test
data. It is a table used to cross-tabulate the actual value with the predicted value using count
of correctly classified customers and wrongly classified customers.
True Positive:
Interpretation: Predicted as positive and it is true.
True Negative:
Interpretation: Predicted as negative and it is true.
False Positive: (Type 1 Error)
Interpretation: Predicted as positive and it is false.
False Negative: (Type 2 Error)
Interpretation: Predicted as negative and it is false.
11
Customer Churn in Telecom
Accuracy of Model using ROC Curve and Area Under the Curve: The ROC uses the true positive
(TP) and false positive (FP) error rates to summarize the classifier’s performance. It is a plot of
sensitivity vs specificity for the cut-off classification.
The Area Under the Curve (AUC) is a performance metric for a ROC curve. The higher the area
under the curve the better prediction power the model has.
Accuracy
# Using probability cutoff of 50%.
##
## FALSE TRUE
## 0 835 20
## 1 113 32
#Sensitivity
32/(32+113 )
## [1] 0.2206897
Accuracy: 0.867
Sensitivity: 0.2206897
# ROC Curve
library(pROC)
plot.roc(test$Churn,prediction)
12
Customer Churn in Telecom
auc(test$Churn,prediction)
## Area under the curve: 0.8247
Area under Curve for our Optimized Model is 0.8247 suggesting a good accurate model.
As we can see above, when we are using a cutoff of 0.50, we are getting a good accuracy, but the
sensitivity is very less. Hence, we need to find the optimal probability cutoff which will give
maximum accuracy and sensitivity. Let us lower the threshold to improve the sensitivity.
Accuracy/lower threshold
#Sensitivity
95/(95+50 )
## [1] 0.6551724
13
Customer Churn in Telecom
Accuracy: 0.806
Sensitivity: 0.6551724
Here we see that lowering the threshold has improved the sensitivity but the accuracy has
gone done. Hence lowering the threshold does help improving the model sensitivity.
3.4 Conclusion:
The final model indicates that the response ‘churn’ variable is affected by the variables
ContractRenewal, DataPlan, CustServCalls, DayMins, OverageFee, and RoamMins.
The signs of these variables are consistent with the expectation. ContractRenewal and
DataPlan have a negative sign for their coefficients. If the customer has recently renewed the
contract (value = 1) and if the customer has a data plan (value = 1), then the customer is not
likely to churn (value = 0).
The overall performance of the model is good with an accuracy of 0.806 with a sensitivity of
0.655 .
R_markdown.docx
R- Markdown document:
14