CustomerChurn Assignment

CUSTOMER CHURN
PREDICTION MODEL
KRISHNAVENI A -BACMAY19016
THUWAARAHAN RAGHUNATHAN -BACMAY19061
S.ANDREW-BACMAY19005
SRINEVETHA-BACMAY19036
NARMATHA B-BACMAY19023
MALAVIKA-BACMAY19019
Customer Churn in Telecom
Table of Contents
1.Project objective............................................................................................................ 2
2. Assumptions ................................................................................................................. 2
3. Solution ........................................................................................................................ 2
3.1 Data Preparation: ............................................................................................................... 2
3.2 Exploratory Data Analysis ................................................................................................... 4
3.3 Model Building ................................................................................................................... 5
3.4 Prediction ........................................................................................................................ 10
3.4 Conclusion: ...................................................................................................................... 14
Appendix – Source Code ................................................................................................. 14
1
1.Project objective
To build a logistic regression model to predict the customer churn of the firm based on the
account information like AccountWeeks, ContractRenewal, DataPlan, DataUsage,
CustServCalls, DayMins, DayCalls, MonthlyCharge, OverageFee, RoamMins and interpret the
result.
The data to be partitioned by allocating 70% -for training data and 30% -for validating the
results.
2. Assumptions
The logistic regression method assumes that:

 Binary logistic regression requires the outcome to be binary or dichotomous like yes
vs no, positive vs negative, 1 vs 0.
 Since logistic regression assumes that P(Y=1) is the probability of the event occurring,
it is necessary that the factor level 1 of the outcome variable should represent the
desired outcome.
 There is a linearity between the logit of the outcome and the independent variables.
The logit function is logit(p) = log(p/(1-p)), where p is the probabilities of the
outcome.
 There is no influential values (extreme values or outliers) in the continuous predictors.
 There is little or no high collinearity among the independent variables.
3. Solution
We shall follow below steps to build the model

 Data Preparation
 Exploratory Data Analysis
 Model Building
 Prediction
 Conclusion
3.1 Data Preparation:
The customer data has around 3333 unique customers. The structure of the dataset is shown
below.
 The number of rows (observations)in the dataset is 3333

 The number of columns (variables) in the dataset is 11
2
 Churn, ContractRenewal, DataPlan variables are integer variables with binary values
which need to be converted to categorical variables.
#Loading Data
data =read.csv("Data-Table 1.csv", header =TRUE)
attach(data)
The given data includes the following variables:

Churn 1 if customer cancelled service, 0 if not
number of weeks customer has had active
AccountWeeks account
1 if customer recently renewed contract, 0 if
ContractRenewal not
DataPlan 1 if customer has data plan, 0 if not
DataUsage gigabytes of monthly data usage
CustServCalls number of calls into customer service
DayMins average daytime minutes per month
DayCalls average number of daytime calls
MonthlyCharge average monthly bill
OverageFee largest overage fee in last 12 months
RoamMins average number of roaming minutes
The variable “Churn” is the dependent variable and the remaining are independent variables.
str(data)
## 'data.frame': 3333 obs. of 11 variables:

## $ Churn : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AccountWeeks : int 128 107 137 84 75 118 121 147 117 141 ...
## $ ContractRenewal: int 1 1 1 0 0 0 1 0 1 0 ...
## $ DataPlan : int 1 1 0 0 0 0 1 0 0 1 ...
## $ DataUsage : num 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
## $ CustServCalls : int 1 1 0 2 3 0 3 0 1 0 ...
## $ DayMins : num 265 162 243 299 167 ...
## $ DayCalls : int 110 123 114 71 113 98 88 79 97 84 ...
## $ MonthlyCharge : num 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
## $ OverageFee : num 9.87 9.78 6.06 3.1 7.42 ...
## $ RoamMins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
Before the conversion of the variables the original data set is stored in another variable as
the check for collinearity requires the variables to be continuous.
3
data1 <-data
As mentioned in the assumptions, logistic regression assumes linearity of independent

variables and log odds. It does not require the dependent and independent variables to be
linearly related but requires linearity between independent variables and the log odds.
Otherwise the test underestimates the strength of the relationship between variables and
rejects the relationship as insignificant. A solution to this problem is the categorization of the
independent variables. That is transforming numeric variables to categorical and then
including them in the model.
The Churn, Contract Renewal, and DataPlan variables have values that are binary in nature
but are represented in the data as numeric variables. Logistic regression is applied on
categorical variables hence the variables need to be changed to factor variables to make
them categorical.
data$Churn =as.factor(Churn)
data$Churn =as.factor(ContractRenewal)
data$Churn =as.factor(DataPlan)
3.2 Exploratory Data Analysis
Logistic regression requires each observation to be independent. The model should have
little or no multicollinearity. That is the independent variables should be independent from
each other. If multicollinearity exists, a factor analysis with orthogonally rotated factors can
be done to remove the collinearity.
The corrplot is used to check for correlation between the variables.
library(corrplot)
corrplot(cor(data1))
4
From the above plot we can see that Data Usage and Data Plan are highly correlated. There is
also a correlation between Monthly Charge and Data Usage, Data Plan and Day Mins.
Churn does not seem to be highly correlated with any of the variables but has some
correlation with Contract Renewal, Customer Service Calls and Day Mins.
3.3 Model Building
In the predictive modelling, the data need to be partitioned into train and test sets. 70% of
the data will be partitioned for training purpose and 30% of the data will be partitioned for
testing purpose.
require(caTools)
set.seed(101)
split <-sample.split(Churn, SplitRatio =0.70)
#get training and test data

train <-subset(data, split ==TRUE)
test <-subset(data, split ==FALSE)
Classification algorithms such as Logistic Regression, Decision Tree, and Random Forest can
be used to predict churn. Multiple models can be executed on top of the telecom dataset to
5
compare their performance and error rate to choose the best model. Here we have used
Logistic Regression Model in R using glm package.
model <-glm (Churn ~., data = train, family = binomial)

summary(model)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9567 -0.5245 -0.3533 -0.2085 3.0586
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.794667 0.664172 -10.230 < 2e-16 ***
## AccountWeeks 0.001942 0.001632 1.190 0.23388
## ContractRenewal -1.729601 0.175062 -9.880 < 2e-16 ***
## DataPlan -1.023936 0.672866 -1.522 0.12807
## DataUsage 0.653364 2.292662 0.285 0.77566
## CustServCalls 0.516659 0.047201 10.946 < 2e-16 ***
## DayMins 0.024124 0.038632 0.624 0.53233
## DayCalls 0.004922 0.003199 1.539 0.12381
## MonthlyCharge -0.066333 0.227172 -0.292 0.77029
## OverageFee 0.285698 0.387788 0.737 0.46128
## RoamMins 0.083332 0.025483 3.270 0.00108 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1930.4 on 2332 degrees of freedom
## Residual deviance: 1548.9 on 2322 degrees of freedom
## AIC: 1570.9
##
## Number of Fisher Scoring iterations: 6
From the model summary, we can see the response churn variable is affected by
ContractRenewal, CustServCalls, and RoamMins variables. The importance of the variable is
shown by the legend ‘***’ in the summary report.
We can use variance inflation factor (vif) to get rid of multicollinearity between the variables.
Multicollinearity exists when two or more predictor variables are highly related to each other
making it difficult to understand the impact of an independent variable on the dependent
variable.
A variable having a VIF of 2 or less is generally considered safe and it can be assumed that it is
not correlated with other variables. Higher the VIF, greater is the correlation of the
independent variable.
vif(model)
## AccountWeeks ContractRenewal DataPlan DataUsage

## 1.003354 1.053949 14.483791 1456.897591
## CustServCalls DayMins DayCalls MonthlyCharge
## 1.086643 962.504267 1.006257 2683.990056
6
## OverageFee RoamMins
## 212.736171 1.152772
From the above VIF results, we can see multicollinearity between the variables.
The variable with the highest VIF value which is MonthlyCharges here is chosen and removed
from the model.
model <-glm (Churn ~.-MonthlyCharge, data = train, family = binomial)
summary(model)
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge, family = binomial, data = train)
##
## -1.9573 -0.5248 -0.3543 -0.2089 3.0479
##
## Coefficients:
## (Intercept) -6.827915 0.654530 -10.432 < 2e-16 ***
## AccountWeeks 0.001957 0.001631 1.200 0.23029
## DataPlan -1.015929 0.672237 -1.511 0.13072
## DataUsage -0.012695 0.230348 -0.055 0.95605
## CustServCalls 0.516474 0.047188 10.945 < 2e-16 ***
## DayMins 0.012850 0.001272 10.105 < 2e-16 ***
## DayCalls 0.004916 0.003197 1.538 0.12416
## OverageFee 0.172751 0.027129 6.368 1.92e-10 ***
## RoamMins 0.083536 0.025478 3.279 0.00104 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 1569
##
The VIF for the model is computed again to check for multicollinearity
library(car)
vif(model)
## AccountWeeks ContractRenewal DataPlan DataUsage

## 1.002458 1.053911 14.451987 14.702171
## CustServCalls DayMins DayCalls OverageFee
## 1.086420 1.042337 1.006179 1.041532
## RoamMins
## 1.152124
Since multicollinearity exists, the variable with the next highest VIF value is chosen. We will
remove next DataUsage variable.
7
model <-glm (Churn ~.-MonthlyCharge -DataUsage, data = train, family =

binomial)
summary(model)
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge - DataUsage, family = binomial,
## data = train)
##
## -1.9564 -0.5255 -0.3539 -0.2096 3.0471
##
## Coefficients:
## (Intercept) -6.824530 0.651568 -10.474 < 2e-16 ***
## AccountWeeks 0.001958 0.001631 1.201 0.229897
## DataPlan -1.051666 0.178515 -5.891 3.83e-09 ***
## CustServCalls 0.516491 0.047182 10.947 < 2e-16 ***
## DayMins 0.012850 0.001271 10.106 < 2e-16 ***
## DayCalls 0.004917 0.003197 1.538 0.124051
## OverageFee 0.172764 0.027127 6.369 1.91e-10 ***
## RoamMins 0.083051 0.023903 3.474 0.000512 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 1567
##
library(car)
vif(model)
## AccountWeeks ContractRenewal DataPlan CustServCalls

## 1.002287 1.050565 1.019219 1.086336
## DayMins DayCalls OverageFee RoamMins
## 1.042283 1.006125 1.041400 1.014592
VIF values above show all the values to be below the threshold of 5. Hence the
multicollinearity has been reduced and we stop with this model.
 Model 2 - Without the Correlated Variables
Now we can check for fitness of the model

#1 Identify overall fitness of model using log likehood Ratio Test
library(lmtest)
lrtest(model)
8
## Likelihood ratio test

##
## Model 1: Churn ~ (AccountWeeks + ContractRenewal + DataPlan + DataUsage +
## CustServCalls + DayMins + DayCalls + MonthlyCharge + OverageFee +
## RoamMins) - MonthlyCharge - DataUsage
## Model 2: Churn ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 9 -774.51
## 2 1 -965.21 -8 381.41 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The log likelihood ratio is shown to be better for the full model.
#2 Calculating McFadden's Rsquared (Minimum McFadden Rsquared considered for model

fitness is 10%)
library(pscl)
pR2(model)
## llh llhNull G2 McFadden r2ML

## -774.5060773 -965.2094900 381.4068254 0.1975772 0.1508194
## r2CU
## 0.2679647
19.75% of the uncertainty in the intercept only model is explained by the full model
#3 Coefficients importance
summary(model)
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge - DataUsage, family = binomial,
## data = train)
##
## -1.9564 -0.5255 -0.3539 -0.2096 3.0471
##
## Coefficients:
## (Intercept) -6.824530 0.651568 -10.474 < 2e-16 ***
## AccountWeeks 0.001958 0.001631 1.201 0.229897
## DataPlan -1.051666 0.178515 -5.891 3.83e-09 ***
## CustServCalls 0.516491 0.047182 10.947 < 2e-16 ***
## DayMins 0.012850 0.001271 10.106 < 2e-16 ***
## DayCalls 0.004917 0.003197 1.538 0.124051
## OverageFee 0.172764 0.027127 6.369 1.91e-10 ***
## RoamMins 0.083051 0.023903 3.474 0.000512 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 1567
9
##
AccountWeeks and DayCalls variables are found to be insignificant hence we can remove
them.
 Model 3 - Remove the insignificant variables
From the above summary result, AccountWeeks and DayCalls variables are found to be
insignificant hence we can remove them.
model <-glm (Churn ~.-MonthlyCharge -DataUsage -AccountWeeks -DayCalls, data =
train, family = binomial)
summary(model)
##
## Call:
## glm(formula = Churn ~ . - MonthlyCharge - DataUsage - AccountWeeks -
## DayCalls, family = binomial, data = train)
##
## -1.9355 -0.5254 -0.3561 -0.2092 3.0291
##
## Coefficients:
## (Intercept) -6.09823 0.52534 -11.608 < 2e-16 ***
## DataPlan -1.05453 0.17832 -5.914 3.34e-09 ***
## CustServCalls 0.51404 0.04708 10.918 < 2e-16 ***
## DayMins 0.01285 0.00127 10.115 < 2e-16 ***
## OverageFee 0.16976 0.02706 6.273 3.55e-10 ***
## RoamMins 0.08438 0.02386 3.536 0.000406 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 1566.9
##
Now the model has only variables that have significant influence on the churn variable.
3.4 Prediction
The model was built using the train dataset and now we can evaluate the model using the
test dataset. The behaviour of the model on the test dataset can be understood by analysing
the Accuracy and error rate of the model for the test dataset.
10
Confusion Matrix is used to understand the performance of the classification model on a test
data. It is a table used to cross-tabulate the actual value with the predicted value using count
of correctly classified customers and wrongly classified customers.
True Positive:
Interpretation: Predicted as positive and it is true.
True Negative:
Interpretation: Predicted as negative and it is true.
False Positive: (Type 1 Error)
Interpretation: Predicted as positive and it is false.
False Negative: (Type 2 Error)
Interpretation: Predicted as negative and it is false.
Some metrics that are computed for a Binary Classifier

Accuracy: Overall, how often the classifier has predicted correctly
(TP+TN)/Total
Misclassification Rate (aka Error Rate): Overall, how often the model is wrong
(FP+FN)/Total
This is the equivalent of 1 minus Accuracy.
True Positive Rate: Events that were correctly predicted by the model as "occurred = Yes."
TP/ actual
False Positive Rate: Events that were predicted as "occurred = Yes," but in reality, it was "not
occurred = No."
FP/actual no
Specificity: When an event was actually predicated as "no," and it was actually a "no."
TN/actual no
Precision: When an event is predicted "yes," how often it is correct?
TP/predicted yes
Prevalence: How often does the "yes" event occur in the sample?
Actual Yes/Total
11
Accuracy of Model using ROC Curve and Area Under the Curve: The ROC uses the true positive
(TP) and false positive (FP) error rates to summarize the classifier’s performance. It is a plot of
sensitivity vs specificity for the cut-off classification.
The Area Under the Curve (AUC) is a performance metric for a ROC curve. The higher the area
under the curve the better prediction power the model has.
##Building Confusion Matrix

prediction <-predict(model,type ="response", newdata = test)
 Accuracy
# Using probability cutoff of 50%.
table(test$Churn, prediction >0.5)
##
## FALSE TRUE
## 0 835 20
## 1 113 32
# Accuracy | Logistic Model

(835+32)/(835+32+113+20)
## [1] 0.867
#Sensitivity
32/(32+113 )
## [1] 0.2206897
#Baseline Model Accuracy

nrow(data[data$Churn ==0,])/nrow(data)
## [1] 0.8550855
Using 50% cutoff, the Accuracy and Sensitivity metrics are
Accuracy: 0.867
Sensitivity: 0.2206897
# ROC Curve
library(pROC)
plot.roc(test$Churn,prediction)
12
auc(test$Churn,prediction)
## Area under the curve: 0.8247
Area under Curve for our Optimized Model is 0.8247 suggesting a good accurate model.
As we can see above, when we are using a cutoff of 0.50, we are getting a good accuracy, but the
sensitivity is very less. Hence, we need to find the optimal probability cutoff which will give
maximum accuracy and sensitivity. Let us lower the threshold to improve the sensitivity.
 Accuracy/lower threshold
We will lower the threshold to 80% cutoff to improve the model.

table(test$Churn, prediction >0.2)
##
## FALSE TRUE
## 0 711 144
## 1 50 95
#Logistic Model Accuracy

(711+95)/(711+95+144+50)
## [1] 0.806
#Sensitivity
95/(95+50 )
## [1] 0.6551724
13
Accuracy: 0.806
Sensitivity: 0.6551724
Here we see that lowering the threshold has improved the sensitivity but the accuracy has
gone done. Hence lowering the threshold does help improving the model sensitivity.
3.4 Conclusion:
The final model indicates that the response ‘churn’ variable is affected by the variables
ContractRenewal, DataPlan, CustServCalls, DayMins, OverageFee, and RoamMins.
The signs of these variables are consistent with the expectation. ContractRenewal and
DataPlan have a negative sign for their coefficients. If the customer has recently renewed the
contract (value = 1) and if the customer has a data plan (value = 1), then the customer is not
likely to churn (value = 0).
The overall performance of the model is good with an accuracy of 0.806 with a sensitivity of
0.655 .
Appendix – Source Code
R_markdown.docx
R- Markdown document:
14

CustomerChurn Assignment

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

CustomerChurn Assignment

Hochgeladen von

Copyright:

Verfügbare Formate

CUSTOMER CHURN

The logistic regression method assumes that:

We shall follow below steps to build the model

3.1 Data Preparation:

 The number of rows (observations)in the dataset is 3333

The given data includes the following variables:

## 'data.frame': 3333 obs. of 11 variables:

As mentioned in the assumptions, logistic regression assumes linearity of independent

3.2 Exploratory Data Analysis

The corrplot is used to check for correlation between the variables.

3.3 Model Building

#get training and test data

model <-glm (Churn ~., data = train, family = binomial)

## AccountWeeks ContractRenewal DataPlan DataUsage

## AccountWeeks ContractRenewal DataPlan DataUsage

model <-glm (Churn ~.-MonthlyCharge -DataUsage, data = train, family =

## AccountWeeks ContractRenewal DataPlan CustServCalls

 Model 2 - Without the Correlated Variables

Now we can check for fitness of the model

## Likelihood ratio test

#2 Calculating McFadden's Rsquared (Minimum McFadden Rsquared considered for model

## llh llhNull G2 McFadden r2ML

 Model 3 - Remove the insignificant variables

Some metrics that are computed for a Binary Classifier

##Building Confusion Matrix

table(test$Churn, prediction >0.5)

# Accuracy | Logistic Model

#Baseline Model Accuracy

Using 50% cutoff, the Accuracy and Sensitivity metrics are

We will lower the threshold to 80% cutoff to improve the model.

#Logistic Model Accuracy

Appendix – Source Code

Das könnte Ihnen auch gefallen