Beruflich Dokumente
Kultur Dokumente
Assessment
PRATIK ZANKE
1
TABLE OF CONTENTS
Table of Contents .............................................................................................................2
2
OBJECTIVE:
Customer Churn is a burning problem for Telecom companies. In this project, we simulate one such case
of customer churn where we work on a data of postpaid customers with a contract. The data has
information about the customer usage behavior, contract details and the payment details. The data also
indicates which the customers who canceled their service were. Based on this past data, we will build a
model which can predict whether a customer will cancel their service in the future or not.
DATA Description:
Assumptions
The following assumptions are made for the inferential statistics:
iv. For naïve bayes: The variables are independent and are equally important.
3
Exploratory Data Analysis – Step by step approach
The various steps followed to analyze the case study is mentioned and explained below.
The lists of R packages used to analyze the data are listed below:
• readxl to Read xlsx data file
• dplyr to scale data
• corrplot library for correlation
• lattice for plots
• caret to calculate confusionMatrix
• ROCR to calculate auc,KS
• ineq to calculate gini
• caTools to Split data
• naivebayes for Naive Bayes model for Numeric Predictors
• e1071 For Naise Bayes
• class For KNN Classifier
• pscl to Maximum likelihood estimation
• lmtest for diagnostic checking in linear regression models
• purrr for Visualization
• tidyr for Visualization
• ggplot2 for Data Visualization
• car for vif
Setting up the working directory will help to maintain all the files related to the project at one place in
the system.
> setwd("F:/project")
4
STR:
str(Telecom)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3333 obs. of 11 variables:
$ Churn : num 0 0 0 0 0 0 0 0 0 0 ...
$ AccountWeeks : num 128 107 137 84 75 118 121 147 117 141 ...
$ ContractRenewal: num 1 1 1 0 0 0 1 0 1 0 ...
$ DataPlan : num 1 1 0 0 0 0 1 0 0 1 ...
$ DataUsage : num 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
$ CustServCalls : num 1 1 0 2 3 0 3 0 1 0 ...
$ DayMins : num 265 162 243 299 167 ...
$ DayCalls : num 110 123 114 71 113 98 88 79 97 84 ...
$ MonthlyCharge : num 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
$ OverageFee : num 9.87 9.78 6.06 3.1 7.42 ...
$ RoamMins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
SUMMARY:
> summary(Telecom)
Churn AccountWeeks ContractRenewal DataPlan
Min. :0.0000 Min. : 1.0 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.: 74.0 1st Qu.:1.0000 1st Qu.:0.0000
Median :0.0000 Median :101.0 Median :1.0000 Median :0.0000
Mean :0.1449 Mean :101.1 Mean :0.9031 Mean :0.2766
3rd Qu.:0.0000 3rd Qu.:127.0 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :243.0 Max. :1.0000 Max. :1.0000
DataUsage CustServCalls DayMins DayCalls MonthlyChar
ge
Min. :0.0000 Min. :0.000 Min. : 0.0 Min. : 0.0 Min. : 14
.00
1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:143.7 1st Qu.: 87.0 1st Qu.: 45
.00
Median :0.0000 Median :1.000 Median :179.4 Median :101.0 Median : 53
.50
Mean :0.8165 Mean :1.563 Mean :179.8 Mean :100.4 Mean : 56
.31
3rd Qu.:1.7800 3rd Qu.:2.000 3rd Qu.:216.4 3rd Qu.:114.0 3rd Qu.: 66
.20
Max. :5.4000 Max. :9.000 Max. :350.8 Max. :165.0 Max. :111
.30
OverageFee RoamMins
Min. : 0.00 Min. : 0.00
1st Qu.: 8.33 1st Qu.: 8.50
Median :10.07 Median :10.30
Mean :10.05 Mean :10.24
3rd Qu.:11.77 3rd Qu.:12.10
Max. :18.19 Max. :20.00
Univariate Analysis
We are analyzing the all the 10 independent variable from data set ‘cellData’. The Churn variable is the
dependent variable. For easy in plotting we convert the dataset to data frame ‘cellDataEDA’ ansd
remove the factor variables. Then we perform Univariate and Bivariate analysis.
• All the variables, except CustServCalls and DataUsage, are normally distributed, with mean and
median almost the same.
• CustServCalls is right skewed. Most of the users have called customer service only 1 time.
5
• DataUsage is right skewed. Most of the users have used less than 1 gb data
• The scatter plot shows that there is random distribution in all variables except CustServCalls and
DataUsage.
HISTOGRAM:
6
BOXPLOT:
DENSITY PLOT:
7
Density Plot
• Account weeks, DayMins, OverageFee, RoamMins features are almost normally distributed.
• Most data of Data Usage, CustServCalls right skewed distributed.
• Day calls, data is low between 0 to 50 then histogram is normally distributed depicting outliers.
• Almost all features show outliers.
Bivariate Analysis:
8
Based on the above plot we can say MonthlyCharge and DataUsage are highly correlated
Hence, we will check the multicollinearity during model building and drop the variable if required.
9
Logistic regression
Logistic regression is part of the supervised learning. Logistic regression is used to describe data and to
explain the relationship between one dependent binary variable and one or more nominal, ordinal,
interval or ratio-level independent variables.
We can scale the data to reduce the impact of outliers. While model building, we have checked with
scaled data as well but there was no impact on the model due to scaling. Hence, we are not scaling the
data.
Model Building
We have split the data in 70:30 ratio.
Call:
glm(formula = Churn ~ ., family = binomial(link = "logit"), data = trainData)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0909 -0.5063 -0.3349 -0.1924 3.0563
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.298282 0.660926 -9.529 < 2e-16 ***
AccountWeeks 0.002147 0.001668 1.287 0.19820
ContractRenewal -2.148996 0.174575 -12.310 < 2e-16 ***
DataPlan -1.341417 0.642404 -2.088 0.03679 *
CustServCalls 0.502549 0.047515 10.577 < 2e-16 ***
DayMins 0.012647 0.003909 3.235 0.00122 **
DayCalls 0.003053 0.003316 0.920 0.35734
MonthlyCharge 0.011254 0.021727 0.518 0.60448
OverageFee 0.121220 0.046308 2.618 0.00885 **
RoamMins 0.084454 0.027049 3.122 0.00179 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1934.3 on 2333 degrees of freedom
Residual deviance: 1500.3 on 2324 degrees of freedom
AIC: 1520.3
Number of Fisher Scoring iterations: 6
vif(lmodel)
AccountWeeks ContractRenewal DataPlan CustServCalls DayMi
ns
1.001376 1.072246 13.618518 1.085827 9.2516
60
DayCalls MonthlyCharge OverageFee RoamMins
1.005013 25.299488 2.999950 1.205865
10
exp(coefficients(lmodel))
(Intercept) AccountWeeks ContractRenewal DataPlan CustServCal
ls
0.001839462 1.002148946 0.116601225 0.261474892 1.6529299
52
DayMins DayCalls MonthlyCharge OverageFee RoamMi
ns
1.012727707 1.003057266 1.011317382 1.128873196 1.0881225
08
lrtest(lmodel)
Likelihood ratio test
Model 1: Churn ~ AccountWeeks + ContractRenewal + DataPlan + CustServCalls +
DayMins + DayCalls + MonthlyCharge + OverageFee + RoamMins
Model 2: Churn ~ 1
#Df LogLik Df Chisq Pr(>Chisq)
1 10 -750.16
2 1 -967.14 -9 433.95 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multicollinearity: MonthlyCharge, DataUsage have very high vif value. Moreover, in correlation matrix
we have found that they are highly correlated, so we can drop MonthlyCharge from the Model and
rebuild.
Moreover, DataUsage is statistical insignificant and odds ratio is more than 1, so we cannot drop
DataUsage from the model.
We tried to drop DataPlan, but there is no significant difference. Moreover, AIC also increased. Hence,
we will only drop MonthlyCharge from the model.
MODEL2
lmodel2=glm(formula = Churn ~ AccountWeeks + ContractRenewal + CustServCalls
+ DayCalls + DayMins + OverageFee + RoamMins, family = binomial, data = train
Data)
> summary(lmodel2)
Call:
glm(formula = Churn ~ AccountWeeks + ContractRenewal + CustServCalls +
DayCalls + DayMins + OverageFee + RoamMins, family = binomial,
data = trainData)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9514 -0.5065 -0.3537 -0.2185 3.0514
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.343452 0.647899 -9.791 < 2e-16 ***
AccountWeeks 0.001981 0.001654 1.198 0.231060
ContractRenewal -2.075243 0.170818 -12.149 < 2e-16 ***
CustServCalls 0.488537 0.046737 10.453 < 2e-16 ***
DayCalls 0.003147 0.003292 0.956 0.339160
DayMins 0.014061 0.001306 10.764 < 2e-16 ***
OverageFee 0.132637 0.026708 4.966 6.83e-07 ***
RoamMins 0.083248 0.024526 3.394 0.000688 ***
---
11
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1934.3 on 2333 degrees of freedom
Residual deviance: 1539.2 on 2326 degrees of freedom
AIC: 1555.2
Number of Fisher Scoring iterations: 5
lrtest(lmodel2)
Likelihood ratio test
Model 1: Churn ~ AccountWeeks + ContractRenewal + CustServCalls + DayCalls +
DayMins + OverageFee + RoamMins
Model 2: Churn ~ 1
#Df LogLik Df Chisq Pr(>Chisq)
1 8 -769.58
2 1 -967.14 -7 395.11 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
No Significant change in log likelihood from the previous model. Also based on the p-value we can reject
the null hypothesis. Thus the model is valid.
pR2(lmodel2)
fitting null model for pseudo-r2
llh llhNull G2 McFadden r2ML r2CU
-769.5841060 -967.1400908 395.1119696 0.2042682 0.1557320 0.2764141
About 20.4% of variance in the intercept model is explained by our model.
exp(coefficients(lmodel2))
(Intercept) AccountWeeks ContractRenewal CustServCalls DayCal
ls
0.001758222 1.001983388 0.125525906 1.629929450 1.0031516
38
DayMins OverageFee RoamMins
1.014160662 1.141835233 1.086810918
Performance Metrics:
Confusion Matrix: For Training Dataset
12
13
Metrics Value for testing
Dataset
Accuracy 0.79
Sensitivity 0.64
Specificity 0.82
AUC 0.80
K-S 0.5
Gini 0.53
Interpretation
1. The model will catch 64% of the customers who will actually churn.
2. The model will catch 82% of the customers who will actually Not churn
4. Out of the customers it predicted as will churn, 37% of them will actually churn
5. Out of the customers it predicted as will Not churn, 93% of them will actually Not churn
7. K-S is 50%, the model will fairly perform to separate the churn and no churn cases.
14
K-Nearest Neighbour:
KNN which stand for K Nearest Neighbor is a Supervised Machine Learning algorithm that classifies a
new data point into the target class, depending on the features of its neighboring data points.
Choosing K value -
If K is too small the model will be “overfit”. This means that the model will do well on the data you used
to create it, but when it comes across new observations it will perform poorly. If K is too high, the model
will also perform poorly. The best value of K was selected by picking one that is not too high or low.
We tried the model for scaled and normalized data both. Checking the output, we have built the final
model on normalized data.
Normalized data:
Telecom
# A tibble: 3,333 x 22
Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls DayMin
s
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl
>
1 0 128 1 1 2.7 1 265
.
2 0 107 1 1 3.7 1 162
.
3 0 137 1 0 0 0 243
.
4 0 84 0 0 0 2 299
.
5 0 75 0 0 0 3 167
.
6 0 118 0 0 0 0 223
.
7 0 121 1 1 2.03 3 218
.
8 0 147 0 0 0 0 157
9 0 117 1 0 0.19 1 184
.
10 0 141 0 1 3.02 0 259
.
# ... with 3,323 more rows, and 15 more variables: DayCalls <dbl>,
# MonthlyCharge <dbl>, OverageFee <dbl>, RoamMins <dbl>, norm.Churn <dbl>,
# norm.Accountweeks <dbl>, norm.daycalls <dbl>, norm.daymins <dbl>,
# norm.overagefee <dbl>, norm.contractrenewal <dbl>, norm.dataplan <dbl>,
# norm.datausage <dbl>, norm.Cust <dbl>, norm.monthlycharge <dbl>,
# norm.roammins <dbl>
We have built the model for various values of K and found K=19 as the optimal value.
15
Performance Metrics:
Metrics Value
Accuracy 0.89
Sensitivity 0.29
Specificity 0.99
Interpretation:
1. The model will catch 29% of the customers who will actually churn.
2. The model will catch 99% of the customers who will actually Not churn
4. Out of the customers it predicted as will churn, 89% of them will actually churn
5. Out of the customers it predicted as will Not churn, 89% of them will actually Not churn
16
Naive Bayes
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a
single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of
features being classified is independent of each other. Calculate the prior probabilities from the count of
the training data. So, it should follow the proportion of the parent dataset.
NB
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = TrainBayes[-1], y = TrainBayes$Churn)
A-priori probabilities:
TrainBayes$Churn
0 1
0.8547558 0.1452442
Performance Metrics
Confusion Matrix For TEST DatasetConfusion Matrix and Statistics
Reference
Prediction 0 1
0 776 85
1 79 59
Accuracy : 0.8358
95% CI : (0.8114, 0.8583)
No Information Rate : 0.8559
P-Value [Acc > NIR] : 0.9658
Kappa : 0.3229
Mcnemar's Test P-Value : 0.6962
Sensitivity : 0.9076
Specificity : 0.4097
Pos Pred Value : 0.9013
Neg Pred Value : 0.4275
Prevalence : 0.8559
Detection Rate : 0.7768
Detection Prevalence : 0.8619
Balanced Accuracy : 0.6587
'Positive' Class : 0
17
Metrics Value
Accurac 0.83
y
Sensitivi 0.90
ty
Specifici 0.40
ty
TRAIN DATASET:
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1848 203
1 147 136
Accuracy : 0.85
95% CI : (0.8349, 0.8643)
No Information Rate : 0.8548
P-Value [Acc > NIR] : 0.751542
Kappa : 0.3516
Metrics Value
Accurac 0.85
y
Sensitivi 0.92
ty
Specifici 0.40
ty
INTERPRETATIONS:
1. The model will catch 25% of the customers who will actually churn.
2. The model will catch 97% of the customers who will actually Not churn
4. Out of the customers it predicted as will churn, 57% of them will actually churn
5. Out of the customers it predicted as will Not churn, 88% of them will actually Not churn.
18
Model Comparison using Model Performance metrics & Interpretation:
KNN has the highest accuracy and has highest specificity
• For Naïve Bayes, the base assumption is that the predictor variables are independent and equally
important. For our data, we have seen that the predictors are correlated. Hence we can say that Naïve
Bayes, is not giving correct prediction.
• Overall all Logistic regression model, there is balance between accuracy, sensitivity and specificity.
Hence, we conclude that Logistic regression model is the best.
• KNN does not give the confidence level (probabilities). It gives the class value directly.
CONCLUSION:
The model build using logistic regression, is a good model as accuracy is about 80% and there is balance
between sensitivity and specificity and it has good predictive ability (auc - 80%). We are able to predict
71% of the customers who will churn. Telecom Company can talk to the predicted customers who will
churn and understand point of view and come up with satisfactory resolution. About 19% are wrongly
classified as churn. If companies talk to these customers as well, there will not be any negative
implication.
We may increase the accuracy by adding other predictors like if customer has more than 1 connection,
the bill payment mode also we can use stepwise regression, to list of predictor variables that contribute
the most to the model.
The important variables based on Odds value and statistical significance are: CustServCalls, DaysMin,
OverageFee, RoamMins and also AccountWeeks , DataUasage and DayCalls are important as well.
19