Sie sind auf Seite 1von 19

Telecom Customer Churn Prediction

Assessment

PRATIK ZANKE

1
TABLE OF CONTENTS
Table of Contents .............................................................................................................2

Objective &EDA and splitting of data ..................................................................................3

Logistic Regression ......................................................................................................... 10

KNN Model .......................................................................................................................15

Naïve Bayes Model ......................................................................................................... 17

Model Comparison ..........................................................................................................19

2
OBJECTIVE:
Customer Churn is a burning problem for Telecom companies. In this project, we simulate one such case
of customer churn where we work on a data of postpaid customers with a contract. The data has
information about the customer usage behavior, contract details and the payment details. The data also
indicates which the customers who canceled their service were. Based on this past data, we will build a
model which can predict whether a customer will cancel their service in the future or not.

• Logistic Regression Model


• KNN Model
• Naive Bayes Model
• Model Comparison using Model Performance metrics & Interpretation

DATA Description:

Churn 1 if customer cancelled service, 0 if not


number of weeks customer has had active
AccountWeeks account
ContractRenewal 1 if customer recently renewed contract, 0 if not
DataPlan 1 if customer has data plan, 0 if not
DataUsage gigabytes of monthly data usage
CustServCalls number of calls into customer service
DayMins average daytime minutes per month
DayCalls average number of daytime calls
MonthlyCharge average monthly bill
OverageFee largest overage fee in last 12 months
RoamMins average number of roaming minutes

Assumptions
The following assumptions are made for the inferential statistics:

i. Observations are independent

ii. Samples are random

iii. Measurements are accurate

iv. For naïve bayes: The variables are independent and are equally important.

v. For KNN:We normalize the continuous variables

3
Exploratory Data Analysis – Step by step approach

The various steps followed to analyze the case study is mentioned and explained below.

Environment Set up and Data Import


Install necessary Packages and Invoke Libraries

The lists of R packages used to analyze the data are listed below:
• readxl to Read xlsx data file
• dplyr to scale data
• corrplot library for correlation
• lattice for plots
• caret to calculate confusionMatrix
• ROCR to calculate auc,KS
• ineq to calculate gini
• caTools to Split data
• naivebayes for Naive Bayes model for Numeric Predictors
• e1071 For Naise Bayes
• class For KNN Classifier
• pscl to Maximum likelihood estimation
• lmtest for diagnostic checking in linear regression models
• purrr for Visualization
• tidyr for Visualization
• ggplot2 for Data Visualization
• car for vif

Set up working Directory

Setting up the working directory will help to maintain all the files related to the project at one place in
the system.
> setwd("F:/project")

Import and Read the Dataset


The given datasets are in “.xlsx format, so to import the data in R we use the “read_excel” command.
Data in file “Cellphone-1.xlsx” is stored in a variable called “cellData”.

> Telecom = read_excel("Cellphone-1.xlsx",sheet = "Data")

The number of rows in the dataset is 3333


The number of columns (Features) in the dataset is 11.

Variable Identification – Inferences


DIM
> dim(Telecom)
[1] 3333 11

4
STR:
str(Telecom)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3333 obs. of 11 variables:
$ Churn : num 0 0 0 0 0 0 0 0 0 0 ...
$ AccountWeeks : num 128 107 137 84 75 118 121 147 117 141 ...
$ ContractRenewal: num 1 1 1 0 0 0 1 0 1 0 ...
$ DataPlan : num 1 1 0 0 0 0 1 0 0 1 ...
$ DataUsage : num 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
$ CustServCalls : num 1 1 0 2 3 0 3 0 1 0 ...
$ DayMins : num 265 162 243 299 167 ...
$ DayCalls : num 110 123 114 71 113 98 88 79 97 84 ...
$ MonthlyCharge : num 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
$ OverageFee : num 9.87 9.78 6.06 3.1 7.42 ...
$ RoamMins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...

SUMMARY:
> summary(Telecom)
Churn AccountWeeks ContractRenewal DataPlan
Min. :0.0000 Min. : 1.0 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.: 74.0 1st Qu.:1.0000 1st Qu.:0.0000
Median :0.0000 Median :101.0 Median :1.0000 Median :0.0000
Mean :0.1449 Mean :101.1 Mean :0.9031 Mean :0.2766
3rd Qu.:0.0000 3rd Qu.:127.0 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :243.0 Max. :1.0000 Max. :1.0000
DataUsage CustServCalls DayMins DayCalls MonthlyChar
ge
Min. :0.0000 Min. :0.000 Min. : 0.0 Min. : 0.0 Min. : 14
.00
1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:143.7 1st Qu.: 87.0 1st Qu.: 45
.00
Median :0.0000 Median :1.000 Median :179.4 Median :101.0 Median : 53
.50
Mean :0.8165 Mean :1.563 Mean :179.8 Mean :100.4 Mean : 56
.31
3rd Qu.:1.7800 3rd Qu.:2.000 3rd Qu.:216.4 3rd Qu.:114.0 3rd Qu.: 66
.20
Max. :5.4000 Max. :9.000 Max. :350.8 Max. :165.0 Max. :111
.30
OverageFee RoamMins
Min. : 0.00 Min. : 0.00
1st Qu.: 8.33 1st Qu.: 8.50
Median :10.07 Median :10.30
Mean :10.05 Mean :10.24
3rd Qu.:11.77 3rd Qu.:12.10
Max. :18.19 Max. :20.00

Univariate Analysis
We are analyzing the all the 10 independent variable from data set ‘cellData’. The Churn variable is the
dependent variable. For easy in plotting we convert the dataset to data frame ‘cellDataEDA’ ansd
remove the factor variables. Then we perform Univariate and Bivariate analysis.

• All the variables, except CustServCalls and DataUsage, are normally distributed, with mean and
median almost the same.

• CustServCalls is right skewed. Most of the users have called customer service only 1 time.

5
• DataUsage is right skewed. Most of the users have used less than 1 gb data

• The box-plot shows there is an outlier in all the continuous variables.

• The scatter plot shows that there is random distribution in all variables except CustServCalls and
DataUsage.

• From density plot we see that CustServCalls is discrete number.

HISTOGRAM:

6
BOXPLOT:

DENSITY PLOT:

7
Density Plot
• Account weeks, DayMins, OverageFee, RoamMins features are almost normally distributed.
• Most data of Data Usage, CustServCalls right skewed distributed.
• Day calls, data is low between 0 to 50 then histogram is normally distributed depicting outliers.
• Almost all features show outliers.

Bivariate Analysis:

8
Based on the above plot we can say MonthlyCharge and DataUsage are highly correlated

And MonthlyCharge and DaysMin are also correlated.

Hence, we will check the multicollinearity during model building and drop the variable if required.

9
Logistic regression
Logistic regression is part of the supervised learning. Logistic regression is used to describe data and to
explain the relationship between one dependent binary variable and one or more nominal, ordinal,
interval or ratio-level independent variables.

The independent variable, churn, is dichotomous in nature.

We can scale the data to reduce the impact of outliers. While model building, we have checked with
scaled data as well but there was no impact on the model due to scaling. Hence, we are not scaling the
data.

Model Building
We have split the data in 70:30 ratio.
Call:
glm(formula = Churn ~ ., family = binomial(link = "logit"), data = trainData)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0909 -0.5063 -0.3349 -0.1924 3.0563
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.298282 0.660926 -9.529 < 2e-16 ***
AccountWeeks 0.002147 0.001668 1.287 0.19820
ContractRenewal -2.148996 0.174575 -12.310 < 2e-16 ***
DataPlan -1.341417 0.642404 -2.088 0.03679 *
CustServCalls 0.502549 0.047515 10.577 < 2e-16 ***
DayMins 0.012647 0.003909 3.235 0.00122 **
DayCalls 0.003053 0.003316 0.920 0.35734
MonthlyCharge 0.011254 0.021727 0.518 0.60448
OverageFee 0.121220 0.046308 2.618 0.00885 **
RoamMins 0.084454 0.027049 3.122 0.00179 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1934.3 on 2333 degrees of freedom
Residual deviance: 1500.3 on 2324 degrees of freedom
AIC: 1520.3
Number of Fisher Scoring iterations: 6

vif(lmodel)
AccountWeeks ContractRenewal DataPlan CustServCalls DayMi
ns
1.001376 1.072246 13.618518 1.085827 9.2516
60
DayCalls MonthlyCharge OverageFee RoamMins
1.005013 25.299488 2.999950 1.205865

10
exp(coefficients(lmodel))
(Intercept) AccountWeeks ContractRenewal DataPlan CustServCal
ls
0.001839462 1.002148946 0.116601225 0.261474892 1.6529299
52
DayMins DayCalls MonthlyCharge OverageFee RoamMi
ns
1.012727707 1.003057266 1.011317382 1.128873196 1.0881225
08

lrtest(lmodel)
Likelihood ratio test
Model 1: Churn ~ AccountWeeks + ContractRenewal + DataPlan + CustServCalls +
DayMins + DayCalls + MonthlyCharge + OverageFee + RoamMins
Model 2: Churn ~ 1
#Df LogLik Df Chisq Pr(>Chisq)
1 10 -750.16
2 1 -967.14 -9 433.95 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Multicollinearity: MonthlyCharge, DataUsage have very high vif value. Moreover, in correlation matrix
we have found that they are highly correlated, so we can drop MonthlyCharge from the Model and
rebuild.

Moreover, DataUsage is statistical insignificant and odds ratio is more than 1, so we cannot drop
DataUsage from the model.

We tried to drop DataPlan, but there is no significant difference. Moreover, AIC also increased. Hence,
we will only drop MonthlyCharge from the model.

MODEL2
lmodel2=glm(formula = Churn ~ AccountWeeks + ContractRenewal + CustServCalls
+ DayCalls + DayMins + OverageFee + RoamMins, family = binomial, data = train
Data)
> summary(lmodel2)
Call:
glm(formula = Churn ~ AccountWeeks + ContractRenewal + CustServCalls +
DayCalls + DayMins + OverageFee + RoamMins, family = binomial,
data = trainData)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9514 -0.5065 -0.3537 -0.2185 3.0514
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.343452 0.647899 -9.791 < 2e-16 ***
AccountWeeks 0.001981 0.001654 1.198 0.231060
ContractRenewal -2.075243 0.170818 -12.149 < 2e-16 ***
CustServCalls 0.488537 0.046737 10.453 < 2e-16 ***
DayCalls 0.003147 0.003292 0.956 0.339160
DayMins 0.014061 0.001306 10.764 < 2e-16 ***
OverageFee 0.132637 0.026708 4.966 6.83e-07 ***
RoamMins 0.083248 0.024526 3.394 0.000688 ***
---

11
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1934.3 on 2333 degrees of freedom
Residual deviance: 1539.2 on 2326 degrees of freedom
AIC: 1555.2
Number of Fisher Scoring iterations: 5

lrtest(lmodel2)
Likelihood ratio test
Model 1: Churn ~ AccountWeeks + ContractRenewal + CustServCalls + DayCalls +
DayMins + OverageFee + RoamMins
Model 2: Churn ~ 1
#Df LogLik Df Chisq Pr(>Chisq)
1 8 -769.58
2 1 -967.14 -7 395.11 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
No Significant change in log likelihood from the previous model. Also based on the p-value we can reject
the null hypothesis. Thus the model is valid.
pR2(lmodel2)
fitting null model for pseudo-r2
llh llhNull G2 McFadden r2ML r2CU
-769.5841060 -967.1400908 395.1119696 0.2042682 0.1557320 0.2764141
About 20.4% of variance in the intercept model is explained by our model.
exp(coefficients(lmodel2))
(Intercept) AccountWeeks ContractRenewal CustServCalls DayCal
ls
0.001758222 1.001983388 0.125525906 1.629929450 1.0031516
38
DayMins OverageFee RoamMins
1.014160662 1.141835233 1.086810918

Performance Metrics:
Confusion Matrix: For Training Dataset

12
13
Metrics Value for testing
Dataset
Accuracy 0.79
Sensitivity 0.64
Specificity 0.82
AUC 0.80
K-S 0.5
Gini 0.53

Interpretation
1. The model will catch 64% of the customers who will actually churn.

2. The model will catch 82% of the customers who will actually Not churn

3. Overall all accuracy is 79%

4. Out of the customers it predicted as will churn, 37% of them will actually churn

5. Out of the customers it predicted as will Not churn, 93% of them will actually Not churn

6. AUC is about 80%, so it is a good classifier

7. K-S is 50%, the model will fairly perform to separate the churn and no churn cases.

14
K-Nearest Neighbour:
KNN which stand for K Nearest Neighbor is a Supervised Machine Learning algorithm that classifies a
new data point into the target class, depending on the features of its neighboring data points.

Choosing K value -

If K is too small the model will be “overfit”. This means that the model will do well on the data you used
to create it, but when it comes across new observations it will perform poorly. If K is too high, the model
will also perform poorly. The best value of K was selected by picking one that is not too high or low.

We tried the model for scaled and normalized data both. Checking the output, we have built the final
model on normalized data.

Normalized data:
Telecom
# A tibble: 3,333 x 22
Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls DayMin
s
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl
>
1 0 128 1 1 2.7 1 265
.
2 0 107 1 1 3.7 1 162
.
3 0 137 1 0 0 0 243
.
4 0 84 0 0 0 2 299
.
5 0 75 0 0 0 3 167
.
6 0 118 0 0 0 0 223
.
7 0 121 1 1 2.03 3 218
.
8 0 147 0 0 0 0 157
9 0 117 1 0 0.19 1 184
.
10 0 141 0 1 3.02 0 259
.
# ... with 3,323 more rows, and 15 more variables: DayCalls <dbl>,
# MonthlyCharge <dbl>, OverageFee <dbl>, RoamMins <dbl>, norm.Churn <dbl>,
# norm.Accountweeks <dbl>, norm.daycalls <dbl>, norm.daymins <dbl>,
# norm.overagefee <dbl>, norm.contractrenewal <dbl>, norm.dataplan <dbl>,
# norm.datausage <dbl>, norm.Cust <dbl>, norm.monthlycharge <dbl>,
# norm.roammins <dbl>

We have built the model for various values of K and found K=19 as the optimal value.

15
Performance Metrics:

Metrics Value
Accuracy 0.89
Sensitivity 0.29
Specificity 0.99

Interpretation:
1. The model will catch 29% of the customers who will actually churn.

2. The model will catch 99% of the customers who will actually Not churn

3. Overall all accuracy is 89%

4. Out of the customers it predicted as will churn, 89% of them will actually churn

5. Out of the customers it predicted as will Not churn, 89% of them will actually Not churn

16
Naive Bayes
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a
single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of
features being classified is independent of each other. Calculate the prior probabilities from the count of
the training data. So, it should follow the proportion of the parent dataset.

NB
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = TrainBayes[-1], y = TrainBayes$Churn)
A-priori probabilities:
TrainBayes$Churn
0 1
0.8547558 0.1452442

Performance Metrics
Confusion Matrix For TEST DatasetConfusion Matrix and Statistics

Reference
Prediction 0 1
0 776 85
1 79 59
Accuracy : 0.8358
95% CI : (0.8114, 0.8583)
No Information Rate : 0.8559
P-Value [Acc > NIR] : 0.9658
Kappa : 0.3229
Mcnemar's Test P-Value : 0.6962
Sensitivity : 0.9076
Specificity : 0.4097
Pos Pred Value : 0.9013
Neg Pred Value : 0.4275
Prevalence : 0.8559
Detection Rate : 0.7768
Detection Prevalence : 0.8619
Balanced Accuracy : 0.6587
'Positive' Class : 0

17
Metrics Value
Accurac 0.83
y
Sensitivi 0.90
ty
Specifici 0.40
ty

TRAIN DATASET:
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 1848 203
1 147 136
Accuracy : 0.85
95% CI : (0.8349, 0.8643)
No Information Rate : 0.8548
P-Value [Acc > NIR] : 0.751542
Kappa : 0.3516

Mcnemar's Test P-Value : 0.003283


Sensitivity : 0.9263
Specificity : 0.4012
Pos Pred Value : 0.9010
Neg Pred Value : 0.4806
Prevalence : 0.8548
Detection Rate : 0.7918
Detection Prevalence : 0.8787
Balanced Accuracy : 0.6637
'Positive' Class : 0

Metrics Value
Accurac 0.85
y
Sensitivi 0.92
ty
Specifici 0.40
ty
INTERPRETATIONS:
1. The model will catch 25% of the customers who will actually churn.

2. The model will catch 97% of the customers who will actually Not churn

3. Overall all accuracy is 86%

4. Out of the customers it predicted as will churn, 57% of them will actually churn

5. Out of the customers it predicted as will Not churn, 88% of them will actually Not churn.

18
Model Comparison using Model Performance metrics & Interpretation:
KNN has the highest accuracy and has highest specificity

• Logistic Regression has highest sensitivity

• For Naïve Bayes, the base assumption is that the predictor variables are independent and equally
important. For our data, we have seen that the predictors are correlated. Hence we can say that Naïve
Bayes, is not giving correct prediction.

• Overall all Logistic regression model, there is balance between accuracy, sensitivity and specificity.
Hence, we conclude that Logistic regression model is the best.

• KNN does not give the confidence level (probabilities). It gives the class value directly.

CONCLUSION:
The model build using logistic regression, is a good model as accuracy is about 80% and there is balance
between sensitivity and specificity and it has good predictive ability (auc - 80%). We are able to predict
71% of the customers who will churn. Telecom Company can talk to the predicted customers who will
churn and understand point of view and come up with satisfactory resolution. About 19% are wrongly
classified as churn. If companies talk to these customers as well, there will not be any negative
implication.

We may increase the accuracy by adding other predictors like if customer has more than 1 connection,
the bill payment mode also we can use stepwise regression, to list of predictor variables that contribute
the most to the model.

The important variables based on Odds value and statistical significance are: CustServCalls, DaysMin,
OverageFee, RoamMins and also AccountWeeks , DataUasage and DayCalls are important as well.

19

Das könnte Ihnen auch gefallen