Sie sind auf Seite 1von 20

Business Analyst: Case Study

Prepared for Namshi.com


July 7th, 2019

-Shubham Malviya

Case Study
AGENDA

 Where are we?

 Key discussion items for this meeting –


 Problem 1: Customer Retention
 Problem 2: Modeling
 Problem 3: Experiment Design

 Q&A

 APPENDIX

Case Study 2
For this round of interview, I have worked to deliver on the case study results
with the objectives of Agility, Transparency and Analytic Rigor

Agility
I worked to deliver results rapidly

Transparency
Throughout the case study problems, I have been open and honest about my ideas and hurdles I
faced:
• I have built supporting evidence for my case study solutions through robust codes and models

Analytic Rigor
I have leveraged a range of models and analyses:
• I used multi-level regression at multiple time intervals and advances algorithms such as
random forest to support my results
• I used various statistical techniques to validate my results

Case Study 3
AGENDA
 Where are we?

 Key discussion items for this meeting –


 Problem 1: Customer retention
 Problem 2: Modeling
 Problem 3: Experiment Design

 Q&A

 APPENDIX

Case Study 4
Problem 1

To build customer retention table R was used

 Where are we?

 I prepared a dummy data of 1k records having random user ids and date to
prove the solution codes

 Please refer to attached codes for details

Case Study 5
AGENDA
 Where are we?

 Key discussion items for this meeting –


 Problem 1: Customer retention
 Problem 2: Modeling
 Problem 3: Experiment Design

 Q&A

 APPENDIX

Case Study 6
Problem 2

2.1: I performed below mentioned exploratory analysis to extract insights


from the data using Excel to visualize the data summaries

 Account Length vs Churn: Maximum Churn is observed in account length bracket 75-
100, 100-125 weeks

 Account Length vs Customer Calls: Majority of customers who have churned received
less Customer calls so we can say that one of the reasons to churn could be lack of attention

 Account Length vs International Plan vs Churn: Churn among customers with


International Plan is higher than the customers without International Plan

 Account Length vs Voicemail Plan: Voice Mail Plan doesn’t have as significant impact
on Churn as compared to International Plan

Please refer to the attached Excel workbook for detailed summaries

Case Study 7
Problem 2

2.1: On checking for missing data, data was observed to be complete.


Additionally, I also checked Correlation among all the features to check
multicollinearity

Correlation

• High correlation between all the four


charges and call minutes was seen,
hence these four charges were
removed from the model
• With rest of the variables, no
significant correlation was observed

Please refer to the R codes for detailed codes and correlation values

Case Study 8
Problem 2

2:2 To predict churn, I trained my model with Random Forest and XG


Boost algorithms with Training Data Set(80%) and evaluated results from
both model with Test Data Set(20%)

I split the data into Train and Test dataset such that the churn ratio in both Train and Test dataset remains
consistent. I used inbuild R function createDataPartition() and verified manually in the codes

Key Inputs Key Outputs


state
account length
international plan
voice mail plan
number vmail messages
total day minutes
total day calls Predicted churn for Test Data Set
total eve minutes
total eve calls
total night minutes
total night calls
total intl minutes
total intl calls
Dummy variables were
customer service calls added for ‘area code’
dum_area code_408
dum_area code_415
dum_area code_510

Case Study 9
Problem 2

2:3 I built a predictive Random Forest and XG Boost model and evaluated
each of the model
Why Random forest?

• Works best with categorical data and has feature selection capability
• Handles even non-linear relationship between independent and dependent variable
• Random forest creates multiple decision trees hence prevents overfitting

Why XG Boost?

• High XGBoost has in-built L1 (Lasso Regression) and L2 (Ridge Regression) regularization which prevents the model
from overfitting
• XGBoost utilizes the power of parallel processing and that is why it is much faster than other algorithms. It uses
multiple CPU cores to execute the model and hence this will enhance the performance of the model after
deployment

Case Study 10
Problem 2

2:4 To compare our Random Forest and XG Boost models we use below
performance metrics

Confusion Matrix: Random Forest Model Confusion Matrix: XG Boost Model


Predicted True Predicted False Predicted True Predicted False
Actual True 69 27 Actual True 72 24
Actual False 14 556 Actual False 6 564

Performance Metrics: Random Forest Model Performance Metrics: XG Boost Model

F1 Score 0.77 F1 Score 0.82


Precision 0.83 Precision 0.92
Recall 0.71 Recall 0.75
Accuracy 0.94 Accuracy 0.96

* Please refer to the Codes attached for detailed analysis

Comparing on the basis of above metrics, we can say that XG Boost Model performed better as compared to the Random
Forest Model

Case Study 11
Problem 3.2

Variable importance plot below shows the relative importance of each of


the independent variable

‘total day minutes’, ‘customer service calls’ and ‘international plan’ are the three most important features

Case Study 12
Problem 2

2:5 Potential issues in the deployment depending on the Batch Training or


Real Time training could be summarized as below:

 Expensive: If we are working on real time training of data and predicting the churn in seconds, we might have
to use expensive cloud infrastructure

 Static to Dynamic Data: Model fitting is usually done on a static data set, however as the model goes into
production, we might have to deal with unstructured live data which might have issues like missing values,
non-defined values, changed format etc.

 Consistent Access to Data: In deployment we need to ensure that all data is available in a programmatic and
trusted manner

 Performance: Defining cluster configuration so that model run efficiently in the given time

 Consistency in Model Deployment: When deploying our model into real life, we assume that the data we
apply the model to is representative to the data we learned the model on however there might be issues for
example- we trained the model for ‘area code’ 408, 415, 510 and in the new dataset we receive data for
entirely different area codes

 Vendor may make changes in defining data: Third party vendors who provide the data might make few
minute changes in defining features that could make our model inconsistent

Case Study 13
AGENDA
 Where are we?

 Key discussion items for this meeting –


 Problem 1: Customer retention
 Problem 2: Modeling
 Problem 3: Experiment Design

 Q&A

 APPENDIX

1
Case Study
Problem 3

We design an experiment to explore how to reduce customer churn


leveraging our model built in the previous exercise

Why do we see customer churn?

 From our model we observe that ‘customer service calls’ was the second most important feature

 From our data visualization we also observed that the majority of customers who have churned received less
Customer calls so we can say that one of the reasons to churn could be lack of attention

Considering above factors in mind, we design an experiment to reduce customer churn by increasing
‘Customer Service Calls’

Primary objective of the experiment: To check if increased customer service calls lead
to reduced churn with statistical significance

Metrics for Performance Measurement

 The statistical significance threshold


 The duration of the Test & Sample Size
 The number of variants tested
 The null hypothesis and alternate hypothesis

Case Study 15
Problem 3

We perform Test-Control Analysis to check if we see reduced churn after


increasing customer service calls

Step 2: Define Test and Step 3: Determine


Step 1: Set up Step 4: Quantify uplift
Control Groups controlling criteria

Test Control Variables used for

% Clicks
controlling matches
Test and control groups
were selected such that
they are similar in all
aspects except that the test Post Event
groups were exposed to Control Test
# customer No changes more customer service
• Define the changes service calls were made calls Uplift attributed to increased calls
increased

From our previous data visualization we observed that customers with maximum churn(75-125
Account Len) have received minimum number of customer service calls, and hence we design our
experiment around this customers and evaluate if increased customer service calls has reduced
customer churn

Case Study 16
Problem 3

We follow the below detailed experiment design

 We take the customer group of account length 75-125 minutes(say N) and divide this customer group into two
samples Test and Control group

 Test Group: Test group was given increased customer service calls over the experiment period T

 Control Group: Control group was given the same previous number of customer service calls over the
experiment period of T

 We then measure the number of churned customer in these two groups

To make sure that we measure the churn reduced from increased calls only and not from any other factor, we
control upon other variables such that all the external promotions/offers remain same for the test and control
group and the only difference lies in the number of customer calls

Case Study 17
Problem 3

We define the null and alternate hypothesis to evaluate our experiment in


statistically significant way

Reduction in churn in the Test group is due to statistically


𝐻0 (Null Hypothesis) =
random coincidence

𝐻1 (Alternate Hypothesis) = Reduction in churn in the Test group is due to increased


number of customer calls

If P-value < 0.1 we reject the Null Hypothesis and conclude that the reduction in churn is
due to increased customer calls

Case Study 18
Problem 3

Experiment can have below risks and ways to mitigate:

 How long to test to arrive at result with significant level of confidence? Additionally during the time we are
testing our hypothesis we would be delivering regular number of customer calls to the control group and
hence we leaving churn as such in the control group or 50% of the customers: We should use time duration
calculator to find a time period that could give statistically significant results

 If we are already very certain that the experience we are about to test will outperform the control, then
testing becomes an unnecessary overhead: both by the fixed costs it incurs, and due to the risk-adjusted
losses by delaying the implementation of the better experience: In our case, as we saw from the historical
data, that number of service calls have direct impact on churn we might also directly increase the service
calls without testing there impact

 Novelty Effect- The novelty effect comes into play when we make an alteration that our typical customer isn’t
used to seeing. In our case, our customers are not used to high customer service calls and hence increased
calls might lead them to not to churn, however if this reduced churn will be because of number of increase in
calls or due to this new trend that attracts their attention: To test this effect, we can include new customers
in our experiment, since the new customers will not be used of some particular pattern

 Even if we can establish that increased service calls lead to reduced churn, we will need to find out what
should be the optimal number of increased service calls to be made: There will be a saturation on the impact
of customer service calls after a particular number. This optimal number of calls could be found out by
running a separate model

Case Study 19
Questions?

Case Study 20

Das könnte Ihnen auch gefallen