Sie sind auf Seite 1von 48

Upsell Model Case

© EduPristine – www.edupristine.com
© EduPristine BA-Session-II
Modeling Test
Propensity Model for Up-Sell in Telecom Industry
Business Problem: Company X in telecom domain has strong market share. The market share has reached a point where
capturing new customers is quite expensive and has low ROI. Company X planned to focus on its existing customers for
growth in its revenue. To capture growth, company decided to target its existing customers for higher value plans through
an up-sell campaign. Since the company has a very huge database of existing customers, it is not possible to target each
and every customer and hence marketing team proposed to develop a propensity model for identifying the customers
whom they should target in this campaign.
Develop a model for identifying right set of customers with plan A to up-sell plan B.

© EduPristine BA-Session-II 2
Tasks to be performed
1. Import data
2. Identify and develop Dependent variable
3. Prepare Uni-Variate and Bi-Variate Report
4. Prepare missing value report
5. Perform initial variable reduction and missing value imputation
6. Perform extreme value treatment
7. Prepare correlation matrix and VIF chart
8. Perform Multicollinearity check and variable reduction through Multicollinearity
9. Develop IV report
10. Perform Binning to prepare modeling dataset
11. Perform sampling to prepare training and validation dataset
12. Run the model
13. Develop report for model outcomes
14. Write the Scoring or implementation strategy

© EduPristine BA-Session-II 3
Import data
Step: 1. Import data
Create working directory
setwd("C:/Users/babycorn/Documents/Edupristine/Telecom upsell Case")
Import the Raw data file into cust_data file
cust_data<-read.csv("Rawdatafile.csv")
### See the data summary (verify Data)
head(cust_data)
tail(cust_data)

© EduPristine BA-Session-II 4
Identify and develop Dependent variable
Step: 2. Identify and develop Dependent variable

See for the variable which can be used to identify the historical responders and non-responders. This needs business
understanding and data understanding. In the given case, flag for Plan change can be used to identify the customers who
went for up-sell in past. Model building exercise will try to capture the characteristics of these customers based on other
independent variables.

© EduPristine BA-Session-II 5
Prepare Uni-Variate and Bi-Variate Report (Cotn'd….)
Step: 3. Prepare Uni-Variate and Bi-Variate Report
summary(cust_data)
write.csv(summ_cust_data.csv, "summ_cust_data.csv.csv")
Check this file generated in the work folder and verify the summary details.
Identify the variables with missing value eg. Var1, Var27
Identify the cases with Extreme value eg. Var4
Identify the cases with very high or very low granularity eg. Var10
Cust_id Plan_Chg_Flag Var1 Var2 Var3 Var4 Var5
Min. :1000 No :4500 Central:1599 High :1745 Govt :1497 Min. : 20.00 Female:2538
1st Qu.:2250 Yes: 500 North :1736 Low :1681 Private :1985 1st Qu.: 20.00 Male :2462
Median :3500 NA South :1640 Medium:1574 Unemployed:1518 Median : 30.00 NA
Mean :3500 NA NA's : 25 NA NA Mean : 33.14 NA
3rd Qu.:4749 NA NA NA NA 3rd Qu.: 50.00 NA
Max. :5999 NA NA NA NA Max. :120.00 NA

Var6 Var7 Var8 Var9 Var10 Var11 Var12


Min. :0.000 Min. : 200.0 Min. : 7.00 Card:2620 Postpaid:5000 Min. :1.000 Min. :1.000
1st Qu.:0.000 1st Qu.: 200.0 1st Qu.: 7.00 Cash:2380 NA 1st Qu.:1.000 1st Qu.:1.000
Median :3.000 Median : 600.0 Median :15.00 NA NA Median :1.000 Median :1.000
Mean :1.502 Mean : 647.3 Mean :11.13 NA NA Mean :1.486 Mean :1.487
3rd Qu.:3.000 3rd Qu.:1200.0 3rd Qu.:15.00 NA NA 3rd Qu.:2.000 3rd Qu.:2.000
Max. :3.000 Max. :1200.0 Max. :15.00 NA NA Max. :2.000 Max. :2.000

© EduPristine BA-Session-II 6
Prepare Uni-Variate and Bi-Variate Report (Cotn'd….)
Var13 Var14 Var15 Var16 Var17 Var18 Var19
Min. :300.0 Min. :300.0 Min. :300.0 Min. :100.0 Min. : 10.0 Min. : 50.0 Min. : 500
1st Qu.:300.0 1st Qu.:300.0 1st Qu.:300.0 1st Qu.:100.0 1st Qu.: 10.0 1st Qu.: 50.0 1st Qu.: 500
Median :600.0 Median :600.0 Median :600.0 Median :300.0 Median :300.0 Median :200.0 Median :1500
Mean :462.4 Mean :455.7 Mean :455.5 Mean :206.7 Mean :167.2 Mean :128.9 Mean :1021
3rd Qu.:600.0 3rd Qu.:600.0 3rd Qu.:600.0 3rd Qu.:300.0 3rd Qu.:300.0 3rd Qu.:200.0 3rd Qu.:1500
Max. :600.0 Max. :600.0 Max. :600.0 Max. :300.0 Max. :300.0 Max. :200.0 Max. :1500

Var20 Var21 Var22 Var23 Var24 Var25 Var26 Var27


Min. :1.000 Min. :1.000 Min. :300.0 Min. :300.0 Min. : 20.0 Min. : 50 Min. :20.00 Good: 811
1st Qu.:1.000 1st Qu.:1.000 1st Qu.:300.0 1st Qu.:300.0 1st Qu.: 20.0 1st Qu.: 50 1st Qu.:20.00 Poor: 17
Median :2.000 Median :2.000 Median :600.0 Median :600.0 Median :100.0 Median :500 Median :20.00 NA's:4172
Mean :1.518 Mean :1.527 Mean :457.1 Mean :460.7 Mean : 62.1 Mean :286 Mean :34.36 NA
3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:600.0 3rd Qu.:600.0 3rd Qu.:100.0 3rd Qu.:500 3rd Qu.:50.00 NA
Max. :2.000 Max. :2.000 Max. :600.0 Max. :600.0 Max. :100.0 Max. :500 Max. :50.00 NA

© EduPristine BA-Session-II 7
Prepare missing value report
Step: 4. Prepare missing value report
The analysis from above report tells us that Var1 and Var27 will need missing value treatment. Further, the % of missing
value is very high in Var27
% Missing = 4172/5000 = 83.44
As a rule, we generally drop variables with more than 50% missing values. Hence Var27 should be dropped out from the
dataset.
For Var1 , % Missing = 25/5000 = 0.5% . We can impute this with either the mode value of the variable or any meaning
value we can think of.

© EduPristine BA-Session-II 8
Perform initial variable reduction and missing value imputation
Step: 5. Perform initial variable reduction and missing value imputation

##Code to drop the variables not required


cust_dat1<- cust_data[,-c(12,29)]
View(cust_dat1)
ifelse(cust_dat1$Var1=="", "North",
ifelse(cust_dat1$Var1=="South", "South",
ifelse(cust_dat1$Var1=="Central", "Central", "North")))

© EduPristine BA-Session-II 9
Perform initial variable reduction and missing value imputation
Step: 6. Perform extreme value treatment
ifelse(cust_dat1$Var4 >50, 50,
ifelse(cust_dat1$Var4< 25, 20 , 30))
Verify summary again:
cust_dat1$Var1<-ifelse(cust_dat1$Var1=="", "North",
ifelse(cust_dat1$Var1=="South", "South",
ifelse(cust_dat1$Var1=="Central", "Central", "North")))
cust_dat1$Var4<-ifelse(cust_dat1$Var4 >50, 50,
ifelse(cust_dat1$Var4< 25, 20 , 30))
summ_cust_data2.csv<-summary(cust_dat1)
write.csv(summ_cust_data2.csv, "summ_cust_data2.csv")

© EduPristine BA-Session-II 10
Perform initial variable reduction and missing value
imputation (Cotn'd….)
Cust_id Plan_Chg_Flag Var1 Var2 Var3 Var4 Var5
Min. :1000 No :4500 Length:5000 High :1745 Govt :1497 Min. :20.00 Female:2538
1st Qu.:2250 Yes: 500 Class :character Low :1681 Private :1985 1st Qu.:20.00 Male :2462
Median :3500 NA Mode :character Medium:1574 Unemployed:1518 Median :30.00 NA
Mean :3500 NA NA NA NA Mean :26.78 NA
3rd Qu.:4749 NA NA NA NA 3rd Qu.:30.00 NA
Max. :5999 NA NA NA NA Max. :50.00 NA

Var6 Var7 Var8 Var9 Var11 Var12


Min. :0.000 Min. : 200.0 Min. : 7.00 Card:2620 Min. :1.000 Min. :1.000
1st Qu.:0.000 1st Qu.: 200.0 1st Qu.: 7.00 Cash:2380 1st Qu.:1.000 1st Qu.:1.000
Median :3.000 Median : 600.0 Median :15.00 NA Median :1.000 Median :1.000
Mean :1.502 Mean : 647.3 Mean :11.13 NA Mean :1.486 Mean :1.487
3rd Qu.:3.000 3rd Qu.:1200.0 3rd Qu.:15.00 NA 3rd Qu.:2.000 3rd Qu.:2.000
Max. :3.000 Max. :1200.0 Max. :15.00 NA Max. :2.000 Max. :2.000
Var13 Var14 Var15 Var16 Var17 Var18 Var19
Min. :300.0 Min. :300.0 Min. :300.0 Min. :100.0 Min. : 10.0 Min. : 50.0 Min. : 500
1st Qu.:300.0 1st Qu.:300.0 1st Qu.:300.0 1st Qu.:100.0 1st Qu.: 10.0 1st Qu.: 50.0 1st Qu.: 500
Median :600.0 Median :600.0 Median :600.0 Median :300.0 Median :300.0 Median :200.0 Median :1500
Mean :462.4 Mean :455.7 Mean :455.5 Mean :206.7 Mean :167.2 Mean :128.9 Mean :1021
3rd Qu.:600.0 3rd Qu.:600.0 3rd Qu.:600.0 3rd Qu.:300.0 3rd Qu.:300.0 3rd Qu.:200.0 3rd Qu.:1500
Max. :600.0 Max. :600.0 Max. :600.0 Max. :300.0 Max. :300.0 Max. :200.0 Max. :1500

Var20 Var21 Var22 Var23 Var24 Var25 Var26


Min. :1.000 Min. :1.000 Min. :300.0 Min. :300.0 Min. : 20.0 Min. : 50 Min. :20.00
1st Qu.:1.000 1st Qu.:1.000 1st Qu.:300.0 1st Qu.:300.0 1st Qu.: 20.0 1st Qu.: 50 1st Qu.:20.00
Median :2.000 Median :2.000 Median :600.0 Median :600.0 Median :100.0 Median :500 Median :20.00
Mean :1.518 Mean :1.527 Mean :457.1 Mean :460.7 Mean : 62.1 Mean :286 Mean :34.36
3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:600.0 3rd Qu.:600.0 3rd Qu.:100.0 3rd Qu.:500 3rd Qu.:50.00
Max. :2.000 Max. :2.000 Max. :600.0 Max. :600.0 Max. :100.0 Max. :500 Max. :50.00

© EduPristine BA-Session-II 11
Prepare correlation matrix and VIF chart
Step: 7. Prepare correlation matrix and VIF chart
Corr_data<- cust_dat1[,-c(1,2,3,4,5,7,11)]
corr_matrix.csv <- cor(Corr_data,Corr_data)
write.csv(corr_matrix.csv, "corr_matrix.csv")

CorrMatrix Var4 Var6 Var7 Var8 Var11 Var12 Var13 Var14 Var15 Var16 Var17 Var18 Var19 Var20 Var21 Var22 Var23 Var24 Var25 Var26
Var4 1 0.008487 -0.01643 -0.01189 -0.02143 -0.02919 -0.00748 0.008141 -0.0083 -0.02473 0.005861 0.016315 0.012893 -0.0006 0.009159 0.007905 0.008748 0.021432 0.011286 -0.02337
Var6 0.008487 1 -0.0166 -0.00164 -0.01877 -0.01838 -0.01816 0.00796 -0.00805 -0.01812 0.019167 0.003141 0.018366 -0.02526 -0.00287 0.011156 -0.0061 0.024771 -0.00847 -0.00075
Var7 -0.01643 -0.0166 1 -0.00826 0.013659 -0.00424 0.015166 0.005934 0.004718 -0.00739 -0.0169 0.021915 -0.00991 0.017714 -0.00586 -0.03519 0.014272 -0.01045 0.003649 0.049884
Var8 -0.01189 -0.00164 -0.00826 1 -0.02589 -0.00158 0.062777 0.088479 0.043672 0.076837 0.065534 0.068031 0.052337 0.094568 0.059557 0.04974 0.091975 0.075245 0.052509 0.0094
Var11 -0.02143 -0.01877 0.013659 -0.02589 1 0.093343 -0.00806 -0.01452 0.002251 0.001138 -0.01804 0.002714 -0.00322 -0.0142 -0.02488 -0.02467 -0.0164 0.00592 -0.01141 0.079273
Var12 -0.02919 -0.01838 -0.00424 -0.00158 0.093343 1 -0.01679 -0.01586 -0.0191 -0.01474 -0.02356 -0.03314 -0.01898 -0.00672 -0.00945 0.017225 -0.00181 -0.02753 -0.0168 0.091835
Var13 -0.00748 -0.01816 0.015166 0.062777 -0.00806 -0.01679 1 0.055096 0.07129 0.066841 0.070738 0.045901 0.074923 0.038856 0.065437 0.086469 0.072161 0.042686 0.0699 -0.01293
Var14 0.008141 0.00796 0.005934 0.088479 -0.01452 -0.01586 0.055096 1 0.033063 0.07487 0.061858 0.0437 0.084167 0.076765 0.083316 0.073528 0.060291 0.053321 0.069879 0.004033
Var15 -0.0083 -0.00805 0.004718 0.043672 0.002251 -0.0191 0.07129 0.033063 1 0.071764 0.078861 0.065426 0.069806 0.073613 0.061752 0.045549 0.060402 0.08226 0.077167 -0.00325
Var16 -0.02473 -0.01812 -0.00739 0.076837 0.001138 -0.01474 0.066841 0.07487 0.071764 1 0.064705 0.081976 0.066638 0.067429 0.071813 0.058199 0.056285 0.033801 0.056114 -0.01437
Var17 0.005861 0.019167 -0.0169 0.065534 -0.01804 -0.02356 0.070738 0.061858 0.078861 0.064705 1 0.067109 0.046327 0.064901 0.060112 0.04419 0.049895 0.067913 0.052531 -0.00843
Var18 0.016315 0.003141 0.021915 0.068031 0.002714 -0.03314 0.045901 0.0437 0.065426 0.081976 0.067109 1 0.069997 0.053061 0.072958 0.040808 0.052876 0.060621 0.062804 -0.00256
Var19 0.012893 0.018366 -0.00991 0.052337 -0.00322 -0.01898 0.074923 0.084167 0.069806 0.066638 0.046327 0.069997 1 0.067048 0.063106 0.083805 0.083349 0.053159 0.060112 -0.01987
Var20 -0.0006 -0.02526 0.017714 0.094568 -0.0142 -0.00672 0.038856 0.076765 0.073613 0.067429 0.064901 0.053061 0.067048 1 0.066224 0.072451 0.055268 0.064284 0.068807 -0.0045
Var21 0.009159 -0.00287 -0.00586 0.059557 -0.02488 -0.00945 0.065437 0.083316 0.061752 0.071813 0.060112 0.072958 0.063106 0.066224 1 0.069191 0.056353 0.075364 0.086376 -0.00609
Var22 0.007905 0.011156 -0.03519 0.04974 -0.02467 0.017225 0.086469 0.073528 0.045549 0.058199 0.04419 0.040808 0.083805 0.072451 0.069191 1 0.079703 0.080908 0.051797 -0.01559
Var23 0.008748 -0.0061 0.014272 0.091975 -0.0164 -0.00181 0.072161 0.060291 0.060402 0.056285 0.049895 0.052876 0.083349 0.055268 0.056353 0.079703 1 0.063317 0.060752 -0.00939
Var24 0.021432 0.024771 -0.01045 0.075245 0.00592 -0.02753 0.042686 0.053321 0.08226 0.033801 0.067913 0.060621 0.053159 0.064284 0.075364 0.080908 0.063317 1 0.09248 -0.00016
Var25 0.011286 -0.00847 0.003649 0.052509 -0.01141 -0.0168 0.0699 0.069879 0.077167 0.056114 0.052531 0.062804 0.060112 0.068807 0.086376 0.051797 0.060752 0.09248 1 -0.01033
Var26 -0.02337 -0.00075 0.049884 0.0094 0.079273 0.091835 -0.01293 0.004033 -0.00325 -0.01437 -0.00843 -0.00256 -0.01987 -0.0045 -0.00609 -0.01559 -0.00939 -0.00016 -0.01033 1

cust_dat1$Responder<-ifelse(cust_dat1$Plan_Chg_Flag == "Yes", 1, 0)
head(cust_dat1)
tail(cust_dat1)

© EduPristine BA-Session-II 12
Prepare correlation matrix and VIF chart (Cotn'd….)
vif_Cust_data.csv <- vif(lm(Responder ~ Var4+ Var6+ Var7+ Var8+ Var11+ Var12+ Var13+ Var14+ Var15+ Var16+
Var17+ Var18+ Var19+ Var20+ Var21+ Var22+ Var23+ Var24+ Var25+ Var26, data=cust_dat1))

write.csv(vif_Cust_data.csv, "vif_Cust_data.csv")
Variable VIF
Var4 1.004064
Var6 1.004115
Var7 1.006625
Var8 1.040132
Var11 1.01728
Var12 1.020826
Var13 1.03402
Var14 1.037052
Var15 1.034849
Var16 1.035711
Var17 1.030768
Var18 1.03111
Var19 1.03765
Var20 1.036979
Var21 1.037472
Var22 1.037842
Var23 1.034547
Var24 1.037831
Var25 1.035905
Var26 1.017624

© EduPristine BA-Session-II 13
Perform Multicollinearity check and variable reduction through
Multicollinearity
Step: 8. Perform Multicollinearity check and variable reduction through Multicollinearity

Since the VIF for the all factors is less than 2 and correlation matrix has lower corr values, we can safely assume that the
collinearity issue is not present.

© EduPristine BA-Session-II 14
Develop IV report using Bi-variate analysis and variable
reduction technique
Step: 9. Develop IV report using Bi-variate analysis and variable reduction technique
Perform the frequency analysis of each category in the data against the response variable (dependent variable).
The final analysis should prepare a report highlighting the Cumulative IV values as below.
The following code generates and saves csv for bi-variate analysis:

##Remove the 2nd variable which was used to create responder variable
cust_dat1<-cust_dat1[,c(-2)]
## Set the data in the new datafile
dat<-cust_dat1
###Verify the dataset
head(dat)

## create table to identify the var type


bi_var<-apply(dat,2,typeof)

## add var name for the type and added the flag ..default set to 1 for all var
bi_var<-data.frame(colnames(dat),bi_var,flag=1)

© EduPristine BA-Session-II 15
Develop IV report using Bi-variate analysis and variable
reduction technique (Cotn'd….)
## set the row names as numbers
row.names(bi_var)<-1:nrow(bi_var)

## set the column names


colnames(bi_var)<-c("variable","var_type","flag")

## get the position for variables to set the flag as 0


bi_var$flag[which( bi_var$variable %in% c("Cust_id","Responder"))]<-0

## remove those with flag as 0


bi_var<-bi_var[bi_var$flag==1,]

## created an object to get bi var analysis


event_rate<-NULL

© EduPristine BA-Session-II 16
Develop IV report using Bi-variate analysis and variable
reduction technique (Cotn'd….)
## loop in till all the var in the table
for ( i in 1:nrow(bi_var))
{
## get the freq table for each var..ensure that the deleted var and numbers

## are in sequence..eg responder should be at 2nd position

aa<-as.matrix(table(dat[,i+2],dat[,27]))
cc<-aa
## append var name and the categories in that variable
bb<-cbind(rep(as.character(bi_var$variable[i]),nrow(aa)),row.names(aa))
## merge the name, cat and freq table
aa<-data.frame(cbind(bb,aa))

## calc for ER, NER, WOE, IV and cum IV


aa[,5]<-as.numeric(cc[,1])/sum(as.numeric(cc[,1]))
aa[,6]<-as.numeric(cc[,2])/sum(as.numeric(cc[,2]))

© EduPristine BA-Session-II 17
Develop IV report using Bi-variate analysis and variable
reduction technique (Cotn'd….)
aa[,7]<-log(aa[,5]/aa[,6])
aa[,8]<-(aa[,5]-aa[,6])*aa[,7]
aa[,9]<-sum(aa[,8])

## append everything in new dataset


event_rate<-rbind(event_rate,aa)
}
## give the column names for data created above ..after the for loop
colnames(event_rate)<-c("variable","Factor","Res","Non-Res","ER","NER","WOE","IV","Cum_IV")
##Read the eventrate file and safe for analysis
head(event_rate)
write.csv(event_rate, "event_rate_IV.csv")

© EduPristine BA-Session-II 18
Develop IV report using Bi-variate analysis and variable
reduction technique (Cotn'd….)
variable Factor Res Non-Res ER NER WOE IV Cum_IV
Var26 1 1 500 0.000222173 0.998003992 -8.410056871 8.391401844 14.59399668
Var6 1200 1500 1 0.333333333 0.002150538 5.043425117 1.67029563 1.98
Var2 Unemployed 1482 36 0.329333333 0.072 1.52040429 0.391250704 1.167269673
Var25 50 2292 101 0.509333333 0.202 0.924834984 0.284232618 0.433702214
Var19 2 2211 377 0.491333333 0.754 -0.428269584 0.112492144 0.303309335
Var13 600 2220 375 0.493333333 0.75 -0.418888128 0.10751462 0.288822004
Var23 100 2253 378 0.500666667 0.756 -0.412100833 0.105223079 0.288068722
Var15 300 2293 375 0.509555556 0.75 -0.38653432 0.09294003 0.254963781
Var14 600 2223 368 0.494 0.736 -0.398694602 0.096484094 0.253926285
Var24 500 2252 370 0.500444444 0.74 -0.391153594 0.093703016 0.250141702
Var21 600 2250 369 0.5 0.738 -0.389335726 0.092661903 0.246472638
Var20 2 2266 370 0.503555556 0.74 -0.384956141 0.091020741 0.243950632
Var18 1500 2237 366 0.497111111 0.732 -0.386966949 0.090894237 0.238729138
Var9 2 2291 137 0.509111111 0.274 0.619538179 0.14566031 0.23766686
Var12 600 2332 374 0.518222222 0.748 -0.366998827 0.084328175 0.233236555
Var17 200 2266 365 0.503555556 0.73 -0.371350489 0.084090255 0.222006159
Var22 600 2310 368 0.513333333 0.736 -0.360304712 0.080227849 0.216417534
Var7 15 2223 358 0.494 0.716 -0.37114465 0.082394112 0.210612972
Var16 300 2345 366 0.521111111 0.732 -0.33981723 0.071663678 0.194080803
Var11 2 2281 156 0.506888889 0.312 0.485288638 0.094577363 0.159485948
Var8 Cash 2226 154 0.494666667 0.308 0.473784352 0.088439746 0.147121712
Var3 50 2 1 0.000444444 0.003610108 -2.094667989 0.006631015 0.085352967
Var5 3 2261 242 0.502444444 0.484 0.037400169 0.000689825 0.001361195
Var4 Male 2211 251 0.491333333 0.502 -0.021477336 0.000229092 0.000455149
Var1 Medium 1413 161 0.314 0.322 -0.02515856 0.000201268 0.000376317

© EduPristine BA-Session-II 19
Perform Binning to prepare modeling dataset
Step: 10. Perform Binning to prepare modeling dataset

cust_dat1$GRPVar1<-ifelse(cust_dat1$Var1=="North",1,ifelse(cust_dat1$Var1=="South",2,3))
cust_dat1$GRPVar2<-ifelse(cust_dat1$Var2=="Low",1,ifelse(cust_dat1$Var2=="Medium",2,3))
cust_dat1$GRPVar3<-ifelse(cust_dat1$Var3=="Unemployed",1,ifelse(cust_dat1$Var3=="Govt",2,3))
cust_dat1$GRPVar4<-ifelse(cust_dat1$Var4 < 25,1,ifelse(cust_dat1$Var4< 40,2,3))
cust_dat1$GRPVar5<-ifelse(cust_dat1$Var5=="Male",1,2)
cust_dat1$GRPVar6<-ifelse(cust_dat1$Var6 < 2,1,2)
cust_dat1$GRPVar7<-ifelse(cust_dat1$Var7 < 500,1,ifelse(cust_dat1$Var7< 1000,2,3))
cust_dat1$GRPVar8<-ifelse(cust_dat1$Var8 < 13,1,2)
cust_dat1$GRPVar9<-ifelse(cust_dat1$Var9=="Cash",1,2)
cust_dat1$GRPVar11<-ifelse(cust_dat1$Var11 < 2,1,2)
cust_dat1$GRPVar12<-ifelse(cust_dat1$Var12 < 2,1,2)
cust_dat1$GRPVar13<-ifelse(cust_dat1$Var13 < 500,1,2)
cust_dat1$GRPVar14<-ifelse(cust_dat1$Var14 < 500,1,2)
cust_dat1$GRPVar15<-ifelse(cust_dat1$Var15 < 500,1,2)
cust_dat1$GRPVar16<-ifelse(cust_dat1$Var16 < 200,1,2)
cust_dat1$GRPVar17<-ifelse(cust_dat1$Var17 < 20,1,2)
cust_dat1$GRPVar18<-ifelse(cust_dat1$Var18 < 100,1,2)
cust_dat1$GRPVar19<-ifelse(cust_dat1$Var19 < 1000,1,2)

© EduPristine BA-Session-II 20
Perform Binning to prepare modeling dataset (Cotn'd….)
cust_dat1$GRPVar20<-ifelse(cust_dat1$Var20 < 2,1,2)
cust_dat1$GRPVar21<-ifelse(cust_dat1$Var21 < 2,1,2)
cust_dat1$GRPVar22<-ifelse(cust_dat1$Var22 < 500,1,2)
cust_dat1$GRPVar23<-ifelse(cust_dat1$Var23 < 500,1,2)
cust_dat1$GRPVar24<-ifelse(cust_dat1$Var24 < 50,1,2)
cust_dat1$GRPVar25<-ifelse(cust_dat1$Var25 < 100,1,2)
cust_dat1$GRPVar26<-ifelse(cust_dat1$Var26 < 30,1,2)

cust_modeldata<-cust_dat1[,c(28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53)]

write.csv(cust_modeldata, "cust_modeldata.csv")

© EduPristine BA-Session-II 21
Perform sampling to prepare training and validation dataset
Step: 11. Perform sampling to prepare training and validation dataset
## sorting the data
cust_modeldata <- cust_modeldata[order(cust_modeldata$Responder,decreasing = TRUE),]
training_data =strata(cust_modeldata,c("Responder"),size=c(350,3150), method="srswor")
training_data<-getdata(cust_modeldata,training_data)

View(training_data)

© EduPristine BA-Session-II 22
Run the model
Step: 12. Run the model
fit <- glm(Responder ~
as.factor(GRPVar2)+as.factor(GRPVar7)+as.factor(GRPVar8)+as.factor(GRPVar9)+as.factor(GRPVar11)+as.factor(GRPVar12)
+as.factor(GRPVar1)+as.factor(GRPVar14)+as.factor(GRPVar15)+as.factor(GRPVar16)+as.factor(GRPVar17)+as.factor(GRPV
ar18)+as.factor(GRPVar19)+as.factor(GRPVar20)+as.factor(GRPVar21)+as.factor(GRPVar22)+as.factor(GRPVar23)+as.factor
(GRPVar24)+ as.factor(GRPVar25), family = binomial("logit"),data=training_data )

© EduPristine BA-Session-II 23
Develop report for model outcomes
Step: 13. Develop report for model outcomes
summary(fit) # display results
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4130 -0.4314 -0.2489 -0.0001 3.5287

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.39793 0.26792 -16.415 < 2e-16 ***
as.factor(GRPVar2)2 0.13524 0.16422 0.823 0.410228
as.factor(GRPVar2)3 0.06255 0.16170 0.387 0.698897
as.factor(GRPVar7)2 0.23670 0.14187 1.668 0.095237 .
as.factor(GRPVar7)3 -17.19727 317.33159 -0.054 0.956781
as.factor(GRPVar8)2 0.31399 0.14478 2.169 0.030108 *
as.factor(GRPVar9)2 0.30041 0.14080 2.134 0.032877 *
as.factor(GRPVar11)2 -1.01584 0.14414 -7.047 1.82e-12 ***
as.factor(GRPVar12)2 -0.67231 0.14026 -4.793 1.64e-06 ***

© EduPristine BA-Session-II 24
Develop report for model outcomes (Cotn'd….)
as.factor(GRPVar1)2 -0.13777 0.16203 -0.850 0.395181
as.factor(GRPVar1)3 -0.14090 0.16094 -0.875 0.381333
as.factor(GRPVar14)2 0.49729 0.14443 3.443 0.000575 ***
as.factor(GRPVar15)2 0.40871 0.14483 2.822 0.004771 **
as.factor(GRPVar16)2 0.32636 0.14490 2.252 0.024302 *
as.factor(GRPVar17)2 0.34034 0.14502 2.347 0.018932 *
as.factor(GRPVar18)2 0.42106 0.14363 2.932 0.003373 **
as.factor(GRPVar19)2 0.35157 0.14416 2.439 0.014741 *
as.factor(GRPVar20)2 0.55552 0.14578 3.811 0.000139 ***
as.factor(GRPVar21)2 0.38559 0.14343 2.688 0.007180 **
as.factor(GRPVar22)2 0.36509 0.14402 2.535 0.011243 *
as.factor(GRPVar23)2 0.24158 0.14353 1.683 0.092343 .
as.factor(GRPVar24)2 0.36435 0.14737 2.472 0.013424 *
as.factor(GRPVar25)2 0.36496 0.14658 2.490 0.012781 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

© EduPristine BA-Session-II 25
Develop report for model outcomes (Cotn'd….)
(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2275.6 on 3499 degrees of freedom


Residual deviance: 1552.9 on 3477 degrees of freedom
AIC: 1598.9

Number of Fisher Scoring iterations: 18

summary_pred_model.csv<-predict(fit, type="response")
> write.csv(summary_pred_model.csv, "summary_pred_model.csv")

See the generated csv for predicted scores

© EduPristine BA-Session-II 26
Model Fit Criteria
1. Use the Deviance or Hosmer & Lemeshow test statistics to check the validity of the model. Higher the “P” value
better is the model. Proceed to next steps only if we have higher value of P.
##############################################################################
### Hosmer lemeshow goodness of fit test
hosmerlem <- function (y, yhat, g = 12)
{
cutyhat <- cut(yhat, breaks = quantile(yhat, probs = seq(0,1, 1/g)), include.lowest = T)
obs <- xtabs(cbind(1 - y, y) ~ cutyhat)
expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
chisq <- sum((obs - expect)^2/expect)
P <- 1 - pchisq(chisq, g - 2)
c("X^2" = chisq, Df = g - 2, "P(>Chi)" = P)
}
##Run the above function after setting the values
R_hat<-as.vector(fitted(fit))
yhat<-R_hat
y<-training_data$Responder
hosmerlem(y, yhat)

© EduPristine BA-Session-II 27
Model Fit Criteria (Cotn'd….)
2. Test the null hypothesis for the independent variables, i.e. all β = 0. P value should be significant (i.e. p < 0.05) to
reject the null hypothesis and prove that β values are not equal to 0.

3. Check the concordance and Tie. The rule of thumb test is (Concordance+ ½ Tie) should be greater than 60%.

4. Check the significance of the estimates of each of the variable. If any of the estimates are not significant, variable
with highest P value will be dropped and steps i to vi are repeated with the new set of variables. This process will
continue until all the variables in the model have significant estimates.

5. Frame the equation with the significant variables. Odds ratio and probability value for each of the profile is
estimated.

6. Specificity and Sensitivity of the model is assessed and ROC (Receiving Operating Characteristic) graph is plotted.
Area under the ROC is an indication of how well the classification of good in to good and bad to bad is decided by the
identified model.

7. Coefficient Stability: Coefficient stability is checked across development and validation sample. Once the model is
performing satisfactorily on development sample, we use the same set of variables to model the validation sample. A
robust model should perform equally well on validation sample too. Hence, the coefficients should be in a close
range and should be of same sign.

© EduPristine BA-Session-II 28
Model Fit Criteria (Cotn'd….)
8. Concordance: Consider a set of 100 individuals out of which 10 are the responders (denoted by 1) and 90 are non-
responders (denoted by 0). Now we construct pairs for each responder with every non-responder. Hence, we get 900
such pairs (10*90 = 900). Using the model under development, we calculate the predicted response rate for each
responder and non-responder in every pair. If responder’s predicted probability is greater than non-responder’s
predicted probability, then the pair is concordant. If it is vice versa, then the pair is discordant and if both are equal,
then the pair is tied. For a good model, the percent concordant pair lies above 65%.

© EduPristine BA-Session-II 29
See the results for concordance test below
##############################################################
outcome_and_fitted_col<-data.frame(training_data$Responder, R_hat)
> colnames(outcome_and_fitted_col)<-c("Responder","fitted.values")
> Concordance = function(outcome_and_fitted_col)
+
+{
+
+ #outcome_and_fitted_col = cbind(logistic1$Responder, logistic1$fitted.values)
+ # get a subset of outcomes where the event actually happened
+
+ ones = outcome_and_fitted_col[outcome_and_fitted_col[,1] == 1,]
+
+ # get a subset of outcomes where the event didn't actually happen
+
+ zeros = outcome_and_fitted_col[outcome_and_fitted_col[,1] == 0,]
+
+ # Equate the length of the event and non-event tables
+
+ if (length(ones[,1])>length(zeros[,1])) {ones = ones[1:length(zeros[,1]),]}
+
+ else {zeros = zeros[1:length(ones[,1]),]}
+
+ # Following will be c(ones_outcome, ones_fitted, zeros_outcome, zeros_fitted)
+
+ ones_and_zeros = data.frame(ones, zeros)
+
+ # initiate columns to store concordant, discordant, and tie pair evaluations
+
+ conc = rep(NA, length(ones_and_zeros[,1]))
+ disc = rep(NA, length(ones_and_zeros[,1]))
+ ties = rep(NA, length(ones_and_zeros[,1]))
+
+ for (i in 1:length(ones_and_zeros[,1])) {
+
+ # This tests for concordance
+
+ if (ones_and_zeros[i,2] > ones_and_zeros[i,4])
+
+ {conc[i] = 1
+ disc[i] = 0
+ ties[i] = 0}
+
+ # This tests for a tie
+
+ else if (ones_and_zeros[i,2] == ones_and_zeros[i,4])
+
+ {
+
+ conc[i] = 0
+ disc[i] = 0
+ ties[i] = 1

© EduPristine BA-Session-II 30
See the results for concordance test below (Cotn'd….)
+
+ }
+
+ # This should catch discordant pairs.
+
+ else if (ones_and_zeros[i,2] < ones_and_zeros[i,4])
+
+ {
+
+ conc[i] = 0
+ disc[i] = 1
+ ties[i] = 0
+
+ }
+
+ }
+
+ # Here we save the various rates
+
+ conc_rate = mean(conc, na.rm=TRUE)
+ disc_rate = mean(disc, na.rm=TRUE)
+ tie_rate = mean(ties, na.rm=TRUE)
+
+ return(list(concordance=conc_rate, num_concordant=sum(conc), discordance=disc_rate, num_discordant=sum(disc), tie_rate=tie_rate,num_tied=sum(ties)))
+
+}
>
> Concordance_test<-Concordance(outcome_and_fitted_col)
> Concordance_test
$concordance
[1] 0.8571429
$num_concordant
[1] 300
$discordance
[1] 0.1428571
$num_discordant
[1] 50
$tie_rate
[1] 0
$num_tied
[1] 0

© EduPristine BA-Session-II 31
Model Fit Criteria
9. Gini Coefficient: The Gini coefficient is one which is used to test the model accuracy. It is calculated by using following
formula. For good model the Gini coefficient should be in the range of 40-60%.
Gini=2C-1 Where C= Area under the curve (ie Concordance+1/2 of Tie)

10. Scoring: Satisfaction of the model comes when the model is doing well in terms of rank ordering, coefficient stability,
Goodness of fit, Concordance and capturing both on development and validation samples.
Now, take the coefficients of variables obtained from a model run on development sample and use it to predict response
rate of validation sample. This method is known as scoring of the model. Scoring provides a good idea about how the
model will perform when applied to another data set. Here, we are concerned about the capturing of the responders, say
in first 40 % of the population.
The model is used to predict the response rate for a set of new data is taken from a different time frame to test the
validity of the rules suggested by the model. The model will be applicable to the profiles similar to the once already
present in the sample data used for model development. Model validation is performed by taking the optimum threshold
level of probability.

Lift/gains chart for model case:

#### Preparing Gains/Lift chart


library(ROCR)

© EduPristine BA-Session-II 32
Model Fit Criteria (Cotn'd….)
gain.chart <- function(y_hat,y) {
plot(performance(prediction(y_hat,y), "tpr", "rpp"),lwd = 7, main = "Lift Chart")
lines(ecdf((rank(-y_hat)[y == T]) / length(y)),verticals = T, do.points = F, col = "red", lwd = 3)
}
gain.chart(R_hat,training_data$Responder)

The model is implemented and refreshed periodically to generate scores for the customers.

© EduPristine BA-Session-II 33
Write the Scoring or implementation strategy
Step: 14. Write the Scoring or implementation strategy

The model can be used to score the existing customers to score for up-sell.

The implementation requires extracting the data of the significant variables (highlighted) and then grouping it as per
model equation and generating the scores. The score are then sorted in descending order and the top few deciles will be
used for up-sell marketing.

© EduPristine BA-Session-II 34
Upsell Model Analysis Demonstration in SAS Language

© EduPristine BA 35
WPS Code for Upsell model
Check the output in the work folder and the libpath after running the code below.
Following code performs the initial data extraction and verification

© EduPristine BA 36
WPS Code for Upsell model contd…
Check the output tables in the work folder for analysis after running the code below.
Following code performs the EDA and drops and fixes some variables.

© EduPristine BA 37
WPS Code for Upsell model contd…
Test the presence of multicollinearity using Variance Inflation Factor (VIF) and correlation matrix.
The code below will generate corr matrix and VIF values. Look for high corr values and VIF >2
as a condition to drop he variable.

© EduPristine BA 38
WPS Code for Upsell model contd…
BiVariate Analysis and the Information value (IV)calculation.
The code below is partially automated and will generate multiple datasets for analysis.
The bi_var_ana2 dataset will contain event rate and non event rates and
can be used to perform binning of the variables.
The bi_var_ana3 dataset will contain the cumulative Information value (IV) which can
be used to drop the variables with IV value less than 0.1

© EduPristine BA 39
WPS Code for Upsell model contd…
BiVariate analysis code contd…

© EduPristine BA 40
WPS Code for Upsell model contd…
BiVariate analysis code contd…

© EduPristine BA 41
WPS Code for Upsell model contd…
Drop variables with low IV value and create a training and validation datasets
using surveyselect procedure

© EduPristine BA 42
WPS Code for Upsell model contd…
Run logistic regression model code to get the predicted values and the significant variables.
Check the output score, logistic and avg_score for details of the model.

© EduPristine BA 43
WPS Output for Upsell model
Observe the significant variables with coefficients as below

© EduPristine BA 44
WPS Output for Upsell model

Gini coefficient can be calculated based on the c value generated below.


This output comes along with the proc logistic .
Gini coefficient = 2C -1 = 2*0.94 – 1 = 1.88-1 = 88% (quite high)

© EduPristine BA 45
WPS Output for Upsell model
Observe the predicted scores for each customer as below

© EduPristine BA 46
WPS Output for Upsell model
See the average predicted scores for each group –Responders and non responder customer as below.
Generated using the following code:
proc sql;
create table avg_score as
select Responder as variable,avg(pred_score) as Avg_score
from datapath.score group by 1;
quit;

© EduPristine BA 47
Thank You !

help@edupristine.com
www.edupristine.com

© EduPristine – www.edupristine.com