Beruflich Dokumente
Kultur Dokumente
____________________________________________________
PRATIK ZANKE
1
Contents
1. Project Objective ................................................................................................................................... 3
2. Assumptions .......................................................................................................................................... 3
4. Clustering ............................................................................................................................................ 14
2
1. Project Objective
The objective of the report is to build the best model which can classify the right customers who have a
higher probability of purchasing the loan on the data set “Thera Bank_Personal_Loan_Modelling-
dataset-1.xlsx” in R and generate insights about the model.
And also reflect upon the performance of the models and find the best model
This exploration report will consists of the following:
➢ Importing the dataset in R
➢ Understanding the structure of dataset
➢ Graphical exploration
➢ Descriptive statistics
➢ Clustering
➢ CART and Random Forest Model
➢ Insights from the dataset
ID Customer ID
Age Customer's age in years
Experience Years of professional experience
Income Annual income of the customer ($000)
ZIPCode Home Address ZIP code.
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any. ($000)
Personal Loan Did this customer accept the personal loan offered in the last campaign?
Securities
Does the customer have a securities account with the bank?
Account
CD Account Does the customer have a certificate of deposit (CD) account with the bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by the bank?
3
3. Exploratory Data Analysis – Step by step approach
The various steps followed to analyze the case study is mentioned and explained below.
Install necessary Packages and Invoke Libraries
The lists of R packages used to analyze the data are listed below:
➢ readxl package to read xlsx data file
➢ dplyr for data manipulation
➢ corrplot library for correlation
➢ lattice for data visualization
➢ fpc to plot the clusters
➢ rpart to for CART Model
➢ rpart.plot to plot CART Model
➢ caret for confusionMatrix
➢ rattle to plot CART Model
➢ RColorBrewer
➢ ROCR to calculate auc,KS
➢ ineq to calculate gini coefficient
➢ NbClust to get optimal number of cluster
➢ cluster to plot cluster
➢ data.table to Rank Chart
➢ library(factoextra) to plot kmeans
➢ library(caTools) to split data
➢ library(randomForest) for random forest
The given datasets are in “.xlsx format, so to import the data in R we use the “read_excel” command.
Data in file “Thera Bank_Personal_Loan_Modelling-dataset-1.xlsx” is stored in a variable called
“loandata”.
> dim(loandata)
[1] 5000 14
Variable Identification
➢ dim : to check dimension (#rows/columns) of a data frame
➢ str : Display internal structure of an R object
➢ head : it will show the first n rows of a data frame or matrix in R(default is 6)
4
➢ summary: It gives the 5 number summary, basically the 5 statistical values, namely the minimum
value, the first quartile, the median, the third quartile, and the maximum value of a data set
➢ colnames: retrieves or set the column names of a matrix
➢ names: To update the column names with user understandable format
➢ as.factor: To convert variable to factor
➢ as.data.frame: To convert to data frame
➢ histogram: to compute histogram of the variables
➢ boxplot: to draw box plot which shows 5 number(mean, quartiles)
➢ barplot: To draw barplot
➢ is.na: to check if there is any missing value
➢ sapply: to apply is.na to all the objects parameter to each column of sub-data frame defined by the
by input parameter
STR
There are 14 variables.
> str(loandata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5000 obs. of 14 variables:
$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ Age (in years) : num 25 45 39 35 35 37 53 50 35 34 ...
$ Experience (in years): num 1 19 15 9 8 13 27 24 10 9 ...
$ Income (in K/month) : num 49 34 11 100 45 29 72 22 81 180 ...
$ ZIP Code : num 91107 90089 94720 94112 91330 ...
$ Family members : num 4 3 1 1 4 4 2 1 3 1 ...
$ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
$ Education : num 1 1 1 2 2 2 2 3 2 3 ...
$ Mortgage : num 0 0 0 0 0 155 0 0 104 0 ...
$ Personal Loan : num 0 0 0 0 0 0 0 0 0 1 ...
$ Securities Account : num 1 1 0 0 0 0 0 0 0 0 ...
$ CD Account : num 0 0 0 0 0 0 0 0 0 0 ...
$ Online : num 0 0 0 0 0 1 1 0 1 0 ...
$ CreditCard : num 0 0 0 0 1 0 0 1 0 0 ...
HEAD
> head(loandata)
# A tibble: 6 x 14
ID `Age (in years)` `Experience (in~ `Income (in K/m~ `ZIP Code`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 25 1 49 91107
2 2 45 19 34 90089
3 3 39 15 11 94720
4 4 35 9 100 94112
5 5 35 8 45 91330
6 6 37 13 29 92121
# ... with 9 more variables: `Family members` <dbl>, CCAvg <dbl>, Education <
dbl>,
# Mortgage <dbl>, `Personal Loan` <dbl>, `Securities Account` <dbl>, `CD
# Account` <dbl>, Online <dbl>, CreditCard <dbl>
SUMMARY
> summary(loandata)
ID Age (in years) Experience (in years) Income (in K/month)
Min. : 1 Min. :23.00 Min. :-3.0 Min. : 8.00
1st Qu.:1251 1st Qu.:35.00 1st Qu.:10.0 1st Qu.: 39.00
5
Median :2500 Median :45.00 Median :20.0 Median : 64.00
Mean :2500 Mean :45.34 Mean :20.1 Mean : 73.77
3rd Qu.:3750 3rd Qu.:55.00 3rd Qu.:30.0 3rd Qu.: 98.00
Max. :5000 Max. :67.00 Max. :43.0 Max. :224.00
CreditCard
Min. :0.000
1st Qu.:0.000
Median :0.000
Mean :0.294
3rd Qu.:1.000
Max. :1.000
Using R studio, the missing values is filled using the mean of the total values under the columns “family
members”
> loandata$`Family members`[is.na(loandata$`Family members`)]=mean(loandata$`
Family members`,na.rm=TRUE)
> sum(is.na(loandata))
[1] 0
Univariate Analysis
We are analyzing the all the 14 variable from data set loandata. The ID variable is an unique Id which
represents the row.
➢ Age and Experience are normally distributed, with mean and median almost the same.
➢ Income are positively skewed and Majority of the customers have income between 45K and 55K
6
➢ CCAvg are positively skewed.
➢ Most of the customer has Mortgage less than 50k.
➢ The box-plot shows there is an outlier in few variables like Income, Credit Card Average and
Mortgage.
➢ The scatter plot shows that there is random distribution in Age, Experience, Income and CC Avg and
family is evenly distributed.
7
8
9
Bi-Variate Analysis
10
Family Size does not have any effect on loan
Customers who does not have Deposit account, does not have loan as well. But almost all customers
who has Deposit account has loan as well.
11
Outlier Identification
Although outliers value is seen in “income” column and “mortgage” column, but CART analysis
doesn’t get affected much with it.
Correlation
12
From the above Correlation plot, we can observe strong correlation “ Age” and “ Experience”, and
moderately correlation between “ income and CCAvg”, and few variables as partial
The above observation suggest strong correlation exists between “ age” and “ Experience” at 0.99%, and
we can treat them as a equal variable, and hence any one of the variable can be eliminated. We will
remove the “experience” variable.
4.Clustering Analysis
First we will start with Clustering Analysis and checking which Clustering (Hierarchal / k-mean) method is best
for given datasets as Clustering analysis helps is grouping a set of objects in such a way that objects in the same
group are more similar (in some sense or another) to each other than to those in other groups (clusters).
1. Hierarchal Clustering
Hierarchal Clustering can be performed by using Chebychev & Euclidian distance Method and then the results
of hierarchical clustering will be shown using dendrogram from top to bottom by executing following R code: -
d.chebyshev=dist(x=loans,method="maximum")
d.euc=dist(x=loans,method = "euclidean")
res.hclust.euc=hclust(d.euc,method = "complete")
res.hclust.ch=hclust(d.chebyshev,method = "complete")
cluster.height=res.hclust.euc$height
cluster.height=sort(cluster.height,decreasing = TRUE)
plot(cluster.height,pch=20,col="red",main="cluster height",ylab="cluster height")
lines(cluster.height,lty=2,lwd=2,col="blue")
par(mfrow=c(2,1))
13
We can clearly see that after cluster height of 100 all vertical distance between two distances are not much so
there is a possibility of 4-5 clusters where we can clearly cover maximum vertical distance which can be seen in
below cluster plot as we will be plotting clusters based upon both the above defined model by executing following
R Codes:-
rect.hclust(res.hclust.euc,k=3,border = "red")
rect.hclust(res.hclust.ch,k=3,border = "red")
14
OutPut:
From the above graph we can clearly see that hierarchal clustering is difficult to interpret clusters due to
overlapping labels so we will proceed further with K-Means Clustering as K-means clustering can handle
large larger data set.
Also, K-Means gives liberty to plot clusters in multiple dimensions. Now we will be proceeding further
with K-Means Clustering.
str(loans)
loans.scaled=scale(loans)
loans.scaled
seed=1000
15
set.seed(seed)
clust1=kmeans(x=loans.scaled,centers = 2,nstart=5)
clusplot(loans.scaled,clust1$cluster,color=TRUE,shade=TRUE,labels = 2,lines = 1)
2 Cluster Plot
twss=rep(0.5)
(k in 1:5) {set.seed(seed)clust2=kmeans(x=loans.scaled,centres=k,nstart=5)twss[k]=clust1$tot.withinss}
print(twss)
plot(c(1:5),twss,type = "b")
set.seed(seed)
nc=NbClust (loans,min.nc=2,max.nc=5,method="kmeans")
table(nc$Best.n[1,])
set.seed(seed)
clust3=kmeans(x=loans.scaled,centers=4,nstart=5)
16
By executing above R code, we found as result below that data is classified into 4 clusters by majority
rule
*** : The Hubert index is a graphical method of determining the number of clu
sters.
In the plot of Hubert index, we seek a significant knee that
corresponds to a
significant increase of the value of the measure i.e the sign
ificant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters
.
In the plot of D index, we seek a significant knee (the signi
ficant peak in Dindex
second differences plot) that corresponds to a significant in
crease of the value of
the measure.
*******************************************************************
* Among all indices:
* 8 proposed 2 as the best number of clusters
* 2 proposed 3 as the best number of clusters
* 12 proposed 4 as the best number of clusters
* 2 proposed 5 as the best number of clusters
*******************************************************************
set.seed(seed)
17
clust3=kmeans(x=loans.scaled,centers=4,nstart=5)
set.seed(111)
str(loans)
prop.table(table(loans$`Personal Loan`))
sample=sample.split(loans,SplitRatio = 0.7)
18
CARTtrain = subset(loans,sample = TRUE)
table(CARTtrain$`Personal Loan`)
CARTtrain=train.data
CARTtest=test.data
Output Analysis
Proportion of responders and non-responder in actual data set is 9.6% and 90.4% respectively.
Train data contains 3500 observation out of which proportion of responders is 9.6% and non-responders
is 90.4%.
Test data contains 1500 observation out of which proportion of responders is 9.7% and non-responders is
90.3%.
The data is well distributed in the training and validation sets almost in the same proportion as they were
in proportion earlier before split
Now as we had successfully partitioned our data, we can proceed further with building of CART and
Random Forest
CARTtrain=train.data
CARTtest=test.data
r.ctrl = rpart.control(minsplit=100, minbucket = 10, cp = 0, xval = 10)
cart.model = rpart(formula = Personal Loan,~. , data = CARTtrain, method = class, control = r.ctrl)
cart.model
cartmodel$variable.importance
19
Calculating Variable Importance (VI): - Check the variable importance variable importance score
as CART looks at the improvement measure attributable to each variable in its role as a either a primary
or a surrogate splitter. The values of all these improvements are summed over each node and totaled and
are then scaled relative to the best performing variable. Execute below code: -
art.model$variable.importance
view(cart.model$variable.importance)
round(cart.model$variable.importance,4)
Output analysis (VI): - Here Income, Education, Family member, CD.Account and CCAvg contributing
a lot in classification of target variable and Mortgage playing very minimal contribution in splitting
decision trees.
Variable Importance
Education 229.48767
Income 166.9566
Family.Member 144.87414
CCAvg 86.62543
CD.Account 59.36536
Mortgage 20.73195
Calculating Complexity Parameter (CP): -Check the complexity parameter as CP used to control
the size of the decision tree and to select the optimal tree size. If the cost of adding another variable to
20
the decision tree from the current node is above the value of CP, then tree building does not continue.
Execute below code: -
cart.model$cptable
print(cart.model)
cptable.frame=as.data.frame(cart.model$cptable)
cptable.frame$cp.deci=round(cptable.frame$CP,4)
cptable.frame
plotcp(cart.model,main="Size of Tree")
From above plot we can see that cross validation error is lowest in 4th split and corresponding CP is 0.
Pruning is done by randomly selecting a test sample and computing the error by running it down the large
tree and subtrees.
The tree with the smallest cross validation error will be the final tree as we will use the same CP.
As CP is 0 so prune is not required anymore.
predCTrain=predict(cart.model,CARTtrain[,-7])
tab2=table(CARTtrain$`Personal Loan`,predCT)
sum(diag(tab2))/sum(tab2)
22
ROC Output Analysis: -
It was concluded that the predicted model in CART decision tree analysis gives us a 97.94% predicted
accuracy in the Train data set and 98.69%. prediction accuracy in the test dataset.
We can infer that the model is perfect to can be used by the Thera bank management for decision
making and prediction of the customer for personal loan as the model validity was found to be almost
equal in both the train and test dataset and hence it’s a perfect predictive model.
OOB estimate of error rate is taken into consideration for tuning the random forest. Whatever level
of split OOB is lesser we will consider that number of splits in tree building.
23
tuneRF
We use tuneRF function to get mtry value and build the tuned random forest. As per the below result
mtry=9 has minimum out of bag error.
Important Variable
Based on the output of Mean Decrease Gini we can say the top 4 variables to predict if customer will
buy the loan or not are Education, Income, Family size and Credit card average.
24
Since Thera Bank has encouraged the retail marketing department to devise campaigns with better
target marketing to increase the success ratio with a minimal budget they should target the potential
customer based on Education, Income, Family Size and CC Average.
Performance Analysis
Train Dataset
Test dataset
Based on the output of confusion matrix for training and testing dataset, we can say that accuracy is
nearly same and hence, the model is stable.
Conclusion:
25
We have designed CART and Random Forest models to classify the right customers who have the higher
probability of purchasing the loan.
Based on the output of Mean Decrease Gini we can say the top 4 variables to predict if customer will
buy the loan or not are Education, Income, Family size and Credit card average.
Performance Comparison Table –
After creating the prediction model for Thera bank customer on CART and Random forest and
validating the model through various model validation test, I concluded that the RANDOM
FOREST model performed better on all the validation result in both test and train data set and
hence can be taken as appropriate model for prediction of customer loan.
26