Sie sind auf Seite 1von 26

Thera Bank Loan Prediction Model

____________________________________________________
PRATIK ZANKE

1
Contents
1. Project Objective ................................................................................................................................... 3

2. Assumptions .......................................................................................................................................... 3

3. Exploratory Data Analysis – Step by step approach ............................................................................. 4

4. Clustering ............................................................................................................................................ 14

5. Decision Tree ....................................................................................................................................... 19

6 Random Forest ..................................................................................................................................... 24

2
1. Project Objective
The objective of the report is to build the best model which can classify the right customers who have a
higher probability of purchasing the loan on the data set “Thera Bank_Personal_Loan_Modelling-
dataset-1.xlsx” in R and generate insights about the model.
And also reflect upon the performance of the models and find the best model
This exploration report will consists of the following:
➢ Importing the dataset in R
➢ Understanding the structure of dataset
➢ Graphical exploration
➢ Descriptive statistics
➢ Clustering
➢ CART and Random Forest Model
➢ Insights from the dataset

2. Steps and approach


➢ We shall follow step by step approach to arrive to the conclusion as follows:
➢ Exploratory Data Analysis
➢ Analysis of independent & dependent variables
➢ Creation of clusters using appropriate methodology
➢ Building model using CART and RAMDOM FOREST
➢ Checking the model performance using all measures
➢ Checking the model performance on test and train data
➢ Identification of best model
➢ Inferences and conclusions
Data
Description:

ID Customer ID
Age Customer's age in years
Experience Years of professional experience
Income Annual income of the customer ($000)
ZIPCode Home Address ZIP code.
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any. ($000)
Personal Loan Did this customer accept the personal loan offered in the last campaign?
Securities
Does the customer have a securities account with the bank?
Account
CD Account Does the customer have a certificate of deposit (CD) account with the bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by the bank?

3
3. Exploratory Data Analysis – Step by step approach
The various steps followed to analyze the case study is mentioned and explained below.
Install necessary Packages and Invoke Libraries
The lists of R packages used to analyze the data are listed below:
➢ readxl package to read xlsx data file
➢ dplyr for data manipulation
➢ corrplot library for correlation
➢ lattice for data visualization
➢ fpc to plot the clusters
➢ rpart to for CART Model
➢ rpart.plot to plot CART Model
➢ caret for confusionMatrix
➢ rattle to plot CART Model
➢ RColorBrewer
➢ ROCR to calculate auc,KS
➢ ineq to calculate gini coefficient
➢ NbClust to get optimal number of cluster
➢ cluster to plot cluster
➢ data.table to Rank Chart
➢ library(factoextra) to plot kmeans
➢ library(caTools) to split data
➢ library(randomForest) for random forest

Set up working Directory


Setting up the working directory will help to maintain all the files related to the project at one place in
the system.
> setwd("F:/project")
> getwd()
[1] "F:/project"

The given datasets are in “.xlsx format, so to import the data in R we use the “read_excel” command.
Data in file “Thera Bank_Personal_Loan_Modelling-dataset-1.xlsx” is stored in a variable called
“loandata”.
> dim(loandata)
[1] 5000 14

Variable Identification
➢ dim : to check dimension (#rows/columns) of a data frame
➢ str : Display internal structure of an R object
➢ head : it will show the first n rows of a data frame or matrix in R(default is 6)

4
➢ summary: It gives the 5 number summary, basically the 5 statistical values, namely the minimum
value, the first quartile, the median, the third quartile, and the maximum value of a data set
➢ colnames: retrieves or set the column names of a matrix
➢ names: To update the column names with user understandable format
➢ as.factor: To convert variable to factor
➢ as.data.frame: To convert to data frame
➢ histogram: to compute histogram of the variables
➢ boxplot: to draw box plot which shows 5 number(mean, quartiles)
➢ barplot: To draw barplot
➢ is.na: to check if there is any missing value
➢ sapply: to apply is.na to all the objects parameter to each column of sub-data frame defined by the
by input parameter

STR
There are 14 variables.
> str(loandata)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5000 obs. of 14 variables:
$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ Age (in years) : num 25 45 39 35 35 37 53 50 35 34 ...
$ Experience (in years): num 1 19 15 9 8 13 27 24 10 9 ...
$ Income (in K/month) : num 49 34 11 100 45 29 72 22 81 180 ...
$ ZIP Code : num 91107 90089 94720 94112 91330 ...
$ Family members : num 4 3 1 1 4 4 2 1 3 1 ...
$ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
$ Education : num 1 1 1 2 2 2 2 3 2 3 ...
$ Mortgage : num 0 0 0 0 0 155 0 0 104 0 ...
$ Personal Loan : num 0 0 0 0 0 0 0 0 0 1 ...
$ Securities Account : num 1 1 0 0 0 0 0 0 0 0 ...
$ CD Account : num 0 0 0 0 0 0 0 0 0 0 ...
$ Online : num 0 0 0 0 0 1 1 0 1 0 ...
$ CreditCard : num 0 0 0 0 1 0 0 1 0 0 ...

HEAD
> head(loandata)
# A tibble: 6 x 14
ID `Age (in years)` `Experience (in~ `Income (in K/m~ `ZIP Code`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 25 1 49 91107
2 2 45 19 34 90089
3 3 39 15 11 94720
4 4 35 9 100 94112
5 5 35 8 45 91330
6 6 37 13 29 92121
# ... with 9 more variables: `Family members` <dbl>, CCAvg <dbl>, Education <
dbl>,
# Mortgage <dbl>, `Personal Loan` <dbl>, `Securities Account` <dbl>, `CD
# Account` <dbl>, Online <dbl>, CreditCard <dbl>

SUMMARY
> summary(loandata)
ID Age (in years) Experience (in years) Income (in K/month)
Min. : 1 Min. :23.00 Min. :-3.0 Min. : 8.00
1st Qu.:1251 1st Qu.:35.00 1st Qu.:10.0 1st Qu.: 39.00

5
Median :2500 Median :45.00 Median :20.0 Median : 64.00
Mean :2500 Mean :45.34 Mean :20.1 Mean : 73.77
3rd Qu.:3750 3rd Qu.:55.00 3rd Qu.:30.0 3rd Qu.: 98.00
Max. :5000 Max. :67.00 Max. :43.0 Max. :224.00

ZIP Code Family members CCAvg Education Mortgage


Min. : 9307 Min. :1.000 Min. : 0.000 Min. :1.000 Min. : 0
.0
1st Qu.:91911 1st Qu.:1.000 1st Qu.: 0.700 1st Qu.:1.000 1st Qu.: 0
.0
Median :93437 Median :2.000 Median : 1.500 Median :2.000 Median : 0
.0
Mean :93153 Mean :2.397 Mean : 1.938 Mean :1.881 Mean : 56
.5
3rd Qu.:94608 3rd Qu.:3.000 3rd Qu.: 2.500 3rd Qu.:3.000 3rd Qu.:101
.0
Max. :96651 Max. :4.000 Max. :10.000 Max. :3.000 Max. :635
.0
NA's :18
Personal Loan Securities Account CD Account Online
Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.000 Median :0.0000 Median :0.0000 Median :1.0000
Mean :0.096 Mean :0.1044 Mean :0.0604 Mean :0.5968
3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000

CreditCard
Min. :0.000
1st Qu.:0.000
Median :0.000
Mean :0.294
3rd Qu.:1.000
Max. :1.000

Missing Value Identification


We use ‘is na’ function to check if there are any missing values. There are 18 missing values in the Family
Size variable.
sum(is.na(loandata))
[1] 18

Using R studio, the missing values is filled using the mean of the total values under the columns “family
members”
> loandata$`Family members`[is.na(loandata$`Family members`)]=mean(loandata$`
Family members`,na.rm=TRUE)
> sum(is.na(loandata))
[1] 0

Univariate Analysis
We are analyzing the all the 14 variable from data set loandata. The ID variable is an unique Id which
represents the row.
➢ Age and Experience are normally distributed, with mean and median almost the same.
➢ Income are positively skewed and Majority of the customers have income between 45K and 55K

6
➢ CCAvg are positively skewed.
➢ Most of the customer has Mortgage less than 50k.
➢ The box-plot shows there is an outlier in few variables like Income, Credit Card Average and
Mortgage.
➢ The scatter plot shows that there is random distribution in Age, Experience, Income and CC Avg and
family is evenly distributed.

7
8
9
Bi-Variate Analysis

Customer with Education 1 has more income.

Age and Experience are positively related.

10
Family Size does not have any effect on loan

Customers who does not have Deposit account, does not have loan as well. But almost all customers
who has Deposit account has loan as well.

11
Outlier Identification
Although outliers value is seen in “income” column and “mortgage” column, but CART analysis
doesn’t get affected much with it.

Correlation

12
From the above Correlation plot, we can observe strong correlation “ Age” and “ Experience”, and
moderately correlation between “ income and CCAvg”, and few variables as partial

The above observation suggest strong correlation exists between “ age” and “ Experience” at 0.99%, and
we can treat them as a equal variable, and hence any one of the variable can be eliminated. We will
remove the “experience” variable.

4.Clustering Analysis
First we will start with Clustering Analysis and checking which Clustering (Hierarchal / k-mean) method is best
for given datasets as Clustering analysis helps is grouping a set of objects in such a way that objects in the same
group are more similar (in some sense or another) to each other than to those in other groups (clusters).

1. Hierarchal Clustering

Hierarchal Clustering can be performed by using Chebychev & Euclidian distance Method and then the results
of hierarchical clustering will be shown using dendrogram from top to bottom by executing following R code: -

d.chebyshev=dist(x=loans,method="maximum")
d.euc=dist(x=loans,method = "euclidean")

res.hclust.euc=hclust(d.euc,method = "complete")
res.hclust.ch=hclust(d.chebyshev,method = "complete")
cluster.height=res.hclust.euc$height
cluster.height=sort(cluster.height,decreasing = TRUE)
plot(cluster.height,pch=20,col="red",main="cluster height",ylab="cluster height")
lines(cluster.height,lty=2,lwd=2,col="blue")
par(mfrow=c(2,1))

Cluster Height v/s index Plot

13
We can clearly see that after cluster height of 100 all vertical distance between two distances are not much so
there is a possibility of 4-5 clusters where we can clearly cover maximum vertical distance which can be seen in
below cluster plot as we will be plotting clusters based upon both the above defined model by executing following
R Codes:-

plot(res.hclust.euc,labels=as.character(loans[,2]),main="H clust Using Euclidean Method",xlab="Euclidean


distance",ylab="Height")

rect.hclust(res.hclust.euc,k=3,border = "red")

plot(res.hclust.ch,labels=as.character(loans[,2]),main="H clust Using Chebychev Method",xlab="Chebychev


Distance",ylab="Height")

rect.hclust(res.hclust.ch,k=3,border = "red")

14
OutPut:

H Clust using Euiclidian Method and dividing in 3 random clusters

H Clust using Chebychev Method dividing in 3 random clusters

From the above graph we can clearly see that hierarchal clustering is difficult to interpret clusters due to
overlapping labels so we will proceed further with K-Means Clustering as K-means clustering can handle
large larger data set.

Also, K-Means gives liberty to plot clusters in multiple dimensions. Now we will be proceeding further
with K-Means Clustering.
str(loans)

loans.scaled=scale(loans)

loans.scaled

seed=1000

15
set.seed(seed)

clust1=kmeans(x=loans.scaled,centers = 2,nstart=5)

clusplot(loans.scaled,clust1$cluster,color=TRUE,shade=TRUE,labels = 2,lines = 1)

2 Cluster Plot

Using total within sum of squares to determine right number of clusters

twss=rep(0.5)

(k in 1:5) {set.seed(seed)clust2=kmeans(x=loans.scaled,centres=k,nstart=5)twss[k]=clust1$tot.withinss}

print(twss)

plot(c(1:5),twss,type = "b")

set.seed(seed)

nc=NbClust (loans,min.nc=2,max.nc=5,method="kmeans")

table(nc$Best.n[1,])

set.seed(seed)

clust3=kmeans(x=loans.scaled,centers=4,nstart=5)

clusplot(loans.scaled,clust3$cluster,color=TRUE,shade=TRUE,labels = 2,lines = 1,main = "Final Cluster")

16
By executing above R code, we found as result below that data is classified into 4 clusters by majority
rule

*** : The Hubert index is a graphical method of determining the number of clu
sters.
In the plot of Hubert index, we seek a significant knee that
corresponds to a
significant increase of the value of the measure i.e the sign
ificant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters
.
In the plot of D index, we seek a significant knee (the signi
ficant peak in Dindex
second differences plot) that corresponds to a significant in
crease of the value of
the measure.

*******************************************************************
* Among all indices:
* 8 proposed 2 as the best number of clusters
* 2 proposed 3 as the best number of clusters
* 12 proposed 4 as the best number of clusters
* 2 proposed 5 as the best number of clusters

***** Conclusion *****

* According to the majority rule, the best number of clusters is 4

*******************************************************************

set.seed(seed)

17
clust3=kmeans(x=loans.scaled,centers=4,nstart=5)

clusplot(loans.scaled,clust3$cluster,color=TRUE,shade=TRUE,labels = 2,lines = 1,main = "Final Cluster")

K-Means Clustering Output: -

5.Decision Trees using CART Method


Decision trees is a supervised learning predictive model and uses binary rules to calculate to target value.

CART uses both Classification and Regression task.

To build decision trees, we will proceed as follow:


1. Create train/test set
2. Build the model
3. Measure performance

Creating Training and Test Data set

set.seed(111)

str(loans)

prop.table(table(loans$`Personal Loan`))

sample=sample.split(loans,SplitRatio = 0.7)

18
CARTtrain = subset(loans,sample = TRUE)

CARTtest = subset(loans,sample = FALSE)

table(CARTtrain$`Personal Loan`)

sum(CARTtrain$`Personal Loan` == "1")/nrow(CARTtrain)

CARTtrain=train.data

CARTtest=test.data

Output Analysis
Proportion of responders and non-responder in actual data set is 9.6% and 90.4% respectively.

Train data contains 3500 observation out of which proportion of responders is 9.6% and non-responders
is 90.4%.
Test data contains 1500 observation out of which proportion of responders is 9.7% and non-responders is
90.3%.

The data is well distributed in the training and validation sets almost in the same proportion as they were
in proportion earlier before split

Now as we had successfully partitioned our data, we can proceed further with building of CART and
Random Forest

Building - Cart Model


Once we have the two data sets and have got a basic understanding of data, we now build a CART model.
We have used "caret" and "rpart" package to build this model. However, the traditional representation of
the CART model is not graphically appealing on R. Hence, we have used a package called "rattle" to make
this decision tree. "Rattle" builds a more fancy and clean trees, which can be easily interpreted.

CARTtrain=train.data
CARTtest=test.data
r.ctrl = rpart.control(minsplit=100, minbucket = 10, cp = 0, xval = 10)

cart.model = rpart(formula = Personal Loan,~. , data = CARTtrain, method = class, control = r.ctrl)
cart.model
cartmodel$variable.importance

cart.model=rpart(formula =CARTtrain$`Personal Loan` ~ ., data = CARTtrain, method = "class", control =


r.ctrl)
cart.model
fancyRpartPlot(cart.model)

19
Calculating Variable Importance (VI): - Check the variable importance variable importance score
as CART looks at the improvement measure attributable to each variable in its role as a either a primary
or a surrogate splitter. The values of all these improvements are summed over each node and totaled and
are then scaled relative to the best performing variable. Execute below code: -

art.model$variable.importance
view(cart.model$variable.importance)
round(cart.model$variable.importance,4)

Output analysis (VI): - Here Income, Education, Family member, CD.Account and CCAvg contributing
a lot in classification of target variable and Mortgage playing very minimal contribution in splitting
decision trees.

Variable Importance
Education 229.48767
Income 166.9566
Family.Member 144.87414
CCAvg 86.62543
CD.Account 59.36536
Mortgage 20.73195

Calculating Complexity Parameter (CP): -Check the complexity parameter as CP used to control
the size of the decision tree and to select the optimal tree size. If the cost of adding another variable to

20
the decision tree from the current node is above the value of CP, then tree building does not continue.
Execute below code: -
cart.model$cptable

print(cart.model)

cptable.frame=as.data.frame(cart.model$cptable)

cptable.frame$cp.deci=round(cptable.frame$CP,4)

cptable.frame

plotcp(cart.model,main="Size of Tree")

Output Analysis (CP): -

Sl. No CP nsplit rel error xerror xstd cp.deci


1 0.33283582 0 1 1 0.05195537 0.3328
2 0.12537313 2 0.3343284 0.4179104 0.03460627 0.1254
3 0.01641791 3 0.2089552 0.2477612 0.02687093 0.0164
4 0 5 0.1761194 0.2059701 0.02455026 0

From above plot we can see that cross validation error is lowest in 4th split and corresponding CP is 0.
Pruning is done by randomly selecting a test sample and computing the error by running it down the large
tree and subtrees.
The tree with the smallest cross validation error will be the final tree as we will use the same CP.
As CP is 0 so prune is not required anymore.

CART Model Performance on Train Data set


21
1. Confusion Matrix: -
Calculating Confusion Matrix on Train Data: - We are predicting classification of 0 and 1 for each row and
then we are putting our actual and predicted into a table to build confusion matrix to check that how
accurate our model is by executing below R Code.
predCT=predict(cart.model,CARTtrain[,-7],type = "class")

predCTrain=predict(cart.model,CARTtrain[,-7])

tab2=table(CARTtrain$`Personal Loan`,predCT)

sum(diag(tab2))/sum(tab2)

Confusion Matrix Output: -


1. ROC
The ROC curve is the plot between sensitivity and (1- specificity).
(1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate.

Calculating ROC on Test Data

22
ROC Output Analysis: -
It was concluded that the predicted model in CART decision tree analysis gives us a 97.94% predicted
accuracy in the Train data set and 98.69%. prediction accuracy in the test dataset.

We can infer that the model is perfect to can be used by the Thera bank management for decision
making and prediction of the customer for personal loan as the model validity was found to be almost
equal in both the train and test dataset and hence it’s a perfect predictive model.

6.Random Forest Method


In Random Forest there is large number of decision trees are created. Every observation is considered
and fed back into every decision tree and taking a majority vote for each classification model.

OOB estimate of error rate is taken into consideration for tuning the random forest. Whatever level
of split OOB is lesser we will consider that number of splits in tree building.

To build decision trees using RF Method, we will proceed as follow:


1. Create train/test set
2. Build the model
3. Measure performance

23
tuneRF
We use tuneRF function to get mtry value and build the tuned random forest. As per the below result
mtry=9 has minimum out of bag error.

Important Variable
Based on the output of Mean Decrease Gini we can say the top 4 variables to predict if customer will
buy the loan or not are Education, Income, Family size and Credit card average.

24
Since Thera Bank has encouraged the retail marketing department to devise campaigns with better
target marketing to increase the success ratio with a minimal budget they should target the potential
customer based on Education, Income, Family Size and CC Average.
Performance Analysis

Train Dataset

Test dataset

Based on the output of confusion matrix for training and testing dataset, we can say that accuracy is
nearly same and hence, the model is stable.

Conclusion:

25
We have designed CART and Random Forest models to classify the right customers who have the higher
probability of purchasing the loan.
Based on the output of Mean Decrease Gini we can say the top 4 variables to predict if customer will
buy the loan or not are Education, Income, Family size and Credit card average.
Performance Comparison Table –

Performance Measure CART Model Value Random Forest Value


Train Dataset Test Dataset Deviation Train Dataset Test Dataset Deviation
K-S Value 0.913 0.924 -0.011 0.918 0.935 -0.017
Area 0.981 0.983 -0.002 0.981 0.996 -0.015
Under the
Curve
Gini Index 0.871 0.874 -0.003 0.909 0.915 -0.006
Confusion 0.986 0.979 0.007 0.986 0.978 0.008
Matrix :
Accuracy
Confusion 0.875 0.806 0.069 0.869 0.792 0.077
Matrix :
Sensitivity
Confusion 0.997 0.997 0 0.999 0.998 0.001
Matrix :
Specificity
Misclassific 50/3500 32/1500 -0.007 48/3500 33/1500 -0.008
ation Rate =0.014 =0.021 =0.014 =0.022
Overall Deviation 0.053 0.04

After creating the prediction model for Thera bank customer on CART and Random forest and
validating the model through various model validation test, I concluded that the RANDOM
FOREST model performed better on all the validation result in both test and train data set and
hence can be taken as appropriate model for prediction of customer loan.

26

Das könnte Ihnen auch gefallen