Sie sind auf Seite 1von 17

Loan   Defaulter Prediction  System

  

Team Members:
Aishwarya Shetty
Prachi Ghadge
Sankeeta Jha
Sharvari Chavan
Sheryl Saji
Table of Contents

1 Project overview
2. Problem Statement
3. Data Exploratory Analysis
4. Process
5. Model Performance measure
6. Models Used
7. Future Scope
8. Conclusion
Project Overview
Loan is one of the most important products of the banking sector. All banks are trying to
figure out effective business strategies to persuade customers to apply for  loans. However,
there are some customers who behave negatively after their application are approved. To
prevent such scenario, banks have to find some methods to predict customers’ behaviours.

Machine learning algorithms have a pretty good performance on this purpose, which are
widely-used by the banking. Here, we have tried to give a detailed information about our
work on loan behaviours prediction using machine learning models. A Company  would want
to automate the loan eligibility process (real time) based on customer detail provided while
filling online application form. These details are Number of Dependents, Income, Loan
Amount, Credit History etc .

 
Problem Statement
We are using the loan defaulter system dataset to predict whether the current crop of users
will pay the loan or not as loans default will cause huge loss for the banks. So, various
methods are needed ,to detect and predict default behaviours of the customers, as default
prediction accuracy will have great impact on their profitability.

We have performed various algorithms while doing this prediction.

 
Data Exploratory Analysis
Fig. 1. Shows the count of each of the loan_status of the past data.

Fig. 2. Shows the distribution of loan amount for charged off  loan status.

Fig. 3. Shows that, majority of charged off status is for debt_consolidation purpose.
Fig. 4.  This shows mostly people from CA state charged off the loan.

Fig. 5. Shows that mostly one with B5 grade has charged off status.
 

 
Model performance measure

Accuracy
Accuracy is the number of correct predictions made by the model by the total number of
records. The best accuracy is 100% indicating that all the predictions are correct.

Sensitivity or recall
Sensitivity (Recall or True positive rate) is calculated as the number of correct
positive predictions divided by the total number of positives. It is also called recall (REC) or
true positive rate (TPR).

Our dataset  is an imbalanced dataset, accuracy is not a valid measure of model
performance. For a dataset where the default rate is 5%,even if all the records are predicted as
0, the model will still have an accuracy of 95%. But this model will ignore all the defaults

and can be very detrimental to the business.

So accuracy is not a right measure for model performance in this scenario

Specificity
Specificity (true negative rate) is calculated as the number of correct negative predictions
divided by the total number of negatives.

Precision
Precision (Positive predictive value) is calculated as the number of correct positive
predictions divided by the total number of positive predictions
                                              
 Models Used in the Project

Decision Tree

Plot for classifier:

In Decision Tree the major challenge is to identification of the attribute for the root node in
each level. This process is known as attribute selection. We have two popular attribute
selection measures:
⦁ Information Gain(for each feature):
  

⦁ Gini Index (for y_pred) =0.231441


⦁ Split Description
NaiveBayes

Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification
problems. The technique is easiest to understand when described using binary or categorical
input values.
In our project we have two classes i.e. charged off and fully paid
NB_Predictions_2 Charged Off Fully Paid
    Charged Off         371 769
    Fully Paid         1317 9116
Accuracy:81.97529
Precision:0.3254386
Recall: 0.2197867
Sensitivity:0.2197867
Specificity:0.9222054

  KNN
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-
implement supervised machine learning algorithm that can be used to
solve both classification and regression problems.

Random Forest Classifier:


It is a very popular classification algorithm. what a random forest algorithm does is that it
creates multiple decision trees and merges them together to obtain a more stable and accurate
prediction. In general, the more trees in the forest, the more robust would be the prediction
and thus higher accuracy.
In the random forest, we grow multiple trees in a model. To classify a new object based on
new attributes each tree gives a classification and we say that tree votes for that class. The
forest chooses the classifications having the most votes of all the other trees in the forest and
takes the average difference from the output of different trees. In general, Random Forest
built multiple trees and combines them together to get a more accurate result.
XGBOOST

XGBoost is an implementation of gradient boosted decision trees designed for speed and
performance. XGBOOST which is also called gradient boosting is an approach where new
models are created that predict the residuals or errors of prior models and then added together
to make the final prediction.
        It is called gradient boosting because it uses a gradient descent algorithm to minimize
the loss when adding new models.
This approach supports both regression and classification predictive modeling problem We
used Xgboost because it dominates structured or tabular datasets on classification and
regression predictive modeling problems and the execution of xgboost
Model Selection

After comparing all the models:  Decision tree, Random Forest, KNN and Naïve Bayes  
The area under the curve is maximum for Random Forest with 0.977
We see that the highest accuracy is for Decision tree  with  94%

While selecting the best model we also have to take in terms the sensitivity  which also
happens to be highest for Decision tree with 0.82

In our dataset there are 20%  entries for defaulters and thus there is some imbalance in the
dataset .In such scenario we can’t completely rely on accuracy to select our model and  thus
we are also considering the sensitivity factor too

Thus we can say that Decision tree is the best model  for our problem
Conclusion
Actually, most of the binary classification models will give the prediction of probability first
and then assign the probabilities to 1 or 0 based on the default threshold of 0.5. To improve
the recall of the model, we can use the the probabilities predicted by the model and set
threshold by ourselves. The threshold is set based on several factors such as business
objectives. It is different case by case. In the bank loan behaviour prediction, for example,
banks want to control the loss to an acceptable level, so they may use a relatively low
threshold. This means more customers will be grouped as “potential bad customers” and their
profiles will be checked carefully later by the credit risk management team. In this way,
banks can detect the default behaviours in the earlier stage and conduct the corresponding
actions to reduce the possible loss.
Future Scope

In this project we were predicting whether the current crop of user will repay the loan or not .
In the future we can work to predict whether a person who has applied for  a loan will repay
the loan or not . This will avoid the bank any loss .
    Now we have worked with various models and have come to a conclusion that decision
tree is the best model  in this scenario . Thus we can make a dynamic system where the
employee can enter the details of the borrower and predict it in no time that whether the
applicant will repay the loan or not .

Das könnte Ihnen auch gefallen