Sie sind auf Seite 1von 14

Vehicle Loan Default

Prediction
Sheryl Saji
Prachi Dhamale
Aishwarya Shetty
Sharvari Chavan
Sankeeta Jha
Table Of Contents:

 Business Understanding :
• Business objectives , problem statement

 Data Understanding:
• Data description, exploration, quality assessment

 Data Preparation :
• Data cleanup, preparation, formatting , Feature selection

 Modeling:
• Training and Testing the models

 Evaluation :
• Model evaluation, performance reporting , Performance after Tunning Features
1. PROBLEM STATEMENT

• Predict the probability of loanee/borrower defaulting on a vehicle


loan in the first EMI (Equated Monthly Instalments) on the due
date.
2. DATA UNDERSTANDING

•Features:
• We have total 40 feature except target variable.

• Target= Loan_default

•Data Imbalance:
0=182543
1=50611

Minority Label percentage: 0.2170711203753742


Majority Label percentage: 0.7829288796246258

Data is skewed with the ratio of 21.70/78.29


UniqueID Identifier for customers
loan_default Payment default in the first EMI on due date
disbursed_amount Amount of Loan disbursed
asset_cost Cost of the Asset
ltv Loan to Value of the asset
branch_id Branch where the loan was disbursed
supplier_id Vehicle Dealer where the loan was disbursed
manufacturer_id Vehicle manufacturer(Hero, Honda, TVS etc.)
Current_pincode Current pincode of the customer
Date.of.Birth Date of birth of the customer
Employment.Type Employment Type of the customer (Salaried/Self Employed)
DisbursalDate Date of disbursement
State_ID State of disbursement
Employee_code_ID Employee of the organization who logged the disbursement
MobileNo_Avl_Flag if Mobile no. was shared by the customer then flagged as 1
Aadhar_flag if aadhar was shared by the customer then flagged as 1
PAN_flag if pan was shared by the customer then flagged as 1
VoterID_flag if voter was shared by the customer then flagged as 1
Driving_flag if DL was shared by the customer then flagged as 1
Passport_flag if passport was shared by the customer then flagged as 1
PERFORM_CNS.SCORE Bureau Score
PERFORM_CNS.SCORE.DESCRIPTION Bureau score description
PRI.NO.OF.ACCTS count of total loans taken by the customer at the time of disbursement
PRI.ACTIVE.ACCTS count of active loans taken by the customer at the time of disbursement
PRI.OVERDUE.ACCTS count of default accounts at the time of disbursement
PRI.CURRENT.BALANCE total Principal outstanding amount of the active loans at the time of disbursement
PRI.SANCTIONED.AMOUNT total amount that was sanctioned for all the loans at the time of disbursement
PRI.DISBURSED.AMOUNT total amount that was disbursed for all the loans at the time of disbursement
SEC.NO.OF.ACCTS count of total loans taken by the customer at the time of disbursement
SEC.ACTIVE.ACCTS count of active loans taken by the customer at the time of disbursement
SEC.OVERDUE.ACCTS count of default accounts at the time of disbursement
SEC.CURRENT.BALANCE total Principal outstanding amount of the active loans at the time of disbursement
SEC.SANCTIONED.AMOUNT total amount that was sanctioned for all the loans at the time of disbursement
SEC.DISBURSED.AMOUNT total amount that was disbursed for all the loans at the time of disbursement
PRIMARY.INSTAL.AMT EMI Amount of the primary loan
SEC.INSTAL.AMT EMI Amount of the secondary loan
NEW.ACCTS.IN.LAST.SIX.MONTHS New loans taken by the customer in last 6 months before the disbursment
DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS Loans defaulted in the last 6 months
AVERAGE.ACCT.AGE Average loan tenure
CREDIT.HISTORY.LENGTH Time since first loan
NO.OF_INQUIRIES Enquries done by the customer for loans
3. DATA PREPARATION

• Age from Date of Birth


• changed the format and type of AVERAGE.ACCT.AGE and CREDIT.HISTORY.LENGTH column from object to
float
• imputed of missing values in Employment.Type column by its mode
• Encoding of Categorical features Employment.Type, PERFORM_CNS.SCORE.DESCRIPTION and DisbursalDate
1.Feature Engineering:
• df['ACTIVE.ACCTS']=df['PRI.ACTIVE.ACCTS']+df['SEC.ACTIVE.ACCTS']
• df['CURRENT.BALANCE']=df['PRI.CURRENT.BALANCE']+df['SEC.CURRENT.BALANCE']
• df['DISBURSED.AMOUNT']=df['PRI.DISBURSED.AMOUNT']+df['SEC.DISBURSED.AMOUNT']
• df['NO.OF.ACCTS']=df['SEC.NO.OF.ACCTS']+df['PRI.NO.OF.ACCTS']
• df['OVERDUE.ACCTS']=df['PRI.OVERDUE.ACCTS']+df['SEC.OVERDUE.ACCTS']
• df['SANCTIONED.AMOUNT']=df['PRI.SANCTIONED.AMOUNT']+df['SEC.SANCTIONED.AMOUNT']
• df['INSTAL.AMT']=df['PRIMARY.INSTAL.AMT']+df['SEC.INSTAL.AMT']
• df['SANCTION_DISBURSED']=df['SANCTIONED.AMOUNT']-df['DISBURSED.AMOUNT']
• df['NO_DEACTIVE_ACCOUNTS']=df['NO.OF.ACCTS']-df['ACTIVE.ACCTS']
• df['NO.OF.ACC.BEF.SIX.MONTH']=df['NO.OF.ACCTS']-df['NEW.ACCTS.IN.LAST.SIX.MONTHS']
• df['OVERDUE.ACC.BEF.SIX.MONTHS']=df['OVERDUE.ACCTS']-df['DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS']
• df['CLEAN.ACC']=df['NO.OF.ACCTS']-(df['ACTIVE.ACCTS']+df['OVERDUE.ACCTS'])
• df['asset_value']=df['disbursed_amount']*(df['ltv']/100)
• df['value_cost']=df['asset_cost']-df['asset_value']
• df['value_per_cost']=df['value_cost']/df['asset_value']
• df['extra_finance']=df['asset_cost']*(df['ltv']/100)-df['disbursed_amount']
• df['asset_disburse']=(df['asset_cost']-df['disbursed_amount'])/df['disbursed_amount']
• df['sixmmonths_dfault']=df['NEW.ACCTS.IN.LAST.SIX.MONTHS']-
df['DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS']
2.Feature Selection:
3. Dealing with data imbalance

• Shuffle the dataset randomly.


• Split the dataset into k groups
• For each unique group:
• Take the group as a hold out or test
data set
• Take the remaining groups as a training
data set
• Fit a model on the training set and
evaluate it on the test set
• Retain the evaluation score and discard
the model
• Summarize the skill of the model using the
sample of model evaluation scores

• Here we have selected k=10


5. EVALUATION

• Evaluation and performance reporting of:

I. Decision Trees
II. Random Forests
III. Extreme Gradient Boosting
IV. Catboost
Model evaluation metric considered Roc score.

• TPR (True Positive Rate) = # True positives / # positives = Recall = TP / (TP+FN)


FPR (False Positive Rate) = # False Positives / # negatives = FP / (FP+TN)
• We use ROC when both classes detection is equally important 
• When we want to give equal weight to both classes prediction ability we should
look at the ROC curve.

Roc score of each model:


I. Decision Trees :0.5294
II. Random Forests : 0.5696
III. Extreme Gradient Boosting : 0.6598
IV. Catboost : 0.6666

Model selected-Catboost

Das könnte Ihnen auch gefallen