Beruflich Dokumente
Kultur Dokumente
Abstract
This paper present investigation result of the Information Gain feature selection
to the Naive Bayes method to classify the status of credit collectability in BRI. By
applying the Information Gain Feature Selection to Naive Bayes, the data processing
process becomes shorter because it only takes influential variables without reducing
the integrity level of the calculation itself. Based on the results of this study, it is found
that the Restruct Flag variable had the highest importance with a gain of 0.157806,
followed by Loan Type, Place of Birth, Rate, Business Type, Number of Dependence,
Duration, Revenue, Marital Status, Age, Education, Gender, and Ceiling. The Naive
Bayes implementation combined with Information Gain in this study resulted in a
minimum accuracy rate of 74.07% and the highest reached 75.93%.
1. INTRODUCTION
The target of financing or crediting is prioritized for debtors who are
considered capable of returning all liabilities including principal and interest and
other costs. Therefore, the bank needs to predict the smoothness of the credit
tagar can know the credit can later be repaid in time by the debtor and will not
develop into bad credit. Debtor itself is a term for people who borrow money or
owe money to banks, while creditors are the name for those who give loans in
this case are banks. BRI Bank Banjarmasin curipan unit is one of the banks that
provide business credit services. Therefore, BRI Bank needs information before
giving credit to its customers if there is no bad credit which endangers the
stability of the bank's financial turnover.
Data mining is a search process for data that is unknown or unpredictable. In
general, data mining functions are classified into 2 categories, namely descriptive
and predictive, the data mining concept of the predictive type can be applied in
classifying the credibility of the status of the bank.
Naive Bayes is a classification with probability and statistical methods which
predict future opportunities based on previous experience so that it is known as the
Bayes theorem. According to Cahya [1] in his research journal, Naive Bayes can be used
to predict the feasibility of current credit or credit defaults on banks. By using the
Naive Bayes algorithm to produce initial data accuracy of 79.84%, while the data has
gone through the preprocessing processing stage, which is equal to 88.61%.
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 1
International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
2. RESEARCH METHODS
2.1. Research procedure
The steps taken in this research procedure are :
a. Data Collection, Data obtained from direct collection at the BRI Kuripan
Banjarmasin Unit of 540 data. The data that will be used are collectivity data of
KUR (People's Business Credit) Micro BRI Unit Kuripan Banjarmasin.
b. Pre-Processing, Data that has been collected previously is processed with the
stages of KDD (Knowledge Data Discovery). The pre-processing steps that will
be carried out are Data Selection and Data Transformation.
c. Data Mining Process, Process of Data Mining, The technique used is the
Information Gain Feature Selection technique for Naive Bayes to get the results
of the classification smoothly or stuck on the classification of the status of the
smooth credit to the bank. The stages carried out in the mining process are
dividing the data into two groups, namely training data and testing data, the
distribution ratio between training data and testing data at 90:10 (486 training
data: 54 testing data). Furthermore, in the training data, the Information Gain
value is calculated on each variable used. After getting the results of each
variable, then sorting / ranking the variables based on the results of the
Information Gain calculation from the largest to the smallest. Naive Bayes
calculates the data testing by counting variables one by one based on the ranking
of Information Gain results from the largest to the smallest in Naive Bayes. The
next step is to match the results of the Naive Bayes classification to the testing
data with the original data. After matching the results of predictions and fact
data, the accuracy of the algorithm can be generated in the case study.
d. Pattern Evaluation, Is a stage of assessment to identify results. Data mining
results are evaluated and data validity is tested with data mining. The testing
data used is the KUR Collectability of BRI Micro Unit Unit KUR Data. Validity
testing is done by testing Accuracy, Recall and Precision levels using the
Confusion Matrix method.
e. Knowledge Presentation, Presenting the knowledge obtained by using
visualization and representation techniques made in a program using the PHP
program.
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 2
International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
Explenation :
A : attribute
|D| : the total number of data samples
|Dj| : sample for value j
v : a possible value for attribute A
Then the information gain value used to measure the effectiveness of an attribute in
data cloning can be calculated by the formula below:
Gain(A) = |𝐼𝑛𝑓𝑜(𝐷) – 𝐼𝑛𝑓𝑜𝐴 (D)|..........................................................................(3)
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 3
International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
Keterangan :
𝑋 : Data with unknown classes
𝑋 : The data hypothesis 𝑋 is a specific class
𝑋(𝑋|𝑋) : Probability of hypothesis 𝐻 based on condition 𝑋 (posteriori
probability)
𝑋(𝑋) : The probability of the hypothesis 𝐻 (prior probability)
𝑋(𝑋|𝑋) : Probability 𝑋 based on the conditions of the hypothesis 𝐻
𝑋(𝑋) : Probability 𝑋
Explanation :
a. True Positive (TP): In this research where the data is predicted to be smooth and
in fact the data is smooth
b. True Negative (TN): In this research where the predicted data is stuck and in fact
the data is jammed
c. False Positive (FP): In this research where data is predicted smooth and in fact
the data is jammed
d. False Negative (FN): In this research where the data is predicted to jam and in
fact the data is smooth
To calculate the value of accuracy, precision, and recall with the calculation formula
as follows:
𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁 ..................................................................................................... (7)
𝑇𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃 𝑟𝑎𝑡𝑒 = 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁
......................................................................... (8)
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ........................................................................................................................ (9)
𝑇𝑃+𝐹𝑃
3.6. Implementation
3.6.1. Data Collection
The data that had been use is the BRI Curipan Unit Banjarmasin collectability
data of 540 data obtained directly from BRI Unit Kuripan. The data contains the
information of the prospective debtor collected by the Bank based on the results of
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 4
International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
the interview, filling out the forms and some files requested by the Bank. There are
27 initial variables before pre-processing, namely: Period, Currency, CIF (Customer
Identification File), LN Type, Debtor Name, JK (Sex), Place of Birth, Date of Birth,
Education, Status, Dependent, Business, Turnover , Ceiling, Rate, Realization Date,
Maturity Date, Period, Principal Installments, Installments of Interest, Total Monthly
Installments, Restructured Flags, Current Collectibility, DPK Collectibility,
Substandard Collectibility, Doubtful Collectibility and Current Collectibility. Data
distribution can be seen in the table below.
3.6.2. Pre-Processing
At this stage a reduction of variables is intended to reduce information that
is not needed and only selected / selected variables will then be processed in the
data mining process. Of the 27 variables in the initial data, 19 variables were
selected. List of selected variables, namely: LN Type, JK (Gender), Place of Birth, Date
of Birth, Education, Status, Dependent, Business, Turnover, Ceiling, Rate, Realization
Date, Duration, Restructuring Flag, Current Collectability, Collectability of TPF,
Substandard Collectability, Doubtful Collectibility and Current Collectibility. After
the selection process is carried out, values are transformed for variables with very
diverse values. Transformation is done to group data. The stages of the data
transformation process are carried out so that the selected data becomes a form that
is ready to be processed in data mining. Data that is not in the form of categories will
first be categorized and then ready to be calculated for the data mining process.
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 5
International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
3.600.000 - 4.500.000
≥ 4.600.000
≤ 14.000.000
10 Ceiling 14.100.000 - 23.000.000
≥ 23.100.000
4.82
4.84
11 Rate
4.94
6.48
12 Month
18 Month
12 Duration
24 Month
36 Month
Y
13 Flag Restruck
N
Smooth
14 Classification
Jammed
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 6
International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
Implementation :
InfoJammed(LN TYPE = SH) = -(54/254) . 𝐿𝑜𝑔2 (54/254) = 0.47490
InfoSmooth(LN TYPE = SH) = -(200/254) . 𝐿𝑜𝑔2 (200/254) = 0.27151
Info(LN TYPE = SH) = 0.47490 + 0.27151 = 0.74642
InfoJammed(LN TYPE = SM) = -(78/232) . 𝐿𝑜𝑔2 (78/232) = 0.528712
InfoSmooth(LN TYPE = SM) = -(154/232) . 𝐿𝑜𝑔2 (154/232) = 0.392431
Info(LN TYPE = SM) = 0.528712 + 0.392431 = 0.92114
c. Calculate split entropy for each variable. The entropy split calculation process
for each variable uses the formula :
𝑣
|𝐷𝑗 |
𝐼𝑛𝑓𝑜𝐴 (D) =∑ . 𝐼𝑛𝑓𝑜 𝐷𝑗
𝑗=1 |𝐷|
The information from the formula is:
A : attribute
| Dj | : number of samples for value j
| D | : number of all data samples
V : a possible value for attribute A
Implementation :
Info(LN TYPE = SH) = 0.74642
Info(LN TYPE = SM) = 0.92114
E(LN TYPE) = (254 / 486) . 0.74642 + (232 / 486) . 0.92114 = 0.82982
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 7
International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
E(JK) = 0.84267
Gain(JK) = 0.843758 - 0.84267 = 0.00108
Calculations are performed on each variable in the same way.
e. Sort or rank variables based on the results of the Information Gain calculation
from the largest to the smallest.
Table 4 Information Gain ranking results
VARIABEL GAIN RANKING
FLAG RESTRUK 0.157806 1
LN TYPE 0.013931 2
PLACE OF BIRTH 0.007194 3
RATE 0.006399 4
BUSINESS TYPE 0.005179 5
NUMBER OF DEPENDENCE 0.005117 6
DURATION 0.004555 7
REVENUE 0.003468 8
MARITAL STATUS 0.00341 9
AGE 0.003285 10
EDUCATION 0.002139 11
GENDER 0.001083 12
CEILING 0.000609 13
𝑃(𝑋|𝐻).𝑃(𝐻)
P(H|X) =
𝑃(𝑋)
Explanation :
𝑋 : Data with unknown classes
𝑋 : The data hypothesis 𝑋 with a specific class
𝑋(𝑋|𝑋) : Probability of hypothesis 𝐻 based on condition 𝑋
𝑋(𝑋) : Probability of hypothesis 𝐻
𝑋(𝑋|𝑋) : Probability 𝑋 based on the conditions of the hypothesis 𝐻
𝑋(𝑋) : Probability 𝑋
P (H) represents the probability of the appearance of the class in the training data.
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 8
International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
Implementation :
The number of Y values on the FLAG RESTRUCT variable with the Jammed class =
52
P(FLAG RESTRUCT = Y | Jammed) = 52 / 132 = 0.39393
The number of Y values on the FLAG RESTRUCT variable with the Smooth class = 9
P(FLAG RESTRUCT = Y | Lancar) = 9 / 354 = 0.02542
The number of N values on the FLAG RESTRUCT variable with the Jammed class =
80
P(FLAG RESTRUCT = N | Jammed) = 80 / 132 = 0.60606
Number of N values in FLAG RESTRUCT variable with Smooth class = 345
P(FLAG RESTRUCT = N | Lancar) = 345 / 354 = 0.97457
The number of SH values on the LN TYPE variable with the Jammed class = 54
P(LN TYPE = SH | Jammed) = 54 / 132 = 0.40909
The number of SH values in the LN TYPE variabel with the Smooth class = 200
P(LN TYPE = SH | Smooth) = 200 / 354 = 0.56497
The number of SM values on the LN TYPE variabel with the Jammed class = 78
P(LN TYPE = SM | Jammed) = 78 / 132 = 0.59090
The number of SM values in LN TYPE variables with Smooth class = 154
P(LN TYPE = SM | Smooth) = 154 / 354 = 0.43502
Calculations are performed on each variable in the same way.
Step 4. Implement Naive Bayes on testing data
In this calculation, the code data 170 is taken for the testing data as follows:
Table 4 Data with code 170
VARIABLE NILAI
NO 170
FLAG RESTRUCT N
LOAN TYPE SH
PLACE OF BIRTH BANJARMASIN
RATE 4,82
BUSINESS TYPE PEDAGANG
NUMBER OF DEPENDENCE 2 Orang
DURATION 36 M
REVENUE 3.600.000 - 4.500.000
MARITAL STATUS MENIKAH
AGE ≤33tahun
EDUCATION SMU/SMK
GENDER MALE
CEILING 14.100.000 - 23.000.000
Based on data with code 170, obtained:
P(FLAG RESTRUCT = N | Smooth) = 0.97457
P(FLAG RESTRUCT = N | Jammed) = 0.60606
P(LOAN TYPE = SH | Smooth) = 0.56497
P(LOAN TYPE = SH | Jammed) = 0.40909
P(PLACE OF BIRTH = BANJARMASIN | Smooth) = 0.52542
P(PLACE OF BIRTH = BANJARMASIN | Jammed) = 0.63636
P(RATE = 4,82 | Smooth) = 0.73728
P(RATE = 4,82 | Jammed) = 0.75
P(BUSINESS TYPE = TRADERS | Smooth) = 0.62146
P(BUSINESS TYPE = TRADERS | Jammed) = 0.66666
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 9
International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
Calculation of 2 variable
P(Smooth) = P(Smooth|Flag Restruct N) x P(Smooth| Loan Type SH) x P(X = Smooth)
= 0,974576271 x 0,564971751 x 0,728395062
= 0,4010601939
P(Jammed) = P(Jammed|Flag Restruct N) x P(Jammed| Loan Type SH) x P(X =
Jammed)
= 0,606060606 x 0,409090909 x 0,271604938
= 0,0673400673
Calculation of 13 variable
P(Smooth) = P(Smooth|Flag Restruct N) x P(Smooth| Loan Type SH) x
P(Smooth| Business Type Trader) x P(Smooth| Rate 4.82) x P(Smooth|
Place of Birth Banjarmasin) x P(Smooth| Duration 36M) x P(Smooth|
Age ≤33 years ) x P(Smooth| Number Of Dependence 2 Persons) x
P(Smooth|Marital Status Married) x P(Smooth|Revenue 3.600.000 -
4.500.000) x P(Smooth|Education SMU/SMK) x P(Smooth|Gender
MALE) x P(Smooth|Ceiling 14.100.000 - 23.000.000) x P(X = Smooth)
= 0,974576271 x 0,564971751 x 0,525423729 x 0,737288136 x
0,621468927 x 0,288135593 x 0,460451977 x 0,347457627 x
0,776836158 x 0,336158192 x 0,542372881 x 0,505649718 x
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 10
International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
0,324858757 x 0,728395062
= 0,0001035558
P(Jammed) = P(Jammed|Flag Restruct N) x P(Jammed| Loan Type SH) x P(Jammed|
Business Type PEDAGANG) x P(Jammed| Rate 4.82) x P(Jammed|
Place of Birth BANJARMASIN) x P(Jammed| Duration 36M) x
P(Jammed| Age ≤33 Years ) x P(Jammed| Number of Dependence 2
Persons) x P(Jammed| Marital Status Married) x P(Jammed| Revenue
3.600.000 - 4.500.000) x P(Jammed|Education SMU/SMK) x
P(Jammed| Gender MALE) x P(Jammed|Ceiling 14.100.000 -
23.000.000) x P(X = Jammed)
= 0,606060606 x 0,409090909 x 0,636363636 x 0,75 x 0,666666667 x
0,303030303 x 0,522727273 x 0,401515152 x 0,772727273 x
0,409090909 x 0,545454545 x 0,462121212 x 0,340909091 x
0,271604938
= 0,0000370178
The same calculation is done on each variable based on smooth and stuck.
Furthermore, the results of the jam class calculation are compared with the results
of the current class calculation, if the jam class value is higher, then the data is
classified as congested, otherwise it is classified as smooth.
Example :
Calculation of 1 variable : Smooth = 0.7098765432, and Jammed = 0.1646090535
Then it can be concluded, based on the calculation of 1 variable, the data is classified
as a smooth class.
3.6.4. PATTERN EVALUATION
Tahapan evaluasi dilakukan dengan metode Confusion Matrix.
Tabel 5 Confusion matrix
FACT FACT
1 s/d 12 VARIABLE TOTAL 13 VARIABLE TOTAL
SMOOTH JAMMED SMOOTH JAMMED
SMOOTH TP : 38 FP : 12 50 SMOOTH TP : 38 FP : 13 50
NB NB
JAMMED FN : 1 TN : 3 4 JAMMED FN : 1 TN : 2 4
TOTAL 39 15 41 TOTAL 39 15 40
(38 + 3)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (1 s/d 12 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) = 𝑥 100% = 75.93%
54
(38 + 2)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (13 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) = 𝑥 100% = 74.07%
54
38
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(1 s/d 12 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) = 𝑥 100% = 95%
(12 + 38)
38
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(13 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) = 𝑥 100% = 92.68%
(13 + 38)
38
𝑟𝑒𝑐𝑎𝑙𝑙 (1 s/d 12 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) = 𝑥 100% = 97.43%
(1 + 38)
38
𝑟𝑒𝑐𝑎𝑙𝑙 (13 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) = 𝑥 100% = 97.43%
(1 + 38)
3.6.5. Knowledge Presentation
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 11
International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
3.7. EVALUATION
Based on the results of the tests conducted, it was found that the
implementation of 1 to 12 variables produced an accuracy level equal to the amount
of 41 testing data predicted correctly, but when the 13th variable (all variables) was
implemented, there was a decrease in accuracy where the amount of data predicted
accurately dropped to 40 data. This shows that in this case study optimal accuracy
is obtained when using 1 to 12 variables with the highest information gain value.
Based on the results of evaluations carried out using confusion matrix, the
implementation of naive bayes combined with Information Gain in this study
resulted in a minimum accuracy rate of 74.07% and the highest reached 75.93%.
The level of precision for 1 to 12 variables is 95% and for 13 variables 92.68%, and
the level of sensitivity for 1 to 12 variables is 97.43% and for 13 variables 97.43%.
4. CONCLUSION
Based on the results of the research and discussion conducted, it can be
concluded that :
a. Variables that have the greatest influence on the classification of credit
smoothness status on the Bank respectively based on Information Gain are Flag
Restructuring, LN Type, Place of Birth, Rate, Business, Dependent, Duration,
Turnover, Status, Age, Education, Gender, and Ceiling .
b. The accuracy of the classification of credit smoothness status in Naive Bayes with
the Information Gain feature selection resulted in the smallest accuracy rate of
74.07% and the highest reached 75.93% in the case studies that had been
research.
c. Based on the research that has been done it is known that the ceiling variable is
not too influential in the process of classification of the status of credit
smoothness at the bank due to a decrease in the results of initial accuracy of
75.93% to 74.07% after adding the ceiling variable.
BIBLIOGRAPHY
[1] Aprilla C, Dennis, dkk. 2013. Belajar Data mining dengan Rapid Miner. Jakarta.
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 12
International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
[2] Andilala. 2016. Movie Review Sentimen Analisis Dengan Metode Naïve Bayes
Base On Feature Selection. Universitas Muhammadiyah Bengkulu : Jurnal
Pseudocode, Volume III Nomor 1, Februari 2016, ISSN 2355 – 5920.
[3] Betrisandi. 2017. Klasifikasi Nasabah Asuransi Jiwa Menggunakan Algoritma
Naive Bayes Berbasis Backward Elimination. ISSN print 2087-1716.
[4] Bustami. 2014. Penerapan Algoritma Naive Bayes Untuk Mengklasifikasi Data
Nasabah Asuransi. Teknik Informatika, Universitas Malikussaleh : Jurnal
Informatika Vol. 8, No. 1.
[5] Hidayatul, Syafitri Annur Aini, dkk. 2017. Seleksi Fitur Information Gain untuk
Klasifikasi Penyakit Jantung Menggunakan Kombinasi Metode K-Nearest
Neighbor dan Naïve Bayes. Program Studi Teknik Informatika Universitas
Brawijaya : Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer.
Vol. 2, No. 9.
[6] Karima, Cahya. Penerapan Seleksi Fitur Information Gain pada Naive Bayes
untuk Klasifikasi Status Kelancaran Kredit Pada Bank. Universitas Lambung
Mangkurat. 2019.
[7] Liu, Bing. 2007. Web Data mining. New York. ISBN: 10 3-540-37881-2.
[8] Nurina, Betha Sari. 2016. Implementasi Teknik Seleksi Fitur Information Gain
Pada Algoritma Klasifikasi Machine Learning Untuk Prediksi Performa
Akademik Siswa. Fakultas Ilmu Komputer, Universitas Singaperbangsa
Karawang. ISSN. 2302-3805.
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 13