Journal (English) - j1f114014 - Cahya Karima

International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)
INFORMATION GAIN FEATURES SELECTION ON

NAIVE BAYES FOR CLASSIFICATION OF BANK CREDIT
COLLECTIBILITY STATUS
Cahya Karima 1, Fatma Indriani2, Dwi Kartini 3
Computer Science Faculty of Mathematics and Natural Science
Universitas Lambung Mangkurat, Banjarbaru, Kalimantan Selatan, Indonesia
Email: cahyakarima28@gmail.com
Abstract
This paper present investigation result of the Information Gain feature selection
to the Naive Bayes method to classify the status of credit collectability in BRI. By
applying the Information Gain Feature Selection to Naive Bayes, the data processing
process becomes shorter because it only takes influential variables without reducing
the integrity level of the calculation itself. Based on the results of this study, it is found
that the Restruct Flag variable had the highest importance with a gain of 0.157806,
followed by Loan Type, Place of Birth, Rate, Business Type, Number of Dependence,
Duration, Revenue, Marital Status, Age, Education, Gender, and Ceiling. The Naive
Bayes implementation combined with Information Gain in this study resulted in a
minimum accuracy rate of 74.07% and the highest reached 75.93%.
Keywords: Information Gain, Naive Bayes, bank collectability, credit.
1. INTRODUCTION
The target of financing or crediting is prioritized for debtors who are
considered capable of returning all liabilities including principal and interest and
other costs. Therefore, the bank needs to predict the smoothness of the credit
tagar can know the credit can later be repaid in time by the debtor and will not
develop into bad credit. Debtor itself is a term for people who borrow money or
owe money to banks, while creditors are the name for those who give loans in
this case are banks. BRI Bank Banjarmasin curipan unit is one of the banks that
provide business credit services. Therefore, BRI Bank needs information before
giving credit to its customers if there is no bad credit which endangers the
stability of the bank's financial turnover.
Data mining is a search process for data that is unknown or unpredictable. In
general, data mining functions are classified into 2 categories, namely descriptive
and predictive, the data mining concept of the predictive type can be applied in
classifying the credibility of the status of the bank.
Naive Bayes is a classification with probability and statistical methods which
predict future opportunities based on previous experience so that it is known as the
Bayes theorem. According to Cahya [1] in his research journal, Naive Bayes can be used
to predict the feasibility of current credit or credit defaults on banks. By using the
Naive Bayes algorithm to produce initial data accuracy of 79.84%, while the data has
gone through the preprocessing processing stage, which is equal to 88.61%.
Information Gain Pada Naive Bayes Untuk Klasifikasi Status Kelancaran Kredit Pada Bank (Cahya Karima) 1
According to Andilala [2] in his research, he suggested, to add feature

selection steps for machine learning processes if you want to improve the
performance of the classification. Feature selection or attribute selection can be
done based on the calculation of the weights obtained.
2. RESEARCH METHODS
2.1. Research procedure
The steps taken in this research procedure are :
a. Data Collection, Data obtained from direct collection at the BRI Kuripan
Banjarmasin Unit of 540 data. The data that will be used are collectivity data of
KUR (People's Business Credit) Micro BRI Unit Kuripan Banjarmasin.
b. Pre-Processing, Data that has been collected previously is processed with the
stages of KDD (Knowledge Data Discovery). The pre-processing steps that will
be carried out are Data Selection and Data Transformation.
c. Data Mining Process, Process of Data Mining, The technique used is the
Information Gain Feature Selection technique for Naive Bayes to get the results
of the classification smoothly or stuck on the classification of the status of the
smooth credit to the bank. The stages carried out in the mining process are
dividing the data into two groups, namely training data and testing data, the
distribution ratio between training data and testing data at 90:10 (486 training
data: 54 testing data). Furthermore, in the training data, the Information Gain
value is calculated on each variable used. After getting the results of each
variable, then sorting / ranking the variables based on the results of the
Information Gain calculation from the largest to the smallest. Naive Bayes
calculates the data testing by counting variables one by one based on the ranking
of Information Gain results from the largest to the smallest in Naive Bayes. The
next step is to match the results of the Naive Bayes classification to the testing
data with the original data. After matching the results of predictions and fact
data, the accuracy of the algorithm can be generated in the case study.
d. Pattern Evaluation, Is a stage of assessment to identify results. Data mining
results are evaluated and data validity is tested with data mining. The testing
data used is the KUR Collectability of BRI Micro Unit Unit KUR Data. Validity
testing is done by testing Accuracy, Recall and Precision levels using the
Confusion Matrix method.
e. Knowledge Presentation, Presenting the knowledge obtained by using
visualization and representation techniques made in a program using the PHP
program.
3. RESULT AND DISCUSSION

3.1. Variables
Based on the data that has been taken from the BRI curriculum unit, the
variables used include: Period, Currency, CIF (Customer Identification File), LN
Type, Debtor Name, JK (Gender), Place of Birth, Date of Birth, Education, Status,
Dependent , Business, Turnover, Ceiling, Rate, Realization Date, Maturity Date,
Duration, Principal Installments, Installments of Interest, Total Monthly
Installments, Restructured Flags, Current Collectibility, DPK Collectibility,

Substandard Collectibility, Doubtful Collectibility and Current Collectibility.
3.2. Data Mining
Data mining is also called Knowledge Discovery in Database (KDD). This is
usually defined as the process of finding patterns or knowledge that is useful from data
sources, for example, databases, text, images, web, etc. Patterns must be valid,
potentially to be used, and easily understood [3].
3.3. Information Gain

Information Gain is the simplest selection method by doing attribute ranking
and is widely used in text categorization applications, microarray data analysis and
data analysis. Information Gain can help reduce noise caused by irrelevant features
[4]. Information Gain detects features that have the most information based on a
particular class. Determining the best attribute is done by calculating the entropy
value first [5].
Info(D) = ∑𝑐𝑖=1 𝑃𝑖 . 𝐿𝑜𝑔2 𝑃𝑖 ..................................................................................(1)
explanation :
D : Set of cases
c : Number of values in the target attribute (number of classification classes)
Pi : number of samples for class i
The log function in this case is used log based 2 because the information is coded
based on bits. Calculation of entropy value after separation can be done using the
following formula:
𝑣
|𝐷𝑗 |
𝐼𝑛𝑓𝑜𝐴 (D) = ∑ . 𝐼𝑛𝑓𝑜 (𝐷𝑗 )......................................................................(2)
𝑗=1 |𝐷|
Explenation :
A : attribute
|D| : the total number of data samples
|Dj| : sample for value j
v : a possible value for attribute A
Then the information gain value used to measure the effectiveness of an attribute in
data cloning can be calculated by the formula below:
Gain(A) = |𝐼𝑛𝑓𝑜(𝐷) – 𝐼𝑛𝑓𝑜𝐴 (D)|..........................................................................(3)
3.4. Naive Bayes

Naive Bayes is a probabilistic classification that calculates a set of
probabilities by summing frequencies and combinations of values from a given
dataset. The algorithm uses the Bayes theorem and assumes all independent or non-
interdependent attributes given by values on class variables. Another definition says
Naive Bayes is a classification with probability and statistical methods presented by
British scientist Thomas Bayes, namely predicting future opportunities based on
previous experience [6].
The stages of the Naive Bayes algorithm process are:
1. Calculating the number of Classes per-Labels
2. Calculating the Number of Cases per-Class
3. Multiply All Class Variables
4. Compare Class Results

Whereas the Bayes theorem equation is :
𝑃(𝑋|𝐻).𝑃(𝐻)
P(H|X) = ............................................................................................(4)
𝑃(𝑋)
Keterangan :
𝑋 : Data with unknown classes
𝑋 : The data hypothesis 𝑋 is a specific class
𝑋(𝑋|𝑋) : Probability of hypothesis 𝐻 based on condition 𝑋 (posteriori
probability)
𝑋(𝑋) : The probability of the hypothesis 𝐻 (prior probability)
𝑋(𝑋|𝑋) : Probability 𝑋 based on the conditions of the hypothesis 𝐻
𝑋(𝑋) : Probability 𝑋
3.5. Confusion Matrix

Confusion matrix is one of the 2x2 matrix measuring instruments used to get
the correct classification of datasets for active and inactive classes in the algorithm
used [7].
Explanation :
a. True Positive (TP): In this research where the data is predicted to be smooth and
in fact the data is smooth
b. True Negative (TN): In this research where the predicted data is stuck and in fact
the data is jammed
c. False Positive (FP): In this research where data is predicted smooth and in fact
the data is jammed
d. False Negative (FN): In this research where the data is predicted to jam and in
fact the data is smooth
To calculate the value of accuracy, precision, and recall with the calculation formula
as follows:
𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁 ..................................................................................................... (7)
𝑇𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃 𝑟𝑎𝑡𝑒 = 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁
......................................................................... (8)
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ........................................................................................................................ (9)
𝑇𝑃+𝐹𝑃
3.6. Implementation
3.6.1. Data Collection
The data that had been use is the BRI Curipan Unit Banjarmasin collectability
data of 540 data obtained directly from BRI Unit Kuripan. The data contains the
information of the prospective debtor collected by the Bank based on the results of
the interview, filling out the forms and some files requested by the Bank. There are
27 initial variables before pre-processing, namely: Period, Currency, CIF (Customer
Identification File), LN Type, Debtor Name, JK (Sex), Place of Birth, Date of Birth,
Education, Status, Dependent, Business, Turnover , Ceiling, Rate, Realization Date,
Maturity Date, Period, Principal Installments, Installments of Interest, Total Monthly
Installments, Restructured Flags, Current Collectibility, DPK Collectibility,
Substandard Collectibility, Doubtful Collectibility and Current Collectibility. Data
distribution can be seen in the table below.
Tabel 1. Class Distribution

CLASS AMOUNT
SMOOTH 393
JAMMED 147
TOTAL 540
3.6.2. Pre-Processing
At this stage a reduction of variables is intended to reduce information that
is not needed and only selected / selected variables will then be processed in the
data mining process. Of the 27 variables in the initial data, 19 variables were
selected. List of selected variables, namely: LN Type, JK (Gender), Place of Birth, Date
of Birth, Education, Status, Dependent, Business, Turnover, Ceiling, Rate, Realization
Date, Duration, Restructuring Flag, Current Collectability, Collectability of TPF,
Substandard Collectability, Doubtful Collectibility and Current Collectibility. After
the selection process is carried out, values are transformed for variables with very
diverse values. Transformation is done to group data. The stages of the data
transformation process are carried out so that the selected data becomes a form that
is ready to be processed in data mining. Data that is not in the form of categories will
first be categorized and then ready to be calculated for the data mining process.
Tabel 2 Values Group for each variable

NO VARIABLE VALUE GROUP
SH
1 Loan Type
SM
Male
2 Gender
Female
Banjarmasin
3 Place Of Birth
Non-Banjarmasin
≤ 33 Years
4 Age 34 - 47 Years
≥ 48 Years
SD/SMP
5 Education SMU/SMK
Bachelor
Single
6 Marital Status Married
Widow / Divorce
≤ 1 Person
2 Persons
7 Number of Dependence
3 Persons
≥ 4 Persons
Industry
Traders
8 Business Type
Fishery
Agriculture
9 Revenue ≤ 3.500.000
3.600.000 - 4.500.000
≥ 4.600.000
≤ 14.000.000
10 Ceiling 14.100.000 - 23.000.000
≥ 23.100.000
4.82
4.84
11 Rate
4.94
6.48
12 Month
18 Month
12 Duration
24 Month
36 Month
Y
13 Flag Restruck
N
Smooth
14 Classification
Jammed
3.6.3. Data Mining Process

The stages carried out in this data mining process are as follows:
Step 1. Distribution of training data and testing data
Divide the data into two groups, namely training data and testing data.
Distribution ratio between training data and testing data is 90:10 (486 training data:
54 testing data).
Tabel 3 Data distribution
CLASS NUMBER OF OVERALL DATA TRAINING DATA TESTING DATA
SMOOTH 393 354 39
JAMMED 147 132 15
TOTAL 540 486 54
Step 2. Calculating Information Gain
There are several calculation steps to get the Information Gain value, namely
:
a. Calculates the value of the class entropy information from all data. The formula
used in calculating the value of the entropy class information from all data is as
follows:
Info(D) = -∑𝑛𝑖=1 𝑃𝑖 . 𝐿𝑜𝑔2 𝑃𝑖
Pi = (Si / S)
D : Set of Cases
c : Number of partitions D (number of classification classes)
Pi : number of samples for class i (proportion of Di to D)
Based on the formula above, the results are:
Info (Smooth) = -(354 / 486) . 𝐿𝑜𝑔2 (354 / 486) = 0.333027
Info (Jammed) = -(132 / 486) . 𝐿𝑜𝑔2 (132 / 486) = 0.510731
Class Entropy = 0.333027 + 0.510731 = 0.843758
b. Calculate the value of entropy information on each variable. The formulas used
in calculating the value of information in each variable are as follows::
Info = -∑𝑛𝑖=1 𝑃𝑖 . 𝐿𝑜𝑔2 𝑃𝑖
Pi = (Si / S)
D : Set of cases
c : Number of partitions D (number of classification classes)
Pi : number of samples for class i (proportion of Di to D)
Implementation :
InfoJammed(LN TYPE = SH) = -(54/254) . 𝐿𝑜𝑔2 (54/254) = 0.47490
InfoSmooth(LN TYPE = SH) = -(200/254) . 𝐿𝑜𝑔2 (200/254) = 0.27151
Info(LN TYPE = SH) = 0.47490 + 0.27151 = 0.74642
InfoJammed(LN TYPE = SM) = -(78/232) . 𝐿𝑜𝑔2 (78/232) = 0.528712
InfoSmooth(LN TYPE = SM) = -(154/232) . 𝐿𝑜𝑔2 (154/232) = 0.392431
Info(LN TYPE = SM) = 0.528712 + 0.392431 = 0.92114
InfoJammed(JK = FEMALE) = -(71/246) . 𝐿𝑜𝑔2 (71/246) = 0.51742

InfoSmooth(JK = FEMALE) = -(175/246) . 𝐿𝑜𝑔2 (175/246) = 0.34950
Info(JK = FEMALE) = 0.51742 + 0.34950 = 0.86692
InfoJammed(JK = MALE) = -(61/240) . 𝐿𝑜𝑔2 (61/240) = 0.50227
InfoSmooth(JK = MALE) = -(179/240) . 𝐿𝑜𝑔2 (179/240) = 0.31554
Info(JK = MALE) = 0.50227 + 0.31554 = 0.81781
Calculations are performed on each variable in the same way.
c. Calculate split entropy for each variable. The entropy split calculation process
for each variable uses the formula :
𝑣
|𝐷𝑗 |
𝐼𝑛𝑓𝑜𝐴 (D) =∑ . 𝐼𝑛𝑓𝑜 𝐷𝑗
𝑗=1 |𝐷|
The information from the formula is:
A : attribute
| Dj | : number of samples for value j
| D | : number of all data samples
V : a possible value for attribute A
Implementation :
Info(LN TYPE = SH) = 0.74642
Info(LN TYPE = SM) = 0.92114
E(LN TYPE) = (254 / 486) . 0.74642 + (232 / 486) . 0.92114 = 0.82982
Info(JK = FEMALE) = 0.86692

Info(JK = MALE) = 0.81781
E(JK) = (246 / 486) . 0.86692 + (240 / 486) . 0.81781 = 0.84267
d. Calculates the information gain value for each variable. The information gain
value for each variable is obtained by using the formula :
Gain(A) = Entropy(D) – Entropy(A)
Explanation :
Gain (A) : Information atribut A
Entropy (D) : Total entropy (Entropy kelas)
Entropy (A) : entropy A
Implementation :
Entropy Class = 0.843758
E(LN TYPE) = 0.82982
Gain(LN TYPE) = 0.843758 - 0.82982 = 0.01393
E(JK) = 0.84267
Gain(JK) = 0.843758 - 0.84267 = 0.00108
e. Sort or rank variables based on the results of the Information Gain calculation
from the largest to the smallest.
Table 4 Information Gain ranking results
VARIABEL GAIN RANKING
FLAG RESTRUK 0.157806 1
LN TYPE 0.013931 2
PLACE OF BIRTH 0.007194 3
RATE 0.006399 4
BUSINESS TYPE 0.005179 5
NUMBER OF DEPENDENCE 0.005117 6
DURATION 0.004555 7
REVENUE 0.003468 8
MARITAL STATUS 0.00341 9
AGE 0.003285 10
EDUCATION 0.002139 11
GENDER 0.001083 12
CEILING 0.000609 13
Step 3. Calculating Naive Bayes

Naïve Bayes calculations are based on the probability of occurrence of values
based on training data. The formula used in Naive Bayes is as follows.
𝑃(𝑋|𝐻).𝑃(𝐻)
P(H|X) =
𝑃(𝑋)
Explanation :
𝑋 : Data with unknown classes
𝑋 : The data hypothesis 𝑋 with a specific class
𝑋(𝑋|𝑋) : Probability of hypothesis 𝐻 based on condition 𝑋
𝑋(𝑋) : Probability of hypothesis 𝐻
𝑋(𝑋|𝑋) : Probability 𝑋 based on the conditions of the hypothesis 𝐻
𝑋(𝑋) : Probability 𝑋
P (H) represents the probability of the appearance of the class in the training data.
Amount of data with smooth class = 354

Amount of data with jammed class = 132
Total data = 486
P(Class = Smooth) = 354 / 486 = 0.728395062
P(Class = Jammed) = 132 / 486 = 0.271604938
P(X | H) symbolizes the probability of occurrence of values with a particular class in
training data, in other words calculating the same number of cases with the same
class.
Implementation :
The number of Y values on the FLAG RESTRUCT variable with the Jammed class =
52
P(FLAG RESTRUCT = Y | Jammed) = 52 / 132 = 0.39393
The number of Y values on the FLAG RESTRUCT variable with the Smooth class = 9
P(FLAG RESTRUCT = Y | Lancar) = 9 / 354 = 0.02542
The number of N values on the FLAG RESTRUCT variable with the Jammed class =
80
P(FLAG RESTRUCT = N | Jammed) = 80 / 132 = 0.60606
Number of N values in FLAG RESTRUCT variable with Smooth class = 345
P(FLAG RESTRUCT = N | Lancar) = 345 / 354 = 0.97457
The number of SH values on the LN TYPE variable with the Jammed class = 54
P(LN TYPE = SH | Jammed) = 54 / 132 = 0.40909
The number of SH values in the LN TYPE variabel with the Smooth class = 200
P(LN TYPE = SH | Smooth) = 200 / 354 = 0.56497
The number of SM values on the LN TYPE variabel with the Jammed class = 78
P(LN TYPE = SM | Jammed) = 78 / 132 = 0.59090
The number of SM values in LN TYPE variables with Smooth class = 154
P(LN TYPE = SM | Smooth) = 154 / 354 = 0.43502
Step 4. Implement Naive Bayes on testing data
In this calculation, the code data 170 is taken for the testing data as follows:
Table 4 Data with code 170
VARIABLE NILAI
NO 170
FLAG RESTRUCT N
LOAN TYPE SH
PLACE OF BIRTH BANJARMASIN
RATE 4,82
BUSINESS TYPE PEDAGANG
NUMBER OF DEPENDENCE 2 Orang
DURATION 36 M
REVENUE 3.600.000 - 4.500.000
MARITAL STATUS MENIKAH
AGE ≤33tahun
EDUCATION SMU/SMK
GENDER MALE
CEILING 14.100.000 - 23.000.000
Based on data with code 170, obtained:
P(FLAG RESTRUCT = N | Smooth) = 0.97457
P(FLAG RESTRUCT = N | Jammed) = 0.60606
P(LOAN TYPE = SH | Smooth) = 0.56497
P(LOAN TYPE = SH | Jammed) = 0.40909
P(PLACE OF BIRTH = BANJARMASIN | Smooth) = 0.52542
P(PLACE OF BIRTH = BANJARMASIN | Jammed) = 0.63636
P(RATE = 4,82 | Smooth) = 0.73728
P(RATE = 4,82 | Jammed) = 0.75
P(BUSINESS TYPE = TRADERS | Smooth) = 0.62146
P(BUSINESS TYPE = TRADERS | Jammed) = 0.66666
P(NUMBER OF DEPENDENCE = 2 Persons | Smooth) = 0.28813

P(NUMBER OF DEPENDENCE = 2 Persons | Jammed) = 0.30303
P(DURATION = 36 M | Smooth) = 0.46045
P(DURATION = 36 M | Jammed) = 0.52272
P(REVENUE = 3.600.000 - 4.500.000 | Smooth) = 0.34745
P(REVENUE = 3.600.000 - 4.500.000 | Jammed) = 0.40151
P(MARITAL STATUS = MARRIED | Smooth) = 0.77683
P(MARITAL STATUS = MARRIED | Jammed) = 0.77272
P(AGE (YEARS) = ≤ 33 Years | Smooth) = 0.33615
P(AGE (YEARS) = ≤ 33 Years | Jammed) = 0.40909
P(EDUCATION = SMU/SMK | Smooth) = 0.54237
P(EDUCATION = SMU/SMK | Jammed) = 0.54545
P(GENDER = MALE | Smooth) =0.50564
P(GENDER = MALE | Jammed) = 0.46212
P(CEILING = 14.100.000 - 23.000.000 | Smooth) = 0.32485
P(CEILING = 14.100.000 - 23.000.000 | Jammed) = 0.34090
And then, multiply all variables based on smooth and jammed. Multiplication is done
sequentially on variables that have the highest to lowest value based on the
previously calculated information gain. Calculations will be carried out starting from
using 1 variable to 13 variables.
Calculation of 1 variable
P(Smooth) = P(Smooth|Flag Restruct N) x P(X = Smooth)
= 0,974576271 x 0,728395062
= 0,7098765432
P(Jammed) = P(Jammed |Flag Restruct N) x P(X = Jammed)
= 0,606060606 x 0,271604938
= 0,1646090535
P(Smooth) = P(Smooth|Flag Restruct N) x P(Smooth| Loan Type SH) x P(X = Smooth)
= 0,974576271 x 0,564971751 x 0,728395062
= 0,4010601939
P(Jammed) = P(Jammed|Flag Restruct N) x P(Jammed| Loan Type SH) x P(X =
Jammed)
= 0,606060606 x 0,409090909 x 0,271604938
= 0,0673400673
P(Smooth) = P(Smooth|Flag Restruct N) x P(Smooth| Loan Type SH) x
P(Smooth| Business Type Trader) x P(Smooth| Rate 4.82) x P(Smooth|
Place of Birth Banjarmasin) x P(Smooth| Duration 36M) x P(Smooth|
Age ≤33 years ) x P(Smooth| Number Of Dependence 2 Persons) x
P(Smooth|Marital Status Married) x P(Smooth|Revenue 3.600.000 -
4.500.000) x P(Smooth|Education SMU/SMK) x P(Smooth|Gender
MALE) x P(Smooth|Ceiling 14.100.000 - 23.000.000) x P(X = Smooth)
= 0,974576271 x 0,564971751 x 0,525423729 x 0,737288136 x
0,621468927 x 0,288135593 x 0,460451977 x 0,347457627 x
0,776836158 x 0,336158192 x 0,542372881 x 0,505649718 x
0,324858757 x 0,728395062
= 0,0001035558
P(Jammed) = P(Jammed|Flag Restruct N) x P(Jammed| Loan Type SH) x P(Jammed|
Business Type PEDAGANG) x P(Jammed| Rate 4.82) x P(Jammed|
Place of Birth BANJARMASIN) x P(Jammed| Duration 36M) x
P(Jammed| Age ≤33 Years ) x P(Jammed| Number of Dependence 2
Persons) x P(Jammed| Marital Status Married) x P(Jammed| Revenue
3.600.000 - 4.500.000) x P(Jammed|Education SMU/SMK) x
P(Jammed| Gender MALE) x P(Jammed|Ceiling 14.100.000 -
23.000.000) x P(X = Jammed)
= 0,606060606 x 0,409090909 x 0,636363636 x 0,75 x 0,666666667 x
0,303030303 x 0,522727273 x 0,401515152 x 0,772727273 x
0,409090909 x 0,545454545 x 0,462121212 x 0,340909091 x
0,271604938
= 0,0000370178
The same calculation is done on each variable based on smooth and stuck.
Furthermore, the results of the jam class calculation are compared with the results
of the current class calculation, if the jam class value is higher, then the data is
classified as congested, otherwise it is classified as smooth.
Example :
Calculation of 1 variable : Smooth = 0.7098765432, and Jammed = 0.1646090535
Then it can be concluded, based on the calculation of 1 variable, the data is classified
as a smooth class.
3.6.4. PATTERN EVALUATION
Tahapan evaluasi dilakukan dengan metode Confusion Matrix.
Tabel 5 Confusion matrix
FACT FACT
1 s/d 12 VARIABLE TOTAL 13 VARIABLE TOTAL
SMOOTH JAMMED SMOOTH JAMMED
SMOOTH TP : 38 FP : 12 50 SMOOTH TP : 38 FP : 13 50
NB NB
JAMMED FN : 1 TN : 3 4 JAMMED FN : 1 TN : 2 4
TOTAL 39 15 41 TOTAL 39 15 40
(38 + 3)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (1 s/d 12 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) = 𝑥 100% = 75.93%
54
(38 + 2)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (13 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) = 𝑥 100% = 74.07%
54
38
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(1 s/d 12 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) = 𝑥 100% = 95%
(12 + 38)
38
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(13 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) = 𝑥 100% = 92.68%
(13 + 38)
38
𝑟𝑒𝑐𝑎𝑙𝑙 (1 s/d 12 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) = 𝑥 100% = 97.43%
(1 + 38)
38
𝑟𝑒𝑐𝑎𝑙𝑙 (13 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒) = 𝑥 100% = 97.43%
(1 + 38)
3.6.5. Knowledge Presentation
Presenting the knowledge obtained by using visualization and

representation techniques made in a program using the PHP programming
language.
Figure 1 Result of Implemetation In The Form Of a Program
3.7. EVALUATION
Based on the results of the tests conducted, it was found that the
implementation of 1 to 12 variables produced an accuracy level equal to the amount
of 41 testing data predicted correctly, but when the 13th variable (all variables) was
implemented, there was a decrease in accuracy where the amount of data predicted
accurately dropped to 40 data. This shows that in this case study optimal accuracy
is obtained when using 1 to 12 variables with the highest information gain value.
Based on the results of evaluations carried out using confusion matrix, the
implementation of naive bayes combined with Information Gain in this study
resulted in a minimum accuracy rate of 74.07% and the highest reached 75.93%.
The level of precision for 1 to 12 variables is 95% and for 13 variables 92.68%, and
the level of sensitivity for 1 to 12 variables is 97.43% and for 13 variables 97.43%.
4. CONCLUSION
Based on the results of the research and discussion conducted, it can be
concluded that :
a. Variables that have the greatest influence on the classification of credit
smoothness status on the Bank respectively based on Information Gain are Flag
Restructuring, LN Type, Place of Birth, Rate, Business, Dependent, Duration,
Turnover, Status, Age, Education, Gender, and Ceiling .
b. The accuracy of the classification of credit smoothness status in Naive Bayes with
the Information Gain feature selection resulted in the smallest accuracy rate of
74.07% and the highest reached 75.93% in the case studies that had been
research.
c. Based on the research that has been done it is known that the ceiling variable is
not too influential in the process of classification of the status of credit
smoothness at the bank due to a decrease in the results of initial accuracy of
75.93% to 74.07% after adding the ceiling variable.
BIBLIOGRAPHY
[1] Aprilla C, Dennis, dkk. 2013. Belajar Data mining dengan Rapid Miner. Jakarta.
[2] Andilala. 2016. Movie Review Sentimen Analisis Dengan Metode Naïve Bayes
Base On Feature Selection. Universitas Muhammadiyah Bengkulu : Jurnal
Pseudocode, Volume III Nomor 1, Februari 2016, ISSN 2355 – 5920.
[3] Betrisandi. 2017. Klasifikasi Nasabah Asuransi Jiwa Menggunakan Algoritma
Naive Bayes Berbasis Backward Elimination. ISSN print 2087-1716.
[4] Bustami. 2014. Penerapan Algoritma Naive Bayes Untuk Mengklasifikasi Data
Nasabah Asuransi. Teknik Informatika, Universitas Malikussaleh : Jurnal
Informatika Vol. 8, No. 1.
[5] Hidayatul, Syafitri Annur Aini, dkk. 2017. Seleksi Fitur Information Gain untuk
Klasifikasi Penyakit Jantung Menggunakan Kombinasi Metode K-Nearest
Neighbor dan Naïve Bayes. Program Studi Teknik Informatika Universitas
Brawijaya : Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer.
Vol. 2, No. 9.
[6] Karima, Cahya. Penerapan Seleksi Fitur Information Gain pada Naive Bayes
untuk Klasifikasi Status Kelancaran Kredit Pada Bank. Universitas Lambung
Mangkurat. 2019.
[7] Liu, Bing. 2007. Web Data mining. New York. ISBN: 10 3-540-37881-2.
[8] Nurina, Betha Sari. 2016. Implementasi Teknik Seleksi Fitur Information Gain
Pada Algoritma Klasifikasi Machine Learning Untuk Prediksi Performa
Akademik Siswa. Fakultas Ilmu Komputer, Universitas Singaperbangsa
Karawang. ISSN. 2302-3805.

Journal (English) - j1f114014 - Cahya Karima

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Journal (English) - j1f114014 - Cahya Karima

Hochgeladen von

Copyright:

Verfügbare Formate

International Conference on Mathematics, Science and Computer Education (IC-MSCEdu 2019)

INFORMATION GAIN FEATURES SELECTION ON

Keywords: Information Gain, Naive Bayes, bank collectability, credit.

According to Andilala [2] in his research, he suggested, to add feature

3. RESULT AND DISCUSSION

Installments, Restructured Flags, Current Collectibility, DPK Collectibility,

3.3. Information Gain

3.4. Naive Bayes

4. Compare Class Results

3.5. Confusion Matrix

Tabel 1. Class Distribution

Tabel 2 Values Group for each variable

3.6.3. Data Mining Process

InfoJammed(JK = FEMALE) = -(71/246) . 𝐿𝑜𝑔2 (71/246) = 0.51742

Info(JK = FEMALE) = 0.86692

Step 3. Calculating Naive Bayes

Amount of data with smooth class = 354

P(NUMBER OF DEPENDENCE = 2 Persons | Smooth) = 0.28813

Presenting the knowledge obtained by using visualization and

Figure 1 Result of Implemetation In The Form Of a Program

Das könnte Ihnen auch gefallen