You are on page 1of 5

Classification:

Classification is a data mining function that assigns items in a collection to target categories or
classes. The goal of classification is to accurately predict the target class for each case in the data.
For example, a classification model could be used to identify loan applicants as low, medium, or
high credit risks.
A classification task begins with a data set in which the class assignments are known. For example, a
classification model that predicts credit risk could be developed based on observed data for many
loan applicants over a period of time. In addition to the historical credit rating, the data might track
employment history, home ownership or rental, years of residence, number and type of investments,
and so on. Credit rating would be the target, the other attributes would be the predictors, and the data
for each customer would constitute a case.
Classifications are discrete and do not imply order. Continuous, floating-point values would indicate
a numerical, rather than a categorical, target. A predictive model with a numerical target uses a
regression algorithm, not a classification algorithm.
In the model build (training) process, a classification algorithm finds relationships between the
values of the predictors and the values of the target. Different classification algorithms use different
techniques for finding relationships. These relationships are summarized in a model, which can then
be applied to a different data set in which the class assignments are unknown.
Classification models are tested by comparing the predicted values to known target values in a set of
test data. The historical data for a classification project is typically divided into two data sets: one for
building the model; the other for testing the model. Scoring a classification model results in class
assignments and probabilities for each case. For example, a model that classifies customers as low,
medium, or high value would also predict the probability of each classification for each customer.
Classification has many applications in customer segmentation, business modeling, marketing, credit
analysis, and biomedical and drug response modeling.
In general, data classification is a two-step process. In the first step, which is called the learning step,
a model that describes a predetermined set of classes or concepts is built by analyzing a set of
training database instances. Each instance is assumed to belong to a predefined class. In the second
step, the model is tested using a different data set that is used to estimate the classification accuracy
of the model. If the accuracy of the model is considered acceptable, the model can be used to classify
future data instances for which the class label is not known. At the end, the model acts as a classifier
in the decision making process. There are several techniques that can be used for classification such
as decision tree, Bayesian methods, rule based algorithms, and Neural Networks.
Decision tree classifiers are quite popular techniques because the construction of tree does not
require any domain expert knowledge or parameter setting, and is appropriate for exploratory
knowledge discovery. Decision tree can produce a model with rules that are human-readable and
interpretable. Decision Tree has the advantages of easy interpretation and understanding for decision
makers to compare with their domain knowledge for validation and justify their decision. Some of
decision tree classifiers are C4.5/C5.0/J4.8, NBTree, and others.
Page 1 of 5

WEKA toolkit (Witten et al., 2011) is a widely used toolkit for machine learning and data mining
originally developed at the University of Waikato in New Zealand. It contains a large collection of
state-of-the-art machine learning and data mining algorithms written in Java. WEKA contains tools
for regression, classification, clustering, association rules, visualization, and data pre-processing.
WEKA has become very popular with academic and industrial researchers, and is also widely used
for teaching purposes.
A Sample Classification Problem
Suppose we want to find the gross salary of employee of a company. We will build a model using
training data of some employees. In this model we will use binary model.
So the attribute set for each employee will be as follow:
Name
Id
Basic
H_rent
Medical
Reduction
Gross

Description
Possible values
Identity number of employee
1,2,3...
Basic salary of Employee in taka
10000, 25000....
House rent for the employee. It my be represent on total 5000, 50%.....
taka or by percent of basic salary
Medical allowance for employee
1500, 15%....
Mooney deducted from an employee's account.
3000, 35%....
Gross Salary of employee
20005,32155.54....

In this problem, firstly we will determine whether h_rent,medical and reduction is represent as
percentage of basic or not. If yes then represent it as amount of taka. Finally calculate,
gross=basic + h_rent + medical - reduction.
Page 2 of 5

Code: We have write a C program for this classification model.


#include<stdio.h>
#include<string.h>
#include <stdlib.h>
int main()
{
char head[6][50],data[6][20],strain[50],stest[50];
float basic,h_rent,medical,reduction,gross,total,diff;
int i,j,k,len;
FILE *ftrainning, *ftest, *fout;
printf("Enter trainning data set name:");
scanf("%s",&strain);
printf("Enter test data set name:");
scanf("%s",&stest);
ftrainning=fopen(strain,"r");
ftest=fopen(stest,"r");
fout=fopen("classification_out.txt","w");
for(i=0;i<6;i++)
{
fscanf(ftrainning,"%s",head[i]);
}
i=0;
j=0;
while(fscanf(ftrainning,"%s%s%s%s%s%s",data[0],data[1],data[2],data[3],data[4],data[5])!=EOF)
{
i++;
basic=atof(data[1]);
len=strlen(data[2]);
if(data[2][len-1]=='%')
{
data[2][len-1]='\0';
h_rent=atof(data[2]);
h_rent=basic*(h_rent/100);
}
else
{
h_rent=atof(data[2]);
}
len=strlen(data[3]);
if(data[3][len-1]=='%')
{
data[3][len-1]='\0';
medical=atof(data[3]);
medical=basic*(medical/100);
}
else
{
medical=atof(data[3]);
}
len=strlen(data[4]);
if(data[4][len-1]=='%')
{
data[4][len-1]='\0';
reduction=atof(data[4]);
reduction=basic*(reduction/100);
}
else
{
reduction=atof(data[4]);
}
gross=atof(data[5]);
total=basic+h_rent+medical-reduction;
diff=gross-total;
if(diff>-2 && diff<2)
j++;
}
if(i==j)
printf("CLassification Model is Built......\n");
else
return 0;

Page 3 of 5

for(i=0;i<6;i++)
{
fscanf(ftest,"%s",head[i]);
fprintf(fout,"%-20s",head[i]);
}
fprintf(fout,"\n");
while(fscanf(ftest,"%s%s%s%s%s",data[0],data[1],data[2],data[3],data[4])!=EOF)
{
fprintf(fout,"%-20s",data[0]);
fprintf(fout,"%-20s",data[1]);
fprintf(fout,"%-20s",data[2]);
fprintf(fout,"%-20s",data[3]);
fprintf(fout,"%-20s",data[4]);
basic=atof(data[1]);
len=strlen(data[2]);
if(data[2][len-1]=='%')
{
data[2][len-1]='\0';
h_rent=atof(data[2]);
h_rent=basic*(h_rent/100);
}
else
{
h_rent=atof(data[2]);
}
len=strlen(data[3]);
if(data[3][len-1]=='%')
{
data[3][len-1]='\0';
medical=atof(data[3]);
medical=basic*(medical/100);
}
else
{
medical=atof(data[3]);
}
len=strlen(data[4]);
if(data[4][len-1]=='%')
{
data[4][len-1]='\0';
reduction=atof(data[4]);
reduction=basic*(reduction/100);
}
else
{
reduction=atof(data[4]);
}
gross=basic+h_rent+medical-reduction;
fprintf(fout,"%-30f",gross);
fprintf(fout,"\n");
}
printf("Classification has done succesfully.....\n");
return 0;
}

Input-Output:

Page 4 of 5

classification_trainning.txt
id
1
2
3
4
5
6
7
8
9
10

basic
5000
15000
5000
15000
5000
15000
5000
15000
5000
15000

h_rent
2000
60%
2000
60%
2000
60%
2000
60%
2000
60%

medical reduction
500
10%
7000
10%
2500
23000
500
10%
7000
10%
2500
23000
500
10%
7000
10%
2500
23000
500
10%
7000
10%
2500
23000
500
10%
7000
10%
2500
23000

gross

medical reduction
500
20%
10%
5500
500
12%
10%
2800
500
40%
10%
12220
500
60%
10%
8500
500
10%
10%
2500
500
10%
10%
2500
500
10%
10%
2500
500
10%
10%
2500
500
10%
10%
2500
500
10%
10%
2500

gross

classification_test.txt
id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

basic
5000
15000
6000
17000
5600
18000
5000
15000
5000
13000
5030
15080
1000
18000
25000
15000
25000
4000
3000
12000

h_rent
2000
65%
2500
40%
2000
50%
2600
550%
2000
60%
2000
62%
2060
30%
12000
60%
9000
60%
8000
60%

classification_out.txt
id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Page 5 of 5

basic
5000
15000
6000
17000
5600
18000
5000
15000
5000
13000
5030
15080
1000
18000
25000
15000
25000
4000
3000
12000

h_rent
2000
65%
2500
40%
2000
50%
2600
550%
2000
60%
2000
62%
2060
30%
12000
60%
9000
60%
8000
60%

medical
500
10%
500
10%
500
10%
500
10%
500
10%
500
10%
500
10%
500
10%
500
10%
500
10%

reduction
20%
5500
12%
2800
40%
12220
60%
8500
10%
2500
10%
2500
10%
2500
10%
2500
10%
2500
10%
2500

gross
6500.000000
20750.000000
8280.000000
22700.000000
5860.000000
16580.000000
5100.000000
90500.000000
7000.000000
19600.000000
7027.000000
23437.599609
3460.000000
22700.000000
35000.000000
23000.000000
32000.000000
4300.000000
11200.000000
17900.000000