Simple Tutorial Project

A Project on
Application & Analysis of Different

Classification algorithms on cardiology dataset
using Weka data mining software
Submit to
Dr. Hossen Asiful Mustafa
PhD, University of South Carolina, USA
Assistant Professor, Institute of Information and Communication Technology (IICT)
Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh
Submitted by
Syful Islam
Roll: 1015312038
Session: Oct2015
Institute of Information and Communication Technology (IICT)
Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh
Index
S.I
No.
Title
Page
***
Problem
Abstract
Introduction
Overview of dataset
Data Preprocessing
Building Classification Model
8-16
Result and Comprison
17
Decision Tree of Model-2(C4.5 Decision Tree)
18
Conclusion
19
Problem
Using Weka Data Mining software, we are required to:
1. Pre-process the dataset in order to select the 9 best attributes. Include in your report
screenshots showing the algorithm you have applied for pre-processing (include the
chosen parameter values if any).
2. On the dataset obtained at point (1), apply precisely 4 different classification
algorithms in order to produce 8 models (2 models per algorithm), that can be used to
automatically diagnose future patients. At least one of the produced models has to be a
decision tree. All the models will be learned and tested by splitting the dataset in a
training and a test dataset, each of which consisting in 75% and 25% of instances,
respectively. For each built model, include in you report the algorithm name and the
parameter values chosen for its application, and the confusion matrix and the measures of
performance of the model (in particular the accuracy). Include a screenshot with one
decision tree that you obtained.
3. Calculate for each model the precision, sensitivity, specificity and the lift, for the class
sick.
4. Choose the best, the second best and the third best model from (2). Justify your
answer.
5. Mention three characteristics of sick people based on the decision tree built and
displayed in (2). List the production rule(s) that you have used to mention these
characteristics.
Abstract
This project is aimed to introduce data mining technique using weka, a famous data
mining software. In this project, a data set named as cardiology.arff is used. Here four
classifier algorithms are applied on the dataset to build 8 models by tunning parameters
of the algorithms. All the models are learned and tested by splitting the dataset in a
training and a test dataset, each of which consisting in 75% and 25% of instances,
respectively. For each built model, the algorithm name and the parameter values chosen
for its application, and the confusion matrix and the measures of performance of the
model (in particular the accuracy) is included in report. For each model, the precision,
sensitivity, specificity and the lift, for the class sick is calculated. In addition, the best,
the second best and the third best model are chosen based on their performance. Finally,
Production rules are created based on three characteristics of sick people on decision tree.
1. Introduction
Weka (pronounced to rhyme with Mecca) is a workbench that contains a collection of
visualization tools and algorithms for data analysis and predictive modeling, together
with graphical user interfaces for easy access to these functions. Weka supports several
standard data mining tasks, more specifically, data preprocessing, clustering,
classification, regression, visualization, and feature selection. In this project, I use the
cardiology dataset to mine the sick and healthy people characteristics. Through the
project, I will try to show the use of the software and how to find desired information
from it. In addition, I will show how to find the accuracy, the misclassification of the
data and decide which model is best & how parameters can be tuned to improve
performance.
2. Overview of data set:

The given dataset cardiology.arff contains total (Number of attributes = 14, Number of
instances = 303). Among 303 records 138 people are counted as sick and remaining 165
are healthy people.
Attributes
age
Data Type
numeric
Value
sex
nominal
{Male, Female}
chest pain type
nominal
{_Asymptomatic, Abnormal Angina,

Angina, No Tang}
blood pressure
numeric
cholesterol
numeric
Fasting blood sugar

<120
nominal
{FALSE, TRUE}
resting ecg
nominal
numeric
{Hyp, Normal, Abnormal}
nominal
numeric
{TRUE, FALSE}
{Flat, Up, Down}
#colored vessels
nominal
numeric
thal
nominal
{Rev, Normal, Fix}
class
nominal
{Sick, Healthy}
maximum heart rate

angina
peak
slope
Table 1: All 14 attributes and their data type with values.
Figure 1: Graphical Representation of the relation between class attribute with other
attribute values.
3. Data Preprocessing
According to the problem 9 best among 14 attributes contained in the dataset have to be
selected. Therefore we have to remove 4 low priority attributes from the given dataset.
Using method provided in Weka, we can easily perform this task. The Process of
selecting best attribute using weka is shown below:
Evaluator:
weka.attributeSelection.InfoGainAttributeEval
Search
Method
Attribute ranking.
Parameter
used
Attribute
Selection
Screenshot
Attributes
after
removing
low priority
attribute
4. Building Classification Model

Four(4) classification algorithms have been chosen to produce 8 model (2 model per
algorithm) from our processed data set.
S.L.
Algorithm name
C4.5 decision tree
REPTree
Multilayer Perceptron
K-nearest neighbors
algorithm
Weka Class Name

weka.classifiers.trees.J48
weka.classifiers.rules.REPTree
weka.classifiers.functions.MultilayerPerceptron
weka.classifiers.lazy.IBk
Algorithm: 1
Parameter
C4.5 decision tree
Model: 1
Confidence Factor=.25
Reduced Error Pruning=False
Unpruned=True
Correctly classified instances
= 67
Incorrectly classified instances = 9

Accuracy =
Performance measure
= 88.16%
Confusion matrix
classified as
31
a = sick
36
b = healthy
Precision =
= 0.9393
For sick people

Sensitivity =
= 0.8157
Specificity =
= 0.9473
Using weka
Algorithm: 1
Parameter
C4.5 decision tree
Model: 2
Confidence Factor=.20
Reduced Error Pruning=True
Unpruned=False
= 70

Accuracy =
Performance measure
= 92.11%
Confusion matrix
classified as
34
a = sick
36
b = healthy
Precision =
= 0.9444
For sick people

Sensitivity =
= 0.8947
Specificity =
= 0.9473
Using weka
10
Algorithm: 2
Parameter
REPTree
Model: 3
No Pruning= False
Num fold=3
= 62

Accuracy =
Performance measure
= 81.57%
Confusion matrix
classified as
25
13
a = sick
37
b = healthy
Precision =
= 0.9615
For sick people

Sensitivity =
= 0.6578
Specificity =
= 0.9736
Using weka
11
Algorithm: 2
Parameter
REPTree
Model: 4
No Pruning= True
Num fold=2
= 64

Accuracy =
Performance measure
= 84.21%
Confusion matrix
classified as
31
a = sick
33
b = healthy
Precision =
= 0.8611
For sick people

Sensitivity =
= 0.8157
Specificity =
= 0.8684
Using weka
12
Algorithm: 3
Parameter
Model: 5
Hidden layer=a
Learning rate=0.3
Momentum=0.2
= 66

Accuracy =
= 86.84%
Performance measure
Confusion matrix
classified as
30
a = sick
36
b = healthy
Precision =
= 0.9376
For sick people

Sensitivity =
= 0.7894
Specificity =
= 0.9476
Using weka
13
Algorithm: 3
Parameter
Model: 6
Hidden layer=o
Learning rate=0.2
Momentum=0.3
= 68

Accuracy =
Performance measure
= 89.47%
Confusion matrix
classified as
34
a = sick
34
b = healthy
Precision =
= 0.8947
For sick people

Sensitivity =
= 0.8947
Specificity =
= 0.8947
Using weka
14
Algorithm: 4
Parameter
K-nearest neighbors classifier
Model: 7
Number of nearest neighbors = 1

Nearest neighbor searching algorithm = Linear Search
Mean Squared=False
= 64

Accuracy =
= 84.21%
Confusion matrix
Performance measure
classified as
31
a = sick
33
b = healthy
Precision =
For sick people
Sensitivity =
Specificity =
= 0.8611
= 0.8157
= 0.8684
Using weka
15
Algorithm: 4
Parameter Chosen
K-nearest neighbors classifier
Model: 8
Number of nearest neighbors = 5

Nearest neighbor searching algorithm = Linear NNSearch
Mean Squared=True
= 68

Accuracy =
= 89.47%
Confusion matrix
Performance measure
classified as
31
a = sick
37
b = healthy
Precision =
For sick people
Sensitivity =
Specificity =
= 0.9687
= 0.8157
= 0.9736
Using weka
16
5. Result and Comparison:

SI
Model
Algorithm
Model-1
Model-2
Model-3
Model-4
Model-5
Model-6
Model-7
K-nearest neighbors algorithm
Model-8
K-nearest neighbors algorithm
C4.5 decision tree

C4.5 decision tree
REPTree
REPTree
Accuracy
Rank
88.16%
4th
92.11%
1st
81.57%
7th
84.21%
6th
86.84%
5th
89.47%
3rd
84.21%
6th
89.48%
2nd
From the Following table we can easily state that Model-2(C4.5) is first best, Model8(K-nearest neighbors algorithm) is the second best, Model-6(Multilayer Perceptron) is
the third best, Model-1(C4.5 decision tree) is the fourth best model based on accuracy.
17
6. Decision Tree of Model-2(C4.5 Decision Tree)

The following tree show is generated from Model-2. We prefer this model because this
model using C4.5 decision tree algorithm shows highest performance for our dataset.
Decision
Tree(Model
-2)
Production
Rule
R1: (#colored Vessels>0)^ (Chest Pain type=NoTang)^(that=Fix) > Sick

R2: (#colored Vessels>0)^ (Chest Pain type=NoTang)^(that=Rev) > Sick
R3: (#colored Vessels>0)^ (Chest Pain type=Angina) > Sick

R4: (#colored Vessels>0)^ (Chest Pain type=Abnormal_Angina)^(Sex=Male)
> Sick
R5: (#colored Vessels<=0)^ (angina=True)^(that=Fix) > Sick
R5: (#colored Vessels<=0)^ (angina=True)^(that=Rev) > Sick
18
7. Conclusion
In this project work, I have tried to show how to perform operation using Weka
software. In the first step, by processing data using Info Gain Attribute evaluator &
Ranker algorithm and select best 9 attributes. Then on the processed dataset, several
algorithms have been applied. In addition, algorithm parameters are tuned to build 8
models which shows better accuracy on the dataset, then compared to show which one
performed as first best, second best, third best & fourth best. Finally some production
rules for sick people based on three characteristics are formed.
19

Simple Tutorial Project

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Simple Tutorial Project

Hochgeladen von

Copyright:

Verfügbare Formate

A Project on

Application & Analysis of Different

Building Classification Model

Result and Comprison

Decision Tree of Model-2(C4.5 Decision Tree)

2. Overview of data set:

chest pain type

{_Asymptomatic, Abnormal Angina,

Fasting blood sugar

{Hyp, Normal, Abnormal}

{Flat, Up, Down}

{Rev, Normal, Fix}

maximum heart rate

Table 1: All 14 attributes and their data type with values.

4. Building Classification Model

C4.5 decision tree

Weka Class Name

C4.5 decision tree

Incorrectly classified instances = 9

For sick people

C4.5 decision tree

Incorrectly classified instances = 6

For sick people

Incorrectly classified instances = 14

For sick people

Incorrectly classified instances = 12

For sick people

Incorrectly classified instances = 10

For sick people

Incorrectly classified instances = 8

For sick people

K-nearest neighbors classifier

Number of nearest neighbors = 1

Incorrectly classified instances = 12

K-nearest neighbors classifier

Number of nearest neighbors = 5

Incorrectly classified instances = 8

5. Result and Comparison:

K-nearest neighbors algorithm

C4.5 decision tree

6. Decision Tree of Model-2(C4.5 Decision Tree)

R1: (#colored Vessels>0)^ (Chest Pain type=NoTang)^(that=Fix) > Sick

R3: (#colored Vessels>0)^ (Chest Pain type=Angina) > Sick

Das könnte Ihnen auch gefallen