Sie sind auf Seite 1von 19

A Project on

Application & Analysis of Different


Classification algorithms on cardiology dataset
using Weka data mining software

Submit to
Dr. Hossen Asiful Mustafa
PhD, University of South Carolina, USA
Assistant Professor, Institute of Information and Communication Technology (IICT)
Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh

Submitted by
Syful Islam
Roll: 1015312038
Session: Oct2015
Institute of Information and Communication Technology (IICT)
Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh

Index
S.I
No.

Title

Page

***

Problem

Abstract

Introduction

Overview of dataset

Data Preprocessing

Building Classification Model

8-16

Result and Comprison

17

Decision Tree of Model-2(C4.5 Decision Tree)

18

Conclusion

19

Problem
Using Weka Data Mining software, we are required to:
1. Pre-process the dataset in order to select the 9 best attributes. Include in your report
screenshots showing the algorithm you have applied for pre-processing (include the
chosen parameter values if any).
2. On the dataset obtained at point (1), apply precisely 4 different classification
algorithms in order to produce 8 models (2 models per algorithm), that can be used to
automatically diagnose future patients. At least one of the produced models has to be a
decision tree. All the models will be learned and tested by splitting the dataset in a
training and a test dataset, each of which consisting in 75% and 25% of instances,
respectively. For each built model, include in you report the algorithm name and the
parameter values chosen for its application, and the confusion matrix and the measures of
performance of the model (in particular the accuracy). Include a screenshot with one
decision tree that you obtained.
3. Calculate for each model the precision, sensitivity, specificity and the lift, for the class
sick.
4. Choose the best, the second best and the third best model from (2). Justify your
answer.
5. Mention three characteristics of sick people based on the decision tree built and
displayed in (2). List the production rule(s) that you have used to mention these
characteristics.

Abstract
This project is aimed to introduce data mining technique using weka, a famous data
mining software. In this project, a data set named as cardiology.arff is used. Here four
classifier algorithms are applied on the dataset to build 8 models by tunning parameters
of the algorithms. All the models are learned and tested by splitting the dataset in a
training and a test dataset, each of which consisting in 75% and 25% of instances,
respectively. For each built model, the algorithm name and the parameter values chosen
for its application, and the confusion matrix and the measures of performance of the
model (in particular the accuracy) is included in report. For each model, the precision,
sensitivity, specificity and the lift, for the class sick is calculated. In addition, the best,
the second best and the third best model are chosen based on their performance. Finally,
Production rules are created based on three characteristics of sick people on decision tree.

1. Introduction
Weka (pronounced to rhyme with Mecca) is a workbench that contains a collection of
visualization tools and algorithms for data analysis and predictive modeling, together
with graphical user interfaces for easy access to these functions. Weka supports several
standard data mining tasks, more specifically, data preprocessing, clustering,
classification, regression, visualization, and feature selection. In this project, I use the
cardiology dataset to mine the sick and healthy people characteristics. Through the
project, I will try to show the use of the software and how to find desired information
from it. In addition, I will show how to find the accuracy, the misclassification of the
data and decide which model is best & how parameters can be tuned to improve
performance.

2. Overview of data set:


The given dataset cardiology.arff contains total (Number of attributes = 14, Number of
instances = 303). Among 303 records 138 people are counted as sick and remaining 165
are healthy people.

Attributes
age

Data Type
numeric

Value

sex

nominal

{Male, Female}

chest pain type

nominal

{_Asymptomatic, Abnormal Angina,


Angina, No Tang}

blood pressure

numeric

cholesterol

numeric

Fasting blood sugar


<120

nominal

{FALSE, TRUE}

resting ecg

nominal
numeric

{Hyp, Normal, Abnormal}

nominal
numeric

{TRUE, FALSE}

{Flat, Up, Down}

#colored vessels

nominal
numeric

thal

nominal

{Rev, Normal, Fix}

class

nominal

{Sick, Healthy}

maximum heart rate


angina
peak
slope

Table 1: All 14 attributes and their data type with values.

Figure 1: Graphical Representation of the relation between class attribute with other
attribute values.

3. Data Preprocessing
According to the problem 9 best among 14 attributes contained in the dataset have to be
selected. Therefore we have to remove 4 low priority attributes from the given dataset.
Using method provided in Weka, we can easily perform this task. The Process of
selecting best attribute using weka is shown below:
Evaluator:

weka.attributeSelection.InfoGainAttributeEval

Search
Method

Attribute ranking.

Parameter
used

Attribute
Selection
Screenshot

Attributes
after
removing
low priority
attribute

4. Building Classification Model


Four(4) classification algorithms have been chosen to produce 8 model (2 model per
algorithm) from our processed data set.
S.L.

Algorithm name

C4.5 decision tree

REPTree

Multilayer Perceptron

K-nearest neighbors
algorithm

Weka Class Name


weka.classifiers.trees.J48
weka.classifiers.rules.REPTree
weka.classifiers.functions.MultilayerPerceptron
weka.classifiers.lazy.IBk

Algorithm: 1
Parameter

C4.5 decision tree

Model: 1

Confidence Factor=.25
Reduced Error Pruning=False
Unpruned=True
Correctly classified instances

= 67

Incorrectly classified instances = 9


Accuracy =

Performance measure

= 88.16%

Confusion matrix

classified as

31

a = sick

36

b = healthy

Precision =

= 0.9393

For sick people


Sensitivity =

= 0.8157

Specificity =

= 0.9473

Using weka

Algorithm: 1
Parameter

C4.5 decision tree

Model: 2

Confidence Factor=.20
Reduced Error Pruning=True
Unpruned=False
Correctly classified instances

= 70

Incorrectly classified instances = 6


Accuracy =

Performance measure

= 92.11%

Confusion matrix

classified as

34

a = sick

36

b = healthy

Precision =

= 0.9444

For sick people


Sensitivity =

= 0.8947

Specificity =

= 0.9473

Using weka

10

Algorithm: 2
Parameter

REPTree

Model: 3

No Pruning= False
Num fold=3
Correctly classified instances

= 62

Incorrectly classified instances = 14


Accuracy =

Performance measure

= 81.57%

Confusion matrix

classified as

25

13

a = sick

37

b = healthy

Precision =

= 0.9615

For sick people


Sensitivity =

= 0.6578

Specificity =

= 0.9736

Using weka

11

Algorithm: 2
Parameter

REPTree

Model: 4

No Pruning= True
Num fold=2
Correctly classified instances

= 64

Incorrectly classified instances = 12


Accuracy =

Performance measure

= 84.21%

Confusion matrix

classified as

31

a = sick

33

b = healthy

Precision =

= 0.8611

For sick people


Sensitivity =

= 0.8157

Specificity =

= 0.8684

Using weka

12

Algorithm: 3
Parameter

Multilayer Perceptron

Model: 5

Hidden layer=a
Learning rate=0.3
Momentum=0.2
Correctly classified instances

= 66

Incorrectly classified instances = 10


Accuracy =

= 86.84%

Performance measure
Confusion matrix

classified as

30

a = sick

36

b = healthy

Precision =

= 0.9376

For sick people


Sensitivity =

= 0.7894

Specificity =

= 0.9476

Using weka

13

Algorithm: 3
Parameter

Multilayer Perceptron

Model: 6

Hidden layer=o
Learning rate=0.2
Momentum=0.3
Correctly classified instances

= 68

Incorrectly classified instances = 8


Accuracy =

Performance measure

= 89.47%

Confusion matrix

classified as

34

a = sick

34

b = healthy

Precision =

= 0.8947

For sick people


Sensitivity =

= 0.8947

Specificity =

= 0.8947

Using weka

14

Algorithm: 4
Parameter

K-nearest neighbors classifier

Model: 7

Number of nearest neighbors = 1


Nearest neighbor searching algorithm = Linear Search
Mean Squared=False
Correctly classified instances

= 64

Incorrectly classified instances = 12


Accuracy =

= 84.21%

Confusion matrix
Performance measure

classified as

31

a = sick

33

b = healthy

Precision =
For sick people
Sensitivity =

Specificity =

= 0.8611

= 0.8157

= 0.8684

Using weka

15

Algorithm: 4
Parameter Chosen

K-nearest neighbors classifier

Model: 8

Number of nearest neighbors = 5


Nearest neighbor searching algorithm = Linear NNSearch
Mean Squared=True
Correctly classified instances

= 68

Incorrectly classified instances = 8


Accuracy =

= 89.47%

Confusion matrix
Performance measure

classified as

31

a = sick

37

b = healthy

Precision =
For sick people
Sensitivity =

Specificity =

= 0.9687

= 0.8157

= 0.9736

Using weka

16

5. Result and Comparison:


SI

Model

Algorithm

Model-1

Model-2

Model-3

Model-4

Model-5

Model-6

Model-7

Multilayer Perceptron
K-nearest neighbors algorithm

Model-8

K-nearest neighbors algorithm

C4.5 decision tree


C4.5 decision tree
REPTree
REPTree
Multilayer Perceptron

Accuracy

Rank

88.16%

4th

92.11%

1st

81.57%

7th

84.21%

6th

86.84%

5th

89.47%

3rd

84.21%

6th

89.48%

2nd

From the Following table we can easily state that Model-2(C4.5) is first best, Model8(K-nearest neighbors algorithm) is the second best, Model-6(Multilayer Perceptron) is
the third best, Model-1(C4.5 decision tree) is the fourth best model based on accuracy.

17

6. Decision Tree of Model-2(C4.5 Decision Tree)


The following tree show is generated from Model-2. We prefer this model because this
model using C4.5 decision tree algorithm shows highest performance for our dataset.
Decision
Tree(Model
-2)

Production
Rule

R1: (#colored Vessels>0)^ (Chest Pain type=NoTang)^(that=Fix) > Sick


R2: (#colored Vessels>0)^ (Chest Pain type=NoTang)^(that=Rev) > Sick

R3: (#colored Vessels>0)^ (Chest Pain type=Angina) > Sick


R4: (#colored Vessels>0)^ (Chest Pain type=Abnormal_Angina)^(Sex=Male)
> Sick
R5: (#colored Vessels<=0)^ (angina=True)^(that=Fix) > Sick
R5: (#colored Vessels<=0)^ (angina=True)^(that=Rev) > Sick

18

7. Conclusion
In this project work, I have tried to show how to perform operation using Weka
software. In the first step, by processing data using Info Gain Attribute evaluator &
Ranker algorithm and select best 9 attributes. Then on the processed dataset, several
algorithms have been applied. In addition, algorithm parameters are tuned to build 8
models which shows better accuracy on the dataset, then compared to show which one
performed as first best, second best, third best & fourth best. Finally some production
rules for sick people based on three characteristics are formed.

19

Das könnte Ihnen auch gefallen