Beruflich Dokumente
Kultur Dokumente
Submit to
Dr. Hossen Asiful Mustafa
PhD, University of South Carolina, USA
Assistant Professor, Institute of Information and Communication Technology (IICT)
Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh
Submitted by
Syful Islam
Roll: 1015312038
Session: Oct2015
Institute of Information and Communication Technology (IICT)
Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh
Index
S.I
No.
Title
Page
***
Problem
Abstract
Introduction
Overview of dataset
Data Preprocessing
8-16
17
18
Conclusion
19
Problem
Using Weka Data Mining software, we are required to:
1. Pre-process the dataset in order to select the 9 best attributes. Include in your report
screenshots showing the algorithm you have applied for pre-processing (include the
chosen parameter values if any).
2. On the dataset obtained at point (1), apply precisely 4 different classification
algorithms in order to produce 8 models (2 models per algorithm), that can be used to
automatically diagnose future patients. At least one of the produced models has to be a
decision tree. All the models will be learned and tested by splitting the dataset in a
training and a test dataset, each of which consisting in 75% and 25% of instances,
respectively. For each built model, include in you report the algorithm name and the
parameter values chosen for its application, and the confusion matrix and the measures of
performance of the model (in particular the accuracy). Include a screenshot with one
decision tree that you obtained.
3. Calculate for each model the precision, sensitivity, specificity and the lift, for the class
sick.
4. Choose the best, the second best and the third best model from (2). Justify your
answer.
5. Mention three characteristics of sick people based on the decision tree built and
displayed in (2). List the production rule(s) that you have used to mention these
characteristics.
Abstract
This project is aimed to introduce data mining technique using weka, a famous data
mining software. In this project, a data set named as cardiology.arff is used. Here four
classifier algorithms are applied on the dataset to build 8 models by tunning parameters
of the algorithms. All the models are learned and tested by splitting the dataset in a
training and a test dataset, each of which consisting in 75% and 25% of instances,
respectively. For each built model, the algorithm name and the parameter values chosen
for its application, and the confusion matrix and the measures of performance of the
model (in particular the accuracy) is included in report. For each model, the precision,
sensitivity, specificity and the lift, for the class sick is calculated. In addition, the best,
the second best and the third best model are chosen based on their performance. Finally,
Production rules are created based on three characteristics of sick people on decision tree.
1. Introduction
Weka (pronounced to rhyme with Mecca) is a workbench that contains a collection of
visualization tools and algorithms for data analysis and predictive modeling, together
with graphical user interfaces for easy access to these functions. Weka supports several
standard data mining tasks, more specifically, data preprocessing, clustering,
classification, regression, visualization, and feature selection. In this project, I use the
cardiology dataset to mine the sick and healthy people characteristics. Through the
project, I will try to show the use of the software and how to find desired information
from it. In addition, I will show how to find the accuracy, the misclassification of the
data and decide which model is best & how parameters can be tuned to improve
performance.
Attributes
age
Data Type
numeric
Value
sex
nominal
{Male, Female}
nominal
blood pressure
numeric
cholesterol
numeric
nominal
{FALSE, TRUE}
resting ecg
nominal
numeric
nominal
numeric
{TRUE, FALSE}
#colored vessels
nominal
numeric
thal
nominal
class
nominal
{Sick, Healthy}
Figure 1: Graphical Representation of the relation between class attribute with other
attribute values.
3. Data Preprocessing
According to the problem 9 best among 14 attributes contained in the dataset have to be
selected. Therefore we have to remove 4 low priority attributes from the given dataset.
Using method provided in Weka, we can easily perform this task. The Process of
selecting best attribute using weka is shown below:
Evaluator:
weka.attributeSelection.InfoGainAttributeEval
Search
Method
Attribute ranking.
Parameter
used
Attribute
Selection
Screenshot
Attributes
after
removing
low priority
attribute
Algorithm name
REPTree
Multilayer Perceptron
K-nearest neighbors
algorithm
Algorithm: 1
Parameter
Model: 1
Confidence Factor=.25
Reduced Error Pruning=False
Unpruned=True
Correctly classified instances
= 67
Performance measure
= 88.16%
Confusion matrix
classified as
31
a = sick
36
b = healthy
Precision =
= 0.9393
= 0.8157
Specificity =
= 0.9473
Using weka
Algorithm: 1
Parameter
Model: 2
Confidence Factor=.20
Reduced Error Pruning=True
Unpruned=False
Correctly classified instances
= 70
Performance measure
= 92.11%
Confusion matrix
classified as
34
a = sick
36
b = healthy
Precision =
= 0.9444
= 0.8947
Specificity =
= 0.9473
Using weka
10
Algorithm: 2
Parameter
REPTree
Model: 3
No Pruning= False
Num fold=3
Correctly classified instances
= 62
Performance measure
= 81.57%
Confusion matrix
classified as
25
13
a = sick
37
b = healthy
Precision =
= 0.9615
= 0.6578
Specificity =
= 0.9736
Using weka
11
Algorithm: 2
Parameter
REPTree
Model: 4
No Pruning= True
Num fold=2
Correctly classified instances
= 64
Performance measure
= 84.21%
Confusion matrix
classified as
31
a = sick
33
b = healthy
Precision =
= 0.8611
= 0.8157
Specificity =
= 0.8684
Using weka
12
Algorithm: 3
Parameter
Multilayer Perceptron
Model: 5
Hidden layer=a
Learning rate=0.3
Momentum=0.2
Correctly classified instances
= 66
= 86.84%
Performance measure
Confusion matrix
classified as
30
a = sick
36
b = healthy
Precision =
= 0.9376
= 0.7894
Specificity =
= 0.9476
Using weka
13
Algorithm: 3
Parameter
Multilayer Perceptron
Model: 6
Hidden layer=o
Learning rate=0.2
Momentum=0.3
Correctly classified instances
= 68
Performance measure
= 89.47%
Confusion matrix
classified as
34
a = sick
34
b = healthy
Precision =
= 0.8947
= 0.8947
Specificity =
= 0.8947
Using weka
14
Algorithm: 4
Parameter
Model: 7
= 64
= 84.21%
Confusion matrix
Performance measure
classified as
31
a = sick
33
b = healthy
Precision =
For sick people
Sensitivity =
Specificity =
= 0.8611
= 0.8157
= 0.8684
Using weka
15
Algorithm: 4
Parameter Chosen
Model: 8
= 68
= 89.47%
Confusion matrix
Performance measure
classified as
31
a = sick
37
b = healthy
Precision =
For sick people
Sensitivity =
Specificity =
= 0.9687
= 0.8157
= 0.9736
Using weka
16
Model
Algorithm
Model-1
Model-2
Model-3
Model-4
Model-5
Model-6
Model-7
Multilayer Perceptron
K-nearest neighbors algorithm
Model-8
Accuracy
Rank
88.16%
4th
92.11%
1st
81.57%
7th
84.21%
6th
86.84%
5th
89.47%
3rd
84.21%
6th
89.48%
2nd
From the Following table we can easily state that Model-2(C4.5) is first best, Model8(K-nearest neighbors algorithm) is the second best, Model-6(Multilayer Perceptron) is
the third best, Model-1(C4.5 decision tree) is the fourth best model based on accuracy.
17
Production
Rule
18
7. Conclusion
In this project work, I have tried to show how to perform operation using Weka
software. In the first step, by processing data using Info Gain Attribute evaluator &
Ranker algorithm and select best 9 attributes. Then on the processed dataset, several
algorithms have been applied. In addition, algorithm parameters are tuned to build 8
models which shows better accuracy on the dataset, then compared to show which one
performed as first best, second best, third best & fourth best. Finally some production
rules for sick people based on three characteristics are formed.
19