Sie sind auf Seite 1von 19

INTRODUCTION:

The health care industry is one of the worlds largest and fastest growing industries having huge
amount of healthcare data. This health care data includes relevant information about patient data,
their treatment data and resource management data. The information is rich and massive. Hidden
relationships and trends in healthcare data can be discovered from the application of data mining
techniques. Data mining techniques are more effective that has used in healthcare research. In
this project we aimed to do the analysis of several data mining classification techniques using
WEKA machine learning tools over the healthcare datasets. In this study, we use different data
mining classification techniques that have been tested on diagnostic datasets for diabetes.
SIGNIFICANCE OF DATA MINING IN HEALTHCARE:
Generally all the healthcare organizations across the world stored the healthcare data in
electronic format. Healthcare data mainly contains all the information regarding patients as well
as the parties involved in healthcare industries. The storage of such type of data is increased at a
very rapidly rate. Due to continuous increasing the size of electronic healthcare data a type of
complexity is exist in it. In other words, we can say that healthcare data becomes very complex.
By using the traditional methods it becomes very difficult in order to extract the meaningful
information from it. But due to advancement in field of statistics, mathematics and very other
disciplines it is now possible to extract the meaningful patterns from it. Data mining is beneficial
in such a situation where large collections of healthcare data are available.
Data Mining mainly extracts the meaningful patterns which were previously not known. These
patterns can be then integrated into the knowledge and with the help of this knowledge essential
decisions can becomes possible. A number of benefits are provided by the data mining. Some of
them are as follows: it plays a very important role in the detection of fraud and abuse, provides
better medical treatments at reasonable price, detection of diseases at early stages, intelligent
healthcare decision support systems etc. Data mining techniques are very useful in healthcare
domain. They provide better medical services to the patients and helps to the healthcare
organizations in various medical management decisions. Some of the services provided by the
data mining techniques in healthcare are: number of days of stay in a hospital, ranking of
hospitals, better effective treatments, fraud insurance claims by patients as well as by providers,
readmission of patients, identifies better treatments methods for a particular group of patients,

construction of effective drug recommendation systems, etc. Due to all these reasons researchers
are greatly influenced by the capabilities of data mining. In the healthcare field researchers
widely used the data mining techniques. There are various techniques of data mining. Some of
them are classification, clustering, regression, etc. Each and every medical information related to
patient as well as to healthcare organizations is useful. With the help of such a powerful tool
known as data mining plays a very important role in healthcare industry. Recently researchers
uses data mining tools in distributed medical environment in order to provide better medical
services to a large proportion of population at a very low cost, better customer relationship
management, better management of healthcare resources, etc. It provides meaningful information
in the field of healthcare which may be then useful for management to take decisions such as
estimation of medical staff, decision regarding health insurance policy, selection of treatments,
disease prediction etc. Dealing with the issues and challenges of data mining in healthcare. In
order to predict the various diseases effective analysis of data mining is used. Proposed a data
mining methodology in order to improve the result and proposed new data mining methodology
and proposed framework in order to improve the healthcare system.
DATAMINING CLASSIFICATION TECHNIQUE:
The healthcare industry is information rich yet knowledge poor. Therefore for healthcare
research, data driven statistical research has become a complement. As with the use of computers
powered with automated tools the large volumes of healthcare data are being collected and made
available to the medical research groups. As a result, Knowledge Discovery in Databases (KDD),
which includes data mining techniques, has become a more popular research tool for healthcare
researchers to identify and to exploit patterns and relationships among large number of variables,
and also made them able to predict the outcome of a disease using the historical cases stored
within datasets. In this project, we carried out various participating data mining classification
techniques on healthcare data. Classification is one of the most popularly used methods of Data
Mining in Healthcare sector. It divides data samples into target classes. The classification
technique predicts the target class for each data points. With the help of classification approach a
risk factor can be associated to patients by analyzing their patterns of diseases. It is a supervised
learning approach having known class categories. Binary and multilevel are the two methods of
classification. In binary classification, only two possible classes such as, high or low risk

patient may be considered while the multiclass approach has more than two targets for example,
high, medium and low risk patient. Data set is partitioned as training and testing dataset. It
consists of predicting a certain outcome based on a given input. Training set is the algorithm
which consists of a set of attributes in order to predict the outcome. In order to predict the
outcome it attempts to discover the relationship between attributes. Goal or prediction is its
outcome. There is another algorithm known as prediction set. It consists of same set of attributes
as that of training set. But in prediction set, prediction attribute is yet to be known. In order to
process the prediction it mainly analyses the input. The term which defines how good the
algorithm is its accuracy.
DATABASE AND TOOLS USED IN PROJECT:
We have practiced PIMA Indian Diabetes dataset taken from m UCI Machine Learning
Repository in WEKA. WEKA Machine learning tools are used to handle classification problems.
This study will help the researchers to determine the better results from the available data within
the datasets.
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for
data pre-processing, classification, regression, clustering, association rules, and visualization. It
is also well-suited for developing new machine learning schemes. Found only on the islands of
New Zealand, the Weka is a flightless bird with an inquisitive nature. Weka is open source
software issued under the GNU General Public License.

WEKA:
WEKA is a data mining system developed by the University of Waikato in New Zealand that
implements data mining algorithms using the JAVA language. WEKA is a state of-the art facility
for developing machine learning (ML) techniques and their application to real-world data mining
problems. It is a collection of machine learning algorithms for data mining tasks. The algorithms
are applied directly to a dataset. WEKA implements algorithms for data preprocessing,

classification, regression, clustering and association rules; it also includes visualization tools.
The new machine learning schemes or algorithm can also be developed with this package.
WEKA is open source software issued under General Public License. The data file normally used
by Weka is in ARFF file format, which consists of special tags to indicate different things in the
data file (foremost: attribute names, attribute types, and attribute values and the data). The main
interface in Weka is the Explorer. It has a set of panels, each of which can be used to perform a
certain task. Once a dataset has been loaded, one of the other panels in the Explorer can be used
to perform further analysis. Advantages of Weka include:
free availability under the GNU General Public License
portability, since it is fully implemented in the Java programming language and thus runs
on almost any modern computing platform
a comprehensive collection of data preprocessing and modeling techniques
ease of use due to its graphical user interfaces
Weka's main user interface is the Explorer, but essentially the same functionality can be accessed
through the component-based Knowledge Flow interface and from the command line. There is
also the Experimenter, which allows the systematic comparison of the predictive performance of
Weka's machine learning algorithms on a collection of datasets.
The Explorer interface features several panels providing access to the main components of the
workbench:
The Preprocess panel has facilities for importing data from a database, a CSV file, etc.,
and for preprocessing this data using a so-called filtering algorithm. These filters can be
used to transform the data (e.g., turning numeric attributes into discrete ones) and make it
possible to delete instances and attributes according to specific criteria.
The Classify panel enables the user to apply classification and regression algorithms
(indiscriminately called classifiers in Weka) to the resulting dataset, to estimate the
accuracy of the resulting predictive model, and to visualize erroneous predictions, ROC
curves, etc., or the model itself (if the model is amenable to visualization like, e.g., a
decision tree).
The Associate panel provides access to association rule learners that attempt to identify
all important interrelationships between attributes in the data.

The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple kmeans algorithm. There is also an implementation of the expectation maximization
algorithm for learning a mixture of normal distributions.
The Select attributes panel provides algorithms for identifying the most predictive
attributes in a dataset.
The Visualize panel shows a scatter plot matrix, where individual scatter plots can be
selected and enlarged, and analyzed further using various selection operators.

CLASSIFICATION ALGORITHM USED IN THE PROJECT:

1) NAVE BAYES
For probabilistic learning method Bayesian classification is used. With the help of
classification algorithm we can easily obtain it. Bayes theorem of statistics plays a very
important role in it. While in medical domain attributes such as patient symptoms and
their health state are correlated with each other but Nave Bayes Classifier assumes that
all attributes are independent with each other. This is the major disadvantage with Nave
Bayes Classifier. If attributes are independent with each other than Nave Bayesian
classifier has shown great performance in terms of accuracy. In healthcare field they play
very important roles. Hence, researchers across the world used them there are various
advantages of BBN. One of them is that it helps to makes computation process very easy.
Another one is that for huge datasets it has better speed and accuracy.
The Nave Bayes is a simple probabilistic classifier. It is based on the assumption of
mutual independency of attributes. The algorithm works on the assumption, that variables
provided to the classifier are independent. The probabilities applied in the Nave Bayes
algorithm are calculated using Bayes Rule [11] the probability of hypothesis H can be
calculated on the basis of the hypothesis H and evidence about the hypothesis E
according to the following formula

The Nave Bayes method works effectively in various real-world situations.


2) ZERO R CLASSIFIER:
ZeroR is the simplest classification method which relies on the target and ignores all
predictors. ZeroR classifier simply predicts the majority category (class). Although there
is no predictability power in ZeroR, it is useful for determining a baseline performance as
a benchmark for other classification methods.
In WEKA Zero-R is a simple classifier. Zero-R is a trivial classifier, but it gives a lower
bound on the performance of a given dataset which should be significantly improved by
more complex classifiers. As such it is a reasonable test on how well the class can be
predicted without considering the other attributes. It can be used as a Lower Bound on
Performance. Any learning algorithm in WEKA is derived from the abstract WEKA
classifiers. Given below is the flow chart of the ZeroR algorithm.

3) ONE R CLASSIFIER:
OneR, short for "One Rule", is a simple, yet accurate, classification algorithm that
generates one rule for each predictor in the data, and then selects the rule with the
smallest total error as its "one rule". To create a rule for a predictor, we have to construct
a frequency table for each predictor against the target. OneR Algorithm for each
predictor, for each value of that predictor, make rule as follows Count how often each value of target (class) appears
Find the most frequent class
Make the rule assign that class to this value of the predictors
Calculate the total error of the rules of each predictor
Choose the predictor with the smallest total error.
Find the best predictor which possess the smallest total error using OneR
algorithm

4) J48 DECISION TREE(C 4.5 in SPSS):


A decision tree partitions the input space of a data set into mutually exclusive regions,
each of which is assigned a label, a value or an action to characterize its data points. The
decision tree mechanism is transparent and we can follow a tree structure easily to see
how the decision is made. A decision tree is a tree structure consisting of internal and
external nodes connected by branches. An internal node is a decision making unit that
evaluates a decision function to determine which child node to visit next. The external
node, on the other hand, has no child nodes and is associated with a label or value that
characterizes the given data that leads to its being visited. However, many decision tree
construction algorithms involve a two - step process. First, a very large decision tree is
grown. Then, to reduce large size and overfitting the data, in the second step, the given
tree is pruned. The pruned decision tree that is used for classification purposes is called
the classification tree. To build a decision tree, we need to calculate entropy and
information gain
E(S) = pi-log2 pi
Information Gain The information gain is depending on the decrease in entropy after a
dataset is split on a selected attribute. Constructing a decision tree is mean to find an
attribute which possess the highest information gain value
Gain (T, X) = Entropy (T) - Entropy (T, X)
Algorithm: Generate decision tree. Generate a decision tree from the training tuples of
data partition D.
Input: Data partition, D, which is a set of training tuples and their associated class labels;
attribute list, the set of candidate attributes; Attribute selection method, a procedure to
determine the splitting criterion that best partitions the data tuples into individual
classes. This criterion consists of a splitting attribute and, possibly, either a split point or
splitting subset.

Figure: Decision Tree Algorithm


LOGISTIC REGRESSION
Logistic Regression is a probabilistic, statistical classifier used to predict the outcome of a
categorical dependent variable based on one or more predictor variables. The algorithm measures
the relationship between a dependent variable and one or more independent variables.
ADABOOSTING (In WEKA)
Boosting is an ensemble method that starts out with a base classifier that is prepared on the
training data. A second classifier is then created behind it to focus on the instances in the training
data that the first classifier got wrong. The process continues to add classifiers until a limit is
reached in the number of models or accuracy.
Boosting is provided in Weka in the AdaBoostM1 (adaptive boosting) algorithm.
1.
2.
3.

Click Add new in the Algorithms section.


Click the Choose button.
Click AdaBoostM1 under the meta selection.

4.

Click the Choose button for the classifier and select J48 under the tree section
and click the choose button.
5.
Click the OK button on the AdaBoostM1 configuration
PROBLEM: PREDICT THE ONSET OF DIABETES IN PIMA
Data mining and machine learning is helping medical professionals make diagnosis easier by
bridging the gap between huge data sets and human knowledge. We can begin to apply machine
learning techniques for classification in a dataset that describes a population that is under a high
risk of the onset of diabetes.
Diabetes Mellitus affects 382 million people in the world, and the number of people with type-2
diabetes is increasing in every country. Untreated, diabetes can cause many complications.

Diabetes Test
Photo by Victor, some rights reserved.
The population for this study was the Pima Indian population near Phoenix, Arizona. The
population has been under continuous study since 1965 by the National Institute of Diabetes and
Digestive and Kidney Diseases because of its high incidence rate of diabetes.
For the purposes of this dataset, diabetes was diagnosed according to World Health
Organization Criteria, which stated that if the 2 hour post-load glucose was at least 200 mg/dl at
any survey exam or if the Indian Health Service Hospital serving the community found a glucose
concentration of at least 200 mg/dl during the course of routine medical care.
Given the medical data we can gather about people, we should be able to make better predictions
on how likely a person is to suffer the onset of diabetes, and therefore act appropriately to help.

We can start analyzing data and experimenting with algorithms that will help us study the onset
of diabetes in Pima Indians.
We took the data from UCI repository which has female patients aged more than 21 years of
PIMA Indian heritage.
1. Title: Pima Indians Diabetes Database
%
% 2. Sources:
%
(a) Original owners: National Institute of Diabetes and Digestive and
%
Kidney Diseases
%
(b) Donor of database: Vincent Sigillito (vgs@aplcen.apl.jhu.edu)
%
Research Center, RMI Group Leader
%
Applied Physics Laboratory
%
The Johns Hopkins University
%
Johns Hopkins Road
%
Laurel, MD 20707
%
(301) 953-6231
%
(c) Date received: 9 May 1990
%
% 3. Past Usage:
%
1. Smith,~J.~W., Everhart,~J.~E., Dickson,~W.~C., Knowler,~W.~C., \&
%
Johannes,~R.~S. (1988). Using the ADAP learning algorithm to forecast
%
the onset of diabetes mellitus. In {\it Proceedings of the Symposium
%
on Computer Applications and Medical Care} (pp. 261--265). IEEE
%
Computer Society Press.
%
%
The diagnostic, binary-valued variable investigated is whether the
%
patient shows signs of diabetes according to World Health
Organization
%
criteria (i.e., if the 2 hour post-load plasma glucose was at least
%
200 mg/dl at any survey examination or if found during routine
medical
%
care).
The population lives near Phoenix, Arizona, USA.
%
%
Results: Their ADAP algorithm makes a real-valued prediction between
%
0 and 1. This was transformed into a binary decision using a cutoff
of
%
0.448. Using 576 training instances, the sensitivity and specificity
%
of their algorithm was 76% on the remaining 192 instances.
%
% 4. Relevant Information:
%
Several constraints were placed on the selection of these instances
from
%
a larger database. In particular, all patients here are females at
%
least 21 years old of Pima Indian heritage. ADAP is an adaptive
learning
%
routine that generates and executes digital analogs of perceptron-like
%
devices. It is a unique algorithm; see the paper for details.
%
% 5. Number of Instances: 768
%
% 6. Number of Attributes: 8 plus class

%
% 7. For Each Attribute: (all numeric-valued)
%
1. Number of times pregnant
%
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance
test
%
3. Diastolic blood pressure (mm Hg)
%
4. Triceps skin fold thickness (mm)
%
5. 2-Hour serum insulin (mu U/ml)
%
6. Body mass index (weight in kg/(height in m)^2)
%
7. Diabetes pedigree function
%
8. Age (years)
%
9. Class variable (0 or 1)
%
% 8. Missing Attribute Values: None
%
% 9. Class Distribution: (class value 1 is interpreted as "tested positive for
%
diabetes")
%
%
Class Value Number of instances
%
0
500
%
1
268
%
% 10. Brief statistical analysis:
%
%
Attribute number:
Mean:
Standard Deviation:
%
1.
3.8
3.4
%
2.
120.9
32.0
%
3.
69.1
19.4
%
4.
20.5
16.0
%
5.
79.8
115.2
%
6.
32.0
7.9
%
7.
0.5
0.3
%
8.
33.2
11.8
%
%
%
%
%
%
% Relabeled values in attribute 'class'
%
From: 0
To: tested_negative
%
From: 1
To: tested_positive
%

A particularly interesting attribute used in the study was the Diabetes Pedigree Function,
pedi. It provided some data on diabetes mellitus history in relatives and the genetic
relationship of those relatives to the patient. This measure of genetic influence gave us an
idea of the hereditary risk one might have with the onset of diabetes mellitus. Based on
observations in the proceeding section, it is unclear how well this function predicts the onset
of diabetes.

Initially we did preprocessing and observed the some things.


Observations from the PIMA Indians Data
For this We used WEKA software and checked distribution parameters of each and every
attribute through WEKA explorer.

All are integers and class variable 0 or1


Female population age more than 21 years and less than 50 years
Some values related to plasma, skin, insulin percentage found zero values, it means there
is some error in the data.

After examining the distribution of class variables (Positive or Negative)


We found 268(34.8%) positive instances and 500 negative instances (65.2%)

From the above figure of histograms we understood that some of the attributes are normally
distributed (plasma, skin, mass, blood pressure) and some exponentially distributed (pregnancy,
insulin, pedigree, age). As we know age normally follows normal distribution, it seems there is
some problem with the dataset that is why it is skewed distribution.
From Scattered chart we observed that

Interesting variable PEDIGREE dont have any relationship with diabetes


Interestingly lager values of plasma (PGC) with lager values of age, pedigree, BMI,
insulin, blood pressure and pregnancy found positive testing.

SCATTER CHART

Evaluation

After performing a cross-validation on the dataset, I will focus on analyzing the algorithms
through the lens of three metrics: accuracy, ROC area, and F1 measure.
Based on testing, accuracy will determine the percentage of instances that were correctly
classified by the algorithm. This is an important start of our analysis since it will give us a
baseline of how each algorithm performs.
The ROC curve is created by plotting the fraction of true positives vs. the fraction of false
positives. An optimal classifier will have an ROC area value approaching 1.0, with 0.5 being
comparable to random guessing. I believe it will be very interesting to see how our
algorithms predict on this scale.
Finally, the F1 measure will be an important statistical analysis of classification since it will
measure test accuracy. F1 measure uses precision (the number of true positives divided by
the number of true positives and false positives) and recall (the true positives divided by the
number of true positives and the number of false negatives) to output a value between 0
and 1, where higher values imply better performance.

I strongly believe that all algorithms will perform rather similarly because we are dealing with
a small dataset for classification. However, the 4 algorithms should all perform better than
the class baseline prediction that gave an accuracy of about 65.2%.

IMPROVE RESULTS BY BOOSTING

Das könnte Ihnen auch gefallen