Sie sind auf Seite 1von 6

International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)

(I-SMAC 2017)

Predictive Analysis of Diabetic Patient Data


Using Machine Learning and Hadoop
Gauri D. Kalyankar Shivananda R. Poojara Nagaraj V. Dharwadkar
Email:gaurikalyankar93@gmail.com Email: shivananda.poojara@ritindia.edu Email: nagaraj.dharwadkar@ritindia.edu

Dept. of Computer Science and Dept. of Computer Science and Dept. of Computer Science and
Engineering Engineering Engineering
Rajarambapu Institute of Technology Rajarambapu Institute of Technology Rajarambapu Institute of Technology
Sakhrale, Sangli Dist Sakhrale, Sangli Dist Sakhrale, Sangli Dist

Abstract—Now days from health care industries large volume of insulin to person. This type of diabetes is call as Insulin -
data is generating. It is necessary to collect, store and process this Dependent Diabetes Mellitus (IDDM). The second type of
data to discover knowledge from it and utilize it to take DM is Type 2 DM which can be caused due to the situation in
significant decisions. Diabetic Mellitus (DM) is from the Non which body cells are unable to use insulin properly. Type 2
Communicable Diseases (NCD), and lots of people are suffering
DM is also known as Non-Insulin-Dependent Diabetes
from it. Now days, for developing countries such as India, DM
has become a big health issue. The DM is one of the critical Mellitus (NIDDM). The third one and most important type of
diseases which has long term complications associated with it and DM is gestational diabetes which can be caused due to
also follows with various health problems. With the help of development of high blood sugar in pregnant woman without
technology, it is necessary to build a system that store and previous diagnosis of DM. This type of patients can proceed
analyze the diabetic data and predict possible risks accordingly. toward Type 2 DM [1]. The DM is one of the critical diseases
Predictive analysis is a method that integrates various data which has long term complications associated with it and also
mining techniques, machine learning algorithms and statistics follows with various health problems. Every day large volume
that use current and past data sets to gain insight and predict of diabetic data is generating and hence it is necessary to do
future risks. In this work machine learning algorithm in Hadoop
MapReduce environment are implemented for Pima Indian
analysis on this data and make efficient decisions [1][5].
diabetes data set to find out missing values in it and to discover
patterns from it. This work will be able to predict types of Predictive analysis is a method that integrates various
diabetes are widespread, related future risks and according to data mining techniques, machine learning algorithms and
the risk level of patient the type of treatment can be provided. statistics that use current and past data sets to discover
knowledge from it and by using it predict future occurrences.
Keywords— Healthcare industry, Hadoop, MapReduce, Machine By applying analytics on health care data, important decision
Learning, Predictive Analysis and prediction can be made. In this work we are using the
machine learning algorithms in Hadoop-Map Reduce
I. INTRODUCTION environment for the prediction of the types of diabetes are
widespread, related complications and accordingly treatment
Now days from health care industries large volume can be provided. On the basis of this analysis, the system will
of data is generating. This data may be in the form of be able to provide competent solution to the early diagnosis of
structured data or unstructured data. In the hospitals various patients’ risk level. This system is able to provide a reasonable
records such as patients’ profile information, x-ray reports, solution with reliable availability.
various medical tests’ reports etc are preserved and this forms
big data[1][16]. Big data analytics is the process which II. PRIOR WORK
examines such large data sets and uncovers hidden
information, hidden patterns to discover knowledge from the The review on prior work gives many results on analysis of
data. Now days, in developing countries such as India, healthcare data which was carried out by different methods,
Diabetic Mellitus (DM) has become a big health hazard. techniques. Many researchers have developed and
Bigdata analytics can be used to identify disease and its implemented various analysis and prediction models using
associated risk at early stages and can help to provide medical different data mining, data management and Hadoop
care accordingly. techniques or combination of these techniques.
According to the current situations, in developing
countries such as India, Diabetic Mellitus (DM) has become a Eswari et al. (2015) proposed Hadoop and MapReduce
big health hazard. Diabetic Mellitus (DM) is categorized as a based approach for the analysis of diabetic data. Through this
Non Communicable Diseases (NCD), and lots of people are analysis the developed system can be predict diabetic type’s
suffering from it. There are three main types of DM. Diabetic prevalent and complications associated with it. On the basis
Mellitus (DM) has three main types. From that the first one is of such analysis, the system can cure the patient by giving
Type 1 DM which can be caused due to patient’s body’s appropriate treatment as early as possible. The system is based
inability to produce insulin and currently it requires inject on Hadoop hence it is affordable to any healthcare

978-1-5090-3243-3/17/$31.00 ©2017 IEEE 619


International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)
(I-SMAC 2017)

organization [1]. The predictive analysis system architecture comprises various


V. H. Bhat et al. (2009) proposed an approach of phases like data collection, missing values imputation, pattern
integration of regression, classification, genetic and neural discovery, pattern matching and result analysis. Figure1
network which deals with the missing values as well as outlier describes overall architecture of proposed work.
values in the diabetic data set and replaced the missing values
by the corresponding attribute domain. For prediction they
used classical neural network model and applied it on the
preprocessed data set [2].
Aiswarya Iyer et al. (2015), used classification technique to
find out patterns from the diabetes data sets. They employed
naive Byse and Decision Tree algorithms by using Weka tool.
Authors also compared performance of both algorithms
against Pima Diaetes Data sets. Experimental results showed
effectiveness of each proposed classification model [3].
Sabibullah M. et al. (2013) developed the prediction model
based on soft computing to find the accumulated risks of
diabetic patients. They have used genetic algorithm for the
experimentation on real time medical data set. From the
results of the experiments the risk level of patient and
accordingly the risk of heart stroke can be predicted. The Fig. 1 Predictive Analysis System Architecture
developed system will definitely help the doctor to diagnose
the patient correctly [4]. B. Machine Learning
K. Rajesh and V. Sangeetha (2012) used classification
technique to find out useful information from diabetic data Machine learning is a collection of methods that can
set. They used C4.5 algorithm to find out patterns from the automatically identify patterns in data, and then use those
data set also for efficient classification. They used Pima patterns to predict future outcomes, or to perform other types
Indian Diabetes Data set (PIDD) for experimentation. While of decision making below certain conditions. Machine
performing the classification authors didn’t consider missing learning introduces various algorithms, those enable machines
values in data set. They treated PIDD as a complete data set to understand the current situations and on the basis of that
[5]. machines can take appropriate decisions. Machine learning
Vaishnav and Dr. Patel (2015) reviewed the different works independently and takes decision at its own [9].
methods for handling missing data. They reviewed different
methods such as K-Means, KNN, classification etc used for The main two types of machine learning are, supervised
the missing values imputation and also compared with their learning and unsupervised learning
advantages and disadvantages [7].
Wei Dai and Wei Ji (2014) implemented a MapReduce 1. Supervised Learning: In supervised learning, the input and
based c4.5 decision tree algorithm. They put the original its corresponding output is already known. This is called
algorithm into the mechanism of Map and Reduce process. supervised learning because it learns from training data set
Authors conducted different experiments on massive synthetic and creates model from it and when this model applies on new
data sets and compared performance and accuracy on single data set it gives predicted results. Decision Tree, naive Bayes
v/s multi node Hadoop cluster. From the results of this work etc are the examples of supervised learning.
authors suggest that implemented algorithm can provide time 2. Unsupervised Learning: Unsupervised learning is where we
efficiency as well as scalability [8]. have only input data and no corresponding output variable.
Lots of research work using different techniques like data The main job of unsupervised learning is to build up class
mining, Weka, Hadoop and its ecosystems etc successfully is labels automatically. The relationship between the data can be
done for analyzing the healthcare data and developing good found using unsupervised learning algorithms to discover
analysis models [15]. For the analysis of diabetic data many whether the data can characterize to form a group. This group
authors preferred decision tree for the classification, rules is known as clusters. Unsupervised learning can be also
generation, pattern recognition etc [12]. In this work we are described as cluster analyses”. K Means Clustering, KNN etc
using c4.5 a decision tree algorithm for the pattern recognition are the examples of unsupervised learning [9][10].
from the diabetes data set to find the risk level of patient. In this work we are integrating supervised learning algorithm
with Hadoop MapReduce environment to perform predictive
analysis on large size diabetic data sets.
III.PROPOSED METHOD
C. Apache Hadoop and MapReduce
A. Predictive Analysis System Architecture

978-1-5090-3243-3/17/$31.00 ©2017 IEEE 620


International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)
(I-SMAC 2017)

Apache Hadoop is java written open source framework 2 Plasma glucose concentration a 2 hours in an oral
which can be used to process massive data set on number of glucose tolerance test
cluster of computers in distributed manner. Hadoop is able to 3 Diastolic blood pressure (mm Hg)
provide distributed storage and processing of data. According 4 Triceps skin fold thickness (mm)
to the need of application Hadoop can scale from single server 5 2-Hour serum insulin (mu U/ml)
to multiple servers. In Hadoop each node is able to provide
6 Body mass index (weight is in kg and height in m)
local storage and computation. Hadoop Distributed File
System is used to store and process the data. Through HDFS 7 Diabetes pedigree function
massive volume data can be access with less time and efforts. 8 Age (years)
Parallel processing of such data is done with MapReduce 9 Class variable (0 or 1)
framework [6].
E. Missing Value Imputation
MapReduce framework can be used to write applications
that can process massive data sets in reliable manner. In Missing data create various problems when analyzing
MapReduce environment large data can be processed in and processing data in the database. Missing data is an issue
distributed and parallel way on number of commodity associated with data mining research. Missing data occurs due
hardware. MapReduce framework contains two phases, first is to no attribute associated for any instance, or the values are
Map phase, in which the input data is converted into not relevant or the values are not collected properly at the time
intermediate data in the form of key value pairs and in Reduce of data has been subjected. The missing values in a database
phase this intermediate data is converted in to final output by can affect the accuracy and performance of the classifier
integrating all the values associated with the key [14]. The which results in difficulty to extract the meaningful
overall architecture of MapReduce framework is shown in information, loss of efficiency. It may be very difficult to
following figure. obtain the data quality mining results from the incomplete
data sets, hence this missing gaps need to the treated [7][13].

In the Pima Indian diabetes data set there are various


attribute which have null values. It is not possible that any
patient who has 0 blood pressure or 0 plasma glucose level in
his body. Hence it is necessary to impute missing values for
good classification results or it will cause wrong classification
results. The missing values can be imputed by using different
techniques such as classification clustering. We used
classification based technique to replace missing values with
attribute mean.
Fig. 2 MapReduce Architecture
The following missing values imputation (MVI) algorithm is
The most important phase of this work is data collection. used to impute missing values in Pima Indian diabetes data
We have collected standard and local data sets to perform the set.
experimentation.

D. Standard Data set Algorithm 1 : MVI

The Pima Indians Diabetes Data set is used for the


experimentation. The data set are available at UCI Machine
Input: Pima Diabetes Data set (X) having missing values
Learning Repository. The data set contains total 8 attributes
and one class variable. This data set is of Pima Indian diabetic Output: Pima Diabetes Data set (X) after missing values
and non-diabetic women whose age is above 21 years. There imputation.
are total 768 records in data set. The data set contains total 8
attributes plus one class variable and all attributes are numeric
valued. The data set has some missing values in it. Following
are the attributes in data set: 1. From attribute Ai, i = 1,…..,N (where N is the total
number of attribute) in data set X, retrieve attribute A
Table 1: Attribute Names of binary class variable.
2. According to the class variable divide the data set X
Sr. Attribute Name in to Xd and Xnd .
No i. Xd = Sub data set having all diabetic
1 Number of times pregnant records

978-1-5090-3243-3/17/$31.00 ©2017 IEEE 621


International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)
(I-SMAC 2017)

ii. Xnd= Sub data set having all non-diabetic must generalize in such a way that it is able to respond
records correctly to all possible inputs.
3. Identify missing values from all attributes A ∈ Ai = .
{A1,A2,A3,...,AN}, in each data set Xd and Xnd Ross Quinlan has developed C4.5 algorithm which
(consider 0 as missing value) builds Decision Trees (DT). Classification problems can be
4. For each attribute A ∈ Ai = {A1,A2,A3,...,AN}, in data solved by decision tree algorithms. C4.5 is improved from the
set Xd, calculate attribute mean Mi,i = 1,....,N where ID3 algorithm by processing both continuous and discrete
N is the total number of attribute . attributes, missing values and pruning trees after construction.
5. For each attribute A ∈ Ai = {A1,A2,A3,...,AN}, in data C4.5 is based on the information gain ratio which is calculated
set Xnd, calculate attribute mean Mni, i = 1,...,N where by entropy. The measurement of the information gain ratio is
N is the total number of attribute used to select the test characteristics at each node of the tree.
6. Impute missing values in data set Xd with attribute Such a measure is considered as an attribute selection
mean Mi. measure. The attribute which has the highest information gain
7. Impute missing values in data set Xnd with attribute ratio is chosen as the test feature for the current node [8].The
mean Mni. following, is a pseudo-code for building C4.5 decision tree
8. Combine data set Xd and Xnd . algorithm
9. Return (X)

After calculating the missing values in data set, numeric Algorithm 2 : Decision Tree C4.5
values are converted into string according the following
ranges for each attribute in data set. These standard ranges are
decided on the basis of medical reference.
Input: Training data set (T); attributes(S).
Table 2: Description of Attribute Ranges
Output: Decision Tree (T)
Attribute Low Medium High
Name/Range
1. If tree is NULL then
Plasma glucose <95 95-141 141+ 2. Return failure
concentration 3. end if
4. set Tree = {}
Diastolic blood <80 80-90 90+
5. for a ∈ S do
pressure 6. setSplitInfo(a,T)=0,Entropy(a) = 0
7. Compute Entropy(a),SplitInfo(a,T),
Diabetes pedigree <0.42 0.42-0.82 0.82+
Gain(a,T),GainRatio(a,T)
function
8. end for
Age <40 40-60 60+ 9. abest = argmax{GainRatio(a,T)}
10. attach abest to Tree
11. for v ∈ Values (abest,T) do
This prepared data set has been given as an input to c4.5 12. call C4.5(Ta,v)
algorithm for the pattern discovery. 13. end for
14. return Tree
F. Pattern Discovery
Let c denote the number of classes, and p(S, j) is the
The pattern discovery is a task which recognizes a pattern proportion of instances in that are assigned to jth class.
from the set of various sequences of strings or values. The Therefore, the entropy of attribute S is calculated as:
input for the pattern discovery programs contains number of
such sequences from which the pattern can be recognized. The Entropy (S) = − ∑ ( , ) × log ( , )
input data may contain a particular sequenced pattern.
Classification is the most useful and appropriate technique can Accordingly, the information gain by a training data set T
be used for pattern discovery [11]. A decision tree is a is defined as:
classification system that generates a tree and a set of rules,
representing the model of different classes, from a given set of Gain(S, T) = Entropy(S)
| , |
data [12]. The Decision tree algorithm follows supervised −∑ ∈ ( )
| |
learning. In supervised learning, inputs and outputs can be
perceived. On the basis of this training data, the algorithm

978-1-5090-3243-3/17/$31.00 ©2017 IEEE 622


International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)
(I-SMAC 2017)

Where Values (Ts) is the set of values of S in T, Ts is the Table 3: Attribute Name and Index
subset of T induced by S , and ( Ts, v) , is the subset of T
in which attribute S has a value of v . Attribute Name Attribute Index
Therefore, the information gain ratio of attribute S is defined plasma glucose 0
as: concentration
Blood Pressure 1
( , )
GainRatio(S, T) = Diabetes Pedigree 2
( , )
function
Where SplitInfo (S, T) is calculated as: Age 3

| , | | , |
SplitInfo(S, T) = ∑ ∈ | |
× log | |

The steps followed in C4.5 algorithm is described in


Algorithm. The entropy, information gain and gain ratio can
be calculated using above equations and recursively computed
on sub tree.
The C4.5 algorithm has been implemented using Hadoop-
MapReduce so that it can work parallel and gives the results
in less time.

IV. RESULTS AND DISCUSSION

A. Input Data set: In order to discover patterns from the data


set, we have given Pima diabetes data set to C4.5
algorithm. Before giving input to C4.5 algorithm, we Fig. 3 Plasma Low
applied algorithm 1 on the data set for the missing value
imputation. After that according to the standard range of
each attribute all the numeric values in data set are
converted into string so that it should be easy to interpret
the output of C4.5 algorithm.

B. Selection of Attributes: There are four attributes selected


to reduce complexity of results.Attributes are selected on
the basis of following criteria. 1) Attributes that are
helpful to diagnose Diabetic patient and risk associated
with the patient. 2) The attributes having less number of
missing values in it to get better accuracy.

C. C4.5 Algorithm Results:


The algorithm generates total 57 rules from standard
data set. According to the highest information gain ratio,
attribute plasma glucose (0) has been selected as a root
node. From these rules we built trees for plasma low,
plasma high and plasma medium as shown in fig. 3, fig. 4
and fig. 5 respectively.
In the following graphs, root node represents
decision attribute. Each branch describes the decision
output, every internal node describes test attribute and
leaf node describes the class label.
In the tree graphs following indexes are used to
represent respective attributes.
Fig. 4 Plasma High

978-1-5090-3243-3/17/$31.00 ©2017 IEEE 623


International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)
(I-SMAC 2017)

[2] V. H. Bhat, P. G. Rao, P. D. Shenoy, “An Efficient Prediction


Model for Diabetic Database Using Soft Computing
Techniques,Architecture,” Springer-Verlag Berlin Heidelberg, pp.
328- 335, 2009.
[3] Aiswarya Iyer, S. Jeyalatha, Ronak Sumbaly “Diagnosis of
Diabetes Using Classification Mining Techniques,” IJDKP Vol.5,
No.1, January 2015.
[4] Sabibullah M, Shanmugasundaram V, Raja Priya K, “Diabetes
Patient’s Risk through Soft Computing Model,”International
Journal of Emerging Trends Technology in Computer Science, vol
2(6), 2013.
[5] K. Rajesh, V. Sangeetha, “Application of Data Mining Methods
and Techniques for Diabetes Diagnosis,” in International Journal
of Engineering and Innovative Technology (IJEIT) Vol 2(3), 2012.
[6] Apache Hadoop and its ecosystems : http://hadoop.apache.org/
[7] Rajnik L. Vaishnav , Dr. K. M. Patel, “Analysis of Various
Techniques to Handling Missing Value in Data set,” International
Journal of Innovative and Emerging Research in Engineering
Volume 2, Issue 2, 2015
[8] Wei Dai, Wei Ji, “A MapReduce Implementation of C4.5 Decision
Tree Algorithm,” International Journal of Database Theory and
Application Vol.7, No.1 (2014), pp.49-60
[9] Machine Learning tutorials and examples
https://www.toptal.com/machine-learning/machine-
learningtheory- an-introductory-primer
[10] Anish Talwar, Yogesh Kumar, “Machine Learning: An artificial
intelligence methodology,” International Journal Of Engineering
And Computer Science ISSN:2319-7242 Volume 2 Issue 12,
Dec.2013
[11] Brona Brejova, Tomas Vina, Ming Li, “Pattern Discovery:
Methods and Software,” Technical Report CS-2000-22, Dept. of
Computer Science, University of Waterloo.
Fig. 5 Plasma Medium [12] Dr.Rajni Jain, “Rule Generation Using Decision Trees,” IASRI
[13] Md. Geaur Rahman, Md. Zahidul Islam, “A Decision Tree-based
From the analysis of all the patterns it is clear that, Missing Value Imputation Technique for Data Pre-processing,”
Proceedings of the 9-th Australasian Data Mining Conference
when plasma is high most of the times patient is of diabetic (AusDM’11), Ballarat, Australia.
class and when plasma is low most of the times patient is of [14] Gauri D.Kalyankar, Shivananda R Poojara, N V
non diabetic class. When plasma is medium both diabetic and Dharwadkar,”Weblog Analysis Using Hadoop,” National
non Research Symposium on Computing - RSC 2016, ISBN: 978-81-
931456-1-8, Dec. 19-20, 2016
From the above graph we can write patterns that [15] Sadhana, Savitha Shetty, “Analysis of Diabetic Data Set Using
will be helpful to classify the patients into diabetic or non Hive and R,” International Journal of Emerging Technology and
diabetic class and also able to predict risk level of each Advanced Engineering, vol 4(7), 2014.
patient. [16] A.Ravishankar Rao, Atul Chhabra, Rajarshi Das, Vikash Ruhil,
“A framework for analyzing publicly available healthcare data,”
IEEE 2015.
V. CONCLUSION

Predictive analysis is a method that integrates various


data mining techniques, machine learning algorithms and
statistics that use current and past data sets to discover
knowledge from it and by using it predict future occurrences.
In this work we have implemented Hadoop MapReduce based
machine learning algorithms for Pima Indian diabetes data set
to find out missing values in it and to discover patterns from
it. This work suggests that implemented algorithms are able to
impute missing values and to recognize patterns from the data
set. In future work pattern matching will be employed by
applying these discovered patterns on testing data set to
predict diabetic prevalent and risk levels associated with it.

REFERENCES
[1] Dr Saravanakumar , Eswari, Sampath, Lavanya “Predictive
Methodology for Diabetic Data Analysis in Big Data,”
ELSEVIER, ISBCC 2015.

978-1-5090-3243-3/17/$31.00 ©2017 IEEE 624

Das könnte Ihnen auch gefallen