Beruflich Dokumente
Kultur Dokumente
(I-SMAC 2017)
Dept. of Computer Science and Dept. of Computer Science and Dept. of Computer Science and
Engineering Engineering Engineering
Rajarambapu Institute of Technology Rajarambapu Institute of Technology Rajarambapu Institute of Technology
Sakhrale, Sangli Dist Sakhrale, Sangli Dist Sakhrale, Sangli Dist
Abstract—Now days from health care industries large volume of insulin to person. This type of diabetes is call as Insulin -
data is generating. It is necessary to collect, store and process this Dependent Diabetes Mellitus (IDDM). The second type of
data to discover knowledge from it and utilize it to take DM is Type 2 DM which can be caused due to the situation in
significant decisions. Diabetic Mellitus (DM) is from the Non which body cells are unable to use insulin properly. Type 2
Communicable Diseases (NCD), and lots of people are suffering
DM is also known as Non-Insulin-Dependent Diabetes
from it. Now days, for developing countries such as India, DM
has become a big health issue. The DM is one of the critical Mellitus (NIDDM). The third one and most important type of
diseases which has long term complications associated with it and DM is gestational diabetes which can be caused due to
also follows with various health problems. With the help of development of high blood sugar in pregnant woman without
technology, it is necessary to build a system that store and previous diagnosis of DM. This type of patients can proceed
analyze the diabetic data and predict possible risks accordingly. toward Type 2 DM [1]. The DM is one of the critical diseases
Predictive analysis is a method that integrates various data which has long term complications associated with it and also
mining techniques, machine learning algorithms and statistics follows with various health problems. Every day large volume
that use current and past data sets to gain insight and predict of diabetic data is generating and hence it is necessary to do
future risks. In this work machine learning algorithm in Hadoop
MapReduce environment are implemented for Pima Indian
analysis on this data and make efficient decisions [1][5].
diabetes data set to find out missing values in it and to discover
patterns from it. This work will be able to predict types of Predictive analysis is a method that integrates various
diabetes are widespread, related future risks and according to data mining techniques, machine learning algorithms and
the risk level of patient the type of treatment can be provided. statistics that use current and past data sets to discover
knowledge from it and by using it predict future occurrences.
Keywords— Healthcare industry, Hadoop, MapReduce, Machine By applying analytics on health care data, important decision
Learning, Predictive Analysis and prediction can be made. In this work we are using the
machine learning algorithms in Hadoop-Map Reduce
I. INTRODUCTION environment for the prediction of the types of diabetes are
widespread, related complications and accordingly treatment
Now days from health care industries large volume can be provided. On the basis of this analysis, the system will
of data is generating. This data may be in the form of be able to provide competent solution to the early diagnosis of
structured data or unstructured data. In the hospitals various patients’ risk level. This system is able to provide a reasonable
records such as patients’ profile information, x-ray reports, solution with reliable availability.
various medical tests’ reports etc are preserved and this forms
big data[1][16]. Big data analytics is the process which II. PRIOR WORK
examines such large data sets and uncovers hidden
information, hidden patterns to discover knowledge from the The review on prior work gives many results on analysis of
data. Now days, in developing countries such as India, healthcare data which was carried out by different methods,
Diabetic Mellitus (DM) has become a big health hazard. techniques. Many researchers have developed and
Bigdata analytics can be used to identify disease and its implemented various analysis and prediction models using
associated risk at early stages and can help to provide medical different data mining, data management and Hadoop
care accordingly. techniques or combination of these techniques.
According to the current situations, in developing
countries such as India, Diabetic Mellitus (DM) has become a Eswari et al. (2015) proposed Hadoop and MapReduce
big health hazard. Diabetic Mellitus (DM) is categorized as a based approach for the analysis of diabetic data. Through this
Non Communicable Diseases (NCD), and lots of people are analysis the developed system can be predict diabetic type’s
suffering from it. There are three main types of DM. Diabetic prevalent and complications associated with it. On the basis
Mellitus (DM) has three main types. From that the first one is of such analysis, the system can cure the patient by giving
Type 1 DM which can be caused due to patient’s body’s appropriate treatment as early as possible. The system is based
inability to produce insulin and currently it requires inject on Hadoop hence it is affordable to any healthcare
Apache Hadoop is java written open source framework 2 Plasma glucose concentration a 2 hours in an oral
which can be used to process massive data set on number of glucose tolerance test
cluster of computers in distributed manner. Hadoop is able to 3 Diastolic blood pressure (mm Hg)
provide distributed storage and processing of data. According 4 Triceps skin fold thickness (mm)
to the need of application Hadoop can scale from single server 5 2-Hour serum insulin (mu U/ml)
to multiple servers. In Hadoop each node is able to provide
6 Body mass index (weight is in kg and height in m)
local storage and computation. Hadoop Distributed File
System is used to store and process the data. Through HDFS 7 Diabetes pedigree function
massive volume data can be access with less time and efforts. 8 Age (years)
Parallel processing of such data is done with MapReduce 9 Class variable (0 or 1)
framework [6].
E. Missing Value Imputation
MapReduce framework can be used to write applications
that can process massive data sets in reliable manner. In Missing data create various problems when analyzing
MapReduce environment large data can be processed in and processing data in the database. Missing data is an issue
distributed and parallel way on number of commodity associated with data mining research. Missing data occurs due
hardware. MapReduce framework contains two phases, first is to no attribute associated for any instance, or the values are
Map phase, in which the input data is converted into not relevant or the values are not collected properly at the time
intermediate data in the form of key value pairs and in Reduce of data has been subjected. The missing values in a database
phase this intermediate data is converted in to final output by can affect the accuracy and performance of the classifier
integrating all the values associated with the key [14]. The which results in difficulty to extract the meaningful
overall architecture of MapReduce framework is shown in information, loss of efficiency. It may be very difficult to
following figure. obtain the data quality mining results from the incomplete
data sets, hence this missing gaps need to the treated [7][13].
ii. Xnd= Sub data set having all non-diabetic must generalize in such a way that it is able to respond
records correctly to all possible inputs.
3. Identify missing values from all attributes A ∈ Ai = .
{A1,A2,A3,...,AN}, in each data set Xd and Xnd Ross Quinlan has developed C4.5 algorithm which
(consider 0 as missing value) builds Decision Trees (DT). Classification problems can be
4. For each attribute A ∈ Ai = {A1,A2,A3,...,AN}, in data solved by decision tree algorithms. C4.5 is improved from the
set Xd, calculate attribute mean Mi,i = 1,....,N where ID3 algorithm by processing both continuous and discrete
N is the total number of attribute . attributes, missing values and pruning trees after construction.
5. For each attribute A ∈ Ai = {A1,A2,A3,...,AN}, in data C4.5 is based on the information gain ratio which is calculated
set Xnd, calculate attribute mean Mni, i = 1,...,N where by entropy. The measurement of the information gain ratio is
N is the total number of attribute used to select the test characteristics at each node of the tree.
6. Impute missing values in data set Xd with attribute Such a measure is considered as an attribute selection
mean Mi. measure. The attribute which has the highest information gain
7. Impute missing values in data set Xnd with attribute ratio is chosen as the test feature for the current node [8].The
mean Mni. following, is a pseudo-code for building C4.5 decision tree
8. Combine data set Xd and Xnd . algorithm
9. Return (X)
After calculating the missing values in data set, numeric Algorithm 2 : Decision Tree C4.5
values are converted into string according the following
ranges for each attribute in data set. These standard ranges are
decided on the basis of medical reference.
Input: Training data set (T); attributes(S).
Table 2: Description of Attribute Ranges
Output: Decision Tree (T)
Attribute Low Medium High
Name/Range
1. If tree is NULL then
Plasma glucose <95 95-141 141+ 2. Return failure
concentration 3. end if
4. set Tree = {}
Diastolic blood <80 80-90 90+
5. for a ∈ S do
pressure 6. setSplitInfo(a,T)=0,Entropy(a) = 0
7. Compute Entropy(a),SplitInfo(a,T),
Diabetes pedigree <0.42 0.42-0.82 0.82+
Gain(a,T),GainRatio(a,T)
function
8. end for
Age <40 40-60 60+ 9. abest = argmax{GainRatio(a,T)}
10. attach abest to Tree
11. for v ∈ Values (abest,T) do
This prepared data set has been given as an input to c4.5 12. call C4.5(Ta,v)
algorithm for the pattern discovery. 13. end for
14. return Tree
F. Pattern Discovery
Let c denote the number of classes, and p(S, j) is the
The pattern discovery is a task which recognizes a pattern proportion of instances in that are assigned to jth class.
from the set of various sequences of strings or values. The Therefore, the entropy of attribute S is calculated as:
input for the pattern discovery programs contains number of
such sequences from which the pattern can be recognized. The Entropy (S) = − ∑ ( , ) × log ( , )
input data may contain a particular sequenced pattern.
Classification is the most useful and appropriate technique can Accordingly, the information gain by a training data set T
be used for pattern discovery [11]. A decision tree is a is defined as:
classification system that generates a tree and a set of rules,
representing the model of different classes, from a given set of Gain(S, T) = Entropy(S)
| , |
data [12]. The Decision tree algorithm follows supervised −∑ ∈ ( )
| |
learning. In supervised learning, inputs and outputs can be
perceived. On the basis of this training data, the algorithm
Where Values (Ts) is the set of values of S in T, Ts is the Table 3: Attribute Name and Index
subset of T induced by S , and ( Ts, v) , is the subset of T
in which attribute S has a value of v . Attribute Name Attribute Index
Therefore, the information gain ratio of attribute S is defined plasma glucose 0
as: concentration
Blood Pressure 1
( , )
GainRatio(S, T) = Diabetes Pedigree 2
( , )
function
Where SplitInfo (S, T) is calculated as: Age 3
| , | | , |
SplitInfo(S, T) = ∑ ∈ | |
× log | |
REFERENCES
[1] Dr Saravanakumar , Eswari, Sampath, Lavanya “Predictive
Methodology for Diabetic Data Analysis in Big Data,”
ELSEVIER, ISBCC 2015.