Sie sind auf Seite 1von 4

International Conference on Computing and Intelligence Systems

Pages: 1222 1225

Volume: 04, Special Issue: March 2015


ISSN: 2278-2397

Machine Learning Approaches and its


Challenges
Raj Mohan Kumaravel1, Ilango Paramasivam2
1

Research Scholar, School of Information technology & Engineering VIT University, Vellore
2
Professor, School of Computing Science & Engineering, VIT University, Vellore
E-Mail: k.rajmohan90@gmail.com, pilango@vit.ac.in

Abstract Real world data sets considerably is not in


a proper manner. They may lead to have incomplete or
missing values. Identifying a missed attributes is a
challenging task. To impute the missing data, data
preprocessing has to be done. Data preprocessing is a
data mining process to cleanse the data. Handling
missing data is a crucial part in any data mining
techniques. Major industries and many real time
applications hardly worried about their data. Because
loss of data leads the company growth goes down. For
example, health care industry has many datas about
the patient details. To diagnose the particular patient
we need an exact data. If these exist missing attribute
values means it is very difficult to retain the datas.
Considering the drawback of missing values in the data
mining process, many techniques and algorithms were
implemented and many of them not so efficient. This
paper tends to elaborate the various techniques and
machine learning approaches in handling missing
attribute values and made a comparative analysis to
identify the efficient method.

contains missing values. Missing datas may lead to


imperfection thus it will lead to preprocessing stage
such that datas can be cleaned completely. This step
improves the extraction process and inconsequence,
the results obtained in any data mining algorithm. The
simplest way of dealing with this problem is mainly to
discard the examples with missing values and analysis
of complete examples does not lead to the serious
problem during inference. In this paper, we compared
various machine learning approaches to handle
incomplete data.

Keywords Data mining, data set, impute, missing


attributes, preprocessing

MAR (Missing at random) this is a type to handle


missing data. It occurs when the missing ness is
related to a particular variable, but it is not related to
the value of the particular variable that has missing
data. Missing at random is the type which sis going to
make the decision in which type of attribute or
variables in the data sets.

I. INTRODUCTION
Incomplete datas is very common in the large and
huge data bases. Technically, some attribute values
are missing leads the database inconsistent state. Data
preprocessing is very essential process to address the
missing attribute values. Typically, they can replace
the missing values with many possible approaches.
We need certain knowledge to predict whether the
data is missed or not. [1] Many real world applications
taking complicated decisions to handle missing data.
For example in a health care industry, if doctor have
to examine the patient, he / she have to check for the
patient history to predict the result. Not only health
care industry, there are many corporate concerns also
worried about their missed data. There are many
approaches and techniques that are handling for
incomplete data. There are many drawbacks that lead
to having missing attribute values that includes loss of
efficiency, complication to manage and analyze data,
bias resulting from differences between missing and
complete data. [2] In order to avoid the negative
effects in the analysis of data mining algorithms.
When missing values are present, different approaches
are employed to prepare and cleanse the data. This is
critical as many existing industrial and research datas

International Journal of Computing Algorithm (IJCOA)

Types of missing data


MCAR - (Missing completely at random) Values in
the datas are said to be MCAR, if any of the data item
being missing are observable and non-observable
parameters. It will occur at random in rare. This is
majorly identified in observable and non observable
parameters.

NMAR (Not missing at random) Datas that are


missing for the specific reason. This is a common type
of data handling.
ACTION
ASSUMPTION
PARAMETER

MAR
NMAR
Weaker
Violated
Partial /
Good
distinct
DATA
Information
No information
TEST
Not fit
Not fit
RESULT
Plausible
Sensitive
Table 1.1 Comparisons of MAR and NMAR

II. MACHINE LEARNING TECHNIQUES


Missing data is the major problem in many real
time applications. There are many possibilities that
may occur to handle missing data including
irresponsible to the questionnaire and so on. Many
new approaches have been proposed and developed
for incomplete data handling. [3] Generally, missing
data have the concept of ignoring techniques that

1222

International Conference on Computing and Intelligence Systems


Pages: 1222 1225
simply omits the cases that contain missing data.
Rather removing the missing data, we can remove the
missing data by imputing the data by replacing the
accurate values.
There are many imputation techniques and
methods that have proposed for the data which has
missing. There are many imputation methods have
been proposed that includes regression and multiple
imputation.
A. Regression Imputation: This is more useful
imputation technique especially for single imputation
for regression based analysis. Here, predicted data
will be replaced as much of missing data available.
This method is thoroughly a prediction and
assumption based linear relationship between many
attributes. [4] Mostly we cannot expect the
relationship to be more linear. We use completers to
calculate the regression of any incomplete variable on
other complete variable. Thus regression imputation
has a good imputing technique to handle incomplete
data. Regression has majorly classified with
classification and regression. If we have to address
the missing attributes by taking the complete
attributes is actually called as classification. If we
want to take continuous incomplete data sets, we
have to take all that attributes to address complete
data sets. So considering the major algorithm that is
mainly helps us to address the complete attributes to
present the output with efficient data.
B. Multiple Imputations: As the name implies that
has multiple imputed data. [5] By replacing the
missing values with some number of set of n
plausible values taken for the predictive distribution
to the state. Over all estimation has been evaluated
and that can analyze by complete data methods to
avoid problems that we faced in single imputation.
By this technique, it can relieve the distortion of the
sample various and produces unbiased estimates, but
data must meet normal distribution assumptions by
the storage requirements. There are many other
techniques and methods which have proposed by
machine learning methods for their respective studies,
in which many of the algorithms also proposed
efficiently.
C. K-Nearest
Neighbor
with
imputation
(KNNI):Using instance based algorithm, every time
the missing data occur we can call that as instance.
This imputation computes a value after the datas
imputed. For nominal attributes, KNNI will complete
with most nearest neighbor. Therefore, a proximate
measurement will be taken. KNNI is an imputation
technique that is mainly going to take the missed
attributes data by taking the complete or nearest
attributes in the data set. After taking the neighbor
attributes the actual attribute can identify by taking
the probable and possible values that has to replace in
the actual missed attributes. This technique is not so
efficient approach because this can lead the replicated
data. Because this approach is substituting the value

International Journal of Computing Algorithm (IJCOA)

Volume: 04, Special Issue: March 2015


ISSN: 2278-2397

with maximum likelihood that is present in the


particular data sets. So this can mislead the attribute
data highly plausible values.
D. Fuzzy K Means Clustering: We know that the
clustering is the technique to group the datas into
various clusters. Here, in fuzzy clustering, each data
object has a membership function that describes the
degree to which the data has belongs to a certain
cluster. [6] To update the membership functions, we
require fuzzy K Means Clustering. In this process,
the data object cannot be assigned to a concrete
cluster that is represented by the cluster centroid. By
replacing various non reference attributes for each
incomplete data object based on the information
about membership degrees.
E. Expectation Maximization Technique:
Estimating the mean and covariance matrix, we can
impute the datas that are missing by the technique
called Expectation Maximization (EM). [7] The
steps for EM are, first the records each regression
parameters of the variables with missing values
among the variable that has missing value and should
compute the mean and covariance matrix. Second,
each record that has missing values have to replace
with expected datas or values being the product of
the available values and estimated regression
coefficients. Third, the identified mean and
covariance matrix have to re estimated and the
sample mean of the completed data sets have to
identify and that to estimate the imputation error.
F. Support Vector Machine:To analyze data and
recognize patterns used for classification and
regression based analysis. Given a set of training
examples, we have some training algorithms that
build a model that assigns new examples into one
category or the other, making some of the
probabilistic binary linear classifier. Formally a SVM
can construct a hyper plane in a high or infinite
dimensional space, which can be used for various
tasks. [8] To keep the computational load reasonable,
the mapping used by SVM schemes are designed to
ensure that dot products may be computed easily in
terms of the variables in the original space, by
defining the term called kernel function K(x, y).
G. Outlier detection: Data analysis has a large
number of variables that are being recorded or
sampled. One of the steps to obtaining a coherent
analysis is the detection of outlaying observations. It
is an observation that appears to deviate markedly
from other observations in the sample. An outlier
may indicate wrong data. For example, the data may
have been coded incorrectly or an experiment may
not have been run correctly. If it can be determined
that an outlaying point is in fact erroneous, then the
outlaying value should be deleted from the analysis.
In some cases, it may not be possible to determine if
an outlying point is bad data. [9] Outliers may be due
to random variation or may indicate something
scientifically interesting. Labeling, accommodation
and identification are three issues in outlier detection.

1223

International Conference on Computing and Intelligence Systems


Pages: 1222 1225
Outlier detection is the major technique that
encompasses the variables to be fit into the data sets
in order to avoid the missing attributes.
Labeling is the flag potential outliers for further
investigation. This is nothing but a identifying a
unique variable in the data sets.
Accommodationuses robust statistical techniques that
will not be affected by outliers. That is, if we cannot
determine the potential outliers are erroneous
observations.
Identification is formally test whether observations
are outliers. This can identify the attribute that is
actually in missed value. By taking the outlier
detection, it is going to ignore and eliminate the
values and attributes that is not fit into the data sets.
H. Iterative linear Fitting Method (ILF): This
method belongs to the category of regression based
methods, which substitutes the missing data based on
the maximum likelihood function under specific
modeling assumptions. [10] The linear regression
model is assumed for the data sets for simplicity. This
method predicts the data that are missing attributes
for each in turn. The iterative method is the technique
that has a procedure in mathematical procedure that
generates a sequence of approximate values for a
class of problems. According to the initial
approximation called convergent method. Iterative
linear fitting method will be very efficient algorithm
to predict the assigned values and to address missing
or missed data sets respectively. Hence forth the
machine approaches firmly classified to avoid the
missed attributes and to cleanse the data.
Technique

Variable
type

Regression

Incomplete

Prediction

Incomplete
Incomplete
Probable

Estimation
Prediction
Initialization

Iterative
Incomplete
Incomplete

Initialization
Prediction
Distribution

No
Yes
Yes

Numerical

Iteration

Yes

Multiple
KNNI
Fuzzykmeans
E-M
SVM
Outlier
detection
ITF

Substitution

Possib
ility
Yes
Yes
Yes
Yes

Table 2.1 Comparisons of MLT

III. REAL WORLD APPLICATIONS


A. Bioinformatics: It has become an important part
in many areas of biology. Datas that includes images
and signal processing allow the extraction of useful
results from large amounts of raw data. Manual
interpretation of using biological tools is called as
bioinformatics. Generally in the medical related
industry, it includes the database, analysis and
statistical algorithms. There are many biological
datas that includes DNA, RNA, Protein, 3D
structure, Genomic DNA, Metabolic data etc.

International Journal of Computing Algorithm (IJCOA)

Volume: 04, Special Issue: March 2015


ISSN: 2278-2397

There are many applications in a bioinformatics


industry such as molecular medicine, personalized
medicine, gene therapy, drug development, waste
cleanup, biotechnology and anti biotic resistance. To
maintain database such as protein sequence database,
secondary database, protein pattern database,
structural classification database is a major challenge.
If any of these database have any missing attributes
we should implement data mining algorithms to
follow up the missed data very efficiently. Many new
machine approaches proposed to maintain the datas
easier. Bioinformatics is the major and important
application that is using machine learning approaches
for various processes. There are many real time data
sets in various repositories. Many sample data sets
have tested and make use of many machine learning
techniques in order to address the actual attribute in
the data sets.
B. Database marketing:Database marketing is a
major trend that has improved form of direct
marketing. DBM is an interactive approach to
marketing, which uses the individual addressable
marketing media and channels. To extend help to a
companys target audience, to estimate their demand
and to maintain database electronically. In marketing,
there are many sources of data that includes consumer
data, business data, analytics and modeling. As the
name implies, DBM can used by any organization
that the datas are available for the customers as
possible. Database marketing is the major important
real world application that can make use of the
marketing in various needs. In marketing, there are
many techniques that the company can maintain their
strategy. Each has major differences among various
corporate worlds.
There are many users often building elaborate
database for maintaining customers information.
These may include a variety of data including name
and address. As we know B2B, Business to Business
company marketers, customers are of many
companies can withstand and maintain the database.
C. Pattern recognition:We can generally categorize
according to the type of learning procedure used to
generate the output value. There are a set of training
data has been provided consisting of a set of instances
that have been properly labeled by hand with correct
output. Within medical science, pattern recognition is
the basis for computer aided diagnosis systems. Many
machine learning algorithms has been proposed that
includes clustering, neural networks, regression based
methods, sequence labeled algorithms to make a data
very quality without any missing data present. In
health care industry, there are various parameters and
data sets available. So the attributes which is firmly
missed that could identify by considering the various
patterns that is suited to the data sets in the form of
exactly cluster groups. These cluster groups can
consider and identify the various missed attributes in

1224

International Conference on Computing and Intelligence Systems


Pages: 1222 1225
the data sets. Pattern recognition is major important
real world application mainly in medical industry.
D. Robot locomotion:The word robot makes us to fill
out human intervention in any datas. This was
implemented especially to develop the capabilities for
robots to autonomously decide how the robots have to
develop. There are many types of robots developed
for many human needs the way the prediction of any
task to get completed.How could it help then by
machine learning? The techniques filled with some
datas and can call as huge datas. So here any of the
directions or any potential things missed means it is
widely makes a problem. So we can use many new
machine learning approaches to make use of robots
very well. Locomotion is nothing but the movement
has to make by the robot to do any task. So there are
various movements and actions can do by the robots
and there are many dimensions and approximated
values can be identified. So this process is little easy
by using machine learning approaches. Using various
techniques if any of the dimensional values missed
means, we can easily identify the values by predicting
the values by using regression and classification.
Various supervised learning and unsupervised
learning has developed for identifying many
techniques in much real world application.

Volume: 04, Special Issue: March 2015


ISSN: 2278-2397

facing this type of missing attribute values. We have


discussed about many machine learning approaches.
Because we have many mechanisms to handle missing
attribute values. Many techniques came up with many
added advantages somehow there are some drawbacks
in many machine learning approaches. So considering
this into the major perspective researchers are
probably move onto the evaluating the missed data by
calculating manually using many mathematical
formulae and by many statistical software that can
retain the data that is actually missed in the data set.
But many of the algorithms do not in metric to
achieve the efficiency in data. So considering this we
can use and implement new algorithm and techniques
that could eradicate the missing attribute completely
and we can propose the new efficiency methods to
achieve quality data. Because presence of missing
attributes may lead the database to go inconsistent
state. To avoid this we need to process and clean our
data in such a manner. Cleansing the data will be the
most efficient way to eradicate missing attribute
values. Consequently every approach has been
proposed and so that we can achieve the quality data
with no missing attributes.

REFERENCE
[1]

IV.CHALLENGES
As datas growing larger and even there are many
machine learning approaches and techniques. Still
there may have some loss of quality to that intend. So
formally there are many challenges out come by the
word missing attributes that are mainly reflected in the
quality of data. Many real world applications formally
working with huge amount of data.If any of the datas
missed means that will reflected to major concern. So
by filling the missing values into the equivalent
probable value or by simply eliminating the missed
group or by ignoring the actual missed datas may
lead to the loss of efficiency. So the datas shall say to
be missed before going to the data preprocessing.
Although many new techniques impressed companies
and even they are taking and picking up some of the
technique still there may have some drawbacks. The
main challenges in addressing missing value attributes
are the loss of quality. This tends the datas to go
down. So considering the datas to be more formal we
are about to make a prediction and replace the values
exactly in deed. Replacing is also the way that we may
feel not good. Rather we can go for some other
techniques to achieve. We have identified major
challenges faced by many real time applications and
even some draw backs of present machine learning
approaches.

Y. S. Su, A. Gelman, J. Hill and M. Yajima, Multiple


Imputations with Diagnostics (mi) in R: Opening Windows
into the Black Box, 2014 Journal of Statistical Software.
[2] R.J.A little and D.B Rubin, Statistical analysis with missing
data, 2013 Wiley, New Jersey.
[3] A. Misrli, A. Benes, and R. Kale:Artificial based software
defect predictors: Applications and benefits in a case study
2013AI Magazine.
[4] Wang, S.Y. and Lin, C.C. NCTUns 5.0: A Network
Simulator for IEEE 802.11(p) and 1609 Wireless Vehicular
Network Researches. Second IEEE.Int.Symp.Wireless,
Vehicular Communications, Calgary, 2013 Canada,
[5] E. Acar and B. Yener. Unsupervised multi way data analysis:
A literature survey, 2012
[6] Acuna E, Rodriguez C Classification, clustering and data
mining applications. Springer, 2011 Berlin, pp. 639648.
[7] Alcal-fdez J, Snchez L, Garcia S, Jesus MJD, Ventura S,
Garrell JM, Otero J, Bacardit J, Rivas VM, Fernandez JC,
Herrera F Keel: a software tool to Assess evolutionary
algorithms for data mining problems. 2011Soft Computing
13(3):307318
[8] Luengo J, Garcia S, Herrera F A study on the use of
imputation methods for experimentation with Radial Basis
Function Network classifiers handling missing attribute
values: the good synergy between RBFNs and Event
Covering method. Neural Network 23(3):406418, 2010
[9] Qin B, Xia Y, Prabhakar S Rule induction for uncertain data.
Knowledge Info System: 10.1007/ s10115-010-0335-7, pp.
12, 2010
[10] Wang H, Wang S Mining incomplete survey data through
classification. Knowledge Info System 24(2):221233, 2010.
[11] Peng C, Zhu J (2008) Comparison of two approaches for
handling missing covariates in logistic regression.68 (1):58
77
[12] Farhangfar A, et al A novel framework for imputation of
missing values in databases. IEEE

V.CONCLUSION
In this paper we have briefly discussed about the
various techniques of missing attributes. We have
discussed about various applications that are broadly

International Journal of Computing Algorithm (IJCOA)

1225

Das könnte Ihnen auch gefallen