Sie sind auf Seite 1von 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/330222999

Performance Analysis of Machine Learning Classifiers for ASD Screening


using Rapidminer

Conference Paper · November 2018

CITATIONS READS

0 154

5 authors, including:

Alrence Halibas
Gulf College Oman
15 PUBLICATIONS   9 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Graduate Attributes and Learning Outcomes: Proposed Framework and Assessment Process View project

Determining the Intervening Effects of Exploratory Data Analysis and Feature Engineering in Telecoms Customer Churn Modelling View project

All content following this page was uploaded by Alrence Halibas on 03 March 2019.

The user has requested enhancement of the downloaded file.


Performance Analysis of Machine Learning
Classifiers for ASD Screening
Alrence Santiago Halibas Leslyn Bonachita Reazol
Faculty of Computing Sciences College of Computer Studies
Gulf College La Salle University
Muscat, Oman Ozamiz City, Philippines
alrence@gulfcollege.edu.om leslyn.reazol@lsu.edu.ph
Erbeth Gerald Tongco Delvo Jannette Cui Tibudan
College of Computer Studies College of Engineering
La Salle University St. Peter’s College
Ozamiz City, Philippines Iligan City, Philippines
erbethgerald.delvo@lsu.edu.ph jctibudan@gmail.com

Abstract— Several machine learning classifiers have been Furthermore, this study attempts to determine the most
used for Autism Spectrum Disorder screening, however, appropriate binary classification model for the given
literature in finding the best classifier for this application datasets. It explores different machine learning algorithms
domain is inadequate. Hence, this paper presents a comparison that can effectively classify whether a child, adolescent or
of five (5) supervised machine learning algorithms: Decision adult is likely an ASD candidate. Likewise, this paper tries to
Tree, Naïve Bayes, k-nn, Random Tree, and Deep Learning get a good estimate and comparison of the algorithms’
using small datasets (n=1100) on child, adolescent and adult prediction performance in terms of their accuracy and
ASD screening in finding the most appropriate classifier. These classification error, precision and class recall, and Receiver
algorithms, which are evaluated using a broad set of prediction
Operating Characteristics (ROC).
performance metrics including accuracy, precision/recall
measures, and Receiver Operating Characteristics, are The succeeding sections of this paper are organized as
compared against each other. The experiment result suggests follows: Section II presents the background information on
that the Deep Learning classifier gives the best performance the chosen machine learning algorithms and performance
(with more than 96%) in almost all metrics while the Random metrics, Section III presents the experimental methods that
Tree classifier came out as the least performing classifier in all include the datasets and software used as well as the
the performance metrics. performance analysis and results, and Section IV presents the
Conclusion and Future Work.
Keywords— Autism, Bioinformatics, Machine Learning, Binary
Classification, Deep Learning
II. BACKGROUND INFORMATION
I. INTRODUCTION
Autism, or otherwise known as Autism Spectrum A. Supervised Machine Learning
Disorder (ASD), is a complex mental condition that is Machine Learning is a branch of Artificial Intelligence
exhibited from early childhood and primarily characterized (AI) that learns and discovers meaningful patterns in data to
by communication and social difficulties. It affects 1 in 150 make predictions [13]. It is built on mathematical principles
children worldwide [1]. Early diagnosis and intervention are of probability and statistics as well as computer science.
seen to be the best treatment for this disorder, hence, Supervised learning is a machine learning technique that
increasing interests and approaches in autism understanding learns from mapping input variables and output variables and
and diagnosis are prevalent [2]. uses this mapping for prediction [14]. Simply, it is learning
from examples having two sets of data, a training and a test
Nowadays, machine learning algorithms are rapidly used set [15].
in medical science [3] that transforms biomedical data into
valuable knowledge. It is widely used in bioinformatics to A typical supervised learning problem, as shown in Fig.
build predictive models for detection and diagnosis of 1, contains instance space X contains objects or attributes, a
diseases [4], medical image segmentation [5], gene finding, label space Y, and a prediction space Y’. [16] defines
protein folding prediction, and so many others [3]. classification as the task of learning a model that maps each
Supervised machine learning is now increasingly applied to attribute x to one of the predefined class labels y.
various bioinformatic problems [6]. Research on machine
learning to enhance autism diagnosis is seen to have a
potential and usefulness [7]. In fact, several studies on autism INPUT OUTPUT
diagnostics using machine learning have already been carried Classification
Attribute Set Model Class (or Label)
out by [8], [9]. The ongoing medical research in this field is (x) (y)
attributed to an increasing availability of online data sets and
low-cost computing [10]. In this regard, this study referenced
the study of [11] on ASD Screening. An artefact of the Fig. 1 A Classification Model
author’s work is a mobile application called ASD Test that is
available on Google Play and Mac App Store [12]. The A training sample (attribute) set is denoted as S =
application allows the users to answer a 10 autism-related ((x1,y1),…,(xm,ym)) Є (X ×Y)m that contains a predetermined
questions and suggests the likelihood of having autistic traits. number of examples where each xi Є X. The output is a
The datasets of this application are utilized for this study. model hs : X  Y’ which learns from the sample set [17]. In

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


a binary classification problem, objects in the sample set are
associated with two classes (or labels), Y = {±1} = {-1,+1}
and the model predicts the class of new instances, Y’ = Y = (1)
{±1}.
According to [19], Naïve Bayes is one of the effective
B. Classification Algorithms and efficient learning algorithms with good prediction
The supervised classification algorithms that are relevant performance for machine learning and data mining. It can be
to this study are discussed in the succeeding subsections. efficiently trained in a supervised setting. It also requires
little training data and can be likewise used for moderate and
1. Decision Tree large training data set having numerous attributes. Lastly,
this classifier can handle missing values.
A decision tree consists of three (3) types of nodes, a root
node, internal nodes with one incoming edge and two or 3. K-nearest neighbors (k-nn)
more outgoing edges, and leaf or terminal nodes with only
one incoming edge and no outgoing edges. The root and k-nn is an instance-based machine learning method for
internal nodes contain the attribute test characteristics classification and regression. The k-nn is one of the most
whereas the leaf node contains the class label. A decision common and simplest classification algorithms [20] with no
tree is illustrated in Fig. 2. explicit training or model, thus, referred to as “lazy
learning”. This algorithm simply saves the training
Classification Model observations and compares new observations using a
(After training)
similarity measure and reference the saved data during test
time to make predictions [21]. This algorithm can easily
handle missing values.
New
Example There are several choices to compute distances
depending on the type of the features in the data. Mixed
distance measure is used if the data contain real and binary
values. If the data Euclidean distance is used for the real part
whereas Hamming distance is used for the binary part. Rai
? [22] provides distance computations found in (2) and (3).
The sample data set contains j attributes that forms a j-
dimensional vector, x=(x1,x2,…,xj) and the given training data
Fig.2. Decision Tree classifier is represented by D={(x1,y1),…(xj,yj)} where xi is the input
and yi is the output/label. The xi is a vector that consists of D
A test example is classified in the decision tree by features whereas xim denotes the m-th feature of x. Therefore,
continually applying the test conditions to the example the output is yi Є{1,...,C}
starting from the root node down to the respective branches
of the tree until it terminates at a leaf node. Euclidean Distance:
A Decision tree affords a fast and effective method of (2)
categorizing data sets [18]. It is usually used to classify when
response variables contain only two (2) categories. The
major advantage of a decision tree is its ability to present a Hamming Distance:
visual representation of the decision structure based on actual (3)
values of the attributes, thus it can be very useful to extract
valuable insights. Furthermore, a decision tree can handle
categorical and numerical variables which are also present in
The algorithm requires the distance function d(xi,xj) and
the datasets.
the value k which is the number of nearest neighbors. Kozma
2. Naïve Bayes [23] simply summarizes the k-nn algorithm as follows:
i. Calculate distances of text point from each training
Naïve Bayes is a learning classifier that is based on point
Bayes rule for solving classification problem [16]. The term ii. Sort the distances in ascending order
“naïve” comes from the main assumption of conditional iii. Select k nearest neighbors
independence of its features. Simply, this classifier assumes iv. Calculate majority voting
that the features independently contribute to the probability
of the model which is illustrated in the succeeding formula,
4. Random Tree
where variable c is the dependent class label and F are the
several feature variables.
A Random Tree is much like a Decision Tree except that
Equation (1) is based on Bayes Theorem for for each split it performs, only a random subset of attributes
classification where each attribute and class label are is available [24].
considered as random variables. The goal is to predict the
class c and choose a value of c that maximizes 5. Deep Learning
P(c|F1,F2,…,Fn) given an example with attributes (F1, F2,…,
Fn). Deep learning is a multi-layer feed-forward neural
network that learns from complex patterns of data. It is an
upgrade to neural network which contains an input layer, 1. Accuracy and Classification Error
large number of hidden layers, and an output layer to model
data much like the one illustrated in Fig.3. Accuracy is often the most common performance metric
for prediction. Achieving high accuracy in modeling is the
topmost factor and in most cases when accuracy is difficult
to achieve, using approximation techniques will suffice. The
training time of a model increases as the number of
parameters increases. However, having numerous parameters
in a model is also indicative of greater flexibility [29].
The formulas for Accuracy and Classification Error are in
(5) and (6).

(5)

Fig. 3 Feed-forward back propagation neural network (6)


2. Precision and Recall
The input layer contains the input vectors xi =
[xi1,…,xip] T where p is the number of input variables. Precision is referred to as the Positive Predictive Value
Likewise, the hidden layer contains s neurons having (PPV). The precision for a class is computed as the number
sigmoid transfer function (f1). Another sigmoid action of true positives divided by the sum of the true positives and
function (f2) is found in the output layer. Deep Learning uses false positives. The formulas for Precision and Recall are in
a backpropagation algorithm that defines changes in the (7) and (8).
internal parameters [25] and calculated based on the weight
matrices of the first and second layers (W1 and W2) and (7)
biases (b1 and b2). The output of neural network is (4). More
details about this network is available in [24, Fig. 1].
(4) Recall is referred to as the True Positive Rate (TPR) or
Sensitivity or Hit Rate. The recall for a class is computed as
Learning exhibits outstanding performance in supervised the number of true positives divided by the sum of the true
learning tasks [27]. Hence, its use is widespread in medical positives and false negatives.
research and it has become a new breakthrough because of
its remarkable results at discovering learnings [28].
(8)
C. Performance Metrics
This sub-section discusses the evaluation criteria that are 3. Receiver Operating Characteristic (ROC)
used to measure the quality of a classifier. In evaluating a
model’s performance, several factors come into play namely, ROC graphs are first used in medical research and
accuracy, classification error, precision and recall, ROC, and gradually gaining popularity in machine learning and data
others. These metrics are computed by using the four (4) mining research. ROC plots or graphs are utilized in
outcomes mentioned below: evaluating the performance and visualizing the binary
classification models [30].
 True Positive (tp) – the number of positive
An ROC curve is a graphical plot that demonstrates the
examples in the test data set that are classified as
analytical ability of the classifier. The curve is plotted with
positive
True Positive Rate (TPR) on the y-axis and False Positive
 False Positive (fp) – the number of negative Rate (FPR), also known as Fallout (1 – Specificity), on the x-
examples in the test data set that are classified as axis. The ROC curve is the preferred choice to select models
positive that can give optimal performance because it is less sensitive
to the class sizes.
 True Negative (tn) – the number of negative
examples in the test data set that are classified as
negative III. EXPERIMENTAL METHOD

 False Negative(fn) – the number of positive A. Selection of Algorithms


examples in the test data set that are classified as This study used five (5) classification algorithms to
negative predict a category, namely Decision Tree, Naïve Bayes, k-
nn, Random Tree, and Deep Learning. This list is not
exhaustive of the applicable algorithms that can be used for
There are no predefined metrics in evaluating a
this machine learning problem. There are other algorithms
classification problem. Different metrics provide different
that can be applied for this two-class prediction problem.
insights. The choice of the metrics is relative to the learning
Furthermore, these algorithms are chosen using the following
problem. Nonetheless, this experiment used the four (4)
criteria: the ability to handle different column types
performance metrics as follows:
(numerical, binary, and categorical), a binary target type, a
small number of columns and rows and an ability to handle classifier operators, namely Decision Tree, Naïve Bayes, k-
missing values. nn, Random Tree and Deep Learning.

B. Software Specification
1. Datasets Used

This experiment used three (3) standard datasets that can


be freely downloaded from UCI Machine Learning
Repository [31]. These are as follows:
 ASD Screening Data for Child Dataset contains
292 examples (151 non-ASD cases, 141 ASD-like
cases)
 ASD Screening Data for Adolescent Dataset
contains 104 examples (41 non-ASD cases, 63
ASD-like cases)
 ASD Screening Data for Adult Dataset contains Fig. 4 Process Model
704 examples (515 non-ASD cases, 189 ASD like
cases)
IV. PERFORMANCE ANALYSIS AND RESULTS
These datasets have been cleaned and prepared for
machine learning. From the three (3) datasets, only 16 out of 1. Accuracy
the 21 attributes are used for this study where
ASD_Likelihood is chosen as the class label. The other Using the three (3) datasets, Deep learning with 96.38%
attributes are irrelevant to the learning problem. The average accuracy outdone the other classifiers as shown in
common description of the datasets is summarized in Table Fig. 5. It is seconded by Naïve Bayes with 90.30% average
1. accuracy, followed by k-nn with 88.89% average accuracy,
Decision Tree with 85.87% average accuracy, and lastly,
Table 1 Problem Description Random Tree with 72.74% average accuracy. The Random
Q1-Q10 Integer Attribute Tree is the least accurate classifier as compared with the
others.
Age Polynomial Attribute
Gender Polynomial Attribute
Ethnicity Polynomial Attribute
Jaundice_History Polynomial Attribute
ASD_History Polynomial Attribute
ASD_Likelihood Binomial Label

2. Software Resources

This experiment used Rapidminer which is a commercial


and open-source software platform for data science. It
performs machine learning, data mining, predictive analytics,
and much more. It has an integrated environment for data
preparation, modeling, evaluation, and deployment.
3. Process Model

The Rapidminer process model, as shown in Fig. 4,


involves the four operators: Read Excel operator that loads Fig. 5 Comparison of Accuracy Levels
the example set; Set Role operator that sets the label role;
Generate Weight (Stratification) operator that distributes the
weight over the example set, and the Validation operator that 2. Precision and Recall
trains, tests, and measures the performance of a learning
operator. The confusion matrix (also known as contingency table)
presents the number of correct and incorrect predictions as
Inside the Validation model are three operators: the well as the precision and recall values. The matrix, as shown
classifier operator, apply model operator that applies a in Fig. 6, displays a visualization of the classification results
model, and the performance operator that evaluates the and summarizes the precision and recall percentages using
performance of the classifier. This experiment used different the three (3) datasets.
Decision Naïve Random Deep
K-nn
Tree Bayes Tree Learning
Pred. NO
87.13 82.79 97.15 72.86 92.21
Class precision
Pred. YES
85.77 95.72 86.59 40 100
Class precision
OVERALL %
86.45 89.25 91.87 56.43 96.10
AVERAGE

Decision Naïve Rando Deep


K-nn
Tree Bayes m Tree Learning
True NO
85.22 95.51 81.71 83.33 100
Class Recall
True YES
84.21 82.45 92.10 42.10 92.98
Class Recall
OVERALL %
84.71 88.98 86.90 62.72 96.49
AVERAGE

Fig. 6 Confusion Matrix Summary

As shown in the class precision table, the Deep Learning


classifier achieved the best average class precision (96.10)
and class recall (96.49) percentages. It is followed by Naïve
Bayes, k-nn, and Decision Tree classifiers having close
precision and class recall percentages. On the contrary,
Random Tree trailed behind as compared to the other
classifiers having only 56.43% class precision and 62.72%
class recall.
3. ROC

In Rapidminer, the Compare Receiver Operating


Characteristic (ROC) operator is used to produce the ROC
curves of models in the same plotter diagram. The plot of the Dataset ROC Performance
curve indicates the true positive rate against the false positive 1st 2nd 3rd 4th 5th
rate. Likewise, the ROC curve is plotted in an optimistic Child Deep k-nn Naïve Random Decision
manner wherein it identifies examples with high confidence Learning Bayes Tree Tree
Adolescent k-nn Deep Naïve Random Decision
first and thereafter plot the succeeding examples with
Learning Bayes Tree Tree
decreasing confidence. Adult Deep k-nn Naïve Random Decision
As seen in Fig. 7, the ROC curves for the different Learning Bayes Tree Tree
models using the three datasets are graphically plotted for
comparison and ranked accordingly. A curve, which is Fig. 7 ROC curves for the ASD datasets
plotted on the upper leftmost or having coordinates (0, 1) of
the ROC space, signifies the best possible prediction. This V. CONCLUSION AND FUTURE WORK
means that this classifier has achieved 100% true positive
rate and 0% false positive rate, thus, is the considered the This paper has provided a comparative study on the
most accurate classifier. prediction performance of several learning classifiers. It
investigated which model(s) provides high accuracy, good
As summarized in the ROC Performance table, Deep precision and recall percentages, and ROC performance for
Learning and k-nn are the top performing classifiers the given machine learning problem. This study concludes
according to their ROC curves. Conversely, the Naïve Bayes, that Deep Learning classifier achieved the highest
Random Tree, and Decision tree consistently maintain the classification performance in terms of accuracy, precision
3rd, 4th, and 5th place, respectively, in terms of their ROC and recall percentages and ROC, whereas, Random Tree
performance. classifier is the least performing classifier in all metrics.
Good classification performances are also achieved by k-nn,
Naïve Bayes, and Decision Tree classifiers. These results can
be used as references for future work on ASD screening.
This paper hopes to promote subsequent work on
performance evaluation for the given binary classification
problem by exploring other machine learning algorithms
such as Chaid, Decision Stump, Rule Induction, and Random
Forest. Further, the small datasets used in this experiment
may affect the generalization of the models that are built,
thus, it is recommended to provide more training data to [15] E. Learned-Miller, “Introduction to Supervised Learning.”
improve the performance of the classifiers. Department of Computer Science, University of Massachussetts,
Lastly, high prediction performance is essential in Amherst, 2014.
medical research especially in ASD screening so that the best [16] T. Pang-Ning, M. Steinbach, and V. Kumar, Introduction to data
intervention strategies are provided. mining. 2006.
[17] S. Agarwal, “Introduction. Binary Classification and Bayes Error.”
ACKNOWLEDGMENT pp. 3–4, 2013.
The author thanks UCI Machine Learning Repository for [18] A. T. Azar and S. M. El-Metwally, “Decision tree classifiers for
allowing the use of the three (3) Autism Screening datasets. automated medical diagnosis,” Neural Comput. Appl., vol. 23, no. 7–
8, pp. 2387–2403, 2013.
REFERENCES [19] H. Zhang, “The Optimality of Naive Bayes,” Proc. Seventeenth Int.
[1] I. Rapin and R. F. Tuchman, “Autism: Definition, Neurobiology, Florida Artif. Intell. Res. Soc. Conf. FLAIRS 2004, vol. 1, no. 2, pp.
Screening, Diagnosis,” Pediatric Clinics of North America, vol. 55, 1–6, 2004.
no. 5. pp. 1129–1146, 2008. [20] S. Gadat, T. Klein, and C. Marteau, “Classification in general finite
[2] F. Hauck and N. Kliewer, “Machine Learning for Autism dimensional spaces with the k-nearest neighbor rule,” Ann. Stat., vol.
Diagnostics: Applying Support Vector Classification,” Int’l Conf. 44, no. 3, pp. 982–1009, 2016.
Heal. Informatics Med. Syst., pp. 120–123, 2017. [21] “Modern Machine Learning Algorithms: Strengths and Weaknesses,”
[3] Y. Q. Zhang and J. C. Rajapakse, Machine Learning in EliteDataScience, 2017. [Online]. Available:
Bioinformatics (Vol.4). John Wiley & Sons, 2009. https://elitedatascience.com/machine-learning-algorithms.
[4] A. R. Olivera et al., “Comparison of machine-learning algorithms to [22] P. Rai, “Supervised Learning: k-Nearest Neighbors and Decision
build a predictive model for detecting undiagnosed diabetes - ELSA- Trees.” School of Computing, The University of Utah, Utah, 2011.
Brasil: accuracy study,” Sao Paulo Med. J., vol. 135, no. 3, pp. 234– [23] L. Kozma, “k Nearest Neighbors algorithm (kNN).” Helsinki
246, 2017. University of Technology, Helsinki, 2008.
[5] S. Wang and R. M. Summers, “Machine learning and radiology,” [24] Rapidminer, “Random Tree,” Rapidminer Documentation, 2018.
Medical Image Analysis, vol. 16, no. 5. pp. 933–951, 2012. [Online]. Available:
[6] X. Chen, M. Wang, and H. Zhang, “The use of classification trees for https://docs.rapidminer.com/latest/studio/operators/modeling/predicti
bioinformatics,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., ve/trees/random_tree.html. [Accessed: 13-Jan-2018].
vol. 1, no. 1, pp. 55–63, 2011. [25] Y. A. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature,
[7] D. Bone, M. S. Goodwin, M. P. Black, C. C. Lee, K. Audhkhasi, and vol. 521, no. 7553, pp. 436–444, 2015.
S. Narayanan, “Applying Machine Learning to Facilitate Autism [26] Y. Lv, J. Liu, and T. Yang, “Comparative studies of model
Diagnostics: Pitfalls and Promises,” J. Autism Dev. Disord., vol. 45, performance based on different data sampling methods,” in 2013 25th
no. 5, pp. 1121–1136, 2015. Chinese Control and Decision Conference (CCDC), 2013, pp. 2731–
[8] M. Duda, J. A. Kosmicki, and D. P. Wall, “Testing the accuracy of an 2735.
observation-based classifier for rapid detection of autism risk,” [27] J. Wu, Yinan Yu, Chang Huang, and Kai Yu, “Deep multiple instance
Translational psychiatry, vol. 5. p. e556, 2015. learning for image classification and auto-annotation,” in 2015 IEEE
[9] D. P. Wall, J. Kosmicki, T. F. Deluca, E. Harstad, and V. A. Fusaro, Conference on Computer Vision and Pattern Recognition (CVPR),
“Use of machine learning to shorten observation-based screening and 2015, pp. 3460–3469.
diagnosis of autism,” Transl. Psychiatry, vol. 2, 2012. [28] Y. Bengio, “Deep learning of representations: Looking forward,” in
[10] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, Lecture Notes in Computer Science (including subseries Lecture
perspectives, and prospects,” Science, vol. 349, no. 6245. pp. 255– Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),
260, 2015. 2013, vol. 7978 LNAI, pp. 1–37.
[11] F. F. Thabtah, “Autism Spectrum Disorder Screening: Machine [29] G. Ericson and W. A. Rohm, “How to choose algorithms for
Learning Adaptation and DSM-5 Fulfillment,” in Proceedings of the Microsoft Azure Machine Learning,” Microsoft Azure, 2017.
1st International Conference on Medical and Health Informatics [Online]. Available: https://docs.microsoft.com/en-us/azure/machine-
2017, 2017, pp. 1–6. learning/studio/algorithm-choice. [Accessed: 12-Jan-2017].
[12] F. F. Thabtah, “Autism Spectrum Disorder Tests App.” 2017. [30] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit.
[13] Wei-Lun Chao, “Machine Learning Tutorial.” Graduate Institute of Lett., vol. 27, no. 8, pp. 861–874, 2006.
Communication Engineering, National Taiwan University, Taiwan, [31] M. Lichman, “UCI Machine Learning Repository.” University of
2011. California, School of Information and Computer Science, Irvine, CA,
[14] D. Greene, P. Cunningham, and R. Mayer, “Machine Learning 2013.
Techniques for Multimedia: Case Studies on Organization and
Retrieval (Cognitive Technologies),” in Machine Learning
Techniques for Multimedia, 2008, pp. 51–90.

View publication stats

Das könnte Ihnen auch gefallen