Intelligent Failure Prediction Models For Scientific Workflows

Expert Systems with Applications 42 (2015) 980989
Contents lists available at ScienceDirect
Expert Systems with Applications

journal homepage: www.elsevier.com/locate/eswa
Intelligent failure prediction models for scientic workows

Anju Bala , Inderveer Chana
Computer Science and Engineering Department, Thapar University, Patiala, India
a r t i c l e i n f o a b s t r a c t
Article history: The ever-growing demand and heterogeneity of Cloud Computing is garnering popularity with scientic
Available online 21 September 2014 communities to utilize the services of Cloud for executing large scale scientic applications in the form of
set of tasks known as Workows. As scientic workows stipulate a process or computation to be
Keywords: executed in the form of data ow and task dependencies that allow users to simply articulate multi-step
Cloud Computing computational and complex tasks. Hence, proactive fault tolerance is required for the execution of
Workows scientic workows. To reduce the failure effect of workow tasks on the Cloud resources during
Failure prediction
execution, task failures can be intelligently predicted by proactively analyzing the data of multiple
Scientic workows
Machine learning
scientic workows using the state of the art of machine learning approaches for failure prediction.
Therefore, this paper makes an effort to focus on the research problem of designing an intelligent task
failure prediction models for facilitating proactive fault tolerance by predicting task failures for Scientic
Workow applications. Firstly, failure prediction models have been implemented through machine
learning approaches using evaluated performance metrics and also demonstrates the maximum
prediction accuracy for Naive Bayes. Then, the proposed failure models have also been validated using
Pegasus and Amazon EC2 by comparing actual task failures with predicted task failures.
2014 Elsevier Ltd. All rights reserved.
1. Introduction and security etc. The second layer of the architecture acts as a core
middleware between software layer and infrastructure layer that
Cloud Computing has transformed the Information and endowed PaaS for testing, deploying and controlling web applica-
Communication Technology industry by facilitating on-demand tions which comprise platforms such as Google Apps Engine,
services, elasticity, exibility and provisioning of computing Microsoft Azure, Hadoop, Aneka and heroku etc. (Buyya, Shin
resources based on utility (Beloglazov & Buyya, 2013). Cloud Yeo, Venugopal, Broberg, & Ivona, 2009). The issues which need
Computing adopts virtualization technologies to provide various to be resolved at this layer are management of big data applica-
services to the user such as Infrastructure as a Service (IaaS), Hard- tions, data analytics and intelligence etc. The infrastructure layer
ware as a Service (HaaS), Platform as a Service (PaaS), Software as a at the bottom consists of IaaS and HaaS that offers various
Service (SaaS) and Workow as a Service (WFaaS) (Cushing, hardware resources and infrastructures on demand without pur-
Koulouzis, Belloum, & Bubak, 2014; Wang, Korambath, Altintas, chasing. Amazon EC2, Eucalyptus, OpenNebula, Nimbus and Open
Davis, & Crawl, 2014; Zhao, Melliar, & Moser, 2010). The layered Stack, Rackspace are the examples of infrastructure and hardware
architecture of the Cloud services and required tools for these providers. Henceforth, Cloud Computing has high availability of
services along with their challenges has been revealed in Fig. 1. these services and thus we are evaluating the use of Cloud services
As the Cloud offers WFaaS and SaaS on the top layer of Cloud where for deploying scientic applications.
the user can utilize the services on internet without buying the Scientic applications are represented as workows that consist
proprietary rights of software and applications such as Workow of few tasks to million of tasks which have the dependency between
Management System (WMS), scientic applications, E-mail, them (Ramakrishnan, Reutiman, Chandra, & Weissman, 2013). The
business and multimedia applications. Some of the key challenges workow applications for real world business process are identied
that need to be addressed at software layer are scheduling, data as Cloud Workows and WFaaS has been used by some of the
heterogeneity, fault tolerance, fault prediction, interoperability researchers for deploying and executing these workows in Cloud
environment. Although, Tan et al. (2009) have also enhanced the
Corresponding author. performance of real world tumor analysis using the Workow-
E-mail addresses: anjubala@thapar.edu (A. Bala), inderveer@thapar.edu
as-a-Service in Grid Computing yet they have not dened any
(I. Chana). architecture for WFaaS in Cloud. Then, Pathirage, Perera, S,
http://dx.doi.org/10.1016/j.eswa.2014.09.014
0957-4174/ 2014 Elsevier Ltd. All rights reserved.
A. Bala, I. Chana / Expert Systems with Applications 42 (2015) 980989 981
Kumara, and Weerawarana (2011) proposed architecture for WFaaS workow for earthquake hazards, Sipht workow for bioinformat-
to host workow applications securely in the Cloud but they have ics, Inspiral workows for detecting gravitational waves, Epige-
not considered Workow Scheduling. Furthermore, Wang et al. nome for genome sequence operations and Broadband to
(2014) has dened WFaaS architecture for scheduling the workow simulate the impact of an earthquake.
applications to increase the scalability and extensibility. Cushing Thus, the goal of proposed models is to predict the task failures
et al. (2014) have also proposed WFaaS approach for task framing intelligently using machine learning approaches before failure
of scientic applications in Cloud. As the work of Juve and occurrence during the execution of scientic workow applica-
Deelman (2010) evaluated Cloud infrastructures as an execution tions. The task failures can occur due to overutilization of
platform for deploying scientic workows as WFaaS due to the resources, unavailable resources, execution time or execution cost
benets of using Cloud such as provisioning on demand, elasticity, exceeds than threshold value, required libraries are not installed
provenance, reproducibility etc. (Deelman, Livny, Berriman, & properly, system running out of memory or disk space and so on.
Good, 2008; Pandey, Karunamoorthy, & Buyya, 2011) have also con- In the present paper, task failures have been generated due to over-
cluded various advantages of executing scientic workows with utilization of resources such as CPU, RAM, Disk Storage and Net-
Cloud infrastructure such as cost effective, scalable, decreased run- work Bandwidth. The historical data of task failure parameters
time, on-demand resource provisioning, and ease of resource man- has been gathered for training and testing the prediction model
agement etc. Although, WFaaS is used by only few of the in Weka by running multiple scientic workow applications at
researchers in Cloud, however there are some open challenges that different intervals in WorkowSim. Then, the results of failure pre-
needs to be resolved such as efcient and scalable management of diction model using evaluation metrics have also been compared
workows, handling application and resource failures, heterogene- using machine learning algorithms such as Naive Bayes, Random
ity of data, failure prediction of tasks and fault tolerant scheduling Forest, LR and ANN and evaluated that the Naive Bayes would be
etc. (Deelman, 2009; Gil1 & Deelman, 2007). Among these research the best machine learning approach for task failure prediction of
issues, the key challenge is to handle the resource and task failures multiple scientic workow applications. To validate the accuracy
through intelligent prediction of failures for scientic workows of proposed model, actual failures have been compared with pre-
which is not implemented by any of the authors till now. dicted failures using Amazon EC2 and Pegasus. Finally, the experi-
As most of the existing works employed several of fault tolerant mental results of proposed model have also been compared with
techniques for scientic workows such as replication, checkpoint- existing model after implementing Broadband, Epigenome and
ing, job migration, retry, task resubmission etc. (Bala & Chana, Montage.
2012; Ganga, Karthik, & Christopher Paul, 2012). Less research
has been done to predict and detect task failures intelligently by 1.1. Motivation for the work
adopting machine learning approaches for implementing proactive
fault tolerance. As the workows have used for simulation, high The motivation of implementing task failure prediction illus-
energy physics, astronomy and many other scientic applications. trates their inspiration from research challenges of scientic
Hence, the reliability models for software and hardware failures workows, applications and benets of using Cloud services
cannot be simply applied to handle the task failures proactively for these workows.
(Bala & Chana, 2013; Xie, Dai, & Poh, 2004). It is insightful, if the Our work aspires at analyzing the problem of failure prediction
fault tolerant approach is a reactive one and might not be able to so that Cloud systems would be capable of making autonomic
handle the failures intelligently (Varghese, McKee, & Alexandrov, fault tolerant decisions by predicting the task failures with var-
2010). Henceforth, intelligent task failure detection and prediction ious resource utilization parameters such as CPU utilization,
is mainly challenging for scientic workow applications which RAM, Disk Storage and Bandwidth utilization.
have lot of job and data dependencies such as Montage, Cyber- Most of the existing works have implemented fault prediction
shake, Sipht, Inspiral, Epigenomics and Broadband. These work- using statistical approaches that would not be useful for pre-
ows have used in scientic community for various applications dicting failures intelligently, therefore our proposed approach
such as Montage workow for astronomical physics, Cybershake would be useful for predicting task failures proactively for
Fig. 1. Layered architecture of Cloud along with Waas.

982 A. Bala, I. Chana / Expert Systems with Applications 42 (2015) 980989
scientic workow applications using various machine learning proposed by Jhawar, Piuri, and Santambrogio (2012) using heart-
approaches, which has not been implemented in the existing beat message protocol for detecting crash failures among VM
research works to the best of our knowledge. instances in Cloud environment but they have not considered to
Our proposed approach is more effective as task failures are detect task failures for scientic applications. Another work by
predicted due to resource overutilization through machine Guan, Zhang, and Song (2011) has also presented fault prediction
learning based approaches. mechanism for building dependable Cloud Systems with Bayesian
classiers and decision trees using health data collection. Poola,
The rest of the paper has been organized as follows: Section 2 Ramamohanarao, and Buyya (2014) have proposed fault tolerant
discusses the related work and Section 3 presents the methodol- workow scheduling to reduce the cost by 70% and also suggested
ogy. Section 4 describes approaches used to implement task failure in their future work that failure prediction approaches can be used
prediction models. Section 5 details the evaluation of various to reduce the cost for scientic workows. Further, more machine
parameters. Section 6 reports the experimental results. Section 7 learning and statistical techniques need to be enhanced to increase
comprises a discussion on conclusion. the performance of these approaches, henceforth; very few
researchers have worked on failure prediction in Cloud.
2. Related work
2.3. Learning machine approaches for predicting task failures: ANN,
Naive Bayes, Random Forrest and LR
Signicant work has done by a number of researchers to per-
form software fault prediction using variety of approaches to
As the work of Catal and Diri (2009) shown in their review that
obtain better prediction results and some of the authors have used
the machine learning models have enhanced characteristics for
machine learning approaches for software fault prediction and
prediction than statistical methods. Therefore, we need to explore
resource provisioning. But, these approaches have not employed
machine learning algorithms for predicting task failures proac-
for predicting task failures using scientic workows data in Cloud
tively. Malhotra and Jain (2012) veried that Random Forest gave
environment. Therefore, in this section, we make an extensive sur-
better results for fault prediction. Firstly, the work of Ohlsson,
vey of the work related to predicting and detecting failures using
Zhao, and Helander (1998) applied multivariate analysis tech-
general approaches, specic approaches in Cloud, machine learn-
niques for predicting fault-prone classes using software design
ing approaches by considering the state-of-the-art techniques
metric by getting data from Ericsson Telecom AB. Currently, the
and related approaches along with research challenges.
work of Suresh, Kumar, Ku, and Rath (2014) demonstrated that
the multivariate logistic regression is an efcient approach for soft-
2.1. General approaches for detecting and predicting failures ware fault prediction. Islam, Keunga, Lee, and Liu (2012) proved
that the accuracy of ANN is superior for the prediction of resource
Fault detection and prediction approaches have proposed and utilization in Cloud environment. Another similar work of
implemented by many of the authors in distributed environment. Kousiouris, Menychtasa, Kyriazis, Gogouvitis, and Varvarigou
Duan, Prodan, and Fahringer (2006) proposed a fault classication (2014) estimated resource provisioning using Articial Neural Net-
and data mining approach to predict different types of faults and works (ANN) in Cloud platforms and also conrmed the better
another work of Jitsumoto, Endo, and Matsuoka (2007) used fault accuracy of prediction with ANN. Naive Bayes Model is measured
detector to differentiate between hardware, process, and transmis- as the robust machine learning algorithm for software fault predic-
sion faults. Fu and Xu (2007) built a neural network to approximate tion by the work of Catal (2011). Although, most of the existing
the number of failures in a given time interval and Fu and Xu work have used machine learning approaches such as Naive Bayes,
(2010) also implemented failure proactive prediction framework ANN, Random Forest and LR for resource provisioning and software
for predicting component failures based on the concept of tempo- fault prediction yet very few work from researchers have reported
ral and spatial correlations using various features such as resource on predicting task failures of workow applications. Only, one of
utilization, packet count and system information. They have used the work by Samak et al. (2012) have revealed that the failure anal-
supervised learning approaches to predict failures with average ysis of jobs with Naive Bayes model which can predict the failure
accuracy of 70.1% with online prediction and 74% with ofine pre- probability of workow jobs with the average accuracy of 85% with
diction. Further, the work of Sindrilaru, Costan, and Cristea (2010) scientic applications such as Broadband, Epigenome and Mon-
implemented the technique for detecting faults before they accom- tage. Moreover, they have not compared the performance of Naive
plish the actual workow engine by intercepting SOAP (Simple Bayes with other failure prediction models and not being consid-
Object Access Protocol) messages and to provide better reaction ered task failures due to resource overutilization or resource fail-
times. Another work of Yu, Wang, and Shi (2010) proposed failure ures. Therefore, to the best of our knowledge, none of the above
aware workow scheduling approach to by predicting online discussed approaches have predicted task failures intelligently
resource failures but needs to be implemented in Pegasus for pre- using variety of machine learning approaches.
dicting task failures also. Guan, Zhang, and Song (2012) have also
proposed failure detection and prediction approach with health 2.4. Research challenges along with its solutions in the state of art of
care data using decision trees and Bayesian networks. Although, failure detection and prediction in Cloud Computing
a number of authors have proposed various approaches for fault
prediction using health care data yet they have not implemented Accurate failure predictions can help in mitigating the impact of
task failure predictions after considering multiple scientic work- failures for scientic applications and resources, applications, and
ows data which can be used to increase the performance of services can be scheduled efciently to edge the effect of failures.
scheduling. However, providing accurate predictions sufciently ahead is a
challenging task for intricate applications such as workows and
2.2. Specic approaches for detecting and predicting failures within an accurate prediction of task failure is a pre-requisite for
Cloud environment implementing intelligent fault tolerant approach for scientic
workows. The state of the art of existing approaches for failure
Zhao et al. (2010) have implemented heartbeat message prediction revealed that some of the research challenges can
protocol for detecting replica failures and similar work has also be resolved using intelligent failure prediction models on large
infrastructure such as Cloud and these challenges are sketched out compared with existing failure prediction models and validates
as follows. that the proposed model using Naive Bayes is effective in terms
of accuracy. Furthermore, the proposed models allows to handle
Although one of the authors (Samak et al., 2012) have imple- resource failures as well as task failures before the occurrence of
mented job failure prediction model for application failure these failures.
and data management component failures yet they have not
implemented the prediction during task failure occurs due to 3. Methodology
resource overutilization which can further enhance the accu-
racy of failure prediction models. The methodology is to design the proposed models depicted in
For efcient forecasting, enhanced failure prediction techniques Fig. 2. The proposed models have two modules where rst module
are required to be implemented which can increase the accu- is used to predict the task failures with machine learning
racy of predictions among others. Further, it would also be use- approaches and second module is used to locate the actual failures
ful to handle current research challenges of Cloud such as after executing workow execution in Cloud testbed.
proactive fault tolerance, scientic data management and
scheduling etc. 3.1. Task failure prediction module
Cost reduction is an important challenge for scientic applica-
tions. Henceforth, efcient mechanisms are required that pre- Task failure prediction unit is used to explore the task failures
dict the resource failures along with execution cost data which is gathered from workow execution of scientic appli-
proactively to reduce the maintenance cost for the scientic cations after the interval of 50 s. The historical data details are
applications. accumulated from data repository of the Cloud. Data pre-process-
Scientic workows may consist thousands of tasks along with ing involves thorough assessment of raw data where distorted val-
their dependencies. The main challenge for these workows is ues or missing readings often become misleading in the formal
to schedule and execute workow tasks on Cloud resources analysis. Henceforth, to deal with missing values interpolation
which are not currently overutilized or unavailable. Therefore, has been used. Principal Component Analysis is applied for select-
prior to schedule the task on these resources, prediction based ing more relevant attribute and to reduce the dimensions which
techniques need to be used to intelligently predict overutilized are required for intelligent fault prediction model. Then, tasks are
or unavailable resources. classied which is based on the extracted data of the PCA as task
As the scientic workows consist of heterogenous data, data failure if utilization parameters value exceed then the maximum
management problems can be resolved by predicting resource threshold value otherwise class is classied as task not failure.
utilization parameters. Machine learning approaches such as Naive Bayes, ANN, LR and
Random Forest are implemented to predict the task failures intel-
2.5. Our contributions ligently from the dataset of scientic workows. Different evalua-
tion measures are also used to examine and compare the accuracy
In contrast to the existing work, intelligent task failure predic- of proposed models and predicted failures are saved into database.
tion models have been proposed and implemented for workow
applications in Cloud environment. As proactive fault tolerant 3.2. Actual task failure module
approach requires a prior knowledge of the task failures, therefore,
we have implemented and compared the performance metrics of Actual Task Failure unit is used to assess the actual task failures
task failures prediction models such as Naive Bayes, Random using Pegasus and Amazon EC2 (Deelman et al., 2005; Juve et al.,
Forest, Logistic Regression and ANN. Secondly, the predicted task 2009, 2010). Pegasus maps the abstract workows to concrete
failures have also been compared with actual task failures using workows that is deployed and executed on Amazon EC2 Cloud
Pegasus and Amazon EC2 for the proposed failure prediction with the goal of making reasonable predictions for scientic appli-
models. Then, the experimental results show that Naive Bayes cations and also utilized for actual prediction of task failures
model outperforms with maximum accuracy on other models by among scientic workows. Then, actual tasks failure log les are
predicting task failures intelligently. Finally, the results have been monitored using pegasus monitored class and analyzed by Amazon
Fig. 2. Methodology of the proposed approach.

Cloud Watch and stored the data in a relational archive, which with the given previous historical data from the scientic work-
standardizes the log les into a close approximation and can sent ow execution in WorkowSim. A typical three-layer Neural Net-
back to the workow management system. Then, actual failure work is shown in Fig. 3.
records are compared with stored predicted failure to conrm The Neural Network consists of multiple layers such as input
the accuracy of proposed failure prediction models. layer, hidden layer and output layer. Input layer has four input
neurons, x x1; x2; x3; x4 and output layer having two output
4. Approaches used for implementing task failure prediction neurons, o o1; o2 and one hidden layer with three hidden neu-
models rons, h h1; h2; h3 in between them. The neurons at each layer
are linked to the neurons of the next layer with a weight wi that
4.1. Logistic regression is to be evaluated during training. With each training point xi
and wi network computes the resultant output yj . If y (actual out-
Logistic regression is the commonly used approach for predict- put) and yj (predicted output) are different, the network weights
ing the dependent variable from a set of independent variables. In have been modied to reduce the calculated error. In this paper,
this paper, we have predicted failure prone classes using multivar- weights are updated using learning rate p = 0.7 and momen-
iate with multiple independent variables of resource utilization. To tum = 0.2. The sum of squared errors i.e. Root Mean Square Error
nd the optimal set of independent variables, there are two selec- (RMSE) and Mean Absolute Percentage Error (MAPE) is the perfor-
tion methods which are forward selection that checks all the vari- mance measure that used a gradient-descent technique to mini-
ables at entry time and backward elimination method comprises mize the error and to reach the local optima. When the
the independent variables and deletes the variables one by one calculated error during all training data is adequately small then
until stopping criteria is not satised. We have used the forward the network has reached at local optima. Thus, the resultant Arti-
stepwise selection method. The multivariate logistic regression cial Neural Network with backpropogation assured to generate a
formula by Hosmer and Lemeshow (1989) is shown as follows in better prediction output after observing the input data in real time.
Eq. (1). The proposed algorithm has been shown in Algorithm 1.
et 1
Ft 1 Algorithm 1. FPNN with backpropogation learning
et 1 1 et
1.Initialize network with random weight vector wi
Logistic function F (t) is written below which take the values 2.For all training examples from j = 1 to m, do
between 0 and 1 then the logistic function can be written in Fx 3.For i = 1 to n do
where x is the explanatory variable as in Eq. (2) P
4.Evaluate output yj ni1 xi wi
1 5.Compare network output with correct output y with yj
Fx 2 6.Error e y yj
1 e b0 b1 x1 b2 x2 b3 x3 b4 x4
7.Use gradient descent to minimize the error
The probability of the dependent variable can be task success 8.Adapt weights in current layer
or task failure, so the inverse of logistic function is dened as 9.Repeat until the RMSE and MAPE error has been
shown below gx in Eq. (3). minimized.
Fx 10. EndFor
gx ln ; gx b0 b1 x1 b2 x2 b3 x3 b4 x4 3 11. EndFor
1 Fx
x1 represents the CPU utilization x2 represents the RAM utiliza-
tion x3 represents the Bandwidth utilization x4 represents the Disk
utilization where b0 ; b1 ; b2 ; b3 represent the constants. Hence, the 4.3. Random Forest
probability of task failure based on four independent variables is
calculated by the following formula in Eq. (4): Random Forest(RF) approach can be used to generate thousands
of tree. RF joins the advantages of bagging and random selection
egx methods proposed by Breiman (2001). Bagging techniques have
Probx1 ; x2 ; x3 ; x4 4
1 egx been used to take the samples continually from various data sets
with uniform probability distribution where Random feature selec-
tion explores at each node for the nest split over a random subset
4.2. Articial Neural Network(ANN)
of the features. The Random Forest categorizes a new object from
an input vector by penetrating the input vector on every tree in
ANN can trained using multivariate functions. In this approach,
the forest. Each tree is used to cast a unit vote at the input vector
a Neural Network model with back-propagation method is trained,
with classication and forest selects the classication which have
the maximum votes over all the trees in the forest (Guo, Ma,
Cukic, & Singh, 2004). Each Random tree is constructed with the
following steps:
For M number of cases in training set,sample M cases randomly.

For every node y fault predictors are randomly
p selected out of Y
input variables and y Y where y Y .
Each Tree is developed to the large extent with no pruning.
4.4. Naive Bayes approach
Naive Bayes Model has been used for fault prediction which
assumes independence of attributes to each other, therefore it is
Fig. 3. Three layer neural network. named as Naive. It uses Bayesian theorem to calculate the
probability of unknown instance Y is classied as class T with True positives (TP): Cloud users labeled as task failure but also
possible outcomes. Where T fT 1 ; T 2 g, T is considered as evaluated as task failure.
fSuccessjFailureg class and the probability of task failure or success True negatives (TN): Cloud users labeled as not failure task but
is dened by Eq. (5). evaluated as not failure.
False positives (FP): Cloud users labeled as failure task but eval-
PYjTPT
PfTjYg 5 uated as not failure.
PY
False negatives (FN): Cloud users labeled as non failure task but
Because Naive assumes the conditional probabilities of the indepen- evaluated as task failure.
dent variables, therefore, PfTjYg can be decomposed into product of
sums by using Eq. (6), where Y j represents n attributes as: 5.1.2. Recall and precision
Y 1 ; Y 2 . . . Y n represents n attributes which are conditionally inde- Precision is dened as the ratio of correctly predictable failures
pendent on one another given T as output. to the number of all the recognized failures where Recall is dene
as the ratio of correctly predicted failures to the number of true
Y
n
PTjY PT PY j jT 6 failures which is dened as follows in Eq. (10).
j1
TP TP
Recall ; Precision 10
The probability that T will take on any point k is dened in the Eq. TP FN TP FP
(7). Here sum is considered as all the possible values t j of T. The Eq.
(7) can be rewritten in Eq. (8) which is a basic equation of Naive
Bayes Classier. This equation also describes to calculate the prob- 5.1.3. MAPE and RMSE
ability that T will take on any given value and can be estimated from In our experiments MAPE and RMSE are evaluated to measure
training data with the distributions PT and PY i jT. error in percentage. MAPE is used for evaluating the prediction
accuracy (Malhotra & Jain, 2012) as percentage. Where yj is the
PT tk PY 1 . . . Y n jT tk actual output, b
y j is the predicted output and m indicates total num-
PT t k jY 1 . . . Y n P 7
j PT t j PY 1 . . . Y n jT t j ber of observations in the dataset and MAPE is dened in Eq. (11)
Q
PT tk i PY i jT t k and lower value of MAPE indicates a more precise prediction
PT t k jY 1 . . . Y n P Q 8 technique.
j PT t j i PY i jT t j
1X m b
j y j yj j
MAPE 11
5. Evaluation criteria m j1 yj
The objectives of the paper are twofold. Firstly, it needs to cor- The metric RMSE (Aggarwal, Singh, Kaur, & Malhotra, 2009) is
rectly identify which tasks could be failed on Virtual Machines due described by the following formula in Eq. (12) and smaller value
to overutilization of resources using ANN, Random Forest, LR and of RMSE signies a more effective prediction scheme.
Naive Bayes models. Secondly, to validate the accuracy of predic- v
u X
tion models, predicted failure tasks has been compared with actual u1 m 2
RMSE t by j yj 12
task failure in Amazon EC2 Cloud. Finally, to evaluate and validate m j1
the performance of proposed failure prediction approaches, results
have been compared with existing failure prediction model.
5.1.4. F-Measure and ROC
5.1. Predictor performance
The F-Measure values are computed as a harmonic mean of the
precision and recall as shown in Eq. (13). The highest value of F-
For predicting the performance of failure prediction models,
Measure would be considered as the best case whereas lowest
data of scientic workows with 25, 30, 50 and 100 tasks have col-
value is evaluated as worst case.
lected for different scientic workows after xed interval in Work-
owSim. CloudSim classes have been used to read and trace the 2 Precision Recall
data values for the workload traces. Then, prediction accuracy of F Measure 13
Precision Recall
task failures has been evaluated using following evaluation metrics.
The accuracy of prediction approaches can also estimated by com-
5.1.1. Sensitivity and specicity paring their ROC curves in graphical approach. The area under curve
These metrics measures the correctness of the predicted model method is used to access the ROC curve which is shown in Eq. (14)
where Sensitivity species the percentage of actual faulty tasks where tpr species the true positive rate and fpr indicates false
(Task Failed) which are correctly classied whereas Specicity is positive rate (Salfner et al., 2010). The maximum value of area
the amount of non-faulty tasks (Task Success) which are correctly under curve signies the best predictor.
identied (Salfner, Lenk, & Malek, 2010). The relation between Z 1
these metrics is depicted in Table 1. The measures are calculated AreaUnderCurv e tpr fprdfpr 2 0; 1 14
using the formulas as given below in Eq. (9). 0
TP TN
Sensitiv ity ; Specificity 9
TP FN FP TN 5.2. Methods used to validate the accuracy
Table 1
We have used percentage split and Standard Error of Mean to
Relation between sensitivity and specicity.
validate the experimental results in Amazon EC2.
Classied class Task failure Task not failure
Task failure TP FP 5.2.1. Percentage-split
Task not failure FN TN
Sensitivity Specicity
Percentage split is used to divide the data set into training set
and test set, in this paper we have used 66 percentage split.
5.2.2. Standard Error of Mean (SEOM) Table 3

The metric SEOM is evaluated by deviation of predicted task Comparison of Performance Measures using percentage split.
failures from the actual task failures, for high accuracy SEOM Metric Naive Bayes Random Forest LR ANN
should be minimum which is expressed by the formula discussed Sensitivity 0.935 0.891 0.848 0.913
in Eq. (15). SD indicates standard deviation and m is the number Specicity 0.849 0.417 0.127 0.42
of samples. Recall 0.935 0.891 0.848 0.913
Precision 0.94 0.876 0.754 0.921
SD RMSE 19 27 35.16 29.94
SEOM p 15 MAPE 29.63 72.0051 64.023 57.041
m
F-Measure 0.937 0.875 0.798 0.893
ROC 0.983 0.975 0.661 0.779
6. Experimental results and discussion
The experimental results to measure the performance of predic- datacenter where the task would fail or succeed. As there are mul-
tion models have been evaluated by implementing all of the tiple levels in the workows therefore, depth species the level of
approaches mentioned in Section 4. WorkowSim (Chen & tasks in workows.
Deelman, 2012) classes have been used to execute workow appli-
cations and CloudSim (Calheiros, Ranjan, Beloglazov, Rose, & 6.2. Performance Measures using 66 percentage split
Buyya, 2011) classes have been utilized for storing the log les.
WEKA (Hall et al., 2009) is used to implement machine learning Table 3 summarizes the results of 66 percentage split of the
approaches using the data of scientic workow applications. model using prediction models and 34 percentage data is used as
The evaluated results have been validated after running scientic test data and 66 percentage is used for training data. shows the
workow applications on Amazon EC2 that affords a large selec- sensitivity, specicity, Recall, Precision, RMSE, MAPE, F-Measure
tion of instance types such as CPU, Memory, storage and network- and ROC values and also compared the performance metrics in tab-
ing capacity to deploy and execute scientic workows. We have ular and graphical form. The experimental results have been
used c3. xlarge instance type where c3 instances are the most shown that Naive bayes performs best on the task failures predic-
recent generation of compute-optimized instances that provides tions of scientic workow data and ANN also perform better than
highest performing processors at the lowest price. It is equipped Random Forest and LR.
with two quad core and 7.5 GB RAM, Intel Xeon E52680 v2,
1680 GB local storage and running Pegasus 4.2 with CentOS 6.0 6.2.1. Analysis of failure prediction model using percentage split
on Virtual Box. Our experimental setup has been categorized into Fig. 4 also details the performance of machine learning
four steps: approaches in graphical form with percentage split 66 and data
from Table 3. We can infer from the Fig. 4(a) that Naive Bayes
Data collection and extraction has highest specicity(0.849) and sensitivity(0.935) where ANN
Performance Measures using percentage split also performs better results than LR and Random Forest. it is appar-
Performance Measures using SEOM ent from Fig. 4(b) that Recall and Precision is maximum of Naive
Comparison of proposed model with existing model. Bayes is 93% and 94% and minimum value is 84% and 75% for LR.
Fig. 4(c) shows the calculated RMSE and MAPE is minimum (19%)
6.1. Data collection and extraction and 29.63% for Naive Bayes and maximum value of RMSE
(35.16%) for LR and MAPE for Random Forest is (64.023%).
Several attributes containing task failure information have been Fig. 4(d) indicates F-Measure and ROC curve value is also high
gathered using WorkowSim and CloudSim classes and PCA has 0.937 and 0.983 in Naive Bayes models whereas LR has minimum
been applied to reduce the dimensions for nine attributes that values(0.798 and 0.661). Hence, the results clearly validate that our
are required for intelligent fault prediction model. Each attribute model using Naive Bayes is accurate enough to predict task failures
along with its description is dened in Table 2. Resource Utilization with minimum values of MAPE and RMSE.
parameters have been evaluated using threshold value of CPU uti- Hence, Naive Bayes based models renounce higher prediction
lization, Bw Utilization, RAM and Disk utilization. The threshold accuracy with percentage split as compared to others such as Ran-
value is selected which is based on the previous history of task fail- dom Forest, Logistic Regression and ANN. Therefore, to achieve
ure due to overutilization of VMs. If the utilization parameter has more accuracy in prediction, the task failure prediction for work-
value more than threshold then status would be classied as Task ow type applications can be implemented using Naive Bayes
Failure otherwise Not Failure. Task id indicates which task id fails approach.
on the resource utilization and VM id signies VM number on
which task has been failed and similarly datacenter id also shows 6.3. Performance Measures using SEOM
To evaluate the accuracy of prediction models, machine learn-

Table 2
Attribute description. ing based models have used to predict failures precisely for differ-
ent size of scientic workows. Then, experimental results have
S. No. Attribute name Description
been validated in Pegasus and Amazon EC2. We have focused on
1 Task id Id of Task Montage workow data with 25 tasks, Cybershake with 30 tasks,
2 VM id Id of VM
Inspiral with 50 tasks and Sipht with 100 tasks at different time
3 Datacenter id Id of datacenter
4 CPU utilization Utilization of CPU (%) intervals of 50,100, 150 and 200 s. Fig. 5 compares the actual task
5 Bw Utilization Utilization of Bandwidth (%) failures and predicted task failures to different size of scientic
6 RAM utilization Utilization of RAM (%) workow based on the performance metrics at different time
7 Disk utilization Utilization of Disk (%) intervals. Fig. 5(a) depicts the results of failure predictions using
8 Depth Level of task
9 Status Failure class or not failure class
Naive Bayes algorithm with more accuracy (94%) than Random
Forest, LR and ANN for predicting task failures. The effect of failure
Fig. 5. Comparison of actual failures with predicted failures.
with accuracy of 94% approximately and maximum value (33%)

of SEOM for LR with minimum accuracy of (75%) approximately.
Random Forest and ANN has lower SEOM (20% and 11%) than LR
Fig. 4. Evaluation metrics using percentage split.
and more than Naive Bayes. Therefore, it emerges that the pre-
dicted model based on Nave Bayes might lead to construction of
the optimum prediction models for developing intelligent failure
predictions using Random Forest is also displayed in Fig. 5(b). The
prediction in Cloud environment.
accuracy of the LR is lower than Naive Bayes and Random Forest
which is dened in Fig. 5(c). The accuracy of ANN is also high for
all the workows as compared to LR and Random Forest as shown 6.4. Comparison of proposed model with existing model
in Fig. 5(d). Therefore, according to predictions as the workow
size grows, the number of correct predictions goes up for all the A very few authors have worked on failure prediction for tasks in
machine learning models. workows. As discussed in Section 2.3, some of the authors (Samak
SEOM has been calculated using above Eq. (15). Fig. 6 compares et al., 2011, 2012) have used machine learning approaches to ana-
the SEOM of fault prediction approaches and results of SEOM lyze the performance of various scientic applications such as
clearly validate that the SEOM is minimum (6%) for Naive Bayes Broadband, Epigenome and Montage. They have used Naive Bayes
average accuracy of (93%) as compared to existing model for Epige-

nome, Broadband and Montage workows.
7.1. Future directions
We suggest the following future directions for research

community.
The incorporation of Naive Bayes model with other fault toler-

Fig. 6. SEOM of failure prediction models. ant techniques would be able to enhance the efciency of Cloud
services and to improve the performance of existing proactive
fault-tolerant approaches.
Failure prediction models can also be useful for resource provi-
sioning and scheduling by predicting time and cost based
parameters. Intelligent prediction models can be helpful to
increase scheduling efciency of different size workow
applications.
More QoS attributes including scalability, availability and reli-
ability can be considered in the future and their impact on the
overall performance can be examined.
The intelligent prediction models would also be valuable for
predicting and analyzing security threats which can further
increase the performance of various expert systems applica-
Fig. 7. Comparison of existing and proposed failure prediction models.
tions such as project management, risk assessment and engi-
neering etc.
classier to evaluate the accuracy of job failures occur due to appli-
cation failures and component failures with scientic workows.
Therefore, to evaluate the usefulness of our proposed model, we References
have compared the experimental results with existing ones by
implementing Broadband with 81 tasks, Epigenome with 320 tasks Aggarwal, K. K., Singh, Y., Kaur, A., & Malhotra, R. (2009). Empirical analysis for
investigating the effect of object-oriented metrics on fault proneness: A
and Montage with 10,429 tasks using proposed failure prediction replicated case study. Software Process: Improvement and Practice, 16, 3962.
models. Fig. 7 depicts that the proposed model using Naive Bayes Bala, A., & Chana, I. (2013). VM migration approach for autonomic fault tolerance in
has higher accuracy of (97%) with Epigenome workows as these cloud computing. In GCA13, international conference, Las Vegas, USA (pp. 39).
Bala, A., & Chana, I. (2012). Fault tolerance-challenges, techniques and
workows are high CPU intensive than Broadband and Montage implementation in cloud computing. IJCSI, 9(1), 288293.
workows where the existing model (NB1) has the accuracy of Beloglazov, A., & Buyya, R. (2013). Managing overloaded hosts for dynamic
(96%). The Broadband has also the maximum accuracy of (87%) with consolidation of virtual machines in cloud data centers under quality of
service constraints. IEEE Transactions on Parallel and Distributed Systems, 24(7),
Naive Bayes Model, whereas LR has minimum accuracy of (79%)
13661379.
among the proposed models and NB1 have the accuracy of (74%). Breiman, L. (2001). Random forests. Machine Learning, 45, 532.
The Montage workows are I/O intensive which has the accuracy Buyya, R., Shin Yeo, C., Venugopal, S., Broberg, J., & Ivona, B. (2009). Cloud
computing and emerging IT platforms: Vision, hype, and reality for delivering
of (94%) with Nave Bayes and minimum accuracy (83%) with Ran-
computing as the 5th utility. Future Generation Computer Systems, 25(6),
dom Forest where the NB1 has accuracy of (84%). Therefore, the 599616.
average accuracy for all the workows using Nave Bayes is maxi- Calheiros, R. N., Ranjan, R., Beloglazov, A., Rose, C. A. F. D., & Buyya, R. (2011).
mum (93%) among other models ANN, LR and Random Forest CloudSim: A toolkit for modeling and simulation of cloud computing
environments and evaluation of resource provisioning algorithms. Software:
whereas existing model have the average accuracy of (85%). Thus, Practice and Experience, 41, 2350.
it can be validated that the proposed model is effective in terms Catal, C. (2011). Software fault prediction: A literature review and current trends.
of accuracy for predicting task failures as well as resource failure Expert Systems With Applications, 38(4), 46264636.
Catal, C., & Diri, B. (2009). A systematic review of software fault prediction studies.
by considering resource utilization for scientic applications. Expert Systems with Applications, 36, 73467354.
Chen, W., & Deelman, E. (2012). WorkfowSim: A toolkit for simulating scientic
workows in distributed environments. In E-Science, IEEE 8th international
conference (pp. 18).
7. Conclusion Cushing, R., Koulouzis, S., Belloum, A., & Bubak, M. (2014). Applying workow as a
service paradigm to application farming. Concurrency Computation: Practice and
Experience, 26, 12971312.
This paper provides an effective approach to implement an intel- Deelman, E. et al. (2005). Pegasus: A framework for mapping complex scientic
ligent failure prediction models using scientic workows data in workows onto distributed systems. Scientic Programming, 13, 219237.
Deelman, E. (2009). Grids and clouds: Making workow applications work in
Cloud environment. The proposed models have been exemplied heterogeneous distributed environments. International Journal of High
with the dataset achieved after running Montage, Cybershake, Performance Computing Applications Online First, 115.
Inspiral and Sipht workow applications using WorkowSim. Then, Deelman, E., Singh G., Livny, M., Berriman, B., & Good, J. (2008). The cost of doing
science on the cloud: The montage example. In Proceedings of the 2008 ACM/IEEE
the experimental results for failure prediction models have also conference on supercomputing, NJ (pp. 112).
been evaluated through Naive Bayes, Random Forest, LR and ANN Duan, R., Prodan, R., & Fahringer, T. (2006). Data mining-based fault prediction and
in Weka using the dataset of workows and validated for the max- detection on the grid. Proceedings of the 15th IEEE international symposium on high
performance distributed computing. Los Alamitos, CA: IEEE Computer Society
imum accuracy of Naive Bayes. Secondly, the analysis of results in
Press.
Pegasus and Amazon EC2 have also been proved the accuracy of Fu, S., & Xu, C.-Z. (2007). Exploring event correlation for failure prediction in
task failure prediction models and veried that the maximum accu- coalitions of clusters. Proceedings of the 2007 ACM/IEEE conference on
racy of 94% was achieved with Naive Bayes by comparing actual supercomputing, SC 07 (41(1), pp. 112). New York, NY, USA: ACM.
Fu, S., & Xu, C.-Z. (2010). Quantifying event correlations for proactive failure
task failures and predicted task failures. Finally, the failure predic- management in networked computing systems. Journal of Parallel and
tion model using Naive Bayes has shown the effectiveness with Distributed Computing, 70(11), 11001109.
Ganga, K., Karthik, S., & Christopher Paul, A. (2012). A Survey on fault tolerance in Pathirage, M., Perera, S, Kumara, I., & Weerawarana, S. (2011). A multi-tenant
workow management and scheduling. International Journal of Advanced architecture for business process executions. In Proceedings 2011 IEEE
Research in Computer Engineering & Technology, 1(8), 176179. international conference on web services (ICWS) (pp. 121128).
Gil1, Y., & Deelman, E. (2007). Examining the challenges of scientic workows. Poola, D., Ramamohanarao, K., & Buyya, R. (2014). Fault-tolerant workow
IEEE Computer, 40(12), , 243. scheduling using spot instances on clouds. In Procedia computer science, ICCS
Guan, Q., Zhang, Z., & Song, F. (2011). Ensemble of bayesian predictors for 2014. 14th international conference on computational science (Vol. 29, pp. 523
autonomic failure management in cloud computing. IEEE, 16. 533).
Guan, Q., Zhang, Z., & Song, F. (2012). A failure detection and prediction mechanism Ramakrishnan, S., Reutiman, R., Chandra, A., & Weissman, J. (2013). Accelerating
for enhancing dependability of data centers. International Journal of Computer distributed workows with edge resources. In IEEE international conference,
Theory and Engineering, 4(5), 726730. IPDPS workshop (pp. 21292138).
Guo, L., Ma, Y., Cukic, B., & Singh, H. (2004). Robust prediction of fault-proneness by Salfner, F., Lenk, M., & Malek, M. (2010). A survey of online failure prediction
random forests. In Proceedings of the 15th international symposium on software methods. ACM Computing Surveys, 42, 142.
reliability engineering (pp. 417428). Samak, T., Gunter, D., Goode, M., Deelman, E., Juve, G., Fabio, S., & Karan, V. (2012).
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). Failure analysis of distributed scientic workows executing in the cloud. In 8th
The WEKA data mining software: An update. SIGKDD Explorations, 11. International conference on network and service management (CNSM), Las Vegas,
Hosmer, D., & Lemeshow, S. (1989). Applied logistic regression. New York, NY, USA: Nevada (pp. 19).
John Wiley & Sons. Samak, T., Gunter, D., Deelman, E., Juve, G., Mehta, G., Silva, F., et al. (2011). Online
Islam, A., Keunga, J., Lee, K., & Liu, A. (2012). Empirical prediction models for fault and anomaly detection for large-scale scientic workows. 13th IEEE
adaptive resource provisioning in the cloud. Future Generation Computer international conference on high performance computing and communications
Systems, 28, 155162. (HPCC-2011). Banff, Alberta, Canada: IEEE Computer Society.
Jhawar, R., Piuri, V., & Santambrogio, M. D. (2012). A comprehensive conceptual Sindrilaru, E., Costan, A., & Cristea, V. (2010). Fault tolerance and recovery in grid
system-level approach to fault tolerance in cloud computing. In IEEE workow management systems international conference on complex. In
international systems conference (pp. 15). Intelligent and software intensive systems, Krakow (pp. 475480).
Jitsumoto, H., Endo, T., & Matsuoka, S.ABARIS. (2007). An adaptable fault detection/ Suresh, Y., Kumar, L., Ku, S., & Rath (2014). Statistical and machine learning methods
recovery component framework for MPI. In Proceedings of the IEEE international for software fault prediction using ck metric suite: A comparative analysis. ISRN
parallel and distributed processing symposium (pp. 18). Los Alamitos, CA: IEEE Sofware Engineering, 116.
Computer Society Press. Tan, W., Chard, K., Sulakhe, D., Madduri, R., Foster, I., Soiland-Reyes, S., & Goble, C.
Juve, G., & Deelman, E. (2010). Scientic workows in the cloud. Cloud Chapter, (2009). Scientic workows as services in caGrid: A Taverna and gRAVI
5674. approach. In Web services, ICWS 2009. IEEE international conference (pp. 413
Juve, G. et al. (2009). Scientic workow applications on Amazon EC2. In E-science 420).
workshops, 5th IEEE international conference (pp. 5966). Varghese, B., McKee, G., & Alexandrov, V. (2010). In Intelligent agents for fault
Juve, G. et al. (2010). Data sharing options for scientic workows on Amazon EC2. tolerance: From multi-agent simulation to cluster-based implementation, 24th
In ACM/IEEE international conference on HPC, networking, storage and analysis, SC international conference. WA: IEEE.
10, Washington, DC, USA (p. 19). Wang, J., Korambath, P., Altintas, I., Davis, J., & Crawl, D. (2014). Workow as a
Kousiouris, G., Menychtasa, A., Kyriazis, D., Gogouvitis, S., & Varvarigou, T. (2014). service in the cloud: Architecture and scheduling algorithms. In 14th
Dynamic behavioral-based estimation of resource provisioning based on high- International conference on computational science, procedia computer science
level application terms in cloud platforms. Future Generation Computer Systems, (Vol. 29, pp. 546556).
32, 2740. Xie, M., Dai, Y. S., & Poh, K. L. (2004). Computing system reliability: Models and
Malhotra, R., & Jain, A. (2012). Fault prediction using statistical and machine analysis. Kluwer: Academic Publishers.
learning methods for improving software quality. Journal of Information Yu, Z., Wang, C., & Shi, W. (2010). FLAW: FaiLure-Aware Workow scheduling in
Processing Systems, 8, 241262. high performance computing systems. Journal of Cluster Computing, 13(4),
Ohlsson, N., Zhao, M., & Helander, M. (1998). Application of multivariate analysis for 421434 [Hingham, MA, USA: Kluwer Academic Publishers.
soft ware fault prediction. Software Quality Journal, 7, 5166. Zhao, W., Melliar, P., & Moser, L. E. (2010). Fault tolerance middleware for cloud
Pandey, S., Karunamoorthy, D., & Buyya, R. (2011). Workow engine for clouds. In R. computing. Proceedings of the 2010 IEEE 3rd international conference on cloud
Buyya, J. Broberg, & A. Goscinski (Eds.), Cloud computing: Principles and computing. Washington, DC, USA: IEEE Computer Society.
paradigms (pp. 321334). Hoboken, NJ, USA: John Wiley & Sons Inc..

Intelligent Failure Prediction Models For Scientific Workflows

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Intelligent Failure Prediction Models For Scientific Workflows

Hochgeladen von

Copyright:

Verfügbare Formate

Expert Systems with Applications 42 (2015) 980989

Contents lists available at ScienceDirect

Expert Systems with Applications

Intelligent failure prediction models for scientic workows

Fig. 1. Layered architecture of Cloud along with Waas.

Fig. 2. Methodology of the proposed approach.

For M number of cases in training set,sample M cases randomly.

4.4. Naive Bayes approach

5.2.2. Standard Error of Mean (SEOM) Table 3

6. Experimental results and discussion

To evaluate the accuracy of prediction models, machine learn-

Fig. 5. Comparison of actual failures with predicted failures.

with accuracy of 94% approximately and maximum value (33%)

average accuracy of (93%) as compared to existing model for Epige-

7.1. Future directions

We suggest the following future directions for research

The incorporation of Naive Bayes model with other fault toler-

Das könnte Ihnen auch gefallen