Beruflich Dokumente
Kultur Dokumente
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/296705917
CITATIONS READS
0 83
4 authors, including:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Special issue Call for Papers in IJCIStudies, Inderscience Publisher View project
All content following this page was uploaded by Waleed Yamany on 06 March 2016.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue
are linked to publications on ResearchGate, letting you access and read them immediately.
Moth-Flame Optimization for Training Multi-layer
Perceptrons
b d
Waleed Yamanya,*, Mohammed Fawzya, Alaa Tharwat ,c,*, Aboul Ella Hassanien ,e,*
aFayoum University, Faculty of Computers and Information, Fayoum, Egypt
Email: wsyOO@fayoum.edu.eg
b
Electrical Department, Faculty of Engineering, Suez Canal University, Ismailia, Egypt
cFaculty of Engineering, Ain Shams University, Cairo, Egypt
d
Faculty of Computers and Information, Cairo University, Cairo, Egypt
eFaculty of Computers and Information, Beni Suef University - Egypt
*Scientific Research Group in Egypt (SRGE) http://www.egyptscience.net
Abstract-Multi-Layer Perceptron (MLP) is one of the Feed optimization techniques to optimize the performance of NN.
Forward Neural Networks (FFNNs) types. Searching for weights On the other hand, stochastic trainers employ stochastic opti
and biases in MLP is important to achieve minimum training mization methods to increase performance of the NN.
error. In this paper, Moth-Flame Optimizer (MFO) is used to
train Multi-Layer Perceptron (MLP). MFO-MLP is used to In general, NN consists of three layers, namely, input,
search for the weights and biases of the MLP to achieve minimum hidden, and output layers. Moreover, a network is used to
error and high classification rate. Five standard classification connect the nodes of all layers. The weight of each connection
datasets are utilized to evaluate the performance of the proposed is adjusted during the training process. The trainer can be
method. Moreover, three function-approximation datasets are
considered as the most important element of any NNs. The
used to test the performance of the proposed method. The
main goal of the trainer is to train NNs by searching for the
proposed method (i.e. MFO-MLP) is compared with four well
known optimization algorithms, namely, Genetic Algorithm (GA),
optimal weights and biases to obtain the maximal accuracy
Particle Swarm Optimization (PSO), Ant Colony Optimization for new sets (i.e. unknown sets or patterns) of given inputs.
(ACO), and Evolution Strategy (ES). The experimental results In other words, the trainer changes the structural parameters
prove that the MFO algorithm is very competitive, solves the of the NN in every training step (iteration) to enhance the
local optima problem, and it achieves a high accuracy. accuracy. When the training stage is finished, the model of the
NN is used to predict or estimate the value of a new pattern.
[
IV. Conclusions and future work are provided in Section V.
]
matrix such that:
II. PRELIMIN ARIES
[ ]
Multi-Layer Perceptron (MLP) is a FNN, but it consists of Assume there exists an array for sorting the moths accord
one hidden layer as shown in Fig. (1). ing to the value of objective function:
Subsequent to giving the inputs, weights, and biases, the O Ml
yield of MLPs are calculated as the following steps. O M2
O M= . (6)
1) The weighted totals of inputs are initially computed
by Equation (1). O Mn
n
where n is the moths number.
t
j = 2)Wij' Xi) -
Bj,j = 1,2,. . . ,h (1)
i=1
[
Secondly, the components of MFO algorithm are the
]
flames, creates another matrix similar to moths matrix as:
where n is the number of the input nodes, Wij
represents the weight from the ith node in the input
layer (Xi) to the lh node in the hidden layer (hj), h,2 hd
Xi indicates the ith input, and Bj
represents the bias hI h2 hd
or threshold of the lh hidden node. F= . (7)
2) The output of each hidden node is computed as
follows:
ff"
n,1 fn,2 fnd,
1 where n is the flames number and d is
the number of
Tj=sigmoid(
j)=
t ,j=I,2, . . . ,h
[ ]
1+
( exp())
-t parameters. Moreover, assume that there exist an array for
j
(2) sorting the flames according to the value of objective function:
3) The final outputs are characterized depend on the
computed outputs of the hidden nodes: OF1
h OF2
k
O = 2)Wjk, Ti) - B� , k= 1,2,... ,m (3)
OF= . (8)
j=1
OFn
1 where n is the flames number.
Ok=sigmoid(
k
o)= , k= 1,2,... ,m
1+
( exp(
-k
O)) Actually moths are the search agents which move around
As may be found in the Equations (1,2,3, and 4), the MFO = (A,B,C) (9)
weights and biases in charge of characterizing the last output
of MLPs from given inputs. Discovering fitting qualities for
where A
is a function which creates the population of moths
randomly and fitness values of them.
weights and biases keeping in mind the end goal to accomplish
a connection between the inputs and produces precise meaning
of training MLPs. A: ¢-+ { M,O M} (10)
268
In B function, update the position of each moth respect to position of each agent represents the fitness of that particle.
corresponding flame using Equations (13, 14, and 15): The fitness of the ith agent is expressed in terms of average
Mean Square Error (MSE). MSE is used to measure how the
value of the desired output is deviated from the value of the
(13)
actual output as follows:
where P is the spiral function, Mi refer to ith moth and Fj
indicates lh flame. m
= d7)2
MSE 2)07 - (16)
i=l
P (Mi,Fj) = Dj.expbt .cos(21rt) + Fj (14)
where m represents the number of outputs, d7
and 07 are the
where D refer to the distance among the ith moth and lh
j desired and actual outputs, respectively, of the ith input unit
flame, b is constant for defining the spiral function, and t is when the kth training sample is used. Hence, the average MSE
random number between - j
1 and 1. D is computed as: is calculated by calculating the average of MSEs of all training
(15) samples as follows:
",m k dki )2
MSE L L.., i=l (%- -
N
Input Hidden Output
= (17)
Layer Layer Layer
k=l
where N is the total number of training samples. The objective
function of the MFO algorithm aims to minimize the average
MSE as follows, min : F (V) = MSE.Generally, MFO
iteratively moves the weights and biases of the MLP to
minimize the average MSE and converges to a global solution
that is better than random initial solutions. Hence, in each
iteration the weights and biases are changed and the moth's
positions are changed. But, there is no absolute guarantee for
finding the global or the most optimal solution for the MLP
due to the stochastic nature of MFO algorithm.
Fig. 1: The structure of MLP with n inputs, one hidden layer,
and m outputs.
269
ranged from a simple dataset as XOR dataset to a more
x_(x-a)x(d-c)
complicated dataset as breast cancer dataset. XOR dataset +c (18)
consists of eight training and testing samples, only two classes, - (b-a)
and each sample is represented by three attributes. On the
other hand, breast cancer dataset has 599 training samples, 100 Each algorithm is run five times on each dataset and the
testing samples, two classes, and each sample is represented by average (AVG) and standard deviation (STD) of the best Mean
nine features. Moreover, another three function-approximation Square Errors (MSEs) in the last iteration in each algorithm
datasets, namely, sigmoid, cosine, and sine are obtained from are calculated. The best classification rates or test errors of
[l3]. All three datasets have the same structure of MLP I-lS each algorithm are also calculated.
I and have one attribute. Moreover, sigmoid dataset consists
C. Experimental Scenarios
61 training samples and 121 teatsing samples, cosine dataset
consists of 31 training samples and 38 testing samples, and In order to verify the performance of the proposed algo
sine dataset consists of 126 training samples and 252 testing rithm (i.e. MFO), four well-known optimization algorithms,
samples. The training and testing samples are chosen from namely, PSO, ACO, ES, and GA, are compared with MFO
each dataset to evaluate the performance of the proposed algorithm on five standard benchmarks and three function
model. approximation datasets. In this section, two experimental sce
narios are performed. In the first experimental scenario, five
sub-experiments are performed. In each sub-experiment, all
TABLE I: Datasets description [14].
optimization algorithms are performed on the five standard
Dataset # Attributes
# Training # Testing
# Classes
MLP datasets. In the second experimental scenario, three sub
Samples Samples Structure
experiments are performed. In each sub-experiment in this
3-bits XOR 3 8 8 2 3-7-1
Iris 4 150 150 3 4-9-3 scenario, all optimization algorithms are performed on the
Heart 22 80 187 2 4-9-3 three function-approximation datasets. In each sub-experiment,
Breast Cancer 9 599 100 2 9-19-1
all optimization algorithms are applied on one dataset.
Baloon 4 16 16 2 22-45-1
According to the first scenario. In the first sub-experiment,
XOR dataset is used. As shown in Table (I), XOR dataset
B. Experimental Setup consists of three attributes, eight training samples, eight testing
samples, two classes, and one output. In the second sub
The initial parameters of all optulllzation algorithms are experiment, Iris dataset which is one of the most common
summarized in Table (II). Moreover, the weights and biases standard datasets is used. Iris dataset consists of four attributes,
are randomly initialized in range [-10,10] for all datasets. The 150 training samples, 150 testing samples, three classes, and
population size of all algorithms are 50 for XOR dataset and three outputs as shown in Table (I). Heart dataset is used
200 for the rest of datasets. Further, the maximum number in the third sub-experiment. As shown from Table (I) the
of iterations are 250 iterations. Furthermore, structure of the heart dataset consists of 22 attributes, 80 training samples,
MLPs for each dataset is presented in Table (I). 187 testing samples, two classes, one output, and the structure
of the MLPs is 4-9-3. The fourth sub-experiment is applied
on Breast cancer dataset. Breast cancer dataset consists of
TABLE II: Initial parameters of the optimization algorithms. nine attributes, 599 training samples, 100 testing samples, two
Optimization classes, and one output and the structure of the MLP is 9-
Parameter Value
Algorithm 19-1. Thus, the 209 variables are optimized. In the fifth sub
Crossover Single point (probability -1.0)
GA Mutation Uniform (Probability-O.OI)
experiment, balloon dataset is used. As shown in Table (I) the
Type Real Coded balloon dataset consists of four attributes, 16 training samples,
Topology Fully Connected 16 testing samples, two classes, output, and the structure of
Social constant ( 2) I
PSO
Cognitive constant (e,) I
the MLP is 22-45-1, thus the dimension of the trainer is 55.
Inertia constant (w) 0.3 The MSEs and classification rates of all sub-experiments are
[nitial pheromone (T) Ie - 06
summarized in Table (III) and Fig. (3), respectively.
Pheromone update constant (�) 20
Pheromone constant (q) I
In the second scenario, function-approximation datasets are
ACO
Global pheromone decay rate (Pg) 0.9
Local pheromone decay rate (p,) 0.5 used. In the First sub-experiment, sigmoid function which is
Pheromone sensitivity (a) I the simplest function in the function-approximation datasets
,\ 10
ES
a I
is used. Sigmoid dataset consists of one attribute, 61 training
b I samples, 121 testing samples, and one output and the structure
MFO
t l 1, IJ of the MLP is 1-15-1. Cosine function which is difficult than
sigmoid function and it is used in the second sub-experiment.
Due to different ranges of the attributes, normalization step This dataset consists of one attribute, 31 training samples, 38
is essential for the MLP. In this work, min-max normalization testing samples, and one output and the structure of the MLP
method is used. Min-max normalization method is calculated is 1-15-1. In the third sub-experiment, sine function is used.
as in Equation (18). In min-max normalization method, the Sine function-approximation dataset consists of one attribute,
variable x is mapped in the interval of [a, b] to c, d. Moreover, 126 training samples, 252 testing samples, and one output and
in this research, the hidden nodes of MLPs are assumed to the structure of the MLP is 1-15-1. The results of all sub
be equal to 2 x N + 1, where N represents the number of experiments in this scenario are summarized in Table (IV) and
attributes of the datasets. Fig. (4).
270
TABLE III: MSE for the XOR, iris, heart, breast cancer, and balloon datasets.
Algorithm XOR Iris Heart Breast Cancer Balloon
MFO 1.018ge 009 ± 1.5111e 009 0.0221 ± 0.0028 0.1982 ± 0.006879 0.00022 ± 4.8597e 07 1.3033e 020 ± 2.8898e 020
PSO 0.084050 ± 0.035945 0.228680 ± 0.057235 0.188568 ± 0.008939 0.034881 ± 0.002472 0.000585 ± 0.000749
GA 0.000181 ± 0.000413 0.089912 ± 0.123638 0.093047 ± 0.022460 0.003026 ± 0.001500 5.08e 24 ± 1.06e 23
ACO 0.180328 ± 0.025268 0.405979 ± 0.053775 0.228430 ± 0.004979 0.013510 ± 0.002137 0.004854 ± 0.007760
ES 0.118739 ± 0.011574 0.314340 ± 0.052142 0.192473 ± 0.015174 0.040320 ± 0.002470 0.019055 ± 0.170260
Fig. 3: Classification rates for XOR, iris, heart, breast cancer, and balloon datasets.
271
TABLE IV: MSE for the sigmoid, cosine, and sine datasets.
Algorithm Sigmoid Cosine Sine
MFO 0.000198 ± 0.000018 0.00035 ± 0.00012 0.192 ± 0.001
PSO 0.023 ± 0.0093 0.0591 ± 0.0211 0.61 ± 0.0711
GA 0.00139 ± 0.001 0.0112 ± 0.00613 0.442 ± 0.06
ACO 0.0241 ± 0.0101 0.0509 ± 0.0111 0.56 ± 0.0512
ES 0.0772 ± 0.0172 0.0872 ± 0.0221 0.73 ± 0.0751
algorithm achieved superior results than the other four algo [3] T. Kohonen, "The self-organizing map," Neurocomputing, vol. 21,no. 1,
pp. 1-6,1998.
rithms. According to MSE, MFO algorithm achieved relatively
[4] S. Ghosh-Dastidar and H. Adeli, "Spiking neural networks," Interna
minimum MSE, which reflects the high local optima avoidance
tional journal of neural systems, vol. 19,no. 04,pp. 295-308,2009.
of this algorithm. The reason for minimum MSE of the MFO
[5] T. Kohonen, "Improved versions of learning vector quantization," in
algorithm is the high exploratory behavior, which helps in IJCNN International Joint Conference on Neural Networks, 1990.
local optima avoidance. In other words, in MFO algorithm, IEEE, 1990,pp. 545-550.
half of the iterations are devoted to the exploration of the [6] G. Bebis and M. Georgiopoulos, "Feed-forward neural networks,"
search space, which is changed for every dataset in training Potentials, IEEE, vol. 13,no. 4,pp. 27-31,1994.
MLPs, while the rest of iterations are devoted to exploitation. [7] A. Tharwat, T. Gaber, M. M. Fouad, V. Snasel, and A. E. Hassanien,
High exploitation behavior leads to a rapid converges towards "Towards an automated zebrafish-based toxicity test model using ma
the global optimum, hence solving the local optima problem. chine learning," Prceedings of International Conference on Commu
nications, management, and Information technology (ICCMIT'2015),
According to the classification rate, MFO algorithm achieved Procedia Computer Science, vol. 65,pp. 643-651,2015.
the highest rate between all other algorithms. The reason for
[8] 1. Hertz, A. Krogh, and R. G. Palmer, Introduction to the theory of
high classification rate is that MFO has adaptive parameters to neural computation. Basic Books, 1991,vol. 1.
smoothly balance between the exploitation and exploration. In [9] R. 1. Williams and 1. Peng, "An efficient gradient-based algorithm for
genera\, MFO algorithm is more suitable and effective in a case on-line training of recurrent network trajectories," Neural Computation,
of difficult and complicated datasets and it is recOlmnended to vol. 2,no. 4,pp. 490-501,1990.
optimize the training process in MLPs. [10] A. Tharwat, T. Gaber, A. E. Hassanien, M. Shahin, and B. Refaat,
"Sift-based arabic sign language recognition system," in Afro-European
Conference for Industrial Advancement, Vtllejuif, France, September 9-
J J. Springer, 2015,pp. 359-370.
V. CONCLUSIONS
[ll] S. Mirjalili, "Moth-flame optimization algorithm: A novel nature
inspired heuristic paradigm," Knowledge-Based Systems, 2015.
In this paper, MFO algorithm is proposed to search for
[l2] A. l. Hafez, H. M. Zawbaa, A. E. Hassanien, and A. A. Fahmy,
the weights and biases to train MLP. The proposed algorithm "Networks community detection using artificial bee colony swarm
(i.e. MFO) is applied to five standard classification datasets, optimization," in Proceedings of the Fifth International Conference
namely, XOR, iris, heart, breast cancer, and balloon datasets on Innovations in Bio-Inspired Computing and Applications (IBICA),
Ostrava, Czech Republic, June 23-25. Springer, 2014,pp. 229-239.
and three function-approximation dataset, namely, sigmoid,
sine, and cosine. Four well-known optimization algorithms, [13] S. Mirjalili, S. M. Mirjalili, and A. Lewis, "Let a biogeography-based
optimizer train your multi-layer perceptron," Information Sciences, vol.
namely, PSO, GA, ES, and ACO are used to train MLP. 269,pp. 188-209,2014.
The results of the MFO algorithm are compared with other
[l4] C. Blake and C. 1. Merz, "{UCI} repository of machine learning
four optimization algorithms. The results showed that MFO databases," 1998.
algorithm is effective in training MLPs and it solves the local
minimum problem efficiently. Hence, MFO helps in finding
the optimal weights and biases and achieved low MSE and
high classification rate.
272