04012019

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO.
6, NOVEMBER 2006
1511
Neural Networks for Continuous Online

Learning and Control
Min Chee Choy, Dipti Srinivasan, Senior Member, IEEE, and Ruey Long Cheu
AbstractThis paper proposes a new hybrid neural network

(NN) model that employs a multistage online learning process to
solve the distributed control problem with an infinite horizon. Various techniques such as reinforcement learning and evolutionary
algorithm are used to design the multistage online learning process.
For this paper, the infinite horizon distributed control problem
is implemented in the form of real-time distributed traffic signal
control for intersections in a large-scale traffic network. The hybrid
neural network model is used to design each of the local traffic signal
controllers at the respective intersections. As the state of the traffic
network changes due to random fluctuation of traffic volumes,
the NN-based local controllers will need to adapt to the changing
dynamics in order to provide effective traffic signal control and to
prevent the traffic network from becoming overcongested. Such a
problem is especially challenging if the local controllers are used
for an infinite horizon problem where online learning has to take
place continuously once the controllers are implemented into the
traffic network. A comprehensive simulation model of a section of
the Central Business District (CBD) of Singapore has been developed using PARAMICS microscopic simulation program. As the
complexity of the simulation increases, results show that the hybrid
NN model provides significant improvement in traffic conditions
when evaluated against an existing traffic signal control algorithm
as well as a new, continuously updated simultaneous perturbation
stochastic approximation-based neural network (SPSA-NN). Using
the hybrid NN model, the total mean delay of each vehicle has been
reduced by 78% and the total mean stoppage time of each vehicle
has been reduced by 84% compared to the existing traffic signal
control algorithm. This shows the efficacy of the hybrid NN model
in solving large-scale traffic signal control problem in a distributed
manner. Also, it indicates the possibility of using the hybrid NN
model for other applications that are similar in nature as the
infinite horizon distributed control problem.
Index TermsDistributed control, hybrid model, neural control,
online learning, traffic signal control.
I. INTRODUCTION
ROVIDING effective real-time control of a dynamically

changing system involves implementing a control system
that has the ability to constantly adapt itself. This can be
achieved by making appropriate changes to its control parameters so that it can provide a set of appropriate plans for the
dynamic system it is trying to control. The implementation of
such a control system is not a trivial matter even if the dynamic
system has only a few degrees of freedom if the set of plans are
Manuscript received May 17, 2005; revised September 15, 2005.
M. C. Choy and D. Srinivasan are with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576, Singapore (e-mail: engp1637@nus.edu.sg; dipti@nus.edu.sg).
R. L. Cheu is with the Department of Civil Engineering, National University
of Singapore, Singapore 117576, Singapore (e-mail: cheu@nus.edu.sg).
Color versions of Figs. 1, 6, 813, and 1519 are available at http://ieeexplore.
ieee.org.
Digital Object Identifier 10.1109/TNN.2006.881710
to be globally optimal for the system. This is made even more

challenging if the control process is a continuous one lasting
indefinitely (e.g., a set of traffic signal controllers implemented
in a traffic network should run for years with little or no offline
tuning). Over the years, neural networks (NNs) have been
applied to solve different complex problems given their ability
to provide a good, nonlinear mapping between the inputs and
the desired outputs of a system. As such, NNs have been widely
used as a model for such a control system.
A. Continuous Online Learning of a NN
Online learning is a major aspect in designing a NN-based
controller for dealing with dynamically changing problems
given that the control parameters need to be updated as the
system changes. One of the major concerns is the ability of the
NN to converge to a set of optimal plans for the dynamically
changing systems they are trying to control while using certain
online learning algorithms. For problems that can be modeled
as a Markov decision process (MDP) or semi-Markov decision
process (SMDP), research works [1][4] have shown that
certain online learning algorithms can actually be applied to
yield a set of optimal solutions. Unfortunately, not all dynamic
systems can be accurately modeled as an MDP or SMDP. In
addition, good plant models for some complex systems (such
as that of the large-scale traffic network) are often not available.
Given that many of the offline learning methods such as
backpropagation of error cannot be applied directly in a NN for
online learning, studies have been carried out over the years to
integrate some of the online learning techniques with existing
offline methods so that connectionist networks can be used as
real-time controllers. Examples include -learning form of
weight updates [5], [6] and integrating reinforcement learning
with backpropagation [6]. For some applications, the use of certain online learning methods in a NN actually yields suboptimal
results [7] and inconsistent performance. Hence, the challenge to
find an effective online learning method for a NN that is generic
enough for applications in different problem domains still exist.
This challenge is made even more difficult if the problem has
an infinite horizon and it is continuously changing. The online
learning process will have to take place continuously for an
indefinite amount of time when dealing with such problems.
Also, issues will arise concerning the adaptability of the neural
controllers, the ability of the neural controllers to avoid or get
out of a local minima, good stochastic exploration, etc.
B. Real-Time Traffic Signal Control
Control of traffic signals for efficient movement of traffic on
urban streets constitutes a challenging part of an Urban Traffic
1045-9227/$20.00 2006 IEEE
1512
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006
Fig. 1. Breakdown of a three-phase cycle.
Control System (UTCS). A common form of traffic signal control is preset timings. These preset timings are optimized offline for the traffic patterns at a particular time of day. At a predetermined time, a new set of timings is downloaded to each
traffic signal in the network. However, these preset timings may
not work well for a complex traffic network with continuously
changing traffic volumes throughout the day and between days.
This presents the need to implement a control system that can
perform real-time update of traffic signals in the traffic network
based on the changes in the traffic volume. Such an idea is possible only if the local controllers can adapt to the changing dynamics of the traffic network. For the case where individual
local controllers are controlling the traffic signals for an indefinite amount of time after they are being installed into the traffic
network, the problem of real-time traffic signal control can be
said to take the form of an infinite horizon distributed control
problem. Hence, for effective traffic signal control, such controllers need to adapt themselves continuously.
Different techniques exist in designing these real-time traffic
signal controllers. Examples include the use of fuzzy logic and
fuzzy sets [8], [9], as well as the use of genetic algorithm and reinforcement learning [10]. Most of these ideas are based on the
distributed approach where a local controller is assigned to update the traffic signals of a single intersection based on the traffic
flow in all the approaches of that intersection. Recent research
works also include the use of NN-based controllers [11][13].
Some of these approaches such as [11] and [13] use a simplified traffic network model consisting of a single intersection.
Hence, it is unclear if the neural controller can effectively control a large-scale traffic network with multiple intersections.
C. Objectives
NN-based local controllers implemented for the infinite
horizon distributed control problem have to constantly adapt
themselves to the changing dynamics of the problem domain

in order to come up with good approximates of the optimal
control functions. In this paper, a new hybrid NN model is
used to implement the local controllers for providing effective
distributed control of a large-scale traffic network with multiple intersections. A multistage online learning process has
been introduced and implemented in the hybrid NN model.
Using a comprehensive model of a real-world traffic network,
the performance of the hybrid NN model will be compared
with a new, continuously updated simultaneous perturbation
stochastic approximation-based neural network (SPSA-NN) as
well as an existing traffic signal control green link determining
algorithm (GLIDE). This study can then be used to demonstrate
the efficacy of the hybrid NN model in solving the infinite
horizon distributed control problem.
II. TRAFFIC SIGNAL CONTROL PROBLEM
A. Problem Description
Traffic signal operations at the intersections of an arterial network are one of the ways in which traffic conditions within the
network can be influenced and controlled. For this research, the
signal control policies formulated by the NN-based local controllers involved implementing the three types of control actions. They are namely cycle time adjustment, split adjustment,
where split is the fraction of the cycle time that is allocated to
each particular phase for a set of traffic movements [14] (Fig.1),
and offset adjustment, where offset is the time difference between the beginning of green phases for a continuous traffic
movement at successive intersections that may give rise to a
green wave along an arterial [14].
For a large, complex traffic network with multiple intersections, setting the values of these traffic signal parameters for
CHOY et al.: NEURAL NETWORKS FOR CONTINUOUS ONLINE LEARNING AND CONTROL
each intersection in the network in a traffic-responsive manner

is an extremely difficult task especially with the interdependency of each intersection and its neighbors. Moreover, the actual number of vehicles that arrive and leave the traffic network
each day is stochastic in nature, resulting in varying level of
traffic demands. Hence, due to the complicated nature of the
traffic signal control problem, the distributed control technique
involving multiple local controllers is applied with the objective of achieving coordinated traffic signal control for a complex traffic network so as to reduce the likelihood of traffic
congestion.
B. Performance Measures
The performance of the NN-based distributed control model
is evaluated using three performance measures: total mean
delay of vehicles, total mean stoppage time of vehicles,
and current vehicle mean speed. The microscopic traffic
simulation platform of PARAMICS allows us to take detailed
measurements of various parameters associated with each vehicle that enters and leaves the traffic network. As such, during
the course of the simulation, the delay faced by each vehicle
entering and leaving the network was being stored in memory
and the total mean delay of vehicles is being calculated as
follows:
1513
For such a distributed approach, each local controller will generate its own control variables based on the local information it
receives. Also, exchange of information can be present among
the local controllers either laterally (for controllers in the same
hierarchical level) or vertically (between lower level controllers
and higher level controllers). This is a form of cooperation and
it can be used to affect the generation of control variables. [15]
presents an analytical approach in dealing with control problems of such nature by modeling the traffic network as graphs
consisting of nodes and links. Based on the approach in [15],
the following are defined:
directed graph with a set
of
nodes and a set of links describing the traffic
network;
local controller acting at node
total number of temporal stages at time ;
local input information vector of
weight vector of
optimal weight vector of
function of
(2)
is the total mean stoppage time for the vehicles,
where
is the total amount of stoppage time faced by all vehicles
that entered and left the traffic network during the time when the
is the total number of vehicles
measurement was taken, and
that entered and left the traffic network during the time when
the measurement was taken.
Finally, the current vehicle mean speed is average speed of
all the vehicles that are currently in the traffic network. These
three performance measures are reflective of the overall traffic
condition in the traffic network. For an overcongested traffic
network, the vehicles are likely to suffer from high stoppage
time as they pass from one street to another and, consequently,
the delay will be high and their current mean speed will be low.
C. NNs for Distributed Control of a Traffic Network
A large-scale control problem such as controlling the traffic
signals in a traffic network can be divided up into subproblems
where each subproblem is being handled by a local controller.
;
at stage ;
neural control function of

at stage
(1)
is the total mean delay of vehicles,
is the total
where
amount of delays faced by all vehicles that entered and left
the traffic network during the time when the measurement was
is the total number of vehicles that entered and
taken, and
left the traffic network during the time when the measurement
was taken.
Similarly, the total stoppage time for each vehicle was also
stored in memory to facilitate the calculation of the mean stoppage time. The equation for calculation of the total mean stoppage time is given as follows:
at stage
at stage ;
;
optimal neural control
at stage ;
state vector of the traffic network at stage ;

cost function for the entire traffic network.
The distributed control problem of the traffic network can thus
be rephrased as follows.
Distributed Control Problem: Find the set of neural control
for each
that minimizes the cost
function
function , where
is a function of the states of the traffic
network at different stages.
Due to the difficulty in obtaining the optimal solution for the
distributed control problem in an analytical manner, an alternative approach involving an approximated optimal solution is
for each local
used. The optimal neural control function
controller will be used an approximation to the optimal control
function for the traffic network.
of the traffic network are designed
For this paper, the
for continuous traffic control. Hence, the optimization problem
becomes that of an infinite-horizon one. A reasonable approximation of such a problem has been presented in [15] based
on [16] in the form of the receding-horizon limited-memory
where the requirement for infinite memory storage capacity can
be overlooked.
For the receding-horizon limited-memory optimization
, one has to solve problem A for the
problem, at stage
. The approximated optimal control
stage
variables generated by the optimal neural control functions
are applied at stage
. For the next stage
, problem A is restated again for stage
.
Once again, the approximated optimal control variables generare applied
ated by the neural control functions
.
at
1514
Given the aforementioned, the problem now arises concerning the following:
1) the approximating ability of the NNs that are involved;
to derive a good approximation of
2) the ability of each
the optimal solution in a timely manner.
The first issue is mainly that of a structural issue concerning the
layout of the NNs. [15] has supplied various proofs concerning
the approximating properties of NNs with a single hidden layer
for solving the distributed control problem. For other NNs such
as that of the fuzzy-NNs with two or more hidden layers, there
exist proofs to show that those NNs can work as a universal
approximator under certain conditions, e.g., [17].
The second issue, however, may not be easily solved even if
the first issue has been resolved (recall that [15] did not give
the rate of convergence for its propositions). The difficulty in
obtaining a reasonably good approximated optimal solution for
at stage
(or any other future state) is
stage
due to computational limitation as well as the limitations of
various existing parameters update algorithms. Computational
limitation refers to the limited number of stage
in which
each
can be considered at any particular stage due to
finite memory storage space as well as computational speed of
the processor. The limitations of existing parameters update algorithms are mainly due to the fact that most parameters update algorithms for connectionist networks are designed for finite horizon learning processes. Hence, if these algorithms are
employed for an online, infinite horizon learning process, problems such as solutions getting stuck in a local minima and inadequate stochastic exploration can become very severe. As such,
the focus of this paper is to develop a new hybrid NN model as
well as a new, continuously updated SPSA-NN model that are
suited for such a problem and to compare their performances.
III. NN MODELS FOR TRAFFIC SIGNAL CONTROL
Several NN-based traffic signal control models have been presented in previous research works. Three representative examples taken from [11], [12], and [13] will be discussed in this
section.
The research works in [11] and [13] involve designing an
NN-based controller for updating the traffic signal of an isolated
traffic intersection. Both [11] and [13] incorporate the concept
of fuzzy logic and their fuzzy-NNs are of the five-layer type
(inputs, fuzzification, inference, consequence, defuzzification).
The inputs for the fuzzy-NNs that are implemented in [11] and
[13] consist mainly of two types as follows:
1) the number of vehicles that pass through the different approaches of the intersection;
2) the number of vehicles waiting in the various queues.
The outputs of the fuzzy-NN controllers are various traffic
signal plans that involved adjusting certain components of the
traffic signal (refer to the problem description section in Section I-A). For example, [13] uses several sets of fuzzy-neural rule
bases to generate different types of green-split adjustments based
on the inputs. In [13], reinforcement learning and the gradient
descent methods are used to adjust the shape of the fuzzy membership functions (through updating the weights of the fuzzy-NNs).
There are several limitations to the approach adopted in [13]
as reported by its authors. First, the neural learning is not effective under certain circumstances due to the lack of stochastic
exploration. This is especially important if the learning process

is a continuous one. Second, it is reported in [13] that the time
needed to adjust the membership functions is too long and,
hence, the algorithm cannot be applied in the field. Finally, it
is not known if the fuzzy-NNs implemented in [11] and [13]
can yield good performances if they are implemented in a more
complex traffic network consisting of more than one intersection. Bearing these in mind, the hybrid NN model presented in
this paper seeks to overcome the aforementioned limitations.
The research work covered in [12] involves the application
of simultaneous perturbation stochastic approximation (SPSA)
in modeling the weight update process of an NN. The SPSA
algorithm is a viable option for online weight update given that
it presents some form of stochastic exploration and the SPSA
algorithm can converge to a set of optimal values under certain
conditions. The SPSA-NN model initially developed in [12] is
being applied to control traffic signals for a large traffic network
with multiple intersections. Unlike the control models in [11]
and [13], [12] uses only a three-layered NN. The inputs of the
NNs consist of relevant traffic variables, similar to that of [11]
and [13].
From [12], it can be seen that the approach has two minor
shortcomings.
1) The approach involves the use of heuristics in the form of
manually identifying the general traffic patterns (morning
peaks, evening peaks) and assigning time periods for each
pattern. The robustness of the system may come into question if the fluctuations of the traffic volume in the traffic
network are not periodic.
2) The NN is assigned to each time period and the weights
of the NN are updated only during the duration of the time
period (the update is done on a daily basis). This implies
that the weight update is done only on a daily basis whenever the same traffic pattern and time period arise. As such,
the traffic controllers may not be able to respond well to
changes in the traffic network within the same time period.
Results in [12] show that the SPSA-NN is able to converge to
a set of optimal signal plans. However, as noted in [12], the NN
has to be reupdated time and again to take into consideration
changes in the long-term dynamics of the traffic network even
after the convergence. Also, no formal proof of convergence to
a set of globally optimal solutions is presented in [12] as well
as in [18] and [19]. Nevertheless, given that the research work
in [12] yields good results for traffic signal control of a large
traffic network, and coupled with the favorable characteristics
of the SPSA algorithm, a new, continuously updated SPSA-NN
that overcomes the aforementioned limitations has been implemented in this study and its performance will be compared to
that of the hybrid model.
A. Fuzzy NN
In this paper, fuzzy NNs are used as the framework for implementing the NN controllers, both for the hybrid NN model as
well as for the continuously updated SPSA-NN model. Fuzzy
NNs have been widely used to control complex systems and
their advantages have been noted in many previous research
works. Fig. 2 shows the structure of the fuzzy NN which will
be used throughout this research.
1515
Fig. 2. Five-layered fuzzy NN.
The five-layered fuzzy NN shown in Fig. 2 follows the popular convention used in the literature and it can be used for a
wide variety of applications. The choice of the operators between the fuzzification layer (second layer) and the implication
layer (third layer) is taken to be -norm. The choice of the operator between the implication layer (third layer) and the consequent layer (fourth layer) is taken to be -norm. Membership
functions of the terms of the fuzzy output are singletons since
complicated membership functions and complex algorithms for
defuzzification may affect the real-time performance of the NN
controller with no significant improvement in its behavior. Section IV gives a brief introduction and description of the advantages of stochastic approximation as well as SPSA. Following
which, it describes how SPSA can be applied to update the
weight of an NN.
point of the loss function, the distributed control problem can

be redefined as follows.
Distributed Control Problem for Continuously Updated
for
SPSA-NN: Find the set of NN weight parameters
that minimizes the approximated cost funceach
tion where is a function of
.
For clarity and without the loss of generality, some of the
notations used in Section II can be redefined as follows:
estimate of
estimate of
at stage
at stage
for each
for each
The RobbinMonro stochastic approximation (RMSA) [23] can

then be written as follows:
(3)
IV. CONTINUOUSLY UPDATED SPSA-NN MODEL

Stochastic optimization is the problem of finding a minimum
in the presence of noise.
point of a real value function
Stochastic approximation is a popular technique of stochastic
optimization and it is used in situations where the gradient
of the cost function
is not readily available. The relations (e.g., convergence properties) between some of the online
learning techniques such as -learning and stochastic approximation theory have been established in [20] and [21]. [22] introduces a new form of stochastic approximation theory termed as
SPSA and establishes the conditions in which SPSA converges
and becomes asymptotically normally distributed. Given these
favorable properties of SPSA, research works have also been
conducted to use SPSA to update the weights of the NN [12],
[18], [19].
A. Using SPSA to Update the Neurons Weights
For the case where stochastic approximation is being applied
for updating the weights of an NN in order to find the minimum
is the gain sequence and it has to satisfy the following

well-known convergence conditions:
as
(4)
Based on that, several gradient approximation methods have
been developed including a popular finite difference approximation method by KieferWolfowitz, [24]. Spall [22] adopted
another approach (SPSA) using the idea of simultaneous perturbation to estimate the gradient. The formal proof of convercan be found
gence of SPSA and the asymptotic normality of
in [22] and will not be presented in this paper. However, some
of the expressions of the SPSA algorithm are given as follows
1516
Fig. 3. Structure of a LC using the continuously updated SPSA-NN model.
to facilitate the understanding of how SPSA can be applied in

an NN (which will be discussed in Section V).
be a vector of
mutually independent
Let
satisfying
mean-zero random variables
the various conditions given in [22] (an example would be a
be a
symmetrical Bernoulli distribution) at stage . Let
positive scalar. The noisy measurement of the loss function
is given as follows:
(5)
is the stochastic perturbation that is applied to
where
during stage
and
represents measurement noise
terms and they must satisfy the following resembling a martingale difference:
a.s.
The estimate of
(6)
can thus be written as follows:
..
.
(7)
Defining the error term as
(8)
(1) can be rewritten in a more generalized form
(9)
is the bias in
(
where
or near unbiasness if the loss function
is sufficiently
smooth, the measurement noise satisfies (6) and the conditions
are met.
for
has been proven in [22] if
Strong convergence of (9) to
five other assumptions are satisfied (refer to [22]). As such, the
iterative form presented in (3) and (9) can be used to model the
iterative weight update process in an NN controller.
B. The Structure of a Local Controller Using Continuously
Updated SPSA-NN
As mentioned in the beginning, the continuously updated
SPSA-NN model used in this study strives to avoid the two
is formulated in
limitations of [12]. The structure of each
can ideally be left on its own once
such a way that each
it has been implemented. The five-layered fuzzy NN in Fig. 2
.
is used as the basic building block for each component in
is shown in Fig. 3.
The structure of a
consists of several deciAs can be seen in Fig. 3, the
sion makers, a single state estimator as well as a delay estimator
takes in traffic
(details of which will be given later). The
parameters as its inputs and generates a set of signal plans as
the output via a two-stage process. The state estimation stage
generates the estimated current state of the traffic network. The
number of decision makers used in the decision-making stage
will depend on the complexity of the problem. Based on the estimated current state of the traffic network, the appropriate decision-maker will be selected to generate a set of signal plans.
are
The decision-making and learning processes of each
enabled by the NNs and, at stage , they are as follows.
1) Weights belonging to the state estimator (SE) are perturbed
. This follows the use of a stochastic
randomly by
perturbation as shown in (5).
2) The SE takes in traffic parameters as inputs and estimates
the current state of the intersection.
3) Based on the estimated state, a decision maker (DM) is

chosen to generate the signal plans for the intersection.
. This
The weights for the DM are also perturbed by
follows the use of a stochastic perturbation as shown in (5).
4) The signal plans are implemented into the simulated traffic
network.
5) The average delay is calculated for all the vehicles passing
through that intersection after the implementation of the
new signal plan.
6) If the weights of the SE have not been updated the previous
round, they will be updated based on the scaled difference
. The
between the delay during stage and stage
weight update equation is similar to that of (9) and the
gradient can be estimated using (7). Otherwise, the average
delay will be stored.
7) If the DM is the same as the one chosen in the stage
and its weights have not been updated, then, the weights
will be updated based on the scaled difference between the
. The weight update
delay in the stage and stage
equation is similar to that of (9) and the gradient can be
estimated using (7). Otherwise, the average delay will be
stored.
.
8) Repeat step 1) for stage
The delay estimator shown in Fig. 3 is used for computing
the average delay of each intersection during each stage . The
computation of the average delay is based on well-known delay
equations [14]. Based on that, the cost or error function of the
SE at stage is as follows:
Error
average delay at stage

The cost or error function of the DM at stage

Error

(10)
is as follows:
(11)
1517
Fig. 4. Structure of a LC using hybrid NN model.
and connections of the neurons in the NN in real-time. This

process primarily consists of three subprocesses: reinforcement learning, weight adjustment, and adjustment of fuzzy
relations (or the way neurons in different layers of the fuzzy
NN are connected to one another), and it is depicted in Fig. 5.
Reinforcement learning is first performed. The reinforcement
, folobtained from this process is backpropagated to each
proceeds to adjust the learning rate for
lowing which, each
each neuron and activate the forgetting mechanism if necessary
(as determined by the value of the reinforcement that the
received). When that is done, each
adjusts the weights of
the neurons according to the topological weight update method.
Finally, the reinforcement is used to update the fitness value of
s NN. If the fitness values of the neurons
each neuron in the
fall below some prespecified values, the fuzzy relations (represented by how the outputs of a layer of neurons are connected
to the inputs of the next layer of neurons) will be updated using
the evolutionary algorithm fuzzy relation generator (EAFRG).
Section V-A will describe in details the various mechanisms of
the multistage online learning process.
A. Online Reinforcement Learning (ORL) Process
is a constant which takes value

and stage
where
is the stage prior to stage in which the chosen DM at stage
was previously selected.
V. HYBRID NN MODEL
Unlike the continuously updated SPSA-NN model, the hybrid
NN model attempts to solve the distributed control problem diis made up of a five-layered fuzzy NN that
rectly. Each
facilitates its decision-making process. As stated in problem
is a function of
, the state of the
A, the cost function
traffic network at stage . Similar to the continuously updated
SPSA-NN model, the state estimation process is also performed
by using a NN, which is called the online reinforcement learning
using the hybrid NN
(ORL) module. The structure of each
model is as shown in Fig. 4.
In order to overcome some of the possible difficulties
(e.g., lack of stochastic exploration, getting stuck in a local
minima) that are associated with performing online learning
for an infinite-horizon control problem, a multistage online
learning process has been designed to update the weights
Various advantages of reinforcement learning have been

highlighted in previous research works [25], [26]. As such,
reinforcement learning is used to develop an unsupervised,
online learning mechanism for the NNs. Fig. 6 shows the
schematic diagram of the ORL module.
The ORL module is designed to generate reinforcement based
on comparison of the current estimated state of the zone with the
previous state. The cost function that is used to generate the
reinforcement signal is defined as follows:
(12)
where is the state change sensitivity constant (determined empirically) and is the best state value. For there to be a positive reinforcement, it is necessary that
and
. Note that if
, the rein. This implies
forcement will be equal to zero if
that no reinforcement will be sent for the current stage if the best
state has already been achieved in the previous stage.
1518
Fig. 5. Multistage online learning process.
Fig. 6. ORL module.
Using the backpropagation technique and denoting

as the
weight between neuron and the activated output neuron , the
change of weight
is computed as follows:
(13)
is the gradient (reinforcement

where is the learning rate,
received with respect to the weight) for the output neuron , and
is the output of neuron . Note that
(14)
1519
(15)
in which is the number of inputs for neuron ,

is the backis the
propagated reinforcement value at output node , and
transfer function for neuron . The superscript in
denotes
the first-order derivative of . For hidden layers of the NN, the
local gradient for hidden neuron with neurons on its right
is defined as follows:
(16)
where
(17)
in which is the number of inputs for neuron . Hence, if a

fuzzy relation represented by a neuron is appropriate for a particular traffic condition, a positive reinforcement in the form of
a positive will be received and vice versa. Upon receiving a
can proceed to
reinforcement, each neuron in the NN of
adjust its learning rate and weight. The details of this process
will be described in Section V-B.
Fig. 7. Topological neighborhood h
as a function of d .
TABLE I
PARAMETERS FOR EAFRG
B. Learning Rate and Weight Adjustment Process

The weight of each neuron in the NN of each
can be
adjusted dynamically both by topological weight update and
also through the activation of the forgetting mechanism. The
learning rate of each neuron can also be adjusted dynamically
according to some well-known methods [27], [28]. The following describes each of these techniques in more detail.
Learning Rate Adaptation: According to Fig. 4, the learning
rate of each neuron is adjusted first before the topological weight
update. This is done dynamically according to the sign of the
backpropagated gradient. The adjustments are made according
to the guidelines mentioned in [27] and [28]. As such, denoting
as the weight between neuron and neuron during
stage , the weigh update equation for neuron at the th iteration is
(18)
where
is the topological neighborhood of the activated
neuron with respect to neuron ,
is the dynamic learning
is the output of neuron , and
is the
rate of neuron ,
gradient of neuron .
Topological Weight Update: Unlike the conventional backpropagation method, not all neurons have their weights updated
during the backward pass. Based on the neurobiological evidence for lateral interaction among a set of excited neurons, it
is clear that a firing neuron tends to excite neurons within its
immediate neighborhood more than the neurons away from it.

This observation has also been made in the research works on
self-organizing maps (SOM) [29].
denote the topological neighborhood centered on the
Let
winning neuron and encompassing a set of excited neurons in
denote the lateral
which a typical one is denoted by . Let
distance between the winning neuron and the excited neuron
. Fig. 7 depicts the
of the winning neuron (which is taken
to be the middle one for this example) as a function of the .
is only symmetUnlike SOM, the topological network
rical for the neuron in the middle of the layer due to the nature of the fuzzy reasoning mechanism and the position of each
is chosen for
neuron. As shown in Fig. 7, the function for
convenience and other functions such as the Gaussian function
can be used instead. It should be noted that the amplitude of
decreases with increasing . This is a necessary condition for
convergence. However, since the learning process of each
continues indefinitely in the dynamic traffic network, the size of
the topological neighborhood does not shrink with time.
Hence, using this concept, only weights belonging to neurons
within the topological neighborhood of the winning/activated
neuron are updated and the process of backpropagation can be
accelerated. The winning neuron is decided by the center of area
1520
Fig. 8. Map of the Suntec City area.
defuzzification process in the fourth layer, -norm fuzzy operator in the third layer, and -norm fuzzy operator in the second
layer.
Forgetting Mechanism: A forgetting mechanism is impleas well as the ORL
mented in the fuzzy NNs of each
module to affect the weight adjustment process. The principle
behind the forgetting mechanism is to enable the decision
module to search through the solution space in an explorative
manner rather than a purely exploitative manner [30] in order to
reduce the number of instances whereby the search is trapped
in a local minima. This is similar to the concept of simulated
annealing whose objective is to find the global minimum of a
cost function that characterizes large and complex systems. In
doing so, simulated annealing proposes that instead of going
downhill all the while to favor low-energy-ordered states, it
is good to go downhill most of the time. In other words, an
uphill search is needed at certain times. Results have shown
in [30] that this new approach provides a robust framework
for reinforcement learning in a changing problem domain
where the improvised algorithm with the forgetting mechanism
outperformed conventional the -learning approach. For this

research, a variation of the forgetting mechanism is used.
Equation (19) shows the additional weight adjustment (besides the one using backpropagation) that is implemented in the
fuzzy-NNs
(19)
, and
where is the forgetting term and its value is
is a positive constant to be determined empirically. Using
this approach, the search for the optimal solution does not get
stuck in a local minima since the transition out of it is always
possible. The forgetting mechanism for each neural weight is
activated after a prespecified number of negative reinforcements
are received.
Following the learning rate and weight adjustment process,
the last stage of the multistage online learning process involves
using evolutionary algorithm for adjustment of fuzzy relation
1521
Fig. 9. Threedimensional (3-D) screenshot of the simulated Suntec City area.
according to the fitness values of individual neurons in the fuzzy

NN of each
. Section V-C will describe this stage in details.
current eligibility of the fuzzy relation as well as the degree to

which the rule is generally satisfying. The fitness function for
the EAFRG is defined as follows:
C. Evolutionary Algorithm Fuzzy Relations Generator

Fuzzy rules lack precision and their membership functions
need to be updated regularly in order for the rules to be valid
in a dynamically changing problem domain. Invalid rules may
even need to be discarded and new rules generated to replace
them. Getting training data for rules optimization problems can
be time consuming and it may not even be feasible to do so in
certain problem domains due to issue of validity and accuracy.
In this paper, the fuzzy rules adjustment process using evolutionary algorithm is performed online throughout the running of
the simulation in order to accommodate possible fluctuations of
the system dynamics. The EAFRG is used to generate new fuzzy
relations based on the reinforcements received by the agents,
thus changing the knowledge representation for each agent as
the traffic network evolves. The chromosome that is used by the
EAFRG determines the way neurons in layer two of the fuzzy
NN are linked to the implication nodes in layer three.
Obtaining a suitable fitness function to evaluate the generated
fuzzy relations is not an easy task since an ideal fitness function should be able to produce a fuzzy relation that results in a
generally satisfying (how much/little deviations the fuzzy relation possesses in comparison with some well-known guidelines
or rule-of-the-thumb for the chosen problem domain) as well
as contextually valid/eligible fuzzy rule (i.e., valid according to
the current context of the problem state) that can accommodate
exceptions in the current problem state to a reasonable extent.
As such, the fitness function should take into consideration the
(20)
where is the current eligibility of the antecedent-implication
relation, is the measure of whether the relation is generally
satisfying, and is the sensitivity factor for (determined emand have counter balpirically). Hence, as can be seen,
ancing influence on each other. A relation may be generally satisfying having a high value, but due to the changing system
dynamics, it may not be eligible.
Hence, a low value will result. Adding them up will produce the overall fitness of the relation. is further defined as
follows:
(21)
where is the eligibility sensitivity factor (determined empiridenotes the eligibility trace at stage , and it is comcally),
puted as follows:
(22)
where is the decay constant (determined empirically), is the
reinforcement, and is the activation value which is zero (0) if
the rule is not activated and one (1) if activated. The function
denotes taking the difference between two chromosome
,
vectors. In this case, the first chromosome vector is
the current chromosome used by the fuzzy NN and the second
1522
TABLE II
NUMBER OF PEAK PERIODS FOR DIFFERENT SIMULATION RUNS
chromosome vector is
by EAFRG. For this research,
, the chromosome generated

is defined as follows:
(23)
where denotes a node in the antecedent (second layer) and

denotes a node in the implication (third layer) such that the
relation is zero (0) if node and are not linked, and
denotes the correlation between and .
For each update to obtain the best chromosome (representing
a set of fuzzy relations between layers two and three), the configuration used is shown in Table I.
As can be seen, the best 80 members of a previous population are carried forward in the new population before the processes of mutation and crossover repeat. Tests have been carried
out to optimize these parameters according to the requirements
of the system. Overall, the computational speed of the EAFRG
does not hinder the real-time performance of the controller agent
since the chromosome is a binary vector and the population size
is relatively small.
VI. EXPERIMENTS ON TRAFFIC SIGNAL CONTROL MODELS
A. Modeling the Traffic Network
The traffic network used for this paper is modeled after the
Suntec City area, which is a section of the Central Business
District (CBD) of Singapore. This is represented realistically by
PARAMICS Modeler using a total of 330 links and 130 nodes.
The networks traffic operations are simulated using version 4.0
of PARAMICS [31]. The necessary data for simulation, such
as traffic flow at different times of the day for several days of
the week, number of approaches in each intersection, etc. is obtained from the Land Transport Authority of Singapore (LTA).
The total number of signalized intersections in the simulated traffic network is 25. As such, 25 local controllers are
implemented to control the traffic signals of each intersection.
Given that PARAMICS is a microscopic simulator (that it has
the ability to model individual vehicles), it is able to accurately
depict various traffic behaviors such as congestion formation
and dispersion. Fig. 8 shows the map of the Suntec City area
and the 25 intersections which are being controlled by the local
controllers.
A three-dimensional (3-D) screenshot of the simulated
Suntec City area taken from PARAMICS Modeler is shown in
Fig. 9.
Three types of simulations are used to evaluate the performance of the different NN-based local controllers. They are
namely the typical scenario with morning peak (3 h), typical

scenario with morning and evening peak (24 h), and extreme
scenario with multiple peaks (24 h). The typical scenario with
a single morning peak is used to test the response of the two
NN control models before they are being used for the infinite
horizon problem. The typical scenario which contains morning
and evening peaks lasting 24 h is essentially a representation of
the normal daily traffic situation in the CBD area. The extreme
scenario with multiple peaks lasting 24 h simulates a fictitious
scenario where numerous peak periods are cascaded closely together within a 24-h period. This will greatly increase the level
of difficulty of the dynamic problem given that the local controllers will now have to deal with more frequent fluctuations
of traffic volume. In that it can be said that the 24-h simulation
(extreme) is an approximation of an infinite horizon problem.
The number of peak periods for the three types of simulation
runs is as shown in Table II.
Traffic volume will increase substantially during these peak
periods. Even though the increase in demand during the different peak periods is largely similar, the number of vehicles that
are actually released into the traffic network varies according to
the random seeds that are set before the simulations. Given that
and the fact that PARAMICS is able to model various characteristics (such as gap acceptance, lane changing, driver aggression, car following, etc.) of vehicles on an individual basis, the
outcome of each simulation run varies with the use of different
random seeds. Figs. 1012 show how the current number of vehicles in the traffic network typically changes for the three different types of simulations. Note that a grey shaded region denotes a single peak period.
B. Interaction Between the Local Controllers and the Traffic
Network
Inductive loop detectors are coded in the simulated traffic
network at stop lines of the intersection approaches, similar
to the real-world installations (refer to Fig. 13). Using the
PARAMICS application programming interface (API), the
following three traffic parameters are extracted in real time
from the loop detectors:
1) lane occupancy;
2) flow;
3) rate of change of flow.
These traffic parameters will then be used as the inputs to the
local controllers. Subsequently, the local controllers generate
the signal plans for their respective intersections and these plans
will be implemented into the simulated traffic network using the
PARAMICS API. The overall interaction diagram between the
1523
Fig. 10. Current number of vehicles for typical scenario with morning peak (3 h).
Fig. 11. Current number of vehicles for typical scenario with morning and evening peaks (24 h).
Fig. 12. Current number of vehicles for extreme scenario with multiple peaks (24 h).
local controllers and the simulated traffic network is shown in

Fig. 14.
As shown in Fig. 14, the signal plans generated by each local

controller are first deposited in the signal plans repository be-
1524
Fig. 13. Screenshot of the installation of inductive loop detectors in the simulated traffic network (right-hand drive).
Fig. 14. Overall interaction diagram of the local controllers and the traffic network.
fore they are being decoded by the signal plans interpreter and
implemented into the traffic network. Each local controller has
eight different types of signal plans in which it can use to control the traffic signals at its intersection. These eight different
types of signal plans are designed to cater for different amount
of traffic loading at the intersection and it is up to each local

controller to choose an appropriate signal plan based on its own
perception of the traffic loading of its intersection.
The 25 local controllers are coded entirely using Java multithreading technology. The structure of individual local con-
TABLE III
TOTAL MEAN DELAY FOR TYPICAL SCENARIO WITH MORNING PEAK (3 h)
troller depends on the NN model that is being used. As mentioned in Sections IV and V, the structure of the local controller
for the continuously updated SPSA-NN model is shown in Fig. 3
and the structure of the local controller for the hybrid NN is
shown in Fig. 4.
The sampling rate for the local controllers can be coordinated
in order to make sure that the agents make timely responses
to the dynamically changing traffic network. For this study, the
local controllers are tuned to sample the traffic network for the
traffic parameters once every 10 s (simulation time).
C. Using GLIDE for Benchmarking
It is difficult to find a good benchmark for this large-scale
traffic signal control problem given the following factors.
1) Existing algorithms or control methodologies have been
developed for controlling the traffic networks of other cities
[11][13] with different traffic patterns and hence, the results obtained from those works cannot be applied directly
for this problem.
2) Some of the existing algorithms [11], [13] are developed
for simplified scenarios and, hence, they are not suitable
for benchmarking.
3) Commercial traffic signal control programs which are
known to have worked well are not easily available due to
proprietary reasons.
In all the experiments for this research, the signal settings
used for benchmarking are derived from the actual signal plans
implemented by LTAs GLIDE traffic signal control system.
GLIDE is the local name of Sydney coordinated adaptive traffic
system (SCATS) (it is one of the state-of-the-art adaptive traffic
signal control system [32] which is currently used in over 70
urban traffic centers in 15 countries worldwide). As such, for
simulation scenarios without the local NN-based controllers,
the signal plans selected and executed by GLIDE are implemented in the traffic network at the respective intersections as
the traffic loading at each intersection changes with time. The
traffic loading is derived from GLIDEs traffic count from the
loop detectors.
VII. RESULTS
The results for the three types of simulations are presented as
follows.
A. Typical Scenario With Morning Peak (3 h)
For the typical scenario with morning peak (3 h), six separate
simulation runs using different random seeds were carried out
for each control technique (continuously updated SPSA-NN,
1525
hybrid NN, and GLIDE). The variances of the outcomes of the

simulation are small. Hence, the average values are taken to be
reasonable representation of a typical outcome. Table III shows
the total mean delay at the end of the first peak period (average
over six separate runs for each technique) for the three different
control techniques (where hybrid NN represents the hybrid multiagent system, and SPSA-NN represents the continuously updated SPSA-NN).
Table III shows that the continuously updated SPSA-NNbased controllers as well as the hybrid NN controllers outperform the GLIDE signal plans at the end of the first peak period (at 0900 h), which is essentially nearing the end of the
simulation run. The hybrid NN controllers achieved the least
amount of total mean delay for all vehicles, followed by the
continuously updated SPSA-NN-based controllers. Compared
to GLIDE, the reduction in total mean delay when using the
hybrid NN is approximately 23%. Using continuously updated
SPSA-NN, the reduction in total mean delay when compared
to GLIDE is around 7%. These results show that the two NN
models can be applied to control the traffic signals of a complex
traffic network for a short period of time. Hence, the simulation
time can be extended to that of 24 h to further evaluate their performances.
B. Typical Scenario With Morning and Evening Peaks (24 h)
For the typical scenario with morning and evening peaks
(24 h), six separate simulation runs using different random
seeds were carried out for each control technique (continuously updated SPSA-NN, hybrid NN, and GLIDE). Again,
the average values of the six simulation runs are taken into
consideration when evaluating the performance of each control technique. Table IV shows the total mean delay at the
end of selective time periods (average over six separate runs
for each technique) for the three different control techniques
(where hybrid NN represents the hybrid multiagent system, and
SPSA-NN represents the continuously updated SPSA-NN).
For the typical scenario with morning and evening peaks
(24 h), it can be seen that the hybrid NN controllers have the
best performance among the three techniques used. Compared
to GLIDE, the hybrid NN controllers managed to reduce the
total mean delay by approximately 50% at the end of the
simulation. Using the continuously updated SPSA-NN, the
total mean delay is reduced by approximately 30% when compared to GLIDE. It can also be seen that for the hybrid neural
controllers, the value of the total mean delay fluctuates the
least. Also, all three control techniques improve with time. This
implies that they manage to maintain a congestion free traffic
network for this typical 24-h scenario. This can be further
verified in Fig. 15 which shows the current mean speed of all
vehicles in the traffic network.
As shown in Fig. 15, the current vehicle mean speed for all vehicles is restored back to the initial level between 2540 mph at the
end of the simulations. This shows again that all three techniques
are capable of maintaining a congestion free traffic network for
the typical scenario with morning and evening peaks lasting 24 h.
Also, it should be noted that using the hybrid NN controllers,
there is less fluctuation in the current mean speed of all vehicles
during the peak periods compared to the other two techniques.
1526
TABLE IV
TOTAL MEAN DELAY FOR THE TYPICAL SCENARIO WITH MORNING AND EVENING PEAKS (24 h)
Fig. 15. Current vehicle mean speed for typical scenario with morning and evening peaks (24 h).
C. Online Selection of Signal Plans by the Two Models

The results from Sections VII-A and VII-B are analyzed
based on the two performance measures. Besides those results,
it will be interesting to look into the way in which the control
techniques select the signal plans in real-time and look into the
issues of convergence (to a set of optimal signal plans). A more
in-depth analysis on the choice of signal plans by the local
controllers and the possibility of them converging to sets of
optimal signal plans can be done by plotting a 3-D graph where
the following is true.
1) The -axis of the graphs represents the eight different
types of signal plans that can be chosen by the local controllers (numbering from 1 to 8).
2) The -axis of the graphs represents the identification

number of the 25 local controllers (numbering from 1 to
25).
3) The -axis of the graphs represents the number of stages
within a single simulation run. This is essentially the
number of time in which the agents take in new inputs
from the traffic network and generate a new set of signal
plans for their intersections.
Using the continuously updated SPSA-NN-based controllers,
a certain trend indicating the possibility of convergence can
be observed in a number of simulation runs. The choice of
signal plans by the 25 continuously updated SPSA-NN-based
controllers for a particular successful six hours simulation run
1527
Fig. 16. Signal plans selected by the continuously updated SPSA-NN-based local controllers for a particular simulation run.
TABLE V
TOTAL MEAN DELAY FOR THE EXTREME SCENARIO WITH MULTIPLE PEAKS (24 h)
(where the traffic network at the end of the six hour is free of
congestion) can be seen in Fig. 16.
As shown in Fig. 16, the set of signal plans selected by the
continuously updated SPSA-NN-based controllers towards the
last 50 rounds of the simulation run is largely similar. Hence,
it implies that some sort of convergence to a set of signal plans
has been achieved for that successful simulation run. Given the
short duration of the simulation, it is not known if the set of
signal plans are indeed the optimal set.
In contrast, the signal plans selections by the hybrid NN and
GLIDE are not found to achieve any form of convergence for
the various simulation scenarios. Hence, the 3-D plot is not presented for their cases.
D. Extreme Scenario With Multiple Peaks (24 h)
For the extreme scenario with multiple peaks (24 h), five separate runs are carried out using different random seeds for each
of the three control techniques. It has been observed that the
variances of all the simulation runs that are performed for a
single control technique is small. Hence, taking the mean of
the values will give a good representation of a typical outcome
for that particular control technique. For this scenario, two performance measures are shown. Table V shows the total mean
delay at the end of selective time periods (average over five
separate runs for each technique) for the three different control
techniques. Table VI shows the total mean stoppage time at the
end of selective time periods (average over five separate runs for
each technique) for the three different control techniques. Note
that the simulation run for the extreme scenario ends after the
eighth peak period (as shown in Fig. 12).
From Tables V and VI, it can be seen that the hybrid NN
controllers manage to achieve the best performance for the extreme scenario with multiple peaks that lasts for 24 h. Using the
GLIDE signal plans, the total mean delay and the total mean
stoppage time for the vehicles increase steadily after the fifth
peak period. The continuously updated SPSA-NN-based controllers obtain a better performance compared to the GLIDE
signal plans despite the fact that the total mean delay and total
mean stoppage time for the vehicles increase steadily after the
sixth peak period. Compared to that, the total mean delay and
total mean stoppage time are increasing at a significantly slower
rate when the hybrid NN controllers are used. Overall, at the
1528
TABLE VI
TOTAL MEAN STOPPAGE TIME FOR THE EXTREME SCENARIO WITH MULTIPLE PEAKS (24 h)
Fig. 17. Traffic network controlled by continuously updated SPSA-NN-based controllers after 24 h (extreme scenario).
end of the simulation runs, the total mean delay is reduced by

approximately 78% when using the hybrid NN model compared
to that of GLIDE. Also, the total mean stoppage time per vehicle
is reduced by approximately 84%.
In addition, it should be noted that the reason for steady increase of the total mean delay for GLIDE and the continuously
updated SPSA-NN as well as the slight increase in total mean
delay for the hybrid NN is essentially due to the nature of large
number of vehicle injection during the multiple peak periods in
the extreme case scenario. The frequency in which the traffic
volume fluctuates is much higher (as shown in Fig. 12) in the
fictitious extreme scenario unlike the typical 24-h scenario with
only one morning peak and one evening peak (refer to Fig. 11).
As such, all three techniques have to adapt quickly to cope with
the frequent fluctuations in traffic volume. Given this difficulty
as well as the existence of other constraints such as the actual capacity of the traffic network, it is largely expected that the total
mean delay will increase for the extreme scenario.
Finally, in order to better illustrate the conditions of the traffic

network at the end of the extreme scenario, two two-dimensional (2-D) screenshots (refer to Figs. 17 and 18) of the traffic
network are taken from PARAMICS modeler for the case where
the traffic network is controlled by the hybrid NN and where it
is controlled by the continuously updated SPSA-NN-based controllers. The two screenshots are captured at 0030 h (at the end
of the 24 h for the extreme scenario).
The PARAMICS modeling environment is preset to denote
13 stopped or queued vehicles with a hotspot or red circle. As
can be seen from the Fig. 17, the traffic network evolves into a
pathological state with over saturation at 0030 h using the continuously updated SPSA-NN-based controllers. This is likely
to be the result of the steady increase of total mean delay and
total mean stoppage time for the vehicles after the sixth peak
period. The numbers of congested links are well over thirties
in number. Using the hybrid NN controllers, congestions have
been confined to the upper section of the traffic network and the
1529
Fig. 18. Traffic network controlled by hybrid NN controllers after 24 h (extreme scenario).
TABLE VII
PERCENTAGE REDUCTION IN TOTAL MEAN DELAY COMPARED TO GLIDE
number of congested links is reduced to less than ten (as shown

in Fig. 18).
VIII. DISCUSSIONS
A. Continuous Online Learning Capabilities of the NN Models
Recall the approximate version of the infinite horizon
problem using the receding horizon limited-memory concept mentioned in Section II-C. For this problem, it is very
important that the neural control functions can produce good
approximates of the optimal control function each time the
distributed control problem is restarted. Hence, the control
algorithm needs to have a high level of adaptability given
that the optimal control functions may change each time the
distributed control problem is restarted. This implies that the
learning period for the neural controllers will have to vary from
time to time depending on the nature of the dynamic problem.
The overall results for the different types of simulations are
shown in Table VII (where hybrid NN represents the hybrid
multiagent system, and SPSA-NN represents the continuously

updated SPSA-NN).
As shown in Table VII, the hybrid NN controllers can control the traffic signals effectively in a dynamically changing environment even as the complexity of the simulation increases
(in the form of extending the duration of the simulation runs
and adding more peak periods). This indicates that the hybrid
NN controllers are able to adjust their weights parameters effectively throughout the duration of the simulation so that the
signal plans they generate can accommodate the periodic as well
as random fluctuations of traffic volumes.
The continuously updated SPSA-NN-based controllers perform very well for the typical scenario with a single morning
peak (3 h) as well as the typical scenario with morning and
evening peaks (24 h). Some forms of convergence are even
shown in a number of successful three-hour simulation runs.
However, for the five separate simulation runs for the extreme
scenario with multiple peaks (24 h), the continuously updated
SPSA-NN-based controllers are unable to accommodate the
changing dynamics (changes in traffic volumes at certain hours)
of the traffic network. On the average, the state of the traffic
1530
Fig. 19. Effect of different values of F on the total mean delay.
network begins to degenerate after the sixth peak period and

the traffic network becomes overcongested at the end of the
simulation runs.
The differences in the performance of the hybrid NN
controllers and the continuously updated SPSA-NN-based
controllers maybe due to their individual weight update algorithms. For the SPSA-NN-based controllers, the weight update
algorithm follows the form of (9) and the gain sequence (or
learning rate) has to satisfy the classical stochastic approximation conditions defined in (4). Choosing an appropriate form
for the gain sequence is not a trivial matter as it would affect
the performance of the NN in the long run. This is because
the SPSA algorithm converges under conditions stated in [12]
and when SPSA is applied to adjust the neural weights, the
convergence property may result in premature convergence of
the neural weights.
For the hybrid NN controllers, the weight update algorithm
comes in various stages, each involving tuning the weight parameters, learning rate and the neural connection in response
to the changes in the environment. The weights of the NN will
not be updated by the reinforcement learning scheme if (12) is
equal to zero (one possible scenario as mentioned in Section V
is the case where
and
). The weights of
the NN will also not be updated if the forgetting mechanism is
not activated (under the condition that the number of negative
reinforcements received is not more than a prespecified value).
Hence, once the hybrid NN becomes a good approximate of
the optimal control function, the neural weights will be updated
by a smaller amount (or possibly not updated at all). If, on the
other hand, the external random dynamics produce a significant change in the existing dynamics of the environment (e.g.,
random injection of large amount of vehicles into the traffic network), negative reinforcements will be received given that the
existing control functions are not good approximate of the optimal control function anymore and weight updates will take
place again.
B. Difficulty in Obtaining the Optimal Set of Control
Parameters
Not all parameters of the NN controllers can be adjusted online for effective real-time control. Some of these parameters
have to be tuned offline before the neural controllers are implemented into the traffic network. In this respect, the continuously
updated SPSA-NN has a smaller number of control parameters
to be tuned compared to the hybrid NN model given its simpler
weight update algorithm. The tuning of these parameters is done
empirically given the difficulty in doing it analytically. For example, Fig. 19 shows how the performance (measured in terms
of total mean delay of the vehicles) of the hybrid NN varies with
different values of the forgetting term of (19). The value of
can thus be chosen based on the results shown in the graph.
Once these parameters are set offline, they will not be further
adjusted during the simulations.
IX. CONCLUSION
A new hybrid NN model has been successfully developed
to solve the infinite horizon distributed control problem. The
model involves a novel multistage online learning process that
incorporates various techniques such as reinforcement learning
and evolutionary algorithm. An approximated version of the infinite horizon distributed control problem is implemented in the
form of distributed traffic signal control of a large-scale traffic
network. For this problem, the NN-based local controllers have
to learn continuously for an indefinite amount of time after they
are implemented into the system. Real-world traffic data used
for modeling the traffic network is obtained from LTA and the
traffic network model is built using PARAMICS modeler. In the
experiments, the traffic signals in the traffic network are controlled by three different control techniques (hybrid NN model,
the continuously updated SPSA-NN model, GLIDE) in separate
simulation runs. Results from the experiments showed that the
hybrid NN controllers and the continuously updated SPSA-NNbased controllers achieved an overall better performance compared to GLIDE for the 3 and 24 h (typical) simulation runs.
For the extreme scenario with multiple peaks (24 h), experimental results showed that the hybrid NN controllers outperforms the continuously updated SPSA-NN-based controllers as
well as GLIDE. From the tables of results as well as the screenshots of the traffic network, it can be inferred that the hybrid
NN controllers can provide effective control of the large-scale
traffic network even as the complexity of the simulation increases substantially. This research has extended the application
of NN and other relating computational intelligence techniques

to a large-scale real world application. For such applications,
the learning period often varies according to the changing nature of the problem and, thus, the concept of effective continuous learning is of utmost importance. This is especially true
given the undesirability of having to retune the controllers from
time to time. This research has also showed that the hybrid NN
can be used for other applications that are similar to the infinite
horizon distributed control problem.
ACKNOWLEDGMENT
The authors would like to thank the Land Transportation Authority of Singapore (LTA) for providing traffic data necessary
for modeling the traffic network and traffic flow of the Central
Business District (CBD) of Singapore area.
REFERENCES
[1] A. G. Barto and S. Mahadevan, Recent advances in hierarchical reinforcement learning, Special Issue Reinforcement Learn., Discrete
Event Syst. J., vol. 13, pp. 4177, 2003.
[2] M. Kearns and S. Singh, Near-optimal reinforcement learning in polynomial time, in Proc. Int. Conf. Mach. Learn., 1999, pp. 260268.
[3] M. Littman and C. Szepesvari, A generalized reinforcement learning
model: Convergence and applications, in Proc. 13th Int. Conf. Mach.
Learn., 1996, pp. 310318.
[4] R. E. Parr, Hierarchical control and learning for Markov decision processes, Ph.D. dissertation, Univ. California Berkeley, Berkeley, CA,
1998.
[5] G. Rummery and M. Niranjan, Online Q-learning using connectionist
system Cambridge Univ., Eng. Dept., Tech. Rep. CUED/F-INFENG/TR 166, 1994.
[6] I. Johnson and M. D. Plumbley, On-line connectionist Q-learning produces unreliable performance with a synonym finding task, in Proc.
Int. Joint Conf. Neural Netw., Jul. 2000, pp. 2427.
[7] R. Jaksa, P. Majernik, and P. Sincak, Reinforcement learning based
on back propagation for mobile robot navigation Computational Intelligence Group, Dept. Cybern. Artif. Intell., Technical Univ., Kosice,
Slovakia, 2000.
[8] S. Chiu and S. Chand, Self-organizing traffic control via fuzzy logic,
in Proc. 32nd IEEE Conf. Decision Control, 1993, pp. 19871902.
[9] G. Nakamiti and F. Gomide, Fuzzy sets in distributed traffic control,
in Proc. 5th IEEE Int. Conf. Fuzzy Syst., 1996, pp. 16171623.
[10] S. Mikami and Y. Kakazu, Genetic reinforcement learning for cooperative traffic signal control, in Proc. 1st IEEE Conf. Evol. Comput.,
1994, vol. 1, pp. 223228.
[11] W. Wei and Y. Zhang, FL-FN based traffic signal control, in Proc.
2002 IEEE Int. Conf. Fuzzy Syst., May 2002, vol. 1, no. 1217, pp.
296300.
[12] J. C. Spall and D. C. Chin, Traffic-responsive signal timing for systemwide traffic control, Transpn. Res.- C, vol. 5, no. 3/4, pp. 153163,
1997.
[13] E. Bingham, Reinforcement learning in neural fuzzy traffic signal control, Euro. J. Operation Res., vol. 131, no. 2, pp. 232241, 2001.
[14] N. J. Garber and L. A. Hoel, Traffic and Highway Engineering, 2nd
ed. Boston, MA: PWS-Kent, 1997, pp. 281329.
[15] M. Baglietto, T. Parisini, and R. Zoppoli, Distributed-information
neural control: The case of dynamic routing in traffic networks, IEEE
Trans. Neural Netw., vol. 12, no. 3, pp. 485502, May 2001.
[16] T. Parisini, M. Sanguineti, and R. Zoppoli, Nonlinear stabilization by
receding-horizon neural regulators, Int. J. Control, vol. 70, no. 3, pp.
341362, 1998.
[17] J. J. Buckley and U. Hayashi, Hybrid fuzzy neural nets are universal
approximators, in Proc. 3rd IEEE Conf. Fuzzy Syst. IEEE World
Congr. Comput. Intell., Jun. 1994, vol. 1, pp. 238243.
[18] A. V. Wouwer, C. Renotte, and M. Remy, On the use of simultaneous
perturbation stochastic approximation for neural network training, in
Proc. Amer. Control Conf., 1999, pp. 388392.
1531
[19] E. Gomez-Ramirez, P. L. Najim, and E. Ikonen, Stochastic learning

control for nonlinear systems, in Proc. Int. Joint Conf. Neural Netw.
(IJCNN02), May 2002, vol. 1, pp. 171176.
[20] T. Jaakkola, M. I. Jordan, and S. P. Singh, On the convergence of stochastic iterative dynamic programming algorithms, Neural Comput.,
vol. 6, no. 6, pp. 11851201, 1994.
[21] M. Littman and C. Szepesvari, A generalized reinforcement learning
model: Convergence and applications, in Proc. 13th Int. Conf. Mach.
Learn., 1996, pp. 310318.
[22] J. C. Spall, Multivariate stochastic approximation using a simultaneous perturbation gradient approximation, IEEE Trans. Autom. Control, vol. 37, no. 3, pp. 332341, Mar. 1992.
[23] H. Robbins and S. Monro, A stochastic approximation method, Ann.
Math. Statist., vol. 25, pp. 382386, 1951.
[24] J. Kiefer and J. Wolfowitz, Stochastic estimation of a regression function, Ann. Math. Stat., vol. 23, pp. 462466, 1952.
[25] R. Jaksa, P. Majernik, and P. Sincak, Reinforcement learning based
on back propagation for mobile robot navigation, in Proc. Comput.
Intell. Modeling, Control, Autom. (CIMCA), Vienna, Austria, 1999.
[26] R. S. Sutton and A. Barto, Reinforcement Learning: An Introduction.
Cambridge, MA: MIT Press, 1998.
[27] R. A. Jacobs, Increased rates of convergence through learning rate
adaptation, Neural Netw., vol. 1, pp. 295307, 1988.
[28] Z. Luo, On the convergence of the LMS algorithm with adaptive
learning rate for linear feedforward networks, Neural Comput., vol.
3, pp. 226245, 1991.
[29] T. Kohonen, Self-Organizing Maps, 2nd ed. Berlin, Germany:
Springer-Verlag, 1997.
[30] G. Yan, F. Yang, T. Hickey, and M. Goldstein, Coordination of exploration and exploitation in a dynamic environment, in Proc. Int. Joint
Conf. Neural Netw. (IJCNN), 2001, pp. 10141018.
[31] Quadstone, PARAMICS Modeller v4.0 User Guide and Reference
Manual. Edinburgh, U.K.: Quadstone Ltd., 2002.
[32] P. B. Wolshon and W. C. Taylor, Analysis of intersection delay
under real-time adaptive signal control, Transportation Res. Part C,
Emerging Technol., vol. 7C, no. 1, pp. 5372, Feb. 1999.
Min Chee Choy is currently working towards the

Ph.D. degree at the National University of Singapore
(NUS), Singapore.
His research interests include distributed object computing, hybrid computational intelligence
techniques, online-reinforcement learning, and
intelligent transportation systems.
Dipti Srinivasan (M89SM02) received the Ph.D.

degree in engineering from National University of
Singapore (NUS), Singapore.
She worked as a Postdoctoral Researcher at the
University of California, Berkeley, from 1994 to
1995, before joining NUS, where she is an Associate Professor at the Department of Electrical
and Computer Engineering. Her research interest
is in application of soft-computing techniques to
engineering optimization and control problems.
Ruey Long Cheu received the B.Eng. and M.Eng.

degrees from the National University of Singapore
(NUS), Singapore, and the Ph.D. degree from University of California, Irvine.
He is currently an Associate Professor in the
Department of Civil Engineering, NUS. His research
interests are in intelligent transportation systems
with emphasis on the applications of artificial
intelligence and emerging computing techniques in
transportation.

04012019

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

04012019

Hochgeladen von

Copyright:

Verfügbare Formate

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO.

Neural Networks for Continuous Online

AbstractThis paper proposes a new hybrid neural network

ROVIDING effective real-time control of a dynamically

to be globally optimal for the system. This is made even more

1045-9227/$20.00 2006 IEEE

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006

Fig. 1. Breakdown of a three-phase cycle.

themselves to the changing dynamics of the problem domain

each intersection in the network in a traffic-responsive manner

optimal weight vector of

neural control function of

state vector of the traffic network at stage ;

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006

exploration. This is especially important if the learning process

Fig. 2. Five-layered fuzzy NN.

point of the loss function, the distributed control problem can

The RobbinMonro stochastic approximation (RMSA) [23] can

IV. CONTINUOUSLY UPDATED SPSA-NN MODEL

is the gain sequence and it has to satisfy the following

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006

Fig. 3. Structure of a LC using the continuously updated SPSA-NN model.

to facilitate the understanding of how SPSA can be applied in

can thus be written as follows:

Defining the error term as

3) Based on the estimated state, a decision maker (DM) is

average delay at stage

The cost or error function of the DM at stage

average delay at stage

Fig. 4. Structure of a LC using hybrid NN model.

and connections of the neurons in the NN in real-time. This

is a constant which takes value

Various advantages of reinforcement learning have been

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006

Fig. 5. Multistage online learning process.

Fig. 6. ORL module.

Using the backpropagation technique and denoting

is the gradient (reinforcement

in which is the number of inputs for neuron ,

in which is the number of inputs for neuron . Hence, if a

Fig. 7. Topological neighborhood h

B. Learning Rate and Weight Adjustment Process

immediate neighborhood more than the neurons away from it.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006

Fig. 8. Map of the Suntec City area.

outperformed conventional the -learning approach. For this

Fig. 9. Threedimensional (3-D) screenshot of the simulated Suntec City area.

according to the fitness values of individual neurons in the fuzzy

current eligibility of the fuzzy relation as well as the degree to

C. Evolutionary Algorithm Fuzzy Relations Generator

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006

, the chromosome generated

where denotes a node in the antecedent (second layer) and

namely the typical scenario with morning peak (3 h), typical

local controllers and the simulated traffic network is shown in

As shown in Fig. 14, the signal plans generated by each local

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006

of traffic loading at the intersection and it is up to each local

hybrid NN, and GLIDE). The variances of the outcomes of the

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006

C. Online Selection of Signal Plans by the Two Models

2) The -axis of the graphs represents the identification

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 17, NO. 6, NOVEMBER 2006

end of the simulation runs, the total mean delay is reduced by