Sie sind auf Seite 1von 9

772

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 32, NO. 6, DECEMBER 2002

Colonies of Learning Automata


Katja Verbeeck and Ann Now

AbstractOriginally, learning automata (LAs) were introduced


to describe human behavior from both a biological and psychological point of view. In this paper, we show that a set of interconnected
LAs is also able to describe the behavior of an ant colony, capable of
finding the shortest path from their nest to food sources and back.
The field of ant colony optimization (ACO) models ant colony
behavior using artificial ant algorithms. These algorithms find
their application in a whole range of optimization problems and
they experimentally prove to work very well. It turns out that a
known model of interconnected LA, used to control Markovian
decision problems (MDPs) in a decentralized fashion, matches
perfectly with these ant algorithms.
The field of LAs can thus both impart in the understanding of
why ant algorithms work so well and may also become an important theoretical tool for learning in multiagent systems (MAS) in
general. To illustrate this, we give an example of how LAs can be
used directly in common Markov game problems.
Index TermsAnt colony optimization (ACO), interconnected
learning automata, multiagent systems (MAS).

I. LEARNING IN MULTIAGENT SYSTEMS (MAS) AND ANTS

GENT-BASED programming and multiagent systems


(MAS) are becoming more and more important in existing
software technology. An agent is a computational entity which
can observe some parts of the environment, and can make
decisions and take actions based on these observations and own
knowledge with some degree of autonomy. A MAS is a system
were several of those agents act in the same environment to
accomplish some task. Important examples of MAS include
real-time monitoring and management of telecommunication
networks, optimization of industrial manufacturing and production processes, improving the flow of urban or air traffic,
etc. A good overview of this relatively young field of MAS can
be found in [1].
The need for learning and adaptation in a MAS is principally due to the fact that the environment an agent experiences
changes over time. The difficulty here lies not only in the fact
that an agent is exposed to external environment changes (e.g.,
load changes in a telecommunication network setting), but also
the decisions taken by other agents, with which the agent might
have to cooperate, communicate or compete with. One cannot
always predict in advance how many other agents will be in the
environment, what their strategy will be, or even what their goals
are. Therefore, for the agents to act optimally in a MAS, they

Manuscript received November 15, 2001; revised March 1, 2002. This paper
was recommended by Guest Editors M. S. Obaidat, G. I. Papadimitriou, and A.
S. Pomportsis.
The authors are with the Computational Modeling Lab (COMO), Vrije Universiteit Brussels, Brussels 1050, Belgium (e-mail: kaverbee@vub.ac.be; asnowe@info.vub.ac.be).
Publisher Item Identifier S 1083-4419(02)06466-X.

must be able to adapt to the changing demands of the dynamic


environment.
The field of reinforcement learning (RL) [2] has already established a profound theoretical framework for learning as a
centralized and isolated process occurring in intelligent standalone systems. However , extending this to MAS is not straightforward. RL guarantees convergence to the optimal strategy as
long as the agent can experiment enough and the environment
in which it is operating has the Markov property. This means
that the optimal action in a certain state is independent of the
previous states or actions taken. As soon as the environment
becomes nonstationary,1 however, which is the case for most
MAS, the Markov property and therefore the guarantees of convergence and optimality are lost. Most of the applications of RL
for such environments found so far ignore the nonstationarity
aspect. In some cases, this leads to unstable policies and stabilizing features should be added (e.g., [3], [4]).
The collective behavior of LAs is one of the first examples
of multiagent RL that has been studied. A learning automaton
(LA) describes the internal state of an agent as a probability
distribution according to which actions should be chosen.
These probabilities are adjusted according to the success or
failure of the chosen actions. This form of RL which also
has its roots in psychology, can be viewed as hill-climbing in
probability space. The work of Narendra and Thathachar [5]
analytically treats learning not only in the single automaton
case, but also in the case of hierarchies of automata and
distributed interconnected automata interacting in complex
environments. Automata games were introduced to see if
automata could be interconnected in useful ways and how they
could work in a decentralized fashion, so as to exhibit a group
behavior that is attractive for either modeling or controlling
complex systems. For instance, an interconnected model of LA
is capable of controlling a Markovian decision problem (MDP)
in a decentralized fashion [6]. Reference [5] also proposes
mathematical models of LAs in nonstationary environments.
Recently, an interesting type of MAS was introduced by the
field of ant colony optimization (ACO). ACO studies artificial
systems, used for discrete optimization problems, that are inspired by the behavior of real ant colonies. The first such system,
introduced by Dorigo was called ant system (AS) [7]. AS tried to
find good solutions for the traveling salesman problem (TSP).
It turned out to be a good prototype for a number of other ant
algorithms which have found many interesting and successful
applications. Problems like sequential ordering, quadratic assignment, partitioning, graph coloring, and routing in communication networks have already been addressed successfully. A
1This means that the probabilities of making state transitions or receiving
some reinforcement signals from the environment change over time. This is obviously fulfilled for MAS given the dependency on the behavior of other agents.

1083-4419/02$17.00 2002 IEEE

VERBEECK AND NOW: COLONIES OF LEARNING AUTOMATA

meta-heuristic unifying the existing ant algorithms has been defined. A good overview of the state of the art in the field is given
in [8].
The main observation on which ACO is based is that real
ants are capable of finding shortest paths from their nest to food
sources and back. They can perform this behavior thanks to a
simple pheromone laying mechanism. In fact, while walking
ants deposit small amounts of pheromones on the ground. When
ants move from their nest to the food source they move mostly
random, but their random movements are biased by pheromone
trails left on the ground by preceding ants. Because the ants that
initially chose the shortest path to the food arrive first, this path
will be seen as more desirable by the same ants during their
journey back to the nest.2 This, in turn, will increase the amount
of pheromone deposited on the shortest path. Eventually, this
auto-catalytic process causes all the ants to take the shortest
path.
Artificial ants take advantage of the differential length as well
as of the auto-catalytic aspects of the real ants behavior to solve
discrete optimization problems. The problem description is represented by a graph. Artificial ants are software agents in this
graph, who modify some variables so to favor the emergence of
good solutions. In practice, to each graphs edge a variable is associated, which is called a pheromone trail in analogy with the
real ants. Ants add pheromone to those edges they pass and by
doing so they increase the probability with which future ants
will take these edges. Artificial ants, as real ones, move according to a probabilistic decision policy biased by the amount
of pheromone trail they smell on the graph edges.
This makes an ant fit the definition of an agent, and thus
ASs are examples of MAS. Since some ant algorithms were already tested extensively and proved to perform well (see, for example, [9]), they should be studied more theoretically. Although
many good results are achieved by this algorithms, many open
questions still remain. How and why do these algorithms work?
What are the principles of ACO algorithms? What does control
them?
It turns out that the way ant algorithms work is not that different from interconnected LAs. In this paper, we wish to point
out that although both fields came from a different perspective
and motivation (human behavior as opposed to ants), they came
up with the same kind of algorithms for the same applications,
cf. routing in telecommunication networks [9], [5]. As far as the
aforementioned questions are concerned, the field of LAs may
help out with a theoretical basis for ant algorithms and MAS
in general. The potential of using LAs for learning in MAS was
also pointed out by others [10], [11]. At the end of this paper, we
use the analogy with the ACO algorithms to construct an interconnected model of LAs. This model is able to handle common
problems in current MAS research (i.e., Markov games) directly
[12], [13].
In the next section, ACO will be discussed. We give two ant
algorithms, which are representatives of the two problem types
ant algorithms handle: 1) static optimization problems and 2)
dynamic optimization problems. Next, we summarize some basics from LA theory. Since a graph can be modeled as an MDP,
2This

is called differential length effect.

773

and ACO problems are modeled as graphs, we are especially


interested in the interconnected model of LAs which is capable
of controlling an MDP. Section IV then shows the link between
the ACO algorithms and the interconnected model of LAs. Section V suggest the use of LAs directly in a dynamic grid-world
game. We end with a discussion in Section VI.
II. ANT COLONY OPTIMIZATION (ACO)
As already mentioned in the introduction, the field of ACO
studies artificial ASs for solving discrete optimization problems. The problem description is represented in a graph and
the artificial ants3 are software agents who can move in this
graph. The problem is usually formulated so that a solution to
the problem is to find out one or more shortest path in the graph.
This is the task for the ants. For example in the TSP the ants have
to find the shortest path which visits every node of the graph exactly once. One ant is capable of finding a tour, but good solutions emerge because of the interactions in the colony. A good
overview of the existing ant algorithms, their applications, and
common characteristics are given in [8].
The construction of the shortest path is motivated by how real
ants find the shortest path from their nest to a food source and
back. By analogy with the pheromone trail real ants use, each
, connecting node with node , has a varigraphs edge
associated with it. This variable represents the amount
able
of pheromone on the edge. While walking ants add pheromone
to the edges on their way. They move according to a probabilistic
decision policy based on the amount of pheromone trail they
smell on the graphs edges. Positive feedback is thus implemented by reinforcing a trail on those edges which are used. To
avoid some premature convergence this pheromone trail evaporates over time and ants transitions to other nodes in the graph
happen stochastic.
A second main effect that plays in the construction of
the shortest path the ants are trying to find, is how this trail
information is communicated in the colony. Communication
of this feedback happens locally and indirectly. It is mediated
by physical modifications of environmental values, here this is
the pheromone trail value of every edge. This model is called
the stigmergetic communication model. Ants thus cooperate by
leaving information for each other at some places in the graph.
Two different types of optimization problem can be considered. Static combinatorial optimization can be translated to the
problem of shortest path tracking in static graphs. As opposed to
static optimization problems, dynamic problems change during
run time. The algorithms must be capable of adapting on-line to
the changing characteristics of the problem environment. These
changes may be unrelated to other ants acting in the environment.4 However, when the problem changes due to other agents
acting in the same environment, one has to deal with the hardest
case of nonstationary scenarios. For example, in urban traffic, a
noncongested alternative route may be a good solution as long
3In

the remainder of this paper, they will be referred to simply as ants.


example, in [14], strategies for the dynamic TSP problems are studied.
In this variant of the TSP, cities or edges may be inserted or removed. After the
change took place, the ant algorithm could be restarted for the new situation.
The question here is whether to initialize everything again or use results from
the previous problem for the new one.
4For

774

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 32, NO. 6, DECEMBER 2002

as it is not discovered by other drivers. More ants will make


the problem more complicated in this case. This means that cooperation can also lead to a form of competition, when several
shortest paths have to be found in parallel as in the routing application for instance.
Unlike real ants, the artificial ant can also be enriched with
extra information like memory, local heuristic information,
lookahead etc.
As a representative for, respectively, static combinatorial optimization and dynamic combinatorial optimization, we consider
here the ACO algorithms AntCycle and AntNet.
A. Traveling Salesman
AntCycle is one of the first ant algorithms and belongs to the
AS group which was defined in [7] for solving the TSP problem.
In this problem, a minimal length tour that visits a given set of
towns has to be tracked. The mapping of the problem into a
graph is trivial in this case.
AntCycle accomplishes the task as follows. At a time instance
, every town has a number of ants, who choose the next town
to go to with a probability that is a function of the town distance
and of the amount of pheromone trail present on the connecting
edge [see (1)]. Transitions to already visited towns are disallowed (forced by a list of visited towns which is in the memory
of the ant). Every edge chosen will be updated with a value depending on the length of the tours it was part of. To avoid some
premature convergence, this pheromone trail slowly evaporates
[see (2)]. For the same reasons state transitions are stochastic.
Communication is handled by the update of the pheromone trail,
which is locally available for the other ants.
with which an ant in node will go to
The probability
node is
if

allowed

otherwise.
(1)
where
is the distance between town and
is the intensity of the trail on edge
. After
town and
values of
the ants in the system ended their tours, the trail
are updated
every edge
(2)
is a trail decay coefficient such that

where
is the length of the tour done by the th ant and is
a constant.
Ants are fully cooperative, as they have a common goal, i.e.,
to find the shortest path. The use of a link by one ant does not
influence the usefulness of that link for another agent. This is
also reflected by the fact that when using more ants in this case,
good solutions will evolve more quickly.5 Some heuristic information is added to the action-selection rule (1) via the term
of which the importance can be tuned by parameter .
AS was compared with other general purpose heuristics [15].
For small TSP problems, results were very interesting. AS was
able to find and improve the best solution found so far for a
known 30-city problem. For problems of growing dimensions
AS quickly converged to good solutions, however did not reach
the best known solutions within the allowed number of iterations. Later on AS was extended to the ACS system6 [8]. The
performance of ACS turned out to be the best one, both in terms
of quality of the solutions generated and of CPU time, on standard problems of various sizes.
ACO algorithms for other static optimization problems were
introduced, i.e., the quadratic assignment problem, job-shop
scheduling, graph coloring, sequential ordering, etc. [8]. They
all proved to be competitive with the best known methods in
literature.
B. Distributed Routing in Communication Networks
The AntNet system was introduced in [9] as a distributed,
adaptive, mobile-agent-based algorithm for load-based shortest
path routing in connection-less communication networks.
Routing is the distributed activity of building and using routing
tables, one for each node in the network, which tells incoming
data packets which outgoing link to use to continue their travel
to their destination node.
are launched in
At regular time intervals, forward ants
every node concurrently with data traffic to a randomly selected destination node . The ants goal is to find a feasible
low-cost path to the destination and to check the load status of
the network. To accomplish this they have to use the same network queues as normal data packets. While traveling, ants keep
track of the nodes visited and the time elapsed since launching
time.
Ants decisions to move forward are taken on the basis of
a combination of a long-term learning process and an instantaneous heuristic prediction. Neighbor of node is selected
with a probability

and

(5)
(3)

is the quantity p.u. of length of trail substance


by the th ant between time and
. The
laid on edge
total number of ants in the system is

is the set of neighbor nodes


if the ant did not already visit it.
is a heuristic correction, i.e., the length of
of node .
the queue of the link connecting the current node with node .
is the probability with which node should be chosen for
destination and is stored in the routing table of the current node
5Because

if th ant used edge


in its tour
otherwise

of computational load there could be some tradeoff (see [7]).


of the differences with the original AS is that next to a local update of
edges visited, there is a global update of the best tour from the beginning of
the trail. Furthermore, a pseudorandom proportional rule is used for the ants to
move in the graph. A candidate list provides additional heuristic information.
6One

(4)

VERBEECK AND NOW: COLONIES OF LEARNING AUTOMATA

775

. The value of weights the importance of the heuristic correction with respect to the routing table information. Therefore,
depends on a long-term learning process
action probability
and an instantaneous heuristic prediction.7
Furthermore, the ant sees to it that cycles are detected so that
irrelevant information collected on cycles may be erased or that
the ant destroys itself when the cycle was too long. When evenis
tually the destination node is reached a backward ant
created. The forward ant transfers its memory to it and dies. The
backward ant moves now in the opposite direction and at each
node along the path it updates the statistical model of the node
as well as its routing table. Backward ants use priority queues to
continue their travel so the information is quickly propagated.
of the traffic distriEvery node keeps a statistical model
bution by computing sample means and variances over the trip
times experienced by the mobile ants. A moving observation
is used to compute the value
of the best trip
window
time seen in that window. The model is updated as follows:
(6)
(7)
is the new observed trip time from node to deswhere
tination . The statistical model is used in the routing table updating process by assigning a goodness to the trip time. This
with
is seen in the
goodness value
current node as a positive reinforcement signal for the node
from which the backward ant returns
(8)
for destination with another neighboring
Probabilities
node of implicitly receive a negative reinforcement by normalization
(9)
DiCaro and Dorigo [9] recognized the importance of the reinforcement . It is carefully chosen to be a squashed function of
a sum of two terms. The first term, which is the most important one, evaluates the ratio between the current trip time and
the best trip time observed over the current window. The second
term considers the stability in the last trip times, favoring paths
with low variations.
If the trip time of a subpath is statistically good, then the statistics and routing table of entries corresponding to every node
on the path from to are also updated.
Routing tables are used in a probabilistic way, not only by the
ants, but also by the data packets.
Versions of the AntNet algorithm were tested in [16] against
some state-of-the-art algorithms, using the NTT Japanese backbone network, randomly generated networks of 100 and 150
nodes, and benchmark problems. The performance of AntNet
algorithms was among the best concerning packet delays and
7Experimentally, Dorigo et al. [9] found that the best value of the weight can
vary between 0.2 and 0.5, depending on the problem characteristics. For lower
values, the reactive effect is vanishing, while for higher values, oscillations of
the resulting routing tables appear.

throughput. Stable behavior was reached fairly quickly and the


ant algorithms were observed to be robust under different traffic
conditions [16].
C. Decentralized Control of a Markovian Decision Problem
(MDPs)
As mentioned earlier, ant algorithms were already extensively
tested and proven to perform very well. However, why they
work and what the basic principles are behind them is not deeply
understood. To clarify the relationship with LAs, we first reformulate the problem. ACO algorithms could be seen as shortest
path tracking in static or dynamic8 graphs. For a start, assume
the graph is static. The graph can then be modeled as an MDP,
because shortest path problems are special cases of MDPs (cf.
[17]).
An MDP is defined by a set of states , a set of actions , a
that outputs a probability
transition function9
,
distribution on , and a reward function
which implicitly specifies the agents task. In our case, the states
of the MDP are the nodes of the graph, the actions lead from
one node in the graph to another connected node, and thus the
transition function is deterministic. The rewards or costs are the
values associated with the edges.
The way LAs are able to control an MDP matches perfectly
well with what ants are doing in their graphs, even in the case of
dynamic graphs. In the next section, we revisit some essentials
concerning LAs, including how an interconnected framework of
LAs is able to control an MDP.
III. LEARNING AUTOMATA
An LA formalizes a general stochastic systems in terms of
states, actions, state or action probabilities, and environment responses [5]. The concept has some roots in psychology and operations research. The design objective of an automaton is to
guide the action selection at any stage by past actions and environment responses, so that some overall performance function
is improved. At each stage, the automaton chooses a specific
action from its finite action set and the environment provides a
random response (see (Fig. 1).
In a variable-structure stochastic automaton, the probabilities of the various actions are updated on the basis of the information the environment provides. Action probabilities are
updated at every stage using a reinforcement scheme. It is defor which is the action or
fined by a quadruple
of the automaton, is a random
output set
variable in the interval [0, 1], is the action probability vector
of the automaton or agent, and denotes an update scheme.
The output of the automaton is actually the input to the environment. The input of the automaton is the output of the
environment, which is modeled through penalty probabilities
with
.
Important examples of linear update schemes are
linear rewardpenalty, linear rewardinaction, and linear
8The graph here is dynamic in the sense that the cost associated with the edges
changes.
9This function models the probability of ending up in a next state when an
agent takes an action in a certain state.

776

Fig. 1.

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 32, NO. 6, DECEMBER 2002

Learning automataenvironment pair.

reward -penalty. The philosophy of those schemes is essentially to increase the probability of an action when it results in
a success and to decrease it when the response is a failure. The
general rewardpenalty algorithm is given by

if

is chosen at time

(10)

if
(11)
The constants and are the reward and penalty parameters,
, the algorithm is referred to as linear
respectively. When
;
when
, it is referred to as rerewardpenalty

;
and
when
is
small
compared to , it is
wardinaction

.
called linear reward -penalty

If the penalty probabilities of the environment are constant,


is completely determined by
the probability
and hence
is a discrete-time homogeneous Markov
process. Convergence results for the different schemes are obtained under the assumptions of constant penalty probabilities
[5]. In the case of state dependent nonstationary environments,
will not be a constant anymore.
the penalty probability
Models are proposed in which depends on the action proband/or the past values of the action probability.
ability
In these models, the overall system can be described by a
homogeneous Markov process so that the asymptotic behavior
of the schemes can still be analyzed. The above schemes
still behave well in these models; an
scheme tends
to equalize the penalty rates of the various actions, while an
scheme tends to equalize the penalty probabilities.

The question, of course, is if the proposed models adequately


agrees with the real environment. Hence, a principal question
in designing LAs is: What information should be fed back and
how can it be achieved in practice?
A. Interconnected Learning Automata: Decentralized Control
of MDPs
In [5], learning is not only considered in the single-automatomn case, but also hierarchies of automata and distributed
intercconnections of automata, such as automata games and
MDPs, are studied
The important problem of controlling a Markov chain can be
formulated as a network of automata in which control passes
from one automaton to another. In this set-up, every action state
in the Markov chain has an LA that tries to learn the optimal action probabilities in that state with learning update rules (10) and

(11). Only one LA is active at each time step and the transition
to the next state triggers the LA from that state to become active
and take some action. LA
active in state is not informed of
resulting from its action , leading to
the one-step reward
receives two pieces
state . When the state is visited again,
of data: 1) he cumulative reward generated by the process up to
the current time step and 2) the current global time. From these,
computes the incremental reward generated since this last
visit and the corresponding elapsed global time. The environis then taken to be
ment response or the input to
(12)
is the cumulative total reward generated for
where
is the cumulative total time
action in state and
elapsed.10 Wheeler and Narendra [6] denote updating rules
(10) and (11) with the environment response of (12) as learning
scheme . They also prove that this interconnected LA-model
is capable of solving the MDP.
Theorem 1 (Wheeler and Narendra [6]): Let for every acusing
tion state of an state Markov chain, an automaton
and having actions be associated with.
learning scheme
Assume that the Markov chain, corresponding to each policy
is ergodic.11 Then for any
, there exists an
such that for
and any
in

where
is the expected reward per step and can be written
for policy
in terms of the limiting stationary probabilities

and are the rewards and transition probabilities, respectively, depending on starting state and ending state .
Proof: Wheeler and Narendra [6] prove that the Markov
chain control problem under the assumptions above can be
asymptotically approximated by an identical payoff game12 of
automata. This game is shown to have a unique equilibrium.
For the corresponding automata game with every automaton
using an
updating scheme, the above result is proved. As
long as the ratio of updating frequencies of any two controllers
does not tend to zero, the result also holds for the asynchronous
updating scheme.
The principal result derived is that, without prior knowledge
of transition probabilities or rewards, the network of independent decentralized LA controllers is able to converge to the set
of actions that maximizes the long-term expected reward [5].
one-step reward is normalized so that stays in [0, 1].
Markov chain x(n) is said to be ergodic when the distribution of the
chain converges to a limiting distribution as n
.
12Games are a formalization of interactions between players. In an identical
payoff game, every player receives the same payoff for the joint action taken
by the players. For an overview of game theory, see [18]. For an introduction to
automata games, see [5].
10The
11A

!1

VERBEECK AND NOW: COLONIES OF LEARNING AUTOMATA

777

IV. ANTS AND LAHOW DO THEY MATCH?


In this section, ant algorithms are compared with the interconnected network of LAs used for controlling an MDP. An ant
can be viewed as a dummy mobile agent that walks around in
the graph of interconnected LAs, makes states/LAs active, and
brings information so that the LAs involved can update their
local state. The only difference is that, in ant algorithms, several ants are walking around simultaneously, and thus several
LAs can be active at the same time. In the model of Wheeler
and Narendra (see Theorem 1), there is only one LA active at a
time.
However, adding multiple mobile agents to the system will
to be the
not harm the convergence. Define update scheme
multiple LAs
can be
same as update scheme , only in
active at the same time and thus update its actions. The active
LAs are those who are currently visited by a mobile ant. The
automata still use the same update scheme [(10) and (11)] and
environment response [(12)]. The following extension to Theorem 1 can be made.
Theorem 2 (Extension): Let for every action state of an
state Markov chain, an automaton
using learning scheme
and having actions be associated with. Assume that the
Markov chain, corresponding to each policy is ergodic. Then
, there exists an
such that for
and
for any
in
the same convergence result as in Theorem 1
any
can be proved.
Proof: As long as all the mobile ants have the same nonconflicting objective, and thus the actions of the LAs are not
updated by conflicting environment responses, the convergence
result of (1) still holds. This is because asynchronous updating
is allowed, and thanks to the multiple ants walking around in the
environment, the ratio of updating frequencies of any two LAs
will certainly not tend to zero.
As for the ant algorithms, this results holds in the static case,
because all the ants are looking for the same shortest path and
no competition is involved. This should also be reflected in the
fact that when using more ants, good solutions will evolve more
quickly [7].
The above theorem gives a formal justification for the use of
ant algorithms in the static case. Therefore, LAs give insight into
why ACO algorithms work. In the dynamic case, meaning the
transition probabilities in the MDP may depend on the action
probabilities of the other nodes, the model can still be used. As
the experimental results of AntNet are very good, an ACO algorithm gives an experimental ground for an LA model. Therefore, our message here is that these two fields may influence
each other in a positive way.
A. AntCycle and LAs
When comparing update rule (2) with the update scheme
of the LAs in the interconnected model of Section III-A, the
commonalities are obvious. In the view of interconnected LAs,
the update scheme of the automata is given by the trail update
rule (2), which is a form of rewardpenalty
. This is
because the trail information reflects the goodness of the tours.

Fig. 2. Grid-world game with two mobile agents in their initial state and
nonmobile LAs in every nongoal state of the grid.

The trail in AntCycle is updated with an amount which depends


on the total length or cost of the tour, while the environment
response for the LAs also depends on the total generated reward since the last visit. When the interconnected model of
LAs was used for solving the TSP, heuristic information such
as favoring closest towns could also be included in the environment response.
B. AntNet and LAs
In AntNet, there are two kinds of mobile agents: 1) forward
ants and 2) backward ants. Their job is to bring information
to the local nonmobile agents, who keep the local routing tables and who are actually LAs. An evaluation and improvement
process can clearly be distinguished in AntNet. The evaluation
process is responsible for learning the local statistical models,
while the improvement process is actually the update of the local
routing tables. Comparing rules (11) and (10) with rules (9) and
(8), respectively, shows that the policy improvement process is
an LA update.
The probabilistic policy in every node is improved with the
rewardinaction LA scheme as suggested by Theorem 2. Since
is a penalty term,
equals perfectly the reinforcement term . Furthermore, the penalty parameter is zero,
whereas the reward parameter is one.
is learned via every-visit Monte
The statistical model
and
are sampled averages with learning
Carlo updates.
is respectively
factor and where the actual return
and
Moreover, one ant exploring the network
corresponds to one trial in the Monte Carlo evaluation. Trials
are thus generated in parallel.
Since this algorithm was already tested thoroughly, it provides the model of interconnected LAs a solid starting point for
being used as a learning framework for MAS in general. In the
next section, it is shown that this model of interconnected LAs
is also capable of handling cases currently studied in MAS research [12], [13].
V. INTERCONNECTED LA FOR MARKOV GAMES
The Markov game view of a MAS is a sequence of games that
have to be played by multiple players, with each game belonging

778

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 32, NO. 6, DECEMBER 2002

Fig. 3. Grid-world game with deterministic transitions a = 0:006.

to a different state of the system. The LA model that we propose


takes a different view.
As in the directed interconnected network of LAs which was
used for modeling an MDP (see Section III-A) we will distribute
the knowledge of the players (or mobile agents) over the states
of the system. Again, every state has an LA, which is nonmobile and learns the action probabilities for the possible actions in
that state. Thus, we assume that actions belong to a state and not
to the players. Furthermore, a state belongs to the problem/environment description, meaning that the players can each be in
a different state.13 The difference with the MDP model in Section III-A is that there are multiple players or mobile agents,
and thus several states/LAs can be active simultaneously. Mobile agents do not have to model their opponents because they
use the knowledge from the state which they are currently visiting. This information can also be seen by the opponents when
they arrive at that same state later on. The LAs are updated by the
knowledge they receive from the visiting mobile agents. Knowledge is thus shared by the players. Intuitively, this means that
coordination may be easier in this case.
In the next subsection, we explain this LA model in more
detail in a grid-world game.
A. An Experiment
The Markov game we consider is a sequential grid-world
3 grid
game, introduced in [13]. The game consists of a 3
and two agents who are trying to find a common goal state. The
agents start in the two bottom corners and are trying to find the
goal square, which is the middle top square (see Fig. 2). If two
players attempt to move to the same square, different from the
goal state, both moves fail.
13In

the Markov game framework, the players define one state.

Fig. 4. Grid-world game with deterministic transitions


probabilities for the actions in the initial states.

0:006;

When we use
schemes for our nonmobile state automata, the automata find two optimal nonconflicting paths for
the mobile agents. They both take action up in their initial state.
This solution is the Nash equilibrium for the game consisting of
the mobile agents dominating strategy (see Figs. 3 and 4).
The game can be made more interesting by letting the action
up in the two starting states only be executed with probability
0.5. Figs. 5 and 6 show that a
scheme with
finds a solution involving one player taking the lateral move and
the other trying to move north. Again, this is an equilibrium for
this game.14
Therefore, in this example, our LA model is able to find the
equilibrium solutions, yet without explicit communication or
the need for an agent to model his opponents.
14The other equilibrium is the same, only with the role of the agents switched.

VERBEECK AND NOW: COLONIES OF LEARNING AUTOMATA

779

Fig. 5. Grid-world game with stochastic transitions a = 0:06.

Fig. 6. Grid-world game with stochastic transitions


for the actions in the initial states.

= 0:06; probabilities

VI. DISCUSSION
In this paper, we compared the field of ACO with an interconnected LA model, which is capable of controlling an MDP.
We extended this LA model to allow for multiple LAs being active simultaneously and we discovered that this model matches
perfectly with the ant colony paradigm.
How can we use this results? From the ACO point of view,
this means that the theory of LA can serve as a theoretical analysis tool for the ACO algorithms. The problem in the field of
ACO is that although good results both in terms of quality and
convergence time, are achieved by the algorithms, no underlying
formalism or convergence results for them exist. Why do these
algorithms work? The convergence proof of the LA model in
Section V is a justification for the use of ant algorithms in the
case of static optimization problems. In dynamic optimization
problems, ant algorithms still work fine; however, in this case,
no proof of convergence is given for the interconnected model

of LA. This confirms the importance of the use of heuristic information which is used in ant algorithms. From the LA point
of view, this may be a suggestion coming from the ACO field,
that is, the use of heuristic information can guide learning and
improve convergence results.
Apart from the practical influence of using a heuristic,
learning in one-agent problems or static environments seem to
be theoretically well understood in the broad RL framework
[2]. However, for learning in nonstationary environments, such
as in MAS, a theoretical ground is missing. The ant algorithms
and therefore also the interconnected model of LA give several
useful insights in this case. Currently, a lot of attention is
going to the Markov game model for learning in MAS. The
Markov game model [13], [19] is a direct extension of the
MDP model and the game theory model for MAS. This model
augments the MDP model with actions that are distributed
over the different agents as in the game theory model. At
every step in the process, the system is in a certain state and
a corresponding game has to be played. Although this model
gives a natural mapping of the problem, learning in it is not
trivial because the Markovian property no longer holds due
to the other agents in the environment. RL, or more precisely,
the technique of Q-learning [20], has already been used for
learning in stochastic games [4], [12], [13], [19]; however, the
proposed solutions are limited by some conditions. Oscillating
behavior may arise and stabilizing features should be added
[4]. Moreover, in MDPs, agents learn a value for an action in a
certain state, while in stochastic games, values are learned for
combinations of actions, and learning is thus done in a product
space. ACO methods seem to work around this problem. We
showed how the similar model of interconnected LAs can be
used directly on a Markov game problem, currently studied in
the MAS community. For now, the LA model gives the same
results, however, without modeling others, and thus without the

780

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 32, NO. 6, DECEMBER 2002

need for the product space, and without explicit communication. Moreover, by analogy with ACO applications and results,
scaling this technique to larger problems should be possible.
The use of LAs for learning in MAS was also proposed by
others. Schmidhuber and Zhao designed a MAS model [21]
which augments the action vector with an additional action
being a change in strategy. The key idea there is the reward they
receive by continual testing the reinforcement acceleration for
every modification done. In this model, there is also no explicit
communication or modeling of other agents.
In [10], distributed game automata and the effect of delayed
communication is studied. Explicit communication is used, but
limited so that overhead costs are reduced but good decisions
still result. Simulation and analytic results reported in [10] show
that there exists a maximum communication delay before decision quality begins to suffer. However, with sufficient communication, the agents adapt to a coordinated policy.
In [22], LAs are used for playing stochastic games at multiple
levels.
To conclude, we believe that the similarities between ACO
and LAs mean that the theory of LA can serve as a good theoretical tool for analyzing ACO algorithms and learning in MAS
in general.

[12] C. Boutilier, Sequential optimality and coordination in multi-agent systems, in Proc. IJCAI, Stockholm, Sweden, 1999, pp. 478485.
[13] J. Hu and M. P. Wellman, Multi-agent reinforcement learning: Theoretical framework and an algorithm, in Proc. 15th Int. Conf. Machine
Learning, 1998, pp. 242250.
[14] M. Guntsch, J. Branke, M. Middendorf, and H. Schmek, ACO strategies
for dynamic TSP, in Proc. 2nd Int. Workshop Ant Algorithms, 2000, pp.
5962.
[15] M. Dorigo and L. M. Gambardella, Ant colony system: A cooperative learning approach to the travelling salesman problem, IEEE Trans.
Evol. Comput., vol. 1, pp. 5366, Jan. 1997.
[16] G. Di Caro and M. Dorigo, Two ant colony algorithms for best-effort
routing in datagram networks, in Proc. 10th Int. Conf. Parallel and Distributed Computing and Systems, 1998, pp. 541546.
[17] J. N. Tsitsiklis, Asynchronous stochastic approximation and
q-learning, Mach. Learn., vol. 16, pp. 185202, 1994.
[18] J. O. Osborne and A. Rubinstein, A Course in Game
Theory. Cambridge, MA: MIT Press, 1994.
[19] M. L. Litmann, Markov games as a framework for multiagent reinforcement learning, in Proc. 11th Int. Conf. Machine Learning, 1994, pp.
157163.
[20] C. Watkins and P. Dayan, Q-learning, Mach. Learn., vol. 8, no. 3, pp.
279292, 1992.
[21] J. Schmidhuber and J. Zhao, Direct policy search and uncertain policy
evaluation, in AAAI Spring Symp. Search Under Certain and Incomplete Information. Stanford, CA, 1999, pp. 119124.
[22] E. A. Billard and S. Lakshmivarahan, Learning in multilevel games
with incomplete informationPart I, IEEE Trans. Syst., Man, Cybern.
B, vol. 29, pp. 329339, June 1999.

REFERENCES
[1] G. Weiss, Ed., Multiagent Systems. A Modern Approach to Distributed
Artificial Intelligence. Cambridge, MA: MIT Press, 1999.
[2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998.
[3] J. Boyan and M. Littman, Packet routing in dynamically changing networks: A reinforcement learning approach, Adv. Neural Inf. Process.
Syst., vol. 6, pp. 671678, 1994.
[4] A. Now and K. Verbeeck, Distributed reinforcement learning, loadbased routing: A case study, in Notes of the Neural, Symbolic, and Reinforcement Methods for Sequence Learning Workshop at IJCAI, Stockholm, Sweden, 1999, pp. 8591.
[5] K. Narendra and M. Thathachar, Learning Automata: An Introduction. Englewood Cliffs, NJ: Prentice-Hall, 1989.
[6] R. M. Wheeler and K. S. Narendra, Decentralized learning in finite
Markov chains, IEEE Trans. Automat. Contr., vol. AC-31, pp. 519526,
June 1986.
[7] M. Dorigo, V. Maniezzo, and A. Colorni, The ant system: Optimization
by a colony of cooperating agents, IEEE Trans. Syst., Man, Cybern. B,
vol. 26, pp. 2941, Feb. 1996.
[8] M. Dorigo, G. Di Caro, and L. M. Gambardella, Ant algorithms for
discrete optimization, Artif. Life, vol. 5, no. 2, pp. 137172, 1999.
[9] G. Di Caro and M. Dorigo, AntNet: Stigmergetic control for communications networks, J. Artif. Intell. Res., vol. 9, pp. 317365, 1998.
[10] E. A. Billard and J. C. Pasquale, Adaptive coordination in distributed
systems with delayed communication, IEEE Trans. Syst., Man, Cybern., vol. 25, pp. 546554, Apr. 1995.
[11] A. Glockner and J. C. Pasquale, Coadaptive behavior in a simple distributed job scheduling system, IEEE Trans. Syst., Man, Cybern., vol.
23, pp. 902907, May/June 1993.

Katja Verbeeck received the M.S. degree in mathematics in 1995, and the M.S. degree in computer
science in 1997, both from Vrije Universiteit
Brussels (VUB), Brussels, Belgium, where she is
currently pursuing the Ph.D. degree.
She is also currently a Teaching Assistant in the
Computational Modeling Lab, COMO, at VUB.
Her research interests are reinforcement learning,
learning automata, and learning in multiagent
systems.

Ann Now received the M.S. degree from Universiteit Gent, Gent, Belgium, in 1987, where
she studied mathematics with optional courses in
computer science, and the Ph.D. degree from Vrije
Universiteit Brussels (VUB), Brussels, Belgium,
in collaboration with Queen Mary and Westfield
College, University of London, London, U.K., in
1994. The subject of her dissertation is located in
the intersection of computer science (AI), control
theory (fuzzy control), and mathematics (numerical
analysis, stochastic approximation).
She was a Teaching Assistant and is now a Professor at VUB. Her major areas
of interest are AI-learning techniques, in particular, reinforcement learning and
learning in multiagent systems. She is a member of the Computational Modeling
Lab, COMO, at VUB.

All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.

Das könnte Ihnen auch gefallen