Sie sind auf Seite 1von 17

Chemotaxis emerges as the optimal solution to cooperative search games

Alberto Pezzotta,1 Matteo Adorisio,1 and Antonio Celani2


1
International School for Advanced Studies (SISSA)
2
The Abdus Salam International Centre for Theoretical Physics (ICTP)
Cooperative search games are collective tasks where all agents share the same goal of reaching a
target in the shortest time while limiting energy expenditure and avoiding collisions. Here we show
that the equations that characterize the optimal strategy are identical to a long-known phenomeno-
logical model of chemotaxis, the directed motion of microorganisms guided by chemical cues. Within
this analogy, the substance to which searchers respond acts as the memory over which agents share
information about the environment. The actions of writing, erasing and forgetting are equivalent to
arXiv:1801.02965v2 [physics.bio-ph] 16 May 2018

production, consumption and degradation of chemoattractant. The rates at which these biochemi-
cal processes take place are tightly related to the parameters that characterize the decision-making
problem, such as learning rate, costs for time, control, collisions and their trade-offs, as well as the
attitude of agents toward risk. We establish a dictionary that maps notions from decision-making
theory to biophysical observables in chemotaxis, and vice versa. Our results offer a fundamental
explanation of why search algorithms that mimic microbial chemotaxis can be very effective and
suggest how to optimize their performance.

Introduction. Individuals in a group often have to Patlak–Keller–Segel model [8, 9] with Weber–Fechner
face complex situations which require concerted actions logarithmic response (see e.g. [10] for a general discussion
[1–3]. Among the various collective intelligence problems, about fold-change detection). The chemical attractant
here we focus our attention on cooperative navigation can therefore be interpreted as the medium that agents
tasks, where all agents share the common goal of locat- use to share information about the location of the target
ing a target and reaching it in the most efficient way. and the density of individuals in the group. The bio-
For instance, a crowd may need to quickly escape from physical processes by which the concentration is altered,
an enclosed space while averting stampedes. Similarly, namely production, consumption and degradation, corre-
birds in a flock or fish in a school try to reduce exposure spond to the actions of writing information on the mem-
to predators and avoid harmful collisions. In addition, ory, erasing and forgetting, respectively. We show that
individuals are also confronted with the limits posed by there is a dictionary that maps concepts from decision-
the energetic costs of locomotion. The very same kind of making theory – strategies, desirability, costs for control
objectives and challenges lie at the heart of multi-agent and for collisions, cost per time elapsed, attitude toward
autonomous robotics[4–7]. risk – into precise physico-chemical and biological corre-
Intelligent agents should aim at acting optimally in lates – concentration levels, diffusion coefficients, degra-
these contexts. That is, they should cooperate in order to dation and consumption rates, chemotactic coefficients
minimize some cost function that compounds the many (see Table I for the detailed analogy).
objectives at play: short time for completing the task, Optimal cooperative search. Let us consider a group
small energy spent in the process, and reduced damage of agents whose goal is to reach some target while min-
by collisions. What is the optimal strategy? How univer- imizing a cost function that is a sum over several con-
sal is it across environments and agents? How is informa- tributions: time to reach the target, energy expenditure
tion shared by agents? How is it translated into actions? and a penalty for collisions.
Can the optimal behavior be reliably and quickly learned The dynamics of the agents is given by a drift-diffusion
by agents facing unknown environments? Is the optimal equation
strategy actually employed by living organisms? dXi √
= ui + 2D ηi (t) (1)
In this paper we answer these questions by formulating dt
the cooperative search game in terms of stochastic opti- where the subscript i labels the agent, Xi are the posi-
mal control. We first discuss how optimal solutions can tions, ui are the individual controls, and ηi are indepen-
be mapped into quantum states of an interacting many- dent standard white noises, hηi (t) ηj (t0 )i = δi,j δ(t − t0 ).
body system. Unfortunately, the exact solution of this Uncontrolled motility is characterized by the constant
quantum problem is very difficult even in simple geome- diffusion coefficient D. In general, the controls ui depend
tries. However, in the limit of very large collectives, a on the spatial configuration of all agents X1 . . . XN . The
mean-field theory yields very simple and well-known ef- cost per unit time paid by the agent i is
fective equations.
γ gX
Indeed, the mean-field equations for optimal cooper- ci = q(Xi ) + u2i + δ(Xi − Xj ) . (2)
2 2
ative search turn out to be identical to a long-known | {z } | {z } |
j6=i
{z }
phenomenological model of chemotaxis, the celebrated time energy collisions
2

Decision making Chemotaxis


ζ Desirability s Chemoattractant concentration
D Uncontrolled dynamics D Random motility

u Optimal control χ∇ log s Chemotactic drift with logarithmic sensing
Weight for the cost of
γ
control χ = 2D/(1 − 2Dαγ) Chemotactic coefficient
α Risk sensitivity
q Time cost rate Ds = D/ Diffusion coefficient of chemoattractant
g Collision cost rate k = q(1 − 2Dαγ)/2Dγ Degradation rate of chemoattractant
1/ Learning rate β = g(1 − 2Dαγ)/2Dγ Consumption rate of chemoattractant per cell
Hamilton-Jacobi-Bellman
Eq. (11) Eq. (9) Patlak-Keller-Segel equations
equation

TABLE I. A dictionary between optimal cooperative search and chemotaxis. The table illustrates the correspondence between
quantities in mean-field optimal control and their counterparts in chemotaxis.

The total cost accumulated along the search process by Hopf–Cole transform [11] (see Section 1.1 of SI for a full
the agent i is the integral of the cost rate ci over time. derivation). The optimal control turns out to be
When the agent reaches the target the cost does not in-
crease anymore. The cooperative search is completed u∗i = 2D ∇i log Z , (3)
when all the agents have reached the target.
where the desirability Z(x1 · · · xN ) satisfies the linearized
The cost features three contributions. The first one is
Bellman equation
the penalty for the time spent before reaching the target.
We denote the associated cost per unit time as q. The X 1 X
second one is the cost of control, that we take as γu2 /2. −D ∇2i Z + hi Z = 0 , (4)
i
2Dγ i
This is reminiscent of the power dissipation due to mo-
tion in a viscous medium, but can also be interpreted as
where hi (x1 · · · xN ) is the contribution to the single-
the Kullback-Leibler divergence from a random strategy
agent cost ci givenP by time expenditure and collisions,
in the decision-making context [11]. Other choices are
hi = q(xi ) + (g/2) j6=i δ(xi − xj ). The desirability Z
possible that leave the scenario below largely unchanged
is a non-negative function of the configuration which is
(see SI, Sec. ). Finally, the last term arises from colli-
closely related to the optimal cost function C ∗ (x1 . . . xN ).
sions. The general case of longer-range interactions be-
The latter is defined as the minimum expected value of
tween agents is discussed in the Supporting Information
the total cost for an initial configuration x1 . . . xN , which
(SI, Section 1). This combination of factors embodies the
is achieved under the optimal control u∗ . Explicitly, one
trade-offs between different costs that lead to nontrivial
has Z = exp[−C ∗ /2Dγ]. It is then clear that the optimal
solutions of the optimization problem: for instance, a
control biases the motion of the agents towards config-
fast search and a low collision risk cannot be achieved
urations with lower expected cost. Eq. (4) has to be
without a consequent expenditure in control cost.
supplemented by appropriate boundary conditions, i.e.
The optimal control for the multi-agent system is the Z = 0 for forbidden configurations, Neumann conditions
set of vector fields u∗i that minimize the total cost aver- on rigid walls, and more complicated ones at the target,
aged over all the possible trajectories
P R of the agents un- which involve the solutions of the control problem for any
der the controlled dynamics h i ci dti where the aver- number of agents less than N (see SI, Sec 1.1.1). Note
age is taken over the paths described by Eq. (1). This is that Eq. (4) is equivalent to the stationary Schrödinger
the usual risk-neutral formulation of the optimal control equation of a quantum-mechanical many-body system of
problem, which corresponds to setting α = 0 in Table I identical particles [13, 14]. To the best of our knowledge,
— see below for the risk-sensitive case. See Fig. 1 in the an exact solution that satisfies the appropriate bound-
SI for a schematic representation of the control problem. ary conditions is not known for a generic N , even for
The minimization of the cost functional can be per- simple geometries. Moreover, a numerical approach ap-
formed by the Pontryagin minumum principle [12]. This pears to be a daunting task already for three agents in
leads to the (non-linear) optimality Bellman equations, a two-dimensional domain. Approximation schemes are
which can then be cast into a linear form by means of a therefore very valuable in order to proceed further.
3

Mean-field cooperative search and the emergence of ζ is proportional to the chemoattractant concentration
chemotaxis. Guided by the interpretation of the lin- c, to which agents respond logarithmically – they sense
ear Bellman equation as a quantum many-body prob- only fold-changes in levels, in accord with the Weber–
lem of identical particles with short-range interaction, Fechner law [10]. The chemotactic coefficient is χ = 2D
we adopted a mean-field approximation scheme, which is in this case. The chemoattractant is degraded with rate
often successful in capturing the large-scale features of k proportional to q/(2Dγ) and consumed at rate β pro-
interacting systems [15]. We remark in passing that the portional to g/(2Dγ) per cell. We note in passing that
mean-field approach that we take here is exactly equiva- perfect adaptation is an implicit feature of Eqs. (6) and
lent to the game-theoretical notion of cooperative mean- (8), in that there is no chemokinesis – random motility
field game [16] which has been applied to crowd dynamics D does not depend on ζ.
in a fast evacuation scenario [17]. Learning to search optimally: scouts, beacons and re-
Mean-field solutions are in general suboptimal, since a cruitment. The main results of the previous sections are
certain amount of information is discarded by the agents that optimal cooperative search can be realized by bio-
in the evaluation of the optimal action. However, if N is physical systems in which the target emits a diffusible
large and the system is diluted enough, a mean-field ap- chemical cue in the environment, that agents respond
proximation for Eq. (4) yields an excellent approximation chemotactically to this signal, and actively modify it.
– it actually becomes exact in the closely related problem However, it would be useful to extend this setting to the
of a confined, repulsive Bose gas [18]. We note in passing relevant case when the location of the target is a pri-
that when the agents are all identical the best mean-field ori unknown and the target does not spontaneously send
solution is the cooperative one (see SI, Sec. 2.1.1). out signals to facilitate the work of the agents. In other
This approximation consists in solving Eq. (4) with the words, we seek a way to include the process of discovery
ansatz that the many-agent desirability Z can be factor- of the target’s location and the successive construction
ized in N copies of a single function ζ of the solution to Eq. (7). As we show below, this can be
accomplished by adding a production term in the equa-
Z(x1 . . . xN ) = ζ(x1 ) · · · ζ(xN ) . (5) tion for concentration, which is the exact analog of the
process by which information is written on memory.
In this ansatz, the control exerted by each agent is only
Our solution to the learning problem goes as follows.
determined by its own position x. Indeed, by combining
Initially, the concentration is set to a constant every-
Eqs. (3) and (5) it follows that
where in space. In the first part of the search process,
u∗ = 2D∇ log ζ . (6) agents wander at random and the concentration decays
and is consumed. As a result, agents explore space away
The unique function ζ, which can be read as the desir- from their initial location. This is called the scouting
ability of a spatial location for a single agent immersed process. When agents eventually reach the target, they
in a crowd, satisfies the mean-field Bellman equation start the production of chemoattractant on site, either
releasing it themselves, e.g. in the form of a pheromone-
1  like cue [19, 20], or inducing its production by the target,
D∇2 ζ − q + g(N − 1)ρ ζ = 0 , (7)
2Dγ which may happen in practice by triggering specific gene
expression [21] or by transforming it into attractive waste
where ρ is the single-agent probability density that obeys material. The net effect is that a beacon signal is emit-
the Fokker-Planck equation ted from the target, and it leads to the recruitment of
all other agents towards it. A mathematically precise de-
∂t ρ + ∇ · (ρu∗ ) = D∇2 ρ . (8)
scription of the process outlined above requires only the
Remarkably, the set of equations (6), (7) and (8) is addition of two terms to the optimality equation (7)
identical, within proportionality factors, to a limiting 
1
case of the well-known Patlak–Keller–Segel equations, ∂t ζ −D∇2 ζ + q + g(N − 1)ρ ζ = f (t)1target ,
|{z} 2Dγ | {z }
which was first introduced to model microbial chemo-
relaxation production
taxis at the population level [8, 9]
(10)
Rt 0
R
where f (t) = 0 dt target ds·Jρ is the cumulated number
∂t n + ∇ · (χn∇ log s) = D∇2 n ,
(9) of agents which have reached the target up to time t, and
Ds ∇2 s − ks − βns = 0 . Jρ = (2D∇ log ζ)ρ − D∇ρ is the spatial flux of agents.
The indicator function 1target specifies that production
where n is the number density of microbes, s is the con- takes place only at the target. The relaxation term is
centration of chemoattractant and Ds is its molecular dif- interpreted as a delay in writing information on memory,
fusivity. Comparing the Bellman equation Eq. (7) with i.e. a learning rate. When production and diffusion bal-
the second row of Eq. (9) one sees that the desirability ance, the optimal solution, Eq. (7), is attained. Notice
4

Scouting Recruiting Stationary


that  is the proportionality factor between Eq. (7) and 0

the second equation in (9).

Log of concentration, log ζ


-20

In the remainder of this section we illustrate how the -40

optimal solution is achieved in two examples of coopera- -60

-80
tive search games. The first example features a circular
-100
target in a two-dimensional domain and can be thought of -120
as a basic model for bacterial predation [21, 22]. In Fig. 1 10

we show the simulation of a large number of agents under 8

the controlled dynamics with optimal mean-field drift,

Agent flux, Jρ
6

which can be computed exactly in this case (see SI, Sec. 4

3), and compare them with the uncontrolled dynamics. 2

From visual inspection, the gain in the number of agents 0


that have reached the target is apparent. More quanti-
tatively, the time average cost per agent as a function of FIG. 2. Optimal crowd evacuation. Agents are injected at
time (Panels c and d of Fig. 1), which is proportional to a constant rate at the center of the maze and have to find
the number of agents which have not reached the target the exit (on the right side of the maze), as quickly as possi-
at a given time, falls off exponentially for the controlled ble while minimizing collisions. The panels show a numerical
case while it exhibits a very slow decay for uncontrolled simulation of Eqs. (6), (8) and (10). The desirability (=con-
centration, see Table I) is shown in the top panels, while the
diffusion. flux of agents Jρ is displayed in the bottom panels. During
The second example is crowd evacuation from a com- scouting (left column), the population consumes the chemi-
plicated domain. Agents, initially localized in the center cal, leading to an outward-driven scouting process, faster than
of a maze, are required to find the exit with the minimal pure diffusion. Upon reaching the target, agents lay the bea-
cost. The domain in which we performed this numerical con signal and recruit those who lag behind to the target
experiment is a reproduction of the historical maze in (middle column). Eventually, since agents are continuously
injected in this case, a stationary state is reached where agents
track the optimal path from the center to the exit (right col-
100
umn). The parameters are D = 1, γ = 1,  = 10−1 , q = 10
a) Uncontrolled Uncontrolled c) S R
10
and g(N − 1) = 100.
Cost rates

0.1

0.01
0 0.5 1 1.5 the gardens of Villa Pisani (Stra, Italy). In this example,
t = 0.1 t = 0.6 Time
agents are introduced at the center of the maze at a con-
b) d) 10
Scouting Recruiting
Probability density

stant injection rate. In Fig. 2 we see the emergence of


1
the phases of scouting and recruitment, and eventually,
0.1 we observe that the agents trace out the optimal path
0.01 to the exit. Notice that, during the scouting phase, the
0.1 1 10
t = 0.1 t = 0.6 Absorption time density of agents propagates
√ as a front with speed which
is proportional to N (see SI, Sec. 2.1.2) so that the
collective is much faster in finding the target than the
FIG. 1. Optimal cooperative predation. Comparison between
the uncontrolled a) and controlled dynamics b), in the non- single agent (which instead reaches it diffusively).
interacting case (g = 0). The agents are initially localized in Extension to risk-sensitive control. A convenient way
a small region of space and are required to reach the target of incorporating the notion of risk in decision mak-
(grey disk). They initially undergo unbiased diffusion during ing is to introduce a parameter α which exponentially
the scouting phase and when some reach the target, the re-
weighs the fluctuations of the cost [23, 24]. In this
cruiting phase begins. The chemical cue is emitted from the
target and degraded at constant rate, resulting in a gradient setting the functional
P R to be minimized becomes Fα =
(grey contour lines, in logarithmic scale) which elicits a drift α−1 loghexp(α i ci dt)i (see SI for details). This choice
toward the target in all other agents. In these simulations ensures the invariance of the optimal control under a
the parameters are: γ = 1, q = 10, D = 1, g = 0,  = 0.1. global offset of the costs. It is easy to verify that as
c): Average cost rate for time (uncontrolled: dashed orange α → 0 one recovers the risk-neutral case previously de-
line; controlled: solid blue line) and for control (dash-dotted scribed. When α is positive the optimal control is such
blue line) The scouting phase (S, shaded) and the recruiting
that fluctuations with cost higher than average are sup-
(R) phase are dominated by time cost and by control cost,
respectively. d): Probability density function of the time to pressed, and one refers to it as a risk-averse controller.
reach the target (color code as in c). For small times the dis- Conversely, when α is negative, the optimal controller
tributions are similar, while at large times controlled agents feels optimistic and is risk-seeking. In this case low-cost
display an exponential decay against a −3/2 power law for fluctuations are enhanced. The procedure described in
uncontrolled ones. the previous section, including the mean-field approxi-
mation, can be extended to the risk-sensitive setting (see [7] M. e. a. Ani Hsieh, “Small and adrift with self-
Sections 1.2 and 2.2 of SI for a full derivation). control: Using the environment to improve autonomy,”
in Robotics Research: Volume 2 (2018) pp. 387–402.
We find that the optimal solution to the risk- [8] C. S. Patlak, Bull. Math. Biol. 15, 311 (1953).
sensitive cooperative search game is also described by the [9] E. F. Keller and L. A. Segel, J. Theor. Biol. 30, 225
chemotaxis equations, and a direct comparison between (1971).
Eq. (11) (with the addition of learning) and Eq. (9) yields [10] M. Adler, P. Szekely, A. Mayo, and U. Alon, Cell systems
the dictionary in Table I. The risk-sensitive optimality 4, 171 (2017).
equations that generalize Eqs. (6) and (7) are [11] E. Todorov, Proc. Nat. Acad. Sc. 106, 11478 (2009).
[12] L. S. Pontryagin, Mathematical theory of optimal pro-
 cesses (CRC Press, 1987).
 ∗ 2D [13] E. H. Lieb and W. Liniger, Phys. Rev. 130, 1605 (1963).

u = ∇ log ζ ,
1 − 2Dαγ [14] E. H. Lieb, Phys. Rev. 130, 1616 (1963).
 (11) [15] G. Parisi, Statistical field theory (Addison-Wesley, 1988).

 1 − 2Dαγ
 D∇2 ζ − q + g(N − 1)ρ ζ = 0 , [16] J.-M. Lasry and P.-L. Lions, Jap. J. Math. 2, 229 (2007).
2Dγ [17] M. Burger, M. Di Francesco, P. A. Markowich, and M.-
T. Wolfram, in Dec. & Cont. (CDC), 2013 IEEE 52nd
where 2Dαγ < 1. Ann. Conf. (IEEE, 2013) pp. 3128–3133.
Discussion. From the standpoint of search theory, [18] E. H. Lieb, R. Seiringer, and J. Yngvason, in The Sta-
bility of Matter: From Atoms to Stars (Springer, 2001)
our findings provide a solid theoretical rationale for the
pp. 685–697.
many solution methods inspired by chemotaxis, from [19] T. D. Wyatt, Pheromones and animal behaviour: com-
computational [25–27], to biological [28, 29] and physico- munication by smell and taste (Cambridge, 2003).
chemical ones [30, 31]. At the practical level, we offer [20] B. Hrolenok, S. Luke, K. Sullivan, and C. Vo, in Proc. 9th
explicit expressions for the optimal choice for the param- Int. Conf. on Autonomous Agents and Multiagent Sys-
eters that appear in these biomimetic approaches, allow- tems: Vol. 3 (International Foundation for Autonomous
ing to shortcut the painstaking procedure of parameter Agents and Multiagent Systems, 2010) pp. 1197–1204.
[21] D. Humphreys, A. Davidson, P. J. Hume, and V. Ko-
tuning. Conversely, from the viewpoint of chemotaxis,
ronakis, Cell Host & Microbe 11, 129 (2012).
we remark that the dictionary in Table I can also be [22] J. Pérez, A. Moraleda-Muñoz, F. J. Marcos-Torres, and
read in reverse, which allows to solve the inverse prob- J. Muñoz-Dorado, Env. microbiol. 18, 766 (2016).
lem of retrieving the decision-making parameters from [23] R. A. Howard and J. E. Matheson, Man. Sc. 18, 356
biophysical observations. For example, bacterial chemo- (1972).
taxis experiments (Fig. 6 in Ref. [32]) give χ/D ≈ 12, [24] K. Dvijotham and E. Todorov, Artificial Intelligence
which translates into 2Dαγ ≈ 5/6. This value is very (UAI) , 1 (2011).
[25] K. M. Passino, IEEE Contr. Syst. 22, 52 (2002).
close to the upper limit for risk aversion, suggesting that
[26] S. D. Muller, J. Marchetto, S. Airaghi, and P. Kournout-
that bacteria try to minimize the impact of unfavorable sakos, IEEE Tr. Evol. Comp. 6, 16 (2002).
fluctuations – a conclusion that has also been reached by [27] A. Reynolds, Phys. Ref. E 81, 062901 (2010).
other means [33]. [28] T. Nakagaki, H. Yamada, and Á. Tóth, Nature 407, 470
Acknowledgments. AC acknowledges innumerable in- (2000).
spiring discussions with Massimo Vergassola. We are [29] T. Nakagaki, M. Iima, T. Ueda, Y. Nishiura, T. Saigusa,
A. Tero, R. Kobayashi, and K. Showalter, Phys. Rev.
grateful to Rami Pugatch for useful comments and sug-
Lett. 99, 068104 (2007).
gestions. [30] I. Lagzi, S. Soh, P. J. Wesson, K. P. Browne, and B. A.
Grzybowski, J. Am. Chem. Soc. 132, 1198 (2010).
[31] C. Jin, C. Krüger, and C. C. Maass, Proc. Nat. Acad.
Sc. 114, 5089 (2017).
[32] Y. V. Kalinin, L. Jiang, Y. Tu, and M. Wu, Biophys. J.
[1] E. Bonabeau, G. Theraulaz, J.-L. Deneubourg, S. Aron, 96, 2439 (2009).
and S. Camazine, Tr. in Ecol. & Evol. 12, 188 (1997). [33] A. Celani and M. Vergassola, Proc. Nat. Acad. Sc. 107,
[2] E. Bonabeau, M. Dorigo, and G. Theraulaz, Swarm in- 1391 (2010).
telligence: from natural to artificial systems, 1 (Oxford,
1999).
[3] S. Garnier, J. Gautrais, and G. Theraulaz, Sw. Intel. 1,
3 (2007).
[4] L. Panait and S. Luke, Aut. Ag. and Mul.-Ag. Sys. 11,
387 (2005).
[5] C. Virágh, G. Vásárhelyi, N. Tarcai, T. Szörényi, G. So-
morjai, T. Nepusz, and T. Vicsek, Bioinsp. & Biomim.
9, 025012 (2014).
[6] V. Gómez, S. Thijssen, A. Symington, S. Hailes, and
H. J. Kappen, Robotics and Aut. Sys. arXiv 1502, 04548
(2015).
Supplemental Information

1. DERIVATION OF THE BELLMAN EQUATION FOR OPTIMAL COOPERATIVE SEARCH

Let us consider a system of N agents in d dimensions1 following the stochastic dynamics 2



dX̄ t = ū(X̄ t , t) dt + 2D dW̄ t , (1)
where W̄ is the standard vector Wiener process, representing the uncontrolled dynamics, and ū is the control. In
Eq. (1) D is the diffusion coefficient, which we choose to be constant. Notice that in this case there is no ambiguity
about the regularization, in that Ito or Stratonovich conventions are equivalent. Part of the domain boundary is
absorbing and we refer to it as the target. We define the cost functional as
XN Z min{T,Ti }  
γ
C0T = dt ui (X̄ t , t)2 + hi (X̄ t ) , (2)
i=1 0
2
where
gX 
hi (X̄ t ) = q(Xit ) + v Xit , Xjt ,
2
j6=i

q > 0 and g > 0, γ > 0 and Ti is the exit time (first passage at the target) for the i-th agent. The upper extreme
of integration in time indicates that an agent stops contributing to the total cost as soon as it hits the target. The
quadratic form of the cost for the control is directly related to the Kullback–Leibler divergence of the controlled
path measure from the uncontrolled (pure diffusion) one; therefore, the cost of the control has a natural probabilistic
interpretation in terms of “distance” between path measures [1].
We first discuss the finite-horizon setup, in which the objective of the optimization is a functional of the cost
accumulated over the fixed interval of time [0, T ], C0T . In this case the problem is generally time-dependent. The
finite horizon setup is more general than the terminal state setup discussed in the main text, which essentially
corresponds to the limit T → ∞ of Eq. (2), i.e. infinite horizon limit in presence of terminal states. In Fig. 1 is shown
a schematic representation of the terminal state problem defined in the main text, where v(x1 , x2 ) = gδ(x1 − x2 ).

FIG. 1. Scheme of the cooperative search game for N = 2 agents. The agents are driven by the controls u1 and u2 , which depend
on both their positions x1 and x2 . They pay individual costs associated with control and time expenditure, respectively with
a rate γu2 /2 and q (possibly position-dependent), and a pairwise cost associated with the interaction – collision – with a rate
proportional to a Dirac-δ function. Optimal controls minimize the average cost (or an exponential average in the risk-sensitive
case).

1 In the context of decision making, this is the term we use referring to controlled particles. Indeed, controlled diffusion is a limit of a
Markov decision process in which the term “agent” is more natural. In this Supporting Information notes we will use the words “agent”
and “particle” interchangeably.
2 Symbols expressed with a bar indicate N -tuples whose index corresponds to the label of the agent; e.g. x̄ ≡ {x1 . . . xN }. The superscript
t indicates time.
2

1.1. Risk-neutral case

Here we derive the Bellman equation for the optimal control which minimizes the average of the N -particle cost
function (2). In terms of the N -particle density function, p, the average cost is expressed as


F[ū] = C0T
Z T Z X γ 
N 2
= dt d x ui (x̄, t) + hi (x̄) p(x̄, t) . (3)
0 i
2

Since p has an implicit dependence on u through the Fokker–Planck equation associated with Eq. (1), it is convenient
to couch the minimization problem by including the dynamics as a constraint and seek to minimize the auxiliary
functional
Z T Z h X  i
L[p, ū, Φ] = F + dt dN x Φ(x̄, t) ∂t p(x̄, t) + ∇i · ui (x̄, t)p(x̄, t) − D∇i p(x̄, t) , (4)
0 i

where Φ(x̄, t) is a Lagrange multiplier. This is an application of the so-called Pontryagin minimum principle [2]. The
condition of null variation of L with respect to Φ yields the Fokker–Planck equation for the N -particle density p. The
variation of L with respect to ui , at the saddle point,

δL 
= γ u∗i − ∇i Φ p = 0 , (5)
δui ∗
gives the optimal control
u∗i (x̄, t) = γ −1 ∇i Φ(x̄, t) . (6)
Finally, the variation with respect to p gives
X γ 
δL ∗ 2 ∗ 2
= −∂ Φ + (u ) − u · ∇ Φ − D∇ Φ + h =0, (7)
δp ∗
t i i
i
2 i i i

which, together with Eq. (6), leads to the Hamilton–Jacobi–Bellman (HJB) equation:
1 X 2 X X
∂t Φ = − ∇i Φ − D ∇2i Φ + hi . (8)
2γ i i i

As pointed out in the main text, the function Φ(x̄, t) is the optimal value function at time t and in the state x̄,
up to an additive constant: this is (minus) the expected cost-to-go under the optimal control ū∗ when the system is
conditioned to be in state x̄ at time t, namely

T t
Z ∞ Z X γ 

Φ(x̄, t) ≡ − Ct X̄ = x̄ ∗ = − dt0 N 0
d x ui (x̄ , t ) + hi (x̄ ) p(x̄0 , t0 ) ,
∗ 0 0 2 0
(9)
t i
2

where p satisfies the Fokker–Planck equation with control ū∗ and has initial condition
p(x̄0 , t) = δ N (x̄0 − x̄) .
Indeed, it can be directly verified that the r.h.s. of Eq. (9) satisfies the saddle-point equation (7).
The HJB equation (8) is non-linear. However, it is possible to make it linear by the Hopf–Cole transformation,
Z = exp (Φ/2Dγ) (10)
which turns Eq. (8) into
X 1
∂t Z = −D ∇2i Z + qZ , (11)
i
2Dγ

and the optimal control into


u∗i = 2D ∇i log Z . (12)
In the main text we focused on a terminal state optimal control problem. In such a setup, when the uncontrolled
dynamics is time-homogeneous and the costs hi do not depend explicitly on time, the HJB equation admits stationary
solutions, Φ(x̄) and Z(x̄).
3

1.1.1. Boundary conditions

The optimal control is found from the stationary solution of Eq. (11) with appropriate boundary conditions. Here
we discuss the structure of these boundary conditions at the target. We start by recalling the relationship of the
desirability Z with the optimal cost-to-go function,

|X̄ 0 =x̄i∗ /2Dγ
Z(x̄) = e−hC0 ,
obtained from Eq. (9) and the Hopf–Cole transformation. From the definition of the cost function, once an agent
reaches the target it does not contribute to the cost any longer. Let us assume that all the agents are inside the domain
0
except one, which we choose to be agent N without loss of generality, which sits at the target: XN = xN ∈ target.
Then, since agent N does not contribute to the cost, the desirability is a function of the remaining N − 1 agents only 3

Z (N ) (x1 . . . xN −1 , xN ) x ∈target = Z (N −1) (x1 . . . xN −1 ) ; (13)
N

more generally, if the agents 1 . . . i (up to relabeling) are not yet at the target, while the others have already reached
it, one has

Z (N ) (x1 . . . xi , xi+1 . . . xN ) xi+1 ...x ∈target = Z (i) (x1 . . . xi ) . (14)
N

1.2. Risk-sensitive case

Here we show that the risk-sensitive Hamilton-Jacobi-Bellman equation can also be linearized. In the risk-sensitive
setup, agents aim at the minimization of the average of an exponentially weighted cost [3]
1
T 1
Fα = log eαC0 = log Gα , (15)
α α
where C0T is defined in Eq. (2). In this section we only consider α > 0, so the problem is equivalent to the minimization
of the functional Gα = exp αFα . It is easy to generalize the derivation for α < 0.

α defines the risk sensitivity of the optimal control problem: one recognizes that in the limit α → 0,
The parameter
Fα → F = C0T , and the control problem is know as risk-neutral; if α > 0 the optimal solution is the one that
reduces the most the fluctuations towards high values of the cost, corresponding to a risk-averse strategy; finally,
α < 0 corresponds to an optimization in which more weight is given to the values of C0T which are smaller than
average, therefore leading to risk-seeking strategies. The limits α → ∞ and α → −∞ are referred to as min-max and
min-min optimization problems, respectively.
We apply the Pontryagin principle to the minimization of the functional Gα . It is convenient to consider the Markov
process {X̄ t , C0t }t , where X̄ t follows the stochastic dynamics in Eq. (1), while from Eq. (2)
X γ 
t
dC0 = ui (X̄ , t) + hi (X̄ ) dt ≡ c(X̄ t ) dt
t 2 t

i
2

The Fokker–Planck equation associated to this system of equations, which describes the evolution of the probability
density function p(x̄, C, t) for the process {X̄ t , C0t }, is
 X 
∂t p + ∂C c(x̄) p + ∇i ui p − D∇i p = 0 (16)
i

The functional Gα can be then expressed as a linear functional of p:


Z Z
Gα = dN x dC p(x̄, C, T ) eαC ≡ dN x Gα (x̄, T ) ; (17)

the function Gα (x̄, t) is the average of exp αC0t over all trajectories which have arrived at x̄ at time t. By multiplying
Eq. (16) by exp αC and integrating over C, one recovers the forward Feynman–Kac equation for Gα :
X  
∂t Gα + ∇i · ui Gα − D∇i Gα = α c Gα . (18)
i

3 For the sole purpose of illustrating the boundary conditions defining the many-particle optimal control problem, we introduce the
notation Z (n) to indicate the desirability function Z for the problem with n agents (when Z is a function of n spatial variables).
4

The Pontryagin principle is then applied to the minimization of the functional Fα subject to the constraint imposed
by Eq. (18); this is equivalent to perform the unconstrained minimization of the Lagrange functional
Z Z T Z h X   X γ  i
N
Lα = d x Gα (x̄, T ) + dt dN x Ψα (x̄, t) ∂t Gα + ∇i · ui Gα − D∇i Gα − α u2i + hi Gα . (19)
0 2
| {z } i i

At the saddle point, variation with respect to the control ui yields


 
δLα ∗
= −G ∇ Ψ + αγ u Ψ =0 (20)
δui ∗
α i α i α

so the optimal control is


1
u∗i = − ∇i log Ψα . (21)
γα
The variation with respect to Gα gives
X X X γ 2 
∂Lα ∗ 2 ∗
= δ(t − T ) − ∂ Ψ − u · ∇ Ψ − D ∇ Ψ − α u + hi Ψα = 0 , (22)
δGα ∗
t α i i α i α
i i i
2 i

which is the backward Feynman–Kac equation for the function Ψα , which can be interpreted (up to multiplicative
constants) as

T
Ψα (x̄, t) ≡ eαCt X̄ t = x̄ ; (23)
the δ-function in time sets the condition at the final time, if a finite-horizon problem is considered; in our terminal
state setup, T → ∞ and the δ-function disappears, so we omit it in the following. Note that Φα ≡ −α−1 log Ψα ,
plays the role of the value, in that u∗i = γ −1 ∇i Φα . Indeed, in the limit α → 0, Φα reduces to the (risk-neutral)
value function Φ 4 . We therefore identify Φα with the risk-sensitive value function [4, 5]. The HJB equation for the
risk-sensitive value function then is
 1  X X
∂t Φα = − − Dα (∇i Φα )2 − D ∇2i Φα + hi ; (24)

| {z } i i
≡1/2γ̃

this equation has the same form as Eq. (8), where the parameter γ is replaced by γ̃ ≡ γ/(1 − 2Dαγ). In the same
way as for the risk-neutral case, a linearized version of the HJB equation can be found by performing the Hopf–Cole
transformation
Zα = exp (Φα /2Dγ̃) (25)
to obtain
X 1 X
∂t Zα = −D ∇2i Zα + hi Zα . (26)
i
2Dγ̃ i

The optimal control is hence obtained from Zα as


2Dγ̃
u∗i = ∇i log Zα . (27)
γ
It is straightforward that for α = 0, the optimal control equations (26) and (27) exactly reduce to the risk-neutral
ones, respectively (11) and (12). As for the boundary conditions for Eq. (26), the same considerations made for the
risk-neutral case hold here: Eqs. (13) and (14) are valid also in this case, with Z being replaced by Zα .

4 Ψα , as defined in Eq. (23), is the moment generating function for the statistics of the cost CtT conditioned to X̄ t = x̄. When α → 0,
∂Ψα
Φα = −α−1 log Ψα → − = −hCtT |X̄ t = x̄i ≡ Φ .
∂α α=0
5

1.3. Exact solution for the 1D non-interacting case

In this subsection we offer the exact analytic calculation for the non-interacting search in one dimension. The results
also provide an approximation to the solution for the multi-dimensional case at large distances from the target.
A single agent is initially at x on the real axis and the target is at the origin; the generating function of the cost
under the control u, G̃s (x) = hexp(−sC0T )|X 0 = xi, satisfies the (stationary) Feynman–Kac equation
γ 
u G̃0s + D G̃00s = s u2 + q G̃s , (28)
2
with boundary conditions G̃s (0) = 1 and G̃s (±∞) = 0 if s > 0 and G̃s (±∞) = +∞ if s < 0. Assuming that the
control is constant and pointing toward the origin 5 , u = −sign(x) U , with U positive, one can easily find that Eq. (28)
is solved by
n √ o
G̃s = exp β|x| 1 − 1 + Γs , (29)

where β = U/(2D) and Γ = 2D(γ + 2q/U 2 ). We can indeed check that the optimal control U ∗ obtained from the
one-dimensional HJB equation is the one which minimizes Fα |X 0 =x = α−1 log G̃−α :
  r  
∂ ∂ U |x| 2q 
Fα |X 0 =x = 1 − 1 − 2Dα γ + 2 =0, (30)
∂U ∂U 2Dα U
solved by
s
∗ 2q
U = . (31)
γ(1 − 2Dαγ)

The probability distribution of the cost C0T under the control U is found by applying the inverse Laplace transform
to Eq. (29),
Z +i∞ Z 0+ +i∞
1 eβ|x|−c/Γ 1 √ 2 2
p(c|X 0 = x) = ds G̃s = 2 2
dt e− t et c/(β x Γ)
2πi −i∞ β x Γ 2πi 0+ −i∞

β|x| Γ eβ|x| −3/2 −(β 2 x2 Γ)/(4c)−c/Γ
= √ c e , (32)
2 π
To obtain the second equality the change of variable t = (βx)2 (1 + Γs) has been made, while in the last equality one
makes use of Eq. [318 ] from Ref. [6]. The important remark is that in Eq. (32) the right tail of the probability density
of the cost has an exponential cutoff with rate
1 1 1
= 2
< = αmax . (33)
Γ 2D(γ + 2q/U ) 2Dγ
T

This result implies that, for any value of U , heαC0 i diverges when α > αmax . In particular, in the limit α → αmax ,
αC0T −1
the functional he i diverges also for controls arbitrarily close to the optimal one, for which Γ = αmax .

1.4. Robustness of the optimal solution

The analytical solution in one dimension also allows to address the question about the robustness of the cost against
perturbations in the control away from optimality. For controls U close to the optimal value U ∗ , the risk-sensitive
cost Fα can be approximated by a quadratic function,
1 2
Fα − Fα∗ ' Fα00 (U ∗ ) U − U ∗ .
| {z } 2 | {z }
δFα δU

5 We know from the exact solution that the optimal control in one dimension is constant: this follows from the fact that the solution
of the HJB equation D∇2 Zα = q/(2Dγ̃) Zα with the boundary conditions specified above is solved by Zα = exp{−[q/(2D2 γ̃)]1/2 |x|},
which produces u∗ = 2Dγ̃/γ ∇ log Zα = −sign(x)(2qγ̃)1/2 /γ, whose amplitude is independent of the coordinate x.
6

In this approximation one can calculate the maximum tolerance on the control amplitude U , given an allowed level
of suboptimality. Using the results from the previous subsection one obtains
 2  2
δFα 1 δU χ δU
= = (34)
Fα∗ 2(1 − 2Dαγ) U ∗ 4D U ∗

(see also Fig. 2). In the risk-neutral case, a relative error in the choice of U ∗ of 10% results in a small increase of
0.5% for the total cost incurred. Risk-averse strategies tend to be less tolerant to errors, while risk-sensitive ones are
more forgiving.
√ We also remark that for α > 0, the control amplitude is bound to be larger than a minimum value
Umin = 2Dαγ U ∗ , below which the risk-sensitive cost diverges.

1.5. Other forms for the cost of the control

The assumption that the cost for control is quadratic is particularly useful for two reasons. First, as we already
remarked, it has a direct interpretation in terms of entropy and distance (Kullback–Leibler divergence) between
ensembles of paths. Second, it has the property that the optimal control problem is linearly solvable (through the
Hopf–Cole transformation the optimality equations can be cast into a linear form). Moreover, it has a physical
interpretation, as the power dissipated moving in a viscous medium. Obviously, this is not the most general form for
the control cost which can be physically motivated. One can use a generic function of |u|. For instance, the control
cost of the form η |u| corresponds, in the low noise limit, to the minimization of the path length to the target.
Here we derive the optimal HJB equation for a target location problem with a single agent and in the risk neutral
case, where the control cost has the form an extra contribution which is linear in the control amplitude, η |u|. The
minimization of the cost function constrained to the dynamics given by Eq. (1) is translated in the unconstrained
minimization of the Lagrange functional
Z ∞ Z   Z ∞ Z  
γ 
L[u, p, φ] = dt dx u(x, t)2 + η |u(x, t)| + q(x) p(x, t) + dt dx φ(x, t) ∂t p + ∇ · u p − D∇2 p
0 |2 {z } |{z} 0
control time

The stationarity with respect to u yields the equation for the optimal control
u∗
γ u∗ + η ∗ = ∇φ (35)
u

15
Cost function, Fα

10
Umin
δU
5 δF
F∗
0.0
U∗
α 0.4
-0.4
0
0 5 10 15
Control amplitude, U

FIG. 2. Robustness of the optimal solution. The risk-sensitive cost function Fα is plotted as a function of the control
amplitude U , in a risk-neutral (solid blue), risk-averse (dashed red) and risk-seeking (dashed-dotted green) situation. The
parabolic approximation around the minimum is shown for the risk-neutral case (dotted blue line). A deviation of the control
from the optimum by a quantity δU corresponds to an increase in the cost δF (in the parabolic approximation). These two
quantities are related by Eq. (34). Curves have been obtained with the same parameters as in Figs. 2 and 4 of the main text.
7

and with respect to p gives the optimality HJB equation

1 2 η2
∂t φ + D∇2 φ + ∇φ − η ∇φ = q − . (36)
2γ 2γ
Through the Hopf–Cole transformation φ = 2Dγ log Z, the HJB equation becomes
 
1 η2
∂t Z + D∇2 Z + η ∇Z = q− Z . (37)
2Dγ 2γ

The dynamics of the desirability (chemoattractant) Z acquires a ballistic contribution, such that in addition to the
diffusive motion it also propagates as a front.

2. MEAN-FIELD APPROXIMATION

The Bellman equation (26) is equivalent to the stationary Schrödinger equation with zero energy for a system of N
identical interacting particles. It seems impossible to solve it exactly with the boundary conditions discussed above.
We therefore proposed a mean-field approximation scheme which is motivated both physically (for large number of
particles and reasonably diluted systems) and from the game-theoretical point of view (inspired by mean-field games).

2.1. Risk-neutral case

For α = 0 the mean-field approximation to this equation consists in factorizing the N -point solution Z into the
product of identical functions of the individual variables:
N
Y
MF
Z(x1 . . . xN ) = ζ(xi ) ; (38)
i=1

from this mean-field ansatz it follows that the controls are

u∗i = 2D∇ log ζ(xi ) , (39)

and that the N -particle density function is also factorized in single-particle distribution functions,
N
Y
p(x1 . . . xN ) = ρ(xi ) (40)
i=1

(provided that the initial positions of the N particles are independent); the single-particle density ρ then satisfies

∂t ρ + ∇ · (u ρ) = D∇2 ρ . (41)

One can then replace the ansatz in Eqs. (39) and (40) in the cost functional F to obtain the cost-per-particle functional
Z Z  Z 
γ N −1
F̃ = dt dx u(x, t)2 + q(x) + dy v(x, y)ρ(y, t) ρ(x, t) (42)
2 2

The mean-field optimal control equations follow by applying the Pontryagin principle to the functional F̃ under the
constraint (41), i.e. as the saddle point equations for the Lagrange functional
Z Z  
L̃ = F̃ + dt dx φ(x, t) ∂t ρ + ∇ · (u ρ) − D∇2 ρ , (43)

where variations have to be calculated with respect to the single particle functions u, ρ (and φ, yielding the constraint).
This leads to

δ L̃ 
= ρ γu∗ − ∇φ = 0 ⇒ u∗ (x, t) = γ −1 ∇φ(x, t) , (44)
δu(x, t) ∗
8

and

δ L̃ γ
= u∗ 2 + hmf − ∂t φ − u∗ · ∇φ − D∇2 φ
δρ(x, t) ∗ 2
1
= −∂t φ − (∇φ)2 − D∇2 φ + hmf = 0 , (45)

where hmf is the mean-field cost
Z
hmf (xi , t) = q(xi ) + (N − 1) dy v(xi , y) ρ(y, t) . (46)

When v is a contact interaction potential, v(x, y) = g δ(x − y), one has


hmf = q + g (N − 1) ρ . (47)
By applying the Hopf–Cole transformation φ = 2Dγ log ζ, one finally gets the HJB equation for the mean-field
desirability, ζ,
1
∂t ζ − D∇2 ζ = hmf ζ , (48)
2Dγ
which for contact potential reads
1  
∂t ζ − D∇2 ζ = q + g(N − 1)ρ ζ . (49)
2Dγ

2.1.1. Optimality of the cooperative solution

In this section we show that the best mean-field solution is indeed the cooperative one, i.e. the one in which the
individual controls are identical.
Let us assume that out of the total N agents, N1 are of one species and N2 = N − N1 are of a second species, with
different control. We can, for the sake of generality, introduce different collision costs depending on the species of the
agents involved. Therefore, the cost rate for an agent of species α colliding with an agent of species β is gαβ /2. The
mean-field costs incurred by individual agents of species 1 and 2 are
Z h i
1
C1 = C̄1 + dt dt g11 (N1 − 1) ρ21 + g12 (N − N1 )ρ1 ρ2 , (50a)
2
Z h i
1
C2 = C̄2 + dt dt g21 N1 ρ1 ρ2 + g22 (N − N1 − 1)ρ22 , (50b)
2
where C̄α is the control and time cost for an agent of species α,
Z γ 
α 2
C̄α = dt dx ρα uα + qα . (51)
2
The goal of each agent of species α is to maximize the cost Cα . We shall see that if the collision costs do not depend
on the species involved, i.e. gij = g, the best partition of the system is N1 = 0 or N1 = N .
One observes that C1 and C2 both have linear dependence on N1 :
Z h i
∂C1 g
= dt dt ρ21 − ρ1 ρ2 , (52a)
∂N1 2
Z h i
∂C2 g
= dt dt ρ1 ρ2 − ρ22 . (52b)
∂N1 2
If C1 decreases with N1 and C2 increases with N1 (i.e. decreases with N2 ), or viceversa, then the two species should
coexist. Assuming ∂C1 /∂N1 < 0 and ∂C2 /∂N1 > 0 would imply
Z h i2
∂C1 ∂C2 g
0> − = dt dx ρ1 − ρ2
∂N1 ∂N1 2
which is not possible. The same conclusion holds for ∂C1 /∂N1 > 0 and ∂C2 /∂N1 < 0. Therefore, one must have both
C1 and C2 decreasing (or increasing) functions of N1 , which makes it more desirable to have N1 → N (or N1 → 0)
for both species.
9

2.1.2. Effect of the collision cost: travelling wave solution of PKS in 1D

As remarked in the main text, the optimal control equations in the mean-field approximation are equivalent to the
Patlak–Keller–Segel (PKS) equations with logaritmic response. In this section we show that in the case where q = 0
(no time expenditure cost) it is possible to find travelling wave solution to the PKS equations in one dimension [7].
We shall see that the combination g(N − 1) enters the definition of the travelling wave velocity.
In one dimension, the optimal control equations for q = 0 are

D∂ 2 ζ − 1 q + g(N − 1)ρζ = 0 ,
x
2Dγ (53)
 
∂t ρ + 2D∂x ρ ∂x ζ/ζ − D∂x2 ρ = 0 .

We impose the boundary conditions ζ(−∞) = 0, ζ(+∞) = 1, ρ(−∞) = ρ∞ and ρ(+∞) = 0. Such boundary
conditions correspond to the situation in which a constant supply of agents is provided very far on the left and the
target is far on the right. The system admit a travelling wave solution, ζ(x, t) = Z(x − ct) and ρ(x, t) = R(x − ct),
with velocity c > 0. With this ansatz, Eqs. (53) write

DZ 00 − g(N − 1) RZ = 0 ,

2Dγ (54)
 − cR0 + 2D R Z 0 /Z 0 − DR00 = 0 ,

where the symbol 0 indicates the derivative with respect to the single variable z = x − ct. The second equation in the
system (54) can be straightforwardly integrated once and gives

R0 = (2Z 0 /Z − κ) R + β ,

where κ = c/D. If we impose the boundary conditions R|∞ = 0 and R0 |∞ = 0, the integration constant β vanishes.
One further integration gives

log R = 2 log Z − κz + α ⇒ R(z) = A e−κz Z(z)2 .

By replacing this solution into the first equation of the system (54), and by defining Z(z) = eκz/2 χ(z)

1  c2 g(N − 1) 
Dχ00 + cχ0 + − Aχ2 χ = 0 , (55)
2D 2 γ

Notice that R = A χ2 , and therefore the boundary conditions for χ follow from the ones for ρ. From the stability
condition c2 /2 − g(N − 1)Aχ2 (−∞)/γ = 0, it follows that the speed of the wave front is
r
2ρ∞
c= g(N − 1) . (56)
γ

The number of agents therefore influences the speed at which the agent density profile propagates: more agents
consume the chemoattractant more rapidly, hence giving rise to steepest gradients of its concentration ζ, which then
yields stronger drift.

2.2. Risk-sensitive case

We now derive the mean-field equation for the risk-sensitive case, in particular the risk-averse one, α > 0 (easily
extended to the risk-seeking one, α < 0). In this case, the mean-field ansatz consists in assuming that the single-
particle contributions to the total cost are independent and identically distributed. We start by writing the evolution
of the joint stochastic process of particle positions and individual costs as the 2N coupled equations
  √

dXit = ui X1t , . . . XN t
dt + 2DdWit ,

γ 1X   (57)
t t t 2 t t t t t

dC = ui (X , . . . X , t) + q(X ) + v(X , X ) dt ≡ ci X , . . . X , t dt ,
 i
2 1 N i
2 i i 1 N
j6=i
10

The N -particle position-cost density pxc (x1 , C1 . . . xN , CN , t) associated to Eqs. (57) is also factorized
MF
Y
pxc (x1 , C1 . . . xN , CN , t) = ρxc (xi , Ci , t) , (58)
i

following from the assumption that the controls are given by a unique function of the individual one-particle positions,
namely
MF
ui (X1t , . . . XN
t
, t) = u(Xit , t) . (59)
It follows that the cost functional Gα also factorizes:
Z P
Z N
MF
Gα = dx1 dC1 . . . dxN dCN pxc (x1 , C1 . . . xN , CN , T ) eα i Ci = dx dC ρxc (x, C, T )eαC . (60)

The Fokker–Planck equation associated with Eqs. (57) is


X  X 
∂t pxc + ∂Ci ci (x̄, t) pxc + ∇i ui pxc − D∇i pxc = 0 , (61)
i i

and can be marginalized to the single-particle one by integrating over all particles but one:
 Z  
γ 2 N −1 0 0 0 0 0
∂t ρxc + ∇ · (uρxc ) + ∂C u +q+ dx dC v(x, x )ρxc (x , C ) ρxc − D∇2 ρxc = 0 . (62)
2 2
The mean-field optimal control equations are then derived (applying Pontryagin principle) as the saddle point
equations of the functional
Z Z 
L̃α = dx dC ρxc (x, C, T )eαC + dx dC dt χα (x, C, t) ∂t ρxc (x, C, t) + ∇ · (u(x, t)ρxc (x, C, t)) − D∇2 ρxc (x, C, t)
 Z  
γ N −1
+ ∂C u(x, t)2 + q(x) + dx0 dC 0 v(x, x0 )ρxc (x0 , C 0 , t) ρxc (x, C, t) (63)
2 2
Variation with respect to the control yields
Z h i
δ L̃α
= − dC ρ (x, C, t) ∇χ (x, C, t) + γu(x, t)∂ χ (x, C, t) =0, (64)
δu(x, t) ∗
xc α C α

and with respect to ρxc


 
δ L̃α 
= eαC δ(t − T ) − ∂t χα + u∗ · ∇χα + D∇2 χα + γu∗ 2 /2 + hmf ∂C χα = 0 , (65)
δρxc (x, C, t) ∗


which is the (backward) equation for the functional χα (x, C, t) = exp αC0T X t = x, C0t = C ∗ , where hmf is the
mean-field cost Eq. (46).
Z Z
0
hmf (x, t) = q(x) + (N − 1) dx dC 0 ρxc (x0 , C 0 , t) v(x, x0 ) . (66)
| {z }
ρ(x0 ,t)

The δ-function at the final time sets the condition χα |t=T = eαC . We notice that the function χα can be expressed as

t T
T
χα (x, C, t) = eα(C0 +Ct ) X t = x, C0t = C ∗ = eαC eαCt X t = x ∗ ≡ eαC ψα (x, t) (67)
We therefore see that the optimal control can be written in terms of ψα (x, t)
1  1 
u∗ (x, t) = ∇ − log ψα (x, t) , (68)
γ α
i.e. as the gradient of the risk-sensitive (mean-field) value function φα = −α−1 log ψα , which satisfies the HJB equation
1 2
∂t φα + ∇φα + D∇2 φα = hmf , (69)
2γ̃
where we recall from the previous section that γ̃ = γ/(1 − 2Dαγ). The mean-field desirability ζα = exp(φα /2Dγ̃)
then satisfies the linear HJB equation
1
∂t ζα + D∇2 ζα = hmf ζα . (70)
2Dγ̃
11

3. EXACT SOLUTION FOR THE CIRCULAR TARGET IN THE NON-INTERACTING CASE

It is possible to solve analytically the HJB equation (for the desirability, in the terminal state setup) for the search
problem of a circular target in the infinite two-dimensional space, in absence of interactions. In this case, the mean-
field ansatz is trivially exact, provided that the particles are independently distributed at the initial time. If the
target has radius R and we choose the origin of the coordinate system to be its center, the HJB equation in cylindrical
coordinates reads
D   q
∂r r∂r ζα − ζα = 0 , (71)
r 2Dγ̃

where the desirability ζα depends only on the radial coordinate, and, from previous sections, γ̃ = γ/(1 − 2Dαγ).

Risk-averse Risk-neutral Risk-seeking


36
α = 0.4 α = 0.0 α = −0.4
10-1
30

10-2
24
Control cost

18 10-3

12 10-4

6 10-5

0 10-6
0 6 12 18 24 30 0 6 12 18 24 30 0 6 12 18 24 30
Time cost Time cost Time cost

FIG. 3. Risk-neutral vs Risk-sensitive. Two-dimensional histograms of time-expenditure and control costs, from simulations
of Eq. (1) with the mean-field control in Eq. (73), for the risk-averse, risk-neutral and risk-seeking situation. The costs for
time-expenditure and control are positively and almost linearly correlated. The optimal control for the risk-neutral problem is
such that control and time-expenditure costs are very similar. Instead, the risk-averse optimal controller tends to pay more on
control (reducing possible dangerous fluctuations towards high values of the cost), whereas the risk-seeking controller allows
for large time-expenditure cost while reducing the control.

Given the connection of the desirability with the expected cost-to-go function (see main text), the boundary conditions
for Eq. (71) are ζ(r → ∞) = 0 and ζ(R) = 1. The solution to this problem is
q 
q
K0 2D 2 γ̃ r
ζα (r) = q  , (72)
q
K0 2D 2 γ̃ R

which yields the optimal control


q 
q
√ K1 2D 2 γ̃ r
2Dγ̃ 2γ̃q
u= ∇ log ζα = − q  êr , (73)
γ γ q
K0 2D 2 γ̃ r

where Kν are the modified Bessel functions of second kind and êr ≡ x/r, i.e. the outward unit vector pointing to x
from the origin.
In Fig. 3, one can see that the risk-sensitivity parameter sets an imbalance between the control cost and the time-
expenditure cost: risk-averse strategies are prone to pay much more price on control than on time-expenditure, while
for risk-neutral strategies the difference is much less pronounced; risk-seeking controllers, instead, pay less cost in
12

control, confident of being driven to the target by the noise.

[1] E. Todorov, Proc. Nat. Acad. Sc. 106, 11478 (2009).


[2] L. S. Pontryagin, Mathematical theory of optimal processes (CRC Press, 1987).
[3] K. Dvijotham and E. Todorov, arXiv preprint arXiv:1202.3715 (2012).
[4] R. A. Howard and J. E. Matheson, Man. Sc. 18, 356 (1972).
[5] C. J. Maddison, D. Lawson, G. Tucker, N. Heess, A. Doucet, A. Mnih, and Y. W. Teh, arXiv preprint arXiv:1703.05820
(2017).
[6] I. S. Gradshteyn and I. M. Ryzhik, Table of integrals, series, and products (Academic, 2014).
[7] Z.-A. Wang, Discrete and Continuous Dynamical Systems Series B 13, 601 (2013).

Das könnte Ihnen auch gefallen