Tree-Based Reinforcement Learning For Optimal Water Reservoir Operation

WATER RESOURCES RESEARCH, VOL. ???, XXXX, DOI:10.
1029/,
Tree-based reinforcement learning for optimal water

reservoir operation
1, 1 1 1
A. Castelletti, S. Galelli, M. Restelli, R. Soncini-Sessa
Andrea Castelletti, Dipartimento di Elettronica e Informazione, Politecnico di Milano, P.zza
Leonardo da Vinci, 32, 20133 Milano, Italy (castelle@elet.polimi.it)
1
Dipartimento di Elettronica e
Informazione, Politecnico di Milano, Milan,
Italy.
D R A F T April 9, 2010, 9:46am D R A F T

X-2 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION
Abstract. Although being one of the most popular and extensively stud-
ied approaches to design water reservoir operations, Stochastic Dynamic Pro-
gramming is plagued by a dual curse that makes it unsuitable to cope with
large water systems: the computational requirement grows exponentially with
the number of state variables considered (curse of dimensionality) and an
explicit model must be available to describe every system transition and the
associated rewards/costs (curse of modeling). A varieties of simplications and
approximations have been devised in the past, which, in many cases, make
the resulting operating policies inefficient and of scarce relevance in practi-
cal contexts. In this paper, a reinforcement-learning approach, called fitted
Q-iteration, is presented: it combines the principle of continuous approxima-
tion of the value functions with a process of learning off-line from experience
to design daily, cyclo-stationary operating policies. The continuous approx-
imation, performed via tree-based regression, makes it possible to mitigate
the curse of dimensionality by adopting a very coarse discretization grid with
respect to the dense grid required to design an equally performing policy via
Stochastic Dynamic Programming. The learning experience, in the form of
a data-set generated combining historical observations and model simulations,
allows to overcome the curse of modeling. Lake Como water system (Italy)
is used as study site to infer general guidelines on the appropriate setting
for the algorithm parameters and to demonstrate the advantages of the ap-
proach in terms of accuracy and computational effectiveness compared to tra-
ditional Stochastic Dynamic Programming.

CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X-3
1. Introduction
Despite the great progress made in the last decades, optimal operation of water reservoir
systems still remains a very active research area (see the recent review by Labadie [2004]).
The combination of multiple, conflicting water uses, non-linearities in the model and the
objectives, strong uncertainties in the inputs, and high dimensional state space make the
problem challenging and intriguing (Castelletti et al. [2008] and references therein).
Stochastic Dynamic Programming (SDP) is one of the most suitable method for de-
signing (Pareto) optimal reservoir operating policies (see, e.g., Soncini-Sessa et al. [2007]
and references therein). SDP is based on the formulation of the operating policy design
problem as a sequential decision making process. The key idea is to use value functions to
organize and structure the search for optimal policies in stochastic domains. A decision
taken now can produce not only an immediate reward, but also affect the next system
state and, through that, all the subsequent rewards. SDP is thus based on looking ahead
to future events and computing a backed-up value, which is then used to update the value
function.
The first application of Dynamic Programming to water systems management is prob-
ably owed to Hall and Buras [1961]. Since then, the method has been systematically
applied to reservoir management, particularly for hydropower production (see, among
others, Hall et al. [1968]; Heidari et al. [1971]; Trott and Yeh [1973]; Turgeon [1980]; Es-
ogbue [1989]). Beginning in the early 1980s, the interest expands to the stochastic version
of dynamic programming for multi-purpose reservoirs operation and networks of reser-
voirs (see the reviews by Yakowitz [1982]; Yeh [1985] and the contributions by Gilbert

and Shane [1982]; Read [1989]; Hooper et al. [1991]; Piccardi and Soncini-Sessa [1991];
Vasiliadis and Karamouz [1994]; Castelletti and Soncini-Sessa [2007]).
Despite being studied so extensively in the literature, SDP suffers from a dual curse
which, de facto, prevents its practical application to even reasonably complex water sys-
tems. (i) The computational complexity grows exponentially with state, decision and
disturbance dimensions (Bellman’s curse of dimensionality [Bellman, 1957]), so that SDP
cannot be used with water systems where the number of reservoirs is greater than a few
(2-3) units. (ii) An explicit model of each component of the water system is required
(curse of modeling [Bertsekas and Tsitsiklis, 1996]) to anticipate the effects of the sys-
tem transitions. Any information included into the SDP framework can only be either
a state variable described by a dynamic model or a stochastic disturbance, independent
in time, with the associated pdf. Exogenous information, such as temperature, precipita-
tion, snowpack depth, which could effectively improve reservoir operation [Tejada-Guibert
et al., 1995; Hejazi et al., 2008], cannot be explicitly considered in taking the release de-
cision, unless a dynamic model is identified for each additional information, thus adding
to the curse of dimensionality (additional state variables). Further, in large reservoir
networks, disturbances are very likely to be spatially and temporally correlated. While
including space variability in the identification of the disturbance’s pdf can be sometimes
rather complicated, it does not add to the computational complexity. Conversely, tempo-
ral correlation can be properly accounted for by using a dynamic stochastic model, which
could be a cumbersome contribution to the curse of dimensionality.
Attempts to overcome the curse of dimensionality are ubiquitous in the literature, e.g.
Dynamic Programming based on Successive Approximations [Bellman and Dreyfus, 1962],

Incremental Dynamic Programming [Larson, 1968], Differential Dynamic Programming
[Jacobson and Mayne, 1970], and problem-specific heuristics [Wong and Luenberger , 1968;
Luenberger , 1971]. However, these methods have been conceived mainly for deterministic
problems and are of scarce interest for the optimal operation of reservoirs networks, where
the uncertainty associated with the underlying hydro-meteorological processes cannot be
neglected. Alternative approaches can be classified in two main classes (see Castelletti
et al. [2008] and references therein for further details) depending on the strategy they
adopt to alleviate the dimensionality burden: methods based on the simplification of the
water system model and methods based on the restriction of the degrees of freedom of
the policy design problem.
The first includes both the decomposition of the system into smaller and tractable
subsystems (e.g., decomposition based on physical or functional structure of the system,
hierarchical multilevel decomposition) with the subsequent use of an iterative procedure
to solve the problem, and the aggregation of the sub-system, or parts thereof, into a com-
posite, computationally tractable system. For instance, Turgeon [1981] (see also Turgeon
[1980]) proposed an algorithm based on the decomposition of a N -reservoirs operation
problem into N subproblems, each considering two reservoirs: one among the actual
reservoirs plus an equivalent reservoir accounting for all the remaining downstream stor-
ages. The resulting overall computing time for the solution of the problem grows linearly
with N . A similar idea was exploited by Archibald et al. [1997], who suggested a de-
composition/aggregation technique where each subproblem includes an actual reservoir
and two equivalent reservoirs for upstream and downstream storages respectively. The
computational complexity is reduced to a quadratic function of the state dimension. Saad

and Turgeon [1988] and Saad et al. [1992] proposed a method based on Principal Com-
ponent Analysis to reduce the complexity in a five-reservoir hydropower system from a
ten to a four-state variable problem, which was then solvable by SDP. Better perfor-
mance was obtained on the same system by Saad et al. [1994], who used a disaggregation
technique based on neural networks. A major contribution to hierarchical multilevel de-
composition comes from Haimes [Haimes, 1977]. The idea behind such approach is that
different decomposition levels are separately modeled and analyzed, but some information
is transmitted from lower to higher levels in the decomposition hierarchy.
The second class of approaches to avert the curse of dimensionality is based on the
introduction of some hypotheses on the regularity of the SDP optimal value function. Since
SDP requires discretization of the feasible state and decision spaces, one way to mitigate
(but not vanquish) the dimensionality problem is to combine a coarser discretization grid
with a continuous approximation of the value function. Different classes of approximators
have been explored, including linear polynomials [Bellman et al., 1963; Tsitsiklis and Roy,
1996], cubic Hermite polynomials [Foufoula-Georgiou and Kitanidis, 1988] and splines
[Johnson et al., 1993; Philbrick and Kitanidis, 2001]. As universal function approximators,
artificial neural networks are particularly suited for this purpose, as discussed in Bertsekas
and Tsitsiklis [1996] and practically demonstrated by Castelletti et al. [2005]; Cervellera
et al. [2006]; Castelletti et al. [2007].
SDP’s curse of modeling has received less attention than its dimensionality twin. In
SDP, models are required to anticipate and evaluate the effects of any feasible decision
on the state dynamics by computing the associated reward. An alternative approach for
performing such evaluation is to rely directly on (i.e., learning from) experience. This

is the central idea of Reinforcement Learning (RL), a well-known framework for sequen-
tial decision-making (e.g., Barto and Sutton [1998]) that combines concepts from SDP,
stochastic approximation via simulation, and function approximation. The learning ex-
perience can be acquired on-line, by directly experimenting decisions on the real system
without any model, or generated off-line, either by using an external simulator or histori-
cal observations. While the first option is clearly impracticable on real reservoir networks,
off-line learning has been already experimented in the operation of water systems. Castel-
letti et al. [2001] (see also Soncini-Sessa et al. [2007]) proposed a partially model-free
version of classical Q-learning [Watkins and Dayan, 1992] to design the daily operation of
a multi-purpose regulated lake. The storage dynamics was simulated via the mass balance
equation. The catchment was described using the historical inflow sequence. Using both
the storage and previous day inflow as state variables, Q-learning outperformed SDP with
the inflow modelled as an autoregressive process of order one. Bhattacharya et al. [2003]
developed a neural version of Q-learning for controlling pumps in a large polder system
in the Netherlands. Lee and Labadie [2007] compared Q-learning algorithm with Implicit
SDP and Sampling SDP [Kelman et al., 1990] on the monthly operation of a two-reservoir
system in Korea, processing the previous month inflow information as proposed in Castel-
letti et al. [2001]. Q-learning was shown to outperform the other two approaches. RL
methods alleviate to some extent also the curse of dimensionality, as the search space over
the range of feasible release decisions is not exhaustively explored at each iteration step.
However, like SDP, they do require a discretization grid over the state space, which again
leads to an exponential explosion of the computational costs.

Lately, a new approach, called fitted Q-iteration, which combines RL concepts of off-line
learning and functional approximation of the value function, has been proposed [Ernst
et al., 2005]. Unlike traditional stochastic approximation algorithms [Bellman et al.,
1963; Bertsekas and Tsitsiklis, 1996; Tsitsiklis and Roy, 1996], which use parametric
function approximators and thus require a time consuming parameter estimation process
at each iteration step, fitted Q-iteration uses tree-based approximation [Breiman et al.,
1984]. The use of tree-based regressors offers a twofold advantage: first, a great modeling
flexibility, which is a paramount characteristic in the typical multi-objective context of
water reservoir systems with multi-dimensional states, where the value functions to be
approximated are unpredictable in shape; second, a higher computational efficiency as no
optimal parameter estimation is required for the value function approximation at each
iteration step. On the other hand, even if tree-based methods infer the model structure
directly from data, some parameters need to be specified to drive the tree construction
process, such as, for instance, the minimum number of data per leaf or the number of trees
when ensembles of trees are used. Fixing the value of such parameters can only be done
empirically and does require a fine, ad hoc analysis as any inaccuracy might ultimately
have negative effects on the policy performance. Further, while traditional Q-learning
has been provably shown to converge only when the value function updates are performed
incrementally, following the state trajectory produced by the sequence of optimal decisions
selected at each iteration step, fitted Q-iteration processes the information in a batch
mode, by simultaneously using all the learning experience in making an update of the
value function. This has been shown to speed up the convergence rate [Kalyanakrishnan
and Stone, 2007].

In this paper, the fitted Q-iteration is demonstrated on Lake Como, a multi-purpose
regulated lake in Italy. As originally proposed in Ernst et al. [2005], fitted Q-iteration
yields a stationary policy, which is perfectly suited for the artificial systems the algorithm
has been conceived for, while it is less conforming to natural resources systems dealt with
in this paper. An improved version is therefore proposed that includes non-stationary
policies, which are more effective in adapting to the natural seasonal variability. The
focus of the paper is first on studying the properties of the algorithm, with an analysis
of the results sensitivity to the tree-based method parameters. The potential advantages
of the approach are then explored and evaluated against traditional SDP, which is the
natural term of comparison.
2. Problem and Well Established Solutions
The problem of designing the optimal operation of a water reservoir can be schematized
with a feedback control framework applied to a discrete-time stochastic system, periodic
with period T equal to one year [Castelletti et al., 2008]. For each time t of the plan-
ning horizon, given the storage volume st available in the reservoir (i.e., the state), the
operating policy p returns the volume ut (i.e., the release decision) to be released over
the time interval [t, t + 1) (e.g., in the next 24 hours when a daily policy is considered).
In some cases, improved operation can be obtained by conditioning the policy on any
other meteorological information (e.g., precipitation, temperature) and/or hydrological
information (e.g., previous period inflow, soil moisture, evapotranspiration, snow pack
depth) It = |It1 , . . . , ItL |, which might be appropriate to partly anticipate the effects of the
stochastic disturbance εt+1 affecting the system.
In the notation adopted in this paper, the time subscript in the symbol of a variable

X - 10 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION
denotes the time instant at which such variable assumes a deterministic value, e.g. the
lake storage is measured at time t an thus is denoted with st , while the disturbance in the
interval [t, t + 1) is denoted with εt+1 since it can be deterministically known only at the
end of the interval [Piccardi and Soncini-Sessa, 1991].
2.1. Model of the Water System
The reservoir dynamics is governed by the mass conservation equation:
st+1 = st + at+1 − rt+1 (1)
where at+1 is the net inflow volume in the time interval [t, t + 1), which includes net
evaporation and other losses; and rt+1 is the release over the same period, which is a
function of the release decision ut made at time t, the storage st and the inflow at+1 , i.e.,
rt+1 = Rt (st , ut , at+1 ) (2)
The release function Rt (·) is a non-linear, periodic (with period T ) function describing
the stochastic relation between the decision ut and the actual release rt+1 [Piccardi and
Soncini-Sessa, 1991]. Indeed, between the time t at which the decision ut is taken and the
time t + 1 at which the release rt+1 it determines is completed, the inflow at+1 is affecting
the system, and the actual release rt+1 may not be equal to the decision ut , for instance
because of the activation of the spillways (for more details see Soncini-Sessa et al. [2007]).
A number of alternative options can be adopted to model the inflow at+1 , depending on
the meteorological and hydrological data available and the requirements posed by the ap-
proach used to design the operating policies. While traditional process-based (especially
spatially-distributed) models are usually too complex (high number of state variables) to
be used within a feedback control framework, statistical models provide a reasonable bal-

CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 11
ance between compactness and accuracy (e.g., Young [2006]), and are generally preferred
over the first in designing optimal reservoir operation. In the most general formulation
the inflow can be described as
at+1 = At (It , εt+1 ) (3)
where At (·) is a periodic function with period T . For example, at+1 can be modeled as a
cyclostationary, log-normal autoregressive process of order d (i.e., a log-PAR(d)):
at+1 = exp (yt+1 σt + µt ) (4a)

∑
d
yt+1 = αi,t yt−i+1 + εt+1 (4b)
i=1
where µt and σt are the periodic mean and standard deviation of the process, αi,t is the
parameter associated to the i-th autoregressive term at time t, and εt+1 is a zero-mean,
Gaussian white noise with constant variance. In this case the information vector It is
composed of the d autoregressive terms yt−i+1 (i = 1, . . . , d).
The model of the water system, composed of the catchment and the reservoir, can be
represented compactly with the following vector difference equation
xt+1 = ft (xt , ut , εt+1 ) (5)
where the state vector xt ∈ Sxt ⊂ Rnx , with nx = 1 + L, includes the state variables st
and It ; ut ∈ Ut (st ) ⊆ Sut ⊂ R, Ut (st ) being the set of the feasible decisions, which only
depends upon the storage st ; the stochastic disturbance εt+1 ∈ Sεt+1 ⊆ R is described by
its pdf ϕt (·|), which is periodic of period T as well as the function ft (·) and the set Ut (·).
When a network of N reservoirs is considered, instead of a single reservoir, the state
t | of the N storages and the P

vector is enlarged to include the vector st = |s1t , . . . , sN
information vectors Ilt from the associated P meteo/catchment systems (l = 1, . . . , P ),

where P can be equal, smaller, or greater than N and nx = N + L · P . The disturbance
vector εt+1 ∈ Sεt+1 ⊆ Rnε is composed of P disturbances εlt+1 (i.e., nε = P ) with associated
pdf ϕlt (·). Finally, the release decision vector ut ∈ Ut (st ) ⊆ Sut ⊂ Rnu , whose components
are the release decision ujt from each reservoir j (with j = 1, . . . , N and nu = N ), replaces
the scalar decision ut in equation (5).
The presence of multiple, say q, operating objectives, corresponding to different wa-
ter users and other social and environmental interests, can be formalized by defining a
periodic, with period T , step reward function gt+1 = gt (xt , ut , εt+1 ) associated to the
stochastic state transition from xt to xt+1 . According to the multi-objective nature of the
problem, this function can be obtained as a weighted sum (Weighting Method) of the q
step reward functions gti (xt , ut , εt+1 ), i = 1, . . . , q describing the whole set of users and
interests considered, i.e.
∑
q
gt (xt , ut , εt+1 ) = λi gti (xt , ut , εt+1 ) (6)
i=1
∑q
where i=1 λi = 1 with λi ≥ 0 ∀i.
An operating policy p is defined as a sequence p = {m0 (·), m1 (·), . . .} of operating rules
of the form
ut = mt (xt ) (7)
2.2. Problem Formulation
The time horizon h over which the operating policy is designed can be either finite
or infinite. Dealing with the management of natural resources, the second assumption
should be preferred over the first, which - to be effective - would require the definition
of a state dependent penalty function on the final instant of the time horizon. And this

can be critical in most of cases. Conversely, when an infinite time horizon is assumed, a
discount factor must be fixed to ensure convergence of the policy design algorithm (Total
Discount Cost (TDC) formulation).
For a given value of the weights λi , with i = 1, . . . , q, the total reward function associated
with the operating policy p over infinite time horizon can be defined as
[ h−1 ]
∑
J (p) = lim E γ t gt (xt , ut , εt+1 ) (8)
h→∞ ε1 ,...,εh
t=0
where 0 < γ < 1 and the expected value is used as criterion to filtering the stochastic
disturbances (see Orlovski et al. [1984]; Nardini et al. [1992]; Soncini-Sessa et al. [2007]
for details and alternative solutions). The optimal policy p∗ is obtained by solving the
following optimal control problem:
p∗ = arg max J (p) (9)

p
subject to the model equations.
Equation (9) can be rewritten on a finite horizon h

[ h−1 ]
∑
∗ t h
ph = arg max E γ gt (xt , ut , εt+1 ) + γ Hh (xh ) (10)
ph ε1 ,...,εh
t=0
where Hh (xh ) is a penalty function that expresses the total expected reward one would
incur in starting from xh and applying optimal release decisions over the period [h, ∞).
Since γ h vanishes for h going to infinity the solution to problem (10) is equivalent to the
limit of the following sequence of policies for the horizon h going to infinity
[ h−1 ]
∑
p∗h = arg max E γ t gt (xt , ut , εt+1 ) (11a)
ph ε1 ,...,εh
t=0

xt+1 = ft (xt , ut , εt+1 ) t = 0, . . . h − 1 (11b)
mt (xt ) = ut ∈ Ut (xt ) t = 0, . . . h − 1 (11c)
εt+1 ∼ ϕt (·|xt , ut ) t = 0, . . . h − 1 (11d)
x0 given (11e)
ph , {mt (·) ; t = 0, . . . h − 1} (11f)
By reformulating and solving the problem for some different values of λi (with i =
1, . . . , q), a finite subset of the generally infinite Pareto optimal policy set is obtained.
Since the system (equations (11b-d)) and the total reward function (11a) are periodic
of period T , the optimal policy p∗ turns out to be periodic with the same period, i.e.
p∗ = {m∗0 (·), m∗1 (·), . . . , m∗T −1 (·)}.
2.3. Stochastic Dynamic Programming
The formulation of the optimal problem (11) already includes a fundamental assumption
of Stochastic Dynamic Programming (SDP): an explicit model of the system is available,
through which the effects of any state transition can be fully anticipated (curse of mod-
eling). Precisely,
1. All the system dynamics are known and must be explicitly modeled in equation
(11b), which means that meteorological and/or hydrological information It can only be
included into the SDP formulation as state variables described by appropriate models. It
is not possible to consider exogenous deterministic inputs, whose values are known in real
time (e.g., precipitation, temperature): input to the models can only be either release
decisions or stochastic disturbances.

2. The disturbance vector is known (equation (11d)) and either the disturbances are
independent in time or any dependency upon the past at time t can be account for by the
value of the state at the same time.
3. The step reward functions are known and separable, i.e., gt (·) only depends on
variables defined for the time interval [t, t + 1).
The solution to problems (9) and (11) is computed by recursively solving the following
Bellman equation formulated according to the TDC framework:

[ ]
Qt (xt , ut ) = E gt (xt , ut , εt+1 ) + γ max Qt+1 (xt+1 , ut+1 ) ∀(xt , ut ) ∈ Sxt × Sut (12)
εt+1 ut+1
where Qt (·, ·) is the so-called Q-function or value function, i.e., the cumulative expected
reward resulting from applying the release decision ut at time t in state xt and assuming
optimal decisions (i.e., a greedy policy) in any subsequent system transition. The rela-
tionship between the Q-function Qt (·, ·) and the cost-to-go function Ht (·), as originally
introduced by Bellman [1957], is given by the following formula:
Ht (xt ) = max Qt (xt , ut ) (13)

ut
with the second being a more compact representation than the former, but requiring
an explicit model to derive the optimal release decision associated to each state value,
according to the SDP requirements above. More precisely, the solution to problems (9)
and (11) is obtained by iteratively solving equation (12) as a backward looking solution
process over the period T −1, . . . , 0 and repeating the cycle until a suitable termination test
is satisfied, say after k cycles. Then, the last T Q-functions are the optimal Q∗ -functions,
from which the optimal operating rule at any time is derived as
m∗t (xt ) = arg max Q∗t (xt , ut ) (14)

ut

To determine the right hand side of equation (12), the domains Sxt , Sut , and Sεt+1 , of
state, release decision, and disturbance must be discretized and, at each iteration step of
the resolution process, explored exhaustively. The choice of the domain discretization is
essential as it reflects on the algorithm complexity, which is combinatorial in the number
of states, release decisions, and disturbances, and in their domain discretization. Let Nxt ,
Nut , and Nεt+1 be the number of elements in the discretized state, release decision, and
disturbance sets Sxt ⊂ Rnx , Sut ⊂ Rnu , and Sεt+1 ⊆ Rnε : the recursive resolution of (12)
for kT iteration steps (where k is usually lower than ten) requires
( )
kT · Nxntx · Nuntu · Nεntε (15)
evaluations of the operator E[·] in (12). Equation (15) shows the so-called curse of di-
mensionality, i.e., an exponential growth of computational complexity with the state and
decision dimension. It follows that SDP cannot be applied to design daily operating poli-
cies for water systems with a number of reservoirs greater than a few units, say 2 or 3,
and/or when too many hydro-meteorological information variables are accounted for in
the vector It .
3. Tree-based Batch Mode Reinforcement Learning
As anticipated in the introduction, Reinforcement Learning (RL) provides a conceptual
framework for overcoming the curse of modeling, since it does not presume the knowledge
of an explicit model to describe state transitions, disturbance’s pdf and rewards. However,
it only relatively alleviates the curse of dimensionality expressed by equation (15).
The fitted Q-iteration algorithm proposed by Ernst et al. [2005], which builds on early
works on fitted value iteration [Ormoneit and Sen, 2002], combines the RL idea of learn-

ing from experience with the concept of continuous approximation of the value function
developed for large-scale dynamic programming (see for example Gordon [1995]; Tsitsiklis
and Roy [1996]). This results into an improved reduction of the computational burden.
Indeed, a continuous mapping of state-decision pair into the value function should permit
the same level of accuracy as a look-up table representation based on an extremely dense
grid, but using a definitely coarser grid for the state-decision space. Further, the learning
process is performed off-line, without the need for directly experimenting on the real sys-
tem, which is a fundamental requirement when dealing with water resources systems, as
experiments would led to unsustainable costs in terms both of time, social and economic
loss.
3.1. Fitted Q-iteration Algorithm
As the other RL algorithms, fitted Q-iteration does not require explicit modeling of
the system. The operating policy is determined by learning from experience. Strictly,
such experience is represented as a finite data-set F of four-tuples of the form <
xt , ut , xt+1 , gt+1 >, i.e.
F = {< xlt , ult , xlt+1 , gt+1

l
>, l = 1, . . . , #F}
where #F is the cardinality of F. Each four-tuple is a sample of the one-step transition
dynamics of the system. The set F is the sole information required to determine an
operating policy, regardless the way it is generated (see Section 3.2), whether the four-
tuples are obtained from one single trajectory of the system (e.g., the historical one) or
from several, independently generated, one-step or multi-step simulations of the system
dynamics. Since, except for very special cases, an optimal policy cannot be determined

from a finite set of transition samples, the policy generated by fitted Q-iteration will be
an approximation of the optimal policy p∗ that solves problem (11). Precisely, the fitted
Q-iteration yields an approximation of the optimal Q-functions of the TDC problem (9),
by iteratively extending the optimization horizon h, i.e. by iteratively solving problem
(11).
The deterministic and stationary (T = 1) case is useful to describe the algorithm.
Under these simplifying assumptions the state transition (11b) and associated reward
depend only on the state xt and decision ut . It can be shown [Ernst, 1999] that the
following sequence of Qh -functions, defined for all (xt , ut ) ∈ Sx × Su
Q0 (xt , ut ) = 0 (16a)
Qh (xt , ut ) = g(xt , ut ) + γ max Qh−1 (xt+1 , ut+1 ) ∀h > 0 (16b)

ut+1
converge, in the infinity norm, to the optimal Q-function Q∗ (·) that solves the determin-
istic and stationary equivalent to equation (12). Assuming the function Qh−1 (·) is known,
the value of Qh (·) can be computed for all the state-decision pairs (xlt , ult ), l = 1, . . . , #F,
using equation (16b) and the learning data-set F. The #F values so obtained can be
then used to get a continuous approximation Q̂h (·) of Qh (·) over the whole state-decision
set Sx × Su by applying a regression algorithm (i.e., by fitting a function approximator)
to the training set
T S = {< (xlt , ult ), Qh (xlt , ut ) >, l = 1, . . . , #F}
where the pairs (xlt , ult ) are the regressor inputs and the values Qh (xlt , ut ) the regressor
output. By substituting Qh (·) for Q̂h (·) and applying the same reasoning, the subsequent
approximations Q̂h+1 (·), Q̂h+2 (·), . . . can be determined iteratively.

In the stochastic case, the right hand side of equation (16b) is a realization of a random
variable and Qh (xt , ut ) is redefined as its expectation. However, the expectation has not to
be operationally applied when Q̂h (·) is approximated with a regression function based on
the least squares method, because this latter generates an approximation of the conditional
expectation of the output variables given the input. Its application to the training set
constructed considering stochastic transitions provides a continuous approximation of
Qh (·) over the whole state-decision set.
As originally proposed by Ernst et al. [2005], fitted Q-iteration generates a stationary policy,
i.e., just one operating rule of the form ut = m(xt ), which is the optimal policy for a stationary
system. However, natural systems are not stationary and thus a periodic policy (i.e., a sequence of
T − 1 operating rules of the form (7)) is much more indicated to adapt to the underlying seasonal
variability. A way to extend the fitted Q-iteration framework to the non-stationary case, is to
consider the time as a component of the state vector, which evolves driven by a deterministic,
autonomous transition function: t + 1 always follows from t. Accordingly, the notation of the
learning data-set F can be rewritten as F = { ¡(t,xt )l , ult , (t + 1, xt+1 )l , gt+1

l
>, l = 1, . . . , #F}In
this way, all the properties of the stationary formulation are preserved and convergence proofs
hold under the same assumptions. Therefore, the fitted Q-iteration algorithm applied to this
new set yields the determination of an approximate periodic policy, composed of a sequence of
T − 1 operating rules of form (7).
A tabular version of the fitted Q-iteration algorithm so modified is the following:
Input: a learning set F and a regression algorithm.
Initialization:
Set h = 0.

Set Q̂0 (·) = 0 over the whole state-decision space Sx × Su .
Iterations: repeat until stopping conditions are met
Set h = h + 1.
Build the training set T S = {< il , ol >, l = 1, . . . , #F}
where il = ((t, xt )l , ult ) and ol = gt+1

l
+ γ max Q̂h−1 ((t + 1, xt+1 )l , ut+1 )
ut+1
Run the regression algorithm on T S to get Q̂h (·), from which the policy p̂h is derived.
Fitted Q-iteration is said to be a batch RL algorithm, because the whole learning set F is
processed in a batch mode, in contrast to traditional RL algorithms that perform incremental
updates of the value function using the four-tuples sequentially. Iterations can be stopped when
the distance between Q̂h (·) and Q̂h−1 (·) drops below a pre-assigned threshold, even if this criterion
does not ensure convergence with some function approximators (see Section 3.3). When the
algorithm stops, whatever the stopping conditions selected, the final policy p̂ is an approximation
of the optimal policy p∗ . The policy p̂h associated to the h-th algorithm iteration is composed of
a sequence of T − 1 operating policies of the form (7), each one looking ahead over the horizon
[t, t + h). In other words, for each value of h the algorithm solves a receding h-steps horizon
problem of form (11).
3.2. The Learning Data-Set
According to the RL concept of learning off-line from experience, the most simple idea to
generate the learning data-set F is to employ a historical record of system transitions and thus
let the algorithm learn from the real experience. If the objectives selected for the problem fit the
actual operation targets, the policy derived will be very close to the historical one, with small
benefit when the system is currently managed well below its potential. One way to refine and

improve this near-historical policy is to enlarge release decisions exploration to a small set of
different values around the historical one (see Gaskett [2002]), for each past value of the state
(Figure 1a). This is, however, a risky approach: if the state-decision set has been scarcely sampled
during the historical operation (typically, in poorly controlled systems [Tsitsiklis and Roy, 1996]),
the informative content of the learning data-set can be low and the resulting operation policy
is very likely to be quite far from optimality. Further, the approach is impracticable when the
water system has never been operated before (e.g., in planning problems).
An alternative approach is to explore the behavior of the water system, via model simulation,
for different state values and under different operating policies, namely to adopt a model-based
approach. However, the modeling effort does not need to involve the whole water system, but
just the components directly controlled (i.e., the reservoir(s)) and any downstream part, which
is affected by the release decisions (e.g., the downstream users). Indeed, the upstream part
(i.e., the meteo/catchment systems) is not influenced by release decisions and thus a model is
not required to explore the processes dynamics and the disturbance realizations. This is the
idea underlying the so-called partial model-free approach [Castelletti et al., 2001]: to use the
historical time series for each inflow and, when available, for any other hydro-meteorological
information that can be usefully included among the state variables, while describing the storage
dynamics with simple mass balance equations and any downstream user with an appropriate
dynamic model (see, e.g., Galelli et al. [2010]). As far as the controlled parts are concerned, a
discretization of the corresponding state space and the decision space is required to run either
one-step or multi-step simulations of the relevant dynamics. Even when pruned from the hydro-
meteorological information components, the dense grid discretization (Figure 1b) adopted in the
SDP formulation might still led to prohibitive computational requirements. On the other hand,

it would not take any advantage of the continuous approximation of the Q-functions provided
by fitted Q-iteration. Rather, a coarse grid can exponentially reduce the computational burden
by linearly reducing Nxt and Nut in equation (15). Such a coarse grid can be either obtained
as a uniform sub-sampling of the SDP dense grid (Figure 1c) or generated with more efficient
discretization methods (Figure 1d), such as orthogonal arrays, Latin hypercube designs, and low-
discrepancy sequences (see Cervellera et al. [2006] and reference therein). Whatever the approach
adopted to build the learning data-set, this might contain redundancies, which only add to the
computational requirements with no advantages in terms of policy performance. A way to filter
the data-set is to adopt active learning techniques [Cohn et al., 1996], based on which only the
samples that mostly improve the performance of the learning algorithm (see Ernst [2005]) are
retained.
3.3. Function Approximator
In principle, fitted Q-iteration algorithm can be combined with any function approximator
based on least squares designed for regression problems. In practice, the approximator adopted
should have several desirable features [Ernst, 1999]:
Modelling flexibility. For very simple problems, involving, for example, one single reservoir op-
erated for one single purpose, the class of functions, to which the Q-functions to be approximated
belong, can be, to some extent, anticipated (e.g., for flood control it must be a function mono-
tonically increasing with the storage). However, when dealing with reservoir networks and/or
multi-objective problems, this class could be totally unpredictable in shape. As a consequence,
the function approximator must be able to adapt its structure to the problem.
Computational efficiency. The regression algorithm is run at every iteration step of the fitted
Q-iteration. It should ensure accurate approximations, without adding too much to the overall

computational requirements. Further, no human tuning of the function approximator parameters
must be ensured (fully automated approximation).
Some parametric function approximators can provide a great modeling flexibility; artificial
neural networks, for instance, are provably able to approximate any continuous, multivariate
function to any desired degree of accuracy. This modeling flexibility, however, comes to a price,
since it is often reflected in a large number of parameters requiring explicit calibration, thus
strongly affecting the computational efficiency (see Castelletti et al. [2005]) and increasing the
risk of over-parameterization. As the problem size scales up, neural networks require more and
more neurons, thus increasing the computational cost of the training phase. Non-parametric
function approximators, particularly tree-based methods, ensure modeling flexibility and, at the
same time, computationally efficiency, since no traditional parameter estimation is required in
their building process.
Tree-based methods provide non-parametric estimates based on a recursive binary partition of
the training data-set T S (tree building algorithm). At the first step, the space of inputs (root)
is partitioned into two subsets (nodes), by applying an appropriate splitting rule to T S. The
operation is iteratively repeated on the two subsets resulting from each splitting until a given
termination test is satisfied. Each subset of the final partition (leaf) is then associated with
a value of the output or a function of the input (association rule). In some methods the tree
building procedure is repeated more than once to construct an ensemble of trees (forest) and the
values estimated by the trees are aggregated, according to an aggregation rule, to produce the
final estimate.
3.3.1. Extra-Trees

Tree-based methods include KD-Tree, Classification and Regression Trees [Breiman et al.,
1984], Tree Bagging [Breiman, 1996], Totally Randomized Trees and Extremely Randomized
Trees (Extra-Trees) [Geurts et al., 2006]. These methods basically differ by the splitting rule,
the termination test they adopt, and the number of trees they grow. Extra-Trees (described
later) were demonstrated to perform better than other tree-based methods combined with the
fitted Q-iteration algorithm [Ernst et al., 2005] and are therefore adopted in this study. Par-
ticularly, they provide great scalability by adapting the trees’ structure to the training set at
each iteration, thus resulting in a better accuracy of the final policy. The drawback with these
continuous changes in the structure is that Extra-Trees do not ensure convergence of fitted Q-
iteration and so the algorithm cannot simply be stopped based on the distance between two
consecutive approximations of the Q-functions. However, contrary to many parametric function
approximators, they do not lead to divergence when the problem horizon is infinite, which is a
fundamental property in dealing with water resources systems.
The Extra-Tree building algorithm grows ensemble of M trees. Nodes are split using the
following rule: K alternative cut-directions (regressor input) are randomly selected and, for each
one, a random cut-point is chosen; a score (explained variance) is then associated to each cut-
direction and the one maximizing the score is adopted to split the node (for more details, see
Geurts et al. [2006]). The algorithm stops partitioning a node if its cardinality is smaller than
nmin (termination test) and the node is therefore a leaf. To each leaf a value is assigned, obtained
as the average of the regressor outputs ol associated to the inputs il that fall in the leaf. The
estimates produced by the M trees are finally aggregated with arithmetic average (aggregation
rule).

Three parameters are thus associated to Extra-Trees, whose values can be fixed on the basis
of empirical evaluations:
K, the number of alternative cut-directions, can be chosen in the interval [1, . . . , n], where
n is the number of regressor inputs. When K is equal to n, the choice of the cut-direction is
not randomized and the randomization acts only through the choice of the cut-point. On the
contrary, low values of K increase the randomization of the trees and weaken the dependence
of their structure on the output of the training data-set. Geurts et al. [2006] have empirically
demonstrated that, for regression problems, the optimal default value for K is n.
nmin , the minimum cardinality for splitting a node. Large values of nmin lead to small trees
(few leaves), with high bias and small variance. Conversely, low values of nmin lead to fully-
grown trees, which may over-fit the data. The optimal value of nmin depends not only on the
risk aversion to over-fitting, but also on the level of noise in the outputs of the training data-set:
the noisier are the outputs, the higher should be the optimal value of nmin .
M , the number of trees in the forest, influences the strength of the variance reduction and
the behavior of the estimation error, which is a decreasing function of M [Breiman, 2001]. The
estimation accuracy thus increases with M and the choice of its value depends on a trade-off
between the desired model accuracy and available computing power.
4. Case Study: Lake Como, Italy
Lake Como water system (Figure 2) was selected as study site to evaluate the potential of
the fitted Q-iteration algorithm. The rationale behind the choice is twofold. First, given the
relatively simple water system topology, the shape of the operating policies and the associated
values of the objectives at the extreme points of the Pareto front (the coordinate of the utopia
point) can be inferred a priori, and this can be very useful for studying the sensitivity of the

algorithm to Extra-Trees parameters and the stopping conditions. Second, the dimension of the
system makes the control problem solvable with SDP, and this is key to perform a comparative
evaluation of the algorithm. Based on this analysis, the advantages of the fitted Q-iteration over
SDP can be easily extrapolated to more complex cases, where SDP requirements would turn out
prohibitive for a comparison.
4.1. Description
Lake Como is the third biggest regulated lake in Italy with a surface area of 145 Km2 and
an active storage of 260 Mm3 . The 4500 Km2 lake’s catchment area produces an yearly average
inflow of 4.73 Gm3 /year with the typical two-peak (spring and autumn) subalpine hydrological
flow pattern. The regulation was introduced in 1946 with the double purpose of providing
flood protection on the lake shores and supplying water to the downstream users (5 irrigation
districts for a total irrigated area of 1400 Km2 and 9 run-of-river power plants with a total
installed capacity of 92 MW). The lake regulation has been extensively studied following the
work by Guariso et al. [1985, 1986] and the multifaceted nature of the conflict over its water
use analyzed from different points of view, including the combined regulation of the lake and
the alpine hydropower reservoirs and the integrated management of blue and green water for
improved agricultural production [Galelli and Soncini-Sessa, 2010].
4.2. Problem Setting
The water system (Figure 2) is composed of a catchment feeding the lake, which serves the
downstream irrigation districts and hydropower plants. The operating policy provides the release
decision ut based on the current value of the lake storage st . This latter is the only component
of the state vector xt and its dynamics is governed by the mass balance equation (1). The

release decision ut is the volume to be released in the next 24 hours from the lake dam and,
finally, according to (6), the step reward function gt (·) is a linear combination of two step costs
(negative rewards) accounting for flood damage and downstream water deficits. The learning
data-set F of four-tuples < (t, xt ), ut , (t + 1, xt+1 ), gt+1 > required by the fitted Q-iteration
algorithm was built adopting a partial model-free approach. A 80-points coarse discretization
was used for the state-decision space (Figure 6); precisely, 10 points for the storage st and 8
points for the release decision ut , the first six of which correspond to downstream water demand
values, plus two greater values. For the inflow at at time t, which plays the role of a disturbance
to the system, 15-years daily streamflow data (1965-1979) were directly used (model free). The
state transitions were performed by running a one-step simulation of equation (1) for each of the
10 points of the storage grid and each of the 8 points of the release decision grid against the 15
possible realizations of the inflow at each time step t, t = 1, . . . , T with T = 365. The resulting
number #F of 4-tuples in the learning data-set was equal to 438,160.
The cost (objective) J f associated to flood damage (to be minimized!) is formulated as the total
discounted (γ = 0.9997) number of days of flooding, computed with equation (8), whit




0 if st ≤ s̄
f
gt (st , ut , at+1 ) = (17)



1 otherwise
where s̄ is the storage corresponding to the flood threshold (1.24 m) at the lowest shore line
point (downtown Como). The cost J w associated to downstream water deficits is computed in
analogous way with the following step cost function:





0 if rt+1 ≥ wt
w
gt (st , ut , at+1 ) = (18)



w − r otherwise
t t+1

where wt is the aggregated agricultural and hydropower demand and rt+1 is the actual release
from the lake given by equation (2).
The policy is designed by solving an equivalent to problem (11) where, according to the nature
of the objectives, the operator max is substituted for min and the aggregated step cost function
gt (·) in equation (8) is computed as
gt (·) = λgtf (·) + (1 − λ)gtw (·) (19)
with 0 ≤ λ ≤ 1.
5. Analysis of Fitted Q-iteration Properties
In order to fix an appropriate setting for the fitted Q-iteration algorithm, the definition of a
suitable stopping condition and the influence of the Extra-Trees parameters (K, nmin , and M )
on the performance of the operating policies were analyzed. The policy performance is evaluated
in terms of the values of the objective Jˆh obtained by simulating the policy p̂h , computed with
the data-set F, on an 18-years validation scenario (1980-1997) for different values of h.
5.1. Stopping Condition
The definition of a suitable stopping condition is not straightforward. As anticipated, the
criterion proposed in Section 3 does not make practical sense combined with Extra-Trees. Indeed,
the randomization in the tree building algorithm refreshes the tree structure of the Q-function
approximator at each iteration, so that the distance between two consecutive approximations in
the Q̂h (·) sequence rapidly decreases, but never vanishes. By way of illustration, consider the
trajectory of Jˆh obtained for a given value of λ (Figure 3). After an initial decrease, the value of
Jˆh randomly fluctuates, with an amplitude that reaches 35% of the initial decrease. If the tree
structure were frozen at the first iteration, Jˆh would asymptotically go to a value more or less

far from optimality depending on the accuracy of the approximation. However, because of the
randomization, as h increases the value of Jˆh fluctuates. The recursive nature of the algorithm
filters the pure random fluctuation (high frequency) of the approximation and therefore Jˆh shows
smooth fluctuations. For small h, fluctuations are dominated by the performance improvement
due to the policy learning process and are not evident. When increasing h does not add any
useful information for improving the policy, oscillations become the dominant effect.
To choose the number h̄ of iterations at which to stop the algorithm, it is therefore necessary
to resort to some empirical criterion. In principle, h̄ should be the value of h for which the policy
learning process is nearly over and the improvement in performance is so low to be overtopped
by random fluctuations due to the Extra-Trees approximation. As far as the authors know, no
indicator exists to identify the iteration from which no further improvement can be achieved, but,
heuristically, the lowest value of h corresponding to a minimum of Jˆh (150 days in Figure 3) can
be a proper choice. The reversal should actually be a good signal of the prevalence of random
fluctuations over learning. The following observation supports, in the case study, this choice.
Remember that problem (9) can be rewritten as (10). Stopping the Qh (·) sequence to a finite
value h is the same as ignoring the penalty Hh (xh ). In this case, the problem solution does not
change only when Hh (·) is a constant function, since the policy (9) is insensitive to any addictive
constant. Given the meaning of Hh (·), this happens when a time instant exists, in the period
T , from which the future performance of the system is independent from its past behavior. For
water reservoir systems this happens when the total storage capacity does not allow for multi-
annual operation, as in the case of Lake Como. There, the irrigation season begins on middle
April, just after the snowmelt, and ends around the third week of September. In autumn the lake
is re-filled by floods and during springtime mainly by snowmelt. As a result, the storage at the

beginning of the new irrigation season is completely independent from the operation performed
in the previous season and, thus, for what concerns the irrigation component, Hh (·) takes a
constant value. Similar reasoning applies to floods, as they occur at the end of October and
the lake can be emptied in 15 days. Therefore, to design a receding horizon policy, as fitted Q-
iteration algorithm de facto does, 5 months (about 150 days) are enough, and the first minimum
in Figure 3 is close to this value. This observation not only supports the empirical criterion
proposed above, but also suggests another stopping criterion, to some extent, more efficient (it
does not require to compute Jˆh after each algorithm iteration): whenever the problem can be
re-framed as a receding h̄-steps horizon problem, h̄ is the natural stopping limit of the algorithm,
since the policy learning process does not improve anymore when the number of iterations is
nearly equal to h̄. Beyond this limit, performance begins to oscillate due to the randomization
in the fitted Q-iteration and therefore it is rational to stop at h̄. When the problem must be
solved over an infinite horizon, up to now no criterion is available.
How to act in order to reduce the mean approximation error due to the tree building process
and smooth the amplitude of random fluctuations, which are both affected by the Extra-Trees
parameters, has not yet been clarified and will be dealt with in the next section.
5.2. Sensitivity to Extra-Trees Parameters
5.2.1. Setting K
The parameter K fixes the number of regression inputs randomly selected at each node and
should be equal to the number of inputs (see Section 3.3), i.e., the arguments of the Q-function:
time t, storage st , and release decision ut .
5.2.2. Setting nmin

The value nmin , the minimum sample size for splitting a node, determines the number of leaves
in a tree and thus the ensemble’s trade-off between bias and variance. By way of illustration,
in Figure 4 (top panel) the experiment in Figure 3 is replicated for nmin = 15 (dashed line).
Reducing nmin decreases the bias (in the average the performance is nearer to the optimal one) and
negatively affects the variance (higher amplitude of the fluctuations). As anticipated in Section 3,
dealing with stochastic function approximation, the regression algorithm should provide the
conditional expectation of the output given the input. nmin should therefore be, at least, equal
to the number of disturbance realizations available for each state-decision pair ((t, xt )l , ult ). Since
the learning data-set F of Lake Como water system was generated using a 15-years long scenario,
the best performance is expected to be obtained for nmin ≥ 15. This is confirmed in Figure 4
(bottom panel).
5.2.3. Setting M
The larger the number M of trees in the forest, the smaller the variance and thus the higher
the smoothing effect on the fluctuations (top panel of Figure 5). The reduction in the variance
has a positive effect on the Extra-Trees estimation error, which reflects in a strong reduction in
the distance between the performance in calibration (dashed line) and validation (solid line) as
M grows from 1 to 10 (bottom panel of Figure 5). From the bottom panel it also appears that
increasing M slightly reduces the bias. Nonetheless, the computation time linearly increases with
M and a balance must be found between accuracy and time requirements. The saturation effect
might help in deciding a proper value: the improvement in the value of Jˆ150 on validation from
values greater than 30-40 is definitely negligible (solid line in the bottom panel of Figure 5) .
6. Fitted Q-iteration vs SDP

The potential of the fitted Q-iteration was analyzed via comparison with an equivalent SDP
formulation. The learning data-set F of fitted Q-iteration was generated using a partial model-
free approach (see Section 4.2). As for SDP, according to the requirement of explicit modeling all
the system components, the inflow at+1 was described as a cyclostationary (with period T =365),
log-normal, stochastic process, whose pdf is defined by the parameters µt and σt , i.e., a log-
PAR(0) model was assumed (for more details see Pianosi and Soncini-Sessa [2009]):
at+1 = eσtmodT εt+1 +µtmodT εt+1 ∼ N (0, 1) (20)
where εt+1 is a Gaussian white noise. The state-decision domain was discretized using a dense
grid of 27,048 points (Nst = 161; Nut = 168, see Figure 6), while a 9-points grid was used for the
inflow (Nεt+1 = 9). Since each class of the storage (excluding the last highest 5 classes, which
were required to set the upper boundary condition) is nearly 4.2 × 103 m3 (corresponding to
2.8 cm of level variation) and each class of release decision is 3 m3 , the grid can be reasonably
considered very close to a continuum.
Simulation analysis was used to comparatively evaluate the efficient daily operating policies
obtained with fitted Q-iteration and SDP against the historical operation. A number of poli-
cies were designed, corresponding to different values of the weights and evaluated on both the
calibration and validation scenarios (starting from the historical storage). The resulting images
of the Pareto front are shown in Figure 7. It can be seen that the operations based on fitted
Q-iteration (solid line & black circles) slightly outperform those obtained with SDP (dashed line
& white circles), while the historical operation is noticeably far from efficiency. The improvement
is more considerable where the front has the strongest curvature (the so-called knee of the front).
Around that area, the objectives are more conflicting and the optimal Q∗ -functions are strongly
non-linear: the continuous approximation of the fitted Q-iteration, though based on a very coarse

grid, is more accurate than the SDP look-up table, even if this is based on a nearly continuous
state-decision discretization grid.
By way of demonstration, the policy associated with point A in Figure 7, derived with the
fitted Q-iteration, dominates the corresponding policy A′ , obtained with SDP, for one day of
floodings per year and nearly 3.5 ×106 m3 of deficit per year. Both the policies suggest to supply
right the water demand (front flat area in panels (a) and (c) in Figure 8) over all the year for a
relatively wide range of storage values and strongly increase the release rate during the two flood
seasons. In so doing they create a time-varying flood buffer zone, whose dimension is optimally
designed as it is either learnt from the flood events and the associated effects available in the
data-set (fitted Q-iteration) or implicitly inferred from the stochastic inflow model (SDP). Such
a buffer is, however, distinctively larger with policy A (panel (a)) as a result of a more accurate
approximation of the Q-function (compare panel (b) and panel (d), where max Qt (xt , ut ) is
ut
plotted), particularly for high values of the storage. The improved ability of policy A to deal
with floods is remarkable on normal floods (Figure 9), when - thanks to its time-varying nature
- it is able to release in advance a volume sufficiently large to buffer all the flood. Clearly, the
difference between the two policies vanishes for extreme floods (Figure 10), when the lake level
rapidly rises over the upper bound of the regulation range and the dam must be completely
opened. In these cases, both policies can just delay floodings of one or two days and slightly
reduce the peak flood with respect to the historical operation. As far as the water deficit is
concerned, the relatively small improvement of policy A over policy A′ is basically due to small
differences on a limited number of droughts, where fitted Q-iteration shows some better ability
to anticipate the inflow and therefore keeps supplying the water demand for longer than SDP

does. An example is provided in Figure 11. The same example shows that both the policies
significantly outperform the historical operation, which appears to be much more risk averse.
By moving toward the left extreme of the Pareto front (point B and B′ in Figure 7), i.e. by
increasing the relative importance of irrigation over floods, SDP performs better than fitted Q-
iteration. This is basically due to the approximation error in the tree-based interpolation of the
Q-functions. Indeed, as the importance of the irrigation increases, the conflict with floodings
becomes negligible and the optimal policy simply suggests to release the water demand. As
anticipated, water demand values belong to the release decision discretization grid for both the
algorithms. However, while the release decision chosen by SDP is necessarily a grid point, and
thus a water demand value, fitted Q-iteration uses a continuous approximation, which sometimes
fails in determining the exact demand value.An interesting property of fitted Q-iteration is that
time t is among the arguments of the Q function: the continuous value function approximation
is computed also with respect to t. An implicit clusterization of the operating rules with respect
to time is automatically performed by the Extra-Trees building algorithm and thus the resulting
operating policies do not necessarily change on a daily basis. This is evident in Figure 8 (panel
(a)): during periods characterized by a reduced variance of the inflow process and a constant
downstream water demand, the operating rules show the same behavior. For example, during
the first 30 days of the year they are monotonically increasing with the storage and provide the
water demand (equal to 99 m3 /s) as release decision for most of the storage values, until large
values are reached. At this stage, all rules suddenly increase the flow to be released, with a small
difference from rule to rule. This is nothing else than an operating policy unchanged within one
month.

6.1. Computational Requirements
A comparative analysis of the computational requirements by fitted Q-iteration and SDP can
be empirically performed by inferring some general rules from the computing times registered on
the Lake Como case study.
As anticipated, the time tSDP required to design an operating policy with SDP is proportional
to the number of evaluations of the operator E[·], which is given by equation (15): by splitting
the state dimensionality (nx ) in two ns storages and nI hydro-meteorological information, the
time tSDP can be expressed as
( )
tSDP = a · kT · Nsnts · NIntI · Nuntu · Nεntε (21)
where a is a constant, machine-dependent parameter. An estimate of a (on an Intel Xeon 3.16
Ghz QuadCore with 16GB Ram) was obtained from the time tSDP required to compute a policy
for Lake Como (k = 4; T = 365; Nst = 161; ns = 1; NIt = 0; nI = 0; Nut = 168; nu = 1;
Nεt = 9; nε = 1). Notice that the estimate so obtained is more conservative than one that would
be obtained on a more complex system, since there are a number of operations performed by
the coded algorithm that are only done once, independently from the system complexity. This
makes the estimate of a (and of b and c below) on a simple system much more greater than the
equivalent on a larger system.
The computing time tQ associated to fitted Q-iteration is the combination of the time tQ1
required to build the learning data-set F and the time tQ2 for running h̄ times (h̄ = 150 as
explained in Section 5.1) the tree-based regression algorithm on the training data-set. When a
model-free approach is adopted, as in Lake Como case, tQ1 only depends upon the number of
components of the states (storages) and release decisions for which a model is identified, and not
on the number of model-free components of the states (hydro-meteorological information) and

disturbances. Assuming that the fitted Q-iteration coarse grid is obtained by reducing the dense
grid of an equally performing SDP of a factor rs and ru respectively for state and decision, tQ1 is
( )
Ns Nu
tQ1 = b · T · ( t )ns · ( t )nu · Na (22)
rs ru
where b is a constant, machine-dependent parameter, and Na is the number of disturbance
realizations (i.e., the number of years in the historical data set used for the inflow and any other
hydro-meteorological information). Time tQ2 grows linearly in the time horizon h̄, in the number
of regressors k (i.e., ns + nI + nu + 1) and trees M , and superlinearly in the number #F (i.e.,

Nst ns Nut nu
( rs
) ·( ru
) · Na · T ) of four-tuples in the data-set. Precisely
( )
Nst ns Nut nu Ns Nu
tQ2 = c· ( ) ·( ) · Na · T · log(( t )ns · ( t )nu · Na · T ) · M · (ns + nI + nu + 1) · h̄ (23)
rs ru rs ru
where c is a constant, machine-dependent parameter. The parameters b and c were obtained
from the times tQ1 and tQ2 registered on Lake Como problem (T = 365; Nst = 10; ns = 1;
NIt = 0; nI = 0; Nut = 8; nu = 1; NI = 15). Also the reduction rates (rs = 16 and ru = 21)
were derived from the Lake Como case study. Figure 12 shows the computing times (given by
the above relationships with the estimated values of a, b, and c) plotted for increasing values
of the state vector dimension for two different system configurations. The top panel refers to
a reservoir network of ns reservoirs, each one with a single outlet (i.e., nu = ns ). All the
reservoirs have similar size and are therefore modeled with the same state and release decision
discretization grid (say Nst = 50 and Nut = 20, which are reasonable values for artificial reservoir
networks). Each reservoir is assumed to be fed by its own catchment (i.e., nε = ns ) described as
a disturbance, with Nεt = 10. The bottom panel was obtained assuming a single reservoir system
(i.e., ns = nu = nε = 1) and an increasing number of additional hydro-meteorological exogenous
information. The advantage of fitted Q-iteration (solid line & black circles) over SDP (dashed

lines & white circles) is evident from both the panels. Fitted Q-iteration goes well beyond the
computational limits of SDP (i.e., ns ≃ 2) on complex networks including up to 5 reservoirs
(top panel). The improvement is even more remarkable when the operating policy depends upon
exogenous information (bottom panel), as the model-free (uncontrolled!) components comes at
nearly no computational additional time for fitted Q-iteration: while SDP requires more than 5
days for a configuration with 1 reservoir and 2 exogenous information, fitted Q-iteration only 1.5
hours.
7. Conclusions
One major technical challenge in expanding the scope of water resources management across
sectors (social, economic, environmental) is to develop new methodologies and tools to cope
with the increasing complexity of water systems. When dealing with large water resources
systems, the dual curse of dimensionality and modeling makes the adoption of Stochastic Dynamic
Programming (SDP) definitely impracticable without resorting to one of the available varieties of
simplications and approximations, which usually make the resulting operating policies inefficient
in practice. This paper provides an encouraging evidence that Reinforcement Learning (RL)
may provide an important and robust alternative to mitigate these SDP plagues. A recent RL
approach is presented, called fitted Q-iteration, which combines continuous approximation of the
value functions with an iterative, batch-mode, learning process from an off-line generated data-
set to design daily, cyclostationary operating policies. The continuous approximation makes it
possible to mitigate the curse of dimensionality by adopting a very coarse discretization grid with
respect to the dense grid required to design an equally performing policy via SDP. The use of a
learning data-set, with basically no requirements on the way this is generated, allows overcoming
the curse of modeling.

The application to Lake Como water system was used to infer general guidelines on the ap-
propriate setting for the algorithm parameters, to define an empirical stopping condition, and
to demonstrate the potential of the approach compared to traditional SDP. The policy obtained
with fitted Q-iteration on an extremely coarse state-decision discretization grid was shown to
generally outperform an equivalent SDP-derived policy computed on a very dense grid. The
dominance is particularly remarkable on flood events (Figure 9), when the time-varying nature
of both the policies, which is key to anticipate and buffer floods when no inflow predictions are
considered, is more effectively exploited by the fitted Q-iteration. Based on the application to
Lake Como, a general rule was also derived to quantify the computational advantages of the
fitted Q-iteration over SDP in designing daily operating policies for large water systems. The
current SDP’s limit of 2-3 state variables can be moved forward to 5-6 state variables when the
state variables are all reservoir storages and to many more when several storages and a large
amount of exogenous information are considered or the exogenous information is strongly tem-
porally correlated. For instance, a network of 4 reservoirs with the operating policy depending
on 12 exogenous information (e.g., the inflow to each reservoir is described by three variables:
2 autoregressive terms and the precipitation) does require less than 5 days. These bounds can
be further improved by exploiting the intrinsically parallel nature of tree ensemble which are
composed of independent trees. By using a multi-threaded implementation, the total computing
time can be reduced up to M times. This will be the subject of future research. Future com-
putational improvements also include the adoption of efficient discretization techniques, to more
effectively explore the state-decision space in generating the learning data-set [Cervellera et al.,
2006]; the use of clustering algorithms [Nguyen and Smeulders, 2004] to clean up the learning
data-set for redundancies and the exploration of policy refinement approaches [Bonarini et al.,

2007] combined with policy reconstruction procedure [Schneegaß et al., 2007], where the current
operating policy is first identified and then iteratively improved by the algorithm.
While the Extra-Trees used by the fitted Q-iteration have been shown to offer a good accu-
racy/efficiency trade-off, this comes at the price of a lack of a well-defined and consistent stopping
condition, which, in turn, might negatively affect both the accuracy (the policy obtained is not
the best one explored) and the efficiency (a better policy could have been found by stopping the
algorithm earlier). Strictly, this happens because of the tree structures refreshed at each iteration
of the fitted Q-algorithm, which is key to build an accurate approximation in the early stages
of the algorithms, but prevents the approximated Q-functions from stabilizing, even when the
improvement in the policy performance is marginal. Further investigations are required in this
direction, including the freezing of the tree structure after some iterations [Guez et al., 2008].
An important feature of the algorithm, which has been theoretically investigated in the pa-
per, but surely deserves further studies, is the great flexibility it offers in dealing with operating
policies conditioned on any kind of exogenous information, even not necessarely expressed in a
quantitative form. Subsequent researches will include the evaluation of the potential improve-
ment in the operation perfomances both by the use of traditional inflow predictions and by the
direct employment of the information which could be useful in formulating inflow predictions
(e.g., precipitation, temperature, snow pack depth). The batch nature of the fitted Q-iteration
has an other important implication on the applicability of optimal control to a wide range of water
related applications, especially to water quality management, integration of quality and quantity
(e.g., selective withdrawals systems), groundwater management and surface-groundwater inter-
action, and, more generally, to all those applications involving spatially distributed states and
rather intricate process-based models, which constitute a major barrier to the use of traditional

control approaches. These also include process-based modeling of rainfall-runoff as required to
generate climate change scenarios and investigate adaptive management strategies. Particularly,
the combined use of fitted Q-iteration and model reduction techniques [Castelletti et al., 2009]
is worth be explored for these purposes.
Acknowledgments. The work was completed while Andrea Castelletti, Stefano Galelli and
Rodolfo Soncini-Sessa were on leave at the Centre for Water Research, University of Western
Australia. This paper forms CWR reference 2329 AC.
References
Archibald, T., K. McKinnon, and L. Thomas (1997), An aggregate stochastic dynamic program-
ming model of multireservoir systems, Water Resources Research, 33 (2), 333–340.
Aufiero, A., R. Soncini-Sessa, and E. Weber (2001), Set-valued control laws in minmax control
problem, in Proceedings of IFAC Workshop Modelling and Control in Environmental Issues,
August 22-23, Yokohama, J.
Barto, A., and R. Sutton (1998), Reinforcement Learning: an introduction, MIT Press, Boston,
MA.
Bellman, R. (1957), Dynamic Programming, Princeton University Press, Princeton, NJ.
Bellman, R., and S. Dreyfus (1962), Applied Dynamic Programming, Princeton University Press,
Princeton, NJ.
Bellman, R., R. Kabala, and B. Kotkin (1963), Polynomial approximation - a new computational
technique in dynamic programming, Mathematics of Computation, 17 (8), 155–161.
Bertsekas, D., and J. Tsitsiklis (1996), Neuro-Dynamic Programming, Athena Scientific, Boston,
MA.

Bhattacharya, A., A. Lobbrecht, and D. Solomatine (2003), Neural networks and reinforcement
learning in control of water systems, Journal of Water Resources Planning and Management-
ASCE, 129 (6), 458–465.
Bonarini, A., A. Lazaric, and M. Restelli (2007), Piecewise constant reinforcement learning for
robotic applications, in Proceedings of the 4th International Conference on Informatics in
Control, Automation and Robotics (ICINCO 2007).
Breiman, L. (1996), Bagging predictors, Machine Learning, 24 (2), 123–140.
Breiman, L. (2001), Random forests, Machine Learning, 45 (1), 5–32.
Breiman, L., J. Friedman, R. Olsen, and C. Stone (1984), Classification and regression trees.,
Wadsworth & Brooks, Pacific Grove, CA.
Castelletti, A., and R. Soncini-Sessa (2007), Coupling real-time control and socio-economic issues
in participatory river basin planning, Environmental Modelling & Software, 22 (8), 1114–1128.
Castelletti, A., G. Corani, A. Rizzoli, R. Soncini-Sessa, and E. Weber (2001), A reinforcement
learning approach for the operational management of a water system, in Proceedings of IFAC
Workshop Modelling and Control in Environmental Issues, August 22-23, Yokohama, J.
Castelletti, A., D. de Rigo, R. Soncini-Sessa, A. Rizzoli, and E. Weber (2005), An improved tech-
nique for neuro-dynamic programming applied to the efficient and integrated water resources
management, in Proceedings of 16th IFAC World Congress, July 4-8, Prague, CZ.
Castelletti, A., D. de Rigo, A. Rizzoli, R. Soncini-Sessa, and E. Weber (2007), Neuro-dynamic
programming for designing water reservoir network management policies, Control Engineering
Practice, 15 (8), 1001–1011.
Castelletti, A., F. Pianosi, and R. Soncini-Sessa (2008), Water reservoir control under economic,
social and environmental constraints, Automatica, 44 (6), 1595–1607.

Castelletti, A., M. De Zaiacomo, S. Galelli, M. Restelli, P. Sanavia, R. Soncini-Sessa, and
J. Antenucci (2009), An emulation modelling approach to reduce the complexity of a 3D
hydrodynamic-ecological model of a reservoir, in Proceedings of International Symposium on
Environmental Software Systems (ISESS2009), October 2-9, Venice, I.
Cervellera, C., V. Chen, and A. Wen (2006), Optimization of a large-scale water reservoir network
by stochastic dynamic programming with efficient state space discretization, European Journal
of Operational Research, 171 (3), 1139–1151.
Cohn, D., Z. Ghahramani, and M. Jordan (1996), Active learning with statistical models, Journal
of artificial intelligence research, 4, 129–145.
Ernst, D. (1999), Near optimal closed-loop control. application to electric power systems, Ph.D.
thesis, University of Liege, Belgium.
Ernst, D. (2005), Selecting concise sets of samples for a reinforcement learning agent, in Pro-
ceedings of the 3rd International Conference on Computational Intelligence, Robotics and Au-
tonomous Systems (CIRAS 2005), December 10-14, Singapore.
Ernst, D., P. Geurts, and L. Wehenkel (2005), Tree-based batch mode reinforcement learning,
Journal of Machine Learning Research, 6, 503–556.
Esogbue, A. (1989), Dynamic Programming for Optimal Water Resources Systems Analysis,
chap. Dynamic programming and water resources: Origins and interconnections, Prentice-Hall,
Englewood Cliffs, NJ.
Foufoula-Georgiou, E., and P. Kitanidis (1988), Gradient dynamic programming for stochastic
optimal control of multidimensional water resources systems, Water Resources Research, 24,
1345–1359.

Galelli, S., and R. Soncini-Sessa (2010), Combining metamodelling and stochastic dynamic pro-
gramming for the design of reservoirs release policies, Environmental Modelling & Software,
25 (2), 209–222.
Galelli, S., C. Gandolfi, R. Soncini-Sessa, and D. Agostani (2010), Building a metamodel of
an irrigation district distributed-parameter model, Agricultural Water Management, 97 (2),
187–200.
Gaskett, C. (2002), Q-learning for robot control, Ph.D. thesis, Australian National University,
Canberra, AUS.
Geurts, P., D. Ernst, and L. Wehenkel (2006), Extremely randomized trees, Machine Learning,
63 (1), 3–42.
Gilbert, K., and R. Shane (1982), TVA hydroscheduling model: theoretical aspects, Journal of
Water Research Planning and Management - ASCE, 108 (1), 21–36.
Gordon, G. (1995), Online tted reinforcement learning, in Proceedings of the Workshop on Value
Function Approximation at the 12th International Conference on Machine Learning, July 9,
Tahoe City, CA.
Guariso, G., S. Rinaldi, and R. Soncini-Sessa (1985), Decision support systems for water manage-
ment - the Lake Como case study, European Journal of Operational Research, 21 (3), 295–306.
Guariso, G., S. Rinaldi, and R. Soncini-Sessa (1986), The management of Lake Como - a multi-
objective analysis, Water Resources Research, 22 (2), 109–120.
Guez, A., R. Vincent, M. Avoli, and J. Pineau (2008), Adaptive treatment of epilepsy via batch-
mode reinforcement learning, in Proceedings of the 23rd AAAI Conference on Artificial Intel-
ligence, July 13-17, pp. 1671–1678, Chicago, IL.

Haimes, Y. (1977), Hierarchical Analyses of Water Resources Systems, McGraw-Hill, New York,
NY.
Hall, W., and N. Buras (1961), The dynamic programming approach to water resources devel-
opment, Journal of Geophysical Research, 66 (2), 510–520.
Hall, W., W. Butcher, and A. Esogbue (1968), Optimization of the operation of a multi-purpose
reservoir by dynamic programming, Water Resources Research, 4 (3), 471–477.
Heidari, M., V. Chow, P. Kokotovic, and D. Meredith (1971), Discrete differential dynamic
programming approach to water resources systems optimisation, Water Resources Research,
7 (2), 273–282.
Hejazi, M., X. Cai, and B. Ruddell (2008), The role of hydrologic information in reservoir oper-
ation - learning from historical releases, Advances in Water Resources, 31 (12), 1636–1650.
Hooper, E., A. Georgakakos, and D. Lettenmaier (1991), Optimal stochastic operation of Salt
River Project, Arizona, Journal of Water Research Planning and Management - ASCE, 117 (5),
556–587.
Jacobson, H., and Q. Mayne (1970), Differential dynamic programming, American Elsevier, New
York, NY.
Johnson, S., J. Stedinger, C. Shoemaker, Y. Li, and J. Tejada-Guibert (1993), Numerical solu-
tion of continuous-state dynamic programs using linear and spline interpolation, Operations
Research, 41, 484–500.
Kalyanakrishnan, S., and P. Stone (2007), Batch reinforcement learning in a complex domain,
in The Sixth International Joint Conference on Autonomous Agents and Multiagent Systems,
May.

Kelman, J., J. Stedinger, L. Cooper, E. Hsu, and S. Yuan (1990), Sampling Stochastic Dynamic
Programming applied to reservoir operation, Water Resources Research, 26 (3), 447–454.
Labadie, J. (2004), Optimal operation of multireservoir systems: State-of-the-art review, Journal
of Water Research Planning and Management - ASCE, 130 (2), 93–111.
Larson, R. (1968), State Incremental Dynamic Programming, American Elsevier, New York, NY.
Lee, J.-H., and J. W. Labadie (2007), Stochastic optimization of multireservoir systems via
reinforcement learning, Water Resources Research, 43 (11), 1–16.
Luenberger, D. (1971), Cyclic dynamic programming: a procedure for problems with fixed delay,
Operations Research, 19 (4), 1101–1110.
Nardini, A., C. Piccardi, and R. Soncini-Sessa (1992), On the integration of risk-aversion and
average-performance optimization in reservoir control, Water Resources Research, 28 (2), 487–
497.
Nguyen, H., and A. Smeulders (2004), Active learning using pre-clustering, in Proceedings of the
21st International Conference on Machine Learning, ACM New York, NY.
Orlovski, S., S. Rinaldi, and R. Soncini-Sessa (1984), A min-max approach to reservoir manage-
ment, Water Resources Research, 20 (11), 1506–1514.
Ormoneit, D., and S. Sen (2002), Kernel-based reinforcement learning, Machine Learning, 49 (2-
3), 161–178.
Philbrick, C., and P. Kitanidis (2001), Improved dynamic programming methods for optimal
control of lumped-parameter stochastic systems, Operations Research, 49, 398–412.
Pianosi, F., and R. Soncini-Sessa (2009), Real-time management of a multipurpose water
reservoir with a heteroscedastic inflow model, Water Resources Research, 45 (10), W10430,
doi:10.1029/2008WR007335.

Piccardi, C., and R. Soncini-Sessa (1991), Stochastic dynamic programming for reservoir opti-
mal control: dense discretization and inflow correlation assumption made possible by parallel
computing, Water Resources Research, 27 (5), 729–741.
Read, E. (1989), Dynamic Programming for Optimal Water Resources Systems Analysis, chap. A
dual approach to stochastic dynamic programming for reservoir release scheduling, pp. 361–372,
Prentice-Hall, Englewood Cliffs.
Saad, M., and A. Turgeon (1988), Application of principal component analysis to long-term
reservoir management, Water Resources Research, 24 (7), 907–912.
Saad, M., A. Turgeon, , and J. Stedinger (1992), Censored-data correlation and principal com-
ponent dynamic programming, Water Resources Research, 28 (8), 2135–2140.
Saad, M., A. Turgeon, P. Bigras, and R. Duquette (1994), Learning disaggregation technique for
the operation of long-term hydroelectric power systems, Water Resources Research, 30 (11),
3195–3203.
Schneegaß, D., S. Udluft, and T. Martinetz (2007), Improving optimality of neural rewards
regression for data-efficient batch near-optimal policy identification, Lecture Notes in Computer
Science, 4668, 109.
Soncini-Sessa, R., A. Castelletti, and E. Weber (2007), Integrated and participatory water re-
sources management. Theory, Elsevier, Amsterdam, NL.
Tejada-Guibert, J., S. Johnson, and J. Stedinger (1995), The value of hydrologic information in
stochastic dynamic programming models of a multireservoir system, Water Resources Research,
31 (10), 2571–2579.
Trott, W., and W. Yeh (1973), Optimization of multiple reservoir systems, Journal of the Hy-
draulic Division ASCE, 99, 1865–1884.

X !!-
!! Please write \lefthead{<AUTHOR NAME(s)>} in file !!: !! Please write \righthead{<(Shortened) Article Title>} in file 47
Figure 1. Alternative approaches to explore the state-decision set: refinement of the historical
policy (a); dense grid discretization (b); coarse grid discretization (c); efficient discretization (d).
Figure 2. The Lake Como water system.
Tsitsiklis, J., and B. V. Roy (1996), Feature methods for large scale dynamic programming,
Machine Learning, 22, 59–94.
Turgeon, A. (1980), Optimal operation of multi-reservoir power systems with stochastic inflows,
Water Resources Research, 16 (2), 275–283.
Turgeon, A. (1981), A decomposition method for the long-term scheduling of reservoirs in series,
Water Resources Research, 17 (6), 1565–1570.
Vasiliadis, H., and M. Karamouz (1994), Demand-driven operation of reservoirs using
uncertainty-based optimal operating policies, Journal of Water Research Planning and Man-
agement - ASCE, 120 (1), 101–114.
Watkins, C., and P. Dayan (1992), Q-learning, Machine Learning, 8 (3-4), 279–292.
Wong, P., and D. Luenberger (1968), Reducing the memory requirements of dynamic program-
ming, Operations Research, 16 (6), 1115–1125.
Yakowitz, S. (1982), Dynamic programming applications in water resources, Water Resources
Research, 18 (4), 673–696.
Yeh, W. (1985), Reservoir management and operations models: a state of the art review, Water
Resources Research, 21 (12), 1797–1818.
Young, P. C. (2006), The data-based mechanistic approach to the modelling, forecasting and
control of environmental systems, Annual Reviews in Control, 30 (2), 169–182.

X -!!48Please write \lefthead{<AUTHOR NAME(s)>} in file !!: !! Please write \righthead{<(Shortened) Article Title>} in file !!
Figure 3. Trajectory of Jˆh obtained with λ = 0.5, K = 3, nmin = 50 and M = 50 on the
validation scenario (top panel ). Trajectories of Jˆh obtained with the same setting and λ = 1 (i.e.,
Jˆh = Jˆhw ) running twice the fitted Q-iteration (bottom panel ).
Figure 4. Trajectories of Jˆh obtained with nmin = 50 (solid line) and nmin = 15 (dashed line)
(top panel ). The value of Jˆ150 for different nmin (bottom panel ). In both panels the validation
scenario was considered and λ = 0.5, K = 3 and M = 50.
Figure 5. Trajectories of Jˆh obtained with M = 50 (black circles) and M = 15 (white
circles) (top panel ). The value of Jˆ150 for different values of M on the validation (solid line) and
calibration (dashed line) scenario (bottom panel ). In both panels λ = 0.5, K = 3 and nmin = 50.
Figure 6. SDP dense discretization grid (27,048 points) and fitted Q-iteration coarse dis-
cretization grid (80 points).
Figure 7. Images of the Pareto fronts obtained with fitted Q-iteration (solid line & black
circles) and SDP (dashed line & white circles) via simulation on the calibration (top panel ) and
validation (bottom panel ) scenarios.
Figure 8. The operating policy corresponding to point A (obtained with fitted Q-iteration) in
Figure 7 (panel (a)) and the associated cost-to-go function derived with equation (13) from the
Q-function (panel (b)). In the bottom panel the same plots for point A′ (obtained with SDP).
Figure 9. Fitted Q-iteration (solid line and black circles), SDP (dashed line and white circles)
and historical operation (solid line) in a flood event in 1981 (validation scenario). In the top
panel the daily release and the inflow (gray line); in the bottom panel the lake level and flood
threshold.

X !!-
!! Please write \lefthead{<AUTHOR NAME(s)>} in file !!: !! Please write \righthead{<(Shortened) Article Title>} in file 49
and historical operation (solid line) in the biggest flood event of the century (validation scenario).
In the top panel the daily release and the inflow (gray line); in the bottom panel the lake level
and flood threshold.
and historical operation (solid line) in a drought event in 1982 (validation scenario). In the top
panel the daily release, the inflow (gray line), and the water demand (dashed line); in the bottom
panel the lake storage.
Figure 12. Comparison of the computational requirements of SDP (dashed lines & white
circles) and fitted Q-iteration (solid line & black circles) for increasing number of state variables,
obtained on an Intel Xeon 3.16 Ghz QuadCore with 16GB Ram machine for two different water
system configurations as explained in the text.

Tree-Based Reinforcement Learning For Optimal Water Reservoir Operation

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Tree-Based Reinforcement Learning For Optimal Water Reservoir Operation

Hochgeladen von

Copyright:

Verfügbare Formate

WATER RESOURCES RESEARCH, VOL. ???, XXXX, DOI:10.

Tree-based reinforcement learning for optimal water

Andrea Castelletti, Dipartimento di Elettronica e Informazione, Politecnico di Milano, P.zza

Leonardo da Vinci, 32, 20133 Milano, Italy (castelle@elet.polimi.it)

Informazione, Politecnico di Milano, Milan,

D R A F T April 9, 2010, 9:46am D R A F T

ied approaches to design water reservoir operations, Stochastic Dynamic Pro-

gramming is plagued by a dual curse that makes it unsuitable to cope with

large water systems: the computational requirement grows exponentially with

the number of state variables considered (curse of dimensionality) and an

associated rewards/costs (curse of modeling). A varieties of simplications and

the resulting operating policies ineﬃcient and of scarce relevance in practi-

cal contexts. In this paper, a reinforcement-learning approach, called ﬁtted

Q-iteration, is presented: it combines the principle of continuous approxima-

to design daily, cyclo-stationary operating policies. The continuous approx-

imation, performed via tree-based regression, makes it possible to mitigate

the curse of dimensionality by adopting a very coarse discretization grid with

Stochastic Dynamic Programming. The learning experience, in the form of

a data-set generated combining historical observations and model simulations,

is used as study site to infer general guidelines on the appropriate setting

proach in terms of accuracy and computational eﬀectiveness compared to tra-

ditional Stochastic Dynamic Programming.

D R A F T April 9, 2010, 9:46am D R A F T

The ﬁrst application of Dynamic Programming to water systems management is prob-

applied to reservoir management, particularly for hydropower production (see, among

of dynamic programming for multi-purpose reservoirs operation and networks of reser-

D R A F T April 9, 2010, 9:46am D R A F T

Vasiliadis and Karamouz [1994]; Castelletti and Soncini-Sessa [2007]).

disturbance dimensions (Bellman’s curse of dimensionality [Bellman, 1957]), so that SDP

a state variable described by a dynamic model or a stochastic disturbance, independent

to the curse of dimensionality (additional state variables). Further, in large reservoir

could be a cumbersome contribution to the curse of dimensionality.

Dynamic Programming based on Successive Approximations [Bellman and Dreyfus, 1962],

D R A F T April 9, 2010, 9:46am D R A F T

Incremental Dynamic Programming [Larson, 1968], Diﬀerential Dynamic Programming

the uncertainty associated with the underlying hydro-meteorological processes cannot be

the policy design problem.

subsystems (e.g., decomposition based on physical or functional structure of the system,

hierarchical multilevel decomposition) with the subsequent use of an iterative procedure

[1980]) proposed an algorithm based on the decomposition of a N -reservoirs operation

composition/aggregation technique where each subproblem includes an actual reservoir

computational complexity is reduced to a quadratic function of the state dimension. Saad

D R A F T April 9, 2010, 9:46am D R A F T

ponent Analysis to reduce the complexity in a ﬁve-reservoir hydropower system from a

technique based on neural networks. A major contribution to hierarchical multilevel de-

is transmitted from lower to higher levels in the decomposition hierarchy.

with a continuous approximation of the value function. Diﬀerent classes of approximators

et al. [2006]; Castelletti et al. [2007].

D R A F T April 9, 2010, 9:46am D R A F T

leads to an exponential explosion of the computational costs.

D R A F T April 9, 2010, 9:46am D R A F T

et al., 2005]. Unlike traditional stochastic approximation algorithms [Bellman et al.,

ﬂexibility, which is a paramount characteristic in the typical multi-objective context of

approximated are unpredictable in shape; second, a higher computational eﬃciency as no

and Stone, 2007].

D R A F T April 9, 2010, 9:46am D R A F T

In this paper, the ﬁtted Q-iteration is demonstrated on Lake Como, a multi-purpose

in this paper. An improved version is therefore proposed that includes non-stationary

natural term of comparison.

2. Problem and Well Established Solutions

with a feedback control framework applied to a discrete-time stochastic system, periodic

other meteorological information (e.g., precipitation, temperature) and/or hydrological