Sie sind auf Seite 1von 61

WATER RESOURCES RESEARCH, VOL. ???, XXXX, DOI:10.

1029/,

Tree-based reinforcement learning for optimal water


reservoir operation
1, 1 1 1
A. Castelletti, S. Galelli, M. Restelli, R. Soncini-Sessa

Andrea Castelletti, Dipartimento di Elettronica e Informazione, Politecnico di Milano, P.zza

Leonardo da Vinci, 32, 20133 Milano, Italy (castelle@elet.polimi.it)

1
Dipartimento di Elettronica e

Informazione, Politecnico di Milano, Milan,

Italy.

D R A F T April 9, 2010, 9:46am D R A F T


X-2 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

Abstract. Although being one of the most popular and extensively stud-

ied approaches to design water reservoir operations, Stochastic Dynamic Pro-

gramming is plagued by a dual curse that makes it unsuitable to cope with

large water systems: the computational requirement grows exponentially with

the number of state variables considered (curse of dimensionality) and an

explicit model must be available to describe every system transition and the

associated rewards/costs (curse of modeling). A varieties of simplications and

approximations have been devised in the past, which, in many cases, make

the resulting operating policies inefficient and of scarce relevance in practi-

cal contexts. In this paper, a reinforcement-learning approach, called fitted

Q-iteration, is presented: it combines the principle of continuous approxima-

tion of the value functions with a process of learning off-line from experience

to design daily, cyclo-stationary operating policies. The continuous approx-

imation, performed via tree-based regression, makes it possible to mitigate

the curse of dimensionality by adopting a very coarse discretization grid with

respect to the dense grid required to design an equally performing policy via

Stochastic Dynamic Programming. The learning experience, in the form of

a data-set generated combining historical observations and model simulations,

allows to overcome the curse of modeling. Lake Como water system (Italy)

is used as study site to infer general guidelines on the appropriate setting

for the algorithm parameters and to demonstrate the advantages of the ap-

proach in terms of accuracy and computational effectiveness compared to tra-

ditional Stochastic Dynamic Programming.

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X-3

1. Introduction

Despite the great progress made in the last decades, optimal operation of water reservoir

systems still remains a very active research area (see the recent review by Labadie [2004]).

The combination of multiple, conflicting water uses, non-linearities in the model and the

objectives, strong uncertainties in the inputs, and high dimensional state space make the

problem challenging and intriguing (Castelletti et al. [2008] and references therein).

Stochastic Dynamic Programming (SDP) is one of the most suitable method for de-

signing (Pareto) optimal reservoir operating policies (see, e.g., Soncini-Sessa et al. [2007]

and references therein). SDP is based on the formulation of the operating policy design

problem as a sequential decision making process. The key idea is to use value functions to

organize and structure the search for optimal policies in stochastic domains. A decision

taken now can produce not only an immediate reward, but also affect the next system

state and, through that, all the subsequent rewards. SDP is thus based on looking ahead

to future events and computing a backed-up value, which is then used to update the value

function.

The first application of Dynamic Programming to water systems management is prob-

ably owed to Hall and Buras [1961]. Since then, the method has been systematically

applied to reservoir management, particularly for hydropower production (see, among

others, Hall et al. [1968]; Heidari et al. [1971]; Trott and Yeh [1973]; Turgeon [1980]; Es-

ogbue [1989]). Beginning in the early 1980s, the interest expands to the stochastic version

of dynamic programming for multi-purpose reservoirs operation and networks of reser-

voirs (see the reviews by Yakowitz [1982]; Yeh [1985] and the contributions by Gilbert

D R A F T April 9, 2010, 9:46am D R A F T


X-4 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

and Shane [1982]; Read [1989]; Hooper et al. [1991]; Piccardi and Soncini-Sessa [1991];

Vasiliadis and Karamouz [1994]; Castelletti and Soncini-Sessa [2007]).

Despite being studied so extensively in the literature, SDP suffers from a dual curse

which, de facto, prevents its practical application to even reasonably complex water sys-

tems. (i) The computational complexity grows exponentially with state, decision and

disturbance dimensions (Bellman’s curse of dimensionality [Bellman, 1957]), so that SDP

cannot be used with water systems where the number of reservoirs is greater than a few

(2-3) units. (ii) An explicit model of each component of the water system is required

(curse of modeling [Bertsekas and Tsitsiklis, 1996]) to anticipate the effects of the sys-

tem transitions. Any information included into the SDP framework can only be either

a state variable described by a dynamic model or a stochastic disturbance, independent

in time, with the associated pdf. Exogenous information, such as temperature, precipita-

tion, snowpack depth, which could effectively improve reservoir operation [Tejada-Guibert

et al., 1995; Hejazi et al., 2008], cannot be explicitly considered in taking the release de-

cision, unless a dynamic model is identified for each additional information, thus adding

to the curse of dimensionality (additional state variables). Further, in large reservoir

networks, disturbances are very likely to be spatially and temporally correlated. While

including space variability in the identification of the disturbance’s pdf can be sometimes

rather complicated, it does not add to the computational complexity. Conversely, tempo-

ral correlation can be properly accounted for by using a dynamic stochastic model, which

could be a cumbersome contribution to the curse of dimensionality.

Attempts to overcome the curse of dimensionality are ubiquitous in the literature, e.g.

Dynamic Programming based on Successive Approximations [Bellman and Dreyfus, 1962],

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X-5

Incremental Dynamic Programming [Larson, 1968], Differential Dynamic Programming

[Jacobson and Mayne, 1970], and problem-specific heuristics [Wong and Luenberger , 1968;

Luenberger , 1971]. However, these methods have been conceived mainly for deterministic

problems and are of scarce interest for the optimal operation of reservoirs networks, where

the uncertainty associated with the underlying hydro-meteorological processes cannot be

neglected. Alternative approaches can be classified in two main classes (see Castelletti

et al. [2008] and references therein for further details) depending on the strategy they

adopt to alleviate the dimensionality burden: methods based on the simplification of the

water system model and methods based on the restriction of the degrees of freedom of

the policy design problem.

The first includes both the decomposition of the system into smaller and tractable

subsystems (e.g., decomposition based on physical or functional structure of the system,

hierarchical multilevel decomposition) with the subsequent use of an iterative procedure

to solve the problem, and the aggregation of the sub-system, or parts thereof, into a com-

posite, computationally tractable system. For instance, Turgeon [1981] (see also Turgeon

[1980]) proposed an algorithm based on the decomposition of a N -reservoirs operation

problem into N subproblems, each considering two reservoirs: one among the actual

reservoirs plus an equivalent reservoir accounting for all the remaining downstream stor-

ages. The resulting overall computing time for the solution of the problem grows linearly

with N . A similar idea was exploited by Archibald et al. [1997], who suggested a de-

composition/aggregation technique where each subproblem includes an actual reservoir

and two equivalent reservoirs for upstream and downstream storages respectively. The

computational complexity is reduced to a quadratic function of the state dimension. Saad

D R A F T April 9, 2010, 9:46am D R A F T


X-6 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

and Turgeon [1988] and Saad et al. [1992] proposed a method based on Principal Com-

ponent Analysis to reduce the complexity in a five-reservoir hydropower system from a

ten to a four-state variable problem, which was then solvable by SDP. Better perfor-

mance was obtained on the same system by Saad et al. [1994], who used a disaggregation

technique based on neural networks. A major contribution to hierarchical multilevel de-

composition comes from Haimes [Haimes, 1977]. The idea behind such approach is that

different decomposition levels are separately modeled and analyzed, but some information

is transmitted from lower to higher levels in the decomposition hierarchy.

The second class of approaches to avert the curse of dimensionality is based on the

introduction of some hypotheses on the regularity of the SDP optimal value function. Since

SDP requires discretization of the feasible state and decision spaces, one way to mitigate

(but not vanquish) the dimensionality problem is to combine a coarser discretization grid

with a continuous approximation of the value function. Different classes of approximators

have been explored, including linear polynomials [Bellman et al., 1963; Tsitsiklis and Roy,

1996], cubic Hermite polynomials [Foufoula-Georgiou and Kitanidis, 1988] and splines

[Johnson et al., 1993; Philbrick and Kitanidis, 2001]. As universal function approximators,

artificial neural networks are particularly suited for this purpose, as discussed in Bertsekas

and Tsitsiklis [1996] and practically demonstrated by Castelletti et al. [2005]; Cervellera

et al. [2006]; Castelletti et al. [2007].

SDP’s curse of modeling has received less attention than its dimensionality twin. In

SDP, models are required to anticipate and evaluate the effects of any feasible decision

on the state dynamics by computing the associated reward. An alternative approach for

performing such evaluation is to rely directly on (i.e., learning from) experience. This

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X-7

is the central idea of Reinforcement Learning (RL), a well-known framework for sequen-

tial decision-making (e.g., Barto and Sutton [1998]) that combines concepts from SDP,

stochastic approximation via simulation, and function approximation. The learning ex-

perience can be acquired on-line, by directly experimenting decisions on the real system

without any model, or generated off-line, either by using an external simulator or histori-

cal observations. While the first option is clearly impracticable on real reservoir networks,

off-line learning has been already experimented in the operation of water systems. Castel-

letti et al. [2001] (see also Soncini-Sessa et al. [2007]) proposed a partially model-free

version of classical Q-learning [Watkins and Dayan, 1992] to design the daily operation of

a multi-purpose regulated lake. The storage dynamics was simulated via the mass balance

equation. The catchment was described using the historical inflow sequence. Using both

the storage and previous day inflow as state variables, Q-learning outperformed SDP with

the inflow modelled as an autoregressive process of order one. Bhattacharya et al. [2003]

developed a neural version of Q-learning for controlling pumps in a large polder system

in the Netherlands. Lee and Labadie [2007] compared Q-learning algorithm with Implicit

SDP and Sampling SDP [Kelman et al., 1990] on the monthly operation of a two-reservoir

system in Korea, processing the previous month inflow information as proposed in Castel-

letti et al. [2001]. Q-learning was shown to outperform the other two approaches. RL

methods alleviate to some extent also the curse of dimensionality, as the search space over

the range of feasible release decisions is not exhaustively explored at each iteration step.

However, like SDP, they do require a discretization grid over the state space, which again

leads to an exponential explosion of the computational costs.

D R A F T April 9, 2010, 9:46am D R A F T


X-8 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

Lately, a new approach, called fitted Q-iteration, which combines RL concepts of off-line

learning and functional approximation of the value function, has been proposed [Ernst

et al., 2005]. Unlike traditional stochastic approximation algorithms [Bellman et al.,

1963; Bertsekas and Tsitsiklis, 1996; Tsitsiklis and Roy, 1996], which use parametric

function approximators and thus require a time consuming parameter estimation process

at each iteration step, fitted Q-iteration uses tree-based approximation [Breiman et al.,

1984]. The use of tree-based regressors offers a twofold advantage: first, a great modeling

flexibility, which is a paramount characteristic in the typical multi-objective context of

water reservoir systems with multi-dimensional states, where the value functions to be

approximated are unpredictable in shape; second, a higher computational efficiency as no

optimal parameter estimation is required for the value function approximation at each

iteration step. On the other hand, even if tree-based methods infer the model structure

directly from data, some parameters need to be specified to drive the tree construction

process, such as, for instance, the minimum number of data per leaf or the number of trees

when ensembles of trees are used. Fixing the value of such parameters can only be done

empirically and does require a fine, ad hoc analysis as any inaccuracy might ultimately

have negative effects on the policy performance. Further, while traditional Q-learning

has been provably shown to converge only when the value function updates are performed

incrementally, following the state trajectory produced by the sequence of optimal decisions

selected at each iteration step, fitted Q-iteration processes the information in a batch

mode, by simultaneously using all the learning experience in making an update of the

value function. This has been shown to speed up the convergence rate [Kalyanakrishnan

and Stone, 2007].

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X-9

In this paper, the fitted Q-iteration is demonstrated on Lake Como, a multi-purpose

regulated lake in Italy. As originally proposed in Ernst et al. [2005], fitted Q-iteration

yields a stationary policy, which is perfectly suited for the artificial systems the algorithm

has been conceived for, while it is less conforming to natural resources systems dealt with

in this paper. An improved version is therefore proposed that includes non-stationary

policies, which are more effective in adapting to the natural seasonal variability. The

focus of the paper is first on studying the properties of the algorithm, with an analysis

of the results sensitivity to the tree-based method parameters. The potential advantages

of the approach are then explored and evaluated against traditional SDP, which is the

natural term of comparison.

2. Problem and Well Established Solutions

The problem of designing the optimal operation of a water reservoir can be schematized

with a feedback control framework applied to a discrete-time stochastic system, periodic

with period T equal to one year [Castelletti et al., 2008]. For each time t of the plan-

ning horizon, given the storage volume st available in the reservoir (i.e., the state), the

operating policy p returns the volume ut (i.e., the release decision) to be released over

the time interval [t, t + 1) (e.g., in the next 24 hours when a daily policy is considered).

In some cases, improved operation can be obtained by conditioning the policy on any

other meteorological information (e.g., precipitation, temperature) and/or hydrological

information (e.g., previous period inflow, soil moisture, evapotranspiration, snow pack

depth) It = |It1 , . . . , ItL |, which might be appropriate to partly anticipate the effects of the

stochastic disturbance εt+1 affecting the system.

In the notation adopted in this paper, the time subscript in the symbol of a variable

D R A F T April 9, 2010, 9:46am D R A F T


X - 10 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

denotes the time instant at which such variable assumes a deterministic value, e.g. the

lake storage is measured at time t an thus is denoted with st , while the disturbance in the

interval [t, t + 1) is denoted with εt+1 since it can be deterministically known only at the

end of the interval [Piccardi and Soncini-Sessa, 1991].

2.1. Model of the Water System

The reservoir dynamics is governed by the mass conservation equation:

st+1 = st + at+1 − rt+1 (1)

where at+1 is the net inflow volume in the time interval [t, t + 1), which includes net

evaporation and other losses; and rt+1 is the release over the same period, which is a

function of the release decision ut made at time t, the storage st and the inflow at+1 , i.e.,

rt+1 = Rt (st , ut , at+1 ) (2)

The release function Rt (·) is a non-linear, periodic (with period T ) function describing

the stochastic relation between the decision ut and the actual release rt+1 [Piccardi and

Soncini-Sessa, 1991]. Indeed, between the time t at which the decision ut is taken and the

time t + 1 at which the release rt+1 it determines is completed, the inflow at+1 is affecting

the system, and the actual release rt+1 may not be equal to the decision ut , for instance

because of the activation of the spillways (for more details see Soncini-Sessa et al. [2007]).

A number of alternative options can be adopted to model the inflow at+1 , depending on

the meteorological and hydrological data available and the requirements posed by the ap-

proach used to design the operating policies. While traditional process-based (especially

spatially-distributed) models are usually too complex (high number of state variables) to

be used within a feedback control framework, statistical models provide a reasonable bal-

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 11

ance between compactness and accuracy (e.g., Young [2006]), and are generally preferred

over the first in designing optimal reservoir operation. In the most general formulation

the inflow can be described as

at+1 = At (It , εt+1 ) (3)

where At (·) is a periodic function with period T . For example, at+1 can be modeled as a

cyclostationary, log-normal autoregressive process of order d (i.e., a log-PAR(d)):

at+1 = exp (yt+1 σt + µt ) (4a)



d
yt+1 = αi,t yt−i+1 + εt+1 (4b)
i=1

where µt and σt are the periodic mean and standard deviation of the process, αi,t is the

parameter associated to the i-th autoregressive term at time t, and εt+1 is a zero-mean,

Gaussian white noise with constant variance. In this case the information vector It is

composed of the d autoregressive terms yt−i+1 (i = 1, . . . , d).

The model of the water system, composed of the catchment and the reservoir, can be

represented compactly with the following vector difference equation

xt+1 = ft (xt , ut , εt+1 ) (5)

where the state vector xt ∈ Sxt ⊂ Rnx , with nx = 1 + L, includes the state variables st

and It ; ut ∈ Ut (st ) ⊆ Sut ⊂ R, Ut (st ) being the set of the feasible decisions, which only

depends upon the storage st ; the stochastic disturbance εt+1 ∈ Sεt+1 ⊆ R is described by

its pdf ϕt (·|), which is periodic of period T as well as the function ft (·) and the set Ut (·).

When a network of N reservoirs is considered, instead of a single reservoir, the state

t | of the N storages and the P


vector is enlarged to include the vector st = |s1t , . . . , sN

information vectors Ilt from the associated P meteo/catchment systems (l = 1, . . . , P ),

D R A F T April 9, 2010, 9:46am D R A F T


X - 12 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

where P can be equal, smaller, or greater than N and nx = N + L · P . The disturbance

vector εt+1 ∈ Sεt+1 ⊆ Rnε is composed of P disturbances εlt+1 (i.e., nε = P ) with associated

pdf ϕlt (·). Finally, the release decision vector ut ∈ Ut (st ) ⊆ Sut ⊂ Rnu , whose components

are the release decision ujt from each reservoir j (with j = 1, . . . , N and nu = N ), replaces

the scalar decision ut in equation (5).

The presence of multiple, say q, operating objectives, corresponding to different wa-

ter users and other social and environmental interests, can be formalized by defining a

periodic, with period T , step reward function gt+1 = gt (xt , ut , εt+1 ) associated to the

stochastic state transition from xt to xt+1 . According to the multi-objective nature of the

problem, this function can be obtained as a weighted sum (Weighting Method) of the q

step reward functions gti (xt , ut , εt+1 ), i = 1, . . . , q describing the whole set of users and

interests considered, i.e.


q
gt (xt , ut , εt+1 ) = λi gti (xt , ut , εt+1 ) (6)
i=1
∑q
where i=1 λi = 1 with λi ≥ 0 ∀i.

An operating policy p is defined as a sequence p = {m0 (·), m1 (·), . . .} of operating rules

of the form

ut = mt (xt ) (7)

2.2. Problem Formulation

The time horizon h over which the operating policy is designed can be either finite

or infinite. Dealing with the management of natural resources, the second assumption

should be preferred over the first, which - to be effective - would require the definition

of a state dependent penalty function on the final instant of the time horizon. And this

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 13

can be critical in most of cases. Conversely, when an infinite time horizon is assumed, a

discount factor must be fixed to ensure convergence of the policy design algorithm (Total

Discount Cost (TDC) formulation).

For a given value of the weights λi , with i = 1, . . . , q, the total reward function associated

with the operating policy p over infinite time horizon can be defined as
[ h−1 ]

J (p) = lim E γ t gt (xt , ut , εt+1 ) (8)
h→∞ ε1 ,...,εh
t=0

where 0 < γ < 1 and the expected value is used as criterion to filtering the stochastic

disturbances (see Orlovski et al. [1984]; Nardini et al. [1992]; Soncini-Sessa et al. [2007]

for details and alternative solutions). The optimal policy p∗ is obtained by solving the

following optimal control problem:

p∗ = arg max J (p) (9)


p

subject to the model equations.

Equation (9) can be rewritten on a finite horizon h


[ h−1 ]

∗ t h
ph = arg max E γ gt (xt , ut , εt+1 ) + γ Hh (xh ) (10)
ph ε1 ,...,εh
t=0

where Hh (xh ) is a penalty function that expresses the total expected reward one would

incur in starting from xh and applying optimal release decisions over the period [h, ∞).

Since γ h vanishes for h going to infinity the solution to problem (10) is equivalent to the

limit of the following sequence of policies for the horizon h going to infinity
[ h−1 ]

p∗h = arg max E γ t gt (xt , ut , εt+1 ) (11a)
ph ε1 ,...,εh
t=0

D R A F T April 9, 2010, 9:46am D R A F T


X - 14 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

xt+1 = ft (xt , ut , εt+1 ) t = 0, . . . h − 1 (11b)

mt (xt ) = ut ∈ Ut (xt ) t = 0, . . . h − 1 (11c)

εt+1 ∼ ϕt (·|xt , ut ) t = 0, . . . h − 1 (11d)

x0 given (11e)

ph , {mt (·) ; t = 0, . . . h − 1} (11f)

By reformulating and solving the problem for some different values of λi (with i =

1, . . . , q), a finite subset of the generally infinite Pareto optimal policy set is obtained.

Since the system (equations (11b-d)) and the total reward function (11a) are periodic

of period T , the optimal policy p∗ turns out to be periodic with the same period, i.e.

p∗ = {m∗0 (·), m∗1 (·), . . . , m∗T −1 (·)}.

2.3. Stochastic Dynamic Programming

The formulation of the optimal problem (11) already includes a fundamental assumption

of Stochastic Dynamic Programming (SDP): an explicit model of the system is available,

through which the effects of any state transition can be fully anticipated (curse of mod-

eling). Precisely,

1. All the system dynamics are known and must be explicitly modeled in equation

(11b), which means that meteorological and/or hydrological information It can only be

included into the SDP formulation as state variables described by appropriate models. It

is not possible to consider exogenous deterministic inputs, whose values are known in real

time (e.g., precipitation, temperature): input to the models can only be either release

decisions or stochastic disturbances.

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 15

2. The disturbance vector is known (equation (11d)) and either the disturbances are

independent in time or any dependency upon the past at time t can be account for by the

value of the state at the same time.

3. The step reward functions are known and separable, i.e., gt (·) only depends on

variables defined for the time interval [t, t + 1).

The solution to problems (9) and (11) is computed by recursively solving the following

Bellman equation formulated according to the TDC framework:


[ ]
Qt (xt , ut ) = E gt (xt , ut , εt+1 ) + γ max Qt+1 (xt+1 , ut+1 ) ∀(xt , ut ) ∈ Sxt × Sut (12)
εt+1 ut+1

where Qt (·, ·) is the so-called Q-function or value function, i.e., the cumulative expected

reward resulting from applying the release decision ut at time t in state xt and assuming

optimal decisions (i.e., a greedy policy) in any subsequent system transition. The rela-

tionship between the Q-function Qt (·, ·) and the cost-to-go function Ht (·), as originally

introduced by Bellman [1957], is given by the following formula:

Ht (xt ) = max Qt (xt , ut ) (13)


ut

with the second being a more compact representation than the former, but requiring

an explicit model to derive the optimal release decision associated to each state value,

according to the SDP requirements above. More precisely, the solution to problems (9)

and (11) is obtained by iteratively solving equation (12) as a backward looking solution

process over the period T −1, . . . , 0 and repeating the cycle until a suitable termination test

is satisfied, say after k cycles. Then, the last T Q-functions are the optimal Q∗ -functions,

from which the optimal operating rule at any time is derived as

m∗t (xt ) = arg max Q∗t (xt , ut ) (14)


ut

D R A F T April 9, 2010, 9:46am D R A F T


X - 16 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

To determine the right hand side of equation (12), the domains Sxt , Sut , and Sεt+1 , of

state, release decision, and disturbance must be discretized and, at each iteration step of

the resolution process, explored exhaustively. The choice of the domain discretization is

essential as it reflects on the algorithm complexity, which is combinatorial in the number

of states, release decisions, and disturbances, and in their domain discretization. Let Nxt ,

Nut , and Nεt+1 be the number of elements in the discretized state, release decision, and

disturbance sets Sxt ⊂ Rnx , Sut ⊂ Rnu , and Sεt+1 ⊆ Rnε : the recursive resolution of (12)

for kT iteration steps (where k is usually lower than ten) requires

( )
kT · Nxntx · Nuntu · Nεntε (15)

evaluations of the operator E[·] in (12). Equation (15) shows the so-called curse of di-

mensionality, i.e., an exponential growth of computational complexity with the state and

decision dimension. It follows that SDP cannot be applied to design daily operating poli-

cies for water systems with a number of reservoirs greater than a few units, say 2 or 3,

and/or when too many hydro-meteorological information variables are accounted for in

the vector It .

3. Tree-based Batch Mode Reinforcement Learning

As anticipated in the introduction, Reinforcement Learning (RL) provides a conceptual

framework for overcoming the curse of modeling, since it does not presume the knowledge

of an explicit model to describe state transitions, disturbance’s pdf and rewards. However,

it only relatively alleviates the curse of dimensionality expressed by equation (15).

The fitted Q-iteration algorithm proposed by Ernst et al. [2005], which builds on early

works on fitted value iteration [Ormoneit and Sen, 2002], combines the RL idea of learn-

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 17

ing from experience with the concept of continuous approximation of the value function

developed for large-scale dynamic programming (see for example Gordon [1995]; Tsitsiklis

and Roy [1996]). This results into an improved reduction of the computational burden.

Indeed, a continuous mapping of state-decision pair into the value function should permit

the same level of accuracy as a look-up table representation based on an extremely dense

grid, but using a definitely coarser grid for the state-decision space. Further, the learning

process is performed off-line, without the need for directly experimenting on the real sys-

tem, which is a fundamental requirement when dealing with water resources systems, as

experiments would led to unsustainable costs in terms both of time, social and economic

loss.

3.1. Fitted Q-iteration Algorithm

As the other RL algorithms, fitted Q-iteration does not require explicit modeling of

the system. The operating policy is determined by learning from experience. Strictly,

such experience is represented as a finite data-set F of four-tuples of the form <

xt , ut , xt+1 , gt+1 >, i.e.

F = {< xlt , ult , xlt+1 , gt+1


l
>, l = 1, . . . , #F}

where #F is the cardinality of F. Each four-tuple is a sample of the one-step transition

dynamics of the system. The set F is the sole information required to determine an

operating policy, regardless the way it is generated (see Section 3.2), whether the four-

tuples are obtained from one single trajectory of the system (e.g., the historical one) or

from several, independently generated, one-step or multi-step simulations of the system

dynamics. Since, except for very special cases, an optimal policy cannot be determined

D R A F T April 9, 2010, 9:46am D R A F T


X - 18 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

from a finite set of transition samples, the policy generated by fitted Q-iteration will be

an approximation of the optimal policy p∗ that solves problem (11). Precisely, the fitted

Q-iteration yields an approximation of the optimal Q-functions of the TDC problem (9),

by iteratively extending the optimization horizon h, i.e. by iteratively solving problem

(11).

The deterministic and stationary (T = 1) case is useful to describe the algorithm.

Under these simplifying assumptions the state transition (11b) and associated reward

depend only on the state xt and decision ut . It can be shown [Ernst, 1999] that the

following sequence of Qh -functions, defined for all (xt , ut ) ∈ Sx × Su

Q0 (xt , ut ) = 0 (16a)

Qh (xt , ut ) = g(xt , ut ) + γ max Qh−1 (xt+1 , ut+1 ) ∀h > 0 (16b)


ut+1

converge, in the infinity norm, to the optimal Q-function Q∗ (·) that solves the determin-

istic and stationary equivalent to equation (12). Assuming the function Qh−1 (·) is known,

the value of Qh (·) can be computed for all the state-decision pairs (xlt , ult ), l = 1, . . . , #F,

using equation (16b) and the learning data-set F. The #F values so obtained can be

then used to get a continuous approximation Q̂h (·) of Qh (·) over the whole state-decision

set Sx × Su by applying a regression algorithm (i.e., by fitting a function approximator)

to the training set

T S = {< (xlt , ult ), Qh (xlt , ut ) >, l = 1, . . . , #F}

where the pairs (xlt , ult ) are the regressor inputs and the values Qh (xlt , ut ) the regressor

output. By substituting Qh (·) for Q̂h (·) and applying the same reasoning, the subsequent

approximations Q̂h+1 (·), Q̂h+2 (·), . . . can be determined iteratively.

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 19

In the stochastic case, the right hand side of equation (16b) is a realization of a random

variable and Qh (xt , ut ) is redefined as its expectation. However, the expectation has not to

be operationally applied when Q̂h (·) is approximated with a regression function based on

the least squares method, because this latter generates an approximation of the conditional

expectation of the output variables given the input. Its application to the training set

constructed considering stochastic transitions provides a continuous approximation of

Qh (·) over the whole state-decision set.

As originally proposed by Ernst et al. [2005], fitted Q-iteration generates a stationary policy,

i.e., just one operating rule of the form ut = m(xt ), which is the optimal policy for a stationary

system. However, natural systems are not stationary and thus a periodic policy (i.e., a sequence of

T − 1 operating rules of the form (7)) is much more indicated to adapt to the underlying seasonal

variability. A way to extend the fitted Q-iteration framework to the non-stationary case, is to

consider the time as a component of the state vector, which evolves driven by a deterministic,

autonomous transition function: t + 1 always follows from t. Accordingly, the notation of the

learning data-set F can be rewritten as F = { ¡(t,xt )l , ult , (t + 1, xt+1 )l , gt+1


l
>, l = 1, . . . , #F}In

this way, all the properties of the stationary formulation are preserved and convergence proofs

hold under the same assumptions. Therefore, the fitted Q-iteration algorithm applied to this

new set yields the determination of an approximate periodic policy, composed of a sequence of

T − 1 operating rules of form (7).

A tabular version of the fitted Q-iteration algorithm so modified is the following:

Input: a learning set F and a regression algorithm.

Initialization:

Set h = 0.

D R A F T April 9, 2010, 9:46am D R A F T


X - 20 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

Set Q̂0 (·) = 0 over the whole state-decision space Sx × Su .

Iterations: repeat until stopping conditions are met

Set h = h + 1.

Build the training set T S = {< il , ol >, l = 1, . . . , #F}

where il = ((t, xt )l , ult ) and ol = gt+1


l
+ γ max Q̂h−1 ((t + 1, xt+1 )l , ut+1 )
ut+1

Run the regression algorithm on T S to get Q̂h (·), from which the policy p̂h is derived.

Fitted Q-iteration is said to be a batch RL algorithm, because the whole learning set F is

processed in a batch mode, in contrast to traditional RL algorithms that perform incremental

updates of the value function using the four-tuples sequentially. Iterations can be stopped when

the distance between Q̂h (·) and Q̂h−1 (·) drops below a pre-assigned threshold, even if this criterion

does not ensure convergence with some function approximators (see Section 3.3). When the

algorithm stops, whatever the stopping conditions selected, the final policy p̂ is an approximation

of the optimal policy p∗ . The policy p̂h associated to the h-th algorithm iteration is composed of

a sequence of T − 1 operating policies of the form (7), each one looking ahead over the horizon

[t, t + h). In other words, for each value of h the algorithm solves a receding h-steps horizon

problem of form (11).

3.2. The Learning Data-Set

According to the RL concept of learning off-line from experience, the most simple idea to

generate the learning data-set F is to employ a historical record of system transitions and thus

let the algorithm learn from the real experience. If the objectives selected for the problem fit the

actual operation targets, the policy derived will be very close to the historical one, with small

benefit when the system is currently managed well below its potential. One way to refine and

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 21

improve this near-historical policy is to enlarge release decisions exploration to a small set of

different values around the historical one (see Gaskett [2002]), for each past value of the state

(Figure 1a). This is, however, a risky approach: if the state-decision set has been scarcely sampled

during the historical operation (typically, in poorly controlled systems [Tsitsiklis and Roy, 1996]),

the informative content of the learning data-set can be low and the resulting operation policy

is very likely to be quite far from optimality. Further, the approach is impracticable when the

water system has never been operated before (e.g., in planning problems).

An alternative approach is to explore the behavior of the water system, via model simulation,

for different state values and under different operating policies, namely to adopt a model-based

approach. However, the modeling effort does not need to involve the whole water system, but

just the components directly controlled (i.e., the reservoir(s)) and any downstream part, which

is affected by the release decisions (e.g., the downstream users). Indeed, the upstream part

(i.e., the meteo/catchment systems) is not influenced by release decisions and thus a model is

not required to explore the processes dynamics and the disturbance realizations. This is the

idea underlying the so-called partial model-free approach [Castelletti et al., 2001]: to use the

historical time series for each inflow and, when available, for any other hydro-meteorological

information that can be usefully included among the state variables, while describing the storage

dynamics with simple mass balance equations and any downstream user with an appropriate

dynamic model (see, e.g., Galelli et al. [2010]). As far as the controlled parts are concerned, a

discretization of the corresponding state space and the decision space is required to run either

one-step or multi-step simulations of the relevant dynamics. Even when pruned from the hydro-

meteorological information components, the dense grid discretization (Figure 1b) adopted in the

SDP formulation might still led to prohibitive computational requirements. On the other hand,

D R A F T April 9, 2010, 9:46am D R A F T


X - 22 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

it would not take any advantage of the continuous approximation of the Q-functions provided

by fitted Q-iteration. Rather, a coarse grid can exponentially reduce the computational burden

by linearly reducing Nxt and Nut in equation (15). Such a coarse grid can be either obtained

as a uniform sub-sampling of the SDP dense grid (Figure 1c) or generated with more efficient

discretization methods (Figure 1d), such as orthogonal arrays, Latin hypercube designs, and low-

discrepancy sequences (see Cervellera et al. [2006] and reference therein). Whatever the approach

adopted to build the learning data-set, this might contain redundancies, which only add to the

computational requirements with no advantages in terms of policy performance. A way to filter

the data-set is to adopt active learning techniques [Cohn et al., 1996], based on which only the

samples that mostly improve the performance of the learning algorithm (see Ernst [2005]) are

retained.

3.3. Function Approximator

In principle, fitted Q-iteration algorithm can be combined with any function approximator

based on least squares designed for regression problems. In practice, the approximator adopted

should have several desirable features [Ernst, 1999]:

Modelling flexibility. For very simple problems, involving, for example, one single reservoir op-

erated for one single purpose, the class of functions, to which the Q-functions to be approximated

belong, can be, to some extent, anticipated (e.g., for flood control it must be a function mono-

tonically increasing with the storage). However, when dealing with reservoir networks and/or

multi-objective problems, this class could be totally unpredictable in shape. As a consequence,

the function approximator must be able to adapt its structure to the problem.

Computational efficiency. The regression algorithm is run at every iteration step of the fitted

Q-iteration. It should ensure accurate approximations, without adding too much to the overall

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 23

computational requirements. Further, no human tuning of the function approximator parameters

must be ensured (fully automated approximation).

Some parametric function approximators can provide a great modeling flexibility; artificial

neural networks, for instance, are provably able to approximate any continuous, multivariate

function to any desired degree of accuracy. This modeling flexibility, however, comes to a price,

since it is often reflected in a large number of parameters requiring explicit calibration, thus

strongly affecting the computational efficiency (see Castelletti et al. [2005]) and increasing the

risk of over-parameterization. As the problem size scales up, neural networks require more and

more neurons, thus increasing the computational cost of the training phase. Non-parametric

function approximators, particularly tree-based methods, ensure modeling flexibility and, at the

same time, computationally efficiency, since no traditional parameter estimation is required in

their building process.

Tree-based methods provide non-parametric estimates based on a recursive binary partition of

the training data-set T S (tree building algorithm). At the first step, the space of inputs (root)

is partitioned into two subsets (nodes), by applying an appropriate splitting rule to T S. The

operation is iteratively repeated on the two subsets resulting from each splitting until a given

termination test is satisfied. Each subset of the final partition (leaf) is then associated with

a value of the output or a function of the input (association rule). In some methods the tree

building procedure is repeated more than once to construct an ensemble of trees (forest) and the

values estimated by the trees are aggregated, according to an aggregation rule, to produce the

final estimate.

3.3.1. Extra-Trees

D R A F T April 9, 2010, 9:46am D R A F T


X - 24 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

Tree-based methods include KD-Tree, Classification and Regression Trees [Breiman et al.,

1984], Tree Bagging [Breiman, 1996], Totally Randomized Trees and Extremely Randomized

Trees (Extra-Trees) [Geurts et al., 2006]. These methods basically differ by the splitting rule,

the termination test they adopt, and the number of trees they grow. Extra-Trees (described

later) were demonstrated to perform better than other tree-based methods combined with the

fitted Q-iteration algorithm [Ernst et al., 2005] and are therefore adopted in this study. Par-

ticularly, they provide great scalability by adapting the trees’ structure to the training set at

each iteration, thus resulting in a better accuracy of the final policy. The drawback with these

continuous changes in the structure is that Extra-Trees do not ensure convergence of fitted Q-

iteration and so the algorithm cannot simply be stopped based on the distance between two

consecutive approximations of the Q-functions. However, contrary to many parametric function

approximators, they do not lead to divergence when the problem horizon is infinite, which is a

fundamental property in dealing with water resources systems.

The Extra-Tree building algorithm grows ensemble of M trees. Nodes are split using the

following rule: K alternative cut-directions (regressor input) are randomly selected and, for each

one, a random cut-point is chosen; a score (explained variance) is then associated to each cut-

direction and the one maximizing the score is adopted to split the node (for more details, see

Geurts et al. [2006]). The algorithm stops partitioning a node if its cardinality is smaller than

nmin (termination test) and the node is therefore a leaf. To each leaf a value is assigned, obtained

as the average of the regressor outputs ol associated to the inputs il that fall in the leaf. The

estimates produced by the M trees are finally aggregated with arithmetic average (aggregation

rule).

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 25

Three parameters are thus associated to Extra-Trees, whose values can be fixed on the basis

of empirical evaluations:

K, the number of alternative cut-directions, can be chosen in the interval [1, . . . , n], where

n is the number of regressor inputs. When K is equal to n, the choice of the cut-direction is

not randomized and the randomization acts only through the choice of the cut-point. On the

contrary, low values of K increase the randomization of the trees and weaken the dependence

of their structure on the output of the training data-set. Geurts et al. [2006] have empirically

demonstrated that, for regression problems, the optimal default value for K is n.

nmin , the minimum cardinality for splitting a node. Large values of nmin lead to small trees

(few leaves), with high bias and small variance. Conversely, low values of nmin lead to fully-

grown trees, which may over-fit the data. The optimal value of nmin depends not only on the

risk aversion to over-fitting, but also on the level of noise in the outputs of the training data-set:

the noisier are the outputs, the higher should be the optimal value of nmin .

M , the number of trees in the forest, influences the strength of the variance reduction and

the behavior of the estimation error, which is a decreasing function of M [Breiman, 2001]. The

estimation accuracy thus increases with M and the choice of its value depends on a trade-off

between the desired model accuracy and available computing power.

4. Case Study: Lake Como, Italy

Lake Como water system (Figure 2) was selected as study site to evaluate the potential of

the fitted Q-iteration algorithm. The rationale behind the choice is twofold. First, given the

relatively simple water system topology, the shape of the operating policies and the associated

values of the objectives at the extreme points of the Pareto front (the coordinate of the utopia

point) can be inferred a priori, and this can be very useful for studying the sensitivity of the

D R A F T April 9, 2010, 9:46am D R A F T


X - 26 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

algorithm to Extra-Trees parameters and the stopping conditions. Second, the dimension of the

system makes the control problem solvable with SDP, and this is key to perform a comparative

evaluation of the algorithm. Based on this analysis, the advantages of the fitted Q-iteration over

SDP can be easily extrapolated to more complex cases, where SDP requirements would turn out

prohibitive for a comparison.

4.1. Description

Lake Como is the third biggest regulated lake in Italy with a surface area of 145 Km2 and

an active storage of 260 Mm3 . The 4500 Km2 lake’s catchment area produces an yearly average

inflow of 4.73 Gm3 /year with the typical two-peak (spring and autumn) subalpine hydrological

flow pattern. The regulation was introduced in 1946 with the double purpose of providing

flood protection on the lake shores and supplying water to the downstream users (5 irrigation

districts for a total irrigated area of 1400 Km2 and 9 run-of-river power plants with a total

installed capacity of 92 MW). The lake regulation has been extensively studied following the

work by Guariso et al. [1985, 1986] and the multifaceted nature of the conflict over its water

use analyzed from different points of view, including the combined regulation of the lake and

the alpine hydropower reservoirs and the integrated management of blue and green water for

improved agricultural production [Galelli and Soncini-Sessa, 2010].

4.2. Problem Setting

The water system (Figure 2) is composed of a catchment feeding the lake, which serves the

downstream irrigation districts and hydropower plants. The operating policy provides the release

decision ut based on the current value of the lake storage st . This latter is the only component

of the state vector xt and its dynamics is governed by the mass balance equation (1). The

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 27

release decision ut is the volume to be released in the next 24 hours from the lake dam and,

finally, according to (6), the step reward function gt (·) is a linear combination of two step costs

(negative rewards) accounting for flood damage and downstream water deficits. The learning

data-set F of four-tuples < (t, xt ), ut , (t + 1, xt+1 ), gt+1 > required by the fitted Q-iteration

algorithm was built adopting a partial model-free approach. A 80-points coarse discretization

was used for the state-decision space (Figure 6); precisely, 10 points for the storage st and 8

points for the release decision ut , the first six of which correspond to downstream water demand

values, plus two greater values. For the inflow at at time t, which plays the role of a disturbance

to the system, 15-years daily streamflow data (1965-1979) were directly used (model free). The

state transitions were performed by running a one-step simulation of equation (1) for each of the

10 points of the storage grid and each of the 8 points of the release decision grid against the 15

possible realizations of the inflow at each time step t, t = 1, . . . , T with T = 365. The resulting

number #F of 4-tuples in the learning data-set was equal to 438,160.

The cost (objective) J f associated to flood damage (to be minimized!) is formulated as the total

discounted (γ = 0.9997) number of days of flooding, computed with equation (8), whit




0 if st ≤ s̄
f
gt (st , ut , at+1 ) = (17)



1 otherwise

where s̄ is the storage corresponding to the flood threshold (1.24 m) at the lowest shore line

point (downtown Como). The cost J w associated to downstream water deficits is computed in

analogous way with the following step cost function:






0 if rt+1 ≥ wt
w
gt (st , ut , at+1 ) = (18)



w − r otherwise
t t+1

D R A F T April 9, 2010, 9:46am D R A F T


X - 28 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

where wt is the aggregated agricultural and hydropower demand and rt+1 is the actual release

from the lake given by equation (2).

The policy is designed by solving an equivalent to problem (11) where, according to the nature

of the objectives, the operator max is substituted for min and the aggregated step cost function

gt (·) in equation (8) is computed as

gt (·) = λgtf (·) + (1 − λ)gtw (·) (19)

with 0 ≤ λ ≤ 1.

5. Analysis of Fitted Q-iteration Properties

In order to fix an appropriate setting for the fitted Q-iteration algorithm, the definition of a

suitable stopping condition and the influence of the Extra-Trees parameters (K, nmin , and M )

on the performance of the operating policies were analyzed. The policy performance is evaluated

in terms of the values of the objective Jˆh obtained by simulating the policy p̂h , computed with

the data-set F, on an 18-years validation scenario (1980-1997) for different values of h.

5.1. Stopping Condition

The definition of a suitable stopping condition is not straightforward. As anticipated, the

criterion proposed in Section 3 does not make practical sense combined with Extra-Trees. Indeed,

the randomization in the tree building algorithm refreshes the tree structure of the Q-function

approximator at each iteration, so that the distance between two consecutive approximations in

the Q̂h (·) sequence rapidly decreases, but never vanishes. By way of illustration, consider the

trajectory of Jˆh obtained for a given value of λ (Figure 3). After an initial decrease, the value of

Jˆh randomly fluctuates, with an amplitude that reaches 35% of the initial decrease. If the tree

structure were frozen at the first iteration, Jˆh would asymptotically go to a value more or less

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 29

far from optimality depending on the accuracy of the approximation. However, because of the

randomization, as h increases the value of Jˆh fluctuates. The recursive nature of the algorithm

filters the pure random fluctuation (high frequency) of the approximation and therefore Jˆh shows

smooth fluctuations. For small h, fluctuations are dominated by the performance improvement

due to the policy learning process and are not evident. When increasing h does not add any

useful information for improving the policy, oscillations become the dominant effect.

To choose the number h̄ of iterations at which to stop the algorithm, it is therefore necessary

to resort to some empirical criterion. In principle, h̄ should be the value of h for which the policy

learning process is nearly over and the improvement in performance is so low to be overtopped

by random fluctuations due to the Extra-Trees approximation. As far as the authors know, no

indicator exists to identify the iteration from which no further improvement can be achieved, but,

heuristically, the lowest value of h corresponding to a minimum of Jˆh (150 days in Figure 3) can

be a proper choice. The reversal should actually be a good signal of the prevalence of random

fluctuations over learning. The following observation supports, in the case study, this choice.

Remember that problem (9) can be rewritten as (10). Stopping the Qh (·) sequence to a finite

value h is the same as ignoring the penalty Hh (xh ). In this case, the problem solution does not

change only when Hh (·) is a constant function, since the policy (9) is insensitive to any addictive

constant. Given the meaning of Hh (·), this happens when a time instant exists, in the period

T , from which the future performance of the system is independent from its past behavior. For

water reservoir systems this happens when the total storage capacity does not allow for multi-

annual operation, as in the case of Lake Como. There, the irrigation season begins on middle

April, just after the snowmelt, and ends around the third week of September. In autumn the lake

is re-filled by floods and during springtime mainly by snowmelt. As a result, the storage at the

D R A F T April 9, 2010, 9:46am D R A F T


X - 30 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

beginning of the new irrigation season is completely independent from the operation performed

in the previous season and, thus, for what concerns the irrigation component, Hh (·) takes a

constant value. Similar reasoning applies to floods, as they occur at the end of October and

the lake can be emptied in 15 days. Therefore, to design a receding horizon policy, as fitted Q-

iteration algorithm de facto does, 5 months (about 150 days) are enough, and the first minimum

in Figure 3 is close to this value. This observation not only supports the empirical criterion

proposed above, but also suggests another stopping criterion, to some extent, more efficient (it

does not require to compute Jˆh after each algorithm iteration): whenever the problem can be

re-framed as a receding h̄-steps horizon problem, h̄ is the natural stopping limit of the algorithm,

since the policy learning process does not improve anymore when the number of iterations is

nearly equal to h̄. Beyond this limit, performance begins to oscillate due to the randomization

in the fitted Q-iteration and therefore it is rational to stop at h̄. When the problem must be

solved over an infinite horizon, up to now no criterion is available.

How to act in order to reduce the mean approximation error due to the tree building process

and smooth the amplitude of random fluctuations, which are both affected by the Extra-Trees

parameters, has not yet been clarified and will be dealt with in the next section.

5.2. Sensitivity to Extra-Trees Parameters

5.2.1. Setting K

The parameter K fixes the number of regression inputs randomly selected at each node and

should be equal to the number of inputs (see Section 3.3), i.e., the arguments of the Q-function:

time t, storage st , and release decision ut .

5.2.2. Setting nmin

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 31

The value nmin , the minimum sample size for splitting a node, determines the number of leaves

in a tree and thus the ensemble’s trade-off between bias and variance. By way of illustration,

in Figure 4 (top panel) the experiment in Figure 3 is replicated for nmin = 15 (dashed line).

Reducing nmin decreases the bias (in the average the performance is nearer to the optimal one) and

negatively affects the variance (higher amplitude of the fluctuations). As anticipated in Section 3,

dealing with stochastic function approximation, the regression algorithm should provide the

conditional expectation of the output given the input. nmin should therefore be, at least, equal

to the number of disturbance realizations available for each state-decision pair ((t, xt )l , ult ). Since

the learning data-set F of Lake Como water system was generated using a 15-years long scenario,

the best performance is expected to be obtained for nmin ≥ 15. This is confirmed in Figure 4

(bottom panel).

5.2.3. Setting M

The larger the number M of trees in the forest, the smaller the variance and thus the higher

the smoothing effect on the fluctuations (top panel of Figure 5). The reduction in the variance

has a positive effect on the Extra-Trees estimation error, which reflects in a strong reduction in

the distance between the performance in calibration (dashed line) and validation (solid line) as

M grows from 1 to 10 (bottom panel of Figure 5). From the bottom panel it also appears that

increasing M slightly reduces the bias. Nonetheless, the computation time linearly increases with

M and a balance must be found between accuracy and time requirements. The saturation effect

might help in deciding a proper value: the improvement in the value of Jˆ150 on validation from

values greater than 30-40 is definitely negligible (solid line in the bottom panel of Figure 5) .

6. Fitted Q-iteration vs SDP

D R A F T April 9, 2010, 9:46am D R A F T


X - 32 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

The potential of the fitted Q-iteration was analyzed via comparison with an equivalent SDP

formulation. The learning data-set F of fitted Q-iteration was generated using a partial model-

free approach (see Section 4.2). As for SDP, according to the requirement of explicit modeling all

the system components, the inflow at+1 was described as a cyclostationary (with period T =365),

log-normal, stochastic process, whose pdf is defined by the parameters µt and σt , i.e., a log-

PAR(0) model was assumed (for more details see Pianosi and Soncini-Sessa [2009]):

at+1 = eσtmodT εt+1 +µtmodT εt+1 ∼ N (0, 1) (20)

where εt+1 is a Gaussian white noise. The state-decision domain was discretized using a dense

grid of 27,048 points (Nst = 161; Nut = 168, see Figure 6), while a 9-points grid was used for the

inflow (Nεt+1 = 9). Since each class of the storage (excluding the last highest 5 classes, which

were required to set the upper boundary condition) is nearly 4.2 × 103 m3 (corresponding to

2.8 cm of level variation) and each class of release decision is 3 m3 , the grid can be reasonably

considered very close to a continuum.

Simulation analysis was used to comparatively evaluate the efficient daily operating policies

obtained with fitted Q-iteration and SDP against the historical operation. A number of poli-

cies were designed, corresponding to different values of the weights and evaluated on both the

calibration and validation scenarios (starting from the historical storage). The resulting images

of the Pareto front are shown in Figure 7. It can be seen that the operations based on fitted

Q-iteration (solid line & black circles) slightly outperform those obtained with SDP (dashed line

& white circles), while the historical operation is noticeably far from efficiency. The improvement

is more considerable where the front has the strongest curvature (the so-called knee of the front).

Around that area, the objectives are more conflicting and the optimal Q∗ -functions are strongly

non-linear: the continuous approximation of the fitted Q-iteration, though based on a very coarse

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 33

grid, is more accurate than the SDP look-up table, even if this is based on a nearly continuous

state-decision discretization grid.

By way of demonstration, the policy associated with point A in Figure 7, derived with the

fitted Q-iteration, dominates the corresponding policy A′ , obtained with SDP, for one day of

floodings per year and nearly 3.5 ×106 m3 of deficit per year. Both the policies suggest to supply

right the water demand (front flat area in panels (a) and (c) in Figure 8) over all the year for a

relatively wide range of storage values and strongly increase the release rate during the two flood

seasons. In so doing they create a time-varying flood buffer zone, whose dimension is optimally

designed as it is either learnt from the flood events and the associated effects available in the

data-set (fitted Q-iteration) or implicitly inferred from the stochastic inflow model (SDP). Such

a buffer is, however, distinctively larger with policy A (panel (a)) as a result of a more accurate

approximation of the Q-function (compare panel (b) and panel (d), where max Qt (xt , ut ) is
ut

plotted), particularly for high values of the storage. The improved ability of policy A to deal

with floods is remarkable on normal floods (Figure 9), when - thanks to its time-varying nature

- it is able to release in advance a volume sufficiently large to buffer all the flood. Clearly, the

difference between the two policies vanishes for extreme floods (Figure 10), when the lake level

rapidly rises over the upper bound of the regulation range and the dam must be completely

opened. In these cases, both policies can just delay floodings of one or two days and slightly

reduce the peak flood with respect to the historical operation. As far as the water deficit is

concerned, the relatively small improvement of policy A over policy A′ is basically due to small

differences on a limited number of droughts, where fitted Q-iteration shows some better ability

to anticipate the inflow and therefore keeps supplying the water demand for longer than SDP

D R A F T April 9, 2010, 9:46am D R A F T


X - 34 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

does. An example is provided in Figure 11. The same example shows that both the policies

significantly outperform the historical operation, which appears to be much more risk averse.

By moving toward the left extreme of the Pareto front (point B and B′ in Figure 7), i.e. by

increasing the relative importance of irrigation over floods, SDP performs better than fitted Q-

iteration. This is basically due to the approximation error in the tree-based interpolation of the

Q-functions. Indeed, as the importance of the irrigation increases, the conflict with floodings

becomes negligible and the optimal policy simply suggests to release the water demand. As

anticipated, water demand values belong to the release decision discretization grid for both the

algorithms. However, while the release decision chosen by SDP is necessarily a grid point, and

thus a water demand value, fitted Q-iteration uses a continuous approximation, which sometimes

fails in determining the exact demand value.An interesting property of fitted Q-iteration is that

time t is among the arguments of the Q function: the continuous value function approximation

is computed also with respect to t. An implicit clusterization of the operating rules with respect

to time is automatically performed by the Extra-Trees building algorithm and thus the resulting

operating policies do not necessarily change on a daily basis. This is evident in Figure 8 (panel

(a)): during periods characterized by a reduced variance of the inflow process and a constant

downstream water demand, the operating rules show the same behavior. For example, during

the first 30 days of the year they are monotonically increasing with the storage and provide the

water demand (equal to 99 m3 /s) as release decision for most of the storage values, until large

values are reached. At this stage, all rules suddenly increase the flow to be released, with a small

difference from rule to rule. This is nothing else than an operating policy unchanged within one

month.

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 35

6.1. Computational Requirements

A comparative analysis of the computational requirements by fitted Q-iteration and SDP can

be empirically performed by inferring some general rules from the computing times registered on

the Lake Como case study.

As anticipated, the time tSDP required to design an operating policy with SDP is proportional

to the number of evaluations of the operator E[·], which is given by equation (15): by splitting

the state dimensionality (nx ) in two ns storages and nI hydro-meteorological information, the

time tSDP can be expressed as

( )
tSDP = a · kT · Nsnts · NIntI · Nuntu · Nεntε (21)

where a is a constant, machine-dependent parameter. An estimate of a (on an Intel Xeon 3.16

Ghz QuadCore with 16GB Ram) was obtained from the time tSDP required to compute a policy

for Lake Como (k = 4; T = 365; Nst = 161; ns = 1; NIt = 0; nI = 0; Nut = 168; nu = 1;

Nεt = 9; nε = 1). Notice that the estimate so obtained is more conservative than one that would

be obtained on a more complex system, since there are a number of operations performed by

the coded algorithm that are only done once, independently from the system complexity. This

makes the estimate of a (and of b and c below) on a simple system much more greater than the

equivalent on a larger system.

The computing time tQ associated to fitted Q-iteration is the combination of the time tQ1

required to build the learning data-set F and the time tQ2 for running h̄ times (h̄ = 150 as

explained in Section 5.1) the tree-based regression algorithm on the training data-set. When a

model-free approach is adopted, as in Lake Como case, tQ1 only depends upon the number of

components of the states (storages) and release decisions for which a model is identified, and not

on the number of model-free components of the states (hydro-meteorological information) and

D R A F T April 9, 2010, 9:46am D R A F T


X - 36 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

disturbances. Assuming that the fitted Q-iteration coarse grid is obtained by reducing the dense

grid of an equally performing SDP of a factor rs and ru respectively for state and decision, tQ1 is
( )
Ns Nu
tQ1 = b · T · ( t )ns · ( t )nu · Na (22)
rs ru

where b is a constant, machine-dependent parameter, and Na is the number of disturbance

realizations (i.e., the number of years in the historical data set used for the inflow and any other

hydro-meteorological information). Time tQ2 grows linearly in the time horizon h̄, in the number

of regressors k (i.e., ns + nI + nu + 1) and trees M , and superlinearly in the number #F (i.e.,


Nst ns Nut nu
( rs
) ·( ru
) · Na · T ) of four-tuples in the data-set. Precisely
( )
Nst ns Nut nu Ns Nu
tQ2 = c· ( ) ·( ) · Na · T · log(( t )ns · ( t )nu · Na · T ) · M · (ns + nI + nu + 1) · h̄ (23)
rs ru rs ru

where c is a constant, machine-dependent parameter. The parameters b and c were obtained

from the times tQ1 and tQ2 registered on Lake Como problem (T = 365; Nst = 10; ns = 1;

NIt = 0; nI = 0; Nut = 8; nu = 1; NI = 15). Also the reduction rates (rs = 16 and ru = 21)

were derived from the Lake Como case study. Figure 12 shows the computing times (given by

the above relationships with the estimated values of a, b, and c) plotted for increasing values

of the state vector dimension for two different system configurations. The top panel refers to

a reservoir network of ns reservoirs, each one with a single outlet (i.e., nu = ns ). All the

reservoirs have similar size and are therefore modeled with the same state and release decision

discretization grid (say Nst = 50 and Nut = 20, which are reasonable values for artificial reservoir

networks). Each reservoir is assumed to be fed by its own catchment (i.e., nε = ns ) described as

a disturbance, with Nεt = 10. The bottom panel was obtained assuming a single reservoir system

(i.e., ns = nu = nε = 1) and an increasing number of additional hydro-meteorological exogenous

information. The advantage of fitted Q-iteration (solid line & black circles) over SDP (dashed

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 37

lines & white circles) is evident from both the panels. Fitted Q-iteration goes well beyond the

computational limits of SDP (i.e., ns ≃ 2) on complex networks including up to 5 reservoirs

(top panel). The improvement is even more remarkable when the operating policy depends upon

exogenous information (bottom panel), as the model-free (uncontrolled!) components comes at

nearly no computational additional time for fitted Q-iteration: while SDP requires more than 5

days for a configuration with 1 reservoir and 2 exogenous information, fitted Q-iteration only 1.5

hours.

7. Conclusions

One major technical challenge in expanding the scope of water resources management across

sectors (social, economic, environmental) is to develop new methodologies and tools to cope

with the increasing complexity of water systems. When dealing with large water resources

systems, the dual curse of dimensionality and modeling makes the adoption of Stochastic Dynamic

Programming (SDP) definitely impracticable without resorting to one of the available varieties of

simplications and approximations, which usually make the resulting operating policies inefficient

in practice. This paper provides an encouraging evidence that Reinforcement Learning (RL)

may provide an important and robust alternative to mitigate these SDP plagues. A recent RL

approach is presented, called fitted Q-iteration, which combines continuous approximation of the

value functions with an iterative, batch-mode, learning process from an off-line generated data-

set to design daily, cyclostationary operating policies. The continuous approximation makes it

possible to mitigate the curse of dimensionality by adopting a very coarse discretization grid with

respect to the dense grid required to design an equally performing policy via SDP. The use of a

learning data-set, with basically no requirements on the way this is generated, allows overcoming

the curse of modeling.

D R A F T April 9, 2010, 9:46am D R A F T


X - 38 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

The application to Lake Como water system was used to infer general guidelines on the ap-

propriate setting for the algorithm parameters, to define an empirical stopping condition, and

to demonstrate the potential of the approach compared to traditional SDP. The policy obtained

with fitted Q-iteration on an extremely coarse state-decision discretization grid was shown to

generally outperform an equivalent SDP-derived policy computed on a very dense grid. The

dominance is particularly remarkable on flood events (Figure 9), when the time-varying nature

of both the policies, which is key to anticipate and buffer floods when no inflow predictions are

considered, is more effectively exploited by the fitted Q-iteration. Based on the application to

Lake Como, a general rule was also derived to quantify the computational advantages of the

fitted Q-iteration over SDP in designing daily operating policies for large water systems. The

current SDP’s limit of 2-3 state variables can be moved forward to 5-6 state variables when the

state variables are all reservoir storages and to many more when several storages and a large

amount of exogenous information are considered or the exogenous information is strongly tem-

porally correlated. For instance, a network of 4 reservoirs with the operating policy depending

on 12 exogenous information (e.g., the inflow to each reservoir is described by three variables:

2 autoregressive terms and the precipitation) does require less than 5 days. These bounds can

be further improved by exploiting the intrinsically parallel nature of tree ensemble which are

composed of independent trees. By using a multi-threaded implementation, the total computing

time can be reduced up to M times. This will be the subject of future research. Future com-

putational improvements also include the adoption of efficient discretization techniques, to more

effectively explore the state-decision space in generating the learning data-set [Cervellera et al.,

2006]; the use of clustering algorithms [Nguyen and Smeulders, 2004] to clean up the learning

data-set for redundancies and the exploration of policy refinement approaches [Bonarini et al.,

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 39

2007] combined with policy reconstruction procedure [Schneegaß et al., 2007], where the current

operating policy is first identified and then iteratively improved by the algorithm.

While the Extra-Trees used by the fitted Q-iteration have been shown to offer a good accu-

racy/efficiency trade-off, this comes at the price of a lack of a well-defined and consistent stopping

condition, which, in turn, might negatively affect both the accuracy (the policy obtained is not

the best one explored) and the efficiency (a better policy could have been found by stopping the

algorithm earlier). Strictly, this happens because of the tree structures refreshed at each iteration

of the fitted Q-algorithm, which is key to build an accurate approximation in the early stages

of the algorithms, but prevents the approximated Q-functions from stabilizing, even when the

improvement in the policy performance is marginal. Further investigations are required in this

direction, including the freezing of the tree structure after some iterations [Guez et al., 2008].

An important feature of the algorithm, which has been theoretically investigated in the pa-

per, but surely deserves further studies, is the great flexibility it offers in dealing with operating

policies conditioned on any kind of exogenous information, even not necessarely expressed in a

quantitative form. Subsequent researches will include the evaluation of the potential improve-

ment in the operation perfomances both by the use of traditional inflow predictions and by the

direct employment of the information which could be useful in formulating inflow predictions

(e.g., precipitation, temperature, snow pack depth). The batch nature of the fitted Q-iteration

has an other important implication on the applicability of optimal control to a wide range of water

related applications, especially to water quality management, integration of quality and quantity

(e.g., selective withdrawals systems), groundwater management and surface-groundwater inter-

action, and, more generally, to all those applications involving spatially distributed states and

rather intricate process-based models, which constitute a major barrier to the use of traditional

D R A F T April 9, 2010, 9:46am D R A F T


X - 40 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

control approaches. These also include process-based modeling of rainfall-runoff as required to

generate climate change scenarios and investigate adaptive management strategies. Particularly,

the combined use of fitted Q-iteration and model reduction techniques [Castelletti et al., 2009]

is worth be explored for these purposes.

Acknowledgments. The work was completed while Andrea Castelletti, Stefano Galelli and

Rodolfo Soncini-Sessa were on leave at the Centre for Water Research, University of Western

Australia. This paper forms CWR reference 2329 AC.

References

Archibald, T., K. McKinnon, and L. Thomas (1997), An aggregate stochastic dynamic program-

ming model of multireservoir systems, Water Resources Research, 33 (2), 333–340.

Aufiero, A., R. Soncini-Sessa, and E. Weber (2001), Set-valued control laws in minmax control

problem, in Proceedings of IFAC Workshop Modelling and Control in Environmental Issues,

August 22-23, Yokohama, J.

Barto, A., and R. Sutton (1998), Reinforcement Learning: an introduction, MIT Press, Boston,

MA.

Bellman, R. (1957), Dynamic Programming, Princeton University Press, Princeton, NJ.

Bellman, R., and S. Dreyfus (1962), Applied Dynamic Programming, Princeton University Press,

Princeton, NJ.

Bellman, R., R. Kabala, and B. Kotkin (1963), Polynomial approximation - a new computational

technique in dynamic programming, Mathematics of Computation, 17 (8), 155–161.

Bertsekas, D., and J. Tsitsiklis (1996), Neuro-Dynamic Programming, Athena Scientific, Boston,

MA.

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 41

Bhattacharya, A., A. Lobbrecht, and D. Solomatine (2003), Neural networks and reinforcement

learning in control of water systems, Journal of Water Resources Planning and Management-

ASCE, 129 (6), 458–465.

Bonarini, A., A. Lazaric, and M. Restelli (2007), Piecewise constant reinforcement learning for

robotic applications, in Proceedings of the 4th International Conference on Informatics in

Control, Automation and Robotics (ICINCO 2007).

Breiman, L. (1996), Bagging predictors, Machine Learning, 24 (2), 123–140.

Breiman, L. (2001), Random forests, Machine Learning, 45 (1), 5–32.

Breiman, L., J. Friedman, R. Olsen, and C. Stone (1984), Classification and regression trees.,

Wadsworth & Brooks, Pacific Grove, CA.

Castelletti, A., and R. Soncini-Sessa (2007), Coupling real-time control and socio-economic issues

in participatory river basin planning, Environmental Modelling & Software, 22 (8), 1114–1128.

Castelletti, A., G. Corani, A. Rizzoli, R. Soncini-Sessa, and E. Weber (2001), A reinforcement

learning approach for the operational management of a water system, in Proceedings of IFAC

Workshop Modelling and Control in Environmental Issues, August 22-23, Yokohama, J.

Castelletti, A., D. de Rigo, R. Soncini-Sessa, A. Rizzoli, and E. Weber (2005), An improved tech-

nique for neuro-dynamic programming applied to the efficient and integrated water resources

management, in Proceedings of 16th IFAC World Congress, July 4-8, Prague, CZ.

Castelletti, A., D. de Rigo, A. Rizzoli, R. Soncini-Sessa, and E. Weber (2007), Neuro-dynamic

programming for designing water reservoir network management policies, Control Engineering

Practice, 15 (8), 1001–1011.

Castelletti, A., F. Pianosi, and R. Soncini-Sessa (2008), Water reservoir control under economic,

social and environmental constraints, Automatica, 44 (6), 1595–1607.

D R A F T April 9, 2010, 9:46am D R A F T


X - 42 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

Castelletti, A., M. De Zaiacomo, S. Galelli, M. Restelli, P. Sanavia, R. Soncini-Sessa, and

J. Antenucci (2009), An emulation modelling approach to reduce the complexity of a 3D

hydrodynamic-ecological model of a reservoir, in Proceedings of International Symposium on

Environmental Software Systems (ISESS2009), October 2-9, Venice, I.

Cervellera, C., V. Chen, and A. Wen (2006), Optimization of a large-scale water reservoir network

by stochastic dynamic programming with efficient state space discretization, European Journal

of Operational Research, 171 (3), 1139–1151.

Cohn, D., Z. Ghahramani, and M. Jordan (1996), Active learning with statistical models, Journal

of artificial intelligence research, 4, 129–145.

Ernst, D. (1999), Near optimal closed-loop control. application to electric power systems, Ph.D.

thesis, University of Liege, Belgium.

Ernst, D. (2005), Selecting concise sets of samples for a reinforcement learning agent, in Pro-

ceedings of the 3rd International Conference on Computational Intelligence, Robotics and Au-

tonomous Systems (CIRAS 2005), December 10-14, Singapore.

Ernst, D., P. Geurts, and L. Wehenkel (2005), Tree-based batch mode reinforcement learning,

Journal of Machine Learning Research, 6, 503–556.

Esogbue, A. (1989), Dynamic Programming for Optimal Water Resources Systems Analysis,

chap. Dynamic programming and water resources: Origins and interconnections, Prentice-Hall,

Englewood Cliffs, NJ.

Foufoula-Georgiou, E., and P. Kitanidis (1988), Gradient dynamic programming for stochastic

optimal control of multidimensional water resources systems, Water Resources Research, 24,

1345–1359.

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 43

Galelli, S., and R. Soncini-Sessa (2010), Combining metamodelling and stochastic dynamic pro-

gramming for the design of reservoirs release policies, Environmental Modelling & Software,

25 (2), 209–222.

Galelli, S., C. Gandolfi, R. Soncini-Sessa, and D. Agostani (2010), Building a metamodel of

an irrigation district distributed-parameter model, Agricultural Water Management, 97 (2),

187–200.

Gaskett, C. (2002), Q-learning for robot control, Ph.D. thesis, Australian National University,

Canberra, AUS.

Geurts, P., D. Ernst, and L. Wehenkel (2006), Extremely randomized trees, Machine Learning,

63 (1), 3–42.

Gilbert, K., and R. Shane (1982), TVA hydroscheduling model: theoretical aspects, Journal of

Water Research Planning and Management - ASCE, 108 (1), 21–36.

Gordon, G. (1995), Online tted reinforcement learning, in Proceedings of the Workshop on Value

Function Approximation at the 12th International Conference on Machine Learning, July 9,

Tahoe City, CA.

Guariso, G., S. Rinaldi, and R. Soncini-Sessa (1985), Decision support systems for water manage-

ment - the Lake Como case study, European Journal of Operational Research, 21 (3), 295–306.

Guariso, G., S. Rinaldi, and R. Soncini-Sessa (1986), The management of Lake Como - a multi-

objective analysis, Water Resources Research, 22 (2), 109–120.

Guez, A., R. Vincent, M. Avoli, and J. Pineau (2008), Adaptive treatment of epilepsy via batch-

mode reinforcement learning, in Proceedings of the 23rd AAAI Conference on Artificial Intel-

ligence, July 13-17, pp. 1671–1678, Chicago, IL.

D R A F T April 9, 2010, 9:46am D R A F T


X - 44 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

Haimes, Y. (1977), Hierarchical Analyses of Water Resources Systems, McGraw-Hill, New York,

NY.

Hall, W., and N. Buras (1961), The dynamic programming approach to water resources devel-

opment, Journal of Geophysical Research, 66 (2), 510–520.

Hall, W., W. Butcher, and A. Esogbue (1968), Optimization of the operation of a multi-purpose

reservoir by dynamic programming, Water Resources Research, 4 (3), 471–477.

Heidari, M., V. Chow, P. Kokotovic, and D. Meredith (1971), Discrete differential dynamic

programming approach to water resources systems optimisation, Water Resources Research,

7 (2), 273–282.

Hejazi, M., X. Cai, and B. Ruddell (2008), The role of hydrologic information in reservoir oper-

ation - learning from historical releases, Advances in Water Resources, 31 (12), 1636–1650.

Hooper, E., A. Georgakakos, and D. Lettenmaier (1991), Optimal stochastic operation of Salt

River Project, Arizona, Journal of Water Research Planning and Management - ASCE, 117 (5),

556–587.

Jacobson, H., and Q. Mayne (1970), Differential dynamic programming, American Elsevier, New

York, NY.

Johnson, S., J. Stedinger, C. Shoemaker, Y. Li, and J. Tejada-Guibert (1993), Numerical solu-

tion of continuous-state dynamic programs using linear and spline interpolation, Operations

Research, 41, 484–500.

Kalyanakrishnan, S., and P. Stone (2007), Batch reinforcement learning in a complex domain,

in The Sixth International Joint Conference on Autonomous Agents and Multiagent Systems,

May.

D R A F T April 9, 2010, 9:46am D R A F T


CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION X - 45

Kelman, J., J. Stedinger, L. Cooper, E. Hsu, and S. Yuan (1990), Sampling Stochastic Dynamic

Programming applied to reservoir operation, Water Resources Research, 26 (3), 447–454.

Labadie, J. (2004), Optimal operation of multireservoir systems: State-of-the-art review, Journal

of Water Research Planning and Management - ASCE, 130 (2), 93–111.

Larson, R. (1968), State Incremental Dynamic Programming, American Elsevier, New York, NY.

Lee, J.-H., and J. W. Labadie (2007), Stochastic optimization of multireservoir systems via

reinforcement learning, Water Resources Research, 43 (11), 1–16.

Luenberger, D. (1971), Cyclic dynamic programming: a procedure for problems with fixed delay,

Operations Research, 19 (4), 1101–1110.

Nardini, A., C. Piccardi, and R. Soncini-Sessa (1992), On the integration of risk-aversion and

average-performance optimization in reservoir control, Water Resources Research, 28 (2), 487–

497.

Nguyen, H., and A. Smeulders (2004), Active learning using pre-clustering, in Proceedings of the

21st International Conference on Machine Learning, ACM New York, NY.

Orlovski, S., S. Rinaldi, and R. Soncini-Sessa (1984), A min-max approach to reservoir manage-

ment, Water Resources Research, 20 (11), 1506–1514.

Ormoneit, D., and S. Sen (2002), Kernel-based reinforcement learning, Machine Learning, 49 (2-

3), 161–178.

Philbrick, C., and P. Kitanidis (2001), Improved dynamic programming methods for optimal

control of lumped-parameter stochastic systems, Operations Research, 49, 398–412.

Pianosi, F., and R. Soncini-Sessa (2009), Real-time management of a multipurpose water

reservoir with a heteroscedastic inflow model, Water Resources Research, 45 (10), W10430,

doi:10.1029/2008WR007335.

D R A F T April 9, 2010, 9:46am D R A F T


X - 46 CASTELLETTI ET AL.: FITTED Q-ITERATION FOR WATER RESERVOIR OPERATION

Piccardi, C., and R. Soncini-Sessa (1991), Stochastic dynamic programming for reservoir opti-

mal control: dense discretization and inflow correlation assumption made possible by parallel

computing, Water Resources Research, 27 (5), 729–741.

Read, E. (1989), Dynamic Programming for Optimal Water Resources Systems Analysis, chap. A

dual approach to stochastic dynamic programming for reservoir release scheduling, pp. 361–372,

Prentice-Hall, Englewood Cliffs.

Saad, M., and A. Turgeon (1988), Application of principal component analysis to long-term

reservoir management, Water Resources Research, 24 (7), 907–912.

Saad, M., A. Turgeon, , and J. Stedinger (1992), Censored-data correlation and principal com-

ponent dynamic programming, Water Resources Research, 28 (8), 2135–2140.

Saad, M., A. Turgeon, P. Bigras, and R. Duquette (1994), Learning disaggregation technique for

the operation of long-term hydroelectric power systems, Water Resources Research, 30 (11),

3195–3203.

Schneegaß, D., S. Udluft, and T. Martinetz (2007), Improving optimality of neural rewards

regression for data-efficient batch near-optimal policy identification, Lecture Notes in Computer

Science, 4668, 109.

Soncini-Sessa, R., A. Castelletti, and E. Weber (2007), Integrated and participatory water re-

sources management. Theory, Elsevier, Amsterdam, NL.

Tejada-Guibert, J., S. Johnson, and J. Stedinger (1995), The value of hydrologic information in

stochastic dynamic programming models of a multireservoir system, Water Resources Research,

31 (10), 2571–2579.

Trott, W., and W. Yeh (1973), Optimization of multiple reservoir systems, Journal of the Hy-

draulic Division ASCE, 99, 1865–1884.

D R A F T April 9, 2010, 9:46am D R A F T


X !!-
!! Please write \lefthead{<AUTHOR NAME(s)>} in file !!: !! Please write \righthead{<(Shortened) Article Title>} in file 47

Figure 1. Alternative approaches to explore the state-decision set: refinement of the historical

policy (a); dense grid discretization (b); coarse grid discretization (c); efficient discretization (d).

Figure 2. The Lake Como water system.

Tsitsiklis, J., and B. V. Roy (1996), Feature methods for large scale dynamic programming,

Machine Learning, 22, 59–94.

Turgeon, A. (1980), Optimal operation of multi-reservoir power systems with stochastic inflows,

Water Resources Research, 16 (2), 275–283.

Turgeon, A. (1981), A decomposition method for the long-term scheduling of reservoirs in series,

Water Resources Research, 17 (6), 1565–1570.

Vasiliadis, H., and M. Karamouz (1994), Demand-driven operation of reservoirs using

uncertainty-based optimal operating policies, Journal of Water Research Planning and Man-

agement - ASCE, 120 (1), 101–114.

Watkins, C., and P. Dayan (1992), Q-learning, Machine Learning, 8 (3-4), 279–292.

Wong, P., and D. Luenberger (1968), Reducing the memory requirements of dynamic program-

ming, Operations Research, 16 (6), 1115–1125.

Yakowitz, S. (1982), Dynamic programming applications in water resources, Water Resources

Research, 18 (4), 673–696.

Yeh, W. (1985), Reservoir management and operations models: a state of the art review, Water

Resources Research, 21 (12), 1797–1818.

Young, P. C. (2006), The data-based mechanistic approach to the modelling, forecasting and

control of environmental systems, Annual Reviews in Control, 30 (2), 169–182.

D R A F T April 9, 2010, 9:46am D R A F T


X -!!48Please write \lefthead{<AUTHOR NAME(s)>} in file !!: !! Please write \righthead{<(Shortened) Article Title>} in file !!

Figure 3. Trajectory of Jˆh obtained with λ = 0.5, K = 3, nmin = 50 and M = 50 on the

validation scenario (top panel ). Trajectories of Jˆh obtained with the same setting and λ = 1 (i.e.,

Jˆh = Jˆhw ) running twice the fitted Q-iteration (bottom panel ).

Figure 4. Trajectories of Jˆh obtained with nmin = 50 (solid line) and nmin = 15 (dashed line)

(top panel ). The value of Jˆ150 for different nmin (bottom panel ). In both panels the validation

scenario was considered and λ = 0.5, K = 3 and M = 50.

Figure 5. Trajectories of Jˆh obtained with M = 50 (black circles) and M = 15 (white

circles) (top panel ). The value of Jˆ150 for different values of M on the validation (solid line) and

calibration (dashed line) scenario (bottom panel ). In both panels λ = 0.5, K = 3 and nmin = 50.

Figure 6. SDP dense discretization grid (27,048 points) and fitted Q-iteration coarse dis-

cretization grid (80 points).

Figure 7. Images of the Pareto fronts obtained with fitted Q-iteration (solid line & black

circles) and SDP (dashed line & white circles) via simulation on the calibration (top panel ) and

validation (bottom panel ) scenarios.

Figure 8. The operating policy corresponding to point A (obtained with fitted Q-iteration) in

Figure 7 (panel (a)) and the associated cost-to-go function derived with equation (13) from the

Q-function (panel (b)). In the bottom panel the same plots for point A′ (obtained with SDP).

Figure 9. Fitted Q-iteration (solid line and black circles), SDP (dashed line and white circles)

and historical operation (solid line) in a flood event in 1981 (validation scenario). In the top

panel the daily release and the inflow (gray line); in the bottom panel the lake level and flood

threshold.

D R A F T April 9, 2010, 9:46am D R A F T


X !!-
!! Please write \lefthead{<AUTHOR NAME(s)>} in file !!: !! Please write \righthead{<(Shortened) Article Title>} in file 49

Figure 10. Fitted Q-iteration (solid line and black circles), SDP (dashed line and white circles)

and historical operation (solid line) in the biggest flood event of the century (validation scenario).

In the top panel the daily release and the inflow (gray line); in the bottom panel the lake level

and flood threshold.

Figure 11. Fitted Q-iteration (solid line and black circles), SDP (dashed line and white circles)

and historical operation (solid line) in a drought event in 1982 (validation scenario). In the top

panel the daily release, the inflow (gray line), and the water demand (dashed line); in the bottom

panel the lake storage.

Figure 12. Comparison of the computational requirements of SDP (dashed lines & white

circles) and fitted Q-iteration (solid line & black circles) for increasing number of state variables,

obtained on an Intel Xeon 3.16 Ghz QuadCore with 16GB Ram machine for two different water

system configurations as explained in the text.

D R A F T April 9, 2010, 9:46am D R A F T

Das könnte Ihnen auch gefallen