Sie sind auf Seite 1von 5

Proceedings of the 34th

Conference on Decision & Control


WM03 1:30 -
New Orleans, LA December 1995

Neuro-Dynamic Programming: An Overview


Dimitri P. Bertsekas John N. Tsitsiklis
b ert sekas@lids.mit .edu jnt@mit.edu

Laboratory for Information and Decision Systems


Massachusetts Institute of Technology
Cambridge, MA 02139, USA

ABSTRACT remaining stages) starting from state j , which is de-


noted by J * ( j ) . These costs can be shown to satisfy
some form of Bellman’s equation
We discuss a relatively new class of dynamic pro-
gramming methods for control and sequential deci-
J * ( i )= minE{g(i, u , j ) + J * ( j ) I i, U } , for all i,
sion making under uncertainty. These methods have U

the potential of dealing with problems that for a long


time were thought to be intractable due to either a where j is the state subsequent to i, and E { . I i, U}
large state space or the lack of an accurate model. denoted expected value with respect to j , given i and
The methods discussed combine ideas from the fields U. Generally, at each state i, it is optimal to use
of neural networks, artificial intelligence, cognitive a control U that attains the minimum above. Thus,
science, simulation, and approximation theory. We decisions are ranked based on the sum of the expected
delineate the major conceptual issues, we survey a cost of the present period, and the optimal expected
number of recent developments, we describe some cost of all subsequent periods.
computational experience, and we address a number The objective of DP is to calculate numerically the
of open questions. optimal cost function J * . This computation can be
done off-line, i.e., before the real system starts operat-
ing. An optimal policy, that is, an optimal choice of U
1 Introduction for each i, is computed either simultaneously with I*,
or in real time by minimizing in the right-hand side of
In this presentation we consider systems where deci- Bellman’s equation. It is well known, however, that
sions axe made in stages. The outcome of each deci- for many important problems the computational re-
sion is not fully predictable but can be anticipated to quirements of DP are overwhelming, mainly because
some extent before the next decision is made. Each of a very large number of states and controls (Bell-
decision results in some immediate cost but also af- man’s “curse of dimensionality”). In such situations
fects the context in which future decisions are to be a suboptimal solution is required.
made and therefore affects the cost incurred in fu-
ture stages. Dynamic programming (DP for short)
provides a mathematical formalization of the tradeoff 1.1 Cost Approximations in Dynamic
between immediate and future costs. Programming
Generally, in DP formulations we introduce a
discrete-time dynamic system whose state evolves ac- In this presentation, we focus on suboptimal meth-
cording to given transition probabilities that depend ods that center around the approximate evaluation
on a decision/control U . In particular, if we are in of the optimal cost function J’, possibly through the
state i and we choose decision U , we move to state use of neural networks and/or simulation. In particu-
j with given probability p i j (U). Simultaneously with lar, we replace the optimal cost J * ( j ) with a suitable
this transition, we incur a cost g ( i , u , j ) . In com- approximation j ( j , T ) , where T is a vector of param-
paring, however, the available decisions U , it is not eters, and we use at state i the (suboptimal) control
enough to look at the magnitude of the cost g(i, u,j); P ( i ) that attains the minimum in the (approximate)
we must also take into account how desirable the next right-hand side of Bellman’s equation
state j is. We thus need a way to rank or rate states
j . This is done by using the optimal cost (over all ji(i) = argmin E{g(i, U ,j ) + j ( j , r-1 I i, U).

$4.00 0 1995 IEEE


0-7803-2685-7/95 5 60
The function J” will be called the scoring function, Approximations of the optimal cost function have
and the value j ( j , r ) will be called the score of state j. been used in the past in a variety of DP contexts.
The general form of J” is known and is such that once Chess playing programs represent a successful exam-
the parameter vector r is determined, the evaluation ple. A key idea in these programs is t o use a posi-
of J”(j,r ) of any state j is fairly simple. tion evaluator t o rank different chess positions and
We note that in some problems the minimization to select at each turn a move that results in the
over U of the expression position with the best rank. The position evalua-
tor assigns a numerical value to each position, ac-
cording to a heuristic formula that includes weights
for the various features of the position (material bal-
may be too complicated or too time-consuming for
ance, piece mobility, king safety, and other factors).
making decisions in real-time, even if the scores
Thus, the position evaluator corresponds to the scor-
j ( j , r ) are simply calculated. In such problems we
ing function j ( j , r ) above, while the weights of the
may use a related technique, whereby we approximate
features correspond to the parameter vector r. Usu-
the expression minimized in Bellman’s equation,
ally, some general structure of position evaluator is
selected (this is largely an art that has evolved over
many years, based on experimentation and human
which is known as the Q-factor corresponding to knowledge about chess), and the numerical weights
(i, U ) . In particular, we replace Q ( i ,U ) with a suit- are chosen by trial and error or (as in the case of
able approximation Q ( i , u , r ) , where r is a vector of the champion program Deep Thought) by “training”
parameters. We then use at state i the (suboptimal) using a large number of sample grandmaster games.
control that minimizes the approximate &-factor cor- As the chess program paradigm suggests, intuition
responding to i: about the problem, heuristics, and trial and error
are all important ingredients for constructing cost ap-
ji(i) = arg min Q(i,U , r ) . proximations in DP. However, it is important t o sup-
plement heuristics and intuition with more systematic
Much of what will be said about approximation of the techniques that are broadly applicable and retain as
optimal cost function also applies t o approximation much as possible the nonheuristic aspects of DP.
of &-factors. In fact, we will see later that the Q- In this presentation we will describe several re-
factors can also be viewed as optimal costs of a related cent efforts to develop a methodological foundation
problem. We thus focus primarily on approximation for combining dynamic programming, compact rep-
of the optimal cost function J * . resentations, and simulation to provide the basis for
We are interested in problems with a large number a rational approach to complex stochastic decision
of states and in scoring functions J” that can be de- problems.
scribed with relatively few numbers ( a vector r of
small dimension). Scoring functions involving few
1.2 Approximation Architectures
parameters are called compact representations, while
the tabular description of J* are called the lookup An important issue in function approximation is the
table representation. Thus, in a lookup table repre- selection of architecture, that is, the choice of a para-
sentation, the values J*(j) are stored in a table for metric class of functions i(., r ) or o(.,., r ) that suits
all states j . In a typical compact representation, only the problem at hand. One possibility is to use a neu-
the vector r and the general structure of the scoring ral network architecture of some type. We should
function J”(., r ) are stored; the scores J”(j,r ) are gen- emphasize here that in this presentation we use the
erated only when needed. For example, J”(j,r>may term “neural network” in a very broad sense, essen-
be the output of some neural network in response to tially as a synonym to “approximating architecture.”
the input j , and r is the associated vector of weights In particular, we do not restrict ourselves to the clas-
or parameters of the neural network; or J”(j,r)may sical multilayer perceptron structure with sigmoidal
involve a lower dimensional description of the state nonlinearities. Any type of universal approximator
j in terms of its “significant features”, and r is the of nonlinear mappings could be used in our context.
associated vector of relative weights of the features. The nature of the approximating structure is left open
Thus determining the scoring function j ( j ,r ) involves in our discussion, and it could involve, for example,
two complementary issues: (1)-deciding on the gen- radial basis functions, wavelets, polynomials, splines,
eral structure of the function J ( j , r ) , and (2) calcu- etc.
lating the parameter vector r so as t o minimize in Cost approximation can often be significantly en-
some sense the error between the functions J * ( . ) and hanced through the use of feature extraction, a pro-
I(.,r ) . cess that maps the state i into some vector f ( i ) ,

561
called the feature vector associated with the state linear regression, and nonlinear system identification.
i. Feature vectors summarize, in a heuristic sense, In these applications the neural network is used as a
what are considered to be important characteristics universal approximator: the input-output mapping of
of the state, and they are very useful in incorporating the neural network is matched to an unknown nonlin-
the designer’s prior knowledge or intuition about the ear mapping F of interest using a least-squares opti-
problem and about the structure of the optimal con- mization. This optimization is known as training the
troller. For example in a queueing system involving network. To perform training, one must have some
several queues, a feature vector may involve for each training data, that is, a set of pairs ( i , F ( i ) ) ,which
queue a three-value indicator, that specifies whether is representative of the mapping F that is approxi-
the queue is “nearly empty”, “moderately busy”, or mated.
“nearly full”. In many cases, analysis can comple- It is important to note that in contrast with these
ment intuition t o suggest the right features for the neural network applications, in the DP context there
problem at hand. is no readily available training set of input-output
Feature vectors are particularly useful when they pairs ( i , J * ( i ) ) , which can be used to approximate
can capture the “dominant nonlinearities” in the op- J’ with a least squares fit. The only possibility is
timal cost function J * . By this we mean that J * ( i ) to evaluate (exactly or approximately) by simulation
can be approximated well by a “relatively smooth” the cost functions of given (suboptimal) policies, and
function j ( f ( i ) ) ;this happens for example, if through to try to iteratively improve these policies based on
a change of variables from states to features, the func- the simulation outcomes. This creates analytical and
tion J * becomes a (nearly) linear or low-order polyno- computational difficulties that do not arise in classi-
mial function of the features. When a feature vector cal neural network training contexts. Indeed the use
can be chosen to have this property, one may consider of simulation to evaluate approximately the optimal
approximation architectures where both features and cost function is a key new idea, that distinguishes the
(relatively simple) neural networks are used together. methodology of this presentation from earlier approx-
In particular, the state is mapped to a feature vector, imation methods in DP.
which is then used as input to a neural network that Using simulation offers another major advantage:
produces the score of the state. More generally, it is it allows the methods of this presentation to be used
possible that both the state and the feature vector for systems that are hard to model but easy to sim-
are provided as inputs to the neural network. ulate; that is, in problems where an explicit model is
A simple method to obtain more sophisticated ap- not available, and the system can only be observed,
proximations, is to partition the state space into sev- either as it operates in real time or through a soft-
eral subsets and construct a separate cost function ap- ware simulator. For such problems, the traditional
proximation in each subset. For example, by using a DP techniques are inapplicable, and estimation of the
linear or quadratic polynomial approximation in each transition probabilities to construct a detailed math-
subset of the partition, one can construct piecewise ematical model is often cumbersome or impossible.
- linear or piecewise quadratic approximations over the There is a third potential advantage of simulation:
entire state space. An important issue here is the it can implicitly identify the “most important” or
choice of the method for partitioning the state space. “most representative” states of the system. It ap-
Regular partitions (e.g., grid partitions) may be used, pears plausible that if these states are the ones most
but they often lead to a large number of subsets and often visited during the simulation, the scoring func-
very time-consuming computations. Generally speak- tion will tend to approximate better the optimal cost
ing, each subset of the partition should contain “sim- for these states, and the suboptimal policy obtained
ilar” states so that the variation of the optimal cost will perform better.
over the states of the subset is relatively smooth and
can be approximated with smooth functions. An in- 1.4 Neuro-Dynamic Programming
teresting possibility is to use features as the basis for
partition. In particular, one may use a more or less In view of the reliance on both DP and neural net-
regular discretization of the space of features, which work concepts, we use the name neuro-dynamic pro-
induces a possibly irregular partition of the original gramming (NDP for short) to describe collectively the
state space. In this way, each subset of the irregular methods of this presentation. In the artificial intelli-
partition contains states with “similar features.” gence community, where the methods originated, the
name reinforcement learning is also used. In com-
mon artificial intelligence terms, the methods of this
1.3 Simulation and Training
presentation allow systems to “learn how to make
Some of the most successful applications of neural good decisions by observing their own behavior, and
networks are in the areas of pattern recognition, non- use built-in mechanisms for improving their actions

562
through a reinhrcement mechanism.” In the less an- other extreme, TD(O), is closely related to stochas-
thropomorphic DP terms used in this presentation, tic approximation. A related method is &-learning,
“observing their own behavior” relates to simulation, introduced by Watkins [Wat89], which is a stochas-
and “improving their actions through a reinforce- tic approximation-like method that iterates on the &-
ment mechanism” relates to iterative schemes for im- factors. While there is convergence analysis of TD(X)
proving the quality of approximation of the optimal and &-learning for the case of lookup table represen-
cost function, or the &-factors, or the optimal pol- tations (see [Tsi94], [JJS94]), the situation is much
icy. There has been a gradual realization that rein- less clear in the case of compact representations. In
forcement learning techniques can be fruitfully moti- our presentation, we will describe results that we have
vated and interpreted in terms of classical DP con- derived for approximate policy and value iteration
cepts such as value and policy iteration; see the nice methods, which are obtained from the traditional DP
survey by Barto, Bradtke, and Singh [BBS93], which methods after compact representations of the various
points out the connections between the artificial in- cost functions involved are introduced.
telligence/reinforcement learning viewpoint and the
While the theoretical support for the NDP method-
control theory/DP viewpoint, and gives many refer-
ology is only now emerging, there have been over the
ences.
last five years, quite a few reports of successes with
In this presentation, we will attempt to clarify some
problems too large and complex to be treated in any
aspects of the current NDP methodology, we will sug-
other way. A particularly impressive success is the
gest some new algorithmic approaches, and we will
development of a backgammon playing program as
identify some open questions. Despite the great in-
reported by Tesauro [Tes92]. Here a neural network
terest in NDP, there is little solid theory at present
provided a compact representation of the optimal cost
to guide the user, and the corresponding literature is -
function of the game of backgammon by using sim-
often confusing.
ulation and TD(X). The training was performed by
The currently most popular methodology itera-
letting the program play against itself. After training
tively adjusts the parameter vector T of the scoring
for several months, the program nearly defeated the
function j ( j , T ) as it produces sample state trajec-
human world champion. Variations of the method
tories (io,i l , . . . , ik, i k + l , . . . , ) by using simulation.
used by Tesauro have been used with success by us
These trajectories correspond to either a fixed sta- and several other researchers in a variety of applica-
tionary policy, or t o a “greedy” policy that applies, tions. In our presentation we will provide some anal-
at state i, the control U that minimizes the expression ysis that explains the success of this method, and we
will also point to some unanswered questions.
Our own experience, involving several engineering
where r is the current parameter vector. A central
applications, has confirmed that NDP methods can
notion here is the notion of a temporal difference, de-
be impressively effective in problems where tradi-
fined by
tional DP methods would be hardly applicable and
other heuristic methods would have a limited chance
of success. We note, however, that the practical ap-
and expressing-the difference between our expected plication of NDP is computationally very intensive,
cost estimate J ( i k , r ) at state-ik and the predicted and often requires a considerable amount of trial and
+
cost estimate g ( i k , uk,ik+l) J ( i k + l , T ) based on the error. Fortunately, all the computation and experi-
outcome of the simulation. If the cost approximations mentation with different approaches can be done off-
were exact, the average temporal difference would be line. Once the approximation is obtained off-line, it
zero by Bellman’s equation. Thus, roughly speak- can be used t o generate decisions fast enough for use
ing, the values of the temporal differences can be in real time. In this context, we mention that in the
used to make incremental adjustments to r so as to machine learning literature, reinforcement learning is
bring about an approximate equality (on the aver- often viewed as an “on-line” method, whereby the
age) between expected and predicted cost estimates cost approximation is improved as the system oper-
along the simulated trajectories. This viewpoint, for- ates in real time. This is reminiscent of the methods
malized by Sutton in [Sut88], can be implemented of traditional adaptive control. We will not discuss
through the use of gradient descent/stochastic ap- this viewpoint in our presentation, as we prefer to
proximation methodology. Sutton proposed a fam- focus on applications involving a large and complex
ily of methods of this type, called TD(X), and pa- system. A lot of training data is required for such a
rameterized by a scalar X E [0, l]. One extreme, system. These data typically cannot be obtained in
TD( 1), is closely related to Monte-Carlo simulation sufficient volume as the system is operating; even if
and least-squares parameter estimation, while the they can, the corresponding processing requirements

563
are typically too large for effective use in real time.
An extensive reference for the material of this pre-
sentation is the research monograph of the authors
[BeT96]. A more limited textbook discussion is given
in [Ber95]. The survey [BBS95] provides much inter-
esting material and the point of view of the machine
learning community.

2 REFERENCES
[BBS95] Barto, A. G., Bradtke, S. J., andSingh, S. P.,
1995. “Real-Time Learning and Control Using Asyn-
chronous Dynamic Programming,” Artificial Intelli-
gence, Vol. 72, 1995, pp. 81-138.
[BeT96] Bertsekas, D. P., and Tsitsiklis, J . N.,
1996. Neuro-Dynamic Programming, Athena Scien-
tific, Belmont, MA.
[Berg51 Bertsekas, D. P., 1995. Dynamic Program-
ming and Optimal Control, Vol. 11, Athena Scientific,
Belmont, MA.
[JJS94] Jaakkola, T., Jordan, M. I., and Singh, S.
P., 1994. “On the Convergence of Stochastic Iterative
Dynamic Programming Algorithms,” Neural Compu-
tation, Vol. 6, pp. 1185-1201.
[Sut88] Sutton, R. S., 1988. “Learning to Predict
by the Methods of Temporal Differences,” Machine
Learning, Vol. 3, pp. 9-44.
[Tes92] Tesauro, G., 1992. “Practical Issues in Tem-
poral Difference Learning,” Machine Learning, Vol. 8,
pp. 257-277.
[Tsi94] Tsitsiklis, J . N., 1994. “Asynchronous
Stochastic Approximation and &-Learning,” Machine
Learning, Vol. 16, pp. 185-202.
[Wat89] Watkins, C. J . C. H., “Learning from
Delayed Rewards,” Ph.D. Thesis, Cambridge Univ.,
England.

564

Das könnte Ihnen auch gefallen