Beruflich Dokumente
Kultur Dokumente
561
called the feature vector associated with the state linear regression, and nonlinear system identification.
i. Feature vectors summarize, in a heuristic sense, In these applications the neural network is used as a
what are considered to be important characteristics universal approximator: the input-output mapping of
of the state, and they are very useful in incorporating the neural network is matched to an unknown nonlin-
the designer’s prior knowledge or intuition about the ear mapping F of interest using a least-squares opti-
problem and about the structure of the optimal con- mization. This optimization is known as training the
troller. For example in a queueing system involving network. To perform training, one must have some
several queues, a feature vector may involve for each training data, that is, a set of pairs ( i , F ( i ) ) ,which
queue a three-value indicator, that specifies whether is representative of the mapping F that is approxi-
the queue is “nearly empty”, “moderately busy”, or mated.
“nearly full”. In many cases, analysis can comple- It is important to note that in contrast with these
ment intuition t o suggest the right features for the neural network applications, in the DP context there
problem at hand. is no readily available training set of input-output
Feature vectors are particularly useful when they pairs ( i , J * ( i ) ) , which can be used to approximate
can capture the “dominant nonlinearities” in the op- J’ with a least squares fit. The only possibility is
timal cost function J * . By this we mean that J * ( i ) to evaluate (exactly or approximately) by simulation
can be approximated well by a “relatively smooth” the cost functions of given (suboptimal) policies, and
function j ( f ( i ) ) ;this happens for example, if through to try to iteratively improve these policies based on
a change of variables from states to features, the func- the simulation outcomes. This creates analytical and
tion J * becomes a (nearly) linear or low-order polyno- computational difficulties that do not arise in classi-
mial function of the features. When a feature vector cal neural network training contexts. Indeed the use
can be chosen to have this property, one may consider of simulation to evaluate approximately the optimal
approximation architectures where both features and cost function is a key new idea, that distinguishes the
(relatively simple) neural networks are used together. methodology of this presentation from earlier approx-
In particular, the state is mapped to a feature vector, imation methods in DP.
which is then used as input to a neural network that Using simulation offers another major advantage:
produces the score of the state. More generally, it is it allows the methods of this presentation to be used
possible that both the state and the feature vector for systems that are hard to model but easy to sim-
are provided as inputs to the neural network. ulate; that is, in problems where an explicit model is
A simple method to obtain more sophisticated ap- not available, and the system can only be observed,
proximations, is to partition the state space into sev- either as it operates in real time or through a soft-
eral subsets and construct a separate cost function ap- ware simulator. For such problems, the traditional
proximation in each subset. For example, by using a DP techniques are inapplicable, and estimation of the
linear or quadratic polynomial approximation in each transition probabilities to construct a detailed math-
subset of the partition, one can construct piecewise ematical model is often cumbersome or impossible.
- linear or piecewise quadratic approximations over the There is a third potential advantage of simulation:
entire state space. An important issue here is the it can implicitly identify the “most important” or
choice of the method for partitioning the state space. “most representative” states of the system. It ap-
Regular partitions (e.g., grid partitions) may be used, pears plausible that if these states are the ones most
but they often lead to a large number of subsets and often visited during the simulation, the scoring func-
very time-consuming computations. Generally speak- tion will tend to approximate better the optimal cost
ing, each subset of the partition should contain “sim- for these states, and the suboptimal policy obtained
ilar” states so that the variation of the optimal cost will perform better.
over the states of the subset is relatively smooth and
can be approximated with smooth functions. An in- 1.4 Neuro-Dynamic Programming
teresting possibility is to use features as the basis for
partition. In particular, one may use a more or less In view of the reliance on both DP and neural net-
regular discretization of the space of features, which work concepts, we use the name neuro-dynamic pro-
induces a possibly irregular partition of the original gramming (NDP for short) to describe collectively the
state space. In this way, each subset of the irregular methods of this presentation. In the artificial intelli-
partition contains states with “similar features.” gence community, where the methods originated, the
name reinforcement learning is also used. In com-
mon artificial intelligence terms, the methods of this
1.3 Simulation and Training
presentation allow systems to “learn how to make
Some of the most successful applications of neural good decisions by observing their own behavior, and
networks are in the areas of pattern recognition, non- use built-in mechanisms for improving their actions
562
through a reinhrcement mechanism.” In the less an- other extreme, TD(O), is closely related to stochas-
thropomorphic DP terms used in this presentation, tic approximation. A related method is &-learning,
“observing their own behavior” relates to simulation, introduced by Watkins [Wat89], which is a stochas-
and “improving their actions through a reinforce- tic approximation-like method that iterates on the &-
ment mechanism” relates to iterative schemes for im- factors. While there is convergence analysis of TD(X)
proving the quality of approximation of the optimal and &-learning for the case of lookup table represen-
cost function, or the &-factors, or the optimal pol- tations (see [Tsi94], [JJS94]), the situation is much
icy. There has been a gradual realization that rein- less clear in the case of compact representations. In
forcement learning techniques can be fruitfully moti- our presentation, we will describe results that we have
vated and interpreted in terms of classical DP con- derived for approximate policy and value iteration
cepts such as value and policy iteration; see the nice methods, which are obtained from the traditional DP
survey by Barto, Bradtke, and Singh [BBS93], which methods after compact representations of the various
points out the connections between the artificial in- cost functions involved are introduced.
telligence/reinforcement learning viewpoint and the
While the theoretical support for the NDP method-
control theory/DP viewpoint, and gives many refer-
ology is only now emerging, there have been over the
ences.
last five years, quite a few reports of successes with
In this presentation, we will attempt to clarify some
problems too large and complex to be treated in any
aspects of the current NDP methodology, we will sug-
other way. A particularly impressive success is the
gest some new algorithmic approaches, and we will
development of a backgammon playing program as
identify some open questions. Despite the great in-
reported by Tesauro [Tes92]. Here a neural network
terest in NDP, there is little solid theory at present
provided a compact representation of the optimal cost
to guide the user, and the corresponding literature is -
function of the game of backgammon by using sim-
often confusing.
ulation and TD(X). The training was performed by
The currently most popular methodology itera-
letting the program play against itself. After training
tively adjusts the parameter vector T of the scoring
for several months, the program nearly defeated the
function j ( j , T ) as it produces sample state trajec-
human world champion. Variations of the method
tories (io,i l , . . . , ik, i k + l , . . . , ) by using simulation.
used by Tesauro have been used with success by us
These trajectories correspond to either a fixed sta- and several other researchers in a variety of applica-
tionary policy, or t o a “greedy” policy that applies, tions. In our presentation we will provide some anal-
at state i, the control U that minimizes the expression ysis that explains the success of this method, and we
will also point to some unanswered questions.
Our own experience, involving several engineering
where r is the current parameter vector. A central
applications, has confirmed that NDP methods can
notion here is the notion of a temporal difference, de-
be impressively effective in problems where tradi-
fined by
tional DP methods would be hardly applicable and
other heuristic methods would have a limited chance
of success. We note, however, that the practical ap-
and expressing-the difference between our expected plication of NDP is computationally very intensive,
cost estimate J ( i k , r ) at state-ik and the predicted and often requires a considerable amount of trial and
+
cost estimate g ( i k , uk,ik+l) J ( i k + l , T ) based on the error. Fortunately, all the computation and experi-
outcome of the simulation. If the cost approximations mentation with different approaches can be done off-
were exact, the average temporal difference would be line. Once the approximation is obtained off-line, it
zero by Bellman’s equation. Thus, roughly speak- can be used t o generate decisions fast enough for use
ing, the values of the temporal differences can be in real time. In this context, we mention that in the
used to make incremental adjustments to r so as to machine learning literature, reinforcement learning is
bring about an approximate equality (on the aver- often viewed as an “on-line” method, whereby the
age) between expected and predicted cost estimates cost approximation is improved as the system oper-
along the simulated trajectories. This viewpoint, for- ates in real time. This is reminiscent of the methods
malized by Sutton in [Sut88], can be implemented of traditional adaptive control. We will not discuss
through the use of gradient descent/stochastic ap- this viewpoint in our presentation, as we prefer to
proximation methodology. Sutton proposed a fam- focus on applications involving a large and complex
ily of methods of this type, called TD(X), and pa- system. A lot of training data is required for such a
rameterized by a scalar X E [0, l]. One extreme, system. These data typically cannot be obtained in
TD( 1), is closely related to Monte-Carlo simulation sufficient volume as the system is operating; even if
and least-squares parameter estimation, while the they can, the corresponding processing requirements
563
are typically too large for effective use in real time.
An extensive reference for the material of this pre-
sentation is the research monograph of the authors
[BeT96]. A more limited textbook discussion is given
in [Ber95]. The survey [BBS95] provides much inter-
esting material and the point of view of the machine
learning community.
2 REFERENCES
[BBS95] Barto, A. G., Bradtke, S. J., andSingh, S. P.,
1995. “Real-Time Learning and Control Using Asyn-
chronous Dynamic Programming,” Artificial Intelli-
gence, Vol. 72, 1995, pp. 81-138.
[BeT96] Bertsekas, D. P., and Tsitsiklis, J . N.,
1996. Neuro-Dynamic Programming, Athena Scien-
tific, Belmont, MA.
[Berg51 Bertsekas, D. P., 1995. Dynamic Program-
ming and Optimal Control, Vol. 11, Athena Scientific,
Belmont, MA.
[JJS94] Jaakkola, T., Jordan, M. I., and Singh, S.
P., 1994. “On the Convergence of Stochastic Iterative
Dynamic Programming Algorithms,” Neural Compu-
tation, Vol. 6, pp. 1185-1201.
[Sut88] Sutton, R. S., 1988. “Learning to Predict
by the Methods of Temporal Differences,” Machine
Learning, Vol. 3, pp. 9-44.
[Tes92] Tesauro, G., 1992. “Practical Issues in Tem-
poral Difference Learning,” Machine Learning, Vol. 8,
pp. 257-277.
[Tsi94] Tsitsiklis, J . N., 1994. “Asynchronous
Stochastic Approximation and &-Learning,” Machine
Learning, Vol. 16, pp. 185-202.
[Wat89] Watkins, C. J . C. H., “Learning from
Delayed Rewards,” Ph.D. Thesis, Cambridge Univ.,
England.
564