Beruflich Dokumente
Kultur Dokumente
Outline
Expected total reward criterion Optimality equations and the principle of optimality Optimality of deterministic Markov policies Backward induction Applications
Starting at a state s , using policy leads to a sequence of state-action pairs {Xt , Yt }. The sequence of rewards is given by {Rt rt (Xt , Yt ) : t = 1, . . . , N 1} with terminal reward RN rN (XN ). The expected total rewards from policy starting in state s is given by
N 1 vN (s ) E s t =1
rt (Xt , Yt ) + rN (xN ) .
Optimal Policy
s S , HR .
s S .
rn (Xn , Yn ) + rN (XN ) .
Assume HD .
1
2 3
Return to 2.
= sup rt (st , a) +
aAst j S
for each st S and t = 1, . . . , N 1, there exists an optimal policy which is deterministic and Markovian.
Backward Induction
(s ) = r (s ) for all s S . Set t = N and uN N N N N (s ) for each s S by Substitute t 1 for t and compute ut t t ut (st ) = max rt (st , a) + aAst j S pt (j |st , a)ut +1 (j ) .
1 2
pt (j |st , a)ut +1 (j ) .
10
i = 1, 2.
11