Sie sind auf Seite 1von 11

Finite-Horizon Markov Decision Processes

Dan Zhang Leeds School of Business University of Colorado at Boulder

Dan Zhang, Spring 2012

Finite Horizon MDP

Outline

Expected total reward criterion Optimality equations and the principle of optimality Optimality of deterministic Markov policies Backward induction Applications

Dan Zhang, Spring 2012

Finite Horizon MDP

Expected Total Reward Criterion


Let be a randomized history-dependent policy; i.e., HR .
= (d1 , . . . , dN 1 ) where dt : Ht P (A).

Starting at a state s , using policy leads to a sequence of state-action pairs {Xt , Yt }. The sequence of rewards is given by {Rt rt (Xt , Yt ) : t = 1, . . . , N 1} with terminal reward RN rN (XN ). The expected total rewards from policy starting in state s is given by
N 1 vN (s ) E s t =1

rt (Xt , Yt ) + rN (xN ) .

Dan Zhang, Spring 2012

Finite Horizon MDP

Optimal Policy

A policy is an optimal policy if


(s ), vN (s ) vN

s S , HR .

The value of a Markov decision problem is dened by


vN (s ) sup vN (s ), HR (s ) = v (s ) for all s S . We have vN N

s S .

Dan Zhang, Spring 2012

Finite Horizon MDP

Finite-Horizon Policy Evaluation

Let HR be a randomized history-dependent policy.


: H R be the total expected reward obtained by Let ut t using policy at decision epochs t , t + 1, . . . , N 1.

Given ht Ht for t < N , let


N 1 ut (ht ) = E ht n=t (h ) = r (s ) for h = (h Furthermore, let uN N N N N 1 , aN 1 , s ). (s ) = v (s ). For given initial state s , we have u1 N

rn (Xn , Yn ) + rN (XN ) .

Dan Zhang, Spring 2012

Finite Horizon MDP

The Finite-Horizon Policy Evaluation Algorithm

Assume HD .
1

(h ) = r (s ) for all Set t = N and uN N N N hN = (hN 1 , aN 1 , sN ) HN .

2 3

If t = 1, stop; otherwise go to step 3.


(h ) for each Substitute t 1 for t and compute ut t ht = (ht 1 , at 1 , st ) Ht by ut (ht ) = rt (st , dt (ht )) + j S pt (j |st , dt (ht ))ut +1 (ht , dt (ht ), j ).

Return to 2.

Dan Zhang, Spring 2012

Finite Horizon MDP

The Principle of Optimality

(h ) = sup Let ut t HR ut (ht ).

Consider the following optimality equations: ut (ht ) = sup rt (st , a) +


aAst j S

pt (j |st , a)ut +1 (ht , a, j ) ,

t = 1, . . . , N 1, ht = (ht 1 , at 1 , st ) Ht , uN (hN ) = rN (sN ), hN = (hN 1 , aN 1 , sN ) HN .

Dan Zhang, Spring 2012

Finite Horizon MDP

The Principle of Optimality

Theorem Suppose ut is a solution to the optimality equations for all t . Then


(h ) for all h H , t = 1, . . . , N ; (a) ut (ht ) = ut t t t (s ) for all s S . (b) u1 (s1 ) = vN 1 1

Dan Zhang, Spring 2012

Finite Horizon MDP

Optimality of Deterministic Markov Policies


Theorem be a solution to the optimality equations for all t . Then Let ut
(h ) depends on h only through s ; (a) For each t = 1, . . . , N , ut t t t

(b) If there exists an a Ast such that rt (st , a ) +


j S pt (j |st , a )ut +1 (ht , a , j )

= sup rt (st , a) +
aAst j S

pt (j |st , a)ut +1 (ht , a, j )

for each st S and t = 1, . . . , N 1, there exists an optimal policy which is deterministic and Markovian.

Dan Zhang, Spring 2012

Finite Horizon MDP

Backward Induction
(s ) = r (s ) for all s S . Set t = N and uN N N N N (s ) for each s S by Substitute t 1 for t and compute ut t t ut (st ) = max rt (st , a) + aAst j S pt (j |st , a)ut +1 (j ) .

1 2

Set A st ,t = arg max rt (st , a) +


aAst j S
3

pt (j |st , a)ut +1 (j ) .

If t = 1, stop; otherwise go to step 2.

Dan Zhang, Spring 2012

Finite Horizon MDP

10

e-Rite-Way: An MDP Formulation

Decision epochs: T = {1, 2, 3, 4, 5} States: S = {1, 2}. Actions: As = {0, 1, 2}.


0: Do nothing 1: Gift and minor price promotion 2: Gift and Major price promotion

Expected rewards: rt (s , a) (see handout). Terminal rewards: rN (s ) = 0. Transition probabilities:


a pt (i |s , a) = psi ,

i = 1, 2.

Dan Zhang, Spring 2012

Finite Horizon MDP

11

Das könnte Ihnen auch gefallen