Beruflich Dokumente
Kultur Dokumente
II, Chapter 1
1.5
(a) We have
n
j=1
p
ij
(u) =
n
j=1
_
p
ij
(u) m
j
1
n
k=1
m
k
_
=
n
j=1
p
ij
(u)
n
j=1
m
j
1
n
k=1
m
k
= 1.
Therefore, p
ij
(u) are transition probabilities.
(b) We have for the modied problem
J
(i) = min
uU(i)
_
_
g(i, u) +
_
_
1
n
j=1
m
j
_
_
n
j=1
p
ij
(u) m
j
1
n
k=1
m
k
J
(j)
_
_
= min
uU(i)
_
_
g(i, u) +
n
j=1
p
ij
(u)J
(j)
n
k=1
m
k
J
(k)
_
_
.
So
J
(i) +
n
k=1
m
k
J
(k)
1
= min
uU(i)
_
_
g(i, u) +
n
j=1
p
ij
(u)J
(j)
n
k=1
m
k
(1
1
1
)
. .
1
J
(k)
_
_
J
(i) +
n
k=1
m
k
J
(k)
1
= min
uU(i)
_
_
g(i, u) +
n
j=1
p
ij
(u)
_
J
(j) +
n
k=1
m
k
J
(k)
1
_
_
_
.
Thus
J
(i) +
n
k=1
m
k
J
(k)
1
= J
(i), i.
Q.E.D.
1.7
We show that for any bounded function J : S R, we have
J T(J) T(J) F(J), (1)
37
J T(J) T(J) F(J). (2)
For any , dene
F
(J)(i) =
g(i, (i)) +
j=i
p
ij
((i))J(j)
1 p
ii
((i))
and note that
F
(J)(i) =
T
(J)(i) p
ii
((i))J(i)
1 p
ii
((i))
. (3)
Fix > 0. If J T(J), let be such that F
(J)(i) =
T
(J)(i) p
ii
((i))J(i)
1 p
ii
((i))
T(J)(i) p
ii
((i))T(J)(i)
1 p
ii
((i))
= T(J)(i).
Since > 0 is arbitrary, we obtain F(J)(i) T(J)(i). Similarly, if J T(J), let be such that
T
(J)(i) =
T
(J)(i) p
ii
((i))J(i)
1 p
ii
((i))
T(J)(i) + p
ii
((i))T(J)(i)
1 p
ii
((i))
T(J)(i) +
1
.
Since > 0 is arbitrary, we obtain F(J)(i) T(J)(i).
From (1) and (2) we see that F and T have the same xed points, so J
F(J) F(J
). (5)
For any bounded function J, let r > 0 be such that
J re J
J +re.
Applying F repeatedly to this equation and using Eqs. (4) and (5), we obtain
F
k
(J)
k
re J
F
k
(J) +
k
re.
Therefore F
k
(J) converges to J
,
J T(J) T
k
(J) F
k
(J) J
.
These equations demonstrate the faster convergence property of F over T.
38
As a nal result (not explicitly required in the problem statement), we show that for any two
bounded functions J : S R, J
: S R, we have
max
j
|F(J)(j) F(J
)(j)| max
j
|J(j) J
(j)|, (6)
so F is a contraction mapping with modulus . Indeed, we have
F(J)(i) = min
uU(i)
_
g(i, u) +
j=i
p
ij
(u)J(j)
1 p
ii
(u)
_
= min
uU(i)
_
g(i, u) +
j=i
p
ij
(u)J
(j)
1 p
ii
(u)
+
j=i
p
ij
(u)[J(j) J
(j)]
1 p
ii
(u)
_
F(J
)(i) + max
j
|J(j) J
(j)|, i,
where we have used the fact
1 p
ii
(u) 1 p
ii
(u) =
j=i
p
ij
(u).
Thus, we have
F(J)(i) F(J
)(i) max
j
|J(j) J
(j)|, i.
The roles of J and J
(j)|, i.
Combining the last two inequalities, we see that
|F(J)(i) F(J
)(i)| max
j
|J(j) J
(j)|, i.
By taking the maximum over i, Eq. (6) follows.
1.9
(a) Since J, J
B(S), i.e., are real-valued, bounded functions on S, we know that the inmum and the
supremum of their dierence is nite. We shall denote
m = min
xS
_
J(x) J
(x)
_
and
M = max
xs
_
J(x) J
(x)
_
.
Thus
m J(x) J
(x) M, x S,
39
or
J
(x) +m J(x) J
(x) +M, x S.
Now we apply the mapping T on the above inequalities. By property (1) we know that T will preserve
the inequalities. Thus
T(J
+Me)(x), x S.
By property (2) we know that
T(J)(x) + min[a
1
r, a
2
r] T(J +re)(x) T(J)(x) + max[a
1
r, a
2
r].
If we replace r by m or M, we get the inequalities
T(J
)(x) + min[a
1
m, a
2
m] T(J
+me)(x) T(J
)(x) + max[a
1
m, a
2
m]
and
T(J
)(x) + min[a
1
M, a
2
M] T(J
+Me)(x) T(J
)(x) + max[a
1
M, a
2
M].
Thus
T(J
)(x) + min[a
1
m, a
2
m] T(J)(x) T(J
)(x) + max[a
1
M, a
2
M],
so that
|T(J)(x) T(J
)(x)| max[a
1
|M|, a
2
|M|, a
1
|m), a
2
|m|].
We also have
max[a
1
|M|, a
2
|M|, a
1
|m|, a
1
|m|, a
2
|m|] a
2
max[|M|, |m|] a
2
sup
xS
|J(x) J
(x).
Thus
|T(J)(x) T(J
)(x)| a
2
max
xS
|J(x) J
(x)|
from which
max
xS
|T(J)(x) T(J
)(x)| a
2
max
xS
|J(x) J
(x)|.
Thus T is a contraction mapping since we know by the statement of the problem that 0 a
1
a
2
< 1.
Since the set B(S) of bounded real valued functions is a complete linear space, we conclude that
the contraction mapping T has a unique xed point, J
, and lim
k
T
k
(J)(x) = J
(x).
(b) We shall rst prove the lower bounds of J
m=0
a
m
1
c,
k
m=0
a
m
2
c] T(J)(x) + min[
k
m=1
a
m
1
c,
k
m=1
a
m
2
c] . . .
T
k
(J)(x) + min[a
k
1
c, a
k
2
c] T
k+1
(J)(x).
(3)
By taking the limit as k and noting that the quantities in the minimization are monotone, and
either nonnegative or nonpositive, we conclude that
J(x) + min
_
1
1 a
1
c,
1
1 a
2
c
_
T(J)(x) + min
_
a
1
1 a
1
c,
a
2
1 a
2
c
_
T
k
(J)(x) + min
_
a
k
1
1 a
1
c,
a
k
2
1 a
2
c
_
T
k+1
(J)(x) + min
_
a
k+1
1
1 a
1
c,
a
k+1
2
1 a
2
c
_
J
(x).
(4)
Finally we note that
min[a
k
1
c, a
k
2
c] T
k+1
(J)(x) T
k
(J)(x).
Thus
min[a
k
1
c, a
k
2
c] inf
xS
(T
k+1
(J)(x) T
k
(J)(x)) .
Let b
k+1
= inf
xS
(T
k+1
(J)(x) T
k
(J)(x)) . Thus min[a
k
1
c, a
k
2
c] b
k+1
. From the above relation we infer
that
min
_
a
k+1
1
c
1 a
1
,
a
k+1
2
c
1 a
2
_
min
_
a
1
1 a
1
b
k
1
,
a
2
1 a
2
b
k+1
_
= c
k+1
Therefore
T
k
(J)(x) + min
_
a
k
1
c
1 a
1
,
a
k
2
c
1 a
2
_
T
k+1
(J)(x) +c
k+1
.
This relationship gives for k = 1
T(J)(x) + min
_
a
1
c
1 a
1
,
a
2
c
1 a
2
_
T
2
(J)(x) +c
2
41
Let
c = inf
xS
(T(J)(x) J(x))
Then the above inequality still holds. From the denition of c
1
we have
c
1
= min
_
a
1
c
1 a
1
,
a
2
c
1 a
2
_
.
Therefore
T(J)(x) +c
1
T
2
(J)(x) +c
2
and T(J)(x) +c
1
J
(x).
Then
min[a
1
b
2
, a
2
b
2
] min
xS
[T
2
(J
1
)(x) T(J
1
)(x)] = min
xS
[T
3
(J)(x) T
2
(J)(x)] = b
3
Thus
min
_
a
2
1
b
2
1 a
1
,
a
2
2
b
2
1 a
2
_
min
_
a
1
b
3
1 a
2
,
a
2
b
3
1 a
2
_
.
Thus
T(J
1
)(x) + min
_
a
1
b
2
1 a
2
,
a
2
b
2
1 a
2
_
T
2
(J
1
)(x) + min
_
a
1
b
3
1 a
2
,
a
2
b
2
1 a
2
_
or
T
2
(J)(x) +c
2
T
3
(J)(x) +c
3
and
T
2
(J)(x) +c
2
J
(x).
Proceeding similarly the result is proved.
The reverse inequalities can be proved by a similar argument.
(c) Let us rst consider the state x = 1
F(J)(1) = min
uU(1)
_
_
_
g(j, j) +a
n
j=1
p
1j
J(j)
_
_
_
42
Thus
F(J +re)(1) = min
uU(1)
_
_
_
g(1, u) +
n
j=1
p
ij
(J +re)(j)
_
_
_
= min
uU(1)
_
_
_
g(1, u) +
n
j=1
p
1j
J(j) +ar
_
_
_
= F(J)(1) +r
Thus
F(J +re)(1) F(J((1)
r
= (1)
Since 0 1 we conclude that
n
. Thus
J=2
p
2j
J(j)
_
and
F(J +re)(2) = min
uU(2)
_
g(2, u) +p
21
F(J +re)(1) +
n
J=2
p
2j
(J +re)(j)
_
= min
uU(2)
_
g(2, u) +p
21
F(J)(1) +
2
rp
21
+
n
J=2
p
2
J(j) +
n
J=2
p
ij
re(j)
_
where, for the last equality, we used relation (1).
Thus we conclude
F(J +re)(2) = F(J)(2) +
2
rp
21
+
n
j=2
p
2j
r = F(J)(2) +
2
rp
21
+r(1 p
21
)
which yields
F(J +re)(2) F(J)(2)
r
=
2
P
21
+(1 p
21
) (2)
Now let us study the behavior of the right-hand side of Eq. (2). We have 0 < < 1 and 0 < p
21
< 1, so
since
2
, and
2
p
21
+(1 p
21
) is a convex combination of
2
, , it is easy to see that
2
2
p
21
+ (1 p
21
) (3)
If we combine Eq. (2) with Eq. (3) we get
a
n
2
j=1
p
1+ij
F(J)(j) +
n
j=i+1
p
i+1j
p
i+1j
J(j)
_
_
_
F(J +re)(i + 1) = min
uU(i+1)
_
_
_
g(i + 1, u) +
i
j=1
p
i+1j
F(J +re)(j) +
j=i+1
n
p
i+1,j
(J +re)(j)
_
_
_
We know
j
F(J +re)(j) , j i, thus
F(J)(i + 1) +r
j=1
F(J)(i + 1) +
2
rp +r(1 p)
where
p =
i
j=1
p
1+ij
Obviously
i
j=1
j
p
i+1j
i
i
j=1
p
i+1j
=
i
p
Thus
i+1
p +(1 p)
F(J +re)(j) F(J)(j)
r
2
p + (1 p)
Since 0 <
i+1
2
< 1 and 0 p i we conclude that
i+1
a
i+1
p+(1p) and a
2
p+(1p)
. Thus
i+1
9x)(=)J
(x) J(x)
_
0(=)MJ
(x) MJ(x)
g(x) +MJ
)(x) T(J)(x)
thus property (1) holds.
44
For property (2) we note that
T(J +re)(x) = g(x) +M(J +re)(x) = g(x) +MJ(x) +rMe(x) = T(J)(x) +rMe(x)
We have
1
Me(x)
2
so that
T(J +re)(x) T(J)(x)
r
= Me(x)
and
1
T(J +re)(x) T(J)(x)
r
2
Thus property (2) also holds if
2
< 1.
1.10
(a) If there is a unique such that T
(J) = T(J), then there exists an > 0 such that for all R
n
with max
i
|(i)| we have
F(J + ) = T(J + ) J = g
+P
(J + ) J = g
+ (P
I)(J + ).
It follows that F is linear around J and its Jacobian is P
I.
(b) We rst note that the equation dening Newtons method is the rst order Taylor series expansion of
F around J
k
. If
k
is the unique such that T
(J
k
) = T(J
k
), then F is linear near J
k
and coincides with
its rst order Taylor series expansion around J
k
. Therefore the vector J
k+1
is obtained by the Newton
iteration satises
F(J
k+1
) = 0
or
T
k(J
k+1
) = J
k+1
.
This equation yields J
k+1
= J
k+1
= arg min
(J
k).
This is precisely the policy iteration of the algorithm.
45
1.12
For simplicity, we consider the case where U(i) consists of a single control. The calculations are very
similar for the more general case. We rst show that
n
j=1
M
ij
= . We apply the denition of the
quantities
M
ij
n
j=1
M
ij
=
n
j=1
_
ij
+
(1 )(M
ij
ij
)
1 m
i
_
=
n
j=1
ij
+
n
j=1
(1 )(M
ij
ij
)
1 m
i
= 1 + (1 )
n
j=1
M
ij
1 m
i
(1 )
1 m
i
n
j=1
ij
= 1 + (1 )
m
i
1 m
i
(1 )
1 m
i
= 1 (1 ) = .
Let J
1
, . . . , J
n
satisfy
J
i
= g
i
+
n
j=1
M
ij
J
j
. (1)
We substitute J
i
= g
i
+
n
j=1
M
ij
J
j
and manipulate the equation until we reach a relation that holds trivially
J
1
=
g
i
(1 )
1 m
i
+
n
j=1
ij
J
j
+
1
1 m
i
n
j=1
(M
ij
ij
)J
j
=
g
i
(1 )
1 m
i
+J
i
+
1
1 m
i
n
j=1
M
ij
J
j
1
1 m
i
J
i
= J
i
+
1
1 m
i
_
_
g
i
+
n
j=1
M
ij
J
j
J
i
_
_
.
This relation follows trivially from Eq. (1) above. Thus J
is a solution of
J
i
= g
i
+
n
J=1
M
ij
J
j
.
1.17
The form of Bellmans Equation for the tax problem is
J(x) = min
i
_
_
j=i
c
j
(x
i
) +E
w
i {J[x
i
, x
i1
, f
i
(x
i
, w
i
)
_
_
46
Let
J(x) = J(x)
J(x) = max
i
_
_
j=1
c
j
(x
j
) +c
i
(x
i
) +E
w
i {
J[ ]}
_
_
Let
J(x) = (1 )
J(x) +
n
j=1
C
j
(x
j
) By substitution we obtain
J(x) = max
i
_
_
(1 )
n
j=1
c
j
(x
j
) + (1 )c
i
(x
i
) +E
w
i {(1 )
J[ ]}
_
_
= max
i
[c
i
(x
i
) E
w
i
{c
i
(f(x
i
, w
i
)}] +E
w
i
{
J( )}].
Thus
J satises Bellmans Equation of a multi-armed Bandit problem with
R
i
(x
i
) = c
i
(x
i
) E
w
i
{c
i
(f(x
i
, w
i
))}.
1.18
Bellmans Equation for the restart problem is
J(x) = max[R(x
0
) +E{J[f(x
0
, w)]}, R(x) +E{J[f(x, w)]}]. (A)
Now, consider the one-armed bandit problem with reward R(x)
J(x, M) = max{M, R(x) +E[J(f(x, w), M)]}. (B)
We have
J(x
0
, M) = R(x
0
) +E[J(f(x
0
, w), M)] > M
if M < m(x
0
) and J(x
0
, M) = M. This implies that
R(x
0
) +E[J(f(x
0
, w))] = m(x
0
).
Therefore the forms of both Bellmans Equations (A) and (B) are the same when M = m(x
0
).
47
Solutions Vol. II, Chapter 2
2.1
(a) (i) First, we need to dene a state space for the problem. The obvious choice for a state variable
is our location. However, this does not encapsulate all of the necessary information. We also need to
include the value of c if it is known. Thus, let the state space consist of the following 2m + 2 states:
{S, S
1
, . . . , S
m
, I
1
, . . . I
m
, D}, where S is associated with being at the starting point with no information,
S
i
and I
i
are associated with being at S and I, respectively, and knowing that c = c
i
, and D is the
termination state.
At state S, there are two possible controls: go directly to D (direct) or go to an intermediate
point (indirect). If control direct is selected, we go to state D with probability 1, and the cost is
g(S, direct, D) = a. If control indirect is selected, we go to state I
i
with probability p
i
, and the cost is
g(S, indirect, I
i
) = b.
At state S
i
, for i {1, . . . , m}, we have the same controls as at state S. Again, if control direct is
selected, we go to state D with probability 1, and the cost is g(S
i
, direct, D) = a. If, on the other hand,
control indirect is selected, we go to state I
i
with probability 1, and the cost is g(S, indirect, I
i
) = b.
At state I
i
, for i {1, . . . , m}, there are also two possible controls: go back to the start (start) or
go to the destination (dest). If control start is selected, we go to state S
i
with probability 1, and the
cost is g(I
i
, start, S
i
) = b. If control dest is selected, we go to state D with probability 1, and the cost is
g(I
i
, dest, D) = c
i
.
We have thus formulated the problem as a stochastic shortest path problem. Bellmans equation
for this problem is
J
(S) = min[a, b +
m
i=1
p
i
J
(I
i
)]
J
(S
i
) = min[a, b +J
(I
i
)]
J
(I
i
) = min[c
i
, b +J
(S
i
)].
We assume that b > 0. Then, Assumptions 5.1 and 5.2 hold since all improper policies have innite cost.
As a result, if
(I
i
) = start, then
(S
i
) = direct. If
(I
i
) = start, then we never reach state S
i
and
so it doesnt matter what the control is in this case. Thus, J
(S
i
) = a, and
(S
i
) = direct. From this,
it is easy to derive the optimal costs and controls for the other states
J
(I
i
) = min[c
i
, b +a]
(I
i
) =
_
dest, if c
i
< b +a
start, otherwise,
48
J
(S) = min[a, b +
m
i=1
p
i
min(c
i
, b +a)]
(S) =
_
direct, if a < b +
m
i=1
p
i
min(c
i
, b +a)
indirect, otherwise.
For the numerical case given, we see that a < b +
m
i=1
p
i
min(c
i
, b + a) since a = 2 and b +
m
i=1
p
i
min(c
i
, b +a) = 2.5. Hence (S) = direct. We need not consider the other states since they will
never be reached.
(ii) In this case, every time we are at the starting location, our available information is the same. We
thus no longer need the states S
i
from part (i). Our state space for this part is then S, I
1
, . . . , I
m
, D.
At state S, the possible controls are {direct, indirect}. If control direct is selected, we go to state
D with probability 1, and the cost is g(S, direct, D) = a. If control indirect is selected, we go to state I
i
with probability p
i
, and the cost is g(S, indirect, I
i
) = b [same as in part (ii)].
At state I
i
, for i {1, . . . , m}, the possible controls are {start, dest}. If control start is selected,
we go to state S with probability 1, and the cost is g(I
i
, start, S) = b. If control dest is selected, we go
to state D with probability 1, and the cost is g(I
i
, dest, D) = c
i
.
Bellmans equation for this stochastic shortest path problem is
J
(S) = min[a, b +
m
i=1
p
i
J
(I
i
)]
J
(I
i
) = min[c
i
, b +J
(S)].
The optimal policy can be described by
(S) =
_
direct, if a < b +
m
i=1
p
i
J
(I
i
)
indirect, otherwise,
(I
i
) =
_
dest, if c
i
< b +J
(S)
start, otherwise.
We will solve the problem for the numerical case by guessing an optimal policy and then showing
that the resulting cost J
(S) = direct
(I
1
) = dest
(I
2
) = start.
Then
J(S) = a = 2 J(I
1
) = c
1
= 0 J(I
2
) = b +J
(S) = 1 + 2 = 3.
49
From Bellmans equation, we have
J(S) = min(2, 1 + 0.5(3 + 0)) = 2
J(I
1
) = min(0, 1 + 2)) = 0
J(I
2
) = min(5, 1 + 2)) = 3.
Thus, our policy is optimal.
(b) The state space for this problem is the same as for part a(ii): {S, I
1
, . . . , I
m
, D}.
At state S, the possible controls are {direct, indirect}. If control direct is selected, we go to state
D with probability 1, and the cost is g(S, direct, D) = a. If control indirect is selected, we go to state I
i
with probability p
i
, and the cost is g(S, indirect, I
i
) = b [same as in part a,(i) and (ii)].
At state I
i
, for i {1, . . . , m}, we have an additional option of waiting. So the possible controls
are {start, dest, wait}. If control start is selected, we go to state S with probability 1, and the cost
is g(I
i
, start, S) = b. If control dest is selected, we go to state D with probability 1, and the cost is
g(I
i
, dest, D) = c
i
. If control wait is selected, we go to state I
j
with probability p
j
, and the cost is
g(I
i
, wait, I
j
) = d.
Bellmans equation is
J
(S) = min[a, b +
m
i=1
p
i
J
(I
i
)]
J
(I
i
) = min[c
i
, b +J
(S), d +
m
j=1
p
j
J
(I
j
)].
We can describe the optimal policy as follows:
(S) =
_
direct, if a < b +
m
i=1
p
i
J
(I
i
)
indirect, otherwise.
If direct was selected, we do not need to consider the other states (other than D) since they will never
be reached. If indirect was selected, then dening k = min(2b, d), we see that
(I
i
) =
_
_
dest, if c
i
< k +
m
i=1
J
(I
i
)
start, if c
i
> k +
m
i=1
J
(I
i
) and 2b < d
wait, if c
i
> k +
m
i=1
J
(I
i
) and 2b > d.
50
2.2
Lets dene the following states:
H: Last ip outcome was heads
T: Last ip outcome was tails
C: Caught (this is the termination state)
(a) We can formulate this problem as a stochastic shortest path problem with state C being the termina-
tion state. There are four possible policies:
1
= {always ip fair coin},
2
= {always ip two-headed coin},
3
= {ip fair coin if last outcome was heads / ip two-headed coin if last outcome was tails}, and
4
=
{ip fair coin if last outcome was tails / ip two-headed coin if last outcome was heads}. The only way
to reach the termination state is to be caught cheating. Under all policies except
1
, this is inevitable.
Thus
1
is an improper policy, and
2
,
3
, and
4
are proper policies.
(b) Let J
1
(H) and J
2
(T) be the costs corresponding policy
1
where the starting state is H and T,
respectively. The expected benet starting from state T up to the rst return to T (and always using the
fair coin), is
1
2
_
1 +
1
2
+
1
2
2
+
_
m
2
=
1
2
(2 m).
Therefore
J
1
(T) =
_
_
+ if m < 2
0 if m = 2
if m > 2.
Also we have
J
1
(H) =
1
2
(1 +J
n
(H)) +
1
2
J
n
(T),
so
J
1
(H) = 1 +J
(T).
It follows that if m > 2, then
1
results in innite cost for any initial state.
(c,d) The expected one-stage rewards at each stage are
Play Fair in State H:
1
2
Cheat in State H: 1 p
Play Fair in State T:
1m
2
Cheat in State T: 0
We show that any policy that cheats at H at some stage cannot be optimal. As a result we can eliminate
cheating from the control constraint set of state H.
51
Indeed suppose we are at state H at some stage and consider a policy which cheats at the rst
stage and then follows the optimal policy
(H)]
J
(H) =
1
2
(1 +J
(H)) +
1
2
_
(1 p)[1 +J
(H)]
_
=
1
2
+
1
2
[J
(H) +J
(H)]
1
2
+J
(H),
where the inequality follows from the fact that J
(H) J
(H) since
1
(T) =
_
_
+ if m < 2
0 if m = 2
if m > 2,
and
J
1
(H) =
_
_
+ if m < 2
1 if m = 2
if m > 2.
Under
3
, we have
J
3
(T) = (1 p)J
3
(H),
J
3
(H) =
1
2
[1 +J
3
(H)] +
1
2
J
3
(T).
Solving these two equations yields
J
3
(T) =
1 p
p
,
J
3
(H) =
1
p
.
Thus if m > 2, it is optimal to cheat if the last ip was tails and play fair otherwise, and if m < 2, it is
optimal to always play fair.
52
2.7
(a) Let i be any state in S
m
. Then,
J(i) = min
uU(i)
[E{g(i, u, j) +J(j)}]
= min
uU(i)
_
_
jSm
p
ij
(u)[g(i, u, j) +J(j)] +
jS
m1
S
1
t
p
ij
(u)[g(i, u, j) +J(j)]
_
_
= min
uU(i)
_
_
jSm
p
ij
(u)[g(i, u, j) +J(j)] + (1
jSm
p
ij
(u))
jS
m1
S
1
t
p
ij
(u)[g(i, u, j) +J(j)]
(1
jSm
p
ij
(u))
_
_
.
In the above equation, we can think of the union of S
m1
, . . . , S
1
, and t as an aggregate termination state
t
m
associated with S
m
. The probability of a transition from i S
m
to t
m
(under u) is given by,
p
itm
(u) = 1
jSm
p
ij
(u).
The corresponding cost of a transition from i S
m
to t
m
(under u) is given by,
g(i, u, t
m
) =
j=S
m1
S
1
t
p
ij
(u)[g(i, u, j) +J(j)]
p
itm
(u)
.
Thus, for i S
m
, Bellmans equation can be written as,
J(i) = min
uU(i)
_
_
jSm
p
ij
(u)[g(i, u, j) +J(j)] +p
itm
(u)[ g(i, u, t
m
) + 0]
_
_
.
Note that with respect to S
m
, the termination state t
m
is both absorbing and of zero cost. Let t
m
and
g(i, u, t
m
) be similarly constructed for m = 1, . . . , M.
The original stochastic shortest path problem can be solved as M stochastic shortest path sub-
problems. To see how, start with evaluating J(i) for i S
1
(where t
1
= {t}). With the values of J(i),
for i S
1
, in hand, the g cost-terms for the S
2
problem can be computed. The solution of the original
problem continues in this manner as the solution of M stochastic shortest path problems in succession.
(b) Suppose that in the nite horizon problem there are n states. Dene a new state space S
new
and sets S
m
as follows,
S
new
= {(k, i)|k {0, 1, . . . , M 1} and i {1, 2, . . . , n}}
S
m
= {(k, i)|k = M m and i {1, 2, . . . , n}}
for m = 1, 2, . . . , M. (Note that the S
m
s do not overlap.) By associating S
m
with the state space of
the original nite-horizon problem at stage k = M m, we see that if i
k
S
m1
under all policies. By
augmenting a termination state t which is absorbing and of zero cost, we see that the original nite-
horizon problem can be cast as a stochastic shortest path problem with the special structure indicated in
the problem statement.
53
2.8
Let J
(i) = min
u
n
j=1
p
ij
(u) (g(i, u, j) +J
(j)) ,
and
J(i) = min
u
n
j=1,j=i
p
ij
(u)
1 p
ii
(u)
_
g(i, u, j) +
g(i, u, i)p
ii
(u)
1 p
ii
(u)
+
J(j)
_
.
For each i, let
(i) =
n
j=1
p
ij
(
(i)) (g(i,
(i), j) +J
(j)) .
Then
J
(i) =
_
_
n
j=1,j=i
p
ij
(
(i)) (g(i,
(i), j) +J
(j))
_
_
+p
ii
(
(i)) (g(i,
(i), i) +J
(i)) .
By collecting the terms involving J
(i)),
J
(i) =
1
1 p
ii
(
(i))
_
_
_
_
_
n
j=1,j=i
p
ij
(
(i))(g(i,
(i), j) +J
(j))
_
_
+p
ii
(
(i))g(i,
(i), i)
_
_
_
.
Since
n
j=1,j=i
p
ij
(
(i))
1p
ii
(
(i))
= 1, we have
J
(i) =
1
1 p
ii
(
(i))
_
_
_
_
_
n
j=1,j=i
p
ij
(
(i))(g(i,
(i), j) +J
(j))
_
_
+
n
j=1,j=i
p
ij
(
(i))
1 p
ii
(
(i))
p
ii
(
(i))g(i,
(i), i)
_
_
_
=
n
j=1,j=i
_
p
ij
(
(i))
1 p
ii
(
(i))
(g(i,
(i), j) +J
(j) +
p
ii
(
(i))g(i,
(i), i)
1 p
ii
(
(i))
)
_
.
Therefore J
(i)
J(i) i.
Similarly, for each i, let (i) be a control such that
J(i) =
n
j=1,j=i
p
ij
( (i))
1 p
ii
( mu(i))
_
g(i, (i), j) +
g(i, (i), i)p
ii
( (i))
1 p
ii
( (i))
+
J(j)
_
.
Then, using a reverse argument from before, we see that
J(i) is the cost of stationary policy { , , . . .}
in the original problem. Thus
J(i) J
(i) i.
Combining the two results, we have
J(i) = J
(i), and thus the two problems have the same optimal costs.
If p
ii
(u) = 1 for some i = t, we can eliminate u from U(i) without increasing J
(j), j = i. If that were not so, every optimal stationary policy must use u at state i and
therefore must be improper, which is a contradiction.
54
2.17
Consider a modied stochastic shortest path problem where the state space is denoted by S
, the control
space by U
= S
S
S
SU
, where
S
S
= {1, . . . , n, t} where each i S
S
corresponds to i S
S
SU
= {(i, u)|i S, u U(i)} where each (i, u) S
SU
corresponds to i S and u U(i).
For i, j S
S
, u U
(i), we dene U
(i) = U(i), g
ij
(u) = p
ij
(u). For (i, u)
S
S
U and j S
S
, the only possible control is u
= u (i.e., U
((i, u), u
, j) =
g(i, u, j) and p
(i,u)j
(u
) = p
ij
(u).
Since trajectories originating from a state i S
S
are equivalent to trajectories in the original
problem, the optimal cost-to-go value for state i in the modied problem is J
SU
by J
(i, u).
Then J
(i) and J
(i) = min
uU(i)
n
j=1
p
ij
(u)(g(i, u, j) +J
(j)) (1)
J
(i, u) =
n
j=1
p
ij
(u)(g(i, u, j) +J
(j)). (2)
The Q-factors for the original problem are dened as
Q(i, u) =
n
j=1
p
ij
(u)(g(i, u, j) +J
(j)),
so from Eq. (2), we have
Q(i, u) = J
(i) = min
uU(i)
J
j=1
p
ij
(u)
_
g(i, u, j) + min
u
U(j)
Q(j, u
)
_
. (5)
There remains to show that there is no other solution to Eq. (5). Indeed, if
Q(i, u) were such that
Q(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + min
u
U(j)
Q(j, u
)
_
, (i, u), (6)
55
then by dening
J(i) = min
uU(i)
Q(i, u) (7)
we obtain from Eq. (6)
Q(i, u) =
n
j=1
p
ij
(u)(g(i, u, j) +
J(j)), (i, u). (8)
By combining Eqs. (7) and (8), we have
J(i) = min
uU(i)
n
j=1
p
ij
(u)(g(i, u, j) +
J(j)), i. (9)
Thus
J(i) and
Q(i, u) satisfy Bellmans Eq. (1)-(2) for the modied problem. Since this Bellman equation
is solved uniquely by J
(i) and J
Q(i, u) = J
(J
) T(J
) +e = J
, we obtain
T
2
(J
) T
(J
) +e J
+e +e.
Proceeding similarly, we obtain
T
k
(J
) T
(J
) +
_
k2
i=0
i
_
e J
+
k1
i=0
i
e
and by taking limit as k , the desired result J
re.
Then, applying T
k
to this inequality, we have
J
= T
k
(J
) T
k
(J
)
k
re.
Taking the limit as k , we obtain J
,
yields J
= J
(x) J
(x) +
i=0
i
Let
i
=
2
i+1
i
> 0.
Thus,
J
(x) J
(x) +
i=0
1
2
i+1
= J
(x) + xS.
57
If < 1, choose
i
=
i=0
i
which is independent of i. In this case,
. In particular, let us consider a system with only one state, i.e. S = {0}, U = (0, ), J
0
(0) = 0, and
g(0, u) = u. Then J
(0) = inf
k=0
u = .
3.9
Let
= {
0
,
1
, . . .} be an optimal policy. Then we know that
J
(x) = J
(x) = lim
k
(T
0
T
1
. . . T
k
)(J
0
)(x) = lim
k
(T
0
_
T
1
. . . T
k
)
_
(J
0
)(x).
From monotone convergence we know that
J
(x) = lim
k
T
0
(T
1
. . . T
k
)(J
0
)(x) = T
0
( lim
k
(T
1
. . . T
k
)(J
0
))(x)
T
0
(J
)(x) T(J
)(x) = J
(x)
Thus T
0
(J
)(x) = J
0
,
0
, . . .} is optimal.
3.12
We shall make an analysis similar to the one of 3.1. In particular, let
J
0
(x) = 0
T(J
0
)(x) = min[x
Qx +u
Ru] = xqx = x
K
0
x
T
2
(J
0
)(x) = min[x
Qx +u
Ru + (Ax +Bu)
Q(Ax +Bu)] = x
K
1
x,
where K
1
= Q+R +D
1
K
0
D
1
with D
1
= A+BL
1
and L
1
= (R +B
K
0
B)
1
B
K
0
A. Thus
T
k
(J
0
)(x) = x
K
k
x
where K
k
= Q + R + D
k
K
k1
D
k
with D
K
= A + BL
k
and L
k
= (R + B
K
k1
B)
1
B
K
k1
A. By
the analysis of Chapter 4 we conclude that K
k
K with K being the solution to the algebraic Ricatti
equation. Thus J
(x) = x
Kx = lim
N
T
N
(J
0
)(x). Then it is easy to verify that J
(x) = T(J
)(x)
and by Prop. 1.5 in Chapter 1, we have that J
(x) = J
(x).
For the periodic problem the controllability assumption is that there exists a nite sequence of
controls {u
0
, . . . , u
r
} such that x
r+1
= 0. Then the optimal control sequence is periodic
= {
0
,
1
, . . . ,
p1
,
0
,
1
, . . . ,
p1
, . . .},
58
where
i
= (R
i
+B
i
K
i+1
B
i
)
1
b
i
K
k+1
A
i
x
p1
= (R
p1
+B
p1
K
0
B
p1
)
1
B
p1
K
0
A
p1
x
and K
0
. . . , K
p1
satisfy the coupled set of p algebraic Ricatti equations
K
i
= A
i
[K
i+1
K
i+1
B
i
(R
i
+B
i
K
i+1
B
i
)
1
B
i
K
i+1
]A
i
+Q
i
, i = 0, . . . , p 2,
K
p1
= A
p1
[K
0
K
0
B
p1
(R
p1
+B
p1
K
0
B
p1
)
1
B
p1
K
0
]A
p1
+Q
p1
.
3.14
The formulation of the problem falls under assumption P for periodic policies. All the more, the problem
is discounted. Since w
k
are independent with zero mean, the optimality equation for the equivalent
stationary problem reduces to the following system of equations
(x
0
, 0) = min
u
0
U(x
0
)
E
w
0
{x
0
Q
0
x
0
+u
0
(x
0
)
R
0
u
0
(x
0
) +
J
(A
0
x
0
+B
0
u
0
+w
0
, 1)}
(x
1
, 1) = min
u
1
U(x
1
)
E
w
1
{x
1
Q
1
x
1
+u
1
(x
1
)
R
1
u
1
(x
1
) +
J
(A
1
x
1
+B
1
u
1
+w
1
, 2)}
. . .
(x
p1
, p 1) = min
u
p1
U(x
p1
)
E
w
p1
{x
p1
Q
p1
x
p1
+u
p1
(x
p1
)
R
p1
u
p1
(x
p1
)
+
J
(A
p1
x
p1
+B
p1
u
p1
+w
p1
, 0)}
(1)
From the analysis in 7.8 in Ch.7 on periodic problems we see that there exists a periodic policy
{
0
,
1
, . . . ,
p1
,
1
,
2
, . . . ,
p1
, . . .}
which is optimal. In order to obtain the solution we argue as follows: Let us assume that the solution is
of the same form as the one for the general quadratic problem. In particular, assume that
(x, i) = x
K
i
x +c
i
,
where c
i
is a constant and K
i
is positive denite. This is justied by applying the successive approximation
method and observing that the sets
U
k
(x
i
, , i) = {u
i
R
m
|x
Qx +u
i
Ru
i
+ (Ax +Bu
i
)
K
k
i+1
(Ax +Bu
i
) }
are compact. The latter claim can be seen from the fact that R 0 and K
k
i+1
0. Then by Proposition
7.7, lim
k
J
k
(x
i
, i) =
J
(x
i
, i) and the form of the solution obtained from successive approximation is
as described above.
59
In particular, we have for 0 i p 1
(x, i) = min
u
i
U(x
i
)
E
w
i
{x
Q
i
x +u
i
(x)
R
1
u
i
(x) +
J
(A
1
x +B
1
u
i
+w
i
, i + 1)}
= min
u
i
U(x
i
)
E
w
i
{x
Q
i
x +u
i
(x)
R
1
u
i
(x) +[(A
i
x +B
i
u
i
+w
i
)
k
i+1
(A
i
x +B
i
u
i
+w
i
) +c
i+1
]}
= min
u
i
U(x
i
)
E
w
i
{x
(Q
i
+A
i
K
i+1
A
i
)x
i
+u
i
(r
i
+B
i
K
i+1
B
i
)u
i
+ 2x
i
K
i+1
B
i
u
i
+
+ 2w
i
K
i+1
B
i
u
i
+ 2x
i
K
i+1
w
i
+w
i
K
i+1
w
i
+c
i+1
}
= min
u
i
U(x
i
)
{x
(Q
i
+A
i
K
i+1
A
i
)x
i
+u
i
(R
i
+B
i
K
i+1
B
i
)u
i
+ 2x
i
K
i+1
B
i
u
i
+
+w
i
K
i+1
w
i
+c
1
}
where we have taken into consideration the fact that E(w
i
) = 0. Minimizing the above quantity will give
us
u
i
= (R
i
+B
i
K
i+1
B
i
)
1
B
i
K
i+1
A
i
x (2)
Thus
(x, i) = x
[Q
i
+A
i
(K
i+1
2
K
i+1
(R
i
+B
i
K
i+1
B
i
)
1
B
i
K
i+1
)A
i
] x +c
i
= x
K
i
x +c
i
where c
i
= E
w
i
{w
i
K
i+1
w
i
} +c
i+1
and
K
i
= Q
i
+A
i
(K
i+1
2
K
i+1
(R
i
+B
i
K
i+1
B
i
)
1
B
i
K
i+1
)A
i
.
Now for this solution to be consistent we must have K
p
= K
0
. This leads to the following system of
equations
K
0
= Q
0
+A
0
(K
1
2
K
1
(R
0
+B
0
K
1
B
0
)
1
B
0
K
1
)A
0
. . .
K
i
= Q
i
+A
i
(K
i+1
2
K
i+1
(R
i
+B
i
K
i+1
B
i
)
1
B
i
K
i+1
)A
i
. . .
K
p1
= Q
p1
+A
p1
(K
0
2
K
0
(R
p1
+B
p1
K
0
B
p1
)
1
B
p1
K
0
)A
p1
(3)
This system of equations has a positive denite solution since (from the description of the problem) the
system is controllable, i.e. there exists a sequence of controls such that {u
0
, . . . , u
r
} such that x
r+1
= 0.
Thus the result follows.
3.16
(a) Consider the stationary policy, {
0
,
0
, . . . , }, where
0
= L
0
x. We have
J
0
(x) = 0
60
T
0
(J
0
)(x) = x
Qx +x
0
RL
0
x
T
2
0
(J
0
)(x) = x
Qx +x
0
RL
0
x +(Ax +BL
0
x +w)
Q(Ax +BL
0
x +w)
= x
M
1
x + constant
where M
1
= Q+L
0
RL
0
+(A+BL
0
)
Q(A+BL
0
),
T
3
0
(J
0
)(x) = x
Qx +x
0
RL
0
x +(Ax +BL
0
x +w)
M
1
(Ax +BL
0
+w) + (constant)
= x
M
2
x + constant
Continuing similarly, we get
M
k+1
= Q+L
0
RL
0
+(A+BL
0
)
M
k
(A+BL
0
).
Using a very similar analysis as in Section 8.2, we get
M
k
K
0
where
K
0
= Q+L
0
RL
0
+(A+BL
0
)
K
0
(A+BL
0
)
(b)
J
1
(x) = lim
N
E w
k
k=0,,N1
_
N1
k=0
k
_
x
k
Qx
k
+
1
(x
k
)
R
1
(x
k
)
_
= lim
N
T
N
1
(J
0
)(x)
Proceeding as in the proof of the validity of policy iteration (Section 7.3, Chapter 7). We have
T
1
(J
0
) = T(J
0
)
J
0
(x) = x
K
0
x + constant = T
0
(J
0
)(x) T
1
(J
0
(x)
Hence, we obtain
J
0
(x) T
1
(J
0
)(x) . . . T
k
1
(J
0
)(x) . . .
implying,
J
0
(x) lim
k
T
k
1
(J
0
)(x) = J
1
(x).
(c) As in part (b), we show that
J
k
(x) = x
K
k
x + constant J
k1
(x).
Now since
0 x
K
k
x x
K
k1
x, x
61
we have
K
k
K.
The form of K is,
K = (A+BL)
K(A+BL) +Q+L
RL
L = (B
KB +R)
1
B
KA
To show that K is indeed the optimal cost matrix, we have to show that it satises
K = A
[K
2
KB(B
KB +R)
1
B
K]A+Q
= A
[KA+KBL] +Q
Let us expand the formula for K, using the formula for L,
K = (A
KA+A
KBL +L
KA+L
KBL) +Q+L
RL.
Substituting, we get
K = (A
KA+A
KBL +L
KA) +QL
KA
= A
KA+A
KBL +Q.
Thus K is the optimal cost matrix.
A second approach: (a) We know that
J
0
(x) = lim
n
T
n
0
(J
0
)(x).
Following the analysis at 8.1 we have
J
0
(x) = 0
T
0
(J)(x) = E{x
Qx +
0
(x)
R
0
(x)} = x
Qx +
0
(x)
R
0
(x) = x
(Q+L
0
RL
0
)x
T
2
0
(J)(x) = E{x
Qx +
0
(x)R
0
(x) +(Ax +B
0
(x) +w)
Q(Ax +B
0
(x) +w)}
= x
(Q+L
0
RL
0
+(A+BL
0
)
Q(A+BL
0
)) x +E{w
Qw}.
Dene
K
0
0
= Q
K
k+1
0
= Q+L
0
RL
0
+(A+BL
0
)
K
k
0
(A+BL
0
).
Then
T
k+1
0
(J)(x) = x
K
k+1
0
x +
k1
m=0
km
E{w
K
m
0
w}.
The convergence of K
k+1
0
follows from the analysis of 4.1. Thus
J
0
(x) = x
K
0
x +
1
E{w
K
0
w}
62
(as in 8.1) which proves the required relation.
(b) Let
1
(x) be the solution of the following
min
u
{u
Ru +(Ax +Bu)
K
0
(Ax +Bu)}
which yields
u
1
= (R +B
K
0
B)
1
B
K
0
Ax = L
1
x.
Thus
L
1
= (R +B
K
0
B)
1
B
K
0
A = M
1
where M = R +B
K
0
B and = B
K
0
A. Let us consider the cost associated with u
1
if we ignore w
J
1
(x) =
k=0
k
(x
k
Qx
k
+
1
(x
k
)
Rm
1
(x
k
)) =
k=0
k
x
k
(Q+L
1
RL
1
)x
k
.
However, we know the following
x
k+1
= (A+BL
1
)
k+1
x
0
+
k+1
m=1
(A+BL
1
)
k+1m
w
m
.
Thus, if we ignore the disturbance w we get
J
1
(x) = x
k=0
k
(A+BL
1
)
k
(Q+L
1
RL
1
)(A+BL
1
)
k
x
0
.
Let us call
K
1
=
k=0
k
(A+BL
1
)
k
(Q+L
1
RL
1
)(A+BL
1
)
k
x
0
. (1)
We know that
K 0 (A+BL
0
)
K
0
(A+BL
0
) L
0
RL
0
= Q.
Substituting in (1) we have
K
1
=
k=0
k
(A+BL
1
)
k
(K
0
+(A+BL
1
)
K
0
(A+BL
1
))(A+BL
1
)+
+
k=0
{
k
(A+BL
1
)
k
[(A+BL
1
)
K
0
(A+BL
1
) (A+BL
0
)
K
0
(A+BL
0
)+
+L
1
RL
1
L
0
RL
0
](A+BL
1
)
k
}.
However, we know that
K
0
=
k=0
k
(A+BL
1
)
k
(K
0
(A+BL
1
)
K
0
(A+BL
1
)) (A+BL
1
)
k
.
63
Thus we conclude that
K
1
K
0
=
k=0
k
(A+BL
1
)
k
(A+BL
1
)
k
where
= (A+BL
1
)
K
0
(A+BL
1
) (A+BL
0
)
K
0
(A+BL
0
) +L
1
K
0
L
1
+L
0
K
0
L
0
.
We manipulate the above equation further and we obtain
= L
1
(R +B
K
o
B)L
1
L
0
(R +B
K
0
B)L
0
+L
1
B
K
0
A+A
K
0
BL
1
0
B
K
0
AA
K
0
BL
0
= L
1
ML
1
L
0
ML
0
+L
1
+
L
1
L
L
0
= (L
0
L
1
)
M(L
0
L
1
) ( +ML
1
)
(L
0
L
1
) (L
0
L
1
)
( +ML
1
).
However, it is seen that
+ML
1
= 0.
Thus
= (L
0
L
1
)
M(L
0
L
1
).
Since M 0 we conclude that
K
0
K
1
=
k=0
k
(A+BL
1
)
k
(L
0
L
1
)M(L
0
L
1
)(A+BL
1
)
k
0.
Similarly, the optimal solution for the case where there are no disturbances satises the equation
K = Q+L
RL +(A+BL)
K(A+BL)
with L = (R +B
KB)
1
B
k=0
k
(A+BL
1
)
k
(L
1
L)
M(L
1
L)(A+BL
1
)
k
0.
Thus K K
1
K
0
. Since K
1
is bounded, we conclude that A + BL
1
is stable (otherwise K
1
).
Thus, the sum converges and K
1
is the solution of K
1
= (A + BL
1
)
K
1
(A + L
1
) + Q + L
1
RL
1
. Now
returning to the case with the disturbances w we conclude as in case (a) that
J
1
(x) = x
K
1
x +
1
E{w
K
1
w}.
Since K
1
K
0
we conclude that J
1
(x) J
0
(x) which proves the result.
c) The policy iteration is dened as follows: Let
L
k
= (R +B
K
k1
B)
1
B
K
k1
A.
64
Then
k
(x) = L
k
x and
J
k
(x) = x
K
k
x +
1
E{w
K
k
w}
where K
k
is obtained as the solution of
K
k
= (A+BL
k
)
K
k
(A+BL
k
) +Q+L
k
RL
k
.
If we follow the steps of (b) we can prove that
K K
k
. . . K
1
K
0
. (2)
Thus by the theorem of monotonic convergence of positive operators (Kantorovich and Akilov p.
189: Functional Analysis in Normed Spaces) we conclude that
K
= lim
p
K
k
exists. Then if we take the limit of both sides of eq. (2) we have
K
= (A+BL
(A+L
) +Q+L
RL
with
L
= (R +B
B)
1
B
A.
However, according to 4.1, K is the unique solution of the above equation. Thus, K
= K and
the result follows.
65
Solutions Vol. II, Chapter 4
4.4
(a) We have
T
k+1
h
0
= T (T
k
h
0
) = T
_
h
k
i
+
_
T
k
h
0
_
(i)e
_
= Th
k
i
+
_
T
k
h
0
_
(i).
The ith component of this equation yields
_
T
k+1
h
0
_
(i) =
_
Th
k
i
_
(i) +
_
T
k
h
0
_
(i).
Subtracting these two relations, we obtain
T
k+1
h
0
_
T
k+1
h
0
_
(i) = Th
k
i
_
Th
k
i
_
(i),
from which
h
k+1
i
= Th
k
i
_
Th
k
i
_
(i).
Similarly, we have
T
k+1
h
0
= T
_
T
k
h
0
_
= T
_
h
k
+
1
n
_
T
k
h
0
_
(i)e
_
= T
h
k
+
1
n
_
T
k
h
0
_
(i)e.
From this equation, we obtain
1
n
_
T
k+1
h
0
_
(i) =
1
n
_
T
h
k
_
(i) +
1
n
_
T
k
h
0
_
(i)e.
By subtracting these two relations, we obtain
h
k+1
= T
h
k
1
n
_
T
h
k
_
(i).
The proof for
h
k
is similar.
(b) We have
h
k
= T
k
h
0
_
1
n
i
_
T
k
h
0
_
(i)
_
e =
1
n
n
i=1
h
k
i
.
So since h
k
i
converges, the same is true for
h
k
. Also,
h
k
= T
k
h
0
min
i
_
T
k
h
0
_
(i)e
and
h
k
(j) =
_
T
k
h
0
_
(j) min
i
_
T
k
h
0
_
(i)
= max
i
_
_
T
k
h
0
_
(j)
_
T
k
h
0
_
(i)
_
= max
i
h
k
i
(j).
Since h
k
i
converges, the same is true for
h
k
.
66
4.8
Bellmans equation for the auxiliary (1 )discounted problem is as follows:
J(i) = min
uU(i)
[g(i, u) + (1 )
j
p
ij
(u)
J(j)]. (1)
Using the denition of p
ij
(u), we obtain
j
p
ij
(u)
J(j) =
j=t
(1 )
1
p
ij
(u)
J(j) + (1 )
1
(p
it
(u) )
J(t),
or
j
p
ij
(u)
J(j) =
j
(1 )
1
p
ij
(u)
J(j) (1 )
1
J(t).
This together with (1) leads to
J(i) = min
uU(i)
[g(i, u) +
j
p
ij
(u)
J(j)
J(t)],
or, equivalently,
J(t) +
J(i) = min
uU(i)
[g(i, u) +
j
p
ij
(u)
J(j)]. (2)
Returning to the problem of minimizing the average cost per stage, we notice that we have to solve the
equation
+h(i) = min
uU(i)
[g(i, u) +
j
p
ij
(u)h(j)]. (3)
Using (2), it follows that (3) is satised for =
J(t) and h(i) =
J(i) for all i. Thus, by Proposition 2.1,
we conclude that
J(t) is the optimal average cost and
J(i) is a corresponding dierential cost at state i.
67