Variational Methods For Reinforced Learning

Variational methods for Reinforcement Learning
Thomas Furmston David Barber

Computer Science Department, University College London, London WC1E 6BT, UK.
Abstract However, these may result in myopic policies since only

the known observed transitions are assumed possible.
As an alternative, we describe a Bayesian approach in
We consider reinforcement learning as solv-
which a prior distribution is placed over the environ-
ing a Markov decision process with unknown
ment model and updated as data from the environment
transition distribution. Based on interac-
is received. This environment distribution maintains
tion with the environment, an estimate of
the possibility of transitions to parts of the space that
the transition matrix is obtained from which
have not yet been observed but nevertheless may prove
the optimal decision policy is formed. The
rewarding. The optimal policy is then obtained by
classical maximum likelihood point estimate
integrating over all possible environment models. To
of the transition model does not reflect the
deal with the difficulties of carrying out this integral we
uncertainty in the estimate of the transition
discuss two approximate methods, Variational Bayes
model and the resulting policies may con-
(VB) (see for example (Beal and Ghahramani, 2003))
sequently lack a sufficient degree of explo-
and Expectation Propagation (EP) (Wainwright and
ration. We consider a Bayesian alternative
Jordan, 2008; Minka, 2001). For simplicity of expo-
that maintains a distribution over the tran-
sition, we assume throughout that the reward model
sition so that the resulting policy takes into
is known, but that the transition model needs to be
account the limited experience of the envi-
learned from experience. Extending the approach to
ronment. The resulting algorithm is formally
an unknown reward model is essentially straightfor-
intractable and we discuss two approximate
ward.
solution methods, Variational Bayes and Ex-
pectation Propagation.
2 Variational MDPs
1 Introduction An MDP can be described by an initial state distri-

bution p1 (s1 ), transition distributions p(st+1 |st , at ),
Reinforcement Learning (RL) is the problem of learn- and a reward function rt (st , at ), where the state and
ing to act optimally through interaction and simula- action at time t are denoted by st and at respec-
tion in an unknown environment (Sutton and Barto, tively. For a discount factor the reward is defined
1998) and may be applied to sequential decision prob- as rt (st , at ) = t1 r(st , at ) for a stationary reward
lems where the underlying dynamics of the envi- r(st , at ). We assume a stationary policy, , defined
ronment is unknown, for example helicopter control as a set of conditional distributions over the action
(Abbeel et al., 2007), the cart-pole problem (Ras- space1 , a,s = p(at = a|st = s, ). The total expected
mussen and Deisenroth, 2008) and elevator scheduling reward of the MDP (the policy utility) is
(Crites and Barto, 1995). We assume a model-based
H X
approach for which we need to estimate the param- X
eters of the transition model based on limited inter- U () = rt (st , at )p(st , at |) (1)
t=1 st ,at
action with the environment. A classical approach to
learning an environment model is to use a point es-
where H is the horizon, which can be either finite or in-
timator, such as the maximum likelihood estimator.
finite, and p(st , at |) is the marginal of the joint state-
Appearing in Proceedings of the 13th International Con- 1
More generally, one may consider policies which de-
ference on Artificial Intelligence and Statistics (AISTATS)
pend on the belief, a,s,D = p(a|s, p(|D), ), similar to
2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of
the encoding of RL as a POMDP(Duff, 2002), though we
JMLR: W&CP 9. Copyright 2010 by the authors.
leave this case for future study.
241
we obtain a lower bound on the log utility

at at+1 policy
log U () H(q(s1:t , a1:t , t)) + hlog p(s1:t , a1:t , t|)iq
(5)
st st+1 s state transition
where hiq denotes the average w.r.t. q(s1:t , a1:t , t) and
H() is the entropy function. An EM algorithm can be
obtained from the bound in (5) by iterative coordinate-
rt rt+1 r utility wise maximisation:
H
E-step For fixed old find the best q that maximises
the r.h.s. of (5). For no constraint on q, this gives
Figure 1: RL represented as a model-based MDP tran- q = p(s1:t , a1:t , t| old ).
sition and policy learning problem. Rewards depend
on the current and past state and the past action, M-step For fixed q find the best that maximises
rt (st , at ). The policy p(at |st , ) determines the deci- the r.h.s. of (5). This is equivalent to maximising
sion and the environment is modeled by the transition the energy hlog p(s1:t , a1:t , t|)iq w.r.t. .
p(st+1 |st , at ). Based on a history of actions, states and
reward, the task is maximize the expected summed re- Maximisation of the energy term w.r.t. , under the
wards with respect to the policy . In the MDP setup, constraint that the policy is a distribution, gives
the state transition and utilities are known; in our RL
setup we have a distribution over these quantities. H X
X t
new
a,s q(s = s, a = a, t) (6)
t=1 =1
action trajectory distribution
For this M-step the required marginals of the q-
p(s1:H , a1:H |) = p(aH |sH , )p1 (s1 ) distribution can be calculated in linear time using mes-
H1 sage passing since the distribution is chain structured
(Wainwright and Jordan, 2008). The EM algorithm
Y
p(st+1 |st , at )p(at |st , ). (2)
t=1 is run until the policy converges to a (possibly local)
optima.
In this paper we consider the episodic case, so that
the horizon is finite. Graphically we can represent this
using an influence diagram, figure 1. Given a transi- 3 Variational Reinforcement Learning
tion model p(st+1 |st , at ), the MDP learning problem
is to find a policy that maximizes (1). By expressing In the RL problem we assume the transition distri-
s0
the utility (1) as the likelihood function of an appro- butions formed from s,a = p(s0 |s, a) are unknown
priately constructed mixture model the MDP can be and need to be estimated on the basis of interaction
solved using techniques from probabilistic inference, with the environment. These interactions are observed
such as EM (Toussaint et al., 2006) or MCMC (Hoff- transitions D = {(sn , an ) sn+1 , n = 1, . . . , N }. A
man et al., 2008). We follow a construction equivalent classical approach it is to use a point estimate of
to (Toussaint et al., 2006) but which has the advantage the transition model, such as the maximum likelihood
of not requiring auxiliary variables, see e.g. (Dayan (ML) estimator. However, for small amounts of ob-
and Hinton, 1997; Kober and Peters, 2009; Furmston served transitions, these estimators harshly assume
and Barber, 2009). Without loss of generality, we as- that unobserved transitions will simply never occur.
sume the reward is non-negative and define the reward Such an over-confident estimate can adversely affect
weighted path distribution the overall policy solution and result in myopic policies
that are unaware of potentially beneficial state-action
rt (st , at )p(s1:t , a1:t |) pairs. Whilst this over-confidence can be ameliorated
p(s1:t , a1:t , t|) = (3)
U () by adding pseudo-counts, this still does not reflect the
uncertainty in the estimate of the transition.
This distribution is properly normalised, as can be seen
from (1) and (2). We now define a variational dis- We propose an alternative Bayesian solution that
tribution q(s1:t , a1:t , t), and take the Kullback-Leibler maintains a distribution over transitions. The pos-
divergence between the q-distribution and (3). Since terior of is formed from Bayes rule
KL(q(s1:t , a1:t , t)||p(s1:t , a1:t , t|)) 0 (4) p(|D) p(D|)p(). (7)
242
Thomas Furmston, David Barber
As is a set of independent categorical distributions M-step For fixed q find the best that maximises the
a natural conjugate prior p() is the product of inde- r.h.s. of (15). This is equivalent to maximising
pendent Dirichlet distributions, i.e. the energy hlog p(s1:t , a1:t , t|, )iq w.r.t. .
Y

p() Dir(s,a |s,a ) (8) To perform the M-step we need the maximum of
s,a
hlog p(s1:t , a1:t , t|, )iq w.r.t. . As the policy is in-
where are hyper-parameters. This gives a posterior dependent of the transitions this maximisation gives
Y updates of the form

p(|D) = Dir(s,a |cs,a + s,a

) (9)
H X
t
s,a new
X
a,s q(s = s, a = a, t) (16)
where c is the count of observed transitions: t=1 =1
N Calculating the policy update is now a matter of cal-

0 X
css,a = I [sn = s, an = a, sn+1 = s0 ] (10) culating the marginals of the q-distribution from the
n=1 previous E-step. If no functional restriction is placed
The task now is to find the policy that maximizes the on the q-distribution then it will take the form of (13),
expected utility given the environmental data where will equal the policy of the previous M-step.
Z However, examining the form of (13), the exact state-
U (|D) = U (|)p(|D)d (11) action marginals of this distribution are computation-
ally intractable. This can be understood by first car-
rying out the integral over , which has the effect of
where U (|) is given by (1) with transitions .
coupling together all time slices of the path distribu-
Our aim is to form an EM style approach to learning tion p(s1:t , a1:t , t).
. Assuming the reward is non-negative we construct
In the following we discuss two approaches to dealing
a probability distribution for which the normalization
with this intractability. The first, Variational Bayes,
constant is equal to (11). Consider the following un-
restricts the functional form of the q-distribution in
normalised distribution defined over state-action paths
the E-step such that the updates in the M-step become
and times t = 1, ..., H,
tractable. The second approximates the marginals of
p(s1:t , a1:t , t|, ) = r(st , at )p(s1:t , a1:t |, ) (12) the q-distribution directly using Expectation Propaga-
tion.
where p(s1:t , a1:t |, ) is the marginal of (2) given the
transitions . Using (12) we now define a joint distri- 4 Variational Bayes
bution over state-action paths, times and transitions
p(s1:t , a1:t , t|, )p(|D) To ensure computational tractability, a suitable re-
p(s1:t , a1:t , t, |, D) = (13) striction on the functional form of the q-distribution
U (|D)
is to make the factorised approximation:
This distribution is properly normalised, which can be
verified through use of (1) and (11). The Kullback- q(s1:t , a1:t , t, ) = q(s1:t , a1:t , t)q(). (17)
Leibler divergence between a variational distribution
q(s1:t , a1:t , t, ), and (13) gives the bound This approximation maintains the lower bound in (15)
which now takes the form
KL(q(s1:t , a1:t , t, )||p(s1:t , a1:t , t, |, D)) 0 (14)
log U (|D) H(qx ) + H(q ) + hlog p(|D)iq
from which we obtain
+ hlog p(s1:t , a1:t , t|, )iq qx (18)
log U (|D) H(q(s1:t , a1:t , t, )) + hlog p(|D)iq
Where we have used the notation q q(), and
+ hlog p(s1:t , a1:t , t|, )iq (15) qx q(s1:t , a1:t , t). The variational Bayes procedure
now iteratively maximizes (18) with respect to the dis-
where hiq denotes the average w.r.t. q(s1:t , a1:t , t, ).
tributions qx and q . Taking the functional derivative
An EM algorithm for optimising the bound with re-
of (18) with respect to qx and q , whilst holding the
spect to is:
other fixed, gives the following update equations:
E-step For fixed old find the best q that maximises q(s1:t , a1:t , t) e
hlog p(s1:t ,a1:t ,t|,)iq
(19)
the r.h.s. of (15). For no constraint on q, this hlog p(s1:t ,a1:t ,t|,)iqx
gives q = p(s1:t , a1:t , t, | old , D). q() p(|, D)e (20)
243
Algorithm 1 VB EM Algorithm these distributions are coupled we need to iterate them

Input: policy , reward r, prior and transition until convergence.
counts c. The form of the M-step is calculated by maximising
repeat the bound (18) with respect to . This leads to the
For fixed policy same updates as (16) except the q-distribution now
repeat takes the form of (21). A summary of VB-EM is given
Calculate the q-marginals (21) and (23). in algorithm (1).
until Convergence of the marginals.
Update the policy according to (16).
until Convergence of the policy. 4.1 Hierarchical Variational Bayes
So far we have assumed that the hyper-parameters, ,

are fixed. However the quality of the policy learned can
Expansion of the log p(s1:t , a1:t , t|, ) term in (19)
be strongly dependent on . If the components of
shows that q(s1:t , a1:t , t) is proportional to
are set too low any initial data points will dominate the
t1 transition posterior and the probability of unobserved
Y hlog s +1 ,s ,a iq transitions will be small. On the other hand if is set
r(st , at )at ,st p1 (s1 ) e a ,s . (21)
=1 too high an excessively large amount of data points will
be required to dilute the prior effect on the posterior.
This is the same form as the original MDP (1,2) with To overcome this problem we can extend the model by
the transitions replaced with unnormalised transi- placing a prior distribution over and then update
tions the posterior as data from the environment is received.
hlog s0 ,s,a iq This extension is straightforward under the variational
(s0 , s, a) e . (22) approximation qx q q . In our experiments we use the
hyper-parameter distribution independently for each
The averages of log in the exponent can be com- component of :
puted using standard digamma functions. Given q ,
the marginals q(s , a , t) can be then calculated us- p() e20(1) ,
2
0
ing message passing on the corresponding factor graph
(Kschischang et al., 2001). which has the effect of retaining significant poste-
A similar calculation for the transition parameters rior variance in the transition model, damping overly
gives the update greedy exploitation.
hlog s +1 ,s ,a iqx
PH Pt1
q() p(|, D)e t=1 =1
5 Expectation Propagation
The summation of the states and actions in the expo- In order to implement the Variational Reinforce-
nent means that we may write ment Learning approach of 3 we require the
Y
marginals of the intractable distribution q =
+ cs,a + rs,a

q() = Dir s,a |s,a (23)
p(s1:t , a1:t , t, | old , D). As an alternative to the vari-
s,a
ational Bayes factorised approach we here consider an
where approximate message passing (AMP) approach that
0 XX approximates the required marginals directly.
s
rs,a = q(s +1 = s0 , s = s, a = a) (24)
The graphical structure of q(s1:t , a1:t , , t) is loopy but
t
sparse, so that a sum-product algorithm may pro-
Equation (23) has an intuitive interpretation: for each vide reasonable approximate marginals, see figure 2.
s0
triple (s0 , s, a) we have the prior s,a term and the ob- The messages for the factor graph version of the sum-
s0 product algorithm take the following form.
served counts cs,a which deal with the posterior of the
s0
transitions. The term rs,a encodes an approximate ex- Y
xf (x) = hx (x) (25)
pected reward obtained from starting in state s, tak-
hn(x)\{f }
ing action a, entering state s0 and then following X Y
afterwards. The posterior q() is therefore a standard f x = f (X) yf (y) (26)
Dirichlet posterior on transitions but biased towards {x} yn(f )\{x}
transitions that are likely to lead to higher expected P
reward. Under the approximation (17) the E-step con- where {x} means the sum over all variables except
sists of calculating the distributions (21) and (23). As x, n() is the set of neighbouring nodes and X are the
244
p1 s1 Algorithm 2 AMP EM Algorithm

Input: policy , reward r, prior , transition
R counts c and message-passing schedule S.
repeat
a1
For fixed policy
repeat
p(|D)
Perform message-passing according to S using
T EP to approximate messages T 0 ().
p1 s1 s2
until Convergence of the messages.
R Update the policy according to (16).
until Convergence of the policy.
a1 a2
using (25) we have that T () takes the form

T T Y
p1 s1 s2 s3 T () = p(|D) T 0 () (29)
T 0 6=T
R
where T 0 () is given by
a1 a2 a3
0
aT 0 (a)sT 0 (s)s0 T 0 (s0 )ss,a .
X
T 0 () =
Figure 2: A factor graph representation of s0 ,a,s
q(s1:t , a1:t , t, ) for transition factors T , reward factors (30)
R and policy factor , for a H = 3 horizon. The
square nodes represent the various factors (functions) From (29) and (30), T () is a mixture of Dirich-
of the distribution and the circle nodes represent the lets where the number of mixtures is exponential in
variables. The initial time has no transition. The tth the planning horizon H. This makes messages such
chain is the tth row of this diagram for fixed . as (28) computationally intractable. Following the
general approach outlined in (Minka, 2001) to make
a tractable approximate implantation we therefore
variables of the factor f . At convergence the singleton project the messages T () to a product of inde-
marginals are approximated by pendent Dirichlets by moment matching. Given the
projection q() we use (26) and (27) to obtain the ap-
p(x) =
Y
f x (x) (27) proximate message
f Fx q()
T () = Q . (31)
p(|D) T 0 6=T T 0 ()
where Fx means the set of functions in the factor graph
that depend on x. As can be seen from (26) and (25) Given a message initialisation and a message passing
all the messages that involve the factors p1 , , and R schedule S, the AMP algorithm can be summarized
are trivial, requiring only summations of discrete func- as in algorithm (2). For our experiments we used the
tions. Also, as the factor node p(|D) is a leaf node schedule S outlined in algorithm (3).
this message is also trivial. However, the messages be-
tween and the transition factors T are intractable.
To see this we examine a message from T to an action 6 Experiments
node a2
6.1 Incorporation of uncertainty
Z
0
dT ()as ,s .
X
T a (a) = sT (s)sT (s0 ) The first experiment is designed to demonstrate that
s,s 0 our objective function indeed incorporates uncertainty
(28) in the knowledge of the environment into the policy
optimisation process. The experiment is performed on
In order for (28) to be tractable we need T () to a problem small enough that for short horizons the
be the product of independent Dirichlets. However, objective function (11) and the EM update (16) can
be calculated exactly. This allows for characteristics
2
We have dropped the time dependence on the factors of the objective function to be gleaned without the
and the variables to ease the notation. complicating issue of approximations.
245

Algorithm 3 AMP message-passing Schedule i 1 i 4 10
Ti = , R=
repeat 1 i i 1 1
for t = 1 to H do
Perform message-passing along the tth chain, Figure 3: The transition and reward matrices for the
q(s1 , a1 , ..., st , at ), figure 2, holding all the mes- two-state toy problem. Ti represents the transition
sages T () fixed. matrix from state si , where the columns correspond
end for to actions and the rows correspond to the next state.
repeat The reward matrix R is defined so that the actions run
for each T () do along the rows and the states run along the columns.
Perform Expectation-Propagation to obtain
q(), then use (31) to update T ().
end for when < , and s1 ,a1 = 0 otherwise, where =
until Convergence of all the messages T (). 0.7021. The fact that we know the point, , at which
until Convergence of the q-distribution. the optimal policy of the MDP changes means that we
can form a distribution of sML 1 ,a1
. Given the sample
size and the true value of the transition parameter we
The experiment was performed on a toy two-state have the distribution
problem, with the transition and reward matrices X
given in figure 3. The horizon was set to H = 5 and the p(sML
1 ,a1
= 1|N, true ) = BN,true (n)
initial state is 1. The aim of the experiment is to com- {nN |n/N <}
pare the average total expected utility of the policies

where BN,true is the density function of the Binomial
obtained from the Bayesian and point-based objective
distribution with parameters (N, true ). Having ob-
functions. The average is taken over the true transi-
tained the distribution over the optimal policy it is
tion model, true , and we compare these averages for
now possible to calculate (33).
increasing numbers of observed transitions, N . We set
the distribution over the true transition model to be We calculated (32) and (33) for increasing values of the
uniform. Writing the quantities of interest down alge- N , the results of which are shown in figure 4. It can
braically we have for the Bayesian objective function be observed that the Bayesian objective function con-
sistently outperforms the point-based objective func-
Ep(true ) [Ep(D|true ,N ) [U ( D |true )]] tion. We expect a more dramatic difference in larger
problems for which the amount of uncertainty in the
Z
= dtrue dDU ( D |true ))p(D|true , N )p(true ) transition parameters is greater.
(32) It should be noted that while the point-based objective
function will always produce a deterministic policy the
where D is the optimal policy of the Bayesian objec-
Bayesian objective function can produce a stochastic
tive function. For the ML objective function we have
policy. This naturally incorporates an explorative type
Ep(true ) [Ep(ML |true ,N ) [U ( ML |true )]] behaviour into the policy that will lead to a reduction
Z in the uncertainty in the environment.
= dtrue d ML U ( ML |true )p( ML |true , N )p(true )
6.2 The chain problem
(33)
We compare the EM RL algorithms on the standard
where similarly ML is the optimal policy of the ML
chain benchmark RL problem (Dearden et al., 1998)
objective function.
which has 5 states each having 2 possible actions, as
As we can calculate the objective function U (|D) ex- shown in figure 5. The initial state is 1 and every ac-
actly, we can also calculate (32) for reasonable values tion is flipped with slip probability pslip = 0.2, mak-
of N . It remains to calculate (33), where the difficult ing the environment stochastic. The optimal policy
term is the probability distribution over the optimal is to travel down the chain towards state 5, which is
policy, which we now detail. achieved by always selecting action a.
The settings of the reward matrix and the horizon are In the experiments the total 1000 time-steps are split
such that, given (1 , 2 ) are known, the optimal action into 10 episodes each of 100 time-steps. During each
in state s2 is a1 for all values of 2 . This means that episode the policy and transition model are fixed, and
when the transition dynamics are known the optimal the transitions and rewards from the RL environment
policy can be given by a single parameter, s1 ,a1 . In are collated. At the end of each episode the policy
the experiment we set 1 = 2 = , so that s1 ,a1 = 1 and transition model are updated. All policies are ini-
246
a,10
33
a,0 a,0 a,0 a,0

s1 s2 s3 s4 s5
32.5
b,2
32
31.5 Figure 5: The single-chain problem state-action tran-

sitions with rewards r(st , at ). The initial state is state
31 1. There are two actions a, b, with each action being
ML Objective Function
Bayesian Objective Function
flipped with probability 0.2.
30.5
0 5 10 15 20
Sample Size, N
marginal statistics required for EM learning using
Expectation Propagation.
Figure 4: The average total expected reward of the
HVB-EM As for VB-EM but extended to the hyper-
policies obtained from the Bayesian objective function,
parameter distribution, as described in 4.1.
U (|D), and the maximum likelihood objective func-
tion, U (|ML ). The sample size is plotted against the
average total expected reward. The results, averaged over a 100 experiments, are
shown in figure 6. The AMP and stochastic EM algo-
rithms consistently outperform the ML EM algorithm.
tialised randomly from a uniform distribution. For the This is in agreement with our previous results and sug-
methods based on a fixed hyper-parameters , we set gests that both of these algorithms are able to make
each component of to 1. reasonable approximations to the true marginals of the
Convergence of all MDP solvers was determined when q-distribution. Despite the encouraging initial perfor-
the L1 norm of the policy between successive itera- mance of the variational Bayes algorithms, the ML
tions is less than 0.01. The methods we compared are EM algorithm eventually performs better than both
described below. the fixed hyper-parameter and hierarchical VB vari-
ants. This suggests that the factorised approximation
inherent in the VB leads to difficulties. One potential
ML EM The mean is computed from the Dirichlet
issue is that under the factorisation assumptions, the
posterior p(|D, ). This is used as a point-based
unnormalised transitions (22) have the form
estimate of the transition model in the MDP EM
algorithm of 2. s0
s0 e(s,a )
s,a = 0 (34)
SEM At the end of each episode we obtained an
P
ss,a
e( s0 )
approximation to the optimal policy using sam-
pling. We first draw samples i , i = 1, . . . , I where
P s0 represents the digamma function. For
from the posterior p(|D, ). For each sample i s0 s,a < 1 the contributions of the first time points
we then compute the exact conditional marginals in the unnormalised distribution (21) exponentially
p(s , a , t|i ) by message passing on the chain. dominate. As a result there is a bias towards the initial
Averaging over the samples gives the Stochastic time-steps, forcing both of the variational Bayes algo-
EM update rithms to focus on only locally optimal policies. Fi-
nally we note that the prior on the hyper-parameters,
I X
X H X
t , was beneficial to the variational Bayes algorithm.
new
s,a p(s , a , t|i ) This is unsurprising since it maintains posterior vari-
i=1 t=1 =1 ance. We would expect a similar improvement in per-
In the experiments we set I so that this method formance for a hierarchical Expectation Propagation
has roughly the same runtime as the AMP EM approach.
algorithm. In the variational Bayes algorithm the q-distributions
VB EM At the end of each episode the approach de- had to be iterated around 15 times on average. The
scribed in 4 is used. The hyper-parameter is approximate message passing algorithm had to repeat
fixed throughout to 1. the message passing schedule around 10 times on av-
erage, where the Expectation Propagation section of
AMP EM At the end of each episode, the approach the schedule had to be repeated around 2 times for
described in 5 is run, which approximates the convergence. Under the current implementation the
247
2.6
Application to Scoring Graphical Model Structures.
ML EM
In Bayesian Statistics, volume 7, pages 453464. Ox-
2.4 SEM ford University Press, 2003.
VB EM
2.2 HVB EM
AMP EM
R. Crites and A. Barto. Improving Elevator Perfor-
mance Using Reinforcement Learning. NIPS, 8:
2
10171023, 1995.
1.8
P. Dayan and G. E. Hinton. Using Expectation-
1.6 Maximization for Reinforcement Learning. Neural
Computation, 9:271278, 1997.
1.4
R. Dearden, N. Friedman, and S. Russell. Bayesian Q

100 200 300 400 500 600
TimeSteps
700 800 900 1000 learning. AAAI, 15:761768, 1998.
M. Duff. Optimal Learning: Computational Pro-
cedures for Bayes-Adaptive Markov Decision Pro-
Figure 6: Results from the P chain problem in fig- cesses. PhD thesis, University of Massachusetts
t
ure 5 with average reward 1t r plotted against Amherst, 2002.
time t. The plot shows the results for approximate
T. Furmston and D. Barber. Solving deterministic
message passing (light blue), hierarchical variational
policy (PO)MPDs using Expectation-Maximisation
Bayes (purple), variational Bayes (red), stochastic EM
and Antifreeze. European Conference on Machine
(green) and and the EM algorithm of 2 using the max-
Learning (ECML), 1:5065, 2009. Workshop on
imum likelihood estimator (dark blue). The results
Learning and data Mining for Robotics.
represent performance averaged over 100 runs of the
experiment. M. Hoffman, A. Doucet, N. de Freitas, and A. Jasra.
Trans-dimensional MCMC for Bayesian Policy
Learning. NIPS, 20:665672, 2008.
variational Bayes algorithm is able to perform an EM
J. Kober and J. Peters. Policy search for motor prim-
step in approximately 0.15 seconds, while the approx-
itives in robotics. NIPS, 21:849856, 2009.
imate message passing algorithm takes approximately
5 seconds. F. R. Kschischang, B. J. Frey, and H-A. Loeliger. Fac-
tor graphs and the sum-product algorithm. IEEE
Transactions on Information Theory, 47:498519,
7 Conclusions 2001.
T. P. Minka. Expectation Propagation for approxi-
Framing Markov Decision Problems as inference in a
mate Bayesian inference. In UAI 01: Proceedings
related graphical model has been recently introduced
of the 17th Conference in Uncertainty in Artificial
and has the potential advantage that methods in ap-
Intelligence, pages 362369, 2001.
proximate inference can be exploited to help overcome
difficulties associated with classical MDP solvers in C. Rasmussen and M. Deisenroth. Probabilistic in-
large-scale problems. In this work, we performed some ference for fast learning in control. In S. Girgin,
groundwork theory that extends these techniques to M. Loth, R. Munos, P. Preux, and D. Ryabko, ed-
the case of reinforcement learning in which the param- itors, Recent Advances in Reinforcement Learning,
eters of the MDP are unknown and need to be learned pages 229242, 2008.
from experience. An exact implementation of such a R. S. Sutton and A. G. Barto. Reinforcement Learning:
Bayesian formulation of RL is formally intractable and An Introduction. MIT Press, 1998.
we considered two approximate solutions, one based on
M. Toussaint, S. Harmeling, and A. Storkey. Proba-
variational Bayes, and the other on Expectation Prop-
bilistic inference for solving (PO)MDPs. Research
agation, our initial findings suggesting that the latter
Report EDI-INF-RR-0934, University of Edinburgh,
approach is to be generally preferred.
School of Informatics, 2006.
M. J. Wainwright and M. I. Jordan. Graphical mod-
References els, exponential families, and variational inference.
P. Abbeel, A. Coates, M. Quigley, and A. Ng. An Foundations and Trends in Machine Learning, 1(1-
Application of Reinforcement Learning to Aerobatic 2):1305, 2008.
Helicopter Flight. NIPS, 19:18, 2007.
M. J. Beal and Z. Ghahramani. The Variational
Bayesian EM Algorithm for Incomplete Data: with
248

Variational Methods For Reinforced Learning

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Variational Methods For Reinforced Learning

Hochgeladen von

Copyright:

Verfügbare Formate

Variational methods for Reinforcement Learning

Thomas Furmston David Barber

Abstract However, these may result in myopic policies since only

1 Introduction An MDP can be described by an initial state distri-

we obtain a lower bound on the log utility

KL(q(s1:t , a1:t , t)||p(s1:t , a1:t , t|)) 0 (4) p(|D) p(D|)p(). (7)

N Calculating the policy update is now a matter of cal-

Algorithm 1 VB EM Algorithm these distributions are coupled we need to iterate them

So far we have assumed that the hyper-parameters, ,

p1 s1 Algorithm 2 AMP EM Algorithm

using (25) we have that T () takes the form

pare the average total expected utility of the policies

a,0 a,0 a,0 a,0

31.5 Figure 5: The single-chain problem state-action tran-

R. Dearden, N. Friedman, and S. Russell. Bayesian Q

Das könnte Ihnen auch gefallen