Sie sind auf Seite 1von 44

Carnegie Mellon

School of Computer Science

Deep Reinforcement Learning and Control

Natural Policy Gradients, TRPO, PPO

CMU 10703

Katerina Fragkiadaki
Part of the slides adapted from John Shulman and Joshua Achiam
Stochastic policies
continuous actions

θ µ✓ (s)
✓ (s)
usually multivariate
Gaussian

2
a ⇠ N (µ✓ (s), ✓ (s))

discrete actions
almost always
categorical

θ pθ(s)
a ∼ Cat(pθ(s))
What Loss to Optimize?
Policy Gradients I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]


I Can di↵erentiate the following loss
h i
LPG (✓) = Êt log ⇡✓ (at | st )Ât
1. Collect trajectories for policy πθ but don’t want to optimize it too far
2. Estimate advantages A This lecture is all about the stepwise
3. Compute policy gradient ĝ I Equivalently di↵erentiate
4. Update policy parameters θnew = θ + ϵ ⋅ ĝ 
IS ⇡✓ (at | st )
5. GOTO 1 L✓old (✓) = Êt Ât
⇡✓old (at | st )
at ✓ = ✓old , state-actions are sampled using ✓old . (IS =
r✓ f (✓) ⇣
✓old
Just the chain rule: r✓ log f (✓) ✓ = f (✓old ) = r✓
old

μθ(s)
θold σθ(s)

μθnew(s)
θnew σθnew(s)
\ objective function?
What is the underlying
1 N T
ĝ ≈ ∇θ log πθ(αt(i) | st(i))A(st(i), at(i)),
N∑ ∑
Policy gradients: τi ∼ πθ
i=1 t=1

What is our objective? Result from differentiating the objective function:


N T
1
J PG(θ) = log πθ(αt(i) | st(i))A(st(i), at(i))
N∑ ∑
τi ∼ πθ
i=1 t=1

Is this our objective? We cannot both maximize over a variable and sample from it.
Well, we cannot optimize it too far, our advantage estimates are from samples of
\pi_theta_{old}. However, this constraint of “cannot optimize too far from \theta_{old}”
does not appear anywhere in the objective.
Compare to supervised learning and maximum likelihood estimation (MLE). Imagine we
have access to expert actions, then the loss function we want to optimize is:
N T
1
J SL(θ) = log πθ(α̃(i) (i)
∑ ∑ t | st ), τi ∼ π* +regularization
N i=1 t=1
which maximizes the probability of expert actions in the training set.
Is this our SL objective?

Well, as a matter of fact, we care about test error, but this is a long story, the short
answer is yes, this is good enough for us to optimize if we regularize.
What Loss to Optimize?
Policy Gradients I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]


I Can di↵erentiate the following loss
h i
LPG (✓) = Êt log ⇡✓ (at | st )Ât
1. Collect trajectories for policy πθ but don’t want to optimize it too far
2. Estimate advantages A This lecture is all about the stepwise
3. Compute policy gradient ĝ I Equivalently di↵erentiate
4. Update policy parameters θnew = θ + ϵ ⋅ ĝ 
IS ⇡✓ (at | st )
5. GOTO 1 L✓old (✓) = Êt Ât
⇡✓old (at | st )
It is also about writing down an objective that we can
optimize withatPG,
✓ =and , state-actions
✓oldthe are sampled
procedure 1,2,3,4,5 will using
be the✓old . (IS =
result of this objective maximization r✓ f (✓) ⇣
✓old
Just the chain rule: r✓ log f (✓) ✓ = f (✓old ) = r✓
old

μθ(s)
θold σθ(s)

μθnew(s)
θnew σθnew(s)
What Loss to Optimize?
Policy Gradients I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]


I Can di↵erentiate the following loss
h i
LPG (✓) = Êt log ⇡✓ (at | st )Ât
1. Collect trajectories for policy πθ but don’t want to optimize it too far
2. Estimate advantages A
3. Compute policy gradient ĝ I Equivalently di↵erentiate
4. Update policy parameters θnew = θ + ϵ ⋅ ĝ 
IS ⇡✓ (at | st )
5. GOTO 1 L✓old (✓) = Êt Ât
⇡✓old (at | st )
at ✓ = ✓old , state-actions are sampled using ✓old . (IS =
r✓ f (✓) ⇣
✓old
Just the chain rule: r✓ log f (✓) ✓ = f (✓old ) = r✓
old
Two problems with the vanilla formulation: μθ(s)
1. Hard to choose stepwise ϵ θold σθ(s)
2. Sample inefficient: we cannot use data
collected with policies of previous
iterations
μθnew(s)
θnew σθnew(s)
What Loss to Optimize?
Hard to choose stepsizes Two I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]


I Can di↵erentiate the following loss
h i
LPG (✓) = Êt log ⇡✓ (at | st )Ât
• Step too big
1. Collect trajectories for policy πθ but don’t want Bad topolicy->data collected
optimize it too far under bad
2. Estimate advantages A policy-> we cannot recover
3. Compute policy gradient ĝ I Equivalently di↵erentiate
(in Supervised Learning, data does not
4. Update policy parameters θnew = θ + ϵ ⋅ ĝ depend on neural network

weights)
IS ⇡✓ (at | st )
5. GOTO 1 • Step too smallL✓old (✓) = Êt ⇡ (a | s ) Ât
✓old t t
Not efficient use of experience
at ✓ = ✓old , (in
state-actions
Supervised areLearning,
sampled data
usingcan
✓oldbe. (IS =
trivially re-used) r✓ f (✓) ⇣
✓old
Just the chain rule: r✓ log f (✓) ✓ = f (✓old ) = r✓
old

μθ(s)
θold σθ(s)
Gradient descent in parameter space
does not take into account the
μθnew(s)
resulting distance in the (output) policy
space between πθold(s) and πθnew(s)
θnew σθnew(s)
What Loss to Optimize?
Hard to choose stepsizes Two I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]


I Can di↵erentiate the following loss
h i
LPG (✓) = Êt log ⇡✓ (at | st )Ât
1. Collect trajectories for policy πθ but don’t want to optimize it too far
2. Estimate advantages A
3. Compute policy gradient ĝ I Equivalently di↵erentiate
4. Update policy parameters θnew = θ + ϵ ⋅ ĝ 
IS ⇡✓ (at | st )
5. GOTO 1 The Problem is More Than Step Size L✓old (✓) = Êt Ât
⇡✓old (at | st )
Consider a family of policies with parametrization:

at ✓ = ✓ , state-actions
old (✓) a=1 are sampled using ✓old . (IS =
⇡✓ (a) =
1 (✓)
a=2 r✓ f (✓) ⇣
✓old
Just the chain rule: r✓ log f (✓) ✓ = f (✓old ) = r✓
old

Figure: Small changes in the policy parameters can unexpectedly lead to big changes in the policy.
Notation Two
We will use the following to denote values of parameters and corresponding policies before
and after an update:

θold → θnew
πold → πnew
θ → θ′
π → π′
Gradient Descent in Parameter Space
The stepwise in gradient descent results from solving the following optimization problem, e.g.,
using line search:

d * = arg max J(θ + d)


∥d∥≤ϵ

Euclidean distance in parameter space

It is hard to predict the result on the parameterized distribution..

θ µ✓ (s)
✓ (s)

SGD: θnew = θold + d *


Gradient Descent in Distribution Space
The stepwise in gradient descent results from solving the following optimization problem, e.g.,
using line search:

d * = arg max J(θ + d)


∥d∥≤ϵ
SGD: θnew = θold + d *
Euclidean distance in parameter space

It is hard to predict the result on the parameterized distribution.. hard to pick the threshold
epsilon

Natural gradient descent: the stepwise in parameter space is determined by


considering the KL divergence in the distributions before and after the update:

d * = arg max J(θ + d)


d, s.t. KL(πθ∥πθ+d )≤ϵ

KL divergence in distribution space


Easier to pick the distance threshold!!!
Solving the KL Constrained Problem

Unconstrained penalized objective:

d * = arg max J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ)


d

First order Taylor expansion for the loss and second order for the KL:

1
≈ arg max J(θold ) + ∇θ J(θ) |θ=θ ⋅ d − λ(d ⊤ ∇2θ DKL [πθold∥πθ] |θ=θ d) + λϵ
d old 2 old
Taylor expansion of KL
⊤ 1 ⊤ 2
DKL(pθold | pθ) ≈ DKL(pθold | pθold) + d ∇θ DKL(pθold | pθ) |θ=θ + d ∇θ DKL(pθold | pθ) |θ=θ
old 2 old

∇θ DKL(pθold | pθ) |θ=θ = − ∇θ 𝔼x∼pθ log Pθ(x) |θ=θ


old old old

= −𝔼x∼pθ ∇θ log Pθ(x) |θ=θ


old old

1
= −𝔼x∼pθ ∇θ Pθ(x) |θ=θ
old P (x) old
θold
1
∫x
= Pθold(x) ∇θ Pθ(x) |θ=θ
Pθold(x) old

∫x
= ∇θ Pθ(x) |θ=θ
old

∫x
= ∇θ Pθ(x) |θ=θ .
Pθold(x)
( Pθ(x) )
old
KL(pθold | pθ) = 𝔼x∼pθ log
=0 old
Taylor expansion of KL
⊤ 1 ⊤ 2
DKL(pθold | pθ) ≈ DKL(pθold | pθold) + d ∇θ KL(pθold | pθ) |θ=θ + d ∇θ DKL(pθold | pθ) |θ=θ d
old 2 old

∇2θ DKL(pθold | pθ) |θ=θ = −𝔼x∼pθ ∇2θ log Pθ(x) |θ=θ


old old old

( Pθ(x) )
∇θ Pθ(x)
= −𝔼x∼pθ ∇θ |θ=θ
old old

∇2θ Pθ(x)Pθ(x) − ∇θ Pθ(x) ∇θ Pθ(x)⊤


old ( )
= −𝔼x∼pθ |θ=θ
Pθ(x) 2 old

∇2θ Pθ(x) |θ=θ


= −𝔼x∼pθ old
+ 𝔼x∼pθ ∇θ log Pθ(x) ∇θ log Pθ(x)⊤ |θ=θ
old Pθold(x) old old

= 𝔼x∼pθ ∇θ log Pθ(x) ∇θ log Pθ(x)⊤ |θ=θ


old old

Pθold(x)
( Pθ(x) )
DKL(pθold | pθ) = 𝔼x∼pθ log
old
Fisher Information Matrix
Exactly equivalent to the Hessian of KL divergence!

F(θ) = 𝔼θ [ ∇θ log pθ(x) ∇θ log pθ(x)⊤]

F(θold) = ∇2θ DKL(pθold | pθ) |θ=θ


old

⊤ 1 ⊤ 2
DKL(pθold | pθ) ≈ DKL(pθold | pθold) + d ∇θ DKL(pθold | pθ) |θ=θ + d ∇θ DKL(pθold | pθ) |θ=θ d
old 2 old

1 ⊤
= d F(θold)d
2
1
= (θ − θold)⊤F(θold)(θ − θold)
2

Since KL divergence is roughly analogous to a distance measure between


distributions, Fisher information serves as a local distance metric between
distributions: how much you change the distribution if you move the parameters a
little bit in a given direction.
Solving the KL Constrained Problem
Unconstrained penalized objective:

d * = arg max J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ)


d

First order Taylor expansion for the loss and second order for the KL:

1
≈ arg max J(θold ) + ∇θ J(θ) |θ=θ ⋅ d − λ(d ⊤ ∇2θ DKL [πθold∥πθ] |θ=θ d) + λϵ
d old 2 old

Substitute for the information matrix:

1
= arg max ∇θ J(θ) |θ=θ ⋅ d − λ(d ⊤F(θold )d)
d old 2
1
= arg min − ∇θ J(θ) |θ=θ ⋅ d + λ(d ⊤F(θold )d)
d old 2
Natural Gradient Descent
Setting the gradient to zero:

∂d ( )
∂ 1
0= − ∇θ J(θ) |θ=θ ⋅ d + λ(d ⊤F(θold )d)
old 2
1
= − ∇θ J(θ) |θ=θ + λ(F(θold ))d
old 2

2 −1
d = F (θold ) ∇θ J(θ) |θ=θ
λ old

The natural gradient: 1


DKL(πθold | πθ) ≈ (θ − θold )⊤F(θold )(θ − θold )
2
˜
∇J(θ) = F−1(θold ) ∇θ J(θ)
1
(αgN )⊤F(αgN ) = ϵ
2
θnew = θold + α ⋅ F−1(θold )ĝ

α=
(gN⊤FgN )
Natural Gradient Descent

Both use samples from the current policy \pi_k


Natural Gradient Descent

very expensive to compute for a large number of parameters!


What Loss to Optimize?
Policy Gradients I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]


I Can di↵erentiate the following loss
h i
LPG (✓) = Êt log ⇡✓ (at | st )Ât
1. Collect trajectories for policy πθ but don’t want to optimize it too far
old
2. Estimate advantages A
3. Compute policy gradient ĝ I Equivalently di↵erentiate
4. Update policy parameters θnew = θold + ϵ ⋅ ĝ 
IS ⇡✓ (at | st )
5. GOTO 1 L✓old (✓) = Êt Ât
⇡✓old (at | st )
at ✓ = ✓old , state-actions are sampled using ✓old . (IS =
r✓ f (✓) ⇣
✓old
Just the chain rule: r✓ log f (✓) ✓ = f (✓old ) = r✓
old

μθold(s)
θold σθold(s)

μθnew(s)
θnew σθnew(s)
What Loss to Optimize?
Policy Gradients I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]


I Can di↵erentiate the following loss
h i
LPG (✓) = Êt log ⇡✓ (at | st )Ât
• On policy learning can be extremely
1. Collect trajectories for policy πθ inefficient
but don’t want to optimize it too far
old
2. Estimate advantages A • The policy changes only a little bit with
3. Compute policy gradient ĝ I Equivalently di↵erentiate
each gradient step
4. Update policy parameters θnew = θold + ϵ ⋅ ĝ • I want to be able to use earlier 
IS ⇡✓data..how
(at | st )
5. GOTO 1 to do that? L✓old (✓) = Êt Ât
⇡✓old (at | st )
at ✓ = ✓old , state-actions are sampled using ✓old . (IS =
r✓ f (✓) ⇣
✓old
Just the chain rule: r✓ log f (✓) ✓ = f (✓old ) = r✓
old

μθold(s)
θold σθold(s)

μθnew(s)
θnew σθnew(s)
Off policy learning with Importance Sampling
J(θ) = 𝔼τ∼πθ(τ) [R(τ)]


= πθ(τ)R(τ)
τ
πθ(τ)
∑ θold πθ (τ)
= π (τ) R(τ)
τ old

πθ(τ)
∑ πθ (τ)
= R(τ)
τ∼π old
θold

πθ(τ)
= 𝔼τ∼πθ R(τ)
old π
θold(τ)

∇θ πθ(τ)
∇θ J(θ) = 𝔼τ∼πθ R(τ)
old πθold(τ)

∇θ J(θ) |θ=θ = 𝔼τ∼πθ ∇θ log πθ(τ) |θ=θ R(τ) <-Gradient evaluated at theta_old is unchanged
old old old
Off policy learning with Importance Sampling
J(θ) = 𝔼τ∼πθ(τ) [R(τ)]
T
πθ(τ) πθ(at | st)
πθold(τ) ∏
=

= πθ(τ)R(τ) π (a | s )
i=1 θold t t
T t
τ πθ(a′t | s′t )
At̂
old ∑ ∏ π
πθ(τ) J(θ) = 𝔼τ∼πθ
=
∑ θold πθ (τ)
π (τ) R(τ)
t=1 t′=1 θold
(a′
t | s′t )
τ old

πθ(τ)
∑ πθ (τ)
= R(τ)
τ∼π old
θold
Now we can use data from the old
πθ(τ) policy, but the variance has
= 𝔼τ∼πθ R(τ) increased by a lot! Those
old π
θold(τ) multiplications can explode or
vanish!
∇θ πθ(τ)
∇θ J(θ) = 𝔼τ∼πθ R(τ)
old πθold(τ)

∇θ J(θ) |θ=θ = 𝔼τ∼πθ ∇θ log πθ(τ) |θ=θ R(τ)


old old old
Trust region Policy Optimization
Trust Region Policy Optimization
I Define the following trust region update:

⇡✓ (at | st )
maximize Êt Ât
✓ ⇡✓old (at | st )
subject to Êt [KL[⇡✓old (· | st ), ⇡✓ (· | st )]]  .
er ReadingI Also worth considering using a penalty instead of a constraint

⇡✓ (at | st )
maximize Êt Ât Êt [KL[⇡✓old (· | st ), ⇡✓ (· | st )]]
✓ ⇡✓old (at | st )
I Method of Lagrange multipliers: optimality point of -constrained problem
is also an optimality point of -penalized problem for some .
I S. Kakade. Again
“A Natural Policy
the KL Gradient.”
penalized NIPS. 2001
problem!
I In practice, is easier to tune, and fixed is better than fixed
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”.
Solving
Trust KL penalized
region problem
Policy Optimization
Solving KL Penalized Problem
I maximize✓ L⇡✓old (⇡✓ ) · KL⇡✓old (⇡✓ )
I Make linear approximation to L⇡✓old and quadratic approximation to KL term:

maximize g · (✓ ✓old ) 2
(✓ ✓old )T F (✓ ✓old )

@ @2
where g = L⇡✓old (⇡✓ ) ✓=✓old
, F = 2 KL⇡✓old (⇡✓ ) ✓=✓old
@✓ @ ✓

I Quadratic part of L is negligible compared to KL term


I F is positive semidefinite, but not if we include Hessian of L
I Solution: ✓ ✓old = 1 F 1 g , where F is Fisher Information matrix, g is
policy Exactly what we saw with natural policy gradient!
gradient. This is called the natural policy gradient3 .
One important detail!
3
S. Kakade. “A Natural Policy Gradient.” NIPS. 2001.
Trust region Policy Optimization
Trust Region Policy Optimization
Small problems with NPG update:
Might not be robust to trust region size ; at some iterations may be too large and
Dueperformance canapproximation,
to the quadratic degrade the KL constraint may be violated! What if we just do a
line Because
search toof
find the bestapproximation,
quadratic stepsize, making sure:
KL-divergence constraint may be violated
• I am improvingTrust Region Policy
my objective Optimization
J(\theta)
Solution:
• The KL constraint IisDefine
not violated!
Require improvement the in following
surrogatetrust region update:
(make

sure that L✓k (✓k+1 ) 0)
Enforce KL-constraint ⇡✓ (at | st )
maximize Êt Ât
✓ ⇡✓old (at | st )
How? Backtracking line search with exponential
subject to Êt [KL[⇡✓ (· | decay
st ), ⇡✓ (· | s(decay
t )]]  .
old
coe↵ ↵ 2 (0, 1), budget L)
IAlso worth considering using a penalty instead of a constraint
Algorithm 2 Line Search for TRPO 
⇡✓ (at | st )
Êt Êt [KL[⇡✓old (· | st ), ⇡✓ (· | st )]]
⇡✓oldq
maximize Ât
✓ (at | st ) 2 1
Compute proposed policy step k = Ĥ ĝk
I
ĝkT Ĥk 1 ĝk k
Method of Lagrange multipliers: optimality point of -constrained problem
for j = 0, 1, 2, ..., L isdo
also an optimality point of -penalized problem for some .
Compute proposed updateis ✓easier
I In practice, = to + ↵andj fixed
✓ktune, k is better than fixed
if L✓k (✓) 0 and D̄KL (✓||✓k )  then
accept the update and set ✓k+1 = ✓k + ↵j k
break
end if
end for

Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 32 / 41
Trust
Trust Regionregion Policy Optimization
Policy Optimization

Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together: TRPO= NPG +Linesearch

Algorithm 3 Trust Region Policy Optimization

Input: initial policy parameters ✓0


for k = 0, 1, 2, ... do
Collect set of trajectories Dk on policy ⇡k = ⇡(✓k )

Estimate advantages Ât k using any advantage estimation algorithm
Form sample estimates for
policy gradient ĝk (using advantage estimates)
and KL-divergence Hessian-vector product function f (v ) = Ĥk v
1
Use CG with ncg iterations to obtain
q x k ⇡ Ĥ k ĝk
Estimate proposed step k ⇡ x T 2Ĥ x xk
k k k
Perform backtracking line search with exponential decay to obtain final update

✓k+1 = ✓k + ↵j k

end for
Trust
Trust Regionregion Policy Optimization
Policy Optimization

Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together: TRPO= NPG +Linesearch+monotonic improvement theorem!

Algorithm 3 Trust Region Policy Optimization

Input: initial policy parameters ✓0


for k = 0, 1, 2, ... do
Collect set of trajectories Dk on policy ⇡k = ⇡(✓k )

Estimate advantages Ât k using any advantage estimation algorithm
Form sample estimates for
policy gradient ĝk (using advantage estimates)
and KL-divergence Hessian-vector product function f (v ) = Ĥk v
1
Use CG with ncg iterations to obtain
q x k ⇡ Ĥ k ĝk
Estimate proposed step k ⇡ x T 2Ĥ x xk
k k k
Perform backtracking line search with exponential decay to obtain final update

✓k+1 = ✓k + ↵j k

end for
Relating objectives of two policies

Policy objective:

γ trt

J(πθ) = 𝔼τ∼πθ
t=0

Policy objective can be written in terms of old one:



γ t A πθ(st, at)

J(πθ′) − J(πθ) = 𝔼τ∼π′θ
t=0

Equivalently for succinctness:



γ t A π(st, at)

J(π′) − J(π) = 𝔼τ∼π′
t=0
Proof of Relative Policy Performance Identity
Relating objectives of two policies
" 1
#
X
J(⇡ 0 ) J(⇡) = E t
A⇡ (st , at )
⌧ ⇠⇡ 0
t=0
" 1
#
X t
= E (R(st , at , st+1 ) + V ⇡ (st+1 ) V ⇡ (st ))
⌧ ⇠⇡ 0
t=0
" 1 1
#
X X
= J(⇡ 0 ) + E t+1
V ⇡ (st+1 ) t
V ⇡ (st )
⌧ ⇠⇡ 0
t=0 t=0
" 1 1
#
X X
= J(⇡ 0 ) + E t
V ⇡ (st ) t
V ⇡ (st )
⌧ ⇠⇡ 0
t=1 t=0

= J(⇡ 0 ) ⇡
E 0 [V (s0 )]
⌧ ⇠⇡
0
= J(⇡ ) J(⇡)

The initial state distribution is the same for both!

Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy


Approximately Gradient
Optimal Methods Reinforcement Learning, Kakade
Approximate October 11,
and2017
Langford122002
/ 41
Relating objectives of two policies
Discounted state visitation distribution:

A Useful Approximation
d π(s) = (1 − γ) γ t P(st = s | π)

t=0

γ t A π(st, at)

J(π′) − J(π) = 𝔼τ∼π′
t=0
⇡0 =⇡𝔼s∼dπ′,a∼π′ A π(s, a)
What if we just said d ⇡dand didn’t worry about it?

J(⇡ ) J(⇡) ⇡ [ π(a | s) E ⇡ ]


π′(a | s) π 
= 𝔼s∼dπ′,a∼π 1 A (s, ⇡
a)0 (a|s)
0
A⇡ (s, a)
1 s⇠d ⇡(a|s)
But how are we supposed to sample states from the policy a⇠⇡ we are trying to optimize for…
.
Let’s use the previous policy to sample them.= L⇡ (⇡ 0 )
π′(a | s) π
J(π′) − J(π) ≈ 𝔼
Turns out: this approximation iss∼d π ,a∼π
pretty good A (s, a)⇡ 0 and ⇡ are close! But why, and
when
π(a | s)
close do they have to be?
= ℒπ(π′)
2
Relative policy performance bounds:
It turns out we can bound this approximation error:
0 0
q
J(⇡ ) J(⇡) + L⇡ (⇡ ) C E ⇡ [DKL (⇡ 0 ||⇡)[s]]
s⇠d

If policies are close in KL-divergence—the approximation is good!


Constrained Policy Optimization, Achiam et al. 2017
Relating objectives of two policies
[ π(a | s) ]
π′(a | s) π
ℒπ′
π = 𝔼s∼dπ,a∼π A (s, a)

[ t=0 π(at | st) ]


π′(at | st) π

= 𝔼τ∼π A (st, at)

This is something we can optimize using trajectories from the old policy!

Compare to Importance Sampling:


T t
πθ(a′t | s′t )
At̂
old ∑ ∏ π
J(θ) = 𝔼τ∼πθ
(a′| s′)
t=1 t′=1 θold t t

Now we do not have the product! So, the gradient will have much smaller variance! (Yes, but
we have approximated, that’s why!) What is the gradient?

∞ ∇θ πθ(at | st) |θ=θ


[∑ ]
∇θ ℒθθk |θ=θ = 𝔼τ∼πθ γ t k
A πθk(st, at)
k k
t=0
πθk(at | st)

[∑ ]
= 𝔼τ∼πθ γ t ∇θ log πθ(at | st) |θ=θ A πθk(st, at)
k k
t=0
Monotonic Improvement Theorem
| J(π′) − (J(π) + ℒπ(π′)) | ≤ C 𝔼s∼dπ [KL(π′| π)[s]]

⇒ J(π′) − J(π) ≥ ℒπ(π′) − C 𝔼s∼dπ [KL(π′| π)[s]]

Given policy π , we want to optimize over policy π′ to maximize .

• If we maximize the RHS we are guaranteed to maximize the LHS.


• We know how to maximize the RHS. I can estimate both quantities of \pi’ with
sampled from \pi

• But will i have a better policy \pi’? (knowing that the distance of the objectives is
maximized is not enough, there needs to be positive or equal to zero)
Monotonic Improvement Theory

Monotonic Improvement Theorem


Proof of improvement guarantee: Suppose ⇡k+1 and ⇡k are related by
0
q
⇡k+1 = arg max L 0 ||⇡ )[s]].
⇡k (⇡ ) C E⇡
[DKL (⇡ k
0 ⇡ s⇠d k

⇡k is a feasible point, and the objective at ⇡k is equal to 0.


L⇡k (⇡k ) / E⇡ [A⇡k (s, a)] = 0
s,a⇠d k ,⇡k
DKL (⇡k ||⇡k )[s] = 0
=) optimal value 0
=) by the performance bound, J(⇡k+1 ) J(⇡k ) 0

Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 21 / 41
⇡k+1 = arg max
0
L⇡k (⇡ ) C E⇡ [DKL (⇡ ||⇡k )[s]]. (3)
⇡ s⇠d k

Approximate
Problem:
Monotonic
C provided by theory is quite high when is near 1
Improvement
=) steps from (3) are too small.
Solution:
• Theory is very conservative (high value of C) and we will use KL distance of pi’ and
piInstead
as a constraint (trust use
of KL penalty, region)
KL as opposed(called
constraint to a penalty:
trust region).
Can control worst-case error through constraint upper limit!

0
⇡k+1 = arg max0
L ⇡k (⇡ )

⇥ 0 ⇤ (4)
s.t. E⇡ DKL (⇡ ||⇡k )[s] 
s⇠d k

Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 22 / 41
Trust
Trust Regionregion Policy Optimization
Policy Optimization

Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together: TRPO= NPG +Linesearch+monotonic improvement theorem!

Algorithm 3 Trust Region Policy Optimization

Input: initial policy parameters ✓0


for k = 0, 1, 2, ... do
Collect set of trajectories Dk on policy ⇡k = ⇡(✓k )

Estimate advantages Ât k using any advantage estimation algorithm
Form sample estimates for
policy gradient ĝk (using advantage estimates)
and KL-divergence Hessian-vector product function f (v ) = Ĥk v
1
Use CG with ncg iterations to obtain
q x k ⇡ Ĥ k ĝk
Estimate proposed step k ⇡ x T 2Ĥ x xk
k k k
Perform backtracking line search with exponential decay to obtain final update

✓k+1 = ✓k + ↵j k

end for
Proximal Policy Optimization
Proximal Policy Optimization
Proximal Policy Optimization
Can I achieve (PPO)without
similar performance is a family of methods
second that approximately
order information enforce
(no Fisher matrix!)
KL constraint without computing natural gradients. Two variants:
Adaptive KL Penalty
Policy update solves unconstrained optimization problem
✓k+1 = arg max L✓k (✓) k D̄KL (✓||✓k )

Penalty coefficient k changes between iterations to approximately enforce


KL-divergence constraint
rther Clipped
Reading Objective
New objective function: let rt (✓) = ⇡✓ (at |st )/⇡✓k (at |st ). Then
" T #
X h i
CLIP ⇡k ⇡k
L (✓)
I S. Kakade. “A Natural Policy✓Gradient.”
k
= E
⌧ ⇠⇡
NIPS. 2001k
min(r t (✓) Â t , clip (r t (✓), 1 ✏, 1 + ✏) Â t )
I t=0 reinforcement learning”. ICML. 2002
S. Kakade and J. Langford. “Approximately optimal approximate
I where ✏ is a hyperparameter (maybe ✏ = 0.2)
J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I CLIP Policy Optimization”.
J. Schulman,Policy
S. Levine,update is I.✓Jordan,
P. Moritz, M.
k+1 = andarg
✓ L✓Region(✓)
max“Trust
P. Abbeel. ICML (2015)
I k
Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
IJoshua
J. Schulman,
Achiam F.(UC
Wolski, P. Dhariwal,
Berkeley, OpenAI) A. Radford, and O.Advanced
Klimov. “Proximal Policy Optimization
Policy Gradient Methods Algorithms”. (2017) October 11, 2017 35 / 41
I
PPO: Adaptive KL Penalty
Proximal Policy Optimization with Adaptive KL Penalty

Algorithm 4 PPO with Adaptive KL Penalty

Input: initial policy parameters ✓0 , initial KL penalty 0 , target KL-divergence


for k = 0, 1, 2, ... do
Collect set of partial trajectories Dk on policy ⇡k = ⇡(✓k )

Estimate advantages Ât k using any advantage estimation algorithm
Compute policy update

✓k+1 = arg max L✓k (✓) k D̄KL (✓||✓k )


by taking K steps of minibatch SGD (via Adam)


if D̄KL (✓k+1 ||✓k ) 1.5 then
k+1 = 2 k
else if D̄KL (✓k+1 ||✓k )  /1.5 then
k+1 = k /2
Don’t use second order approximation for Kl which is
end if
expensive, use standard gradient descent
end for

Initial KL penalty not that important—it adapts quickly


Some iterations may violate KL constraint, but most don’t
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 36 / 41
PPO: Clipped Objective
Proximal Policy Optimization: Clipping Objective
I Recall the surrogate objective
 h i
⇡ (a
✓ t | s t )
IS
L (✓) = Êt Ât = Êt rt (✓)Ât . (1)
⇡✓old (at | st )
I Form a lower bound via clipped importance ratios
h i
LCLIP (✓) = Êt min(rt (✓)Ât , clip(rt (✓), 1 ✏, 1 + ✏)Ât ) (2)
rther Reading

I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001


I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I I and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
Forms pessimistic bound on objective, can be optimized using SGD
J. Martens
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I
Proximal
PPO: Clipped
Policy Optimization
Objective
Proximal Policy Optimization with Clipped Objective

But how does clipping keep policy close? By making objective as pessimistic as possible
about performance far away from ✓k :

Figure: Various objectives as a function of interpolation factor ↵ between ✓k+1 and ✓k after one
update of PPO-Clip 9

9
Schulman, Wolski, Dhariwal, Radford, Klimov, 2017
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 38 / 41
PPO: Clipped Objective
Proximal Policy Optimization with Clipped Objective

Algorithm 5 PPO with Clipped Objective

Input: initial policy parameters ✓0 , clipping threshold ✏


for k = 0, 1, 2, ... do
Collect set of partial trajectories Dk on policy ⇡k = ⇡(✓k )
⇡k
Estimate advantages Ât using any advantage estimation algorithm
Compute policy update
✓k+1 = arg max LCLIP
✓k (✓)

by taking K steps of minibatch SGD (via Adam), where


" T #
X h i
⇡k ⇡
LCLIP
✓k (✓) = E min(r t (✓) Â t , clip (rt (✓), 1 ✏, 1 + ✏) Ât k )
⌧ ⇠⇡k
t=0

end for

Clipping prevents policy from having incentive to go far away from ✓k+1
Clipping seems to work at least as well as PPO with KL penalty, but is simpler to
implement
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 37 / 41
PPO: Clipped Objective
Empirical Performance of PPO

Figure: Performance comparison between PPO with clipped objective and various other deep RL
methods on a slate of MuJoCo tasks. 10

10
Schulman, Wolski, Dhariwal, Radford, Klimov, 2017
Summary
• Gradient Descent in Parameter VS distribution space
• Natural gradients: we need to keep track of how the KL changes
Furtherfrom
Reading
iteration to iteration
• Natural policy gradients
• Clipped objective works well
I
Further
I
Reading
S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Related Readings
Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
I J. Schulman, S. Levine,
approximation”. (2017)P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Schulman,
J. Duan, X. Chen, R. Houthooft,
F. Wolski, J. Schulman,
P. Dhariwal, A. Radford,and P.O.
and Abbeel. “Benchmarking
Klimov. DeepOptimization
“Proximal Policy Reinforcement Learning for(2017)
Algorithms”. Continuous Control”.
I ICML (2016)
J. Achiam, D. Held,recent
blog.openai.com: postsP.on
A. Tamar, baselines
Abeel releases Policy Optimization”. (2017)
“Constrained
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I blog.openai.com: recent posts on baselines releases

Das könnte Ihnen auch gefallen