Lecture NaturalPolicyGradientsTRPOPPO PDF

Carnegie Mellon
School of Computer Science
Deep Reinforcement Learning and Control
Natural Policy Gradients, TRPO, PPO
CMU 10703
Katerina Fragkiadaki
Part of the slides adapted from John Shulman and Joshua Achiam
Stochastic policies
continuous actions
θ µ✓ (s)
✓ (s)
usually multivariate
Gaussian
2
a ⇠ N (µ✓ (s), ✓ (s))
discrete actions
almost always
categorical
θ pθ(s)
a ∼ Cat(pθ(s))
What Loss to Optimize?
Policy Gradients I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât
Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]

I Can di↵erentiate the following loss
h i
LPG (✓) = Êt log ⇡✓ (at | st )Ât
1. Collect trajectories for policy πθ but don’t want to optimize it too far
2. Estimate advantages A This lecture is all about the stepwise
3. Compute policy gradient ĝ I Equivalently di↵erentiate
4. Update policy parameters θnew = θ + ϵ ⋅ ĝ 
IS ⇡✓ (at | st )
5. GOTO 1 L✓old (✓) = Êt Ât
⇡✓old (at | st )
at ✓ = ✓old , state-actions are sampled using ✓old . (IS =
r✓ f (✓) ⇣
✓old
Just the chain rule: r✓ log f (✓) ✓ = f (✓old ) = r✓
old
μθ(s)
θold σθ(s)
μθnew(s)
θnew σθnew(s)
\ objective function?
What is the underlying
1 N T
ĝ ≈ ∇θ log πθ(αt(i) | st(i))A(st(i), at(i)),
N∑ ∑
Policy gradients: τi ∼ πθ
i=1 t=1
What is our objective? Result from differentiating the objective function:

N T
1
J PG(θ) = log πθ(αt(i) | st(i))A(st(i), at(i))
N∑ ∑
τi ∼ πθ
i=1 t=1
Is this our objective? We cannot both maximize over a variable and sample from it.
Well, we cannot optimize it too far, our advantage estimates are from samples of
\pi_theta_{old}. However, this constraint of “cannot optimize too far from \theta_{old}”
does not appear anywhere in the objective.
Compare to supervised learning and maximum likelihood estimation (MLE). Imagine we
have access to expert actions, then the loss function we want to optimize is:
N T
1
J SL(θ) = log πθ(α̃(i) (i)
∑ ∑ t | st ), τi ∼ π* +regularization
N i=1 t=1
which maximizes the probability of expert actions in the training set.
Is this our SL objective?
Well, as a matter of fact, we care about test error, but this is a long story, the short
answer is yes, this is good enough for us to optimize if we regularize.
h i

h i
2. Estimate advantages A This lecture is all about the stepwise
IS ⇡✓ (at | st )
It is also about writing down an objective that we can
optimize withatPG,
✓ =and , state-actions
✓oldthe are sampled
procedure 1,2,3,4,5 will using
be the✓old . (IS =
result of this objective maximization r✓ f (✓) ⇣
✓old
old
μθ(s)
θold σθ(s)
μθnew(s)
θnew σθnew(s)
h i

h i
2. Estimate advantages A
IS ⇡✓ (at | st )
r✓ f (✓) ⇣
✓old
old
Two problems with the vanilla formulation: μθ(s)
1. Hard to choose stepwise ϵ θold σθ(s)
2. Sample inefficient: we cannot use data
collected with policies of previous
iterations
μθnew(s)
θnew σθnew(s)
Hard to choose stepsizes Two I Policy gradients
h i

h i
• Step too big
1. Collect trajectories for policy πθ but don’t want Bad topolicy->data collected
optimize it too far under bad
2. Estimate advantages A policy-> we cannot recover
(in Supervised Learning, data does not
4. Update policy parameters θnew = θ + ϵ ⋅ ĝ depend on neural network

weights)
IS ⇡✓ (at | st )
5. GOTO 1 • Step too smallL✓old (✓) = Êt ⇡ (a | s ) Ât
✓old t t
Not efficient use of experience
at ✓ = ✓old , (in
state-actions
Supervised areLearning,
sampled data
usingcan
✓oldbe. (IS =
trivially re-used) r✓ f (✓) ⇣
✓old
old
μθ(s)
θold σθ(s)
Gradient descent in parameter space
does not take into account the
μθnew(s)
resulting distance in the (output) policy
space between πθold(s) and πθnew(s)
θnew σθnew(s)
Hard to choose stepsizes Two I Policy gradients
h i

h i
IS ⇡✓ (at | st )
5. GOTO 1 The Problem is More Than Step Size L✓old (✓) = Êt Ât
Consider a family of policies with parametrization:
⇢
at ✓ = ✓ , state-actions
old (✓) a=1 are sampled using ✓old . (IS =
⇡✓ (a) =
1 (✓)
a=2 r✓ f (✓) ⇣
✓old
old
Figure: Small changes in the policy parameters can unexpectedly lead to big changes in the policy.
Notation Two
We will use the following to denote values of parameters and corresponding policies before
and after an update:
θold → θnew
πold → πnew
θ → θ′
π → π′
Gradient Descent in Parameter Space
The stepwise in gradient descent results from solving the following optimization problem, e.g.,
using line search:
d * = arg max J(θ + d)

∥d∥≤ϵ
Euclidean distance in parameter space
It is hard to predict the result on the parameterized distribution..
θ µ✓ (s)
✓ (s)
SGD: θnew = θold + d *

Gradient Descent in Distribution Space
The stepwise in gradient descent results from solving the following optimization problem, e.g.,
using line search:

∥d∥≤ϵ
SGD: θnew = θold + d *
Euclidean distance in parameter space
It is hard to predict the result on the parameterized distribution.. hard to pick the threshold
epsilon
Natural gradient descent: the stepwise in parameter space is determined by

considering the KL divergence in the distributions before and after the update:

d, s.t. KL(πθ∥πθ+d )≤ϵ
KL divergence in distribution space

Easier to pick the distance threshold!!!
Solving the KL Constrained Problem
Unconstrained penalized objective:
d * = arg max J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ)

d
First order Taylor expansion for the loss and second order for the KL:
1
≈ arg max J(θold ) + ∇θ J(θ) |θ=θ ⋅ d − λ(d ⊤ ∇2θ DKL [πθold∥πθ] |θ=θ d) + λϵ
d old 2 old
Taylor expansion of KL
⊤ 1 ⊤ 2
DKL(pθold | pθ) ≈ DKL(pθold | pθold) + d ∇θ DKL(pθold | pθ) |θ=θ + d ∇θ DKL(pθold | pθ) |θ=θ
old 2 old
∇θ DKL(pθold | pθ) |θ=θ = − ∇θ 𝔼x∼pθ log Pθ(x) |θ=θ

old old old
= −𝔼x∼pθ ∇θ log Pθ(x) |θ=θ

old old
1
= −𝔼x∼pθ ∇θ Pθ(x) |θ=θ
old P (x) old
θold
1
∫x
= Pθold(x) ∇θ Pθ(x) |θ=θ
Pθold(x) old
∫x
= ∇θ Pθ(x) |θ=θ
old
∫x
= ∇θ Pθ(x) |θ=θ .
Pθold(x)
( Pθ(x) )
old
KL(pθold | pθ) = 𝔼x∼pθ log
=0 old
Taylor expansion of KL
⊤ 1 ⊤ 2
DKL(pθold | pθ) ≈ DKL(pθold | pθold) + d ∇θ KL(pθold | pθ) |θ=θ + d ∇θ DKL(pθold | pθ) |θ=θ d
old 2 old
∇2θ DKL(pθold | pθ) |θ=θ = −𝔼x∼pθ ∇2θ log Pθ(x) |θ=θ

old old old
( Pθ(x) )
∇θ Pθ(x)
= −𝔼x∼pθ ∇θ |θ=θ
old old
∇2θ Pθ(x)Pθ(x) − ∇θ Pθ(x) ∇θ Pθ(x)⊤

old ( )
= −𝔼x∼pθ |θ=θ
Pθ(x) 2 old
∇2θ Pθ(x) |θ=θ

= −𝔼x∼pθ old
+ 𝔼x∼pθ ∇θ log Pθ(x) ∇θ log Pθ(x)⊤ |θ=θ
old Pθold(x) old old
= 𝔼x∼pθ ∇θ log Pθ(x) ∇θ log Pθ(x)⊤ |θ=θ

old old
Pθold(x)
( Pθ(x) )
DKL(pθold | pθ) = 𝔼x∼pθ log
old
Fisher Information Matrix
Exactly equivalent to the Hessian of KL divergence!
F(θ) = 𝔼θ [ ∇θ log pθ(x) ∇θ log pθ(x)⊤]
F(θold) = ∇2θ DKL(pθold | pθ) |θ=θ

old
⊤ 1 ⊤ 2
DKL(pθold | pθ) ≈ DKL(pθold | pθold) + d ∇θ DKL(pθold | pθ) |θ=θ + d ∇θ DKL(pθold | pθ) |θ=θ d
old 2 old
1 ⊤
= d F(θold)d
2
1
= (θ − θold)⊤F(θold)(θ − θold)
2
Since KL divergence is roughly analogous to a distance measure between

distributions, Fisher information serves as a local distance metric between
distributions: how much you change the distribution if you move the parameters a
little bit in a given direction.
Solving the KL Constrained Problem
Unconstrained penalized objective:
d * = arg max J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ)

d
First order Taylor expansion for the loss and second order for the KL:
1
≈ arg max J(θold ) + ∇θ J(θ) |θ=θ ⋅ d − λ(d ⊤ ∇2θ DKL [πθold∥πθ] |θ=θ d) + λϵ
d old 2 old
Substitute for the information matrix:
1
= arg max ∇θ J(θ) |θ=θ ⋅ d − λ(d ⊤F(θold )d)
d old 2
1
= arg min − ∇θ J(θ) |θ=θ ⋅ d + λ(d ⊤F(θold )d)
d old 2
Natural Gradient Descent
Setting the gradient to zero:
∂d ( )
∂ 1
0= − ∇θ J(θ) |θ=θ ⋅ d + λ(d ⊤F(θold )d)
old 2
1
= − ∇θ J(θ) |θ=θ + λ(F(θold ))d
old 2
2 −1
d = F (θold ) ∇θ J(θ) |θ=θ
λ old
The natural gradient: 1

DKL(πθold | πθ) ≈ (θ − θold )⊤F(θold )(θ − θold )
2
˜
∇J(θ) = F−1(θold ) ∇θ J(θ)
1
(αgN )⊤F(αgN ) = ϵ
2
θnew = θold + α ⋅ F−1(θold )ĝ
2ϵ
α=
(gN⊤FgN )
Both use samples from the current policy \pi_k

very expensive to compute for a large number of parameters!

h i

h i
old
4. Update policy parameters θnew = θold + ϵ ⋅ ĝ 
IS ⇡✓ (at | st )
r✓ f (✓) ⇣
✓old
old
μθold(s)
θold σθold(s)
μθnew(s)
θnew σθnew(s)
h i

h i
• On policy learning can be extremely
1. Collect trajectories for policy πθ inefficient
but don’t want to optimize it too far
old
2. Estimate advantages A • The policy changes only a little bit with
each gradient step
4. Update policy parameters θnew = θold + ϵ ⋅ ĝ • I want to be able to use earlier 
IS ⇡✓data..how
(at | st )
5. GOTO 1 to do that? L✓old (✓) = Êt Ât
r✓ f (✓) ⇣
✓old
old
μθold(s)
θold σθold(s)
μθnew(s)
θnew σθnew(s)
Off policy learning with Importance Sampling
J(θ) = 𝔼τ∼πθ(τ) [R(τ)]
∑
= πθ(τ)R(τ)
τ
πθ(τ)
∑ θold πθ (τ)
= π (τ) R(τ)
τ old
πθ(τ)
∑ πθ (τ)
= R(τ)
τ∼π old
θold
πθ(τ)
= 𝔼τ∼πθ R(τ)
old π
θold(τ)
∇θ πθ(τ)
∇θ J(θ) = 𝔼τ∼πθ R(τ)
old πθold(τ)
∇θ J(θ) |θ=θ = 𝔼τ∼πθ ∇θ log πθ(τ) |θ=θ R(τ) <-Gradient evaluated at theta_old is unchanged
old old old
Off policy learning with Importance Sampling
J(θ) = 𝔼τ∼πθ(τ) [R(τ)]
T
πθ(τ) πθ(at | st)
πθold(τ) ∏
=
∑
= πθ(τ)R(τ) π (a | s )
i=1 θold t t
T t
τ πθ(a′t | s′t )
At̂
old ∑ ∏ π
πθ(τ) J(θ) = 𝔼τ∼πθ
=
∑ θold πθ (τ)
π (τ) R(τ)
t=1 t′=1 θold
(a′
t | s′t )
τ old
πθ(τ)
∑ πθ (τ)
= R(τ)
τ∼π old
θold
Now we can use data from the old
πθ(τ) policy, but the variance has
= 𝔼τ∼πθ R(τ) increased by a lot! Those
old π
θold(τ) multiplications can explode or
vanish!
∇θ πθ(τ)
∇θ J(θ) = 𝔼τ∼πθ R(τ)
old πθold(τ)
∇θ J(θ) |θ=θ = 𝔼τ∼πθ ∇θ log πθ(τ) |θ=θ R(τ)

old old old
Trust region Policy Optimization
Trust Region Policy Optimization
I Define the following trust region update:

⇡✓ (at | st )
maximize Êt Ât
✓ ⇡✓old (at | st )
subject to Êt [KL[⇡✓old (· | st ), ⇡✓ (· | st )]]  .
er ReadingI Also worth considering using a penalty instead of a constraint

⇡✓ (at | st )
maximize Êt Ât Êt [KL[⇡✓old (· | st ), ⇡✓ (· | st )]]
✓ ⇡✓old (at | st )
I Method of Lagrange multipliers: optimality point of -constrained problem
is also an optimality point of -penalized problem for some .
I S. Kakade. Again
“A Natural Policy
the KL Gradient.”
penalized NIPS. 2001
problem!
I In practice, is easier to tune, and fixed is better than fixed
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”.
Solving
Trust KL penalized
region problem
Policy Optimization
Solving KL Penalized Problem
I maximize✓ L⇡✓old (⇡✓ ) · KL⇡✓old (⇡✓ )
I Make linear approximation to L⇡✓old and quadratic approximation to KL term:
maximize g · (✓ ✓old ) 2
(✓ ✓old )T F (✓ ✓old )
✓
@ @2
where g = L⇡✓old (⇡✓ ) ✓=✓old
, F = 2 KL⇡✓old (⇡✓ ) ✓=✓old
@✓ @ ✓
I Quadratic part of L is negligible compared to KL term

I F is positive semidefinite, but not if we include Hessian of L
I Solution: ✓ ✓old = 1 F 1 g , where F is Fisher Information matrix, g is
policy Exactly what we saw with natural policy gradient!
gradient. This is called the natural policy gradient3 .
One important detail!
3
S. Kakade. “A Natural Policy Gradient.” NIPS. 2001.
Trust region Policy Optimization
Trust Region Policy Optimization
Small problems with NPG update:
Might not be robust to trust region size ; at some iterations may be too large and
Dueperformance canapproximation,
to the quadratic degrade the KL constraint may be violated! What if we just do a
line Because
search toof
find the bestapproximation,
quadratic stepsize, making sure:
KL-divergence constraint may be violated
• I am improvingTrust Region Policy
my objective Optimization
J(\theta)
Solution:
• The KL constraint IisDefine
not violated!
Require improvement the in following
surrogatetrust region update:
(make

sure that L✓k (✓k+1 ) 0)
Enforce KL-constraint ⇡✓ (at | st )
maximize Êt Ât
✓ ⇡✓old (at | st )
How? Backtracking line search with exponential
subject to Êt [KL[⇡✓ (· | decay
st ), ⇡✓ (· | s(decay
t )]]  .
old
coe↵ ↵ 2 (0, 1), budget L)
IAlso worth considering using a penalty instead of a constraint
Algorithm 2 Line Search for TRPO 
⇡✓ (at | st )
Êt Êt [KL[⇡✓old (· | st ), ⇡✓ (· | st )]]
⇡✓oldq
maximize Ât
✓ (at | st ) 2 1
Compute proposed policy step k = Ĥ ĝk
I
ĝkT Ĥk 1 ĝk k
Method of Lagrange multipliers: optimality point of -constrained problem
for j = 0, 1, 2, ..., L isdo
also an optimality point of -penalized problem for some .
Compute proposed updateis ✓easier
I In practice, = to + ↵andj fixed
✓ktune, k is better than fixed
if L✓k (✓) 0 and D̄KL (✓||✓k )  then
accept the update and set ✓k+1 = ✓k + ↵j k
break
end if
end for
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 32 / 41
Trust
Trust Regionregion Policy Optimization
Policy Optimization
Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together: TRPO= NPG +Linesearch
Algorithm 3 Trust Region Policy Optimization
Input: initial policy parameters ✓0

for k = 0, 1, 2, ... do
Collect set of trajectories Dk on policy ⇡k = ⇡(✓k )
⇡
Estimate advantages Ât k using any advantage estimation algorithm
Form sample estimates for
policy gradient ĝk (using advantage estimates)
and KL-divergence Hessian-vector product function f (v ) = Ĥk v
1
Use CG with ncg iterations to obtain
q x k ⇡ Ĥ k ĝk
Estimate proposed step k ⇡ x T 2Ĥ x xk
k k k
Perform backtracking line search with exponential decay to obtain final update
✓k+1 = ✓k + ↵j k
end for
Trust
Policy Optimization
it all together: TRPO= NPG +Linesearch+monotonic improvement theorem!

for k = 0, 1, 2, ... do
⇡
1
k k k
✓k+1 = ✓k + ↵j k
end for
Relating objectives of two policies
Policy objective:
∞
γ trt
∑
J(πθ) = 𝔼τ∼πθ
t=0
Policy objective can be written in terms of old one:

∞
γ t A πθ(st, at)
∑
J(πθ′) − J(πθ) = 𝔼τ∼π′θ
t=0
Equivalently for succinctness:

∞
γ t A π(st, at)
∑
J(π′) − J(π) = 𝔼τ∼π′
t=0
Proof of Relative Policy Performance Identity
" 1
#
X
J(⇡ 0 ) J(⇡) = E t
A⇡ (st , at )
⌧ ⇠⇡ 0
t=0
" 1
#
X t
= E (R(st , at , st+1 ) + V ⇡ (st+1 ) V ⇡ (st ))
⌧ ⇠⇡ 0
t=0
" 1 1
#
X X
= J(⇡ 0 ) + E t+1
V ⇡ (st+1 ) t
V ⇡ (st )
⌧ ⇠⇡ 0
t=0 t=0
" 1 1
#
X X
= J(⇡ 0 ) + E t
V ⇡ (st ) t
V ⇡ (st )
⌧ ⇠⇡ 0
t=1 t=0
= J(⇡ 0 ) ⇡
E 0 [V (s0 )]
⌧ ⇠⇡
0
= J(⇡ ) J(⇡)
The initial state distribution is the same for both!
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy

Approximately Gradient
Optimal Methods Reinforcement Learning, Kakade
Approximate October 11,
and2017
Langford122002
/ 41
Discounted state visitation distribution:
∞
A Useful Approximation
d π(s) = (1 − γ) γ t P(st = s | π)
∑
t=0
∞
γ t A π(st, at)
∑
J(π′) − J(π) = 𝔼τ∼π′
t=0
⇡0 =⇡𝔼s∼dπ′,a∼π′ A π(s, a)
What if we just said d ⇡dand didn’t worry about it?
J(⇡ ) J(⇡) ⇡ [ π(a | s) E ⇡ ]

π′(a | s) π 
= 𝔼s∼dπ′,a∼π 1 A (s, ⇡
a)0 (a|s)
0
A⇡ (s, a)
1 s⇠d ⇡(a|s)
But how are we supposed to sample states from the policy a⇠⇡ we are trying to optimize for…
.
Let’s use the previous policy to sample them.= L⇡ (⇡ 0 )
π′(a | s) π
J(π′) − J(π) ≈ 𝔼
Turns out: this approximation iss∼d π ,a∼π
pretty good A (s, a)⇡ 0 and ⇡ are close! But why, and
when
π(a | s)
close do they have to be?
= ℒπ(π′)
2
Relative policy performance bounds:
It turns out we can bound this approximation error:
0 0
q
J(⇡ ) J(⇡) + L⇡ (⇡ ) C E ⇡ [DKL (⇡ 0 ||⇡)[s]]
s⇠d
If policies are close in KL-divergence—the approximation is good!

Constrained Policy Optimization, Achiam et al. 2017
[ π(a | s) ]
π′(a | s) π
ℒπ′
π = 𝔼s∼dπ,a∼π A (s, a)
[ t=0 π(at | st) ]

π′(at | st) π
∑
= 𝔼τ∼π A (st, at)
This is something we can optimize using trajectories from the old policy!
Compare to Importance Sampling:

T t
πθ(a′t | s′t )
At̂
old ∑ ∏ π
J(θ) = 𝔼τ∼πθ
(a′| s′)
t=1 t′=1 θold t t
Now we do not have the product! So, the gradient will have much smaller variance! (Yes, but
we have approximated, that’s why!) What is the gradient?
∞ ∇θ πθ(at | st) |θ=θ

[∑ ]
∇θ ℒθθk |θ=θ = 𝔼τ∼πθ γ t k
A πθk(st, at)
k k
t=0
πθk(at | st)
∞
[∑ ]
= 𝔼τ∼πθ γ t ∇θ log πθ(at | st) |θ=θ A πθk(st, at)
k k
t=0
Monotonic Improvement Theorem
| J(π′) − (J(π) + ℒπ(π′)) | ≤ C 𝔼s∼dπ [KL(π′| π)[s]]
⇒ J(π′) − J(π) ≥ ℒπ(π′) − C 𝔼s∼dπ [KL(π′| π)[s]]
Given policy π , we want to optimize over policy π′ to maximize .
• If we maximize the RHS we are guaranteed to maximize the LHS.

• We know how to maximize the RHS. I can estimate both quantities of \pi’ with
sampled from \pi
• But will i have a better policy \pi’? (knowing that the distance of the objectives is
maximized is not enough, there needs to be positive or equal to zero)
Monotonic Improvement Theory
Monotonic Improvement Theorem

Proof of improvement guarantee: Suppose ⇡k+1 and ⇡k are related by
0
q
⇡k+1 = arg max L 0 ||⇡ )[s]].
⇡k (⇡ ) C E⇡
[DKL (⇡ k
0 ⇡ s⇠d k
⇡k is a feasible point, and the objective at ⇡k is equal to 0.

L⇡k (⇡k ) / E⇡ [A⇡k (s, a)] = 0
s,a⇠d k ,⇡k
DKL (⇡k ||⇡k )[s] = 0
=) optimal value 0
=) by the performance bound, J(⇡k+1 ) J(⇡k ) 0
⇡k+1 = arg max
0
L⇡k (⇡ ) C E⇡ [DKL (⇡ ||⇡k )[s]]. (3)
⇡ s⇠d k
Approximate
Problem:
Monotonic
C provided by theory is quite high when is near 1
Improvement
=) steps from (3) are too small.
Solution:
• Theory is very conservative (high value of C) and we will use KL distance of pi’ and
piInstead
as a constraint (trust use
of KL penalty, region)
KL as opposed(called
constraint to a penalty:
trust region).
Can control worst-case error through constraint upper limit!
0
⇡k+1 = arg max0
L ⇡k (⇡ )
⇡
⇥ 0 ⇤ (4)
s.t. E⇡ DKL (⇡ ||⇡k )[s] 
s⇠d k
Trust
Policy Optimization
it all together: TRPO= NPG +Linesearch+monotonic improvement theorem!

for k = 0, 1, 2, ... do
⇡
1
k k k
✓k+1 = ✓k + ↵j k
end for
Proximal Policy Optimization
Can I achieve (PPO)without
similar performance is a family of methods
second that approximately
order information enforce
(no Fisher matrix!)
KL constraint without computing natural gradients. Two variants:
Adaptive KL Penalty
Policy update solves unconstrained optimization problem
✓k+1 = arg max L✓k (✓) k D̄KL (✓||✓k )
✓
Penalty coefficient k changes between iterations to approximately enforce

KL-divergence constraint
rther Clipped
Reading Objective
New objective function: let rt (✓) = ⇡✓ (at |st )/⇡✓k (at |st ). Then
" T #
X h i
CLIP ⇡k ⇡k
L (✓)
I S. Kakade. “A Natural Policy✓Gradient.”
k
= E
⌧ ⇠⇡
NIPS. 2001k
min(r t (✓) Â t , clip (r t (✓), 1 ✏, 1 + ✏) Â t )
I t=0 reinforcement learning”. ICML. 2002
S. Kakade and J. Langford. “Approximately optimal approximate
I where ✏ is a hyperparameter (maybe ✏ = 0.2)
J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I CLIP Policy Optimization”.
J. Schulman,Policy
S. Levine,update is I.✓Jordan,
P. Moritz, M.
k+1 = andarg
✓ L✓Region(✓)
max“Trust
P. Abbeel. ICML (2015)
I k
Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
IJoshua
J. Schulman,
Achiam F.(UC
Wolski, P. Dhariwal,
Berkeley, OpenAI) A. Radford, and O.Advanced
Klimov. “Proximal Policy Optimization
Policy Gradient Methods Algorithms”. (2017) October 11, 2017 35 / 41
I
PPO: Adaptive KL Penalty
Proximal Policy Optimization with Adaptive KL Penalty
Algorithm 4 PPO with Adaptive KL Penalty
Input: initial policy parameters ✓0 , initial KL penalty 0 , target KL-divergence

for k = 0, 1, 2, ... do
Collect set of partial trajectories Dk on policy ⇡k = ⇡(✓k )
⇡
Compute policy update
✓k+1 = arg max L✓k (✓) k D̄KL (✓||✓k )

✓
by taking K steps of minibatch SGD (via Adam)

if D̄KL (✓k+1 ||✓k ) 1.5 then
k+1 = 2 k
else if D̄KL (✓k+1 ||✓k )  /1.5 then
k+1 = k /2
Don’t use second order approximation for Kl which is
end if
expensive, use standard gradient descent
end for
Initial KL penalty not that important—it adapts quickly

Some iterations may violate KL constraint, but most don’t
PPO: Clipped Objective
Proximal Policy Optimization: Clipping Objective
I Recall the surrogate objective
 h i
⇡ (a
✓ t | s t )
IS
L (✓) = Êt Ât = Êt rt (✓)Ât . (1)
I Form a lower bound via clipped importance ratios
h i
LCLIP (✓) = Êt min(rt (✓)Ât , clip(rt (✓), 1 ✏, 1 + ✏)Ât ) (2)
rther Reading
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001

I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I I and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
Forms pessimistic bound on objective, can be optimized using SGD
J. Martens
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I
Proximal
PPO: Clipped
Policy Optimization
Objective
Proximal Policy Optimization with Clipped Objective
But how does clipping keep policy close? By making objective as pessimistic as possible
about performance far away from ✓k :
Figure: Various objectives as a function of interpolation factor ↵ between ✓k+1 and ✓k after one
update of PPO-Clip 9
9
Schulman, Wolski, Dhariwal, Radford, Klimov, 2017
Proximal Policy Optimization with Clipped Objective
Algorithm 5 PPO with Clipped Objective
Input: initial policy parameters ✓0 , clipping threshold ✏

for k = 0, 1, 2, ... do
Collect set of partial trajectories Dk on policy ⇡k = ⇡(✓k )
⇡k
Estimate advantages Ât using any advantage estimation algorithm
Compute policy update
✓k+1 = arg max LCLIP
✓k (✓)
✓
by taking K steps of minibatch SGD (via Adam), where

" T #
X h i
⇡k ⇡
LCLIP
✓k (✓) = E min(r t (✓) Â t , clip (rt (✓), 1 ✏, 1 + ✏) Ât k )
⌧ ⇠⇡k
t=0
end for
Clipping prevents policy from having incentive to go far away from ✓k+1
Clipping seems to work at least as well as PPO with KL penalty, but is simpler to
implement
Empirical Performance of PPO
Figure: Performance comparison between PPO with clipped objective and various other deep RL
methods on a slate of MuJoCo tasks. 10
10
Schulman, Wolski, Dhariwal, Radford, Klimov, 2017
Summary
• Gradient Descent in Parameter VS distribution space
• Natural gradients: we need to keep track of how the KL changes
Furtherfrom
Reading
iteration to iteration
• Natural policy gradients
• Clipped objective works well
I
Further
I
Reading
S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Related Readings
Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I
Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I
Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
I J. Schulman, S. Levine,
approximation”. (2017)P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Schulman,
J. Duan, X. Chen, R. Houthooft,
F. Wolski, J. Schulman,
P. Dhariwal, A. Radford,and P.O.
and Abbeel. “Benchmarking
Klimov. DeepOptimization
“Proximal Policy Reinforcement Learning for(2017)
Algorithms”. Continuous Control”.
I ICML (2016)
J. Achiam, D. Held,recent
blog.openai.com: postsP.on
A. Tamar, baselines
Abeel releases Policy Optimization”. (2017)
“Constrained
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I blog.openai.com: recent posts on baselines releases

Lecture NaturalPolicyGradientsTRPOPPO PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture NaturalPolicyGradientsTRPOPPO PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Carnegie Mellon

School of Computer Science

Deep Reinforcement Learning and Control

Natural Policy Gradients, TRPO, PPO

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]

What is our objective? Result from differentiating the objective function:

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]

d * = arg max J(θ + d)

Euclidean distance in parameter space

It is hard to predict the result on the parameterized distribution..

SGD: θnew = θold + d *

d * = arg max J(θ + d)

Natural gradient descent: the stepwise in parameter space is determined by

d * = arg max J(θ + d)

KL divergence in distribution space

Unconstrained penalized objective:

d * = arg max J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ)

∇θ DKL(pθold | pθ) |θ=θ = − ∇θ 𝔼x∼pθ log Pθ(x) |θ=θ

= −𝔼x∼pθ ∇θ log Pθ(x) |θ=θ

∇2θ DKL(pθold | pθ) |θ=θ = −𝔼x∼pθ ∇2θ log Pθ(x) |θ=θ

∇2θ Pθ(x)Pθ(x) − ∇θ Pθ(x) ∇θ Pθ(x)⊤

∇2θ Pθ(x) |θ=θ

= 𝔼x∼pθ ∇θ log Pθ(x) ∇θ log Pθ(x)⊤ |θ=θ

F(θ) = 𝔼θ [ ∇θ log pθ(x) ∇θ log pθ(x)⊤]

F(θold) = ∇2θ DKL(pθold | pθ) |θ=θ

Since KL divergence is roughly analogous to a distance measure between

d * = arg max J(θ + d) − λ(DKL [πθ∥πθ+d] − ϵ)

Substitute for the information matrix:

The natural gradient: 1

Both use samples from the current policy \pi_k

very expensive to compute for a large number of parameters!

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]

Actor-Critic Policy Gradient: g ̂ = 𝔼̂ t [ ∇ θ log πθ(at | st )Aw(st )]

∇θ J(θ) |θ=θ = 𝔼τ∼πθ ∇θ log πθ(τ) |θ=θ R(τ)

I Quadratic part of L is negligible compared to KL term

Algorithm 3 Trust Region Policy Optimization

Input: initial policy parameters ✓0

Algorithm 3 Trust Region Policy Optimization

Input: initial policy parameters ✓0

Policy objective can be written in terms of old one:

Equivalently for succinctness:

The initial state distribution is the same for both!

Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy

J(⇡ ) J(⇡) ⇡ [ π(a | s) E ⇡ ]

If policies are close in KL-divergence—the approximation is good!

[ t=0 π(at | st) ]

Compare to Importance Sampling:

∞ ∇θ πθ(at | st) |θ=θ

⇒ J(π′) − J(π) ≥ ℒπ(π′) − C 𝔼s∼dπ [KL(π′| π)[s]]

Given policy π , we want to optimize over policy π′ to maximize .

• If we maximize the RHS we are guaranteed to maximize the LHS.

Monotonic Improvement Theorem

⇡k is a feasible point, and the objective at ⇡k is equal to 0.

Algorithm 3 Trust Region Policy Optimization

Input: initial policy parameters ✓0

Penalty coefficient k changes between iterations to approximately enforce

Algorithm 4 PPO with Adaptive KL Penalty

Input: initial policy parameters ✓0 , initial KL penalty 0 , target KL-divergence

✓k+1 = arg max L✓k (✓) k D̄KL (✓||✓k )

by taking K steps of minibatch SGD (via Adam)