Beruflich Dokumente
Kultur Dokumente
CMU 10703
Katerina Fragkiadaki
Part of the slides adapted from John Shulman and Joshua Achiam
Stochastic policies
continuous actions
θ µ✓ (s)
✓ (s)
usually multivariate
Gaussian
2
a ⇠ N (µ✓ (s), ✓ (s))
discrete actions
almost always
categorical
θ pθ(s)
a ∼ Cat(pθ(s))
What Loss to Optimize?
Policy Gradients I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât
μθ(s)
θold σθ(s)
μθnew(s)
θnew σθnew(s)
\ objective function?
What is the underlying
1 N T
ĝ ≈ ∇θ log πθ(αt(i) | st(i))A(st(i), at(i)),
N∑ ∑
Policy gradients: τi ∼ πθ
i=1 t=1
Is this our objective? We cannot both maximize over a variable and sample from it.
Well, we cannot optimize it too far, our advantage estimates are from samples of
\pi_theta_{old}. However, this constraint of “cannot optimize too far from \theta_{old}”
does not appear anywhere in the objective.
Compare to supervised learning and maximum likelihood estimation (MLE). Imagine we
have access to expert actions, then the loss function we want to optimize is:
N T
1
J SL(θ) = log πθ(α̃(i) (i)
∑ ∑ t | st ), τi ∼ π* +regularization
N i=1 t=1
which maximizes the probability of expert actions in the training set.
Is this our SL objective?
Well, as a matter of fact, we care about test error, but this is a long story, the short
answer is yes, this is good enough for us to optimize if we regularize.
What Loss to Optimize?
Policy Gradients I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât
μθ(s)
θold σθ(s)
μθnew(s)
θnew σθnew(s)
What Loss to Optimize?
Policy Gradients I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât
μθ(s)
θold σθ(s)
Gradient descent in parameter space
does not take into account the
μθnew(s)
resulting distance in the (output) policy
space between πθold(s) and πθnew(s)
θnew σθnew(s)
What Loss to Optimize?
Hard to choose stepsizes Two I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât
Figure: Small changes in the policy parameters can unexpectedly lead to big changes in the policy.
Notation Two
We will use the following to denote values of parameters and corresponding policies before
and after an update:
θold → θnew
πold → πnew
θ → θ′
π → π′
Gradient Descent in Parameter Space
The stepwise in gradient descent results from solving the following optimization problem, e.g.,
using line search:
θ µ✓ (s)
✓ (s)
It is hard to predict the result on the parameterized distribution.. hard to pick the threshold
epsilon
First order Taylor expansion for the loss and second order for the KL:
1
≈ arg max J(θold ) + ∇θ J(θ) |θ=θ ⋅ d − λ(d ⊤ ∇2θ DKL [πθold∥πθ] |θ=θ d) + λϵ
d old 2 old
Taylor expansion of KL
⊤ 1 ⊤ 2
DKL(pθold | pθ) ≈ DKL(pθold | pθold) + d ∇θ DKL(pθold | pθ) |θ=θ + d ∇θ DKL(pθold | pθ) |θ=θ
old 2 old
1
= −𝔼x∼pθ ∇θ Pθ(x) |θ=θ
old P (x) old
θold
1
∫x
= Pθold(x) ∇θ Pθ(x) |θ=θ
Pθold(x) old
∫x
= ∇θ Pθ(x) |θ=θ
old
∫x
= ∇θ Pθ(x) |θ=θ .
Pθold(x)
( Pθ(x) )
old
KL(pθold | pθ) = 𝔼x∼pθ log
=0 old
Taylor expansion of KL
⊤ 1 ⊤ 2
DKL(pθold | pθ) ≈ DKL(pθold | pθold) + d ∇θ KL(pθold | pθ) |θ=θ + d ∇θ DKL(pθold | pθ) |θ=θ d
old 2 old
( Pθ(x) )
∇θ Pθ(x)
= −𝔼x∼pθ ∇θ |θ=θ
old old
Pθold(x)
( Pθ(x) )
DKL(pθold | pθ) = 𝔼x∼pθ log
old
Fisher Information Matrix
Exactly equivalent to the Hessian of KL divergence!
⊤ 1 ⊤ 2
DKL(pθold | pθ) ≈ DKL(pθold | pθold) + d ∇θ DKL(pθold | pθ) |θ=θ + d ∇θ DKL(pθold | pθ) |θ=θ d
old 2 old
1 ⊤
= d F(θold)d
2
1
= (θ − θold)⊤F(θold)(θ − θold)
2
First order Taylor expansion for the loss and second order for the KL:
1
≈ arg max J(θold ) + ∇θ J(θ) |θ=θ ⋅ d − λ(d ⊤ ∇2θ DKL [πθold∥πθ] |θ=θ d) + λϵ
d old 2 old
1
= arg max ∇θ J(θ) |θ=θ ⋅ d − λ(d ⊤F(θold )d)
d old 2
1
= arg min − ∇θ J(θ) |θ=θ ⋅ d + λ(d ⊤F(θold )d)
d old 2
Natural Gradient Descent
Setting the gradient to zero:
∂d ( )
∂ 1
0= − ∇θ J(θ) |θ=θ ⋅ d + λ(d ⊤F(θold )d)
old 2
1
= − ∇θ J(θ) |θ=θ + λ(F(θold ))d
old 2
2 −1
d = F (θold ) ∇θ J(θ) |θ=θ
λ old
μθold(s)
θold σθold(s)
μθnew(s)
θnew σθnew(s)
What Loss to Optimize?
Policy Gradients I Policy gradients
h i
Monte Carlo Policy Gradients (REINFORCE), gradient direction: ĝ = Êt r✓ log ⇡✓ (at | st )Ât
μθold(s)
θold σθold(s)
μθnew(s)
θnew σθnew(s)
Off policy learning with Importance Sampling
J(θ) = 𝔼τ∼πθ(τ) [R(τ)]
∑
= πθ(τ)R(τ)
τ
πθ(τ)
∑ θold πθ (τ)
= π (τ) R(τ)
τ old
πθ(τ)
∑ πθ (τ)
= R(τ)
τ∼π old
θold
πθ(τ)
= 𝔼τ∼πθ R(τ)
old π
θold(τ)
∇θ πθ(τ)
∇θ J(θ) = 𝔼τ∼πθ R(τ)
old πθold(τ)
∇θ J(θ) |θ=θ = 𝔼τ∼πθ ∇θ log πθ(τ) |θ=θ R(τ) <-Gradient evaluated at theta_old is unchanged
old old old
Off policy learning with Importance Sampling
J(θ) = 𝔼τ∼πθ(τ) [R(τ)]
T
πθ(τ) πθ(at | st)
πθold(τ) ∏
=
∑
= πθ(τ)R(τ) π (a | s )
i=1 θold t t
T t
τ πθ(a′t | s′t )
At̂
old ∑ ∏ π
πθ(τ) J(θ) = 𝔼τ∼πθ
=
∑ θold πθ (τ)
π (τ) R(τ)
t=1 t′=1 θold
(a′
t | s′t )
τ old
πθ(τ)
∑ πθ (τ)
= R(τ)
τ∼π old
θold
Now we can use data from the old
πθ(τ) policy, but the variance has
= 𝔼τ∼πθ R(τ) increased by a lot! Those
old π
θold(τ) multiplications can explode or
vanish!
∇θ πθ(τ)
∇θ J(θ) = 𝔼τ∼πθ R(τ)
old πθold(τ)
maximize g · (✓ ✓old ) 2
(✓ ✓old )T F (✓ ✓old )
✓
@ @2
where g = L⇡✓old (⇡✓ ) ✓=✓old
, F = 2 KL⇡✓old (⇡✓ ) ✓=✓old
@✓ @ ✓
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 32 / 41
Trust
Trust Regionregion Policy Optimization
Policy Optimization
Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together: TRPO= NPG +Linesearch
✓k+1 = ✓k + ↵j k
end for
Trust
Trust Regionregion Policy Optimization
Policy Optimization
Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together: TRPO= NPG +Linesearch+monotonic improvement theorem!
✓k+1 = ✓k + ↵j k
end for
Relating objectives of two policies
Policy objective:
∞
γ trt
∑
J(πθ) = 𝔼τ∼πθ
t=0
= J(⇡ 0 ) ⇡
E 0 [V (s0 )]
⌧ ⇠⇡
0
= J(⇡ ) J(⇡)
This is something we can optimize using trajectories from the old policy!
Now we do not have the product! So, the gradient will have much smaller variance! (Yes, but
we have approximated, that’s why!) What is the gradient?
[∑ ]
= 𝔼τ∼πθ γ t ∇θ log πθ(at | st) |θ=θ A πθk(st, at)
k k
t=0
Monotonic Improvement Theorem
| J(π′) − (J(π) + ℒπ(π′)) | ≤ C 𝔼s∼dπ [KL(π′| π)[s]]
• But will i have a better policy \pi’? (knowing that the distance of the objectives is
maximized is not enough, there needs to be positive or equal to zero)
Monotonic Improvement Theory
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 21 / 41
⇡k+1 = arg max
0
L⇡k (⇡ ) C E⇡ [DKL (⇡ ||⇡k )[s]]. (3)
⇡ s⇠d k
Approximate
Problem:
Monotonic
C provided by theory is quite high when is near 1
Improvement
=) steps from (3) are too small.
Solution:
• Theory is very conservative (high value of C) and we will use KL distance of pi’ and
piInstead
as a constraint (trust use
of KL penalty, region)
KL as opposed(called
constraint to a penalty:
trust region).
Can control worst-case error through constraint upper limit!
0
⇡k+1 = arg max0
L ⇡k (⇡ )
⇡
⇥ 0 ⇤ (4)
s.t. E⇡ DKL (⇡ ||⇡k )[s]
s⇠d k
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 22 / 41
Trust
Trust Regionregion Policy Optimization
Policy Optimization
Trust Region Policy Optimization is implemented as TNPG plus a line search. Putting
it all together: TRPO= NPG +Linesearch+monotonic improvement theorem!
✓k+1 = ✓k + ↵j k
end for
Proximal Policy Optimization
Proximal Policy Optimization
Proximal Policy Optimization
Can I achieve (PPO)without
similar performance is a family of methods
second that approximately
order information enforce
(no Fisher matrix!)
KL constraint without computing natural gradients. Two variants:
Adaptive KL Penalty
Policy update solves unconstrained optimization problem
✓k+1 = arg max L✓k (✓) k D̄KL (✓||✓k )
✓
But how does clipping keep policy close? By making objective as pessimistic as possible
about performance far away from ✓k :
Figure: Various objectives as a function of interpolation factor ↵ between ✓k+1 and ✓k after one
update of PPO-Clip 9
9
Schulman, Wolski, Dhariwal, Radford, Klimov, 2017
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 38 / 41
PPO: Clipped Objective
Proximal Policy Optimization with Clipped Objective
end for
Clipping prevents policy from having incentive to go far away from ✓k+1
Clipping seems to work at least as well as PPO with KL penalty, but is simpler to
implement
Joshua Achiam (UC Berkeley, OpenAI) Advanced Policy Gradient Methods October 11, 2017 37 / 41
PPO: Clipped Objective
Empirical Performance of PPO
Figure: Performance comparison between PPO with clipped objective and various other deep RL
methods on a slate of MuJoCo tasks. 10
10
Schulman, Wolski, Dhariwal, Radford, Klimov, 2017
Summary
• Gradient Descent in Parameter VS distribution space
• Natural gradients: we need to keep track of how the KL changes
Furtherfrom
Reading
iteration to iteration
• Natural policy gradients
• Clipped objective works well
I
Further
I
Reading
S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
I J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Related Readings
Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. “Benchmarking Deep Reinforcement Learning for Continuous Control”.
ICML (2016)
I
I S. Kakade. “A Natural Policy Gradient.” NIPS. 2001
J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I
I S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML. 2002
Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I
I J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing (2008)
Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
I J. Schulman, S. Levine,
approximation”. (2017)P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML (2015)
I Y. Schulman,
J. Duan, X. Chen, R. Houthooft,
F. Wolski, J. Schulman,
P. Dhariwal, A. Radford,and P.O.
and Abbeel. “Benchmarking
Klimov. DeepOptimization
“Proximal Policy Reinforcement Learning for(2017)
Algorithms”. Continuous Control”.
I ICML (2016)
J. Achiam, D. Held,recent
blog.openai.com: postsP.on
A. Tamar, baselines
Abeel releases Policy Optimization”. (2017)
“Constrained
I J. Martens and I. Sutskever. “Training deep and recurrent networks with Hessian-free optimization”. Springer, 2012
I Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, et al. “Sample Efficient Actor-Critic with Experience Replay”. (2016)
I Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep reinforcement learning using Kronecker-factored
approximation”. (2017)
I J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. (2017)
I blog.openai.com: recent posts on baselines releases