Sie sind auf Seite 1von 66

Tutorial: Deep Reinforcement Learning

David Silver, Google DeepMind


Outline

Introduction to Deep Learning

Introduction to Reinforcement Learning

Value-Based Deep RL

Policy-Based Deep RL

Model-Based Deep RL
Reinforcement Learning in a nutshell

RL is a general-purpose framework for decision-making


I RL is for an agent with the capacity to act
I Each action influences the agents future state
I Success is measured by a scalar reward signal
I Goal: select actions to maximise future reward
Deep Learning in a nutshell

DL is a general-purpose framework for representation learning


I Given an objective
I Learn representation that is required to achieve objective
I Directly from raw inputs
I Using minimal domain knowledge
Deep Reinforcement Learning: AI = RL + DL

We seek a single agent which can solve any human-level task


I RL defines the objective
I DL gives the mechanism
I RL + DL = general intelligence
Examples of Deep RL @DeepMind

I Play games: Atari, poker, Go, ...


I Explore worlds: 3D worlds, Labyrinth, ...
I Control physical systems: manipulate, walk, swim, ...
I Interact with users: recommend, optimise, personalise, ...
Outline

Introduction to Deep Learning

Introduction to Reinforcement Learning

Value-Based Deep RL

Policy-Based Deep RL

Model-Based Deep RL
Deep Representations

I A deep representation is a composition of many functions

x / h1
O
/ ... / hn
O
/y /l

w1 ... wn

I Its gradient can be backpropagated by the chain rule

@h1 @h2 @hn @y


@h1 @hn 1 @hn
@l @x @l @l @l
@x
o
@h1
o ... o @hn
o
@y
@h1 @hn
@w1 @wn

@l @l
@w1 ... @wn
Deep Neural Network

A deep neural network is typically composed of:


I Linear transformations

hk+1 = Whk

I Non-linear activation functions

hk+2 = f (hk+1 )
I A loss function on the output, e.g.
I Mean-squared error l = ||y y ||2
I Log likelihood l = log P [y ]
Training Neural Networks by Stochastic Gradient Descent
!"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#

!"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$
I '*$%-/0'*,('-*$.'("$("#$1)*%('-*$
Sample gradient of expected loss L(w) = E [l]
,22&-3'/,(-&
%,*$0#$)+#4$(-$
@l @l @L(w)
%&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$
E =
@w @w @w
I Adjust w down the sampled gradient
!"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$
1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($
@l
w/
%,*$*-.$0#$)+#4$(-$)24,(#$("#$
@w
'*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$
,22&-3'/,(-& 9,*4$%&'('%:;$$$$$$
<&,4'#*($4#+%#*($=>?
Weight Sharing
Recurrent neural network shares weights between time-steps
yO t yt+1
O

... / ht / ht+1 / ...


? O = O

w xt w xt+1

Convolutional neural network shares weights between local regions

w1 w2

w1 w2
h2

h1
x
Outline

Introduction to Deep Learning

Introduction to Reinforcement Learning

Value-Based Deep RL

Policy-Based Deep RL

Model-Based Deep RL
Many Faces of Reinforcement Learning

Computer Science

Engineering Neuroscience
Machine
Learning
Optimal Reward
Control System
Reinforcement
Learning
Operations Classical/Operant
Research Conditioning
Rationality/
Mathematics Psychology
Game Theory

Economics
Agent and Environment

observation action I At each step t the agent:


ot at
I Executes action at
I Receives observation ot
I Receives scalar reward rt
reward rt
I The environment:
I Receives action at
I Emits observation ot+1
I Emits scalar reward rt+1
State

I Experience is a sequence of observations, actions, rewards

o1 , r1 , a1 , ..., at 1 , ot , rt

I The state is a summary of experience

st = f (o1 , r1 , a1 , ..., at 1 , ot , r t )

I In a fully observed environment

st = f (ot )
Major Components of an RL Agent

I An RL agent may include one or more of these components:


I Policy: agents behaviour function
I Value function: how good is each state and/or action
I Model: agents representation of the environment
Policy

I A policy is the agents behaviour


I It is a map from state to action:
I Deterministic policy: a = (s)
I Stochastic policy: (a|s) = P [a|s]
Value Function

I A value function is a prediction of future reward


I How much reward will I get from action a in state s?
I Q-value function gives expected total reward
I from state s and action a
I under policy
I with discount factor
2

Q (s, a) = E rt+1 + rt+2 + rt+3 + ... | s, a
Value Function

I A value function is a prediction of future reward


I How much reward will I get from action a in state s?
I Q-value function gives expected total reward
I from state s and action a
I under policy
I with discount factor
2

Q (s, a) = E rt+1 + rt+2 + rt+3 + ... | s, a

I Value functions decompose into a Bellman equation



Q (s, a) = Es 0 ,a0 r + Q (s 0 , a0 ) | s, a
Optimal Value Functions
I An optimal value function is the maximum achievable value

Q (s, a) = max Q (s, a) = Q (s, a)

Optimal Value Functions
I An optimal value function is the maximum achievable value

Q (s, a) = max Q (s, a) = Q (s, a)

I Once we have Q we can act optimally,

(s) = argmax Q (s, a)


a
Optimal Value Functions
I An optimal value function is the maximum achievable value

Q (s, a) = max Q (s, a) = Q (s, a)

I Once we have Q we can act optimally,

(s) = argmax Q (s, a)


a

I Optimal value maximises over all decisions. Informally:

Q (s, a) = rt+1 + max rt+2 + 2


max rt+3 + ...
at+1 at+2

= rt+1 + max Q (st+1 , at+1 )
at+1
Optimal Value Functions
I An optimal value function is the maximum achievable value

Q (s, a) = max Q (s, a) = Q (s, a)

I Once we have Q we can act optimally,

(s) = argmax Q (s, a)


a

I Optimal value maximises over all decisions. Informally:

Q (s, a) = rt+1 + max rt+2 + 2


max rt+3 + ...
at+1 at+2

= rt+1 + max Q (st+1 , at+1 )
at+1

I Formally, optimal values decompose into a Bellman equation



Q (s, a) = Es 0 r + max
0
Q (s 0 , a0 ) | s, a
a
Value Function Demo
Model

observation action

ot at

reward rt
Model

observation action

ot at I Model is learnt from experience


I Acts as proxy for environment
I Planner interacts with model
reward rt

I e.g. using lookahead search


Approaches To Reinforcement Learning

Value-based RL
I Estimate the optimal value function Q (s, a)
I This is the maximum value achievable under any policy
Policy-based RL
I Search directly for the optimal policy
I This is the policy achieving maximum future reward
Model-based RL
I Build a model of the environment
I Plan (e.g. by lookahead) using model
Deep Reinforcement Learning

I Use deep neural networks to represent


I Value function
I Policy
I Model
I Optimise loss function by stochastic gradient descent
Outline

Introduction to Deep Learning

Introduction to Reinforcement Learning

Value-Based Deep RL

Policy-Based Deep RL

Model-Based Deep RL
Q-Networks

Represent value function by Q-network with weights w

Q(s, a, w) Q (s, a)

Q(s,a,w) Q(s,a1,w) Q(s,am,w)

w w

s a s
Q-Learning

I Optimal Q-values should obey Bellman equation




Q (s, a) = Es 0 r + max 0
Q (s 0 , a0 ) | s, a
a

I Treat right-hand side r + max


0
Q(s 0 , a0 , w) as a target
a
I Minimise MSE loss by stochastic gradient descent
2
0 0
l= r+ max
0
Q(s , a , w) Q(s, a, w)
a
Q-Learning

I Optimal Q-values should obey Bellman equation




Q (s, a) = Es 0 r + max 0
Q (s 0 , a0 ) | s, a
a

I Treat right-hand side r + max


0
Q(s 0 , a0 , w) as a target
a
I Minimise MSE loss by stochastic gradient descent
2
0 0
l= r+ max
0
Q(s , a , w) Q(s, a, w)
a

I Converges to Q using table lookup representation


Q-Learning

I Optimal Q-values should obey Bellman equation




Q (s, a) = Es 0 r + max 0
Q (s 0 , a0 ) | s, a
a

I Treat right-hand side r + max


0
Q(s 0 , a0 , w) as a target
a
I Minimise MSE loss by stochastic gradient descent
2
0 0
l= r+ max
0
Q(s , a , w) Q(s, a, w)
a

I Converges to Q using table lookup representation


I But diverges using neural networks due to:
I Correlations between samples
I Non-stationary targets
Deep Q-Networks (DQN): Experience Replay

To remove correlations, build data-set from agents own experience

s1 , a 1 , r 2 , s 2
s2 , a 2 , r 3 , s 3 ! s, a, r , s 0
s3 , a 3 , r 4 , s 4
...
st , at , rt+1 , st+1 ! st , at , rt+1 , st+1

Sample experiences from data-set and apply update

2
0 0
l= r+ max
0
Q(s , a , w ) Q(s, a, w)
a

To deal with non-stationarity, target parameters w are held fixed


Deep Reinforcement Learning in Atari

state action

st at

reward rt
DQN in Atari

I End-to-end learning of values Q(s, a) from pixels s


I Input state s is stack of raw pixels from last 4 frames
I Output is Q(s, a) for 18 joystick/button positions
I Reward is change in score for that step

Network architecture and hyperparameters fixed across all games


DQN Results in Atari
DQN Atari Demo

INNOVATIONS IN
The microbiome

T H E I N T E R N AT I O N A L W E E K LY J O U R N A L O F S C I E N C E

DQN paper
www.nature.com/articles/nature14236

DQN source code:


sites.google.com/a/deepmind.com/dqn/ Self-taught AI software
attains human-level
performance in video games
PAGES 486 & 529

EPIDEMIOLOGY COSMOLOGY QUANTUM PHYSICS NATURE.COM/NATURE


26 February 2015 10

SHARE DATA IN A GIANT IN THE TELEPORTATION Vol. 518, No. 7540

OUTBREAKS EARLY UNIVERSE FOR TWO


Forge open access A supermassive black hole Transferring two properties
to sequences and more at a redshift of 6.3 of a single photon
PAGE 477 PAGES 490 & 512 PAGES 491 & 516
Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0
Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0

I Prioritised replay: Weight experience according to surprise


I Store experience in priority queue according to DQN error

r+ max
0
Q(s 0 , a0 , w ) Q(s, a, w )
a
Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0

I Prioritised replay: Weight experience according to surprise


I Store experience in priority queue according to DQN error

r+ max
0
Q(s 0 , a0 , w ) Q(s, a, w )
a

I Duelling network: Split Q-network into two channels


I Action-independent value function V (s, v )
I Action-dependent advantage function A(s, a, w)

Q(s, a) = V (s, v ) + A(s, a, w)


Improvements since Nature DQN
I Double DQN: Remove upward bias caused by max Q(s, a, w)
a
I Current Q-network w is used to select actions
I Older Q-network w is used to evaluate actions
2
0 0 0
l = r + Q(s , argmax Q(s , a , w), w ) Q(s, a, w)
a0

I Prioritised replay: Weight experience according to surprise


I Store experience in priority queue according to DQN error

r+ max
0
Q(s 0 , a0 , w ) Q(s, a, w )
a

I Duelling network: Split Q-network into two channels


I Action-independent value function V (s, v )
I Action-dependent advantage function A(s, a, w)

Q(s, a) = V (s, v ) + A(s, a, w)

Combined algorithm: 3x mean Atari score vs Nature DQN


Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games


I Applied to recommender systems within Google
Asynchronous Reinforcement Learning

I Exploits multithreading of standard CPU


I Execute many instances of agent in parallel
I Network parameters shared between threads
I Parallelism decorrelates data
I Viable alternative to experience replay
I Similar speedup to Gorila - on a single machine!
Outline

Introduction to Deep Learning

Introduction to Reinforcement Learning

Value-Based Deep RL

Policy-Based Deep RL

Model-Based Deep RL
Deep Policy Networks

I Represent policy by deep network with weights u

a = (a|s, u) or a = (s, u)

I Define objective function as total discounted reward



L(u) = E r1 + r2 + 2 r3 + ... | (, u)

I Optimise objective end-to-end by SGD


I i.e. Adjust policy parameters u to achieve more reward
Policy Gradients

How to make high-value actions more likely:


I The gradient of a stochastic policy (a|s, u) is given by

@L(u) @log (a|s, u)
=E Q (s, a)
@u @u
Policy Gradients

How to make high-value actions more likely:


I The gradient of a stochastic policy (a|s, u) is given by

@L(u) @log (a|s, u)
=E Q (s, a)
@u @u

I The gradient of a deterministic policy a = (s) is given by



@L(u) @Q (s, a) @a
=E
@u @a @u
I if a is continuous and Q is dierentiable
Actor-Critic Algorithm

I Estimate value function Q(s, a, w) Q (s, a)


I Update policy parameters u by stochastic gradient ascent

@l @log (a|s, u)
= Q(s, a, w)
@u @u
or
@l @Q(s, a, w) @a
=
@u @a @u
Asynchronous Advantage Actor-Critic (A3C)
I Estimate state-value function
V (s, v) E [rt+1 + rt+2 + ...|s]
I Q-value estimated by an n-step sample
n 1 n
qt = rt+1 + rt+2 ... + rt+n + V (st+n , v)
Asynchronous Advantage Actor-Critic (A3C)
I Estimate state-value function
V (s, v) E [rt+1 + rt+2 + ...|s]
I Q-value estimated by an n-step sample
n 1 n
qt = rt+1 + rt+2 ... + rt+n + V (st+n , v)

I Actor is updated towards target


@lu @log (at |st , u)
= (qt V (st , v))
@u @u

I Critic is updated to minimise MSE w.r.t. target


lv = (qt V (st , v))2
Asynchronous Advantage Actor-Critic (A3C)
I Estimate state-value function
V (s, v) E [rt+1 + rt+2 + ...|s]
I Q-value estimated by an n-step sample
n 1 n
qt = rt+1 + rt+2 ... + rt+n + V (st+n , v)

I Actor is updated towards target


@lu @log (at |st , u)
= (qt V (st , v))
@u @u

I Critic is updated to minimise MSE w.r.t. target


lv = (qt V (st , v))2

I 4x mean Atari score vs Nature DQN


Deep Reinforcement Learning in Labyrinth
A3C in Labyrinth
(a|st-1) V(st-1) (a|st) V(st) (a|st+1) V(st-1)

Deep Reinforcement Learning in Labyrinth


st-1 st st+1

Deep Reinforcement
Deep Reinforcement Learning
Learning in Labyrinth
in Labyrinth

ot-1 ot ot+1

I End-to-end learning of softmax policy (a|st ) from pixels


I Observations ot are raw pixels from current frame
I State st = f (o1 , ..., ot ) is a recurrent neural network (LSTM)
I Outputs both value V (s) and softmax over actions (a|s)
I Task is to collect apples (+1 reward) and escape (+10 reward)
A3C Labyrinth Demo

Demo:
www.youtube.com/watch?v=nMR5mjCFZCw&feature=youtu.be

Labyrinth source code (coming soon):


sites.google.com/a/deepmind.com/labyrinth/
Deep Reinforcement Learning with Continuous Actions

How can we deal with high-dimensional continuous action spaces?


I Cant easily compute max Q(s, a)
a
I Actor-critic algorithms learn without max
I Q-values are dierentiable w.r.t a
I @Q
Deterministic policy gradients exploit knowledge of @a
Deep DPG

DPG is the continuous analogue of DQN


I Experience replay: build data-set from agents experience
I Critic estimates value of current policy by DQN
2
0 0
lw = r + Q(s , (s , u ), w ) Q(s, a, w)

To deal with non-stationarity, targets u , w are held fixed


I Actor updates policy in direction that improves Q

@lu @Q(s, a, w) @a
=
@u @a @u
I In other words critic provides loss function for actor
DPG in Simulated Physics
I Physics domains are simulated in MuJoCo
I End-to-end learning of control policy from raw pixels s
I Input state s is stack of raw pixels from last 4 frames
I Two separate convnets are used for Q and
I Policy is adjusted in direction that most improves Q
a

Q(s,a)

(s)
DPG in Simulated Physics Demo

I Demo: DPG from pixels


A3C in Simulated Physics Demo

I Asynchronous RL is viable alternative to experience replay


I Train a hierarchical, recurrent locomotion controller
I Retrain controller on more challenging tasks
Fictitious Self-Play (FSP)

Can deep RL find Nash equilibria in multi-agent games?


I Q-network learns best response to opponent policies
I By applying DQN with experience replay
I c.f. fictitious play
I Policy network (a|s, u) learns an average of best responses

@l @log (a|s, u)
=
@u @u
I Actions a sample mix of policy network and best response
Neural FSP in Texas Holdem Poker
I Heads-up limit Texas Holdem
I NFSP with raw inputs only (no prior knowledge of Poker)
I vs SmooCT (3x medal winner 2015, handcrafted knowlege)
100

-100

-200

-300
mbb/h

-400

-500

-600 SmooCT
NFSP, best response strategy
-700 NFSP, greedy-average strategy
NFSP, average strategy
-800
0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07
Iterations
Outline

Introduction to Deep Learning

Introduction to Reinforcement Learning

Value-Based Deep RL

Policy-Based Deep RL

Model-Based Deep RL
Learning Models of the Environment

I Demo: generative model of Atari


I Challenging to plan due to compounding errors
I Errors in the transition model compound over the trajectory
I Planning trajectories dier from executed trajectories
I At end of long, unusual trajectory, rewards are totally wrong
Deep Reinforcement Learning in Go

What if we have a perfect model? e.g. game rules are known

T H E I N T E R N AT I O N A L W E E K LY J O U R N A L O F S C I E N C E

AlphaGo paper:
www.nature.com/articles/nature16961

AlphaGo resources: At last a computer program that


can beat a champion Go player PAGE 484

deepmind.com/alphago/ ALL SYSTEMS GO


CONSERVATION RESEARCH ETHICS POPULAR SCIENCE NATURE.COM/NATURE
28 January 2016 10

SONGBIRDS SAFEGUARD WHEN GENES Vol. 529, No. 7587

LA CARTE TRANSPARENCY GOT SELFISH


Illegal harvest of millions Dont let openness backfire Dawkinss calling
of Mediterranean birds on individuals card forty years on
PAGE 452 PAGE 459 PAGE 462

Cover 28 January 2016.indd 1 20/01/2016 15:40


Conclusion

I General, stable and scalable RL is now possible


I Using deep networks to represent value, policy, model
I Successful in Atari, Labyrinth, Physics, Poker, Go
I Using a variety of deep RL paradigms

Das könnte Ihnen auch gefallen