Sie sind auf Seite 1von 5

Chapter 1

Abstract
Reinforcement Learning (RL) is a general purpose framework for designing controllers(agents) and learns
a policy by trial and error. This makes it highly suitable for systems which are difficult to control using
conventional control methodologies, such as driving cars (this is still a problem if you consider the traffic
in our city). Traditionally, RL has only been applicable to problems with low dimensional state space, but
use of Deep Neural Networks as function approximators with RL have shown impressive results for
control of high dimensional systems. This approach is known as Deep Reinforcement Learning (DRL).

When we think about what how an agent could help us the first thought goes to autonomous driving.
This will not only decrease by a lot the number of lifes lost because of numerous causes: speeding,
driver being tired or not paying attention to the road. During the time of going from A to B we could do
our work or relax.
In the recent works there were propositions that the learning of autonomous driving agent to be seen as
a Reinforcement Learning problem (Sallab, Abdou, Perot, & Yogamani, 2017), motivated by the
interactive nature between the autonomous vehicle and its driving environment.
Because testing an autonomous car is very expensive, most of relevance decision making algorithms are
developed in simulation. In this thesis, we aim at prototyping agents using either Q-Learning (Deep Q-
Learning in our case) and Policy Gradient.
At first we will try with a toy problem environment : CarRacing v0 from OpenAi (
https://gym.openai.com/envs/CarRacing-v0/ ) and then we will try with a more complex and more
appropriate to the real world : Carla Simulator ( https://carla.org/ ).
Introduction
Machine learning is a part of Artificial Intelligence where machines learn how to make a decision and
recognize patterns from data without being explicitly programmed. There are three main types of
learning: supervised learning, unsupervised learning and reinforcement learning. Supervised learning
attempts to learn a map between an input space and its label space, where the correct label is given at
during traing. In contrary, unsupervised learning aims to extract information from unlabeled input data.
Unsupervised learning can provide a better representation for a supervised task or come up with its own
task

Reinforcement learning (RL) aims to learn from interaction with the environment in a way such that an
accumulative reward is maximized. Reinforcement learning provides solutions to goal-oriented tasks in
replacement of supervised learning methods when a simulator of the environment is available instead of
input|label training pairs. A reinforcement learning algorithm interacts with the environment (actions).
For example, for the task of learning to drive, a supervised learning algorithm would leverage a dataset
consisting of pairs of observations and their corresponding optimal actions. However, collecting such
dataset is highly challenging. On the other hand, given a driving simulator, a RL approach can directly
perform training basing on the reward signal from the simulator. Deep reinforcement learning (DRL)
integrates the recent advances in deep neural networks (DNNs) to RL by for example parametrizing
value approximators with DNNs (Mnih et al., 2013)(Mnih et al., 2015). A real-world autonomous driving
setup involves many substantially more complicated components, on top of which it is very costly to
evaluate a proposed RL algorithm. Therefore, by prototyping on a toy example like ours, we hope to gain
useful insights in designing effective models to potentially avoid investing an excessive amount of
resources when experimenting with real-world scenarios.

There are many factors in the success of an RL solution. In this thesis, we selectively attend to
investigate and optimize 3 fundamental aspects, namely the design of action space design, the
exploration mechanisms, and the utilization of experience replay. Also we want to see the different
steps in value function approximation methods and policy gradient methods.

2 Background
In this Section, we present the most relevant Reinforcement Learning (RL) background knowledge
required to grasp the content of this thesis. We start with general RL concepts (Section 2.1), followed by
more elaborations on the recently popularized Deep Learning (DL) based approaches to RL (Section 2.2).
In addition, Section 2.3 provides more information on driving simulator, including their roles in building
an autonomous driving system, several popular open source examples in research communities, and the
reasons why we chosed the CarRacing and Carla Simulator environments.
2.1 Reinforcement Learning
Reinforcement learning is a subdomain of machine learning in which the target is that the agent
learns to maximize an accumulative reward in an environment. The process of decision-making
that the agent makes in order to take an action in the environment is considered to be a Markov
Decision Process (MDP) (Sutton and Barto, 1998). A Markov Decision Process consists in a set
of States that can be finite, infinite or continuous, actions that control the environment ( which
can be discrete or continuous ), a transition in which the agent goes from state S t to state St+1 with
an action A and a reward function that gives the reward for taking the action A from State St.
In RL, the problem is often mathematically formulated as a Markov decision process (MDP). A
MDP is a way of representing the "dynamics" of the environment, that is, the way the
environment will react to the possible actions the agent might take, at a given state. More
precisely, an MDP is equipped with a transition function (or "transition model"), which is a
function that, given the current state of the environment and an action (that the agent might take),
outputs a probability of moving to any of the next states. A reward function is also associated
with an MDP. Intuitively, the reward function outputs a reward, given the current state of the
environment (and, possibly, an action taken by the agent and the next state of the environment).
Collectively, the transition and reward functions are often called the model of environment. To
conclude, the MDP is the problem and the solution to the problem is a policy. Furthermore, the
"dynamics" of the environment are governed by the transition and reward functions (that is, the
"model").
However, we often do not have the MDP, that is, we do not have the transition and reward
functions (of the MDP associated the environment). Hence, we cannot estimate a policy from the
MDP, because it is unknown. Note that, in general, if we had the transition and reward functions
of the MDP associated with the environment, we could exploit them and retrieve an optimal
policy (using dynamic programming algorithms).

In the absence of these functions (that is, when the MDP is unknown), to estimate the optimal
policy, the agent needs to interact with environment and observe the responses of the
environment. This is often referred to as the "reinforcement learning problem", because the agent
will need to estimate a policy by reinforcing its beliefs about the dynamics of the environment.
Over time, the agent starts to understand how the environment responds to its actions, and it can
thus start to estimate the optimal policy. Thus, in the RL problem, the agent estimates the
optimal policy to behave in an unknown (or partially known) environment by interacting with it
(using a "trial-and-error" approach).

In this context, a model-based algorithm (Sutton and Barto, 1998) is an algorithm that uses the
transition function (and the reward function) in order to estimate the optimal policy. The agent
might have access only to an approximation of the transition function and reward functions,
which can be learned by the agent while it interacts with the environment or it can be given to
the agent (e.g. by another agent). In general, in a model-based algorithm, the agent can
potentially predict the dynamics of the environment (during or after the learning phase), because
it has an estimate of the transition function (and reward function). However, note that the
transition and reward functions that the agent uses in order to improve its estimate of the optimal
policy might just be approximations of the "true" functions. Hence, the optimal policy might
never be found (because of these approximations).

A model-free algorithm (Sutton and Barto, 1998) is an algorithm that estimates the optimal
policy without using or estimating the dynamics (transition and reward functions) of the
environment. In practice, a model-free algorithm either estimates a "value function" or the
"policy" directly from experience (that is, the interaction between the agent and environment),
without using neither the transition function nor the reward function. A value function can be
thought of as a function which evaluates a state (or an action taken in a state), for all states. From
this value function, a policy can then be derived.

The training of a RL algorithm can be done on-policy or off-policyAn off-policy, whereas, is


independent of the agent’s actions. It figures out the optimal policy regardless of the agent’s
motivation. For example, Q-learning is an off-policy learner. On-policy methods attempt to
evaluate or improve the policy that is used to make decisions. In contrast, off-policy methods
evaluate or improve a policy different from that used to generate the data.

The direct objectives of RL algorithms generally fall into two categories: value function
approximation and policy function approximation.

Value function approximation tries to build some function to estimate the true value function by
creating a compact representation of the value function that uses a smaller amount of parameters:

A common practice is using deep learning — in that case, the weights of the neural network are
the vector of weights w that will be used to estimate the value function across the entire
state/state-action space. This vector of weights will be updated using the methods like Monte
Carlo or Temporal-Difference learning.

2.2 Deep Reinforcement Learning


2.3 Driving Simulators

Das könnte Ihnen auch gefallen