Sie sind auf Seite 1von 26

9.

Neural Networks - Learning

9.1 Cost Function


Neural Networks are one of the most powerful learning algorithms that we have today. In this, and in
the next few lectures, I'd like to start talking about a learning algorithm for fitting the parameters of the
neural network given the training set. As for the discussion of most of the learning algorithms, we're
going to begin by talking about the cost function for fitting the parameters of the network. I'm going to
focus on the application of neural networks to classification problems. So, suppose we have a network
like this

And suppose we have a training set like

*obs.: in the case of the upper network, the dimension of and are 4;

I'm going to use upper case L to denote the total number of layers in this network. So

In this case And, I'm going to use s subscript L ( ), to denote the number of units, that is a
number of neurons, not counting the bias unit in each layer of the network. So, in our example

We're going to consider two types of classification problems. The first is binary classification, where the
labels are either zero or one

In this case, we would have one output unit, like


And the output of the neural network would be that is going to be a real number; and for binary
classification we always have . In this case, to simplify notation later, I'm also going to set
(where denotes the number of units in the output layer).

The second type of classification problem we'll consider will be multiclass classification problem where
we may have distinct classes. So, our early example, we have

So our hypotheses will output vectors that are dimensional. And the number of output units will be
equal to . And usually we will have greater than or equal to three in multiclass classification, because
if we had two classes you would have a binary classification problem.

Now, let's define the cost function for our neural network. The cost function we use for the neural
network is going to be a generalization of the one that we use for logistic regression. For logistic
regression, we used to minimize the cost function using this formula

For a neural network our cost function is going to be a generalization of this. Where instead of having
basically just one logistic regression output unit, we may instead have of them. So here's our cost
function
Neural network now outputs vectors in where might be equal to 1 if we have the binary
classification problem. I'm going to use this notation

to denote the output entry. And so, this subscript just selects the element of the vector that is
output by my neural network - and this is a value between zero or one that estimates the probability
that the input object corresponds to class.

My cost function is now similar to what we have in logistic regression. Except that we have this sum
from through . The summation is basically a sum over my output units. Then, we are
summing the error associated to all output predictions for one training example. For example, if I have
four units in the final layer of my neural network, then this sum goes from through 4 - basically is
the logistic regression cost function but summing that cost function over each of my four output units in
turn (for each training example - see formula).

And so, you notice that for each iteration, in sum, we analyze terms like and , because we're
basically taking the output unit (form hypothesis, that could be for example )
and comparing that to the actual value of , that could be for example .

And finally, we have the regularization term

that is similar to what we had for logistic regression. In this term we are regularizing all the parameters
between layers and . The first term

guarantees that we traverse all parameters in between first and final (L) layers. Remember that the
number of rows of is given by the units in layer (without the bias term), thats why we have

And, the number of columns is given by the units in layer without the bias term because we don't
regularize (different case from above). But this is just one possible convention and even if you were
to starting from , the algorithm will work about the same and it doesn't make a big difference. So,
And using this setup we regularize all neural network parameters.

So, that's the cost function we're going to use to fill on your own network.

9.2 Backpropagation Algorithm


In this lecture, let's start to talk about an algorithm, for trying to minimize the cost function that we saw.
In particular, we'll talk about the backpropagation algorithm - actually the backpropagation algorithm
does not minimize , its just compute the needed partial derivatives; but in a global view of machine
learning problem, backpropagation is a important step to minimize . Here's the cost function that
we wrote down in the previous lecture

What we'd like to do is try to find parameters theta to try to minimize . Again, in order to use either
gradient descent or one of the advance optimization algorithms we need code that takes the
parameters theta as input and computes and the partial derivative terms with respect to theta
(that varies now with and and ). So

In order to compute the cost function , we just use the cost function formula itself; but how to get
the derivative terms?

Let's start by talking about the case of


when we have only one training example in the entire training set. And now, to get our output of
this example, given a set of parameters, we just run forwardmpropagation in this training example, like

*obs.: is just the model thats is used for prediction with this choice of parameters theta. As we
apply gradient descent or another optimization algorithm in order to minimize the cost function this
model changes predicting better values for new training examples.

So, this is our vectorized implementation of forward propagation and it allows us to compute the
activation values for all of the neurons in our neural network

With evaluated previously I can compute my cost function from my data (that provides the real
outputs .

Next, in order to compute the derivatives, we're going to use an algorithm called back propagation. The
intuition of the back propagation algorithm is that for each node (a unit in particular layer) will have a
delta number associated that represents the error in this node. So,

So, this delta term is in some sense going to capture our error in the activation of that neural unit. So,
concretely, given our neural network we start from the back and we move towards the front of the
neural network calculating this delta (error) terms - we sort of back propagating the errors from the
output layer to the first layer. So, given our example, where , for each output unit, we will
calculate the delta term like
And this error is just the activation of that unit minus what was the actual value of that training example
(that obviously we have from training examples). And if you think of delta and as vectors then you can
also take those and come up with a vectorized implementation of it, which is just

Where here, each of these vectors are 4-dimensional vectors (the number of output units in our
network - ).

What we do next is compute the delta terms for the earlier layers in our network. Here's a formula for
computing and

*obs.: * denotes element-wise multiplication - in above equation both sides of * are vector of same
dimension, the element wise multiplication is not the inner product between those vectors but is the
multiplication of entries in the same position; * returns a vector of same dimension of its 'inputs').

where (that is the vector of error for each unit in layer 3) is the set of parameters between layers 3
and 4 multiplied by previous error vector, ; and, each element multiplied by

Where this element is actually the derivative of the activation function evaluated at the input values
given by , that can be seen as the derivative of , since . And this derivative term,
when computed, returns

Where is a vector of activations for that layer (this case, 3); and '1' is a vector of ones.

For we have a similar relationship


And finally, logically, there is no term, because the first layer corresponds to the input layer and
that's just the features we observe in our training sets, so that doesn't have any error associated with
that - we don't really want to try to change those values

The name back propagation comes from the fact that we start by computing the delta terms for the
output layer and then we go back a layer and compute the delta terms for the previous ones.

Finally, the derivation is surprisingly complicated, it is possible to prove that

Where

The partial derivative terms you want are exactly given by the multiplication of the activation terms and
these delta terms - ignoring regularization term lambda in . We'll fix this detail later about the
regularization term, but so by performing back propagation and computing these delta terms, you can
pretty quickly compute these partial derivative terms for all of your parameters.

So, summarizing, let's take everything and put it all together to talk about how to implement back
propagation to compute derivatives with respect to your parameters, now for the case of when we have
a large training set.

Suppose we have a training set of m examples like that shown here

The first thing we're going to do is

We're gonna set Delta equals zero for all values of and (it's like a preprocessing step). Eventually,
this capital delta will be used to compute the partial derivative terms
So as we'll see in a second, these deltas are going to be used as accumulators (through training
examples) that will slowly add things in order to compute these partial derivatives. Next, we're going to
loop through our training set and for each training example we are going to do this

So, for each iteration, we're going to working with the training example . So for each training
example we will do

1. Set the activations of the input layer as being the features of .

2. Perform forward propagation using the current values of theta to compute the activations for layer
two, layer three and so on up to the final layer, layer capital L.

3. Start Backpropagation - calculate the error of the output layer for this training example (where is
the hypothesis vector).

4. Back propagate the errors to compute , until .

5. Finally, use all this error terms already calculated for this particular training example to update
("accumulation of the partial derivative of with respect to "). Note that to update delta in this
iteration (training example) we need all activation functions (a terms) and the error ( terms) for this
training example, that we have! Because we calculated in previous steps. And by the way, if you look at
this expression

that's exactly what we saw earlier, but know we sum it for all training examples (the below formula it's
just for one particular training example)
And, it's possible to vectorize that too

So that's a vectorized implementation of the accumulation term that automatically does an update for
all values of and - for this particular training example. Finally, after executing the body of the for
loop (after we iterating in all training examples and updating respectively) we compute the following

Where capital D is (after some complicated calculations) actually the partial derivative term that we are
looking for

*obs.: remember that we saw that this expression

was the partial derivatives without the regularized term; so we add it in the end of back prop.

And see that we have two separate cases, because case corresponds to the bias term that don't
comprehends the extra regularization term. Having calculated D (partial derivatives), we just pass this to
a optimization algorithm to get the values that minimize . So that's the back propagation algorithm
and how you compute derivatives of your cost function for a neural network.

9.3 Backpropagation Intuition


In the previous lecture, we talked about the back propagation algorithm. Back propagation
0unfortunately is a less mathematically clean or simple algorithm compared to linear regression or
logistic regression, but it's well possible to implement it without knowing in details of what exactly it's
doing. So, what I want to do in this lecture is to look a little bit more at the mechanical steps of back
propagation and try to give you a little more intuition about what the mechanical steps of back prop are
doing to hopefully convince you that it is at least a reasonable algorithm.

In order to better understand back propagation, let's take another closer look at what forward
propagation is doing. Here's a neural network
with two inputs and two hidden units in the next two hidden layers and then finally one output unit. The
whole idea of forward propagation can be illustrated as

So, for each training example I will do the above procedure. So, when performing forward propagation,
we might have some particular example, say some example that we feed into the input layer
to forward propagate it to the first hidden layer and so one. To propagate we fist compute the terms,
that are the weighted (the weights are the Theta values) sum of unit in previous layer; and then, we
calculate the activation function for each unit with sigmoid function. The figure shows a detailed
calculation for . I will do this until I reach the activation function of my output unit that is my
hypothesis function. So, that's forward propagation. And it turns out that, as we will see later on in this
lecture, what back propagation is doing, is a process very similar to this, except that instead of the
computations flowing from the left to the right of the network, the computations is there flowing from
the right to the left of the network, but using a very similar computation as this.

Now, let's go to back propagation. To better understand what back propagation is doing, let's look at its
cost function as if we had only one output unit ( is a real value number)
Note that there isnt the summation over output classes. More than that, let's just focus on the single
example and let's ignore regularization, so lambda equals zero. Given these three
assumptions, if you look inside the summation term, you find that the cost term associated with the
training example is

And what this cost function does (as we saw in logistic function lecture), is it plays a role similar to the
square error

that measures the difference between the hypothesis and the actual value of a training example -
measures how well is the network doing on example .

Now, let's actually look at what back propagation is doing. Before that, one useful intuition is that back
propagation is computing these delta terms

where

So, the error of a particular unit in a particular layer is actually the partial derivative of the
(that is the same for all evaluations in this particular training example) with respect to linear
combination (weighted sum of inputs) of this unit, (varies for term to term in the network). So, of
course, these terms affect directly the relation of my hypothesis and the actual value .

So, the backpropagation process can be viewed in the following figure


*obs.: all the following back prop calculations are made after the forward propagation calculations for
this training example.

First, we calculate the error in the output unit (it's the difference between the actual value and what
was the value predicted). Next we propagate these values backwards in a process very similar to
forward propagation algorithm, but doing it backwards. To see that, the figure shows and
evaluations. We can see for example, that is actually the weighted sum of and delta values
(weighted by the corresponding edge strength). And by the way, depending on how you define the back
propagation algorithm or depending on how you implement it, you may end up implementing
something to compute delta values for the bias units as well. The bias unit is always output the values +1
and there's no way for us to change their values. So, depending on your implementation of back prop,
we do end up computing these delta values, but we just discard them and we don't use them, because
they don't end up being part of the calculation needed to compute the derivatives.

So, hopefully, that gives you a little bit of intuition about what back propagation is doing. In case of all
this, they still seem a black box thing, in a later lecture, in the "Putting It Together" lecture, I'll try to give
a little more intuition about what that back propagation is doing. But, unfortunately, this is a difficult
algorithm to try to visualize and understand what it is really doing. But fortunately, many people have
been using it very successfully for many years and if you implement the algorithm, you have a very
effective learning algorithm, even though the inner workings of exactly how it works can be harder to
visualize.

9.4 Implementation Note - Unrolling Parameters


In the previous lectures, we talked about how to use back propagation to compute the derivatives of
your cost function. In this lecture, I want to quickly tell you about one implementational detail of
unrolling your parameters from matrices into vectors, which we need in order to use the advanced
optimization routines.
Concretely, let's say you've this (already studied) setup

Both these functions assume that theta, gradient and the initialTheta are parameter vectors (
vectors). This worked fine when we were using logistic progression but now that we're using a neural
network, our parameters are no longer vectors, but matrices. For example

In this lecture I want to quickly tell you about the idea of how to take these matrices and unroll them
into vectors. So that they end up being in a format suitable for passing into that functions.

Concretely, let's say we have a neural network with this setup

So, our matrices dimensions are

So, if you want to convert these matrices into vectors what you can do is to unroll them and put all the
elements into a big long column vector, using these MATLAB codes

And, of course, if you want to go back from the vector representations to the matrix representations we
can do this
For example the (1:110) parameter in reshape for Theta1 is because these first 110 elements
corresponds exactly the elements of original Theta1 matrix. Now, using reshape I reorganize the
elements into a (10x11) matrix, exactly the original one.

To make this process really concrete, here's how we use the unrolling idea to implement our learning
algorithm. Let's say that you have some initial value of the parameters , and first, we
unroll them into a long vector and call it initialTheta, this unrolled initialTheta will be passed to fminunc
function. Summarizing

The other thing we need to do is implement the costFunction. Here's my implementation of the cost
function

The costFunction will receive the thetaVec, that is the unrolled vector of all the current values for .
Now, inside the costFunction, the first thing I'm going to do is use reshape function to calculate my Theta
matrices again; to effectively calculate the cost and the derivatives using forward and back
propagation. After that, we unroll the matrices in a just one big vector, so that it could be used in
to minimize ; since is a scalar we don't need to change it.

Note that the advantage of the matrix representation is that when your parameters are stored as
matrices it's more convenient to forward propagation and back propagation intuition and you take
advantage of the vectorized implementations. Whereas in contrast, the advantage of the vector
representation is that when you are using the advanced optimization algorithms, because those
algorithms tend to assume that you have all of your parameters unrolled into a big long vector. And so
with what we just went through (vector-matrix mappings), hopefully you can now quickly convert
between the two as needed.

9.5 Gradient Checking


In the last few lectures, we talked about how to do forward-propagation and back-propagation in a
neural network in order to compute derivatives. But back prop as an algorithm has a lot of details
actually it could look like it's working - your cost function may end up decreasing on every iteration of
gradient descent or in some advanced algorithm, while actually you have a neural network that has a
high level of error thats not even close to the possible minimal. So, what can we do about this? There's
an idea called gradient checking (that can be viewed as a numerical estimation (aprox.) of gradients)
that eliminates almost all of these problems. So today, every time I implement back propagation or a
similar gradient descent algorithm on the neural network or any other reasonably complex model, I
always implement gradient checking (to see that indeed the partial derivatives that are being calculated
correctly). And if you do this it will help you make sure and sort of gain high confidence that your
implementation of forward prop and back prop or whatever, is 100% correct. Therefore, once you
implement numerical gradient checking, you'll be able to verify for yourself that the code you're writing
is indeed computing the derivative of the cost function , even for a complex learning algorithm like
backpropagation where is not very clean what the algorithm is doing.

So here's the idea. Suppose I have the function , and I have some particular value for this
example, and I'm going to assume that is just a real number. So I have

And let's say I want to estimate the derivative of this function at this point. And the derivative is equal to
the slope of the tangent line in this point. To numerically approximate the derivative I need to follow
this configuration

So, my approximation of the derivative is the slope of the line connecting and (the
slope of this is the vertical height ( ) divided by this horizontal width ( )), while the
actual derivative is the slope of the blue line (tangent line). So, the approximation can be calculated as
So I want to approximate the slope of the tangent line as being the slope of the hypotenuse of red
triangle. And this seems like it would be a pretty good approximation since, usually, I use a pretty small
value for like

And in fact, if you let epsilon become really small then, mathematically, the approximation term
becomes the derivative, becomes exactly the slope of the function at this point.

*obs.: we don't want to use epsilon that's too small because then you might run into numerical
problems related to the computational accuracy of the computer.

And by the way some of you may have seen it alternative formula for estimating the derivative which is
this formula. Like

thats called one-sided difference. But, the first approximation, the two-sided difference gives us a
slightly more accurate estimate, so I usually use that to approximate the derivative.

So, concretely, what you implement in MATLAB is you call to compute gradApprox function which is
going to compute an approximation of the derivatives

And this will give you a numerical estimate of the gradient at that point. And in this example it seems
like it's a pretty good estimate. Now, let's look at the more general case of where is a vector
parameter (for example ) rather than a real value

In this case, we can then use a similar idea to approximate all of the partial derivative terms
So, for each different I will approximate the derivative only changing its value by a small amount of
error . So, these equations give you a way to numerically approximate the partial derivative of with
respect to all of your parameters. Concretely, what you implement is the following

I'm going to set this auxiliary variables thetaPlus and thetaMinus equals theta to make the evaluations in
gradApprox correctly. just takes the current value and sum it with a error (similar
)

And then implementing this simple for loop, it will give you all the approximations in a unrolled vector of
the partial derivatives with respect to each different .

Given the approximation values, what I usually do is then take my numerically computed derivative, that
is this gradApprox and make sure that that is equal or approximately equal up to DVec that is "actual"
derivatives I am computing from back prop

And if these two ways of computing the derivative give me the same answer or at least give me very
similar answers up to a few decimal places, then I'm much more confident that my implementation of
back prop is correct, which means that the gradient descent or some advanced optimization algorithm
are running correctly and doing a good job optimizing .
Finally, I want to put everything together and tell you how to implement this numerical gradient
checking. Here's what I usually do

First thing is to implement back-propagation to compute DVec. Then, we implement a numerical


gradient checking to compute gradApprox. Then you should make sure that DVec and gradApprox gives
similar values, let's say up to a few decimal places. And finally, and this the important step, it is
important to turn off gradient checking and no longer compute this gradApprox thing using the
numerical derivative for further iterations. And the reason for that is that the gradient checking code is
very computationally expensive, that's a very slow way to try to approximate the derivative. Whereas in
contrast, the back-propagation algorithm that we talked about earlier, computes D1, D2, D3, or DVec in
a much more computationally efficient way. So

So once you've verified that your implementation of back-propagation is correct, you should turn off
gradient checking, and just stop using that. Another thing is that we always have to use in gradient
descent or in an advanced optimization algorithm the DVec value of derivatives, never the
approximation values (even in the first iteration - "checking iteration").

So that's how you take gradients numerically. And that's how you can verify that your implementation of
back-propagation is correct. Whenever I implement back-propagation or similar gradient descent
algorithm for a complicated model, I always use gradient checking. This really helps me make sure that
my code is indeed correct.

9.6 Random Initialization


In the previous lectures, we put together almost all the pieces we need in order to implement and train
in our network. There's just one last idea I need to share with you, which is the idea of random
initialization. When you're running an algorithm like gradient descent or also the advanced optimization
algorithms, we need to pick some initial value for the parameters theta. So for the advanced
optimization algorithm it assumes that you will pass it some initial value for the parameters
Now, let's consider gradient descent. For that we also need to initialize theta to something. And then we
can slowly take steps going downhill, using gradient descent, to minimize the function . So what do
we set the initial value of to? Is it possible to set the initial value of theta to the vector of all zeros.
Whereas this worked okay when we were using logistic regression, initializing all of your parameters to
zero actually does not work when you're training a neural network. Why?

Consider training the following neural network; and let's say we initialized all of the parameters in the
network to zero

And if you do that, all the weights are going to be zero, which means that for example we have

So, at the initialization the two blue weights are going to be equal (both zero); same case for red and
green weights. So, for every training example, both hidden units are going to be computing the same
values. You can also show that the both delta values for this unit are also going to be the same too. And
this repeats for the partial derivative of the cost function with respect to these two blue weights in
neural network. And so, what this means is that even after one gradient descent update when you
change the value of this blue weights you end up with the same (non zero) values for those
And similarly, even after one gradient descent update the red weights and green weights, they'll both
change values but they'll both end up the same value as each other. So after each update, the
parameters corresponding to the inputs going into each of the two hidden units are identical.

That's just saying that the two green weights must be sustained, the two red weights must be sustained,
the two blue weights are still the same and what that means is that even after a lot of iterations you will
find that your two hidden units are still computing exactly the same function of the input. So as keep
running gradient descent, the blue weights will stay the same as each other; the two red weights will
stay the same as each other; the two green weights will stay the same as each other; as well as the
activation functions in the hidden unit and - and that's because the identical partial derivatives
that changes the values for a same amount. And what this means is that your neural network really can't
compute very interesting functions. Imagine that you had not only two hidden units but imagine that
you had many hidden units; then what this is saying is that all of your hidden units are computing the
exact same function of the input. And this is a highly redundant representation. Because that means
that your final logistic regression unit really only gets to see one feature in each layer.

In order to get around this problem, the way we initialize the parameters of a neural network therefore,
is with random initialization. Concretely, the problem we saw on the previous slide is sometimes called
the problem of symmetric weights, that is if the weights all being the same. And so, this random
initialization is how we perform symmetry breaking. So what we do is we initialize each value of theta
(weights) to a random number between and near zero

The code in MATLAB for this neural network

is
*obs.: note that this epsilon here has nothing to do with the epsilon that we were using when we were
doing gradient checking. They are different things!

So, to summarize, to train a neural network, what you should do is randomly initialize the weights to
small values close to 0, between minus epsilon and plus epsilon, and then implement forward checking,
back-propagation, do gradient checking; and use either gradient descent or one of the advanced
optimization algorithms to try to minimize . And by doing symmetry breaking, which is this process,
hopefully, gradient descent or the advanced optimization algorithms will be able to find a good value for
, using pretty complex functions.

9.7 Putting It Together


So, it's taken us a lot of lectures to get through the neural network learning algorithm. In this lecture,
what I'd like to do is try to put all the pieces together, to give a overall summary or a bigger picture view
of how to implement a neural network learning algorithm.

When training a neural network, the first thing you need to do is pick some network architecture and by
architecture I just mean connectivity pattern between the neurons. So, here we have some examples
with the number of units in each layer

So, how do you make these choices? Well first, the number of input units is pretty well defined, thats
the number of features considered of our training examples. The number of output units are going to be
the number of classes that exists in your training set - remember, that's a supervised learning, the
number of different outputs is given to us through different labels of training examples. Now, for the
number of hidden layers, a reasonable default is to use a single hidden layer and so this type of neural
network shown on the left with just one hidden layer is probably the most common. Or, if you use more
than one hidden layer, the reasonable default will be to have the same number of hidden units in every
hidden layer. Finally, for the number of hidden units - usually, the more hidden units the better; it's just
that if you have a lot of hidden units, it can become more computationally expensive, but very often,
having more hidden units is a good thing. And usually the number of hidden units in each layer will be
maybe comparable to the dimension of features - it could be anywhere from same number of hidden
units (scale 1) of input features to maybe so that three or four times of that (scale 4). So having the
number of hidden units bigger (by a scale from 1 through 4 for example) than the number of input
features is often a useful thing to do. So, hopefully this gives you one reasonable set of default choices
for neural architecture, but in a later set of lectures I will talk specifically about advice for how to choose
a neural network architecture.

Now, let's go to the algorithm itself. In order to train the neural network, there are actually six steps that
I have to follow; let's see the first four steps

1. First step is to set up the neural network and to randomly initialize the values of the weights; and we
usually initialize the weights to small values near zero.

2. Implement forward propagation to compute that is my prediction model.

3. and 4. Are codes to compute the jVal and gradient in costFunction that are going to be used in
fminunc function. Concretely, to implement back prop to compute the partial derivatives, usually we will
do that with a for loop over the training examples

*obs.: actually there exists some very advanced vectorized methods that do not iterate through training
examples, but the straightforward way to do back prop is to iterating over the examples.

And this for loop will update these accumulator term over iterations. And, with , outside the loop we
can therefore compute the actual derivative terms with regularization term ( vector).

Next, we have the steps 5 and 6 to train the neural network


5. I do gradient checking to make sure that my algorithm is calculating the partial derivatives correctly.
And, as we said, is very important that we disable gradient checking, because the gradient checking code
is computationally very slow.

6. And finally, we then use an optimization algorithm such as gradient descent, or one of the advanced
optimization methods such as LBFGS, conjugate gradient that has embodied into fminunc. So we can use
one of these optimization methods to try to minimize as a function of the parameters theta. And by
the way, for neural networks, this cost function is non-convex, so it can theoretically be susceptible to
local minima, and in fact algorithms like gradient descent and the advance optimization methods can, in
theory, get stuck in local optima. But, it turns out that in practice this is not usually a huge problem and
even though we can't guarantee that these algorithms will find a global optimum, usually algorithms like
gradient descent will do a very good job minimizing this cost function and get a very good local
minimum, even if it doesn't get to the global optimum.

Finally, gradient descent for a neural network might still seem a little bit magical or unintuitive. So, let
me just show one more figure to try to get that intuition about what gradient descent for a neural
network is doing

So, we have some cost function, and we have a number of parameters in our neural network. Right
above I've just written down two of the parameter values; in reality, of course, in the neural network,
we will have lots of parameters, that encodes a very high dimensional space. Now, this cost function
measures how well the neural network fits the training data. So, if you take a point like this one
In there, is pretty close to the actual output . This means that, for most of the training
examples, the output of my hypothesis, that may be pretty close to the actual output. Whereas in
contrast, if you were to take a point in the red region, that corresponds to, for many training examples,
the output of my neural network is far from the actual value that was observed in the training set.

So, what gradient descent does is it starts from some random initial point and it will repeatedly go
downhill. And so what back propagation is doing is computing the direction of the gradient, it's
computing the direction that gradient descent will use to successfully get to a minimal point.

So, this ends the theoretical part of neural networks; and you will find that with back propagation added
to the other techniques and concepts that we saw, will construct a neural network that will be able to fit
very complex, powerful, non-linear functions to your data, and this is one of the most effective learning
algorithms we have today.

9.8 Autonomous Driving


In this lecture, I'd like to show you a fun and historically important example of neural network learning,
of using a neural network for autonomous driving that is getting a car to learn to drive itself. The main
visualization of this problem is this figure
In the lower left corner we have the view seen by the car of what's in front of it (in this case a road that's
maybe going a bit to the left and then a little bit to the right).

On top, this first horizontal bar shows the direction selected by the human driver, where the location of
the bright white band shows the steering direction selected (in this case the steering selected is a little
bit to the left). The second bar indicates the steering direction selected by the learning algorithm (that,
in this case, selected a reasonable direction, once it has been trained before).

We can see from the video that initially the white band in the second bar is very fuzzy (initially the agent
has no idea how to drive the car); the algorithm starts with random initialization parameters. But, once
it starts to learn from observing several examples of the discreetized image (thats the input) and the
output selected by the human driver (a continuous output thats the steering angle); it can indeed learn
how to generate outputs (steering angles) given new inputs (read images). So, only after it's learned for
a while that it will then start to output like a solid white band in just a small part of the region
corresponding to choosing a particular steering direction. And that corresponds to when a neural
network becomes more confident in selecting a particular location rather than outputting a sort of light
gray fuzz.

Alban is a system of artificial neural networks, that learns to steer by watching a person drive. Alban is
designed to control a modified army Humvee who could put sensors, computers and actuators for
autonomous navigation experiments. The initial spec in configuring Alban is training part, which means
that the person drives the car while Alban watches. Once every two seconds, Alban digitizes a video
image of the road ahead, and records the person's steering direction. This training image is reduced in
resolution to 30 by 32 pixels and provided as input to Alban's three-layer network. Using the back
propagation learning algorithm; Alban is training to output the same steering direction as the human
driver for that image. Initially, the network's steering response is random. After about two minutes of
training, the network learns to accurately imitate the steering reactions of the human driver.

This same training procedure can works with more than one type of road. In this case we will have
several neural networks in parallel corresponding to each of the different types of roads. So, when a
new image comes in, Alban digitizes the image and feeds it to its neural networks. Each network,
running in parallel, produces a steering direction and a measure it's confidence in its response. The
steering direction from the most confident network will be picked; for example, if we are in a single road
network, the network trained for the one-lane road is used to control the vehicle because it should have
a high confidence value. Suddenly, if an intersection appears ahead of the vehicle, as the vehicle
approaches the intersection, the confidence of the one-lane network decreases. As it crosses the
intersection, and the two-lane road ahead comes into view, the confidence of the two-lane network
rises. When it's confidence becomes more than the one-lane confidence, the two-lane network is
selected to steer, safely guiding the vehicle into its particular lane, on the two-lane road.

So that was autonomous driving using a neural network. Of course, there are recently more modern
attempts to do autonomous driving, where they're giving more robust driving controllers than this, but I
think it's still pretty remarkable and pretty amazing how a simple neural network trained with back-
propagation can actually learn to drive a car somewhat well.

Das könnte Ihnen auch gefallen