7 Jan2015

e-Chapter 7
ER
Neural Networks
Neural networks are a biologically inspired1 model which has had consider-
PT
able engineering success in applications ranging from time series prediction
to vision. Neural networks are a generalization of the perceptron which uses
a feature transform that is learned from the data. As such, they are a very
powerful and flexible model.
Neural networks are a good candidate model for learning from data because
they can efficiently approximate complex target functions and they come with
good algorithms for fitting the data. We begin with the basic properties of
neural networks, and how to train them on data. We will introduce a variety
A
of useful techniques for fitting the data by minimizing the in-sample error.
Because neural networks are a very flexible model, with great approximation
power, it is easy to overfit the data; we will study a number of techniques to
control overfitting specific to neural networks.
H
7.1 The Multi-layer Perceptron (MLP)

The perceptron cannot implement simple classi-
fication functions. To illustrate, we use the tar-
e-C
get on the right, which is related to the Boolean +1

xor function. In this example, f cannot be 1
written as sign(wt x). However, f is composed
x2
of two linear parts. Indeed, as we will soon see, 1
we can decompose f into two simple percep- +1

trons, corresponding to the lines in the figure,
and then combine the outputs of these two per-
ceptrons in a simple way to get back f . The two x1
perceptrons are shown next.
1 The analogy with biological neurons though inspiring should not be taken too far; after
all, we build planes with wings that do not flap. In much the same way, neural networks,
when applied to learning from data, do not much resemble their biological counterparts.
c AM L Yaser Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin: Jan-2015.

All rights reserved. No commercial use or re-distribution in any format.
e-7. Neural Networks 7.1. The Multi-layer Perceptron (MLP)
+1
1 1
x2
x2
+1
ER
x1 x1
h1 (x) = sign(w1t x) h2 (x) = sign(w2t x)
The target f equals +1 when exactly one of h1 , h2 equals +1. This is the
Boolean xor function: f = xor(h1 , h2 ), where +1 represents true and 1
represents false. We can rewrite f using the simpler or and and operations:
PT
or(h1 , h2 ) = +1 if at least one of h1 , h2 equal +1 and and(h1 , h2 ) = +1 if
both h1 , h2 equal +1. Using standard Boolean notation (multiplication for
and, addition for or, and overbar for negation),
f = h1 h2 + h1 h2 .
Exercise 7.1
Consider a target function f whose + and regions are illustrated below.
+
A

x2
+ +

x1
H
The target f has three perceptron components h1 , h2 , h3 :
+

e-C
x2 x2 x2
+ +

h1 h2 h3
x1 x1 x1
Show that
f = h1 h2 h3 + h1 h2 h3 + h1 h2 h3 .
Is there a systematic way of going from a target which is a decomposition of
perceptrons to a Boolean formula like this? [Hint: consider only the regions
of f which are + and use the disjunctive normal form ( or of ands).]
Exercise 7.1 shows that a complicated target, which is composed of percep-

trons, is a disjunction of conjunctions (or of ands) applied to the component
c
AM
L Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:72
perceptrons. This is a useful insight, because or and and can be implemented

by the perceptron:
or(x1 , x2 ) = sign(x1 + x2 + 1.5);

and(x1 , x2 ) = sign(x1 + x2 1.5).
This implies that these more complicated targets are ultimately just combi-
ER
nations of perceptrons. To see how to combine the perceptrons to get f , we
introduce a graph representation of perceptrons, starting with or and and:
1 1
1.5 1.5
x1 or(x1 , x2 ) x1 and(x1 , x2 )
1 1
x2 1
PT x2 1
A node outputs a value to an arrow. The weight on an arrow multiplies this

output and passes the result to the next node. Everything coming to this next
node is summed and then transformed by sign() to get the final output.
Exercise 7.2
A
(a) The Boolean or and and of two inputs can be extended to more
than two inputs: or(x1 , . . . , xM ) = +1 if any one of the M inputs
is +1; and(x1 , . . . , xM ) = +1 if all the inputs equal +1. Give graph
representations of or(x1 , . . . , xM ) and and(x1 , . . . , xM ).
H
(b) Give the graph representation of the perceptron: h(x) = sign(wt x).
(c) Give the graph representation of or(x1 , x2 , x3 ).
e-C
The MLP for a Complex Target. Since f = h1 h2 + h1 h2 , which is an or

of the two inputs h1 h2 and h1 h2 , we first use the or perceptron, to obtain:
1.5
h1 h2 f
1
1
h1 h2
c
AM
The two inputs h1 h2 and h1 h2 are ands. As such, they can be simulated by
the output of two and perceptrons. To deal with negation of the inputs to
the and, we negate the weights multiplying the negated inputs (as you have
done in Exercise 7.1(c)). The resulting graph representation of f is:
1 1.5 1
ER
1.5 1.5
1
h1 1
f
1
1
1
h2 1
The blue and red weights are simulating the required two ands. Finally, since
PT
h1 = sign(w1t x) and h2 = sign(w2t x) are perceptrons, we further expand the
h1 and h2 nodes to obtain the graph representation of f .
1 1 1.5 1
w1tx
1.5 1.5
1
x1 f
1
1
1
A
x2 1
w2t x 1
The next exercise asks you to compute an explicit algebraic formula for f .
H
The visual graph representation is much neater and easier to generalize.
Exercise 7.3
Use the graph representation to get an explicit formula for f and show that:
e-C
h i
f (x) = sign sign(h1 (x) h2 (x) 23 ) sign(h1 (x) h2 (x) + 23 ) + 32 ,
where h1 (x) = sign(w1t x) and h2 (x) = sign(w2t x)
Lets compare the graph form of f with

the graph form of the simple perceptron, 1 w0
shown to the right. More layers of nodes
are used between the input and output w1
x1 sign(wtx)
to implement f , as compared to the sim-
ple perceptron, hence we call it a multi- w2
x2
layer layer perceptron (MLP),. The ad-
ditional layers are called hidden layers.
c
AM
Notice that the layers feed forward into the next layer only (there are no back-
ward pointing arrows and no jumps to other layers). The input (leftmost)
layer is not counted as a layer, so in this example, there are 3 layers (2 hid-
den layers with 3 nodes each, and an output layer with 1 node). The simple
perceptron has no hidden layers, just an input and output. The addition of
hidden layers is what allowed us to implement the more complicated target.
ER
Exercise 7.4
For the target function in Exercise 7.1, give the MLP in graphical form, as
well as the explicit algebraic form.
If f can be decomposed into perceptrons using an or of ands, then it can be

implemented by a 3-layer perceptron. If f is not strictly decomposable into
perceptrons, but the decision boundary is smooth, then a 3-layer perceptron
PT
can come arbitrarily close to implementing f . A proof by picture illustration
for a disc target function follows:

+ + + + + +
+ + + + + +
A

Target 8 perceptrons 16 perceptrons

H
The formal proof is somewhat analogous to the theorem in calculus which says
that any continuous function on a compact set can be approximated arbitrarily
closely using step functions. The perceptron is the analog of the step function.
We have thus found a generalization of the simple perceptron that looks
e-C
much like the simple perceptron itself, except for the addition of more layers.
We gained the ability to model more complex target functions by adding more
nodes (hidden units) in the hidden layers this corresponds to allowing more
perceptrons in the decomposition of f . In fact, a suitably large 3-layer MLP
can closely approximate just about any target function, and fit any data set,
so it is a very powerful learning model. Use it with care. If your MLP is too
large you may lose generalization ability.
Once you fix the size of the MLP (number of hidden layers and number
of hidden units in each layer), you learn the weights on every link (arrow) by
fitting the data. Lets consider the simple perceptron,
h(x) = (wt x).
c
AM
When (s) = sign(s), learning the weights was already a hard combinatorial
problem and had a variety of algorithms, including the pocket algorithm, for
fitting data (Chapter 3). The combinatorial optimization problem is even
harder with the MLP, for the same reason, namely that the sign() function is
not smooth; a smooth, differentiable approximation to sign() will allow us to
use analytic methods, rather than purely combinatorial methods, to find the
optimal weights. We therefore approximate, or soften the sign() function by
ER
using the tanh() function. The MLP is sometimes called a (hard) threshold
neural network because the transformation function is a hard threshold at zero.
Here, we choose (x) = tanh(x) which is in-
between linear and the hard threshold: nearly linear
linear for x 0 and nearly 1 for |x| large. The tanh
tanh() function is another example of a sigmoid
(because its shape looks like a flattened out s),
sign
related to the sigmoid we used for logistic regres-
sion.2 Such networks are called sigmoidal neu-
PT
ral networks. Just as we could use the weights
learned from linear regression for classification, we could use weights learned
using the sigmoidal neural network with tanh() activation function for classi-
fication by replacing the output activation function with sign().
Exercise 7.5
A
Given w1 and > 0, find w2 such that |sign(w1t xn ) tanh(w2t xn )|
for xn D. [Hint: For large enough , sign(x) tanh(x).]
H
The previous example shows that the

sign() function can be closely approxi- sign
mated by the tanh() function. A concrete tanh
illustration of this is shown in the figure
e-C
Ein
to the right. The figure shows how the

in-sample error Ein varies with one of the
weights in w on an example problem for
the perceptron (blue curve) as compared
to the sigmoidal version (red curve). The w
sigmoidal approximation captures the gen-
eral shape of the error, so that if we minimize the sigmoidal in-sample error,
we get a good approximation to minimizing the in-sample classification error.
2 In logistic regression, we used the sigmoid because we wanted a probability as the output.
Here, we use the soft tanh() because we want a friendly objective function to optimize.
c
AM
e-7. Neural Networks 7.2. Neural Networks
7.2 Neural Networks

The neural network is our softened MLP. Lets begin with a graph represen-
tation of a feed-forward neural network (the only kind we will consider).
1 1 1
ER
x1 h(x)
x2
(s)
..
. s

xd
input layer = 0 hidden layers 0 < < L output layer = L

PT
The graph representation depicts a function in our hypothesis set. While this
graphical view is aesthetic and intuitive, with information flowing from the
inputs on the far left, along links and through hidden nodes, ultimately to the
output h(x) on the far right, it will be necessary to algorithmically describe
the function being computed. Things are going to get messy, and this calls for
a very systematic notation; bear with us.
A
7.2.1 Notation
There are layers labeled by = 0, 1, 2, . . . , L. In our example above, L = 3, i.e.
we have three layers (the input layer = 0 is usually not considered a layer
and is meant for feeding in the inputs). The layer = L is the output layer,
H
which determines the value of the function. The layers in between, 0 < < L,
are the hidden layers. We will use superscript() to refer to a particular layer.
Each layer has dimension d() , which means that it has d() + 1 nodes,
labeled 0, 1, . . . , d() . Every layer has one special node, which is called the bias
node (labeled 0). This bias node is set to have an output 1, which is analogous
e-C
to the fictitious x0 = 1 convention that we had for linear models.

Every arrow represents a weight or connection strength from a node in a
layer to a node in the next higher layer. Notice that the bias nodes have no
incoming weights. There are no other connection weights.3 A node with an
incoming weight indicates that some signal is fed into this node. Every such
node with an input has a transformation function . If (s) = sign(s), then we
have the MLP for classification. As we mentioned before, we will be using a soft
version of the MLP with (x) = tanh(x) to approximate the sign() function.
The tanh() is a soft threshold or sigmoid, and we already saw a related sigmoid
3 In a more general setting, weights can connect any two nodes, in addition to going
backward (i.e., one can have cycles). Such networks are called recurrent neural networks,
and we do not consider them here.
c
AM
when we discussed logistic regression in Chapter 3. Ultimately, when we do

classification, we replace the output sigmoid by the hard threshold sign().
As a comment, if we were doing regression instead, our entire discussion goes
through with the output transformation being replaced by the identity function
(no transformation) so that the output is a real number. If we were doing
logistic regression, we would replace the output tanh() sigmoid by the logistic
regression sigmoid.
ER
The neural network model Hnn is specified once you determine the ar-
chitecture of the neural network, that is the dimension of each layer d =
[d(0) , d(1) , . . . , d(L) ] (L is the number of layers). A hypothesis h Hnn is
specified by selecting weights for the links. Lets zoom into a node in hidden
layer , to see what weights need to be specified.
i
PT
(1)
xi
()
()
sj
wij ()
j xj
layer ( 1) layer
A
A node has an incoming signal s and an output x. The weights on links into
the node from the previous layer are w() , so the weights are indexed by the
layer into which they go. Thus, the output of the nodes in layer 1 is
H
multiplied by weights w() . We use subscripts to index the nodes in a layer.

()
So, wij is the weight into node j in layer from node i in the previous layer,
()
the signal going into node j in layer is sj , and the output of this node
()
is xj . There are some special nodes in the network. The zero nodes in every
e-C
layer are constant nodes, set to output 1. They have no incoming weight, but
they have an outgoing weight. The nodes in the input layer = 0 are for the
input values, and have no incoming weight or transformation function.
For the most part, we only need to deal with the network on a layer by layer
basis, so we introduce vector and matrix notation for that. We collect all the
input signals to nodes 1, . . . , d() in layer in the vector s() . Similarly, collect
()
the output from nodes 0, . . . , d() in the vector x() ; note that x() {1}Rd
because of the bias node 0. There are links connecting the outputs of all
nodes in the previous layer to the inputs of layer . So, into layer , we have
()
a (d(1) + 1) d() matrix of weights W() . The (i, j)-entry of W() is wij
going from node i in the previous layer to node j in layer .
c
AM

W()
W(+1)
s() x()
+
layer ( 1) layer layer ( + 1)
ER
layer parameters
signals in s() d() dimensional input vector

outputs x() d() + 1 dimensional output vector
weights in W() (d(1) + 1) d() dimensional matrix
weights out W(+1) PT (d() + 1) d(+1) dimensional matrix
After you fix the weights W() for = 1, . . . , L, you have specified a particular
neural network hypothesis h Hnn . We collect all these weight matrices into
a single weight parameter w = {W(1) , W(2) , . . . , W(L) }, and sometimes we will
write h(x; w) to explicitly indicate the dependence of the hypothesis on w.
7.2.2 Forward Propagation

A
The neural network hypothesis h(x) is computed by the forward propagation
algorithm. First observe that the inputs and outputs of a layer are related by
the transformation function,
H

1
x() = . (7.1)
(s() )
()
where (s() ) is a vector whose components are (sj ). To get the input vector
e-C
into layer , we compute the weighted sum of the outputs from the previous
() Pd(1) () (1)
layer, with weights specified in W() : sj = i=0 wij xi . This process
is compactly represented by the matrix equation
s() = (W() )t x(1) . (7.2)
All that remains is to initialize the input layer to x(0) = x (so d(0) = d, the
input dimension)4 and use Equations (7.2) and (7.1) in the following chain,
W(1) W(2)
x = x(0) s(1) x(1) s(2) x(2) s(L) x(L) = h(x).
4 Recall that the input vectors are also augmented with x0 = 1.
c
AM
Forward propagation to compute h(x):

1: x(0) x [Initialization]
2: for = 1 to L do [Forward Propagation]
3: s() (W() )t x(1)

() 1
4: x
(s() )
ER
5: h(x) = x(L) [Output]
1 1
+ ()
x(1) x()
..
.
..
.
PT
W()
+
s()
()
layer ( 1) layer
A
After forward propagation, the output vector x() at every layer l = 0, . . . , L
has been computed.
Exercise 7.6
H
Let V and Q be the number of nodes and weights in the neural network,
L
X L
X
V = d() , Q= d() (d(1) + 1).
=0 =1
e-C
In terms of V and Q, how many computations are made in forward propa-

gation (additions, multiplications and evaluations of ).
[Answer: O(Q) multiplications and additions, and O(V ) -evaluations.]
If we want to compute Ein , all we need is h(xn ) and yn . For the sum of
squares,
N
1 X
Ein (w) = (h(xn ; w) yn )2
N n=1
N
1 X (L)
= (x yn )2 .
N n=1 n
c
AM
We now discuss how to minimize Ein to obtain the learned weights. It will be a
direct application of gradient descent, with a special algorithm that computes
the gradient efficiently.
7.2.3 Backpropagation Algorithm

We studied an algorithm for getting to a local minimum of a smooth in-sample
error surface in Chapter 3, namely gradient descent: initialize the weights to
ER
w(0) and for t = 1, 2, . . . update the weights by taking a step in the negative
gradient direction,
w(t + 1) = w(t) Ein (w(t))
we called this (batch) gradient descent. To implement gradient descent, we

need the gradient.
Exercise 7.7
be Ein (w) = N1
PT
For the sigmoidalPperceptron, h(x) = tanh(wt x), let the in-sample error
N t 2
n=1 (tanh(w xn ) yn ) . Show that
N
2 X
Ein (w) = (tanh(wt xn ) yn )(1 tanh2 (wt xn ))xn .
N n=1
If w , what happens to the gradient; how this is related to why it is

hard to optimize the perceptron.
A
We now consider the sigmoidal multi-layer neural network with (x) = tanh(x).
Since h(x) is smooth, we can apply gradient descent to the resulting error func-
tion. To do so, we need the gradient Ein (w). Recall that the weight vector w
H
contains all the weight matrices W(1) , . . . , W(L) , and we need the derivatives
with respect to all these weights. Unlike the sigmoidal perceptron in Exer-
cise 7.7, for the multilayer sigmoidal network there is no simple closed form
expression for the gradient. Consider an in-sample error which is the sum of
the point-wise errors over the data points (as is the squared in-sample error),
e-C
N
1 X
Ein (w) = en .
N n=1
where en = e(h(xn ), yn ). For the squared error, e(h, y) = (h y)2 . To

compute the gradient of Ein , we need its derivative with respective to each
weight matrix:
N
Ein 1 X en
= , (7.3)
W() N n=1 W()
The basic building block in (7.3) is the partial derivative of the error on a
e
data point e, with respect to the W() . A quick and dirty way to get W () is
c
AM
to use the numerical finite difference approach. The complexity of obtaining

the partial derivatives with respect to every weight is O(Q2 ), where Q is the
number of weights (see Problem 7.6). From (7.3), we have to compute these
derivatives for every data point, so the numerical approach is computation-
ally prohibitive. We now derive an elegant dynamic programming algorithm
known as backpropagation.5 Backpropagation allows us to compute the partial
derivatives with respect to every weight efficiently, using O(Q) computation.
ER
We describe backpropagation for getting the partial derivative of the error e,
but the algorithm is general and can be used to get the partial derivative of
any function of the output h(x) with respect to the weights.
Backpropagation is based on several applications of the chain rule to write
partial derivatives in layer using partial derivatives in layer ( + 1). To
describe the algorithm, we define the sensitivity vector for layer , which is
the sensitivity (gradient) of the error e with respect to the input signal s()
that goes into layer . We denote the sensitivity by () ,
PT e
() = .
s()
The sensitivity quantifies how e changes with s() . Using the sensitivity, we
can write the partial derivatives with respect to the weights W() as
e
= x(1) ( () )t . (7.4)
W()
A
We will derive this formula later, but for now lets examine it closely. The
partial derivatives on the left form a matrix with dimensions (d(1) + 1) d()
and the outer product of the two vectors on the right give exactly such a ma-
trix. The partial derivatives have contributions from two components. (i) The
H
output vector of the layer from which the weights originate; the larger the
output, the more sensitive e is to the weights in the layer. (ii) The sensitivity
vector of the layer into which the weights go; the larger the sensitivity vector,
the more sensitive e is to the weights in that layer.
The outputs x() for every layer 0 can be computed by a forward
e-C
propagation. So to get the partial derivatives, it suffices to obtain the sen-

sitivity vectors () for every layer 1 (remember that there is no input
signal to layer = 0). It turns out that the sensitivity vectors can be obtained
by running a slightly modified version of the neural network backwards, and
hence the name backpropagation. In forward propagation, each layer outputs
the vector x() and in backpropagation, each layer outputs (backwards) the
vector () . In forward propagation, we compute x() from x(1) and in back-
propagation, we compute () from (+1) . The basic idea is illustrated in the
following figure.
5 Dynamic programming is an elegant algorithmic technique in which one builds up a
solution to a complex problem using the solutions to related but simpler problems.
c
AM
1 1
() .. (+1)
. W(+1)
..
.
ER
(s() ) +
layer layer ( + 1)
As you can see in the figure, the neural network is slightly modified only in
that we have changed the transformation function for the nodes. In forward
PT
propagation, the transformation was the sigmoid (). In backpropagation,
the transformation is multiplication by (s() ), where s() is the input to the
node. So the transformation function is now different for each node, and
it depends on the input to the node, which depends on x. This input was
computed in the forward propagation. For the tanh() transformation function,
tanh (s() ) = 1tanh2 (s() ) = 1x() x() , where denotes component-wise
multiplication.
In the figure, layer (+1) outputs (backwards) the sensitivity vector (+1) ,
A
which gets multiplied by the weights W(+1) , summed and passed into the
nodes in layer . Nodes in layer multiply by (s() ) to get () . Using , a
shorthand notation for this backpropagation step is:
H
d()
() = (s() ) [W(+1) (+1) 1
, (7.5)
d()
where the vector W(+1) (+1) 1 contains components 1, . . . , d() of the vec-
e-C
tor W(+1) (+1) (excluding the bias component which has index 0). This for-
mula is not surprising. The sensitivity of e to inputs of layer is proportional
to the slope of the activation function in layer (bigger slope means a small
change in s() will have a larger effect on x() ), the size of the weights going
out of the layer (bigger weights mean a small change in s() will have more
impact on s(+1) ) and the sensitivity in the next layer (a change in layer
affects the inputs to layer + 1, so if e is more sensitive to layer + 1, then it
will also be more sensitive to layer ).
We will derive this backward recursion later. For now, observe that if we
know (+1) , then you can get () . We use (L) to seed the backward process,
and we can get that explicitly because e = (x(L) y)2 = ((s(L) ) y)2 .
c
AM
Therefore,
e
(L) =
s(L)

= (x(L) y)2
s(L)
x(L)
= 2(x(L) y) (L)
s
ER
= 2(x y) (s(L) ).
(L)
When the output transformation is tanh(), (s(L) ) = 1 (x(L) )2 (classifica-

tion); when the output transformation is the identity (regression), (s(L) ) = 1.
Now, using (7.5), we can compute all the sensitivities:
(1) (2) (L1) (L) .
PT
Note that since there is only one output node, sL is a scalar, and so too is
(L) . The algorithm box below summarizes backpropagation.
Backpropagation to compute sensitivities () .

Input: a data point (x, y).
0: Run forward propagation on x to compute and save:
s() for = 1, . . . , L;
A
x() for = 0, . . . , L.
1: (L) 2(x(L) y) (s(L) ) [Initialization]

(
1 (x(L) )2 (s) = tanh(s);
H
(L)
(s ) =
1 (s) = s.
2: for = L 1 to 1 do [Back-Propagation]
()
()
()
() d
3: Let (s ) = 1 x x 1 .
e-C
4: Compute the sensitivity () from (+1) :

d()
() (s() ) W(+1) (+1) 1
In step 3, we assumed tanh-hidden node transformations. If the hidden unit

transformation functions are not tanh(), then the derivative in step 3 should
be updated accordingly. Using forward propagation, we compute x() for
= 0, . . . , L and using backpropagation, we compute () for = 1, . . . , L.
Finally, we get the partial derivative of the error on a single data point using
Equation (7.4). Nothing illuminates the moving parts better than working an
example from start to finish.
c
AM
Example 7.1. Consider the following neural network.
1 1 1 1
0.1 0.2 1
0.2
x 0.3
tanh
1
tanh
2
tanh - h(x)
ER
0.4
-3
tanh
There is a single input, and the weight matrices are:

0.2
PT
0.1 0.2 1
W(1) = ; W(2) = 1 ; W(3) = .
0.3 0.4 2
3
For the data point x = 2, y = 1, forward propagation gives:

x(0) s(1) x(1) s(2) x(2) s(3) x(3)

1
1 0.1 0.3 1 0.7 1
= 0.60 [1.48] [0.8] 0.66
2 0.2 0.4 2 1 0.90
0.76
A
We show above how s(1) = (W(1) )t x(0) is computed. Backpropagation gives
(3) (2) (1)
H

0.44
[1.855] [(1 0.92 ) 2 1.855] = [0.69]
0.88
We have explicitly shown how (2) is obtained from (3) . It is now a simple
matter to combine the output vectors x() with the sensitivity vectors ()
e-C
using (7.4) to obtain the partial derivatives that are needed for the gradient:

0.69
e (0) (1) t 0.44 0.88 e 0.42 ; e 1.85
= x ( ) = ; = = .
W(1) 0.88 1.75 W(2) W(3) 1.67
0.53
Exercise 7.8
Repeat the computations in Example 7.1 for the case when the output trans-
formation is the identity. You should compute s() , x() , () and e/W()
c
AM
Lets derive (7.4) and (7.5), which are the core equations of backpropagation.
Theres nothing to it but repeated application of the chain rule. If you wish
to trust our math, you wont miss much by moving on.
Begin safe skip: If you trust our math, you can skip
this part without compromising the logical sequence.
A similar green box will tell you when to rejoin.
ER
To begin, lets take a closer look at the partial derivative, e/W() . The
situation is illustrated in Figure 7.1.
W(+1)
x(1) W() s() x()
+
PT e = (x(L) y)2
layer ( 1) layer layer ( + 1)
Figure 7.1: Chain of dependencies from W() to x(L) .
We can identify the following chain of dependencies by which W() influences

the output x(L) , and hence the error e.
A
W() s() x() s(+1) x(L) = h.
To derive (7.4), we drill down to a single weight and use the chain rule. For a
H
() () ()
single weight wij , a change in wij only affects sj and so by the chain rule,
()
e sj e (1) ()
()
= ()
()
= xi j ,
wij wij sj
e-C
() Pd(1) () (1)
where the last equality follows because sj = =0 wj x and by defi-
()
nition of j .We have derived the component form of (7.4).
We now derive the component form of (7.5). Since e depends on s() only
through x() (see Figure 7.1), by the chain rule, we have:
()
() e e xj e
()
j = ()
= ()
()
= sj ()
.
sj xj sj xj
To get the partial derivative e/x() , we need to understand how e changes

due to changes in x() . Again, from Figure 7.1, a change in x() only affects
c
AM
s(+1) and hence e. Because a particular component of x() can affect every
component of s(+1) , we need to sum these dependencies using the chain rule:
(+1) (+1)
dX (+1) dX
e sk e (+1) (+1)
()
= ()
(+1)
= wjk k .
xj k=1 xj sk k=1
Putting all this together, we have arrived at the component version of (7.5)
ER
(+1)
dX
() () (+1) (+1)
j = (sj ) wjk k , (7.6)
k=1
Intuitively, the first term comes from the impact of s() on x() ; the summation
is the impact of x() on s(+1) , and the impact of s(+1) on h is what gives us
back the sensitivities in layer ( + 1), resulting in the backward recursion.
PT
End safe skip: Those who skipped are now rejoining
us to discuss how backpropagation gives us Ein .
Backpropagation works with a data point (x, y) and weights w = {W(1) , . . . , W(L) }.
Since we run one forward and backward propagation to compute the outputs
x() and the sensitivities () , the running time is order of the number of
weights in the network. We compute once for each data point (xn , yn ) to
A
get Ein (xn ) and, using the sum in (7.3), we aggregate these single point
gradients to get the full batch gradient Ein . We summarize the algorithm
below.
Algorithm to Compute Ein (w) and g = Ein (w).

H
Input: w = {W(1) , . . . , W(L) }; D = (x1 , y1 ) . . . (xN , yn ).

Output: error Ein (w) and gradient g = {G(1) , . . . , G(L) }.
1: Initialize: Ein = 0 and G() = 0 W() for = 1, . . . , L.
2: for Each data point (xn , yn ), n = 1, . . . , N , do
e-C
3: Compute x() for = 0, . . . , L. [forward propagation]

4: Compute () for = L, . . . , 1. [backpropagation]
5: Ein Ein + N1 (x(L) yn )2 .
6: for = 1, . . . , L do
7: G() (xn ) = [x(1) ( () )t ]
8: G() G() + N1 G() (xn )
(G() (xn ) is the gradient on data point xn ). The weight update for a single
iteration of fixed learning rate gradient descent is W() W() G() , for
= 1, . . . , L. We do all this for one iteration of gradient descent, a costly
computation for just one little step.
c
AM
In Chapter 3, we discussed stochas- 0

tic gradient descent (SGD) as a more Gradient Descent
efficient alternative to the batch mode. -1
log10 (error)
Rather than wait for the aggregate gra- -2
dient G() at the end of the iteration, one
immediately updates the weights as each -3
SGD
data point is sequentially processed using -4
ER
the single point gradient in step 7 of the
algorithm: W() = W() G() (xn ). In 0 2 4 6
log10 (iteration)
this sequential version, you still run a for-
ward and backward propagation for each
data point, but make N updates to the weights. A comparison of batch gra-
dient descent with SGD is shown to the right. We used 500 training examples
from the digits data and a 2-layer neural network with 5 hidden units and
learning rate = 0.01. The SGD curve is erratic because one is not minimiz-
ing the total error at each iteration, but the error on a specific data point.
as the minimization proceeds.

PT
One method to dampen this erratic behavior is to decrease the learning rate
The speed at which you minimize Ein can depend heavily on the optimiza-
tion algorithm you use. SGD appears significantly better than plain vanilla
gradient descent, but we can do much better even SGD is not very efficient.
In Section 7.5, we discuss some other powerful methods (for example, conju-
gate gradients) that can significantly improve upon gradient descent and SGD,
by making more effective use of the gradient.
A
Initialization and Termination. Choosing the initial weights and decid-
ing when to stop the gradient descent can be tricky, as compared with logistic
H
regression, because the in-sample error is not convex anymore. From Exer-
cise 7.7, if the weights are initialized too large so that tanh(wt xn ) 1,
then the gradient will be close to zero and the algorithm wont get any-
where. This is especially a problem if you happen to initialize the weights
to the wrong sign. It is usually best to initialize the weights to small ran-
e-C
dom values where tanh(wt xn ) 0 so that the algorithm has the flexibility to
move the weights easily to fit the data. One good choice is to initialize using
2 2
Gaussian random weights, wi N (0, w ) where w is small. But how small
2
should
t w 2be? A simple heuristic is that we want |w t
xn |2 to be small. Since
2 2 2 2 2
Ew |w xn | = w kxn k , we should choose w so that w maxn kxn k 1.
Exercise 7.9
What can go wrong if you just initialize all the weights to exactly zero?
c
AM
e-7. Neural Networks 7.3. Approximation versus Generalization
When do we stop? It is risky to rely solely on

the size of the gradient to stop. As illustrated on
the right, you might stop prematurely when the
iteration reaches a relatively flat region (which
Ein
is more common than you might suspect). A
combination of stopping criteria is best in prac-
tice, for example stopping only when there is Weights, w
ER
marginal error improvement coupled with small
error, plus an upper bound on the number of iterations.
7.2.4 Regression for Classification

In Chapter 3, we mentioned that you could use the weights resulting from
linear regression as perceptron weights for classification, and you can do the
same with neural networks. Specifically, fit the classification data (yn = 1) as
though it were a regression problem. This means you use the identity function
PT
as the output node transformation, instead of tanh(). This can be a great
help because of the flat regions which the network is susceptible to when
using gradient descent, which happens often in training. The reason for these
flat periods in the optimization is the exceptionally flat nature of the tanh
function when its argument gets large. If for whatever reason the weights get
large toward the beginning of the training, then the error surface begins to look
flat, because the tanh has been saturated. Now, gradient descent cannot make
any progress and you might think you are at a minimum, when in fact you
A
are far from a minimum. The problem of a flat error surface is considerably
mitigated when the output transformation is the identity because you can
recover from an initial bad move if it happens to take you to large weights
(the linear output never saturates). For a concrete example of a prematurely
H
flat in-sample error, see the figure in Example 7.2 on page 25.
7.3 Approximation versus Generalization

e-C
A large enough MLP with 2 hidden layers can

approximate smooth decision functions arbitrar- 1 1
ily well. It turns out that a single hidden layer
suffices.6 A neural network with a single hidden x1
layer having m hidden units (d(1) = m) imple-
ments a function of the form x2 h
! ..
Xm Xd .
(2) (2) (1)
h(x) = w01 + wj1 wij xi . xd
j=1 i=0
6 Though one hidden layer is enough, it is not necessarily the most efficient way to fit the
data; for example a much smaller 2-hidden-layer network may exist.
c
AM
This is a cumbersome representation for such a simple network. A simplified

notation for this special case is much more convenient. For the second-layer
weights, we will just use w0 , w1 , . . . , wm and we will use vj to denote the jth
column of the first layer weight matrix W(1) , for j = 1 . . . m. With this simpler
notation, the hypothesis becomes much more pleasant looking:

Xm

h(x) = w0 + wj vjt x .
ER
j=1
Neural Network versus Nonlinear Transforms. Recall the linear model

from Chapter 3, with nonlinear transform (x) that transforms x to z:
x z = (x) = [1, 1 (x), 2 (x), . . . , m (x)]t .
The linear model with nonlinear transform is a hypothesis of the form

PT

M
X
h(x) = w0 + wj j (x) .
j=1
The j () are called basis functions. On face value, the neural network and the
linear model look nearly identical, by setting (vjt x) = j (x). There is a subtle
difference, though, and this difference has a big practical impact. With the
nonlinear transform, the basis functions j () are fixed ahead of time before
A
you look at the data. With the neural network, the basis function (vjt x)
has a parameter vj inside, and we can tune vj after seeing the data. First,
this has a computational impact because the parameter vj appears inside
the nonlinearity (); the model is no longer linear in its parameters. We
saw a similar effect with the centers of the radial basis function network in
H
Chapter 6. Models which are nonlinear in their parameters pose a significant

computational challenge when it comes to fitting to data. Second, it means
that we can tune the basis functions to the data. Tunable basis functions,
although computationally harder to fit to data, do give us considerably more
flexibility to fit the data than do fixed basis functions. With m tunable basis
e-C
functions one has roughly the same approximation power to fit the data as with
md fixed basis functions. For large d, tunable basis functions have considerably
more power.
Exercise 7.10
It is no surprise that adding nodes in the hidden layer gives the neural net-
work more approximation ability, because you are adding more parameters.
How many weight parameters are there in a neural network with architecture
specified by d = [d(0) , d(1) , . . . , d(L) ], a vector giving the number of nodes
in each layer? Evaluate your formula for a 2 hidden layer network with 10
hidden nodes in each hidden layer.
c
AM
Approximation Capability of the Neural Network. It is possible to

quantify how the approximation ability of the neural network grows as you
increase m, the number of hidden units. Such results fall into the field known
as functional approximation theory, a field which, in the context of neural net-
works, has produced some interesting results. Usually one starts by making
some assumption about the smoothness (or complexity) of the target func-
tion f . On the theoretical side, you have lost some generality as compared
ER
with, for example, the VC-analysis. However, in practice, such assumptions
are OK because target functions are often smooth. If you assume that the
data are generated by a target function with complexity7 at most Cf , then
a variety of bounds exist on how small an in-sample error is achievable with
m hidden units. For regression with squared error, one can achieve in-sample
error
N
1 X (2RCf )2
Ein (h) = (h(xn ) yn )2 ,
N n=1 m
PT
where R = maxn kxn k is the radius of the data. The in-sample error decreases
inversely with the number of hidden units. For classification, a similar result
with slightly worse dependence on m exists. With high probability,

Ein Eout + O(Cf / m),

where Eout is the out-of-sample error of the optimal classifier that we discussed
in Chapter 6. The message is that Ein can be made small by choosing a large
A
enough hidden layer.
Generalization and the VC-Dimension. For sufficiently large m, we can

get Ein to be small, so what remains is to ensure that Ein Eout . We need to
H
look at the VC-dimension. For the two layer hard-threshold neural network
(MLP) where (x) = sign(x), we show a simple bound on the VC dimension:
dvc (const) md log(md). (7.7)

e-C
For a general sigmoid neural network, dvc can be infinite. For the tanh()
sigmoid, with sign() output node, one can show that dvc = O(V Q) where V
is the number of hidden nodes and Q is the number of weights; for the two
layer case
dvc = O(md(m + d)).
The tanh() network has higher VC-dimension than the 2-layer MLP, which
is not surprising because tanh() can approximate sign() by choosing large
enough weights. So every dichotomy that can be implemented by the MLP
can also be implemented by the tanh() neural network.
7 We do not describe details of how the complexity of a target can be measured. One
measure is the size of the high frequency components of f in its Fourier transform. Another
more restrictive measure is the number of bounded derivatives f has.
c
AM
To derive (7.7), we will actually show a

more general result. Consider the hypothesis H1
set illustrated by the network on the right.
Hidden node i in the hidden layer imple-
H2
ments a function hi Hi which maps Rd
x HC h(x)
to {+1, 1}; the output node implements ..
a function hc HC which maps Rm to .
ER
{+1, 1}. This output node combines the
outputs of the hidden layer nodes to imple- Hm
ment the hypothesis
h(x) = hC (h1 (x), . . . , hm (x)).
(For the 2-layer MLP, all the hypothesis sets are perceptrons.)
Suppose the VC-dimension of Hi is di and the VC-dimension of HC is dc .
Fix x1 , . . . , xN , and the hypotheses h1 , . . . , hm implemented by the hidden
PT
nodes. The hypotheses h1 , . . . , hm are now fixed basis functions defining a
transform to Rm ,

h1 (x1 ) h1 (xN )
x1 z1 = ... xN zN = ..

. .
hm (x1 ) hm (xN )
The transformed points are binary vectors in Rm . Given h1 , . . . , hm , the points

A
x1 , . . . , xN are transformed to an arrangement of points z1 , . . . , zN in Rm .
Using our flexibility to choose h1 , . . . , hm , we now upper bound the number
of possible different arrangements z1 , . . . , zN we can get.
The first components of all the zn are given by h1 (x1 ), . . . , h1 (xN ), which
H
is a dichotomy of x1 , . . . , xN implemented by h1 . Since the VC-dimension

of H1 is d1 , there are at most N d1 such dichotomies.8 That is, there are at
most N d1 different ways of choosing assignments to all the first components
of the zn . Similarly, an assignment to all the ith components can be chosen
in at most N di ways. Thus, the total number of possible arrangements for
e-C
z1 , . . . , zN is at most
m
Y Pm
N di = N i=1 di .
i=1
Each of these arrangements can be dichotomized in at most N dc ways, since the

VC-dimension of HC is dc . Each such dichotomy for a particular arrangement
gives one dichotomy of the data x1 , . . . , xN . Thus, the maximum number of
different dichotomies we can implement on x1 , . . . , xN is upper bounded by
the product: the number of possible arrangements times the number of ways
8 Recall that for any hypothesis set with VC-dimension d
vc and any N dvc , m(N ) (the
maximum number of implementable dichotomies) is bounded by (eN/dvc )dvc N dvc (for
the sake of simplicity we assume that dvc 2).
c
AM
e-7. Neural Networks 7.4. Regularization and Validation
of dichotomizing a particular arrangement. We have shown that

Pm Pm
m(N ) N dc N i=1 di
= N dc + i=1 di
.
Pm
Let D = dc + i=1 di . After some algebra (left to the reader), if N
2D log2 D, then m(N ) < 2N , from which we conclude that dvc 2D log2 D.
For the 2-layer
Pm MLP, di = d + 1 and dc = m + 1, and so we have that
D = dc + i=1 di = m(d + 2) + 1 = O(md). Thus, dvc = O(md log(md)).
ER
Our analysis looks very crude, but it is almost tight: it is possible to shatter
(md) points with m hidden units (see Problem 7.16), and so the upper bound
can be loose by at most a logarithmic factor.pUsing the VC-dimension, the
generalization error
pbar from Chapter 2 is O( (dvc log N )/N ) which for the
2-layer MLP is O( (md log(md) log N )/N ).
We will get good generalization if m is not too large and we can fit the data

if m is large enough. A balance is called for. For example, choosing m = d1 N

as N , Eout Ein and Ein Eout
PT . That is, Eout Eout (the optimal
performance) as N grows, and m grows sub-linearly with N . In practice
the asymptotic regime is a luxury and one does not simply set m N .
These theoretical results are a good guideline, but the best out-of-sample
performance usually results when you control overfitting using validation (to
select the number of hidden units) and regularization to prevent overfitting.
We conclude with a note on where neural networks sit in the parametric-
nonparametric debate. There are explicit parameters to be learned, so para-
metric seems right. But distinctive features of nonparametric models also
A
stand out: the neural network is generic and flexible and can realize optimal
performance when N grows. Neither parametric nor nonparametric captures
the whole story. We choose to label neural networks as semi-parametric.
H
7.4 Regularization and Validation

The multi-layer neural network is powerful, and, coupled with gradient descent
(a good algorithm to minimize Ein ), we have a recipe for overfitting. We
discuss some practical techniques to help.
e-C
7.4.1 Weight Based Complexity Penalties

As with linear models, one can regularize the learning using a complexity
penalty by minimizing an augmented error (penalized in-sample error). The
squared weight decay regularizer is popular, having augmented error:
X () 2
Eaug (w) = Ein (w) + (wij )
N
,i,j
The regularization parameter is selected via validation, as discussed in Chap-

ter 4. To apply gradient descent, we need Eaug (w). The penalty term adds
c
AM
to the gradient a term proportional to weights,

Eaug (w) Ein (w) 2 ()
()
= + W .
W W() N
We know how to obtain Ein /W() using backpropagation. The penalty term
adds a component to the weight update that is in the negative direction of w,
i.e. towards zero weights hence the term weight decay.
Another similar regularizer is weight elimination, having augmented error:
ER
()
X (wij )2
Eaug (w, ) = Ein (w) + () 2
.
N
,i,j 1 + (wij )
For a small weight, the penalty term is much like weight decay, and will decay
that weight to zero. For a large weight, the penalty term is approximately
a constant, and contributes little to the gradient. Small weights decay faster
than large weights, and the effect is to eliminate those smaller weights.
PT
Exercise 7.11
()
Eaug Ein 2 wij
For weight elimination, show that ()
= ()
+ () 2 2
.
wij wij N (1 + (wij ) )
Argue that weight elimination shrinks small weights faster than large ones.
7.4.2 Early Stopping

A
Another method for regularization, which on face value does not look like regu-
larization is early stopping. An iterative method such as gradient descent does
not explore your full hypothesis set all at once. With more iterations, more
of your hypothesis set is explored. This means that by using fewer iterations,
H
you explore a smaller hypothesis set and should get better generalization.9
Consider fixed-step gradient descent with step
size . At the first step, we start at weights w0 , w1 = w0 kgg00 k
and take a step of size to w1 = w0 kgg00 k .
Because we have taken a step in the direction of
e-C
w0
the negative gradient, we have looked at all the
hypotheses in the shaded region shown on the right. H1
This is because a step in the negative gradient leads to the sharpest decrease in
Ein (w), and so w1 minimizes Ein (w) among all weights with kw w0 k .
We indirectly searched the entire hypothesis set
H1 = {w : kw w0 k },
and picked the hypothesis w1 H1 with minimum in-sample error.
9 If we are to be sticklers for correctness, the hypothesis set explored could depend on the
data set and so we cannot directly apply the VC analysis which requires the hypothesis set
to be fixed ahead of time. Since we are just illustrating the main idea, we will brush such
technicalities under the rug.
c
AM
Now consider the second step, as illustrated to the

right, which moves to w2 . We indirectly explored the
hypothesis set of weights with kw w1 k , picking H2
w1
the best. Since w1 was already the minimizer of Ein
w2
over H0 , this means that w2 is the minimizer of Ein
among all hypotheses in H2 , where w0
ER
H2 = H1 {w : kw w1 k }.
Note that H1 H2 . Similarly, we define hypothesis set

H3
H3 = H2 {w : kw w2 k }, w1
w2
and in the 3rd iteration, we pick weights w3 than min- w0
imize Ein over w H3 . We can continue this argument
PT w3
as gradient descent proceeds, and define a nested se-
quence of hypothesis sets
H1 H2 H3 H4 .
As t increases, Ein (wt ) is decreasing, and dvc (Ht ) is increasing. So, we ex-
pect to see the approximation-generalization trade-off which was illustrated in
Figure 2.3 (reproduced here with iteration t a proxy for dvc ):
A
Eout (wt )
H
(dvc (Ht ))
Error
e-C
Ein (wt )
t iteration, t
The figure suggests it may be better to stop early at some t , well before
reaching a minimum of Ein . Indeed, this picture is observed in practice.
Example 7.2. We revisit the digits task of classifying 1 versus all other
digits, with 70 randomly selected data points and a small sigmoidal neural
network with a single hidden unit and tanh() output node. The figure below
shows the in-sample error and test error versus iteration number.
c
AM
-0.2
-0.6 Etest
log10 (error)
-1
-1.4
ER
-1.8 Ein
t
102 103 104
iteration, t
The curves reinforce our theoretical discussion: the test error initially decreases
as the approximation gain overcomes the worse generalization error bar; then,
the test error increases as the generalization error bar begins to dominate the
PT
approximation gain, and overfitting becomes a serious problem.
In the previous example, despite using a parsimonious neural network with

just a single hidden node, overfitting was an issue because the data are noisy
and the target function is complex, so both stochastic and deterministic noise
are significant. We need to regularize.
In the example, it is better to stop early
at t and constrain the learning to the contour of constant Ein
A
smaller hypothesis set Ht . In this sense,
early stopping is a form of regularization.
Early stopping is related to weight decay,
as illustrated to the right. You initialize w0
H
near zero; if you stop early at wt you have wt

stopped at weights closer to w0 , i.e., smaller
weights. Early stopping indirectly achieves
smaller weights, which is what weight decay w0
directly achieves. To determine when to stop
e-C
training, use a validation set to monitor the

validation error at iteration t as you minimize the training-set error. Report
the weights wt that have minimum validation error when you are done train-
ing.
After selecting t , it is tempting to use all the data to train for t iterations.
Unfortunately, adding back the validation data and training for t iterations
can lead to a completely different set of weights. The validation estimate of
performance only holds for wt (the weights you should output). This appears
to go against the wisdom of the decreasing learning curve from Chapter 4: if
you learn with more data, you get a better final hypothesis.10
10 Using all the data to train to an in-sample error of E
train (wt ) is also not recommended.
Further, an in-sample error of Etrain (wt ) may not even be achievable with all the data.
c
AM
Exercise 7.12
Why does outputting wt rather than training with all the data for t
iterations not go against the wisdom that learning with more data is better.
[Hint: More data is better applies to a fixed model (H, A). Early stop-
ping is model selection on a nested hypothesis sets H1 H2 deter-
mined by Dtrain . What happens if you were to use the full data D?]
ER
When using early stopping, the usual trade-off exists for choosing the size of
the validation set: too large and there is little data to train on; too small
and the validation error will not be reliable.A rule of thumb is to set aside a
fraction of the data (one-tenth to one-fifth) for validation.
Exercise 7.13
Suppose you run gradient descent for 1000 iterations. You have 500 ex-
PT
amples in D, and you use 450 for Dtrain and 50 for Dval . You output the
weight from iteration 50, with Eval (w50 ) = 0.05 and Etrain (w50 ) = 0.04.
(a) Is Eval (w50 ) = 0.05 an unbiased estimate of Eout (w50 )?
(b) Use the Hoeffding bound to get a bound for Eout using Eval plus an
error bar. Your bound should hold with probability at least 0.1.
(c) Can you bound Eout using Etrain or do you need more information?
A
Example 7.2 also illustrates another common problem with the sigmoidal out-
put node: gradient descent often hits a flat region where Ein decreases very
little.11 You might stop training, thinking you found a local minimum. This
early stopping by mistake is sometimes called the self-regularizing property
of sigmoidal neural networks. Accidental regularization due to misinterpreted
H
convergence is unreliable. Validation is much better.
7.4.3 Experiments With Digits Data

e-C
Lets put theory to practice on the digits task (to classify 1 versus all other
digits). We learn on 500 randomly chosen data points using a sigmoidal neural
network with one hidden layer and 10 hidden nodes. There are 41 weights
(tunable parameters), so more than 10 examples per degree of freedom, which
is quite reasonable. We use identity output transformation (s) = s to reduce
the possibility of getting stuck at a flat region of the error surface. At the
end of training, we use the output transformation (s) = sign(s) for actually
classifying data. After more than 2 million iterations of gradient descent, we
manage to get close to a local minimum. The result is shown in Figure 7.2.
It doesnt take a genius to see the overfitting. Figure 7.2 attests to the ap-
proximation capabilities of a moderately sized neural network. Lets try weight
11 The linear output transformation function helps avoid such excessively flat regions.
c
AM
Symmetry
ER
Average Intensity
Figure 7.2: 10 hidden unit neural network trained with gradient descent on
500 examples from the digits data (no regularization). Blue circles are the
PT
digit 1 and the red xs are the other digits. Overfitting is rampant.

decay to fight the overfitting. We minimize Eaug (w, ) = Ein (w) + N wt w,
with = 0.01. We get a much more believable separator, shown below.
A Symmetry
H
e-C
Average Intensity
As a final illustration, lets try early stopping with a validation set of size 50
(one-tenth of the data); so the training set will now have size 450.
The training dynamics of gradient descent are shown in Figure 7.3(a). The
linear output transformation function has helped as there are no extremely flat
periods in the training error. The classification boundary with early stopping
at t is shown in Figure 7.3(b). The result is similar to weight decay. In both
cases, the regularized classification boundary is more believable. Ultimately,
the quantitative statistics are what matters, and these are summarized below.
c
AM
e-7. Neural Networks 7.5. Beefing Up Gradient Descent
-1
Eval
log10 (error)
Symmetry
-1.2
-1.4 Ein
ER
-1.6
t
102 103 104 105 106
Average Intensity
iteration, t
(a) Training dynamics (b) Final hypothesis
Figure 7.3: Early stopping with 500 examples from the digits data. (a)
Training and validation errors for gradient descent with a training set of
PT
size 450 and validation set of size 50. (b) The regularized final hypothesis
obtained by early stopping at t , the minimum validation error.
Etrain Eval Ein Eout

No Regularization 0.2% 3.1%
Weight Decay 1.0% 2.1%
Early Stopping 1.1% 2.0% 1.2% 2.0%
A
7.5 Beefing Up Gradient Descent
Gradient descent is a simple method to minimize Ein that has problems con-
verging, especially with flat error surfaces. One solution is to minimize a
H
friendlier error instead, which is why training with a linear output node helps.
Rather than change the error measure, there is plenty of room to improve
the algorithm itself. Gradient descent takes a step of size in the negative
gradient direction. How should we determine and is the negative gradient
e-C
the best direction in which to move?
Exercise 7.14
Consider the error function E(w) = (w w )t Q(w w ), where Q is an
arbitrary positive definite matrix. Set w = 0.
Show that the gradient E(w) = Qw . What weights minimize E(w).
Does gradient descent move you in the direction of these optimal weights?
Reconcile your answer with the claim in Chapter 3 that the gradient is the
best direction in which to take a step. [Hint: How big was the step?]
The previous exercise shows that the negative gradient is not necessarily the
best direction for a large step, and we would like to take larger steps to increase
c
AM
the efficiency of the optimization. The next figure shows two algorithms: our
old friend gradient descent and our soon-to-be friend conjugate gradient de-
scent. Both algorithms are minimizing Ein for a 5 hidden unit neural network
fitting 200 digits data. The performance difference is dramatic.
0
gradient descent
ER
-2
log10 (error)
-4
-6 conjugate gradients
-8
0.1 1
PT 10 102 103 104
optimization time (sec)
We now discuss methods for beefing up gradient descent, but only scratch
the surface of this important topic known as numerical optimization. The two
main steps in an iterative optimization procedure are to determine:
1. Which direction should one search for a local optimum?
2. How large a step should one take in that direction?
A
7.5.1 Choosing the Learning Rate
In gradient descent, the learning rate multiplies the negative gradient to
give the move Ein . The size of the step taken is proportional to . The
H
optimal step size (and hence learning rate ) depends on how wide or narrow
the error surface is near the minimum.
in-sample error, Ein
in-sample error, Ein
e-C
Ein (w) Ein (w)
weights, w weights, w
wide: use large . narrow: use small .
When the surface is wider, we can take larger steps without overshooting;
since kEin k is small, we need a large . Since we do not know ahead of time
how wide the surface is, it is easy to choose an inefficient value for .
c
AM
Variable learning rate gradient descent. A simple heuristic that adapts

the learning rate to the error surface works well in practice. If the error drops,
increase ; if not, the step was too large, so reject the update and decrease .
For little extra effort, we get a significant boost to gradient descent.
Variable Learning Rate Gradient Descent:
1: Initialize w(0), and 0 at t = 0. Set > 1 and < 1.
2: while stopping criterion has not been met do
ER
3: Let g(t) = Ein (w(t)), and set v(t) = g(t).
4: if Ein (w(t) + t v(t)) < Ein (w(t)) then
5: accept: w(t + 1) = w(t) + t v(t); t+1 = t .
6: else
7: reject: w(t + 1) = w(t); t+1 = t .
8: Iterate to the next step, t t + 1.
PT
It is usually best to go with a conservative increment parameter, for example
1.05 1.1, and a bit more aggressive decrement parameter, for example
0.5 0.8. This is because, if the error doesnt drop, then one is in an
unusual situation and more drastic action is called for.
After a little thought, one might wonder why we need a learning rate at
all. Once the direction in which to move, v(t), has been determined, why not
simply continue along that direction until the error stops decreasing? This
leads us to steepest descent gradient descent with line search.
A
Steepest Descent. Gradient descent picks a descent direction v(t) = g(t)
and updates the weights to w(t + 1) = w(t) + v(t). Rather than pick
arbitrarily, we will choose the optimal that minimizes Ein (w(t + 1)). Once
you have the direction to move, make the best of it by moving along the line
H
w(t) + v(t) and stopping when Ein is minimum (hence the term line search).
That is, choose a step size , where
(t) = argmin Ein (w(t) + v(t)).

e-C
Steepest Descent (Gradient Descent + Line Search):

1: Initialize w(0) and set t = 0;
3: Let g(t) = Ein (w(t)), and set v(t) = g(t).
4: Let = argmin Ein (w(t) + v(t)).
5: w(t + 1) = w(t) + v(t).
6: Iterate to the next step, t t + 1.
The line search in step 4 is a one dimensional optimization problem. Line
search is an important step in most optimization algorithms, so an efficient
algorithm is called for. Write E() for Ein (w(t) + v(t)). The goal is to find
a minimum of E(). We give a simple algorithm based on binary search.
c
AM
Line Search. The idea is to find an interval on the line which is guaranteed
to contain a local minimum. Then, rapidly narrow the size of this interval
while maintaining as an invariant the fact that it contains a local minimum.
The basic invariant is a U-arrangement:
E(3 )
1 < 2 < 3 with E(1 )
E(2 ) < min{E(1 ), E(3 )}.
ER
E(2 )
Since E is continuous, there must be a local
minimum in the interval [1 , 3 ]. Now, consider
the midpoint of the interval,
1 2 3
1
= (1 + 3 ),
2
hence the name bisection algorithm. Suppose
that < 2 as shown. If E( ) < E(2 ) then
and, if E(
PT
{1 , , 2 } is a new, smaller U-arrangement;
) > E(2 ), then { , 2 , 3 } is the
new smaller U-arrangement. In either case, the
bisection process can be iterated with the new 1 2 3

U-arrangement. If happens to equal 2 , per-
turb slightly to resolve the degeneracy. We leave it to the reader to determine
how to obtain the new smaller U-arrangement for the case > 2 .
An efficient algorithm to find an initial U-arrangement is to start with
A
1 = 0 and 2 = for some step . If E(2 ) < E(1 ), consider the sequence
= 0, , 2, 4, 8, . . .
(each time the step doubles). At some point, the error must increase. When
H
the error increases for the first time, the last three steps give a U-arrangement.
If, instead, E(1 ) < E(2 ), consider the sequence
= , 0, , 2, 4, 8, . . .
e-C
(the step keeps doubling but in the reverse direction). Again, when the error
increases for the first time, the last three steps give a U-arrangement.12
Exercise 7.15
Show that |3 1 ] decreases exponentially in the bisection algorithm.
[Hint: show that two iterations at least halve the interval size.]
The bisection algorithm continues to bisect the interval and update to a new U-
arrangement until the length of the interval |3 1 | is small enough, at which
12 We do not worry about E(1 ) = E(2 ) such ties can be broken by small perturbations.
c
AM
point you can return the midpoint of the interval as the approximate local
minimum. Usually 20 iterations of bisection are enough to get an acceptable
solution. A better quadratic interpolation algorithm is given in Problem 7.8,
which only needs about 4 iterations in practice.
Example 7.3. We illustrate these three heuristics for improving gradient

descent on our digit recognition (classifying 1 versus other digits). We use
200 data points and a neural network with 5 hidden units. We show the
ER
performance of gradient descent, gradient descent with variable learning rate,
and steepest descent (line search) in Figure 7.4. The table below summarizes
the in-sample error at various points in the optimization.
Optimization Time
Method 10 sec 1,000 sec 50,000 sec
Gradient Descent 0.079 0.0206 0.00144
Stochastic Gradient Descent PT 0.0213 0.00278 0.000022
Variable Learning Rate 0.039 0.014 0.00010
Steepest Descent 0.043 0.0189 0.000012
-1
A
-2
log10 (error)
-3
gradient descent
H
-4
SGD
variable
-5
steepest descent
0.1 1 10 102 103 104
e-C

Figure 7.4: Gradient descent, variable learning rate and steepest descent
using digits data and a 5 hidden unit 2-layer neural network with linear
output. For variable learning rate, = 1.1 and = 0.8.
Note that SGD is quite competitive. The figure illustrates why it is hard to
know when to stop minimizing. A flat region trapped all the methods, even
though we used a linear output node transform. It is very hard to differentiate
between a flat region (which is typically caused by a very steep valley that leads
to inefficient zig-zag behavior) and a true local minimum.
c
AM
7.5.2 Conjugate Gradient Minimization

Conjugate gradient is a queen among op-
timization methods because it leverages a
contour of constant Ein
simple principle. Dont undo what you
have already accomplished. When you
end a line search, because the error can- v(t)
not be further decreased by moving back
ER
or forth along the search direction, it must w
w2
be that the new gradient and the previous w(t + 1)
line search direction are orthogonal. What g(t + 1)
this means is that you have succeeded in

setting one of the components of the gra-
w(t)
dient to zero, namely the component along
the search direction v(t) (see the figure).
w1
If the next search direction is the negative
PT
of the new gradient, it will be orthogonal to the previous search direction.
You are at a local minimum when the gradient is zero, and setting one
component to zero is certainly a step in the right direction. As you move
along the next search direction (for example the new negative gradient), the
gradient will change and may not remain orthogonal to the previous search
direction, a task you laboriously accomplished in the previous line search.
The conjugate gradient algorithm chooses the next direction v(t + 1) so that
the gradient along this direction, will remain perpendicular to the previous
A
search direction v(t). This is called the conjugate direction, hence the name.
After a line search along this new direction
contour of constant Ein
v(t + 1) to minimize Ein , you will have set
two components of the gradient to zero.
First, the gradient remained perpendicu- v(t)
H
lar to the previous search direction v(t).

v(t + 1)
Second, the gradient will be orthogonal to
w2
v(t + 1) because of the line search (see the w(t + 1)
figure). The gradient along the new direc-

e-C
tion v(t + 1) is shown by the blue arrows

in the figure. Because v(t + 1) is conju-
w(t)
gate to v(t), observe how the gradient as
we move along v(t+1) remains orthogonal
w1
to the previous direction v(t).
Exercise 7.16
Why does the new search direction pass through the optimal weights?
We made progress! Now two components of the gradient are zero. In two
dimensions, this means that the gradient itself must be zero and we are done.
c
AM
In higher dimension, if we could continue to set a component of the gradient

to zero with each line search, maintaining all previous components at zero, we
will eventually set every component of the gradient to zero and be at a local
minimum. Our discussion is true for an idealized quadratic error function. In
general, conjugate gradient minimization implements our idealized expecta-
tions approximately. Nevertheless, it works like a charm because the idealized
setting is a good approximation once you get close to a local minimum, and
ER
this is where algorithms like gradient descent become ineffective.
Now for the details. The algorithm constructs the current search direc-
tion as a linear combination of the previous search direction and the current
gradient,
v(t) = g(t) + t v(t 1),
where
g(t + 1)t (g(t + 1) g(t))
t = .
PT g(t)t g(t)
The term t v(t 1) is called the momentum term because it asks you to keep
moving in the same direction you were moving in. The multiplier t is called
the momentum parameter. The full conjugate gradient descent algorithm is
summarized in the following algorithm box.
Conjugate Gradient Descent:

1: Initialize w(0) and set t = 0; set v(1) = 0
A
3: Let v(t) = g(t) + t v(t 1), where
g(t + 1)t (g(t + 1) g(t))

t = .
g(t)t g(t)
H
4: Let = argmin Ein (w(t) + v(t)).

5: w(t + 1) = w(t) + v(t);
6: Iterate to the next step, t t + 1;
e-C
The only difference between conjugate gradient descent and steepest descent is
in step 3 where the line search direction is different from the negative gradient.
Contrary to intuition, the negative gradient direction is not always the best
direction to move, because it can undo some of the good work you did before.
In practice, for error surfaces that are not exactly quadratic, the v(t)s are
only approximately conjugate and it is recommended that you restart the
algorithm by setting t to zero every so often (for example every d iterations).
That is, every d iterations you throw in a steepest descent iteration.
Example 7.4. Continuing with the digits example, we compare conjugate

gradient and the previous champion steepest descent in the next table and
Figure 7.5.
c
AM
e-7. Neural Networks 7.6. Deep Learning: Networks with Many Layers
0
steepest descent
log10 (error) -2
-4
ER
-6 conjugate gradients
-8
0.1 1 10 102 103 104

PT
Figure 7.5: Steepest descent versus conjugate gradient descent using 200
examples of the digits data and a 2-layer sigmoidal neural network with 5
hidden units.
Optimization Time
Method 10 sec 1,000 sec 50,000 sec
Steepest Descent 0.043 0.0189 1.2 105
Conjugate Gradients 0.0200 1.13 106 2.73 109
A
The performance difference is dramatic.
H
7.6 Deep Learning: Networks with Many Layers

Universal approximation says that a single hidden layer with enough hidden
units can approximate any target function. But, that may not be a natural
way to represent the target function. Often, many layers more closely mimics
e-C
human learning. Lets get our feet wet with the digit recognition problem to
classify 1 versus 5. A natural first step is to decompose the two digits into
basic components, just as one might break down a face into two eyes, a nose,
a mouth, two ears, etc. Here is one attempt for a prototypical 1 and 5.
1 2 3 4 5 6
c
AM
Indeed, we could plausibly argue that every 1 should contain a 1 , 2 and 3 ;

and, every 5 should contain a 3 , 4 , 5 , 6 and perhaps a little 1 . We have
deliberately used the notation i which we used earlier for the coordinates of
the feature transform . These basic shapes are features of the input, and, for
example, we would like 1 to be large (close to 1) if its corresponding feature
is in the input image and small (close to -1) if not.
ER
Exercise 7.17
The basic shape 3 is in both the 1 and the 5. What other digits do
you expect to contain each basic shape 1 6 . How would you select
additional basic shapes if you wanted to distinguish between all the digits.
(What properties should useful basic shapes satisfy?)
We can build a classifier for 1 versus 5 from these basic shapes. Remember
how, at the beginning of the chapter, we built a complex Boolean function
PT
from the basic functions and and or? Lets mimic that process here. The
complex function we are building is the digit classifier and the basic functions
are our features. Assume, for now, that we have feature functions i which
compute the presence (+1) or absence (1) of the corresponding feature. Take
a close look at the following network and work it through from input to output.
+ve weight
6
ve weight
A
is it a 1 ? - z1 z5 is it a 5 ?
H
2 3 4 5
1 6
e-C
,
Ignoring details like the exact values of the weights, node z1 answers the
question is the image a 1 ? and similarly node z5 answers is the image a
5 ? Lets see why. If they have done their job correctly when we feed in a
1, 1 , 2 , 3 compute +1, and 4 , 5 , 6 compute 1. Combining 1 , . . . , 6
with the signs of the weights on outgoing edges, all the inputs to z1 will be
positive hence z1 outputs +1; all but one of the inputs into z5 are negative,
hence z5 outputs 1. A similar analysis holds if you feed in the 5. The final
c
AM
node combines z1 and z5 to the final output. At this point, it is useful to fill
in all the blanks with an exercise.
Exercise 7.18
Since the input x is an image it is convenient to represent it as a matrix
[xij ] of its pixels which are black (xij = 1) or white (xij = 0). The basic
shape k identifies a set of these pixels which are black.
ER
(a) Show that feature k can be computed by the neural network node
!
X
k (x) = tanh w0 + wij xij .
ij
(b) What are the inputs to the neural network node?

(c) What do you choose as values for the weights? [Hint: consider sepa-
rately the weights of the pixels for those xij k and those xij 6 k .]
PT
(d) How would you choose w0 ? (Not all digits are written identically, and
so a basic shape may not always be exactly represented in the image.)
(e) Draw the final network, filling in as many details as you can.
You may have noticed, that the output of z1 is all we need to solve our problem.
This would not be the case if we were solving the full multi-class problem
with nodes z0 , . . . , z9 corresponding to all ten digits. Also, we solved our
problem with relative ease our deep network has just 2 hidden layers. In
A
a more complex problem, like face recognition, the process would start just
as we did here, with basic shapes. At the next level, we would constitute
more complicated shapes from the basic shapes, but we would not be home
yet. These more complicated shapes would constitute still more complicated
H
shapes until at last we had realistic objects like eyes, a mouth, ears, etc. There
would be a hierarchy of basic features until we solve our problem at the very
end.
Now for the punch line and crux of our story. The punch line first. Shine
your floodlights back on the network we constructed, and scrutinize what the
e-C
different layers are doing. The first layer constructs a low-level representation
of basic shapes; the next layer builds a higher level representation from these
basic shapes. As we progress up more layers, we get more complex repre-
sentations in terms of simpler parts from the previous layer: an intelligent
decomposition of the problem, starting from simple and getting more complex,
until finally the problem is solved. This is the promise of the deep network,
that it provides some human-like insight into how the problem is being solved
based on a hierarchy of more complex representations for the input. While
we might attain a solution of similar accuracy with a single hidden layer, we
would gain no such insight. The picture is rosy for our intuitive digit recog-
nition problem, but here is the crux of the matter: for a complex learning
problem, how do we automate all of this in a computer algorithm?
c
AM
ER
(a) (b) (c) (d)
Figure 7.6: Greedy deep learning algorithm. (a) First layer weights are
learned. (b) First layer is fixed and second layer weights are learned. (c)
First two layers are fixed and third layer weights are learned. (d) Learned
weights can be used as a starting point to fine-tune the entire network.
PT
7.6.1 A Greedy Deep Learning Algorithm
Historically, the shallow (single hidden layer) neural network was favored over
the deep network because deep networks are hard to train, suffer from many
A
local minima and, relative to the number of tunable parameters, they have a
very large tendency to overfit (composition of nonlinearities is typically much
more powerful than a linear combination of nonlinearities). Recently, some
simple heuristics have shown good performance empirically and have brought
H
deep networks back into the limelight. Indeed, the current best algorithm for
digit recognition is a deep neural network trained with such heuristics.
The greedy heuristic has a general form. Learn the first layer weights
W(1) and fix them.13 The output of the first hidden layer is a nonlinear
e-C
(1) (1)
transformation of the inputs xn xn . These outputs xn are used to train
the second layer weights W(2) , while keeping the first layer weights fixed. This
is the essence of the greedy algorithm, to greedily pick the first layer weights,
fix them, and then move on to the second layer weights. One ignores the
possibility that better first layer weights might exist if one takes into account
what the second layer is doing. The process continues with the outputs x(2)
used to learn the weights W(3) , and so on.
13 Recall that we use the superscript ()() to denote the layer .
c
AM
Greedy Deep Learning Algorithm:

1: for = 1, . . . , L do
2: W(1) W(1) are given from previous iterations.
(1)
3: Compute layer 1 outputs xn for n = 1, . . . , N .
(1)
4: Use {xn } to learn weights W by training a single
hidden layer neural network. (W(1) W(1) are fixed.)
output
ER
V
error measure
hidden layer
W()
(1)
xn , n = 1, . . . , N
We have to clarify step 4 in the algorithm. The weights W() and V are
PT
learned, though V is not needed in the algorithm. To learn the weights, we
minimize an error (which will depend on the output of the network), and that
error is not yet defined. To define the error, we must first define the output
and then how to compute the error from the output.
Unsupervised Auto-encoder. One approach is to take to heart the notion

that the hidden layer gives a high-level representation of the inputs. That is,
we should be able to reconstruct all the important aspects of the input from
A
the hidden layer output . A natural test is to reconstruct the input itself: the
n , a prediction of the input xn ; and, the error is the difference
output will be x
between the two. For example, using squared error,
H
2
en = k
xn xn k .
When all is said and done, we obtain the weights without using the targets
yn and the hidden layer gives an encoding of the inputs, hence the name
unsupervised auto-encoder. This is reminiscent of the radial basis function
e-C
network in Chapter 6, where we used an unsupervised technique to learn the

centers of the basis functions, which provided a representative set of inputs
as the centers. Here, we go one step further and dissect the input-space itself
into pieces that are representative of the learning problem. At the end, the
targets have to be brought back into the picture (usually in the output layer).
Supervised Deep Network. The previous approach adheres to the philo-

sophical goal that the hidden layers provide an intelligent hierarchical rep-
resentation of the inputs. A more direct approach is to train the two-layer
network on the targets. In this case the output is the predicted target yn and
the error measure en (yn , yn ) would be computed in the usual way (for example
squared error, cross entropy error, etc.).
c
AM
In practice, there is no verdict on which method is better, with the unsu-

pervised auto-encoder camp being slightly more crowded than the supervised
camp. Try them both and see what works for your problem, thats usually
the best way. Once you have your error measure, you just reach into your
optimization toolbox and minimize the error using your favorite method (gra-
dient descent, stochastic gradient descent, conjugate gradient descent, . . . ). A
common tactic is to use the unsupervised auto-encoder first to set the weights
ER
and then fine tune the whole network using supervised learning. The idea is
that the unsupervised pass gets you to the right local minimum of the full
network. But, no matter which camp you belong to, you still need to choose
the architecture of the deep network (number of hidden layers and their sizes),
and there is no magic potion for that. You will need to resort to old tricks
like validation, or a deep understanding of the problem (our hand made
network for the 1 versus 5 task suggests a deep network with six hidden
nodes in the first hidden layer and two in the second).
Exercise 7.19
PT
Previously, for our digit problem, we used symmetry and intensity. How do
these features relate to deep networks? Do we still need them?
Example 7.5. Deep Learning For Digit Recognition. Lets revisit the
digits classification problem 1 versus 5 using a deep network architecture
A
[d(0) , d(1) , d(2) , d(3) ] = [256, 6, 2, 1].
(The same architecture we constructed by hand earlier, with 16 16 input

pixels and 1 output.) We will use gradient descent to train the two layer
H
networks in the greedy algorithm. A convenient matrix form for the gradient
of the two layer network is given in Problem 7.7. For the unsupervised auto-
encoder the target output is the input matrix X. for the supervised deep
network, the target output is just the target vector y. We used the supervised
approach with 1,000,000 gradient descent iterations for each supervised greedy
e-C
step using a sample of 1500 examples from the digits data. Here is a look at
what the 6 hidden units in the first hidden layer learned. For each hidden
node in the first hidden layer, we show the pixels corresponding to the top 20
incoming weights.
1 2 3 4 5 6
c
AM
Real data is not as clean as our idealized analysis. Dont be surprised. Never-
theless, we can discern that 2 has picked out the pixels (shapes) in the typical
1 that are unlikely to be in a typical 5. The other features seem to focus
on the 5 and to some extent match our hand constructed features. Lets not
dwell on whether the representation captures human intuition; it does to some
extent. The important thing is that this result is automatic and purely data
driven (other than our choice of the network architecture); and, what matters
ER
is out-of-sample performance. For different architectures, we ran more than
1000 validation experiments selecting 500 random training points each time
and the remaining data as a test set.
Deep Network Architecture Ein Etest
[256, 3, 2, 1] 0 0.170%
[256, 6, 2, 1] 0 0.187%
[256, 12, 2, 1]
PT 0 0.187%
[256, 24, 2, 1] 0 0.183%
Ein is always zero because there are so many parameters, even with just 3
hidden units in the first hidden layer. This smells of overfitting. But, the test
performance is impressive at 99.8% accuracy, which is all we care about. Our
hand constructed features of symmetry and intensity were good, but not quite
this good.
A
H
e-C
c
AM
e-7. Neural Networks 7.7. Problems
7.7 Problems
Problem 7.1 Implement the decision function below using a 3-layer

perceptron.
ER
2
1
+
2 1 00 1 2
1
Problem 7.2
PT
A set of M hyperplanes will generally divide the space
into some number of regions. Every point in Rd can be labeled with an M
dimensional vector that determines which side of each plane it is on. Thus, for
example if M = 3, then a point with a vector (1, +1, +1) is on the -1 side of
the first hyperplane, and on the +1 side of the second and third hyperplanes.
A region is defined as the set of points with the same label.
(a) Prove that the regions with the same label are convex.
A
(b) Prove that M hyperplanes can create at most 2M distinct regions.
(c) [hard] What is the maximum number of regions created by M hyper-
planes in d dimensions?
Pd
H
M
[Answer: d
.]
i=0
[Hint: Use induction and let B(M, d) be the number of regions created
by M (d 1)-dimensional hyperplanes in d-space. Now consider adding
the (M + 1)th hyperplane. Show that this hyperplane intersects at most
B(M, d 1) of the B(M, d) regions. For each region it intersects, it
e-C
adds exactly one region, and so B(M + 1, d) B(M, d) + B(M, d 1).

(Is this recurrence familiar?) Evaluate the boundary conditions: B(M, 1)
and B(1, d), and proceed from there. To see that the M +1th hyperplane
only intersects B(M, d 1) regions, argue as follows. Treat the M + 1th
hyperplane as a (d 1)-dimensional space, and project the initial M hy-
perplanes into this space to get M hyperplanes in a (d 1)-dimensional
space. These M hyperplanes can create at most B(M, d 1) regions
in this space. Argue that this means that the M + 1th hyperplane is
only intersecting at most B(M, d 1) of the regions created by the M
hyperplanes in d-space.]
c
AM
Problem 7.3 Suppose that a target function f (for classification) is

represented by a number of hyperplanes, where the different regions defined by
the hyperplanes (see Problem 7.2) could be either classified +1 or 1, as with
the 2-dimensional examples we considered in the text. Let the hyperplanes be
h1 , h2 , . . . , hM , where hm (x) = sign(wm x). Consider all the regions that
are classified +1, and let one such region be r + . Let c = (c1 , c2 , . . . , cM ) be
the label of any point in the region (all points in a given region have the same
ER
label); the label cm = 1 tells which side of hm the point is on. Define the
AND-term corresponding to region r by
(
c1 c2 cM cm hm if cm = +1,
tr = h1 h2 . . . hM , where hm =
m
h if cm = 1.
Show that f = tr1 +tr2 + +trk , where r1 , . . . , rk are all the positive regions.
(We use multiplication for the AND and addition for the OR operators.)
Problem 7.4
PT
Referring to Problem 7.3, any target function which can
be decomposed into hyperplanes h1 , . . . , hM can be represented by f = tr1 +
tr2 + + trk , where there are k positive regions.
What is the structure of the 3-layer perceptron (number of hidden units in each
layer) that will implement this function, proving the following theorem:
Theorem. Any decision function whose 1 regions are defined in terms of
the regions created by a set of hyperplanes can be implemented by a 3-layer
A
perceptron.
Problem 7.5 [Hard] State and prove a version of a Universal Approxi-

H
mation Theorem:
Theorem. Any target function f (for classification) defined on [0, 1]d , whose
classification boundary surfaces are smooth, can arbitrarily closely be approxi-
mated by a 3-layer perceptron.
e-C
[Hint: Decompose the unit hypercube into -hypercubes ( 1d of them); The

volume of these -hypercubes which intersects the classification boundaries
must tend to zero (why? use smoothness). Thus, the function which takes
on the value of f on any -hypercube that does not intersect the boundary and
an arbitrary value on these boundary -hypercubes will approximate f arbitrarily
closely, as 0. ]
Problem 7.6 The finite difference approximation to obtaining the gradient

is based on the following formula from calculus:
() ()
h h(wij + ) h(wij )
()
= + O(2 ),
wij 2
c
AM
()
where h(wij + ) to denotes the function value when all weights are held at
()
their values in w except for the weight wij , which is perturbed by . To get
the gradient, we need the partial derivative with respect to each weight.
Show that the computational complexity of obtaining all these partial deriva-
tives O(W 2 ). [Hint: you have to do two forward propagations for each weight.]
ER
Problem 7.7 .
Consider the 2-layer network below, with output vector y
This is the two layer network used for the greedy deep network algorithm.
n
y
V
zn n k2
en = kyn y
PT W
xn
Collect the input vectors xn (together with a column of ones) as rows in the
input data matrix X, and similarly form Z from zn . The target matrices Y and
Y are formed from yn and y n respectively. Assume a linear output node and
the hidden layer activation is ().
(a) Show that the in-sample error is

A

Ein = 1
trace
(Y Y)(Y t ,
Y)
N
where
X is N (d + 1)
H
W is (d + 1) d(1)

Y = ZV
and Z is N (d(1) + 1)
Z = [1, (XW)] V0
V= V1 is (d(1) + 1) dim(y)

Y, Y are N dim(y)
e-C
(It is convenient to decompose V into its first row V0 corresponding to

the biases and its remaining rows V1 ; 1 is the N 1 vector of ones.)
(b) derive the gradient matrices:
Ein
= 2Zt ZV 2Zt Y
V
Ein
; = 2Xt (XW) ((XW)V1 V1t + 1V0 V1t YV1t ) ,
W
where denotes element-wise multiplication. Some of the matrix deriva-
tives of functions involving the trace from the appendix may be useful.
c
AM
Problem 7.8 Quadratic Interpolation for Line Search

Assume that a U-arrangement has been found, as illustrated below.
E()
Quadratic
Interpolant
ER
1 2 3
Instead of using bisection to construct the point , quadratic interpolation fits a

PT
quadratic curve E() = a 2 + b + c to the three points and uses the minimum
of this quadratic interpolation as .
(a) Show that the minimum of the quadratic interpolant for a U-arrangement
is within the interval [1 , 3 ].
(b) Let e1 = E(1 ), e2 = E(2 ), e3 = E(3 ). Obtain the quadratic function
that interpolates the three points {(1 , e1 ), (2 , e2 ), (3 , e3 )}. Show that
the minimum of this quadratic interpolant is given by:
A

1 (e1 e2 )(12 32 ) (e1 e3 )(12 22 )
=
2 (e1 e2 )(1 3 ) (e1 e3 )(1 2 )
[Hint: e1 = a12 + b1 + c, e2 = a22 + b2 + c, e3 = a32 + b3 + c. Solve

for a, b, c and the minimum of the quadratic is given by = b/2a. ]
H
(c) Depending on whether E( ) is less than E(2 ), and on whether is to

the left or right of 2 , there are 4 cases.
In each case, what is the smaller U-arrangement?
(d) What if = 2 , a degenerate case?
e-C
Note: in general the quadratic interpolations converge very rapidly to a locally

optimal . In practice, 4 iterations are more than sufficient.
Problem 7.9 [Convergence of Monte-Carlo Minimization]

Suppose the global minimum w is in the unit cube and the error surface is
quadratic near w . So, near w ,
1
E(w) = E(w ) + (w w )T H(w w )
2
where the Hessian H is a positive definite and symmetric.
c
AM
(a) If you uniformly sample w in the unit cube, show that

Z
Sd (2)
P [E E(w ) + ] = dd x = ,
det H
xT Hx2
where Sd (r) is the volume of the ddimensional sphere of radius r,
Sd (r) = d/2 r d /( d2 + 1).
ER

[Hints: P [E E(w ) + ] = P 12 (w w )T H(w w ) . Sup-
pose the orthogonal matrix A diagonalizes H: At HA = diag[21 , . . . , 2d ].
Change variables to u = At x and use det H = 21 22 2d .]
(b) Suppose you sample M times and choose the weights with minimum
error, wmin . Show that
d !N
1
P [E(wmin ) > E(w ) + ] 1
PT ,
d d
p
where 8e/ and
is the geometric mean of the eigenvalues of H.

(You may use (x + 1) xx ex 2x.)

(c) Show that if N ( d )d log 1 , then with probability at least 1 ,
E(wmin ) E(w ) + .
(You may use log(1 a) a for small a and (d)1/d 1.)
A
Problem 7.10 For a neural network with at least 1 hidden layer and
tanh() transformations in each non-input node, what is the gradient (with
respect to the weights) if all the weights are set to zero.
H
Is it a good idea to initialize the weights to zero?
Problem 7.11
e-C
[Optimal Learning Rate] Suppose that we are in the

vicinity of a local minimum, w , of the error surface, or that the error surface
is quadratic. The expression for the error function is then given by
1
E(wt ) = E(w ) + (wt w )H(wt w ) (7.8)
2
from which it is easy to see that the gradient is given by gt = H(wt w ). The
weight updates are then given by wt+1 = wt H(wt w ), and subtracting
w from both sides, we see that
t+1 = (I H)t (7.9)
Since H is symmetric, one can form an orthonormal basis with its eigenvectors.
Projecting t and t+1 onto this basis, we see that in this basis, each component
c
AM
decouples from the others, and letting () be the th component in this basis,
we see that
t+1 () = (1 )t () (7.10)
so we see that each component exhibits linear convergence with its own co-
efficient of convergence k = 1 . The worst component will dominate
the convergence so we are interested in choosing so that the k with largest
magnitude is minimized. Since H is positive definite, all the s are positive,
ER
so it is easy to see that one should choose so that 1 min = 1 and
1 max = 1 + , or one should choose. Solving for the optimal , one finds
that
2 1c
opt = kopt = (7.11)
min + max 1+c
where c = min /max is the condition number of H, and is an important
measure of the stability of H. When c 0, one usually says that H is ill-
conditioned. Among other things, this affects the ones ability to numerically
compute the inverse of H. PT
Problem 7.12 P [Hard] With aPvariable learning rate, suppose that
2
t 0 satisfying 1/t = and 1/t < , for example one could
t t
choose t = 1/(t + 1). Show that gradient descent will converge to a local
minimum.
A
Problem 7.13 [Finite Difference Approximation to Hessian]
(a) Consider the function E(w1 , w2 ). Show that the finite difference approx-
imation to the second order partial derivatives are given by
2 E E(w1 +2h,w2 )+E(w1 2h,w2 )2E(w1 ,w2 )
H
w12 = 4h2
2 E E(w1 ,w2 +2h)+E(w1 ,w2 2h)2E(w1 ,w2 )

w22 = 4h2
2 E E(w1 +h,w2 +h)+E(w1 h,w2 h)E(w1 +h,w2 h)E(w1 h,w2 +h)

w22 = 4h2
e-C
(b) Give an algorithm to compute the finite difference approximation to the

Hessian matrix for Ein (w), the in-sample error for a multilayer neural
network with weights w = [W(1) , . . . , W() ].
(c) Compute the asymptotic running time of your algorithm in terms of the
number of weights in your network and then number of data points.
Problem 7.14 Suppose we take a fixed step in some direction, we

ask what the optimal direction for this fixed step assuming that the quadratic
model for the error surface is accurate:
1
Ein (wt + w) = Ein (wt ) + gtt w + wt Ht w.
2
c
AM
So we want to minimize Ein (w) with respect to w subject to the constraint

that the step size is , i.e., that wt w = 2 .
(a) Show that the Lagrangian for this constrained minimization problem is:
1
L = Ein (wt ) + gtt w + wt (Ht + 2I)w, (7.12)
2
where is the Lagrange multiplier.
ER
(b) Solve for w and and show that they satisfy the two equations:
w = (Ht + 2I)1 gt ,
wt w = 2 .
(c) Show that satisfies the implicit equation:

1
= (wt gt + wt Ht w).
2 2
PT
Argue that the second term is (1) and the first is O( kgt k/). So,
is large for a small step size .
1
(d) Assume that is large. Show that, To leading order in
kgt k
= .
2
Therefore is large, consistent with expanding w to leading order in
1
. [Hint: expand w to leading order in 1 .]
A

1
(e) Using (d), show that w = Ht + kgt k I gt .
H
Problem 7.15
P The
Pk outer-product Hessian approximation is H =
N
n=1 gn gn
t
. Let H k = n=1 gn gn
t
be the partial sum to k, and let H1
k
be its inverse.
e-C
H1 t 1
k gk+1 gk+1 Hk
(a) Show that H1 1
k+1 = Hk 1 .
1 + gk+1
t
Hk gk+1
1 t 1
t
[Hints: Hk+1 = Hk +gk+1 gk+1 ; and, (A+zzt )1 = A1 A1+zzz A
t A1 z .]
(b) Use part (a) to give an O(N W 2 ) algorithm to compute H1

t , the same
time it takes to compute H. (W is the number of dimensions in g).
Note: typically, this algorithm is initialized with H0 = I for some small . So

the algorithm actually computes (H + I)1 ; the results are not very sensitive
to the choice of , as long as is small.
c
AM
Problem 7.16 In the text, we computed an upper bound on the VC-

dimension of the 2-layer perceptron is dvc = O(md log(md)) where m is the
number of hidden units in the hidden layer. Prove that this bound is essentially
tight by showing that dvc = (md). To do this, show that it is possible to find
md points that can be shattered when m is even as follows.
Consider any set of N points x1 , . . . , xN in general position with N = md.
N points in d dimensions are in general position if no subset of d + 1 points
ER
lies on a d 1 dimensional hyperplane. Now, consider any dichotomy on these
points with r of the points classified +1. Without loss of generality, relabel the
points so that x1 , . . . , xr are +1.
(a) Show that without loss of generality, you can assume that r N/2. For
the rest of the problem you may therefore assume that r N/2.
(b) Partition these r positive points into groups of size d. The last group
may have fewer than d points. Show that the number of groups is at
most N2 . Label these groups Di for i = 1 . . . q N/2.
PT
(c) Show that for any subset of k points with k d, there is a hyperplane
containing those points and no others.
(d) By the previous part, let wi , bi be the hyperplane through the points in
group Di , and containing no others. So
wit xn + bi = 0
if and only if xn Di . Show that it is possible to find h small enough so

A
that for xn Di ,
|wit xn + bi | < h,
and for xn 6 Di
|wit xn + bi | > h.
H
(e) Show that for xn Di ,
sign(wit xn + bi + h) + sign(wit xn bi + h) = 2,
and for xn 6 Di
e-C
sign(wit xn + bi + h) + sign(wit xn bi + h) = 0
(f) Use the results so far to construct a 2-layer MLP with 2r hidden units
which implements the dichotomy (which was arbitrary). Complete the
argument to show that dvc md.
c
AM

7 Jan2015

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

7 Jan2015

Hochgeladen von

Copyright:

Verfügbare Formate

e-Chapter 7

7.1 The Multi-layer Perceptron (MLP)

get on the right, which is related to the Boolean +1

of two linear parts. Indeed, as we will soon see, 1

we can decompose f into two simple percep- +1

c AM L Yaser Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin: Jan-2015.

The target f has three perceptron components h1 , h2 , h3 :

Exercise 7.1 shows that a complicated target, which is composed of percep-

perceptrons. This is a useful insight, because or and and can be implemented

or(x1 , x2 ) = sign(x1 + x2 + 1.5);

A node outputs a value to an arrow. The weight on an arrow multiplies this

The MLP for a Complex Target. Since f = h1 h2 + h1 h2 , which is an or

The visual graph representation is much neater and easier to generalize.

where h1 (x) = sign(w1t x) and h2 (x) = sign(w2t x)

Lets compare the graph form of f with

If f can be decomposed into perceptrons using an or of ands, then it can be

Target 8 perceptrons 16 perceptrons

h(x) = (wt x).

The previous example shows that the

to the right. The figure shows how the

7.2 Neural Networks

input layer = 0 hidden layers 0 < < L output layer = L

to the fictitious x0 = 1 convention that we had for linear models.

when we discussed logistic regression in Chapter 3. Ultimately, when we do

multiplied by weights w() . We use subscripts to index the nodes in a layer.

layer ( 1) layer layer ( + 1)

signals in s() d() dimensional input vector

7.2.2 Forward Propagation

s() = (W() )t x(1) . (7.2)

Forward propagation to compute h(x):

In terms of V and Q, how many computations are made in forward propa-

7.2.3 Backpropagation Algorithm

w(t + 1) = w(t) Ein (w(t))

we called this (batch) gradient descent. To implement gradient descent, we

If w , what happens to the gradient; how this is related to why it is

where en = e(h(xn ), yn ). For the squared error, e(h, y) = (h y)2 . To

to use the numerical finite difference approach. The complexity of obtaining

propagation. So to get the partial derivatives, it suffices to obtain the sen-

When the output transformation is tanh(), (s(L) ) = 1 (x(L) )2 (classifica-

Backpropagation to compute sensitivities () .

1: (L) 2(x(L) y) (s(L) ) [Initialization]

4: Compute the sensitivity () from (+1) :

In step 3, we assumed tanh-hidden node transformations. If the hidden unit

Example 7.1. Consider the following neural network.

There is a single input, and the weight matrices are:

For the data point x = 2, y = 1, forward propagation gives:

layer ( 1) layer layer ( + 1)

Figure 7.1: Chain of dependencies from W() to x(L) .

We can identify the following chain of dependencies by which W() influences

To get the partial derivative e/x() , we need to understand how e changes

Algorithm to Compute Ein (w) and g = Ein (w).

Input: w = {W(1) , . . . , W(L) }; D = (x1 , y1 ) . . . (xN , yn ).

3: Compute x() for = 0, . . . , L. [forward propagation]

In Chapter 3, we discussed stochas- 0

as the minimization proceeds.

When do we stop? It is risky to rely solely on

7.2.4 Regression for Classification

7.3 Approximation versus Generalization

A large enough MLP with 2 hidden layers can

data; for example a much smaller 2-hidden-layer network may exist.

This is a cumbersome representation for such a simple network. A simplified

Neural Network versus Nonlinear Transforms. Recall the linear model

x z = (x) = [1, 1 (x), 2 (x), . . . , m (x)]t .