Sie sind auf Seite 1von 6

4

Multilayer perceptrons and Radial


Basis Functions

4.1 Multilayer perceptrons


The simple perceptron in Fig. 3.2 can be expected to handle problems that are linearly
separable. The classification approach in Fig. 3.2 may be seen to be similar to a neural
network, connecting inputs to outputs, by weighted connections. In general note that not
every output need to be connected to every input.
To tackle more complicated (nonlinear) situations, we have two options: we either increase
the set of connections given a set of neurons, or else we increase the number of neurons (this
is the preferred approach). The additional neurons are added in a hidden layer, between the
output and input neuron layers. This is therefore a multi-layer perceptron (MLP) (Fig. 4.1.).
Fig. 4.1: A multilayer percep- The inputs and outputs to the hidden layer are unknown, and need to be computed from
the training data. When the input training data is propagated forwards, we finally generate
tron
a set of outputs which are compared to the known outputs; the error in modelling is then
identified.
A simple example where the perceptron does not work is the XOR function. The XOR
function has two inputs with values 0 or 1, and has an output which has a value 0 if the two
inputs are equal, and 1 otherwise. An MLP with a hidden layer of 3 neurons, which solves
this, is shown in Fig. 4.2.

Backpropagation
Let hj be in sum of inputs coming in from the input layer, into node j of the hidden layer.
X
Fig. 4.2: An MLP for the XOR
hj =
xi vij
(4.1)
i
function
We assume that this hidden node then gets activated depending on the magnitude of its input,
hj . The activation function is typically chosen to be a sigmoid, in which case its derivative
0
is easy to find. In general, if g() denotes the activation function, (a) g must be computable,
(b) g should be fairly constant at the extremes, but should change rapidly in the middle of
the range. This allows us to have a small range over which the neuron changes state (fires or
stops firing). A sigmoid function serves as an approximation to a step function: for positive
, the output of node j then is
aj = g(hj ) =

1
1 + exp (hj )

(4.2)

This conveniently gives


0

g (h) = a(1 a)

(4.3)

An alternate activation function is the tanh function:


The tanh function saturates
at 1; the sigmoid function
saturates at 0 and 1.
52

a = g(h) = tanh(h) =
c
SBN,
IITB, 2013

exp(h) exp(h)
exp(h) + exp(h)

(4.4)

53

Multilayer perceptrons

For a node k in the output layer, we have its input (hk ) and output (denoted as yk instead
of ak ) as follows:
X
hk =
aj wjk
(4.5)
j

yk = g(hk ) =

1
1 + exp (hk )

(4.6)

After proceeding through the network in the forward direction, we end up predicting a
set of outputs which needs to be compared to the true output values y (i.e. targets, t). The
values of the computed outputs depend on (a) the current input x, (b) the activation function
g(.) of the nodes of the network, and (c) the weights of the network (denoted v for the first
layer and w for the second layer). The error then is
Notation: i, j and k will be

2
n
X
X
X
used to indicate the input,
1
1
tj g
wjk aj
(4.7)
E(w) =
(tk yk )2 =
hidden, and output nodes,
2
2
j
k=1
k
respectively. Then aj and
yk are the activation levels Let hk be the input to the output-layer neuron k. The, summing over all the hidden layer
of the hidden and output neurons feeding into outer neuron k gives
X
nodes.
hk =
wlk al
(4.8)
l

We wish to minimize the error E by manipulating the weights wjk . We use the chain rule:
E
E hk
=
wjk
hk wjk

(4.9)

To determine the second term above, we first note that wlk /wjk = 0 for all values of l
except l = j.
P
X wlk al
l wlk al
hk
=
=
= aj
(4.10)
wjk
wjk
wjk
l

The first term in Eq. 4.9 is denoted o (subscript o for output layer) and is also computed
using the chain rule:
o =

E
E yk
=
hk
yk hk

Neuron k in the output layer has its output (g(.) is the activation function)
Notation: a is the activation

X

of a hidden neuron; y is the

yk = g houtput
= g
wjk ahidden
j
k
activation of an output neuj
ron. The superscripts to h
(hidden, or output) help Then

keep track of which neurons
g houtput
0
E
E
output 
k
o =
=
output 
output
output  g hk
we compute inputs to.
g hk
h
g hk
#
" k
2 0 output 

1X
output 

=
g hk
tk
g hk
2
g houtput
k
k

 0


0
= g houtput
tk g houtput
= (yk tk )g houtput
k
k
k

(4.11)

(4.12)

(4.13)

The update rule for the weights now is written in analogous form to Eq. 3.36, where the
sign ensures we go downhill and reduce the error:
wjk wjk

E
wjk

(4.14)

Multilayer perceptrons and Radial Basis Functions

54

where for a sigmoid activation function, with the scaling factor set to 1,
E
= o aj = (yk tk )yk (1 yk )aj
wjk

(4.15)

In similar fashion we can compute the weights vjk which connect the inputs to the hidden
nodes. We compute h as follows (the summation is over the output nodes):
h =

X
k

X houtput
E houtput
k
k
=
o hidden
hidden
h
h
houtput
k
k
k
k

(4.16)

Since
!
houtput
k

=g

wlk hhidden
l

(4.17)

which implies that


g
houtput
k
=
hidden
hk

wlk hhidden
l
hhidden
k

(4.18)

Using the fact that hl /hj = 0 for l 6= j gives


0
houtput
k
= wjk g (aj ) = wjk aj (1 aj )
hidden
hk

(4.19)

Hence we get the updates


h = aj (1 aj )

o wjk

(4.20)

X
E
= aj (1 aj )
o wjk
vij

!
xi

(4.21)

vij vij

E
vij

(4.22)

The MLP backpropagation algorithm


Initialization: the weights are set to small random values. (positive or negative). A typical
trick is to assume that each neuron has n inputs with unit variance, in which case (with w
being the initialization value of the weights), the cumulative input to the neuron is of the

form w n. The weights may then be chosen to be randomly chosen values in the range

1/ n < w < 1/ n, which consequently means that the total input to a neuron has an
approximate magnitude of 1. Note that unit variances can be achieved for your inputs by
standardizing them: subtract mean and then divide by standard deviation. Hence good
practice is to standardize the inputs.
Training: For each input vector (observation) we have a forwards phase where we predict
the output values using the current weights, followed by the backwards phase,where we
adjust the weights.
Forwards phase: determine the activation of each neuron j in the hidden layer using
hj =

xi vij

aj = g(hj ) =

1
1 + exp(hj )

(4.23)

1
1 + exp(hk )

(4.24)

Repeat next, for the output layer using


hk =

X
j

aj wjk

yk = g(hk ) =

55

Multilayer perceptrons

Backwards phase: compute the error at the output:


ok = (tk yk )yk (1 yk )

(4.25)

Compute the error in the hidden layer using


hj = aj (1 aj )

wjk ok

(4.26)

Update the output layer weights next:


wjk wjk + ok ahidden
j

(4.27)

Update the hidden layer weights:


vij vij + hj xi

(4.28)

The inputs are fed in a random- This process (across all input vectors) will be repeated until learning stops.
ized order to avoid bias, in the Recall: We use the forwards phase as described above, for a new observation.
different iterations.

Some notes on practical usage of MLPs


There is theory which proves that you need at most two hidden layers if we use sigmoidal
activation functions. For radial basis functions (discussed below), we need just the one
hidden layer.
There is no good theory which explains the number of neurons per hidden layer, however.
Trial and error works best, with an eye on prediction accuracy.
Given m, n, and p as the number of nodes in the input, hidden, and output layers, there
are (m + 1) n + (n + 1) p weights. (The +1 comes from the bias nodes.) A good rule of
thumb is that you have at least 10 times as many training data points as the number of
weights that have to be fit by backpropagation.
The sigmoid activation function (with its 0 and 1 states) facilitates classification. If instead
we had linear nodes (g(h) = h) we are simply summing up the inputs. This is useful in
regression problems, when we want outputs which are not just 0s and 1s. Note that in our
algorithm above, we now use
ok = (tk yk )

yk = g(hk ) = hk

(4.29)

We can also use a soft-max activation function to rescale inputs between 0 and 1:
exp(hk )
hk exp(hk )

yk = g(hk ) = P

(4.30)

The update equation continues to be the same for the linear output: ok = tk yk .
One pass through all training The MLP normally involves all training data being fed, and weights being computed, before
data = one epoch.
the training error is computed towards updating the weights. The algorithm above operates
in sequential (loop) mode. It does the job, but it would be slower than the batch method.
The sequential method however, has a better choice (given the random order in which
points are used for training) of avoiding local minima, when trying to minimize the training
error.
We can speed up optimization terms by adding momentum terms:
t1
t
t
wij
wij
+ o ahidden
+ wij
j

(4.31)

where the superscripts t and t1 keep track of iterations. 0 < < 1 controls the momentum
( = 0.9 usually). In general this may be tweaked to avoid large changes in the weights, as
t become large.

56

Multilayer perceptrons and Radial Basis Functions

There are issues of over-fitting, and validation of the MLP, and deciding thresholds of
when to stop learning which are important. Since these impact practically any classification
method, we will discuss these later.

4.2 Radial Basis Functions


The MLP as described in the previous section, allows for one input vector to cause several
neurons in a hidden layer to fire (activate); this distributed pattern of activation, in turn
triggered activity in subsequent hidden and output layers. Below, we use a different approach
where a neuron responds only if the inputs are in a specific (and particular) range of input
space. This ensures that similar inputs result in similar outputs. This involves visualizing a
neuron in weight space: use an axis for each weight coming into a neuron, and plot a neuron
using wi as the position along axis i. When the weights are trained, the location of the neurons
is changed in weight space. Notice that if we have omitted the bias node in this graphical
depiction, that the weight space has the same number of dimensions as the input space, in
which case the inputs themselves can also be plotted on this (weight) space. Now, if we desire
that a neuron specifically respond to inputs only in a certain range, then this implies that the
Euclidean distance from the inputs to the neurons be small. Using a probabilistic description,
the probability of a neuron having an output value 0 goes to 0 as we move away from the
Strictly speaking, real neu- node: the activation function is of the form


rons do not behave this way,
||x w||2
g(x, w, ) = exp
(4.32)
and behave more like in the
2 2
MLP discussion above. We
therefore stop calling them If is large, the neuron responds to practically every input. The regression problem now is
neurons, and instead call therefore one of picking values for the individual nodes. It is important that input vectors
be normalized: the activations generated otherwise could be incorrect.
them nodes.

The Radial Basis Function network

More than one layer is unnecessary, since the RBFN


is a universal approximator,
just as the two hidden layer
MLP was.

The RBF network is practically an MLP, with one hidden layer. This hidden layer has RBFs
describing activation of the nodes; the output layer could continue to have non-Gaussian (e.g.
sigmoidal) activations. A bias input is usually added for the output layer, to deal with the
scenario when none of the RBF nodes fire. This last (output) layer then is just a perceptron,
and training it is straightforward. What needs attention is how to train the weights into the
RBF layer.
The RBF layer and output nodes have different activation functions, but also different
purposes: the RBF layer generates a nonlinear representation of the inputs, while the output
layer uses these nonlinear inputs in a linear classifier (assuming a simple perceptron has been
used). Practically, there are two tasks: locating the RBF nodes in weight space, and using
the activations of the RBF nodes to train the linear outputs. There are several ways to set
up the RBF nodes: assuming that we have good training data, we randomly use some of the
inputs as locations for our nodes. Alternately, we can use an unsupervised approach like the
k-means algorithm to determine the node location.
For each input vector, we compute the activation of all the hidden nodes, as represented
by matrix G. Then Gij describes the activation of hidden node j for input i. The outputs
are then y = GW for a set of weights W. We need to minimize the deviation of y from t,
the output targets. Then, using the pseudo-inverse (i.e. solving in the least squares sense),
W = (G> G)1 G> t

(4.33)

57

Radial Basis Functions

The value of is also important as indicated before. Since we ideally desire good coverage
of weight space, the width between the Gaussians should be a function of the maximum
distance d between hidden node locations, and the number of hidden nodes. If we use M

RBFs, then is chosen as = d/ 2M .


A potential problem with the Gaussian function, is that if a new input turns up far away
from the current locations of the nodes, none of them fire. To prevent this, the Gaussians
themselves are normalized typically using the soft-max function
exp(||x w||/2 2 )
g(x, w, ) = P
2
i exp(||x wi ||/2 )

(4.34)

This ensures that in a relative sense, at least one of the terms is large and hence one node at
least should fire.

Das könnte Ihnen auch gefallen