Notes

Neural networks: an introduction
Peter Ross
AI Applications Institute
University of Edinburgh
September 1999
Contents
1 Introduction
1.1 Mathematical preliminaries . . . . . . . .
1.1.1 Vectors . . . . . . . . . . . . . . . .
1.1.2 Matrices . . . . . . . . . . . . . . .
1.1.3 Basic combinatorics . . . . . . . . .
1.1.4 Basic probability and distributions
1.1.5 Partial differentiation . . . . . . . .
1.1.6 Optimisation: Lagrange multipliers
1.2 About the nervous system . . . . . . . . .
2 Pattern recognition and statistical
2.1 Clustering techniques . . . . . . .
2.2 Principal components analysis . .
2.3 Canonical discriminant analysis .
.
.
.
.
.
.
.
.
1
3
3
4
5
5
7
7
9
tools
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
1
2
6
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Perceptrons
3.1 Threshold units . . . . . . . . . . . . . . . . . . .
3.2 Perceptrons and learning . . . . . . . . . . . . . .
3.3 The single-layer perceptron . . . . . . . . . . . . .
3.4 The Perceptron Convergence Theorem . . . . . .
3.5 Multiple outputs . . . . . . . . . . . . . . . . . .
3.6 The two-layer perceptron . . . . . . . . . . . . . .
3.7 What can be represented? . . . . . . . . . . . . .
3.8 Are linearly separable functions common or rare?
4 Backpropagation
4.1 Smooth activation . . . . . . . . . . . . . . . .
4.2 The error function . . . . . . . . . . . . . . .
4.3 The maths of gradient descent in weight space
4.4 The back-propagation algorithm . . . . . . . .
4.5 An example of back-propagation . . . . . . . .
4.6 Some practice . . . . . . . . . . . . . . . . . .
4.7 Non-layered feed-forward networks . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
4
4
6
9
9
11
12
.
.
.
.
.
.
.
1
1
3
3
6
7
9
11
Page 0:2
CONTENTS
4.8
Variations on back-propagation . . . . . . . . . . .
4.8.1 Other error function . . . . . . . . . . . . .
4.8.2 The Fahlman variation . . . . . . . . . . . .
4.8.3 Momentum . . . . . . . . . . . . . . . . . .
4.8.4 Dynamically adjusted parameters . . . . . .
4.8.5 Line search and conjugate gradient methods
4.9 What is learned by a net? . . . . . . . . . . . . . .
4.10 Training and test data . . . . . . . . . . . . . . . .
4.11 What can be learned by a feed-forward network? . .
5 Applying the technology
5.1 The T-C problem: a simple example . . . .
5.2 Simple digit recognition . . . . . . . . . . .
5.3 Some observations . . . . . . . . . . . . . .
5.4 Finding a network topology . . . . . . . . .
5.4.1 Weight decay . . . . . . . . . . . . .
5.4.2 Freans Upstart algorithm . . . . . .
5.4.3 Mezard and Nadals Tiling algorithm
5.5 Observations about input data preparation .
5.6 Applications . . . . . . . . . . . . . . . . . .
5.6.1 NETtalk . . . . . . . . . . . . . . . .
5.6.2 Playing backgammon . . . . . . . . .
5.6.3 Hyphenation . . . . . . . . . . . . . .
5.6.4 Cytodiagnosis of cervical cancer . . .
5.6.5 Time series prediction . . . . . . . .
5.6.6 Recognising handwritten digits . . .
5.6.7 Active sonar target classification . . .
5.6.8 Other applications . . . . . . . . . .
6 Recurrent networks
6.1 Introduction . . . . . . . . . .
6.2 Jordan and Elman networks .
6.3 Backpropagation through time
6.4 Recurrent backpropagation . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Associative networks
7.1 Hamming Distance . . . . . . . . . . . . . .
7.2 The Hopfield Net . . . . . . . . . . . . . . .
7.2.1 Operation of Hopfield Nets . . . . . .
7.3 Energy functions in the Hopfield Net . . . .
7.3.1 Programming the net . . . . . . . . .
7.3.2 Searching and optimisation problems
7.3.3 Stability in the Hopfield net . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
12
13
13
15
15
19
22
23
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
4
5
7
7
8
9
10
11
11
12
13
13
14
14
15
16
.
.
.
.
1
1
2
4
5
.
.
.
.
.
.
.
1
1
2
3
4
7
9
12
CONTENTS
7.4
7.5
7.6
Page 0:3
The Boltzmann machine . . . . . . . . . . . . . . .

7.4.1 The update rule for the Boltzmann machine
7.4.2 Using and Training the Boltzmann machine
7.4.3 Analysing the Boltzmann machine . . . . . .
7.4.4 Summary . . . . . . . . . . . . . . . . . . .
Willshaws Associative Net . . . . . . . . . . . . . .
7.5.1 The simple matrix memory model . . . . . .
7.5.2 Estimating the memory capacity . . . . . .
7.5.3 Partially connected nets . . . . . . . . . . .
7.5.4 Information efficiency . . . . . . . . . . . . .
Kanervas Sparse Distributed Memory . . . . . . .
8 Self-organizing networks
8.1 The Kohonen net . . . . . . . . . .
8.1.1 Normalised weight vectors .
8.1.2 Caveats for the Kohonen net
8.1.3 Applications . . . . . . . . .
8.2 LVQ: a supervised variation . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
16
17
19
24
25
25
26
28
29
30
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
4
5
5
6
9 Genetic Algorithms
9.1 Introduction to genetic algorithms . . . .
9.2 Basics of genetic algorithms . . . . . . .
9.3 Other variations . . . . . . . . . . . . . .
9.4 Why does it work? The schema theorem
9.5 Deceptive problems . . . . . . . . . . . .
9.6 Applications . . . . . . . . . . . . . . . .
9.7 Classifier systems . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
2
4
5
7
9
10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
4
4
5
5
5
5
6
7
7
8
8
8
.
.
.
.
.
.
.
.
.
.
A The rbp back-propagation simulator

A.1 Introduction . . . . . . . . . . . . . . .
A.2 A Simple Example . . . . . . . . . . .
A.2.1 Listing Patterns . . . . . . . . .
A.2.2 Examining Weights . . . . . . .
A.2.3 The Help Command . . . . . .
A.2.4 To Quit the Program . . . . . .
A.3 The Format Command . . . . . . . . .
A.3.1 Input Patterns . . . . . . . . .
A.3.2 Output of Patterns . . . . . . .
A.3.3 Breaking up the Output Values
A.3.4 Pattern Formats . . . . . . . .
A.3.5 Controlling Summaries . . . . .
A.3.6 Ringing the Bell . . . . . . . . .
A.3.7 Echoing Input . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Page 0:4
A.3.8 Paging . . . . . . . . . . . . . . . . . . . . . .
A.3.9 Making a Copy of Your Session . . . . . . . .
A.3.10 Up-To-Date Statistics . . . . . . . . . . . . . .
A.4 Taking Training and Testing Patterns from Files . . .
A.5 Saving and Restoring Weights and Related Values . .
A.6 Initializing Weights and Giving the Network a Kick
A.7 Setting the Algorithm to Use . . . . . . . . . . . . .
A.7.1 Activation Functions . . . . . . . . . . . . . .
A.7.2 Sharpness (or Gain) . . . . . . . . . . . . . .
A.7.3 The Derivatives . . . . . . . . . . . . . . . . .
A.7.4 Update Methods . . . . . . . . . . . . . . . .
A.7.5 Other Algorithm Options . . . . . . . . . . .
A.8 The Delta-Bar-Delta Method . . . . . . . . . . . . .
A.9 Quickprop . . . . . . . . . . . . . . . . . . . . . . . .
A.10 Recurrent Networks . . . . . . . . . . . . . . . . . . .
A.11 The Benchmarking Command . . . . . . . . . . . . .
A.12 Miscellaneous Commands . . . . . . . . . . . . . . .
A.13 Limitations . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
9
9
9
10
11
12
12
13
13
14
14
14
15
17
17
20
20
B The cluster and pca programs
C Genetic algorithms: the pga program
D Hopfield Net Simulator
E Bibliography
Chapter 1
Introduction
The symbol-processing approach in AI has a long history, and some serious unresolved
problems. A symbol by itself, such as squink, is meaningless; it only starts to mean
something when related to other symbols. For example, the Prolog code:
greng(X,Y) :- % squink-ness is a special case of greng-ness
squink(X,Y).
greng(X,Y) :- % greng (whatever that is!) is transitive
greng(X,Z),
greng(Z,Y).
defines greng to be the name of a transitive relationship and squink to be the name
of a relation which is a special case of it. So greng and squink could be the relations
we would speak of as is a parent of and is the father of although nothing in the
above code expresses the fact that everyone has one and only one father. If the above
code is a complete definition rather than a fragment then of course squink is unlikely
to be is the father of since motherhood ought to figure in parenthood too.
Symbol names are usually chosen to be meaningful to humans, unfortunately, so
that humans can tolerate and allow for the representational inadequacies in such code
describing relationships. Its incredibly hard, indeed practically impossible, to write
code which so tightly defines and interlinks a set of relationships that the whole thing
can only possibly model some unique aspect of our world. Symbol-based programs
are also driven by some specific engine which knows how to manipulate symbols
and their relations. Obviously you can write programs in which symbols identify
relations and there are relations about relations and so on, but at the core of it all
is that engine which contains some base-level knowledge inaccessible to the running
program. Many of the more elaborate symbol-processing programs have some explicit
hierarchical structure, so that a part of the program at (say) level N in the hierarchy
treats parts at level N 1 as data.
However, its not clear or even remotely likely, on the evidence now available
to us, that human brains work in such a fashion. There is no master controller; there
0
Chapter written by Peter Ross, Dec 92; revised by him, Dec 93, Dec 94, Sep95
Page 1:2
Introduction
is not even some bureaucratic component that has access to all that is happening
everywhere in the brain and shunts summaries of it to various parts, without being in
control. As the saying goes, inside your head there is nobody watching the movie.
The evidence is that virtually all the interactions going on inside your head that let
you think, act and be intelligent are extremely localised ones; and yet it works. For
example, there does not appear to be any one very highly localised place in your
brain where your grandmothers name is stored, nor is there a cell which responds
if and only if you are looking at your grandmother. The evidence for this is that
localised brain damage doesnt seem to cause any such specific loss. There are no
such grandmother cells. Even more perplexingly, all the base-level brain events
that together seem to produce your consciousness and thought, appear to happen
on a significantly shorter time-scale than consciousness and thought. In a modern
digital computer the most basic events happen in microseconds or nanoseconds; in a
brain they happen in milliseconds. But its possible for you to make genuine logical
deductions (rather than, say, massively parallel situation retrieval and match from
some ingenious memory system) very fast, such that only a few hundred consecutive
sets of brain events can have been involved. If each member of each such set utilises
only highly localised information then how can this be reconciled with the observations
that memory and indeed the mechanisms of consciousness are not that localised?
The aim of these notes is to introduce you to the study of the properties of
systems composed of many very simple processing elements with purely local interconnections. The initial motivation is that human brains seem to fit this description
tolerably well. The main questions will be:
do such systems have interesting emergent properties - that is, properties which
are not properties of any of the individual components?
even if such a study turns out to tell us little about brains and intelligence, are
there any properties which might be of practical use?
There are some important categorisations to grasp about the study of such
systems. First, when considering how such systems might learn some behaviour it is
important to distinguish between supervised and unsupervised learning. In supervised
learning the system is presented with some input and told what output is expected,
and its task is then to make internal modifications so that the actual output becomes
closer in some sense to what is expected. In unsupervised learning the system is
presented with inputs but is told nothing about the outputs; through the operation
of some internal algorithm the system re-organises itself under the influence of some
regularities in the set of inputs, so that it eventually learns to classify inputs in some
way. Second, it is important to distinguish between systems which self-modify purely
on the basis of local information (much as the brain must do) and system which rely on
algorithms that utilise global information; the latter may be biologically implausible
but easier to analyse. Third, it will be important to distinguish between deterministic
and stochastic systems; in the latter, there will be probabilistic components so that
1.1 Mathematical preliminaries
Page 1:3
(in theory at least) behaviour will never be perfectly repeatable. Fourth, it will be
important to distinguish between synchronous and asynchronous systems. In the
former events happen only according to the ticking of some notional clock and all
actions happen together, whereas in the latter events happen one at a time and there
is not necessarily any external clock governing just when they happen.
This study will involve some mathematics, so a few mathematical preliminaries
are given below in order to help you appreciate the proofs and arguments in later
chapters. There is also a short account of the human nervous system, which should
help you to appreciate how simplistic this study really is. However, as in all scientific
endeavour it is best to start by understanding the most basic ideas thoroughly before
tackling more complex systems. The study of connectionism, as it is called, is still at
this stage and it has already had useful commercial by-products, so you should not
be downcast by the colossal differences in scale, sophistication and detail between the
human nervous system and the highly simplistic systems currently being considered.
1.1
1.1.1
Mathematical preliminaries
Vectors
The notes assume that you are familiar with the basics of vectors and vector calculations. Let x denote the n-dimensional vector with components
(x1 , x2 , , xn )
Then |x| denotes the length of this vector, using the usual Euclidian definition:
|x| =
x21 + x22 + + x2n
The inner product w.x is defined as:
w.x =
n
X
wi xi
i=1
and has a natural geometric interpretation as:

w.x = |w| |x| cos()
where is the angle between the two vectors. Thus if the lengths of two vectors are
fixed their inner product is largest when = 0, whereupon one is just some constant
multiple of the other.
Exercise 1.1
This natural geometric interpretation is not black magic. Show

that
n
X
i=1
wi xi = |w| |x| cos()
Page 1:4
Introduction
by using the Theorem of Pythagoras. Start with a triangle
whose corners are the origin and the points w and x. This
triangle is not necessarily right-angled, of course; the angle at
the origin is of size .
1.1.2
Matrices
The notes assume some familiarity with matrices, which are shown as upper-case
bold letters such as A. If the element of the i-th row and j-th column is aij , then AT
denotes the matrix that has aji there instead - the transpose of A. So, for example if
A is a 3 3 matrix:
2 3 4
A=
4 5 9
6 7 1
then the transpose (written AT ) is:
2 4 6
T
A = 3 5 7
4 9 1
The product of two matrices A and B has k aik bkj in the i-th row and j-th column.
The matrix I is the identity or unit matrix, necessarily square, with 1s on the
diagonal and 0s everywhere else. If det(A) denotes the determinant of a square matrix
A then the equation
det(A I) = 0
P
is called the characteristic polynomial of A. Using the example above, the characteristic polynomial would be:
which is
2
3
4
4
5
9
6
7
1
=0
(2 )((5 )(1 ) 63) 3(4(1 ) 54) + 4(28 6(5 )) = 0

which simplifies to:
3 + 82 + 82 + 26 = 0
Note that a square matrix must satisfy its own characteristic polynomial, by definition
of the polynomial, so (pre- or post-multiplying through by A1 ) it provides a way
to calculate the inverse of a matrix using only matrix multiplication, if that inverse
exists. Clearly the inverse exists if and only if the matrix is square and det(A) 6= 0
(note that det(A) is the constant term in the characteristic polynomial).
Page 1:5
The roots of the characteristic polynomial are called the eigenvalues of the
matrix. Note that if A is an m n matrix and x is an n-dimensional (column)
vector, then
y = Ax
represents a linear map into an m-dimensional space. If A happens to be a square
matrix then any vector which is transformed by the linear map into a scalar multiple
of itself is called an eigenvector of that matrix. Obviously, in that case Ax = x for
some . The eigenvectors can be found by finding the eigenvalues and then solving
the linear equation set:
(A I)x = 0
An orthogonal matrix is a square matrix A such that AT = A1 . Such matrices

represent a mapping from one rectangular co-ordinate system to another. For such a
matrix,
AAT = I
- the inner product of any two different rows is 0 and the inner product of any row
with itself is 1.
1.1.3
Basic combinatorics
The number of ways of selecting k items from a collection of n items is

n
k
n!
k!(n k)!
if the ordering of the selection doesnt matter. This quantity is also the coefficient of
xk in the expansion of (1 + x)n . Stirlings formula provides a useful approximation
for dealing with large factorials:
n! nn en 2n
There are a huge number of formulae involving combinations. For example, since
(1 + x)n+1 = (1 + x)n (1 + x) it is clear that
n
k
n
k+1
n+1
k+1
and so on.
1.1.4
Basic probability and distributions
A random variable X is a variable which, in different experiments carried out under

the same conditions, assumes different values xi , each of which then represents a
random event. A discrete random variable can take one of only a finite, or perhaps a
countably infinite, set of values. A continuous random variable can take any value in
Page 1:6
Introduction
a finite or infinite interval. Random variables are completely characterised by their

probability, density and distribution functions.
For a discrete random variable, if p(X = x) is the probability that it takes the
value x then
F (x) = p(X < x)
is the distribution function of X. For a continuous random variable, there is a probability density function f (x) such that
Z
f (x) dx = 1
and the distribution function is then:
F (x) =
f (t) dt
For a discrete random variable, the mean value is

=
and for a continuous variable it is
xi p(X = xi )
tf (t) dt
The variance 2 is, for a discrete variable:

2 =
and for a continuous variable:
2 =
(xi )2 p(X = xi )
(t )2 f (t) dt
There are several widely-occurring distributions that are worth knowing about.
Suppose that some event will happen with fixed probability p. Then the probability
that it will happen exactly k times in n trials is
n
k
pk (1 p)nk
and this is the binomial distribution. It has mean np and variance np(1 p). If one
lets n one gets the Gaussian or normal distribution, typically parameterised by
two constants a and b; it has density function
1
2
2
e(xb) /(2a )
a 2
with mean b and variance a2 . If one starts with the binomial distribution and lets
n and p 0 with the extra assumption that np = a, where a is some constant,
then one gets the Poisson distribution with density function
ak ea
k!
with mean and variance both a.

1.1.5
Page 1:7
Partial differentiation
If z = f (x1 , x2 , , xn ) is a function of n independent variables then one can form

the partial derivative of the function with respect to one variable (say xi ),
f
xi
by treating all other variables as constant. For example, if
f = xy + y 3
then
f
f
=y
= x + 3y 2
x
y
The geometric significance of a quantity such as f
is as follows. If the function f is
x
plotted and represents some suitably well-behaved surface, then this partial derivative
represents the slope of the surface in a direction parallel to the x-axis at any given
point (x, y). The total derivative dz is given by
X z
dxi
dz =
xi
and clearly, if all the xi are functions of one variable t then
dz X z dxi
=
dt
xi dt
There is a directly analogous version of this chain rule for the case where the xi are
each functions of several variables and you wish to find the partial derivative of z
with respect to one of those variables.
Exercise 1.2
Find the partial derivatives of the function

f (x, y, z) = (x + 2y)2 sin (xy)
1.1.6
Optimisation: Lagrange multipliers
Suppose that you wish to find the stationary points (maxima or minima) of some
n-argument function f (x) = f (x1 , , xn ), subject to the m constraints g1 (x) =
0, , gm (x) = 0. Lagrange showed that they could be found as the solution of the
(n + m) equations in the (n + m) variables x1 , , xn , 1 , , m :
m
X
gj
f
j
= 0
x1 j=1 x1
m
X
f
gj
j
= 0
xn j=1 xn
g1 (x) = 0
gm (x) = 0
Page 1:8
Introduction
where the j are m specially-introduced variables called Lagrange multipliers. This

theorem is certainly not obvious, but should at least be fairly natural-looking, and
it provides a handy way to tackle a range of optimisation problems. Notice that the
above equations are the (n + m) partial derivatives of the function
f
m
X
j g j
j=1
each set to zero.

For example, to find the maximum of f (x, y) = x + y subject to the constraint
x2 + y 2 = 1, solve:
1 2x = 0
1 2y = 0
2
x + y2 1 = 0
to get x = y = = 1/ 2, after which you should then check to determine which of

these two solutions is the true maximum.
Exercise 1.3
Find the maximum of y x subject to the constraint that y +

x2 = 4.
Exercise 1.4
Find the answer to the same problem experimentally as follows.

You can plot the graph of y = 4 x2 and the graph of y =
x + m, to find the largest value of m such that the two graphs
still intersect. You can use the Unix command gnuplot (from
within X-windows) for this, like so:
% gnuplot
gnuplot> m=2
gnuplot> plot [-2:2] [0:5] 4-x*x, x+m
gnuplot> m=2.5
gnuplot> replot
gnuplot> m=3
gnuplot> replot
..etc..
gnuplot> quit
%
The plot command asks gnuplot to plot the two functions given,
and to limit the picture to values of x between 2 and 2 and
values of y between 0 and 5. It is always possible to leave these
range specifiers out or to omit the second of them, (the one for
y) in which case gnuplot will choose ranges for you. However,
1.2 About the nervous system
Page 1:9
its choices will turn out to depend on m; you should keep both
range specifiers as shown above, so that you will be looking at
the same region of the plan whatever values of m you happen
to specify. (NB: for PC owners, there is a version of gnuplot
that runs under MS-DOS.)
1.2
About the nervous system
The human nervous system has been intensively studied for a very long time. About
a hundred years ago Camillo Golgi developed a technique for the chemical staining
of nerve fibers using a bichromate of silver, and Santiago Ramon y Cajal was able to
apply it to prove for the first time that nerves had tiny gaps between them. Until
then some had argued that the nervous system consisted of a single continuous but
vastly complicated network. Study techniques have developed enormously since then,
and it is now possible to study the activity of a single nerve cell in the living brain as
that brain undertakes some everyday task. Indeed, it is possible to use special multipronged electrodes to record the simultaneous activity of perhaps 50 closely-sited cells
at a time.
It has been found that all neurons, from the smallest which are about 10 microns (10 5 meters) in size to the very largest of about a meter long, are constructed
from the same basic ingredients. There is a central cell body, or soma. Many root-like
extensions, the dendrites, project from it. There is a single, long tubular extension
from the cell body called the axon and this spreads at its far end into a number of
branches. The dendrites receive signals from other cells, the axon sends signals to
other cells. However, there is no direct connection. The tiny gap between the end
of a branch at the end of an axon and a dendrite or cell body of some other cell is
called a synapse. Signals propagate electrically or chemically: Within a neuron signal
propagation is mainly electrical: the interior of the neuron is negatively charged with
respect to exterior. A wave of depolarisation causes the membrane separating the
interior and exterior of the neuron to lose its impermeability briefly, during which
time positive sodium ions can enter. The membrane regains its impermeability and
the cell interior returns to its negative charge during a period of several milliseconds;
thereafter the cell is ready to fire again, but during that period it cannot fire. The
wave of depolarisation travels with a speed of between 0.5 meters/second and 2 meters/second. In fact many such waves make up a single signal, so that really the cell
generates a train of pulses when it fires. The frequency can range from 1 Hertz to
100 Hertz or more. However, the slow speed of the depolarisation wave suggests that
it might take a couple of seconds for information to travel from your brain to your
foot. In fact it happens much faster because the nerves are covered in an insulating sheath called myelin that acts as a form of waveguide; the sheath is periodically
punctuated by the nodes of Ranvier where the cell membrane can become permeable
to ions. Speeds of up to 100 meters/second are thus possible. When the wave reaches
Page 1:10
Introduction
a synapse at the end of the cell axon, minute amounts of neurotransmitter chemical
are released which travel across the gap in perhaps 0.5 milliseconds and stimulate the
dendrites, cell body or muscle fibre at the other side. Enzymes rapidly break up these
chemicals so that the process is swiftly ready to restart. Substances such as curare
or cocaine can inhibit the effect of the neurotransmitter chemicals.
Synapses can be excitatory or inhibitory; the experimental evidence suggests
that all synapses at the end of any particular axon are of the same type; this is called
Dales law, although it is not properly speaking a law at all. The signals that then
travel in from dendrites to cell body are, in a sense, summed by the cell body but
it forgets over a period of a very few milliseconds unless more stimulation arrives.
If sufficient stimulation arrives the cell fires and passes waves of depolarisation down
its axon. A cell may receive signals from 10,000-80,000 synapses, and transmit to
hundreds of other neurons.
Most of the complexity of the human nervous system resides in the outer heavily
convoluted layer of the brain called the cortex. The total number of neurons can only
be crudely estimated. Counts in tiny regions produce a pretty constant estimate of
around 150,000 neurons per square millimeter of surface of the cortex, with neurons
residing in the outer layer (the grey matter). Since the surface area of the cortex is
around 200,000 square millimeters this suggests a count of perhaps 3 1010 neurons
in a human brain, although some have argued that it may be as high as 1011 . The
complexity of the nervous system is not entirely determined by genetics. There is very
strong evidence that the nervous system develops rapidly before and just after birth,
and then stabilises, and that the development is influenced by environmental factors.
For example, in one disagreeable experiment a newborn kitten was kept blindfolded
for a while after birth; even after the blindfold was later removed it never developed
the ability to perceive many kinds of optical pattern, and laboratory studies suggested
that the relevant neural hardware had not appeared.
Exercise 1.5
You will find that a few book suggest that there could be as
many as 1014 or even 1015 neurons in the brain. Show that
these estimates must be wrong. All the information you need
can be found in this section.
Chapter 2
Pattern recognition and statistical
tools
Much of the technology of neural nets presented in the three chapters after this one is
devoted to pattern recognition or pattern classification tasks getting some kind of
neural net to learn to output something about the input, such as which of a finite set
of classes it best represents, or whether it resembles a standard pattern closely. For
example, the input to some (large) net might be a digitised image of a handwritten
digit and there might be ten outputs corresponding to the classes its a zero, its a
one, . . . , its a nine. It would be very pleasing if were always the case that nine of
those ten outputs were zero and one was one, but that is a level of performance that
is extremely hard to obtain. In practice, it might be that one of the outputs is a clear
winner in the sense of being close to one while all the others are fairly small; so that
some thresholding device can decide which is the winner. Such a net is performing a
form of classification of its input. Another (large) net might have as input a digitised
image of a face, and a single output which is to be close to one if that face is the
face of a person authorised to enter some secretive establishment, and close to zero
otherwise. Such a net is performing pattern recognition; and obviously there is no
clear distinction between recognition and classification.
The three chapters following this one consider networks which can learn from
examples to do such tasks. Such learning is supervised learning; the net, during
learning, is presented with known instances from which to learn. If it is presented
with all possible instances then it is essentially memorising rather than learning.
Clearly a digit- or face-recognition net cannot be presented with all possible instances
but a simple net whose task was to learn to output the parity of a 16-binary-bit input
could be presented with all cases; such learning is not very interesting, although it
is a good place to start because it is such a well-understood task. Face-recognition
is not well understood; and so it is much harder to know whether a network that
performs well on training and test cases is going to continue to perform well when
0
Chapter written by Peter Ross, Dec 92; revised by him, Dec 93, Dec 94, Sep 95
Page 2:2
Pattern recognition and statistical tools
presented with a brand-new image.

This chapter is about conventional pattern recognition techniques, and about
statistical tools for analysing patterns. It presents just the rudiments of the art. There
are two reasons for presenting this in these notes. First, there is no point in using
neural net methods for pattern recognition or classification if conventional techniques
already do the task perfectly well; so what do those conventional techniques do well
and what do they do badly? Second, it might be all very well to train a net to
recognise approved faces (if possible, and if conventional techniques dont work well
on this task), but it would be better to then find out just what it was doing. It is
hard to have confidence in some mysterious black box that appears to work well. So
if a neural net can be trained to do the task, then the well-understood conventional
techniques might be applied to information obtained from the internals of the network
in order to find out just what it is that the net has learned to do. For example, in
the digit-recognition task, it might turn out that some part of the network learns to
recognize a closed or nearly closed loop, another part learns to recognise a slightly
curved tail, so that between them these two parts, if both activated, are telling you
its a six or a nine; which would suggest that a third part is distinguishing between
these cases. This kind of analysis would obviously be extremely useful for the study
of AI, because it would help us to understand something of the internal structure of
at least some kinds of knowledge.
2.1
Clustering techniques
If you look at figure 2.1 you can easily see that the points group naturally into two
clusters. Your eye and brain are brilliant at performing this kind of recognition
naturally.
100
80
60
40
20
0
0
20
40
60
Figure 2.1: Two clumps
80
100
2.1 Clustering techniques
Page 2:3
However, consider the same set of points in the form of data:

(26,71)
(18,72)
(22,76)
(23,66)
(38,59) (21,71) (36,49) (18,62)

(25,74) (34,53) (38,52) (40,62)
(36,58) (22,61) (41,56) (37,64)
(21,64)
It is very hard to tell, by looking at this data alone, that there are two clumps present.
Hierarchical clustering algorithms can process this data to group them, as follows:
Find the closest pair of points (in some sense see below).
Form a new set of points by replacing that pair by a single new clump.
Repeat, until some suitable stopping criterion is met.
This algorithm is underspecified. First, what measure of distance is to be used? The
customary one is the Euclidean distance:
d(x, y) =
but one could choose any Lp norm:
dp (x, y) =
r
X
|xi yi |2
r
X
p
|xi yi |p
For example, L1 is the Manhattan or city-block distance. Second, when two points
are to be merged, what are they to be replaced by? One choice is a new point which is
the average of the two. Another is a genuine clump, whose distance to any other point
or clump is the minimum of the distances between any points involved; or perhaps
the maximum of the distances between any points involved; or perhaps the average
of all the distances involved. Third, when does it all stop? Obviously, it must stop
when there is finally just one clump, but that is the least interesting option. Better
choices might be:
when there are two clumps;
or, when the histogram of distances between the points and/or clumps remaining, shows clearly separated peaks;
or, when the width of each clump is significantly smaller than the distance
of that clump from its nearest neighbour.
The various choices lead to algorithms that perform somewhat differently in various
cases. Often, the results of such an algorithm can be displayed visually as a dendrogram a binary tree showing how the clumps are composed of smaller clumps and
ultimately points.The tree is binary because the algorithm combines only two clumps
Page 2:4
or points at a time. The length of the branches in the tree is used to express an idea of
the distance between clumps or points; the longer are the two branches coming from
a node, the more separated are the two clumps or points at the other ends of the two
branches. Figure 2.2 shows a dendrogram for the points in figure 2.1, as generated
by the UNIX cluster program. The points were labeled p01p17 in the order in
which they were given in the above data.
_____________|------> p08
|
|______|------> p04
|
|------> p09
|-----------------------------|
_______|----> p02
|
|
|--------|
|----> p12
|
|-------------|
|-------> p14
|
|________|------> p10
-|
|------> p15
|
________|-----> p03
|
|--------------|
|-----> p06
|
|
|
________|-----> p01
|-----------------------------|
|--------|
|-----> p07
|
|--------> p11
|______________|-------> p05
|_______|------> p13
|______|-----> p16
|-----> p17
Figure 2.2: A dendrogram

However, all such nearest-neighbour hierarchic cluster techniques can fail in
certain cases. For example, figure 2.3 shows two clumps that are each large, and there
are points in each clump which are much further apart than the clumps themselves.
Figure 2.4 shows two clumps that pose an even worse problem unless one first
transforms the data to polar coordinates with the origin shifted suitably, in which
case a hierarchic clustering algorithm is going to work easily. But how does one decide
that the transformation to polar co-ordinates is the right thing to do? A suitable data
transformation will always work, but usually the choice of transformation depends on
knowing or guessing the answer beforehand!
Exercise 2.1
Use the cluster program described in an appendix to experiment with hierarchic clustering techniques. You can create
suitable data by using the xfig program as follows. First, use
xfig to place text strings, each just a + sign, where you like
on the drawing area. Then save the drawing to a suitable file, in
the default FIG format. This file can then be edited with your
2.1 Clustering techniques
Page 2:5
100
80
60
40
20
0
0
20
40
60
80
100
Figure 2.3: Clumps that fool the nearest-neighbour methods
100
80
60
40
20
0
0
20
40
60
80
100
Figure 2.4: Two clumps, easy to see but awkward for algorithms
favourite text editor. You will find lines ending in a control-A

character, rather like this:
..... 413 279 +^A
which are instructions to xfig to place a text string consisting
of a + at x = 413, y = 279. The units are hundredths of
an inch. Just use your editor to delete everything but those
pairs of co-ordinates; a suitable keyboard macro in any Emacslike editor can make the process almost painless. Then add a
Page 2:6

unique name for each line, e.g. p1, p2 etc (see the description
of the cluster program) to get a file containing lines such as:
413 279 p23
Clearly you could just use text strings such as p23 when using
xfig, but using + to mark points may be easier to look at if
some points are very close together.
Exercise 2.2
Try using cluster -t to create a dendrogram of your test data,

and see how small changes to your data can affect the shape of
the dendrogram.
There are other hierarchical clustering techniques which are divisive rather than
agglomerative; that is, they start by regarding the set of points as one cluster and then
dividing them, typically by starting with the pair of points furthest apart. However,
such techniques will not be described in these notes.
2.2
Principal components analysis
Given a set of data points, it may be that there is some more concise representation
to be found. For example, eight random points on the unit circle would be described
by sixteen numbers if using Cartesian co-ordinates but only eight numbers (plus
the knowledge that r = 1 in each case) if using polar co-ordinates. However, that
compression can only be done in the light of the fore-knowledge that the points do
lie on the unit circle. If a small amount of noise has been added to each coordinate,
the situation becomes harder to analyse, but it would still be useful to know that
the points lay close to the circle. In many other cases it is unnecessary to find such
a non-linear transformation: some suitable linear transformation is good enough in
some sense. For example, consider these points:
(3,3) (5, 5) (1,1) (2,2)
Clearly they lie on a line; this data is essentially one-dimensional rather than two dimensional. Consider this data instead, obtained by adjusting the co-ordinates slightly:
(3.1,3.2) (5.1,4.9) (1.2,1.2) (1.8,1.9)
This data is very nearly one-dimensional.
Principal components analysis (PCA) is a technique for reducing the dimensionality of data by finding some linear transformation of the co-ordinate system such
that the variance of the data along some of the new dimensions is suitably small and
so those particular new dimensions can be ignored. It is an extremely useful and (once
understood) simple technique in the study of neural nets and many other domains.
2.2 Principal components analysis
Page 2:7
PCA first finds some dimension along which the data varies as much as possible.
That is, it seeks a transformation of the axes:
y=
ai xi
such that the variance of the original data points in direction y is as large as possible.
Since the transformation can be arbitrarily rescaled, it is usual to add the constraint
that
X
a2i = 1
i
Having found such a direction, it then finds another direction, orthogonal to the first,
along which the data varies as much as possible. Then it finds a third direction,
orthogonal to the first two, . . . and so on. The method is as follows. Suppose we have
a set of m points, each n-dimensional:
p1
p2
pm
= (p11 , p12 , , p1n )

= (p21 , p22 , , p2n )
= (pm1 , pm2 , , pmn )
First compute the average point by averaging each co-ordinate separately:

= (1 , 2 , , n )
p11 + + pm1
p1n + + pmn
= (
,,
)
m
m
Now form an m n matrix M:
p11 1 p1n n
M=
pm1 1 pmn n
Each row of this matrix is a vector representing the difference between one of the
original data points and the mean of them all. The matrix C = MT M/(m 1) is
then the covariance matrix of the original points. It is an n n symmetric matrix.
The entry in the i-th row and j-th column is:
cij =
m
1 X
(pki i )(pkj j )
m 1 k=1
The value cij is a measure (not the measure) of how well the i-th and j-th coordinates of all the points agree with respect to their mean values. The sum gets a
positive contribution from the k-th point if both the i-th and the j-th co-ordinates
are above or both are below the mean value for that co-ordinate. Observe also that
an entry on the diagonal:
cii =
m
1 X
(pki i )2
m 1 k=1
Page 2:8
represents the variance of the i-th co-ordinate from its mean over all points.
Because this matrix is symmetric (and real-valued) it has n eigenvectors and
eigenvalues and the eigenvectors corresponding to different eigenvalues are mutually
orthogonal to each other. To see this last point: if
Cxa = a xa
Cxb = b xb
then, pre-multiplying the second equation by xTa gives:
xTa Cxb = b xTa xb
and since C = CT and xTa CT = a xTa it follows that:
a xTa xb = b xTa xb
If the eigenvalues are distinct then xTa xb = 0, that is, the eigenvectors are orthogonal.
The eigenvector corresponding to the largest eigenvalue is in fact the direction
along which the original points have their largest variance. The eigenvector corresponding to the second largest eigenvalue is the orthogonal direction along which the
points have their largest variation; and so on. That is, the eigenvectors taken in the
order of size of the eigenvalues are the directions we are looking for. Moreover, the
size of an eigenvalue shows the proportionate variation of the original points in the direction of the corresponding eigenvector; thus dimensional reduction can be achieved
by ignoring those eigenvectors corresponding to suitably small eigenvalues.
Exercise 2.3
Use the pca program to experiment with principal components

analysis see the manual page in an appendix of these notes.
As in the previous exercise, you can use xfig to help you create
a suitable 2-dimensional set of test data points. Pick them to
lie close to some sloping line and check that pca finds a vector
parallel to that line and another orthogonal to it.
Exercise 2.4
Try forming two or three clusters of points, using xfig, and

see whether PCA helps to identify the clusters.
Exercise 2.5
PCA assumes that the transformation of the original data should

be a linear one. See how well it performs if the data points form
a rough crescent shape or S-shape.
Clearly PCA can be also used as the basis of a clustering technique, by finding
the direction that most spreads out the points and then grouping the points according to how they clump in that direction. Each clump can then be treated separately,
in a recursive way.
2.3 Canonical discriminant analysis
Page 2:9
Where does all this black magic come from, you might well ask? The theory
behind it is not very difficult. Suppose that p is a point, that is the average of
all the points and x is a unit vector lying along a line that you would like to be
the direction along which the set of points has largest variance. First, observe that
(p ).x is (because x is a unit vector) the length of the vector (p ) projected
onto the direction x. Thus Mx is a column of such lengths, one per point, and
(Mx)T (Mx) = xT MT Mx is the sum of squared lengths by which the points differ
from the average point in the direction given by x.
Therefore the direction wanted is the one which maximises this quantity, subject
to the condition that x has unit length, that is, subject to xT x = 1. This can be
tackled directly by the method of Lagrange multipliers outlined in section 1.1.6 on
page 7; just solve:
d T T
(x M Mx (xT x 1)) = 0
dx
which happens to be:
2(MT Mx x) = 0
that is, the desired x is an eigenvector of MT M. Understanding this derivation

should help you to remember and understand PCA.
2.3
Canonical discriminant analysis
This section is optional, included purely for those who want to delve a little deeper.
PCA is a fairly simple technique, but useful for the process of dimensional
reduction of a set of data. There is a related technique called canonical discriminant
analysis (CDA) that is useful if a set of data points has already been classified into
groups, but not necessarily in some obviously geometric way. The aim of CDA is to see
whether there is some linear transformation of the data that can bring out a natural
geometric clustering of the data. More precisely, it finds a direction along which the
points within a group are tightly clustered but the groups are well separated.
The method is as follows. Suppose that the points within group g are
(g)
p1
(g)
p2
(g)
pm
(g)
(g)
(g)
(g)
(g)
(g)
= (p11 , p12 , , p1n )
= (p21 , p22 , , p2n )
(g)
(g)
(g)
= (pm1 , pm2 , , pmn
)
Compute the average value of each co-ordinate for the points in this group and as
before, let M(g) be the m n matrix: Now form an m n matrix M:
M(g)
(g)
(g)
(g)
p 1 p1n n(g)
11
(g)
(g)
(g)
pm1 1 pmn
n(g)
Page 2:10
Now let W (the within-groups sum of squares matrix) be the n n matrix:

W=
M (g) (M (g) )
and let H be the n G matrix, where G is the number of groups, as follows. If is

the average of all the (g) , then:
(1)
(1)
n
1
H=
(G)
(G)
1 n
Then B, the between-groups sum of squares matrix is defined as:
B = HHT
The main step is then to find the eigenvectors of the matrix W1B. The eigenvector associated with the largest eigenvalue will be the direction along which the
ratio of between-group distances to within-group distances is maximised. It is the direction which represents the best compromise between clustering points within groups
tightly and separating groups widely. As with PCA, the eigenvector corresponding to
the smallest eigenvalue represents the poorest compromise. However, unlike in PCA,
the eigenvectors are not necessarily orthogonal (note that W1 B is not necessarily
symmetric). Moreover, if one or more points is put into the wrong group the results
may well be meaningless.
Chapter 3
Perceptrons
3.1
Threshold units
The earliest neural network models go back to the 1940s. McCulloch and Pitts
[22] proposed a very crude model of a neuron as a simple thresholding element (or
thresholding unit) with n inputs x1 xn , each of which is either 1 or 0. Each
input is modified by a multiplicative weight wi on the arc from the input xi to the
thresholding element itself. In this model, and nearly all subsequent models, it is
supposed that the thresholding element computes the weighted sum of its inputs:
n
X
wixi
i=1
and gives an output of 0 or 1 according to whether this sum is less than some threshold
s or not. That is, the output y is:
y = (
n
X
i=1
where
(z) =
wi xi s)
1 if z 0
0 if z < 0
Rather than complicating the equation by having a distinctive-looking threshold s,

it is customary to rename it to be w0 and to suppose that there is an extra input
x0 = 1, so that the above formula can be rewritten as:
y = (
n
X
wi xi )
i=0
note that the sum now runs from 0 rather than 1. In later chapters, in which
the inputs will not always by just 0 or 1, it would be equally convenient to suppose
0
Page 3:2
Perceptrons
that there is some extra input which is permanently 1 rather than +1, so that the
threshold is +w0 .
McCulloch and Pitts showed that any arbitrary logical function can be constructed as some suitable combination of such elements. The proof is straightforward:
the key is to observe that a two-input AND-gate and a one-input NOT-gate can each
be represented by a single such element, using some appropriate weights, and that
any other basic logical function such as XOR or NAND can be expressed in terms
of AND- and NOT-gates.
Exercise 3.1
Find suitable weights that would make such a threshold element

behave like an AND-gate. Find suitable weights for a NOTgate.
Exercise 3.2
Imagine that such a threshold element has n inputs. Find a set

of weights such that the element will output 0 if and only if the
inputs are symmetric about the middle that is, if and only if
xi = x(n+1i) for i = 1 n
Exercise 3.3
In each case, can the weights be varied much, or at all, without

changing the behaviour?
However, McCulloch and Pitts did not address two very important questions:
How do interconnections between these simplistic model neurons form? For
the sake of simplicity you could suppose that everything is connected to everything else but that most such connections will have an associated weight of
0 (equivalent to the absence of a connection). The question is thus how the
weights might be adjusted by some learning process in order to get the network
to perform some specific input-output mapping.
Biological neural nets can be remarkably error-tolerant. The simplified models
seem to require that the elements all perform reliably. How can error-tolerant
performance be produced?
An attempt at an answer to the first question was proposed by Donald Hebb
in 1949 [8]. The idea was roughly as follows: if we want the output of an element to
be 1 then we ought to increase any wi for which xi = 1, because this will increase
the weighted sum. On the other hand if we want the output to be 0 we ought to
decrease any wi for which xi = 1; this will decrease the weighted sum. Thus if we
want to train a 4-input element to behave as shown in table 3.1, we might adjust
the weights according to (a) first, by increasing weights w1 , w3 and w4 a bit; and
then adjust the weights according to (b), by decreasing weights w2 and w4 a bit; and
then repeating this until perfect performance is (hopefully) obtained. This process
of nudging the weights a little for each input/output pattern in turn is called batch
3.1 Threshold units
Page 3:3
(a)
(b)
Input Output
1011
1
0101
0
Table 3.1: Simple training data
update. Alternatively one might try concentrating entirely on (a) until the weights
are suitable for it; and then concentrating entirely on (b) until that is learned, even
though this may have wrecked the elements performance of (a); and then returning to
concentrate on (a) again; and so on, until (hopefully) both are learned. This method is
rarer; you might suppose that the network would forget all about (a) when learning
about (b), because the knowledge of (a) is just the initial configuration of weights
when it comes to learning about (b) and it is not clear that anything of the initial
configuration would necessarily be retained.
The process outlined here is actually not quite what Hebb suggested; he proposed only ever increasing weights, by small amounts each time. The above (batch
update) process can be more formally expressed as the following algorithm. Let the
weights at cycle t be wi(t). Suppose there
patterns to be learned:
n are P input/output
o
(p)
(p)
for the p-th pattern the input will be x1 , , xn and the desired output will be
d(p) . Let a(t) be the actual output at cycle t. Let be some small constant, the
amount by which the weights are to be nudged traditionally it is called the learning
rate. Then:
(1) Initialise the weights wi(0), (0 i n) to small random values. (Recall that
w0 is the threshold and x0 is always 1).
n
(p)
(2) Present the next pattern x1 , , xn(p) .

(3) Calculate
a(t) = (
n
X
(p)
wi (t)xi )
i=0
(4) Adjust the weights:

(p)
wi(t + 1) = wi (t) + xi (d(p) a(t))

(5) Move on to the next pattern, go to 2.
The process stops if all patterns have been learned. However, it is not yet clear that
the patterns can be learned at all!
The above formulation of the algorithm uses the factor (d(p) a(t)) in the weight
update. If a(t) and d(p) are each just 0 or 1 then this quantity is 0 if the actual output
Page 3:4
Perceptrons
is the desired one (so that the weights dont change); otherwise it is 1. Remember
(p)
that xi is 0 or 1, so that the weights will change by .
Exactly the same formulation can be used if we are using some other function
(z) instead of a threshold function, in which case the factor (d(p) a(t)) represents
the error between the desired and actual outputs. If the error is large, it makes sense
to change the weights by a large amount; if small, it should be changed by a small
amount; just as this formulation suggests. This trivially generalised version of the
weight update rule is called the Widrow-Hoff rule ([31]).
3.2
Perceptrons and learning
The above algorithm is for one single element, but clearly if there are many elements
the algorithm can just be applied to each provided that the desired input/output
patterns for each are known. The most-studied kind of assemblage is a layered, feedforward network: the inputs all connect to a set of elements which form the first layer,
the outputs of the elements in that first layer form the inputs for the elements in the
next layer, and so on. Such an assemblage of elements is called a perceptron, a term
coined by Frank Rosenblatt who carried out many investigations of the properties of
such assemblages in the early 1960s (see e.g. [26]).
Several questions naturally arise at this point:
does this algorithm work?
how big should be? If it is tiny it will take a great deal of nudging to get the
weights collectively to where they ought to be. If it is too large then maybe the
weights will be nudged past where they should be; and would that matter
what can a perceptron learn? Can it learn an arbitrary set of input/output
pairs, provided that there are enough elements?
Let us start by confining our attention to a single-layer network. This is simplest. In a multi-layer network we will know what the inputs to the first layer and
outputs from the last layer are supposed to be, but we wont know what the outputs
from the first or any other intermediate layer should be, so we cannot directly consider
the above algorithm for such a network.
3.3
The single-layer perceptron
Let us look again at the problem of trying to get a single n-input threshold element
to learn a set of input/output patterns. Obviously the set must be consistent; the
element cannot
both a 0 and a 1 for the same input pattern. For an
n learn to produce
o
(p)
input x(p) = x1 , , xn(p) the output will be 1 if
n
X
i=0
(p)
wi xi > 0
3.3 The single-layer perceptron
Page 3:5
In terms of vector geometry this means that the point x(p) must lie anywhere in
that half-space bounded by a plane through the origin normal to the vector w and
containing w.
All the inputs x(p) whose output is to be 1 must lie in this half-space, and all
the inputs x(p) whose output is to be 0 must lie in the other half-space. Thus it
must be possible to find a plane which separates all the 1-points from the 0-points.
Any vector normal to this plane (that is, at a right angle to it) which points into the
1-points half-space will then define a suitable w. See figure 3.1.
1
1
1
0
0
0
0
Figure 3.1: A separating plane (in this case, just a line)
However, it is easy to think up examples for which no such plane exists, even
when each co-ordinate of each x(p) vector is restricted to be 0 or 1. For instance, figure
3.2 shows the XOR function in two dimensions: the output value is the exclusive-or
of x1 and x2 . There is no plane which can separate the 0-points from the 1-points.
Thus this simple function cannot be represented or learned by a single threshold unit.
Functions for which there does exist some plane which separates the 0-points
from the 1-points are called linearly separable functions. If you think about the
geometric interpretation of the learning algorithm, it should be reasonably intuitive
that it will work for any linearly separable function. For suppose we have some
unsuitable value of the weight vector w which causes some x(p) input to give the
wrong result. The algorithm changes w to be w + x(p) if x(p) is producing a 0 but
should be producing a 1. This change causes the plane to swing towards x(p) . If
x(p) is producing a 1 but should be producing a 0, the algorithm changes w to be
w x(p) : the plane swings away from it.
Page 3:6
Perceptrons
x2
output=1
output=0
(1,0)
(1,1)
output=0
output=1
(0,0)
(0,1)
x1
Figure 3.2: Exclusive OR (XOR)
In fact the learning algorithm is guaranteed to find a suitable weight vector w

in a finite time if the set of input/output pairs is linearly separable; that is, if there
is such a w to be found at all. Here is a proof.
3.4
The Perceptron Convergence Theorem
The Perceptron Convergence Theorem states that the learning algorithm must terminate and find a suitable weight vector, if there is a suitable weight vector to be found
at all.
First, we can simplify the problem a little. suppose there is some input vector
n
(z)
x(z) = x1 , , xn(z)
which is supposed to produce an output of 0. Thus we want:

n
X
(z)
wi xi
< 0
i=0
This is the same as wanting:
n
X
(z)
wi (xi ) > 0
i=0
(z)
that is, x must output a 1. We can therefore modify the set of input/output
pairs to produce an equivalent set, for which the same vector w would be a solution,
but now every input vector is to produce an output of 1. (Note carefully: the argument
above is not quite correct, since you are not free to change the sign of x0 ; it has to
3.4 The Perceptron Convergence Theorem
Page 3:7
be the same value for every input/output pair. However you can repair it easily, for
example by having two threshold inputs with values 1 and 1 instead of just a single
one. So for the sake of simplicity lets just suppose that you can mess about with the
sign of x0 ).
By adjusting the set of input/output pairs like this, we can therefore suppose
that we are looking for a vector w which defines a plane bounding a half-space that
contains all the input vectors x(p) . Now that every input vector is to produce an
output of 1, the algorithm can be expressed more concisely as:
(1) Choose an initial random w.
(2) For any input vector x(p) :
If w.x(p) 0 then GOTO 2
else GOTO 3
(3) Replace w by w + x(p) and GOTO 2
Figure 3.3 illustrates the idea.
w
new w
wrongly
classified
point
point
is now on
correct side of line
Figure 3.3: A step on the road to finding a weight vector

The theorem claims that step 3 will only be done a finite number of times. The
proof finds an expression for the maximum number of times that it can be done, as
follows.
We are given that there is a vector w , which we can take to be a unit vector,
such that
w .x(p) > > 0 for all the x(p)
where is some small positive constant. Since it is only the direction of w that
really matters (it defines a plane through the origin normal to it, and that plane is
to be the boundary of the half-space), we can suppose that
|w | = 1
Page 3:8
Perceptrons
Now let us consider the function:

g(w) =
Because |w | = 1 it follows that
w w
|w|
g(w) 1
When we apply the above algorithm we start by choosing a random initial

weight vector call it W0 . After t steps, call it wt . Consider how the numerator of
g(wt ) is changed by step 3 of the algorithm:
w wt+1 = w (wt + x(p) )
= w wt + w x(p)
w wt +
Therefore, after m applications of step 3 of the algorithm,
w wm m + w w0
Now consider how the square of the denominator of g(wt ) is affected by step 3 of the
algorithm. It is easier to work with the square of the denominator because:
|wt+1 |2 = Wt+1 wt+1
= (wt + x(p) ) (wt + x(p) )

= |wt |2 + 2wt x(p) + 2 x(p)
But the algorithm only got into step 3 because wt x(p) < 0. If we define X to be
size of the largest of the input vectors:
then we can say that:
X = maxp x(p)
|wt+1 |2 < |wt |2 + 2 X 2
So, after m applications of step 3 of the algorithm:
|wm |2 < |w0 |2 +

m 2 X 2
< (|w0 | + mX)2
and so:
|wm | < |w0 | +
mX
Combining these inequalities for numerator and denominator, we get:

g(wm ) =
>
>
w wm
|wm |
m+w
W0
(|w0 |+ mX)
m|W
0|
(|w0 |+ mX)
3.5 Multiple outputs
Page 3:9
Since we also know that g(wm ) 1 it is clear that m cannot be arbitrarily large, and
a little basic algebra would even let you find an explicit upper bound for m. Thus the
algorithm can only go to step 3 a finite number of times, and the proof is therefore
complete.
Notice that nowhere in this proof is it required that the input vectors x(p) are
to have components which are just 0 or 1. The main requirements are that there is a
solution to be found (so that w exists) and that there are a finite number of input
vectors. The proof suggests that should be large and |w0 | should be small ( is fixed
by the choice of w and X by the set of inputs). However, this is only suggestive,
since it is not clear that the bound is ever a precise one.
Exercise 3.4
Imagine that you had to invent such a proof for yourself. Explain why it would make sense to be considering a function such
as g(w) in the first place.
Exercise 3.5
Because the components of the input vectors do not have to be

just 0 or 1, it seems that there could be an infinite number
of input vectors. In such a case, would the proof still be valid
provided that X = maxp x(p) existed and was finite?
Exercise 3.6
Suppose that you have an 2n -input single-layer perceptron composed of n threshold units. Find a set of weights which will
make the perceptron output the binary encoding of k when only
the k-th input is 1 and all the rest are 0. How much can the
weights vary? What happens if more than one input is 1 at the
same time?
3.5
Multiple outputs
The discussion above considers a single-layer, single-output network; such a network

contains a single threshold unit. If you have a problem with multiple outputs you can
simply regard it as a superposition of completely separate networks, one per output,
each with the same set of inputs but a separate single output.
3.6
The two-layer perceptron
Let us briefly consider a two-layer feed-forward network composed of threshold units.

Such a system can represent the two-input XOR function, unlike the single-layer
perceptron. For example, if there is one output unit and two units in the middle
layer, then one of the input units can represent the OR of the inputs, the other can
represent the AND of the inputs and the output unit can thus compute the OR-butnot-AND of the inputs that is, the XOR of the inputs. Figure 3.4 shows this network
(without weight values) and another network (which is not layered, strictly speaking)
Page 3:10
Perceptrons
Figure 3.4: Two nets that can represent XOR
which could also be used to represent XOR. It was mentioned earlier that the learning
algorithm given above cannot be used for a two-layer or multi-layer perceptron which
uses threshold units, because the inputs to the second layer are the outputs from the
first layer, but we dont know what they should be. For example, in the two-layer
XOR network we want the behaviour shown in table 3.2. However, we dont know
Input
00
01
10
11
Layer 1 output
ab
cd
ef
gh
Layer 2 input
ab
cd
ef
gh
Output
0
1
1
0
Table 3.2: I/O for a simple two-layer XOR network
what each of ah could or should be. We could perhaps make each of them be 0
or 1 at random, thus producing an explicit input/output mapping for each layer so
that the learning algorithm might be used; but we couldnt be sure that each mapping
was linearly separable. More generally, the trouble is that the units are just threshold
units, so we have no information about how to adjust the input weights in order to
correct output errors; the output is not a continuous function of the weights, and
some tiny weight adjustment in the first layer may tip the weighted sum in a unit
in the middle layer past its threshold and so cause a major difference to the output
of the second layer. In general there is no known algorithm for training an arbitrary
multi-layer perceptron comprised of threshold units.
Before we leave the topic of threshold units, two other questions naturally arise:
What functions can a two-layer (or a multi-layer) perceptron possibly represent?
Each unit can learn to output 0 or 1 for various input vectors by using the perceptron learning algorithm, but this input/output mapping needs to be known
in advance because the algorithm depends on it. The proof shows that only
linearly separable mappings can be learned. Are linearly separable mappings
common or rare, among all possible mappings?
3.7 What can be represented?
3.7
Page 3:11
What can be represented?
The general question of what functions a multi-layer perceptron composed of threshold units can represent is too hard to attack head-on. It is best to consider some
specific case first. In particular we shall consider a network with two real-valued
inputs, three threshold units in the first layer and one threshold unit in the second
(output) layer. The commonly-used shorthand for such a configuration is a 231
network. For geometric convenience, call the two inputs x and y.
Each of the three threshold units in the first layer outputs a 0 or 1 according
to whether the point (x, y) lies on one side or the other of some line in the plane; call
them lines L1 , L2 and L3 . Each specific line is determined by the three weights (the
two inputs plus the threshold weight) to that unit. These three units each contribute
an input to the single output unit, which itself outputs a 0 or 1 according to whether
its three inputs specify some point in 3-dimensional space lying on one side or the
other of some plane. The specific plane is determined by the four weights (three on
the arcs coming from the first layer, plus the threshold weight) to that unit.
The three inputs to the single output unit, being each 0 or 1, form the coordinates of some point in 3-dimensional space. However, it is not an arbitrary point.
The only possible points are the corners of the unit cube: (0, 0, 0) or (0, 0, 1) or (1, 0, 1)
and so on, eight possibilities in all. So the output unit just has to separate some of
these points from the others. Now a (1, 0, 1) input to the output unit means that the
(x, y) input point lies on the 1-side of L1 , on the 0-side of L2 and the 1-side of L3 .
It is often stated INCORRECTLY in books on connectionism or neural networks that this means that the point (x, y) must lie in some convex region of the
plane, defined as the intersection of certain half-planes bounded by the lines, or in
the complement of such a region. The rationale for this claim is as follows. The
output unit is separating certain corners of the unit cube from the others. The corners that lie on one side of this plane are connected by edges of the unit cube; for
example, if (1, 0, 1) and (0, 1, 1) lie on the same side of the plane then either (0, 0, 1)
or (1, 1, 1) must lie on the same side too, or else wed have an analogue of the XOR
problem, in the z = 1 surface. If, say, it is (1, 1, 1) that lies on the same side as
(1, 0, 1) and (0, 1, 1) and all the other points happen to lie on the other side, then the
corresponding (x, y) inputs must lie on the 1-side of L3 intersected with the region
defined as those points not on the 0-side of L1 and the 0-side of L2 . This defines a
convex region, as does the intersection of any finite number of half-spaces, or indeed
the intersection of any finite number of other convex regions.
However, the claim that the (x, y) points must all lie in some convex region
(or its complement) is wrong. This is because not all the points of the unit cube are
possible as inputs to the output unit. You know that three linear expressions in two
variables x and y must be linearly dependent, so the outputs of the three units in
the first layer cannot be independent. In fact the three lines L1 , L2 and L3 divide
the (x, y) plane into at most 7 distinct regions, so at least one corner of the unit
cube must be unattainable as a possible input to the output unit. Thus it might,
Page 3:12
Perceptrons
for example, be possible to have (0, 1, 1) and (1, 0, 1) be the only two corners on one
side of its plane if it happened to be the (1, 1, 1) corner that was unattainable! And
this corresponds to a non-convex region of the (x, y) plane, whose complement is also
non-convex.
If you have followed this argument carefully you should see that the convexity
argument does apply provided that the second layer of units is no larger than the first
layer, and the first layer of units is no larger than the number of inputs. It can be
generalised to networks other than the specific 231 network considered here.
Exercise 3.7
Show by induction that in two dimensions, n lines can divide

2
regions. (This number is 2 +
the plane into at most (n +n+2)
2
2 + 3 + 4 + + n)
Exercise 3.8
Extend the analysis informally to consider what functions a

three-layer network of threshold units could possibly represent.
3.8
Are linearly separable functions common or rare?
In this section we consider the following problem: given a set of P points in ndimensional space, each of which is to be labelled either 0 or 1, how many of those
2P labellings are linearly separable? That is, for how many of them can we find some
(n 1)-dimensional hyper-plane which separates the 1-points from the 0-points?
Call this number c(P, n). First, we will require that every subset of n or fewer of
the P points be linearly independent; for a truly random set of P points this will be
true except for infinitely unlikely coincidences. Now c(P, n) can be determined by
induction as follows.
Add a (P + 1)-th point. For some of the c(P, n) linearly separable labellings of
the original P points, the separating plane will pass through the new point and so
it can be nudged infinitesimally to either side, allowing the new point to be labelled
either way and still get a linearly separable labelling; so that there are two labellings
for each possible labelling of the original P points. If the separating plane could NOT
be made to pass through the new point, for example because the new point lay in the
center of some clump of identically-labelled points, then there will be only one way
of labelling it. So:
c(P + 1, n) = c(P, n) + Q
where Q is the number of the original c(P, n) labellings for which the separating plane
could be made to pass through the new point. However, this is just c(P, n1) because
in this case the problem reduces to the (n 1) dimensional case by projecting every
point along some direction parallel to the separating plane, down onto any (n 1)
dimensional plane perpendicular to the separating plane, as illustrated in figure 3.5.
The lines of projection are parallel to the separating plane, so the projected points
are still separated. This produces a linearly separable labelling of P points in that
3.8 Are linearly separable functions common or rare?
Page 3:13
1
1
The new
. point
A separating line
can go through it.
1
0
The separating line
1
0
in two dimensions ...
0
.. and the separating point
in one dimension.
Projection onto this line
gives a linearly sepaeable
labelling in lower dimension.
Figure 3.5: Projecting down to the (n-1) dimensional case
(n 1)-dimensional plane.
So:
c(P + 1, n) = c(P, n) + c(P, n 1)
Note that for all n,
c(1, n) = 2
so we can solve the recurrence relation numerically at least. A bit of fancy maths,
beyond the level of this course, can be applied to show that
c(P, n) = 2
n1
X
i=0
P 1
i
which holds for all P and

! n provided that we use the usual convention that any
j
= 0 if k > j. In particular it can be shown that c(2n, n) =
binomial coefficient
k
2n1 so that just half the labellings are linearly separable if P = 2n; and that c(P, n) =
2P if P n, but c(P, n) < 2P if P > n; and that c(P, n)/2P 0 rapidly as P grows
significantly larger than 2n.
Exercise 3.9
Some people prefer to use bipolar threshold units, which output

+1 or 1 rather than 1 or 0. How does this affect anything
discussed in this chapter?
Page 3:14
Perceptrons
Exercise 3.10 Calculate c(P, n) for a range of P and n and then plot c(P,n)
2P
against Pn . The former quantity always lies in the range 01,
of course.
Exercise 3.11 In the light of this result about linear separability, is it a good
idea to solve the training problem for multi-layer perceptrons
with threshold units by randomly assigning targets for layers
other than the output layer?
In the next chapter we switch from considering threshold units to considering
units whose output is a smooth function of the weighted sum of inputs.
Chapter 4
Backpropagation
4.1
Smooth activation
It was argued in the last chapter that multi-layer perceptron networks using threshold
units may be able to represent a wide variety of functions, but there is a major problem
in getting them to learn such a function. The difficulty is that the output is not a
continuous function of the weights, and certainly not a well-behaved differentiable
function of the weights, so very small weight changes could result in major changes
at the output.
The obvious remedy is to make the output be a continuous differentiable function of the weights somehow, by replacing the threshold function (z) with something
else. But what? Ideally we are looking for some activation function g(z) (also known
as a transfer function) to apply to the weighted sum of inputs that has the following
properties:
It should be continuous and differentiable everywhere, so that we can easily
calculate the effect of a very small change to any weight.
It should have a bounded range of values. The function should not be able
to output arbitrarily large positive or negative values for at least two reasons.
First, real neurons dont exhibit extreme values of performance such as impossibly high (or low) spiking frequencies, so the model should be able to survive
P
without such a feature. Second, if some xi in wi xi is huge then the corresponding wi might need to be very small to counteract its effect, and so wed
need a great delicacy in adjusting the wi (this is sometimes referred to as the
gain problem in the literature).
It would be useful if the function were monotonic, so that if some desired value
of g(z) is to be attained there is no doubt about what value of z should be
aimed for.
0
Page 4:2
Backpropagation
In order to cut down on calculation, it would be nice to use an activation

function whose derivative was fairly easy to calculate.
The function which is usually chosen is called the sigmoid function, known in physics
as the Fermi function and sometimes called the squashing function:
g(z) =
1
1 + eDz
(4.1)
where D is some convenient constant, often just 1, and usually termed the sharpness of
the function because the larger you make this constant the sharper and more like the
threshold function this g(z) becomes. This function has all the desirable properties:
g(z) 1 as z
g(z) 0 as z
g(z) + g(z) = 1 for all z
it is monotonic, and increasing
dg(z)
dz
= Dg(z)(1 g(z))
It is illustrated in figure 4.1. An alternative which is sometimes used is the hyperbolic

1
1/(1+exp(-x))
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-10
-5
Figure 4.1: The sigmoid function (for D = 1)

tangent:
Kz
e e
tanh(Kz) =
eKz +eKz
1e2Kz
=
1+e2Kz
= 2g(z) 1 if D = 2K
Kz
10
4.2 The error function
Page 4:3
which is just the sigmoid function rescaled and slid downward by 1. Its output lies
between +1 and 1 rather than 1 and 0.
So, let us now suppose that a unit computes the weighted sum of its inputs
first. Since we are going to be considering learning in multi-layer networks it will be
more convenient to refer to individual weights as wij , meaning the weight on the arc
to unit i from unit j. Unit i then outputs ai , the sigmoid of this weighted sum:
ai = g(
n
X
wij aj )
(4.2)
j=0
4.2
The error function
We need some measure of how well a feed-forward network is performing before we

can try to get it to learn by adjusting weights. Let us initially confine our attention
to a single input/output pair of the training data. If dj is the desired output of unit
j and (as above) aj is its actual output, then define
E=
1X
(dj aj )2
2 j
(4.3)
where the sum is taken over all the output units (the units in the final layer). E is a
function of all the weights, and note that the dj are constants.
If you were able to plot a graph of E against all the weights in the network, in
some high-dimensional space (one dimension per weight) called the weight space, you
would be looking at some very elaborate surface. The aim of learning would be to
head towards the absolute minimum of this surface, where E is as small as possible
and perhaps even 0. Being a sum of squares it cannot be negative. Notice, however,
that if some dj is exactly 0 or 1 then ideal of zero error can never be attained in
practice because the function g(z) is not 0 or 1 for any finite value of z. In practice,
therefore, learning stops when E becomes smaller than some suitably small value, or
perhaps when all aj are within some suitably small distance of the corresponding dj .
In the absence of any information at all about the general shape or other characteristics of this error surface it seems that a gradient descent process hill-descending
rather than hill-climbing would seem to be the best we could do to try to find that
absolute minimum. The search should start from some random point and keep heading as steeply downhill as possible. In the next section the details of this gradient
descent process are worked out.
4.3
The maths of gradient descent in weight space
At any point in weight space the slope of the surface of E is going to be specified by
the partial derivatives of E with respect to each weight wmi . By definition,
E
wmi
Page 4:4
Backpropagation
will be the slope of E in the direction parallel to the wmi axis. Thus a step downward
in the direction of steepest descent can be taken by:
wmi 7 wmi
E
wmi
where is (again) a small constant controlling the step size and called the learning
rate. Thus we need to calculate these partial derivatives somehow.
Let sj be the weighted sum of inputs to unit j, so that:
sj =
wjk ak
(4.4)
and:
aj = g(sj )
(4.5)
Now wmi is the weight to unit m from i and so it appears only in the expression for
sm and not in the expression for sq for any other q than m. Therefore by the chain
rule for differentiation:
E sm
E
=
(4.6)
wmi
sm wmi
From equation 4.4, we know that:
sm
= ai
wmi
(4.7)
E
What about s
? Clearly sm as an explicit quantity only appears in the expression
m
for am (see equation 4.5), and so by the chain rule again we can say that:
so that
E am
E
=
sm
am sm
(4.8)
E
E am
= ai
wmi
am sm
(4.9)
Now there are two cases to consider, depending on whether am appears explicitly in
the expression for E (which would happen if the unit m was one of the output units)
or only implicitly (which would be the case if the unit m was not one of the output
units.
Case 1: the output units
From equation 4.3 it follows that:
E
= (dm am )
am
(4.10)
4.3 The maths of gradient descent in weight space
Page 4:5
and from equation 4.5 it follows that:

am
= g (sm )
sm
(4.11)
(the dash denoting the derivative of g() as is customary). Since

g (z) = Dg(z)(1 g(z))
it follows from equations 4.11 and 4.5 that:
am
= Dam (1 am )
sm
Therefore, for an output unit, using equations 4.64.11:
E
= (dm am )g (sm )ai
wmi
= D(dm am )am (1 am )ai
(4.12)
(4.13)
Case 2: units other than output units

In this case am appears in the expressions for certain sk in fact, for just those units
k to which the non-output unit m sends its output. So we can look at E as a function
of those sk and use the chain rule to get:
X E sk
E
=
am
sk am
k
X E
wkm
=
k sk
(4.14)
(4.15)
using equation 4.4. But this sum is over those units k in the layer just after unit
m to which m sends its output. For example, if unit m is in the second-last layer,
sending its output as input to various output units, then the sum is over output units
E
and equations 4.84.11 enable us to calculate s
for all such units k.
k
E
for all units in the second-last
So by using equation 4.15 we can calculate a
m
layer, which is the last hidden layer.
E
Therefore by using equations 4.8, 4.11 and 4.15 we can calculate s
for all
m
units m in the last hidden layer.
E
for all
Then, we can use this information and equation 4.15 to calculate a
m
units m in the second-last hidden layer (that is, the third last layer), and so by using
E
equations 4.8 and 4.11 we can determine s
for these units m too.
m
E
Clearly therefore we can find sm for every unit in the network and by using
E
equations 4.6 and 4.7 we can calculate w
for every weight.
mi
This process is called back-propagation for reasons that should be obvious. It
was originally discussed in [30] and was rediscovered in [27]. Notice that, in equation
Page 4:6
Backpropagation
4.15, the wkm are the weights on the arcs going forward from unit m to later units k
in the layered network. Equation 4.15 bears an interesting resemblance to equation
4.4.
Remember that this whole derivation has been for one single input/output pair
of the training data. The process can be repeated for each input/output pair in turn,
thus adjusting the weights to reduce the error for that pair alone before moving on to
the next pair. Sometimes, rather than cycling through the set of input/output pairs
adjusting weights each time, an implementor will choose to make a random selection
of input/output pair instead.
Or, we could use a modified error function which was the sum of the errors E
for every input/output pair in the training data, and so adjust the weights to reduce
the total error on the whole training set in each cycle. However, this often results in
slower convergence in practice and so it is less often used than weight-update-per-pair.
4.4
The back-propagation algorithm
To summarise: for each input/output training pair in turn, do:

(1) calculate the actual output of each output unit of the net;
(2) use equation 4.12 to calculate
E
wmi
for all weights on arcs to output nodes;
(3) use equations 4.8, 4.10 and 4.11 to calculate
E
sm
for these nodes;
(4) use equations 4.8, 4.11 and 4.15 to calculate
E
sm
for the last hidden layer;
(5) . . . and so on, propagating
E
sm
backward towards the first layer;
(6) finally, compute all the weight changes:

wmi 7 wmi
E
wmi
Repeat this cycling through the training set until the sum of errors is no longer
decreasing and has therefore reached a minimum in weight space hopefully 0. Or
stop when the error for each pair is below some small tolerance value.
Equation summary
For all weights:
E
E
=
ai
wmi
sm
4.5 An example of back-propagation
Page 4:7
For output units m:

E
= (dm am )g (sm ) = (dm am )Dam (1 am )
sm
For all units m except output units:
X E
X E
E
= g (sm )
wkm = Dam (1 am )
wkm
sm
k sk
k sk
where the sum is over those units k to which unit m sends its output.
4.5
An example of back-propagation
In order to see how this works, consider the simple example of trying to get a 221
network to learn the XOR function. The network is illustrated in figure 4.2.
0
w 30
w 40
w 31
w 41
w 50
w 53
5
w 32
2
w 54
w 42
Figure 4.2: Learning XOR with a 221 network

In this figure, there is one hidden layer (neither input nor output) consisting of
units 3 and 4. Unit 5 is the sole output unit. Units 1 and 2 are the actual inputs;
they deliver x1 and x2 to units 3 and 4. Unit 0 is permanently set to deliver 1, so
that w30 , w40 and w50 are the biases for units 3, 4 and 5 respectively. These biases
are weights like any other, and get adjusted like any other.
Let us suppose that the net has just been created with random weights which
for convenience will actually be taken to be as shown in table 4.1. Remember that
x0 is always 1, and is a fake input used for letting biases masquerade as weights.
The performance of the net, using the initial weights, is shown in table 4.2; the
numbers are correct to four decimal places. For example, in the last line of this table,
Page 4:8
Backpropagation
To unit 5: w50 = 0.1 w53 = 0.2 w54 = 0.3

To unit 4: w40 = 0.4 w41 = 0.5 w42 = 0.6
To unit 3: w30 = 0.7 w31 = 0.8 w32 = 0.9
Table 4.1: Initial weights
x1
0
0
1
1
x2
0
1
0
1
s3
-0.7000
0.2000
0.1000
1.0000
a3
s4
a4
s5
a5
0.3318 -0.4000 0.4013 0.0868 0.5217
0.5498 0.2000 0.5498 0.1749 0.5436
0.5250 0.1000 0.5250 0.1625 0.5405
0.7311 0.7000 0.6682 0.2467 0.5614
d
0
1
1
0
Table 4.2: How the net performs so far
s3 = 0.7 1 + 0.8 1 + 0.9 1 = 1.0. The a5 column is the actual output and the
d column is the desired output; so this net is not performing at all well yet.
Let us consider how to update the weights in order to reduce the error for the
last line of this table.
Step 2 of the algorithm involves calculating:
E
=
w50
=
=
E
=
w53
=
E
=
w54
=
(d a5 )a5 (1 a5 )a0
0.5614 0.5614 (1 0.5614) 1
0.1382
(d a5 )a5 (1 a5 )a3
0.1011
(d a5 )a5 (1 a5 )a4
0.0924
Step 3 is:
E
= (d a5 )a5 (1 a5 ) = 0.1382
s5
Step 4 is:
E a3
E
E
=
=
w53 g (s3 )
s3
a3 s3
s5
= 0.1382 0.2 0.7311 0.2689
= 0.0054
4.6 Some practice

and similarly for
E
.
s4
Page 4:9
Thus:
E
E
=
a0 = 0.0054
w30
s3
and so on. The full calculations are tedious!
4.6
Some practice
The following exercises ask you to train a network to learn some function specified
by training data, and to investigate how fast it learns. You do not have to be too
precise when it comes to trying different values of parameters such as the learning
rate. After all, if a network did turn out to be incredibly sensitive to such a parameter
it would be useless because it would be too hard to discover the suitable values. A
few parameter values will do: for example, try altering the learning rates in steps of
0.1 or even 0.2.
WARNING: the rbp program can produce a lot of output if you are not careful.
Dont try to be clever by setting up some script to run the experiments for you
automatically. That way, you could use up the file space very rapidly! Do them
interactively and think about what you are doing.
Exercise 4.1
Use the rbp program, described in an appendix, to investigate

how a feed-forward network learns the XOR function. You
should create a file containing (say):
m
s
k
e
t
A
n
0
0
1
1
2 2 1
13 17 23
0 2
0.1
0.1
as
4
0 0
1 1
0 1
1 0
*
*
*
*
*
*
*
and start the program by:

% rbp filename
make a 2-2-1 network

seed the random number generator
initial weights to be in [-2,2]
learning rate (eta) = 0.1
pattern learned if within 0.1
use genuine sigmoid, not approx
4 lines of training data follow:
Page 4:10
Backpropagation
You can then give many commands; for example, e 0.3 will
change the learning rate, and r 200 5 will cause the program
to cycle 200 times, printing a report every 5 cycles. The report
indicates how many of the patterns have been learned, and the
average error per output unit.
You will find that it may take hundreds of cycles to learn the
function to within the tolerance specified (by the t command).
You may have to use several r commands in a row; it does
not reset anything. You can safely over-estimate the number
of cycles, the program will stop as soon as all the training data
has been learned.
Investigate how the choice of learning rate affects the number
of cycles needed.
Exercise 4.2
Investigate how the tolerance affects the number of cycles needed.
Exercise 4.3
Repeat both investigations using a 2-3-1 network instead.
Exercise 4.4
Investigate the learning of simple parity, using a 3-3-1 network

and the following training data:
n
0
0
0
0
1
1
1
1
8
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
1
0
1
0
0
1
*
*
*
*
*
*
Output is 0 or 1 so as
to make number of 1s be
an even number.
Parity for just 2 inputs
is just XOR; 3 inputs is
more interesting.
Exercise 4.5
Try the same task with a 3-2-1 network.
Exercise 4.6
Try the 4-2-4 encoder problem, using a 4-2-4 network and

the following training data:
n
0
0
0
1
4
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
* Outputs same as inputs;

* only one is 1.
4.7 Non-layered feed-forward networks
Page 4:11
Exercise 4.7
Try the analogous 8-3-8 encoder problem, and see if an 8-2-8

network can also learn it! In the 8-3-8 encoder problem, since
8 = 23 , you might suppose that the net is learning a binary
representation of the position of the single 1 bit. If so, you
might expect that the 8-2-8 network might fail to learn.
Exercise 4.8
Consider the continuous XOR problem: you are given a large

number of points in the unit square 0 x 1, 0 y 1. None
of the points have an X or Y co-ordinate which is exactly 0.5
The square is divided into four smaller squares, and all points in
the squares 0 x < 0.5, 0 y < 0.5 and 0.5 < 1, 0.5 < y 1
should output 1. All points in the other two quarter-squares
should output 0. This problem is very hard for a network with
one hidden layer, especially if there are points which are very
close to the central point (0.5, 0.5). What kind of network ought
to be able to solve this problem?
4.7
Non-layered feed-forward networks
The derivation of the back-propagation algorithm does not use the property that the
net is layered; in fact it applies to all feed-forward networks whether layered or not.
The important property is that there should be no loops. So the same algorithm can
be used to train the non-layered network in figure 3.4.
Exercise 4.9
Use the rbp simulator to train that network. You can add extra
connections between non-adjacent layers by using the c command, interactively or in the initial file. For example,
c 1 1 3 1
adds an arc from layer 1 unit 1 (layer 1 being the inputs) to
layer 3 unit 1 (in a net with just one hidden layer, layer 3 will
consist of the output units).
4.8
Variations on back-propagation
As you should have found if you tried some of the exercises above, learning using backpropagation can be surprisingly slow. Considerable research has gone into methods
of speeding up the search for the global minimum of the error function. The risk
of getting trapped in some purely local minimum remains, of course, as with all
versions of some hill-climbing search strategy. Some examples of variations are
briefly discussed below.
Page 4:12
Backpropagation
Exercise 4.10 When training a net using the standard backpropagation algorithm, you will sometimes find that the error is decreasing satisfactorily for a while but then it suddenly jumps up to around
0.5 and stays there. Explain why this can happen.
4.8.1
Other error function
It is not essential to use a sum of squared errors as the measure of performance. The
trouble with such a sum is that since the outputs must lie between 0 and 1 (because
g(z) does), the error function has a maximum equal to the number of output units
since it is a sum over output units and the squared error can be at most 1. The main
requirements for an error function are that it should be continuous, differentiable and
have its global minimum when the desired outputs are all equal to the actual outputs.
So, for example, you might use a sum of higher powers than two of the units errors,
provided that the power is even; for example, a sum of fourth powers. Using the sum
of squared errors, or any higher power, means that the network will try to reduce
any single large error because such an error contributes a lot to the sum. However,
this is not always desirable. Suppose you are trying to train a network to predict
electricity demand in Edinburgh one day ahead and one week ahead, given data
about the demand for the past fourteen days. Thus the network might have fourteen
real-valued inputs and two real-valued outputs. However, the week-ahead values in
the training data will inherently contain much more noise than the day-ahead values
(that is, there will naturally be a limit to how accurately they can ever be predicted).
Using sum-of-squares error, the network is likely to spend much effort trying to reduce
large errors in its week-ahead predictions, and relatively little effort trying to get the
day-ahead predictions right. In such a case it might be more sensible to use merely
a sum of absolute values of output errors, modifying the back-propagation algorithm
to suit.
Exercise 4.11 How else might you deal with the day-ahead/week-ahead problem, other than by changing the error measure?
If, on the other hand, it is very important to deal with large errors, then a choice
that is sometimes used is the relative entropy function, together with the activation
function g(z) = tanh(z) instead of the sigmoid. This function ranges over 1 1,
and so the desired outputs can be allowed to lie in this range too. The relative entropy
function is:
E = 1/2
X
m
1 + dm
1 dm
(1 + dm ) log(
) + (1 dm ) log(
)
1 + am
1 am
This is always positive except when dm = am , when it is zero. Also, it is very large
if any am is close to an extreme of the range 1 1 without being close to dm too,
so the gradients involved are typically much larger than with the quadratic error
function.
4.8 Variations on back-propagation
Page 4:13
For the output layer, using this relative entropy function,

E
= (dm am )ai
wmi
so that it is similar to the case of quadratic error except that the g factor is missing.
This is useful, because if sm is a large positive or negative number (so that am is
close to an extreme of its range) then g would be very close to zero, giving very
small weight changes in the case of quadratic error. Such small weight changes are
perfectly acceptable if am is close to dm , but if as can often happen some am is close
to an extreme of the range but not close to the corresponding dm then it will take
very many steps to correct it.
The relative entropy error function is not some arbitrary function plucked from
thin air. It happens to have a natural interpretation in probabilistic terms. Imagine
that each output unit represents one of a complete and disjoint set of hypotheses
about the input, for example that the input belongs to a certain class. If 1/2(1 + am)
represents to probability that hypothesis m is true of the input, so that if am = 1 the
hypothesis is definitely true and if am = 1 it is definitely false, and if 1/2(1 + dm )
is the target probability, then information theory suggests that the above function
E is a natural measure of the difference between the desired and actual probability
distributions - c.f. the entropic measure used in the ID3 machine learning algorithm
(see for example [19]).
Note, by the way, that if the activation function is linear, say g(z) = z, then
the back-propagation algorithm is just like perceptron learning! The rbp simulator
allows you to choose a linear activation function, by the command A l.
4.8.2
The Fahlman variation
Scott Fahlman in [5] was the first to point out the problem of having g as a factor
in the expression for weight updating, if g is the sigmoid function; as just mentioned,
some output value can get stuck at the wrong extreme because g is very close to
zero there. He suggested the very simple idea of just adding a small positive constant,
say 0.1, to g in order to ensure that this factor is never small! In practice it seems
to work well.
Exercise 4.12 Use the rbp program to examine the effect of this variation on
learning times for some of the previous exercises. You can opt
to use it just in the output layer, by the command A dF, or in
all layers by the command A df. You can revert to using the
ordinary derivative by the command A do.
4.8.3
Momentum
The gradient descent implemented by the backpropagation algorithm is not quite

true gradient descent, because the step size is not infinitesimal. Thus it is always
Page 4:14
Backpropagation
possible that some step will actually step over the true global minimum and miss it
altogether. If is made very small indeed this risk is reduced but it will take a very
long time to reach the minimum.
For example, imagine that the quadratic error function is being used and there
are only two weights w1 and w2 and the error function just happens to be the very
simple function:
E = w12 + 10w22
Obviously this would never be true in any real application; the aim of this assumption
is just to make the gradient descent process for it extremely easy to grasp. So in the
gradient descent:
E
= w1 (1 2)
w1
E
w2
7 w2
= w1 (1 20)
w2
w1 7 w1
So if the initial values of w1 and w2 are 1 and 1, then after k steps:

w1 = (1 2)k
w2 = (1 20)k
The true global minimum is of course at (0, 0). Thus if is very small, say 1/100, it will
take a long time to get w1 even down to 0.05 in fact it will take (log(0.05)/ log(0.98) =
148.2 steps to get there. On the other hand if 20 > 2 then w2 is going to diverge
and it will never find the global minimum.
The simplest technique that helps somewhat to counteract slow convergence is
called momentum ([27]). The idea is that at each time step a proportion of the weight
change from the previous time step should also be added in. Thus, rather than saying
wmi =
E
wmi
one uses
E
+ wmi (t)
wmi
where is the momentum coefficient. Typically it is taken to be nearly 1 for
E
is nearly constant, so that the weight landscape
example, 0.9. Notice that if (say) w
mi
is pretty evenly sloping in the wmi direction, then over time:
wmi (t + 1) =
wmi (t)
E
1 wmi
provided that < 1, so that the learning rate has been effectively multiplied by
1/(1 ). This suggests that it is important to choose to lie between 0 and 1 and
to be quite close to 1. However, you can still usefully take 1 in a few applications;
Page 4:15
in this case the steps just get larger and larger. If the search is proceeding along the
floor of a very long straight valley in weight space this will be very beneficial; if
the valley curves even slightly, then like a runaway sledge in the Winter Olympics
bobsleigh event the search will rise up the wall of the valley and shoot off the side.
But in practice, the use of a momentum term often makes the choice of a suitable
learning rate harder, and the choice often gets harder for larger values of .
Exercise 4.13 Write a short program to plot successive points (w1 , w2 ), starting from (1, 1), for the trivial error function E = w12 + 10w22
used above, but using momentum with various values of . Use
either the simple program xgraph or the considerably more sophisticated gnuplot to display a graph in an X window. Although it is straightforward to solve the recurrence relations
involved, a graph makes the process much clearer.
Exercise 4.14 Can the use of momentum ever be a bad thing to do? Justify
your answer.
4.8.4
Dynamically adjusted parameters
It can be hard to choose some right values of the learning rate and the momentum
coefficient . Perhaps these parameters can be adjusted automatically as learning
progresses? The most commonly used idea is to see whether a weight update did
actually decrease E. If not, then perhaps the step overshot the minimum and so
should be reduced. On the other hand if the error function is decreasing steadily
over several steps perhaps should be increased a bit. For example, one scheme
is to increase by some small constant after each successful step and decrease it
geometrically after any unsuccessful step. Sometimes the weighted moving average
of the changes in E over some fixed number of most recent steps is used for this
purpose.
If a bad step which increases E is made, it may be worthwhile to reverse the
step and to set the momentum coefficient to 0 until the next good step. Jacobs,
who first investigated this idea in [14], has even suggested using a different learning
rate for every weight, and adapting them as just outlined. This is the basis of his
delta-bar-delta method.
Exercise 4.15 Use the rbp program to see whether Jacobs delta-bar-delta algorithm significantly affects learning times for some of the test
problems used in previous exercises.
4.8.5
Line search and conjugate gradient methods
An obvious extension is to choose the direction of steepest descent, as indicated by

E
the partial derivatives w
, but to calculate the value of the step to take which most
mi
Page 4:16
Backpropagation
reduces the value of E when moving in that direction. Take that step and then repeat
the process. However, finding the size of the step takes a little work; for example,
you might find three points along the line such that the error at the intermediate
point is less than at the other two, so that there is some minimum along the line
lies between the first and second or between the second and third, and some kind
of interval-halving approach can then be used to find it. (The minimum found in
this way, just as with any sort of gradient-descent algorithm, may not be a global
minimum of course.) There are several variants of this theme. Notice that if the
step size is chosen to reduce E as much as it can in that direction, then no further
improvement in E can be made by moving in that direction for the moment. Thus
the next step will have no component in that direction; that is, the next step will be
at right angles to the one just taken.
The direction of steepest descent is not always the best direction to choose
either. Imagine you are standing at the edge of a deep valley whose floor slopes
gently down to the right. The direction of steepest descent will take you first almost
straight ahead and down to the valley floor, after which you will head down the valley.
It might be more sensible to take a first step which both takes you to the valley floor
and a long way to the right, provided you can work out the direction of such a step;
the line minimisation idea would then help you to figure out the size of the step.
Or suppose you choose a direction for line minimisation which takes you diagonally
down into the valley and a point somewhere on the valley floor. If you start from
this new point and choose a direction at right angles to the previous one, then you
will be choosing a direction that heads you diagonally out of the valley again so
presumably line minimisation will now produce a tiny step, so avoiding climbing back
up the side.
In such cases what you really want is to find a new direction which in some
way helps to preserve the downward progress that previous steps have already made.
Conjugate-gradient techniques can be used to find such directions, and are widely
used in real-world applications of backpropagation. The idea is to choose a new
direction which changes as little as possible the component of the gradient along the
previous direction. What does this mean? Suppose the search starts at w0 and heads
along direction p0 . Line minimisation therefore means finding a real number such
that
E(w0 + p0
=0
By doing the differentiation you can see that direction p0 must indeed be orthogonal
to the gradient of E at the new point (call it w1 ), so that gradient descent from there
must indeed be at right angles. But we dont now want strict gradient descent, we
want a new direction that disturbs the previous gain as little as possible. This means
finding a direction V p1 starting from w1 such that the component of the gradient in
direction p0 is still, as near as possible, what it was at w0 . Sparing you the business
of expanding in terms of Taylor series and then making the customary assumption
that is going to be so small that terms involving its square or higher powers can be
Page 4:17
ignored, what this comes down to is finding a direction p1 such that

p1 Hp0 = 0
where H is a matrix of the second derivatives of the error term; such a matrix of
second derivatives of a function is called the Hessian of the function, named after the
mathematician Otto Hesse (1811-1874).
This is all very well, you may correctly say, if we are dealing with abstract
equations or if we are dealing with functions which we can differentiate easily. However, in practice the second derivatives can only be estimated from numeric data and
even then the matrix calculations required to determine p1 must surely be expensive
to do? This is where a wonderful simplifying assumption saves the day. In practice,
the error landscape in the general vicinity of a minimum can almost always be well
approximated by a suitable quadratic function, and if the function is quadratic all
the entries in the Hessian matrix will be constant. This doesnt seem like progress
what is the approximating quadratic, and what are the terms of the Hessian? Almost
magically, it turns out that you dont need to know!
First, consider what happens with a genuinely quadratic function. For example,
consider the (trivial) problem of trying to find the minimum of the function:
f (x, y) = x2 + xy + y 2
starting from the point (3, 0). Figure 4.3 shows contour lines of this function, and
point A is the strating point. Now:
f
= 2x + y
x
f
= x + 2y
y
so the gradient at (3, 0) runs in the direction (6, 3). Line minimisation therefore means
finding the value of such that f (3 + 6, 0 + 3) is a minimum. It turns out to be at
= 5/14, that is, at the point (12/14, 15/14) which is point B in figure 4.3. As
you can see, proceeding from B at right angles to AB would take the search along
the dashed line BD, and therefore only a tiny way along it to a point (not shown)
that was very close to B itself. Instead, the conjugate-gradient direction from B is
the vector p = (x, y) such that:
(63)
2 1
1 2
x
y
=0
which multiplies out to be:

15x + 12y = 0
This is the line BC in the figure, which passes straight through the true minimum of
the function at (0, 0) and minimisation along this line therefore goes straight to it.
Page 4:18
Backpropagation
D
1
-1
-2
-3
-4
-4
-3
-2
-1
Figure 4.3: Conjugate-gradient example
4.9 What is learned by a net?
Page 4:19
When dealing with a quadratic function of (say) n variables, this conjugategradient technique guarantees to get to the minimum in at most n steps; for n = 2
in the above example it took 2 steps. But it still seems as though you need to be
able to get your hands on the Hessian somehow. But it turns out that for quadratic
functions this is not necessary at all. The basic idea is as follows: update is done
according to:
wt+1 = wt + t pt
and the very first step direction p0 is the usual one, the direction of steepest descent
E
w
(0). However, thereafter the direction is a linear combination of the current
direction of steepest descent and the previous step direction:
pt+1 =
E
(t + 1) + t pt
w
There are two algorithms for computing a suitable t , whose derivation is beyond the
scope of these notes. The first is the Fletcher-Reeves formula dating from 1964:
t =
E T
E
(t + 1) w
(t +
w
E
E T
(t) w (t)
w
1)
The second is a variant, now the standard one, called the Polak-Ribiere formula and
dating from 1969:
E T
E
E
(t + 1)( w
(t + 1) w
(t))
w
t =
E
E T
(t) w (t)
w
If you followed the maths in fine detail you would be able to assert correctly that
these two formulae are in theory identical. Therein lies the difference between theory
and practice! Usually roundoff errors mean that you dont get the conjugate gradient
exactly right, and anyway any realistic error surface is not exactly quadratic. Under
these conditions the second formula behaves itself much better. And if the function
E was indeed a quadratic, it turns out that the succesion of steps produced by this
formula are exactly those you would have taken if you had worked out the Hessian
and gone through all the conjugate gradient calculations illustrated above. This is
the near-magical fact referred to above. In fact a proof of it is not too hard, but it is
outside the scope of these notes.
Studies suggest that although use of the Polak-Ribiere formula is computationally more complicated, training time can be much shorter than if using one of the
more basic forms of back-propagation.
4.9
What is learned by a net?
It is all very well to get a feed-forward network to learn to reproduce test data, but it
is just as important to try to understand just what it is doing. As mentioned at the
Page 4:20
Backpropagation
start of chapter 2, if a net is provided with all possible cases of the input/output data
for training, then it is essentially just memorising. If it is provided with just a subset
of cases it may or may not generalise correctly from that data. For example, suppose
that a net is to have n binary-valued (0 or 1) inputs and one binary-valued output.
The complete set of 2n inputs, and the output required in each case, defines a binary
n
function. Thus there are 22 possible binary functions. Thus for a net with just four
binary inputs and one binary output there are 65536 possible binary functions of
which (even) parity is just one. Thus if you train a network to learn to generate
even parity, using perhaps 12 of the 16 possible inputs, and the net learns to perform
4
satisfactorily, then there will still be 22 12 = 16 possible binary functions that agree
with the even parity function about the outputs required for those 12 inputs; even
parity itself will be one of those 16 possible functions.
Why should the net thus learn to compute one specific function when given
incomplete training data? The answer is, usually it wont; but it may learn some
function whose performance is in some quantifiable way not too different from that
of even parity. There is nothing very special about the even parity function which
singles it out from the other 15 which agree with it about the desirable outputs from
the 12 training cases. The fact that it is a function which can be succinctly stated
in English has no bearing on the network. Similarly, consider the simple symmetry
function mentioned in the previous chapter, where the output is to be 1 if and only if
the set of input bits is symmetric about its midpoint: 01100110 should output 1 but
01101110 should output 0. A fully-connected feed-forward network with one hidden
layer can learn this function, given all possible training examples. But as before,
with incomplete data the net may generalise incorrectly. But in this case, unlike
in the case of even parity, there is an added peculiarity. Given a trained network,
and knowing all the weights, you can instantly set up another network in which the
only difference is that the input nodes have been permuted somehow. Thus if nodes
18 are renumbered as 13452768, then the input 01100110 becomes 01001110 this
trivial change (as far as the net is concerned) has completely destroyed the natural
appearance of the function.
You can start investigating what a simple net seems to be learning by looking
again at some of the very basic problems already used in past exercises.
Exercise 4.16 Use the rbp program to train a 2-2-1 net to compute XOR (you
should find that it trains appreciably faster than the 2-1-1 net
with extra connections from input to output). You can then use
the l and p commands to see the activations of the nodes in
the layers for given input patterns, like so:
p 1 0
causes the two input nodes to be set to 1 and 0. Then the
command:
4.9 What is learned by a net?
Page 4:21
l 1
shows you the values on the input layer:

1.000 0.000
and the command:

l 2
shows the values of the nodes in the hidden layer, for example:
0.804 0.000
The command:
w 2 1
will show you the weights to layer 2 node 1 (the node that had
the value 0.804):
layer unit
1
1
1
2
2
t
unit value
1.00000
0.00000
1.00000
weight
-3.95524
-3.56139
5.44262
input from unit

-3.95524
-0.00000
5.44262
sum =
1.48738
The t unit is the bias (or threshold) input, which is permanently

set to 1. The weighted sum is shown, and 0.804 is the output
of the sigmoid of this.
Exercise 4.17 Try using the same techniques to investigate what the hidden
units are doing in a simple parity problem (such as 4-bit parity)
and in the 4-2-4 encoder problem.
Page 4:22
4.10
Backpropagation
Training and test data
In general, and in almost all practical problems, you will not know what sort of
function your network ought to be learning. So how can you tell when it has done
its job? The usual approach is to have separate sets of data for training and for
testing. The network is trained on the training data until it appears to have learned
it all, or perhaps has learned it well enough in the sense that a few of training
cases remain misclassified but the overall error is small and further training produces
no improvement. Then the network can be tested on the test data; hopefully it
should do about as well on the test data as on the training data provided both are
truly representative in some sense. This raises a lot of further questions: how much
data is needed; how should the data be divided into training and test sets; what if
performance is markedly worse on the test set than on the training set?
Usually, the question of how much data is needed is academic you use what
youve got! As for division into training and test sets, the crudest procedure is to
divide the data at random into two equal-sized sets. This is of course risky; the test
set might contain all or most of a certain class of examples, so that the net couldnt
learn about that class (because the training set contains few or no examples of it).
A sounder approach is called cross-validation: pick some integer k, divide the data
randomly into k equal or nearly equal sized sets. Then take one set as the test set,
the other k 1 together as the training set. Repeat this, taking each of the k sets
in turn as the test set. If k is the size of the whole set of available data, so that the
test set is of size 1 each time, the procedure is called jack-knifing. The value of k
is usually chosen by common sense, but sometimes by some form of search instead,
looking for the value of k which delivers best results.
What counts as good results? It would be naive to suppose that you should
train the net until it seems to be doing as well as possible on the training data, and
then take its performance on the test data as the measure of overall performance. The
problem with this is that the net may come to learn the training data specifically, so
that after a while performance on the test data starts to go down again. This is a
classic symptom of overfitting to the training data. The more usual procedure is to
run the test set frequently during training. Typically performance on the training set
will improve reasonably steadily but at some point performance on the test set will
begin to worsen. The overall performance at the specific moment where this downturn in test set performance occurs is usually taken to be the nets best generalisation
from the data. (Note, rbp has a benchmarking command B which makes it easy to
gather the necessary data). Be warned however: even this is not a foolproof criterion.
Very occasionally you will find a problem on which test set performance decreases
seriously for a while as training progresses, and then improves again, perhaps reaching
a much higher overall performance. All you can do is report the criterion you are
using, openly and honestly.
When assessing performance, the proportion by which perfomance is worse on
the test data than on the training data is sometimes called the shrinkage, especially
4.11 What can be learned by a feed-forward network?
Page 4:23
in medical circles. Ideally, shrinkage should be 0%, even if the net does not manage
to get all examples correct. It is also important to remember that performance is
going to depend on the threshold used to decide what an output value represents. A
sigmoid unit is going to output a value between 0 and 1, and the value will never be
exactly 0 or exactly 1, it will lie between these extreme values. Let us suppose that
the desired outputs are 0 or 1. In practice it is common to set some threshold t, such
that any output larger than t is deemed to be a 1 and any lower than t is deemed to
be 0. What should t be?
The best value of t can be found by plotting a receiver operating curve or ROC
curve. For any given t the false alarm rate is the probability of the nets saying that
the output should be 1, among all those cases for which it should be 0. The hit rate
is the probability of the nets saying that the output should be 1, among all those
cases for which it should indeed be 1. Ideally you would like the hit rate to be 1 and
the false alarm rate to be 0. If the threshold t is set to be just larger than 1 the net
will claim that all cases give an output of 0, so the false alarm rate and the hit rate
will both be 0. If the threshold t is set to just below 0, the net will claim that all
cases output 1 so the hit rate and the false alarm rate will be 1. Thus if you first
ascertain what true value the net outputs for each case, and then classify each as 0
or 1 according to the threshold, you can plot a curve of hit rate (the vertical axis
against false alarm rate (the horizontal axis) as the threshold varies between 0 and 1.
This curve will run from (0, 0) (at t = 1) to (1, 1) (at t = 0) and ideally should run
straight up from (0, 0) to (0, 1) and straight across to (1, 1). This graph is the ROC
curve. The worst that can happen is that the hit rate is always equal to the false
alarm rate, meaning that the net is completely unable to classify cases correctly; in
this case the ROC curve will be a straight line between (0, 0) and (1, 1). The area
under the ROC curve is sometimes used as a measure of the nets effectiveness; this
quantity will lie somewhere between 0.5 (useless) and 1.0 (perfect).
Of course such a measure is not always what you want. Suppose that you are
training a net to decide whether a patient has (say) a brain tumour or not, based on
a number of fairly basic clinical observations, and that the nets output will be used
to decide whether to order more expenive tests such as a brain scan. Other things
being equal you would rather that the net erred on the side of deciding that a tumour
was present than on the side of deciding that one was not present; the cost of the two
sorts of mistake are wildly different. In such cases the costs of the different sorts of
errors should be built into the calculations somehow.
4.11
What can be learned by a feed-forward network?
The question naturally arises, what kind of function can a feed-forward network learn?
Obviously, if the activation function of the output units is the sigmoid function, then
the output values will be restricted to lie between 0 and 1 and the extremes of this
range. However, if the output layer alone was changed to use a linear activation
Page 4:24
Backpropagation
function, then there would be no restriction on possible output values. In fact it has
been shown that, with linear activation in the output layer, at most two hidden layers
are sufficient to be able to represent any function to an arbitrary degree of accuracy.
The informal proof (there is a formal one) goes like this.
First, any function can be approximated by a suitably huge linear combination
of bumps that are each significantly non-zero only in some small region of the input
space. Second, such bumps can be constructed with at most two hidden layers. If g(z)
is the sigmoid function, then g(x) g(x p) has a peak at x = p/2 and is virtually
zero when x is far from there. Thus a sharp peak can be produced anywhere in one
dimension with g(ax + b) g(ax + c). In more than one dimension, we can create a
ridge running in any direction we please, for example by:
g(2x + 3y 4) g(2x + 3y 5)
and the crest of this ridge is the line
2x + 3y = 4.5
Moreover, we can create two such ridges running in non-parallel directions which
intersect at the target location where we wish to set up a purely local bump. For
example, the two ridges created by
g(2x + 3y 4) g(2x + 3y 5) + g(3x 5y 2) g(3x 5y 3)
intersect at roughly the point (0.789, 0.974). Thus there will be a maximum there
where the two ridges intersect, but the ridges themselves remain. However, we can
add an extra with a suitable threshold to slice off everything but the top of this bump,
thus getting a purely localised bump. In the specific case here, it turns out that the
ridges are no more than 0.3 high whereas the bump is twice that height. So a function
such as:
g(100(g(2x + 3y 4) g(2x + 3y 5) + g(3x 5y 2) g(3x 5y 3)) 45)
will have a bump of height nearly 1 (around g(15) in fact) at the point of intersection
of the ridges, and will have a height of nearly 0 (around g(15) in fact) even on the
crest of the ridges when even a modest distance away from the point of intersection.
Thus, we can build a fragment of network with inputs x and y as follows. The
weighted sum of inputs to the first node in the first hidden layer will be 2x + 3y 4,
and that node outputs g(2x + 3y 4). The other three nodes in the first hidden layer
compute the other terms. The second hidden layer has one node, whose weighted
sum of inputs is 100g( ) 100g( ) + 100g( ) 100g( ) 45, and it outputs the
bump. The output layer has a linear activation function, so can rescale the bump to
any desired height.
And finally, if we can represent a localised bump anywhere with just the two
hidden layers, then we can add enormous numbers of further hidden nodes to represent
4.11 What can be learned by a feed-forward network?
Page 4:25
vast numbers of extra bumps, to approximate the desired function as closely as we

please.
However, this only says that an arbitrary function is representable. It doesnt
say that it is learnable, nor does it suggest how few hidden units are needed. It
has been proved that, to represent any continuous function, just one hidden layer is
needed; and even that if there are n inputs and m outputs then at most 2n + 1 hidden
units are necessary. But once more, this doesnt say whether such a net can learn the
function in any reasonable time.
Chapter 5
Applying the technology
This chapter is about applying the ideas introduced in the previous chapter. It begins
with some discussion of how one might decide on the topology of a network, and some
algorithms to help discover a suitable topology, as well as some exercises to let you
try for yourself. This is followed by descriptions of various example applications of
feed-forward neural networks, to help give you some feel for the uses of this technology.
5.1
The T-C problem: a simple example
The T-C problem, not quite in the form presented here, is a simple but classic example
of a visual recognition problem. The aim is to train a neural net to distinguish
between a stylised binary image of the letter T and a stylised binary image of
the letter C. Figure 5.1 shows these crude images. In the original version of this
problem studied by Rumelhart and McLelland, the C had shorter cross-pieces and
so differed from the T by the siting of a single square. However, the net must be
111
T=010
010
111
C=100
111
Figure 5.1: The T-C problem: the stylised letters

able to distinguish between these letters whichever orientation (of the four possible
in this crude problem) a letter may have.
You can try this problem for yourself using the rbp program. To set up the
training data, create a 3x3 grid using your favourite editor:
0
Page 5:2
000
000
000
and make eight copies with a blank line between each, by marking the start (ESCspace in microEmacs or ctrl-@ in GNUemacs), moving to the end and taking a copy
(ESC-W), then yank the copy back (ctrl-Y) seven times. You can then change some
of the zeros to ones in each of the eight grids, to produce each letter in each of four
orientations. Use overwrite mode for convenience (ctrl-X M over in microEmacs,
ESC-x overwrite-mode in GNUemacs), but dont forget to cancel overwrite mode
afterward (ctrl-X ctrl-M over in microEmacs, a second ESC-x overwrite-mode in
GNUemacs). You can then turn a grid such as:
111 010 010
into a line of training data for rbp by deleting the newlines to get:
111010010
and then adding what the output should be 0 for T or 1 for C to get a line:
111010010 0
Try a 9-2-1 network; you should find that it learns very quickly, for various values of
the learning rate and momentum parameters.
Exercise 5.1
This particular version of the problem is trivial. Why?
Exercise 5.2
Use the w command to look at the weights to the two hidden

units (for example, w 2 1, w 2 2) and the output unit (w 3 1)
to see whether the net has noticed the obvious feature which
distinguishes the two letters.
This example should show you the network has learned to distinguish a particular feature of the problem. Now consider a more complicated version of this problem,
in which the 3 3 grid representing a T or C in any of four orientations is located
in a 4 4 grid:
0000
0100 = T on its side, lower left
0111
0100
As before you can use your editor to create the training data easily, or use a supplied
set. This time there are 16 inputs, there will still be just one output, but it is not
clear how many hidden nodes there should be (although you should assume that there
is just one hidden layer).
5.1 The T-C problem: a simple example

Exercise 5.3
Page 5:3
Try 4 nodes in the hidden layer to start with. Try using a

momentum parameter of 0.9 initially (a 0.9), and then try
it again with a 0.0. You can re-initialise the network by:
C
* Clears the network
k 0 2 * Initial weights in [-2...2]
a 0.0 * Momentum coefficient = 0, so no momentum
Exercise 5.4
If the network does not seem to be converging to low error, add

another node to the hidden layer:
H 2 2 * In layer 2, weights to it lie in [-2...2]
and continue. It may take of the order of 1100-1500 cycles to

reach success.
The 16-5-1 network contains a reasonable number of weights (16 5 + 5 1 +
5 + 1 = 91, the last two terms being for the threshold weights in the hidden and
output layers). But it is not clear just how the network is doing what it does, and
it may be that many of the weights are redundant, even if none of the hidden units
are. In some systems the network is not fully connected between adjacent layers. For
example, in the 4 4 version of the T-C problem, it may be enough to have four
hidden units, each of which looks at a 3 3 patch of the input grid.
Exercise 5.5
Try using a reduced network for the 4 4 T-C problem, as

follows. First create the network and use the W command to
get rid of all but the threshold weights. When a network is
first created all weights are zero, and W 1 deletes all weights of
absolute value less than 1 apart from threshold weights, which
can never be deleted:
m 16 4 1
s 31 29 17
W 1
* A 16-4-1 network
* Random number seeds
* Get rid of weights
Now add individual connections like so:

c 1 1 2 1
c 1 2 2 1
c 1 3 2 1
* Weights to layer 2 node 1 come

* from top-left 3x3 grid:
*
1 2 3 +
Page 5:4

c 1 5 2 1
c 1 6 2 1
c 1 7 2 1
c 1 9 2 1
c 1 10 2 1
c 1 11 2 1
...(another
*
*
*
5 6 7 +
9 10 11 +
+ + + +
3x9 lines to go)...
and dont forget to re-connect layer 2 (the hidden layer) to layer

3 (the output layer):
c
c
c
c
2
2
2
2
1
2
3
4
3
3
3
3
1
1
1
1
Try training this network. Again, it is helpful to set the momentum parameter to 0, and dont forget to initialise the weights
randomly by, say, k 0 4. You should find that this reduced network trains significantly faster than the fully-connected version.
Use the w command to inspect the weights and see if you can
decide what the hidden units are now doing.
5.2
Simple digit recognition
Many people have applied neural nets to some kind of image processing task. The
T-C problem is a very simple instance. A (very) slightly more interesting example
concerns the recognition of digitised digits. For example, a stylised image of the digits
3 and 4 and 5 on a low-resolution 6 4 grid might be:
0110
1001
3=0010
0001
1001
0110
1000
1000
4=1010
1111
0010
0010
1111
1000
5=1110
0001
1001
0110
There will be 10 lines of training data, one for each of the digits 09, and presumably
10 outputs exactly one of which should be close to 1 indicating which digit is present
in the input image.
5.3 Some observations

Exercise 5.6
Page 5:5
Set up the training data for yourself, or use a supplied copy.

Then try to train a 24-3-10 network; you should find that the
error per output unit oscillates a bit but stops decreasing. This
happens even if the momentum coefficient is set to 0. Add
another hidden unit by H 2 2 (say), and keep the momentum
coefficient at 0. You should now find that the error starts to
decrease again. If you start with a 24-4-10 network and no
momentum, the net trains reasonably, in several hundred cycles.
Now try entering noisy patterns using the p command, for
example a 4 with one extra bit set:
p 100010001010111101100010
Study how well the network performs at recognising digits with

a little added (or subtracted) noise.
Exercise 5.7
5.3
If you have the patience, try a reduced network of some kind

for this task. Guess what features of the input are important
for the recognition task, and set up weights so that the hidden
units look only at where those features might be. Does such a
network perform well with noisy inputs?
Some observations
If you tried the exercises up to this point, you should have reached some valuable
conclusions:
One way to develop a topology for a network is to start with too few hidden
units, train the network until the error stops decreasing, add a further hidden
unit; and repeat until the network is fully trained.
Sometimes a fully connected network is too general, and some kind of reduced
network may work rather better.
Momentum is not always a desirable feature.
Even a network trained on a tiny subset of the possible data may perform fairly
well with noisy forms of that data.
There are various qualifiers to be added. First, incremental addition of hidden
units may not always work, or may not work well. For example, once you have found
a feasible topology by this incremental process, it may be worthwhile to start again
with random weights and train that network from scratch. Also, you may end up
Page 5:6
with so many hidden units that the network is merely memorising the training data
instead of doing something more useful.
If you guess what the interesting features are and engineer a reduced network,
you may get something which is a bit less tolerant of noise. This is because your
guesses might not be quite perfect, and it can be very hard to tell. The fully-connected
digit-recognition example works reasonably well on test data that has a little noise,
but not necessarily because the network has generalised well and now understands
digits. It is more likely that humans dont understand the task very well, and the
network is doing something fairly simple. This is where the statistical clustering
techniques described in chapter 2 can be very useful. For example, you could feed
a large number of noisy examples into the fully-connected digit-recognition network,
and in each case treat the set of values output by the hidden units as the co-ordinates
of some point in hidden-unit-value space. You can use PCA to see whether these
points cluster. Also, these points can be grouped according to which digit, with
added noise, was presented at the input and then CDA can be used to look for
distinctions between groups in order to try to guess whether particular hidden units
are looking at specific features.
Remember also that the network will even try to classify a completely random
input, and may come up with a clear winner although the input has no resemblance to
any of the patterns on which it was trained. It may be helpful, therefore, to include a
rejected category in the classification scheme, but the system will need to be trained
on suitable reject inputs of course.
In the examples so far, the input has been coded in a fairly natural way. Often
the method chosen for representing the input can make the task very easy or completely impossible for a neural network. For instance, imagine that an n-bit number
is encoded in binary for presentation to an n-input network. Then it is almost trivial
to train such a network to output a 1 if and only if the number is odd; it only has to
output the last input. On the other hand, it seems impossible to train a network to
output the smallest prime factor of the input. The network has no direct method of
doing multiplication or division, and no working memory. Obviously, since a feedforward network can represent XOR and a complete digital computer can be built
out of XOR gates it ought to be possible to build a network that could output the
smallest prime factor of its input; but it seems a hopeless task to train a network
to do that. An incremental approach, training fragments of network to do subtasks
and fossilising them when they work, might eventually succeed; but the task is pretty
pointless. Note, however, that it might be much easier to train a network to output
a 2 if the input is divisible by 2, a 3 if it is divisible by 3 and not by 2, or a 5 if it is
divisible by 5 but not by 2 or 3. Such a network will output the smallest prime factor
for 5/6 of all possible inputs an 84% success rate already! Therefore:
Be thoughtful about how the inputs are to be represented.
Be cautious about the nature of what the network is doing; it may be solving
5.4 Finding a network topology
Page 5:7
a much simpler problem than you suppose, because of a bad choice of training
data or even because you cannot see the hidden simplicity of the problem.
The following sections look at other approaches to developing a network topology, and at some much more sophisticated applications of feed-forward neural network
technology.
5.4
Finding a network topology
In addition to the approach suggested above, there are several other techniques which
can be employed in order to find a reasonable network topology for a given task.
5.4.1
Weight decay
One method is to start with an overly-large network and prune it during training, by
making the weights decay with time. The simplest approach, in which at each cycle:
wij 7 (1 )wij
looks promising but has drawbacks. This corresponds exactly to having an extra term
+
X 2
w
2 i,j ij
added to the error which gradient descent is trying to minimise, which means that
large values of wij are disproportionately penalised by this scheme. A better variant
would be to have an extra error term such as
+
X 2
wij /(1 + wij2 )
2 i,j
The effect of this is that, after a lot more maths:

wij 7 (1
)wij
(1 + wij2 )2
compared to the standard backpropagation; that is, small weights decay faster than
large ones. Variants of this such as
wij 7 (1
1+
wij2
)wij
mean that the weight decay is the same for all weights feeding unit i, and the weights
to a unit with small output decay faster than those to a unit with large output.
Page 5:8
5.4.2
Freans Upstart algorithm
This algorithm grows a network, but is specifically for networks consisting of 0/1
threshold units. It is due to Marcus Frean. The description makes the simplifying
assumption that the network is to be trained on a number of known input patterns,
for each of which the output is just a single value. If the output is to be a vector of
values, then the algorithm can be extended to consider each component separately.
Start with just a single output unit connected to the inputs, and train it to
do the best it can, using the perceptron learning algorithm. The algorithm may not
terminate, but will settle down into a sort of oscillatory behaviour getting about as
many input/output patterns right as it can; at which point it is stopped. Now look
at those input/output patterns which are still misclassified. Some will produce a 1
when 0 is desired. Train a subsidiary unit in the same way to get as many of these
exceptional cases as possible right and to output 0 for all the correctly classified cases,
and connect this subsidiary unit to the output unit with a large negative weight, so
it corrects the output for those cases. Likewise, add a subsidiary unit that corrects as
many as possible of the misclassified patterns for which the output produces a 1 when
0 is desired and outputs 0 for all correctly classified cases; connect this to the output
unit with a large positive weight, so it corrects the output for those cases. The two
subsidiary units may themselves misclassify, so each can recursively be given correct
subsidiary units, until eventually all input/output patterns have been dealt with.
Suppose, for instance, that when training the first unit on a training set of size
100, it has been possible to get 83 of them right, but the unit outputs 1 instead of
0 on 11 of the other cases and 0 instead of 1 on the remaining 6 cases. The idea is
to try to train a first subsidiary unit to output 0 for each of the 83 correctly handled
cases and to output a 1 for the 11 cases which are producing a 1 rather than 0 from
the initial unit. Note that it is always possible to get this subsidiary unit to output
0 for all 94 cases; and in fact it is always possible to get it to output 1 for at least
one of the 11 troublesome cases too. This is because all the inputs are 0/1; all the
inputs are thus located at the corners of a hypercube, and it is always possible to find
a plane which separates a single corner of the hypercube from all the rest. Thus the
first subsidiary unit can at least be trained to output 0 for the 83 correctly handled
cases and a 1 for at least one of the 11 cases of 1-instead-of-0. If this subsidiary unit
is now connected to the initial unit with a big negative weight, then the subsidiary
unit will not upset the initial units behaviour on the 83 correctly handled cases, and
will correct its behaviour on at least one of the mishandled cases.
Proceeding recursively, eventually all the originally mishandled cases will be
corrected, because each new unit added fixes at least one more case. The resulting tree-structured network is unusual, but it has been shown that it can later be
transformed into a single-layer net.
5.4 Finding a network topology

5.4.3
Page 5:9
Mezard and Nadals Tiling algorithm
This algorithm, devised by Mezard and Nadal, generates a multi-layer network. As

above, the description makes the simplifying but unnecessary assumption that there
is a single output value required. It is for the learning of Boolean functions. The
explanation below is couched in terms of units outputting 1 rather than 0 or 1, but
can be recast if needed.
The first observation is that, in any correctly trained network, if two inputs
produce different outputs then those two inputs cannot produce identical sets of
values for the nodes in any hidden layer. If (1, 1, 1, 1) is mean to output a 1, and
(1, 1, 1, 1) is meant to output a 1, then these two input sets cannot both produce
(say) (1, 1, 1) on some three-node intermediate layer. Instead, the representation of
the input as a set of values on each intermediate layer must be faithful, in sense of
preserving distinctions where outputs required at to differ.
The idea is then as follows. Start with just a single output node, train it using
the perceptron learning algorithm to do as well as possible. If this doesnt completely
solve the problem, we need to add more units to this layer, because as yet the output
layer is not producing a faithful representation. Add a new output unit and train
it just on those inputs for which the original output was unfaithful. Now the two
output units may still not be generating a faithful representation, so keep on adding
more output units until a completely faithful representation has been generated
probably using far too many output units. Each addition makes at least one more
input pattern be represented in a faithful way, so this stage must terminate. Now,
the output units produce a faithful representation of the inputs, and can be treated
as inputs to the next layer which gets generated in the same kind of way. This whole
idea will eventually produce a multi-layer net with a single final output (the ultimate
output layer having just one node in it) provided that the faithful representations
being produced during the construction of each layer are somehow getting simpler.
It is possible to choose weights between one layer and the first unit in the next
layer so that that first unit in the next layer classifies at least one more pattern
correctly. The proof goes like this. Consider figure 5.2. Suppose unit V1 classifies
Layer L+1
w0
w1
w2
wn
w3
Layer L
1
threshold
V1
V2
V3
Vn
Figure 5.2: The tiling algorithm: proof

only q of p patterns correctly. Set w1 = 1, and choose some pattern k that it fails to
Page 5:10

(k)
classify correctly. Thus V1 is dk , where dk is the desired output for that pattern.
Choose the other weights to be
(k)
wi = dk Vi
where is chosen to satisfy
1/n < < 1/(n 2)
Now pattern k is correctly classified, since
X
(k)
w j Vj
= dk + ndk
and n > 1. Also, all the patterns correctly classified by unit V1 are still classified
correctly, since, where l is any one of these correctly classified patterns:
X
j
(l)
w j Vj
(l)
(k)
= V1 + V1
(l)
(k)
Vj Vj
j6=1
(l)
(k)
The sum on the right hand side cannot be n because V0 = V0 = 1, and it cannot
be +n if level L is producing a faithful representation. Since (n 2) < 1, this means
that the second term on the right-hand side cannot outweigh the first, and so pattern
l is still classified correctly. This means that the first unit in layer L + 1 is classifying
at least one more pattern correctly. So the whole algorithm must terminate, because
eventually a layer will be constructed in which the first unit classifies all the inputs
correctly.
5.5
Observations about input data preparation
It is usually important to give a network the best chance of learning to solve problem
by preprocessing the data suitably. Input fields which are binary-valued can be presented as 0 and 1, of course. Input fields which can take any one of (say) N values,
such as the day of the week, can be represented by N distinct inputs exactly one of
which is 1 and all the rest of which are 0. An input field drawn from some numeric
real-valued range can be represented by rescaling the range to be some suitable small
interval, perhaps 0 to 1. It is important not to present a network with real numbers
of large absolute magnitude these can make the weighted sum of inputs to a node
be very large in absolute value, so that the derivative of the sigmoid will be virtually
0 and back-propagation will then take a very long time to alter the weights suitably.
If the set of inputs is a vector of real-valued components a simple and natural thing to do is to normalise the vector. This will ensure that all the weighted
sums are suitably small, assuming that the weights are initialised to smallish values.
But remember that normalisation would reduce both (5, 5) and (50, 50) to the value
(0.7071, 0.7071), thus losing all distinction between them. If this is not what you
want, you can apply Z-axis normalisation instead: add one extra component, such
5.6 Applications
Page 5:11
as a constant 1, to each input vector (getting (5, 5, 1) and (50, 50, 1) in this example)
and normalise these instead.
Sometimes your data will have a few missing fields. A common way to handle
these is to add an extra input for each field that may have be missing in some cases.
The extra input is usually 0, meaning that the associated inputs are valid, but can
be 1 meaning that the associated inputs are junk. An alternative approach is just
to supply some suitable value for a missing field, such as the median of all possible
values use your common sense about this!
You may find it beneficial to apply more sophisticated data transformations as
well. For example, suppose the input data consists of points in the XY plane, and the
task is to predict which of two concentric circles any given point lies on. This task
is very hard if the data is given as Cartesian co-ordinates; it becomes trivial if the
data is transformed to polar co-ordinates. Consider a different example: suppose you
are trying to get a net to predict the height of sea-level at Craobh Haven in Argyll
given the time of day and the date. This is rather tricky, because the twice-daily tidal
surges come round both sides of the long island of Jura and reach Craobh Haven at
slightly different times. However, without any pre-processing the net will spend a
lot of time discovering what you already know, namely that there is a major cyclic
component of high and low tides. You should factor this out of the data, for example
by getting the net to predict instead how the sea-level at Craobh Haven differs from
the sea level at, say, Oban which the tidal surge reaches before it splits to pass round
either side of Jura and so is unaffected by Jura. In general, dont ask a net to find
out what you already know!
Suppose you wanted to create a net that took 100 binary inputs, representing
a digitised image of a hand-drawn digit. It is tempting to have 10 outputs, one
per possible digit. It is better to have 11 the extra one means the input is not
recognisable as a digit. Without such a category the net is obviously going to try to
classify any junk input as some digit. However, it becomes much harder to prepare
suitably representative training data.
5.6
5.6.1
Applications
NETtalk
NETtalk, by Sejnowski and Rosenberg [28], is perhaps the most frequently cited example of a neural net application. Its task was to learn how to pronounce English words,
or to be more precise how to map input text to phonemes; the phonemes selected by
the output of the net were then fed into a phoneme-to-sound converter derived from
the commercial DECtalk system. The input encoding was fairly straightforward. Using an alphabet of 29 characters (26 letters plus space and punctuation), the net was
presented with seven characters at a time a central character and three on either
side for context and this seven-character window was scanned over some written
text, to generate successive training examples with the aid of an English/phoneme
Page 5:12
dictionary:
N:
N+1:
N+2:
What can t|he Mat|ter be?

What can th|e mAtt|er be?
What can the| maTter| be?
Thus there were 7 29 = 203 input units. There was a single hidden layer containing
80 units, and the output layer consisted of 26 units each representing some phoneme.
The network was trained on a text containing 1024 different words. After
10 epochs (complete passes through the training data), the network was producing
intelligible speech, and had achieved about 95% accuracy on the training data after
50 epochs. It was then tested on a different text (a continuation of the original
text, in fact), and was able to achieve about 78% accuracy. It was reported to sound
somewhat like a child learning to talk, as it progressed through the 50 training epochs;
first learning crude features such as the division between words, and then refining its
discrimination progressively. However, it would be unwise to read too much into this
observation.
Damaging the trained network by adding noise to weights or removing units produced degraded speech, but the degradation happened in a way that varied smoothly
with the extent of the damage. The commercial DECtalk system, based on handcoded linguistic rules, produces considerably better speech than NETtalk and unlike
NETtalk does the complete text-to-speech task and for a much wider vocabulary.
However, it is worth remembering that DECTalk is the product of nearly 10 years of
sustained effort, whereas NETtalk was trained from scratch in just a few hours if you
dont count the research effort needed to produce the English/phoneme dictionary.
5.6.2
Playing backgammon
In the two-person game of backgammon, a player rolls two dice and then has a choice
of perhaps 20 moves consistent with the numbers shown. The game is a combination
of luck (the dice throws) and skill (selecting good moves while defending against the
other players luck and skill, perhaps taking risks about the others luck). Because of
the combinatorics, a conventional look-ahead approach is not very good as the basis
for a computer program to play the game. Position-based judgement and pattern
recognition are regarded as much more important aspects for good play.
Tesauro and Sejnowski [29] devised a neural net to play backgammon, training
it to select moves by outputting a score for a possible move presented at the input; a
conventional program generated possible moves and presented them to the network
for assessment. The training set consisted of just over 3000 board positions, each composed of current position, dice value, possible move, and some higher-level positional
assessments such as degree of trapping; each possible move for such each such training example was given a score, on a scale of 100 (awful) to +100 (excellent). Some of
these scores were assigned by an expert player, for interesting moves, but most were
assigned at random from the scale 65 + 35, the negative bias being deliberate as
5.6 Applications
Page 5:13
a counterbalance to the experts tendency to choose good moves to hand-score. The

network itself consisted of 459 input units, two hidden layers each containing 24 hidden units, and a single continuous-valued output unit representing the score. It was
trained on the data produced as described, but with the experts hand-scored choices
being presented rather more often than the random-scored choices. The training set
was modified somewhat during training to compensate for some obvious deficiencies
that showed up as training progressed.
Considering that there are of the order of 1020 legal board positions in backgammon so that considerable generalisation is needed from just 3000+ board positions
, the trained network performed very well, winning about 59% of the time against
a commercial backgammon program (Suns gammontool). The higher-level features
turned out to be a necessary part of the input; without them, the networks score
dropped to around 41%. A later version of this network won the gold medal at the
computer olympiad in London in 1989, defeating all other computer backgammon
programs. But it still loses to a really good human expert player.
5.6.3
Hyphenation
In this application the task is to determine where a given long word may be hyphenated. This depends on whole syllables:
prop.a.ga.tion
pro.tect
pro.pa.gan.da
pro.pane
and is not a simple algorithmic task. Good algorithms, with exception dictionaries,
do exist for English but not as yet for languages such as German and Danish. In
a study by Brunak et al [3], words were presented using a six-letter window and a
26-character alphabet, thus using 6 26 = 156 input units. There were 20 units in
the single hidden layer, and a single output unit to indicate whether hyphenation
should be allowed at the middle of the six-letter window. After training on 17,228
long words the network was able to perform with 99% accuracy on test data.
5.6.4
Cytodiagnosis of cervical cancer
In 1920, G.N.Papanicolaou developed a test for the early detection of cervical cancer
by looking for abnormal cells in a mucus smear the Pap smear test. However,
a single microscope slide with a mucus smear on it may contain over 100,000 cells,
only a very few of which may be abnormal. Inspecting such smears is thus a very
boring but expert task. In the commercial PAPNET system, a conventional vision
program identifies unusual cells, digitised cell images then being presented to a neural
net for its assessment of whether the cell is a malignant abnormality or not. A human
pathologist makes the final assessment of the few cases selected by the neural net; it
is reported that this system avoids virtually all false negative results.
Page 5:14
5.6.5
Time series prediction
The prediction of the behaviour of time series of great interest in weather forecasting,
economics, business planning and many other areas. Consider, for example, the
Mackey-Glass equation, a non-linear differential difference equation derived when
trying to describe a certain physiological process:
dx(t)
0.2x(t )
=
0.1x(t)
dt
1 + x(t )10
Because there is a delay involved, one would have to know the value of x(t) on a
continuous interval of length in order to be able to fully characterise the solution.
For larger values of the behaviour grows more chaotic, in the sense that arbitrarily
small variations of the initial conditions lead to arbitrarily large variations in the
long-term behaviour.
Lapedes and Farber [20] trained a neural net to predict future values of the
solution, investigating the particular cases = 17 and = 30. They used a net with
four inputs, two hidden layers of ten units each, and a single linear-activation output
unit (so that, as discussed at the end of the previous chapter, the network could output
arbitrarily large or small values). The network took huge amounts of computing power
to train, using a variant of line-searching gradient descent; but the trained network
outperformed other methods available for predicting the behaviour of the solution
of such an equation. More conventional methods have since been developed which
can match the performance of the trained network, but those conventional methods
require, if anything, even more computing power.
Other studies have shown that neural networks can do well at the task of
predicting the behaviour of noisy time series such as
x(t) = (t) + 3 tanh(3x(t 1)) + 3 sin(x(t 2))
where (t) is a normally distributed random variable, with variance 1. Even training
a net using only two past values (so two inputs) and eight units in a single hidden
layer provided much better prediction than the standard linear predictor methods.
5.6.6
Recognising handwritten digits
Le Cun et al [21] developed a neural network capable of very reasonable performance

on the task of recognising hand-written digits comprising ZIP codes (postal codes)
on US mail. The network input consisted of a 16 16 = 256 pixel image of a handwritten digit, rescaled to be a standard size. Their network had three hidden layers.
The first layer consisted of 768 units, logically organised into 12 groups of 8 8. Each
unit in a group had connections to a 5 5 square of the original image, and the 25
weights involved were the same for every unit in a group. The second hidden layer
consisted of 192 units, logically arranged into 12 groups of 4 4 units. As in the first
layer, each unit in a group looked at a 5 5 square in a group in the first layer, and
5.6 Applications
Page 5:15
each unit in a group used the same set of weights. The third hidden layer consisted
of 30 units, fully connected to the second hidden layer. The output layer consisted of
10 units corresponding to the ten digits.
The network was trained on 7300 digits and tested on a further 2000, with an
error rate of around 1% on the training set and 5% on the test set. In about 12% of
cases the network expressed significant doubt about the result by producing significant
activation for two or more outputs with no clear winner; its error rate on the other
88% was a very creditable 1%. Further techniques used to prune unnecessary weights
from the network produced even better performance, with a reject rate of 9% rather
than 12%. This kind of performance is very encouraging, and experiments by postal
services are continuing.
5.6.7
Active sonar target classification
One of the earliest applications in this area was an experiment by Gorman and Sejnowski [6, 7], using a three-layer feed-forward network to try to distinguish between
sonar returns from rocks and sonar returns from metal cylinders, the rocks and cylinders being of a similar size and shape. The data was gathered from scans of the floor
of Chesapeake Bay, scanning various objects from various aspects at close range.
From an initial sample set of 1200 returns, 97 rock returns and 111 cylinder returns
were selected for training. The signal-to-noise ratio varied from 4 dB to 15 dB. Each
sonar return was processed into a one-dimensional spectral image containing 60 components, and this was fed into a network with 60 inputs, between 0 and 24 hidden
units and one output (rock or cylinder?). The results varied between 89% and 99.8%
correct on the training set, and 77.1% to 84.5% on an unseen test set. For comparison, a statistical nearest-neighbour technique could do no better than 82.7% on the
test set. In a contest between an experienced human sonar operator and the trained
network, the human scored 88% and the network scored 97%.
There have been several other studies of the utility of neural networks in sonar
classification. A fairly recent one by the Ontario Hydro Research Division tried to use
a neural net to classify seven types of underwater target: a ping-pong ball, bubbles,
a leaf, and four kinds of fish: walleye, sturgeon, brown bullhead and rainbow trout.
The sonar data consisted of returns from various aspects at close range (6-9 meters);
each return was processed into a 192-component spectral image. Forty-eight returns
per target type were collected, and thirty-six of each were used for training with the
other ten being reserved for testing. The feed-forward network used had 192 inputs,
between 10 and 30 hidden units and one output per target type. Best performance
was obtained with 25 hidden units (20 and 30 being poorer but as good as each other),
obtaining 91% correct classification on the test set. An experienced human managed
75%. In a second experiment, three neural nets were used: the first classified objects
as fish or not, the second classified those identified by the first as fish into fish types,
the third tried to determine the angular displacement of the fish from the sensor!
Accuracy improved to about 93%, this time using 20 hidden units provided the best
Page 5:16
results.
5.6.8
Other applications
Neural networks have been applied to a remarkable range of tasks, in many cases
proving to be commercially worthwhile. To illustrate, here are just a few examples,
here are many more:
scheduling and planning; for example, planning airline crew schedules.
quality control; for example, in a parts inspection system, checking that diskdrive parts have been manufactured within specified tolerances.
fault diagnosis; for example, a system used at Fort Detrick for diagnosing problems with a satellite communication system.
signature verification; for example, a PC-based system with an accuracy of
around 95% at distinguishing a real signature from a forgery of it.
financial forecasting; for example, predicting the price of gas or of certain stocks
in the near future.
financial trading; for example, a system to analyse foreign currency trading.
credit evaluation; for example, a system to vet applications for mortgages.
medical diagnosis; for example, systems to diagnose causes of low back pain, of
epilepsy, of dementia, of papulosquamous skin diseases and so on.
The Japanese have developed many practical applications [16]. For example,
Matsuhita have used neural nets to control air conditioning via image processing.
Human-like objects are located in image data and features are computed for input to
a net. The net estimates the distance between the human and the sensor and whether
sitting or standing, and adjusts the air conditioning controls to suit. Toshiba has a
neural net in a toaster, to control toasting of bread. Gas flow while cooking is sensed,
the neural net estimates the temperature and area of the bread from the gas flow
pattern over time. A fuzzy system uses the nets information, plus temperature data,
to control the toaster.
Sanyo have a neural net system that learns, by back-propagation, the daily
usage pattern of a kerosene fan heater in a home taking account of time of day, season etc. It learns to switch on just before it is likely to be needed; this reduces
wastage considerably. Nikko Securities use a net-based convertible bond rating system. Fujitsu and Nippon Steel have applied neural net-based process control in steel
production, for example in continuous casting. Komatsu have developed a 3-layer
net for nondestructive weld testing. Sharp now sell a microwave oven (LogiCook)
that senses the water vapour produced by the cooking food and uses a neural net to
5.6 Applications
Page 5:17
adjust the power level and cooking time, so that the user no longer needs to set these
by hand.
Exercise 5.8
Suggest how you might go about creating a neural net for signature recognition, or for fault diagnosis in (say) a photocopier.
Chapter 6
Recurrent networks
6.1
Introduction
So far, the networks considered have been feed-forward input units send their values
to hidden units, hidden units send their values onward to further units and eventually
values arrive at output units. There are no feedback connections. This means that
a feed-forward network, once trained, is essentially static. The output for any given
input depends only on that input and not on the immediately preceding input too,
nor any earlier ones; the system has no memory of its use.
Therefore, although such networks may be very useful for many pattern recognition and classification tasks, they certainly cannot be the whole basis of anything
resembling an intelligent system. The next step is therefore to consider networks that
include some kind of feedback connections too recurrent networks. This chapter
introduces a variety of such networks.
First: if any unit can be connected to any other, then there are no start and
end points to the network, no natural choices for what are to be input and output
units. Instead, some units are merely designated to be inputs, with unit i receiving
external input xi (t) at time t = 0, 1, 2, ; for those units which receive no external
input, xi (t) = 0 for all time. Input units may also have connections coming from other
units. Similarly, some units are designated as providing output as well as receiving
connections from other units; unit j provides output aj (t) at time t, and as before
dj (t) will stand for the desired output of unit jat time t. Clearly dj (t) is only defined
for those units designated as output units.
If a network has feedback connections it may be possible for it to have some
sort of a memory of its past use. Thus, rather than just having a set of input/output
pairs for training purposes, it might be feasible to have sequences of input/output
pairs for training. For example, it might be possible to construct a network that
would only output a 1 when a certain input pattern was presented, and then only if
some other input pattern had previously been presented. That is, to get a 1 output
0
Chapter written by Peter Ross, Dec 92; revised by him, Dec 93, Dec 94
Page 6:2
Recurrent networks
it would be necessary to present input vector Xa at time t1 , and then any number
of irrelevant input vectors, and then input vector Xb at a later time t2 ; the actual
times t1 and t2 being immaterial, providing only that t2 is later. The output value of
1 would appear just at t2 , or perhaps at t2 and some finite or infinite number of later
steps. Note, however, that it is not (yet) obvious that a network could be trained to
do such a task, or any other time-dependent job.
Suppose, for the moment, that it was possible to train a recurrent network to
do some time-based tasks. What would the training sequences look like, and how
would they be distinguished? Obviously each individual training sequence is just an
ordered set of input/output pairs, but what separates sequences? The commonest
answer is nothing at all; the network is just trained to output some function of a
fixed number of past inputs for example, a network with one input and one output
might be trained to output the XOR of x1 (t) and x1 (t 1), or the even parity of
x1 (t) x1 (t 7). Clearly the successive inputs have to be carefully chosen so that
the network gets to see all that it is supposed to learn regularly and without undue
bias towards certain subsequences. Even if the network is being trained to compute
a function of some varying number of past inputs, as in the Xa -and-later-Xb problem
described above, there need not always be an identifiable separator between sequences;
the network just has to learn to reset when it has just seen Xb . Observe, however,
that for any problem such as the Xa -and-later-Xb task which allows an arbitrary
amount of time between significant input features, it is impossible to produce a truly
representative collection of training data. If the gap between Xa and Xb in your
training data is no larger than (say) 200 time steps then you are really training
the network to solve a more specialised problem: Xa -and-at-most-200-later-Xb. The
hope is that the network might make the right generalisation because the general
problem was easier to solve than the specific one. This is just another manifestation
of the classic philosophical problem of induction: how can the leap to an inductive
generalisation of data be justified when it is physically impossible to consider all
instances of that generalisation? As an aside, this problem is a lot less troublesome
than it used to be since it was discovered that the size of a computer program
is, in a clearly definable sense, independent of the language in which it is expressed;
the best representation of a set of data might then be defined as the smallest
program which can output it all. This doesnt address the problem of the correctness
of any inductive generalisation, but does provide a metric as a basis for discussion
and comparison.
The next sections describe some specialised forms of recurrent network and how
they could be trained.
6.2
Jordan and Elman networks
Jordan [15] suggested a restricted form of recurrent network in 1986, in which there
was a conventional feed-forward network but also some extra context units alongside
6.2 Jordan and Elman networks
Page 6:3
the inputs, which also fed the first hidden layer. Context unit i would have, as its
value ci (t) at time t:
ci (t) = ci (t 1) + oi (t 1)
where oi (t1) is the value of the i-th output unit at time t1, and 0 1 is some
constant governing the decay of a context units sense of its own past. The larger the
value of is, the more the value of a context unit depends on past network outputs.
Figure 6.1 illustrates the idea. The network is trained by backpropagation or any of
output layer
hidden layer
context units
input units
Figure 6.1: A stylised Jordan network

its variations; the context units are treated exactly like inputs for this purpose.
Elman [4] introduced a variation on this idea in 1990, in which the context
units were copies of the hidden units at the previous time step and there was no
decay term:
ci (t) = hi (t 1)
where hi (t 1) is the value of the i-th hidden unit at time t 1. If there is more than
one hidden layer you can pick which layer to copy to the context units. The idea is
illustrated in figure 6.2. As before, the weights are trained by backpropagation or any
of its variations. The copying back of the hidden unit values provides the memory of
past inputs.
Exercise 6.1
The rbp program allows you to experiment with simple Elman

networks. Read the section in the first appendix, and try the
example suggested there to get a 7-3-4 network to memorise the
two sequences acb and bcd so that the network outputs the
next term of whichever sequence is being input.
Exercise 6.2
Use rbp to get a 3-2-1 Elman network (that is, one real input
and two context units, plus a hidden layer of two units and a
Page 6:4
Recurrent networks
output layer
hidden layer
context units
input units
Figure 6.2: A stylised Elman network
single output) to learn to compute the XOR of the current and

previous inputs.
Exercise 6.3
6.3
Experiment with using rbp to get a 2-2-1 Jordan-type without

decay ( = 0) to compute the XOR of the current and previous
inputs.
Backpropagation through time
In 1969 Minsky and Papert [23] pointed out that a fully recurrent network could
be re-represented as a fairly conventional feed-forward network provided the training
sequences had a (small) maximum length. The ideas is shown in figure 6.3; the righthand network is the left-hand one unfolded for three timesteps, and V1 is the unit
which represents V1 at the second timestep and so on.
The unfolded network has one unusual constraint compared to an ordinary
feed-forward network: each image of each connection at each timestep must carry
the same weight. The simplest way to do this is somewhat ad-hoc but is reported to
work well: just train the unfolded network by backpropagation or whatever, but at
each cycle replace the weight on each image of a connection by the averaged weight
on all its images including itself.
The limitation of this technique to tasks with a small maximum sequence length
has meant that it has not been widely used. The maximum length must be fairly
small, otherwise backpropagation on the unfolded network will take a very long time
indeed.
6.4 Recurrent backpropagation
Page 6:5
V1
w 12
w 11
w 12
w 11
V1
V2
w 22
V1
w 21
w 22
w 21
w 12
w 11
V1
V2
V2
w 22
w 21
V2
Figure 6.3: Unfolding a recurrent network for three steps
6.4
Recurrent backpropagation
It turns out to be possible to extend the idea of backpropagation to recurrent networks

[24]. As in chapter 4, let ai be the output of unit i. Let xi be the input to unit i,
perhaps 0 for all time if i is not an input unit. Then the input to unit i is
si =
wij aj + xi
(6.1)
let the network evolve according to the equation
dai
= ai + g(si)
dt
(6.2)
so that if there is a suitable stable attractor of this dynamical system then at the fixed
points:
ai = g(si )
(6.3)
The error function will be as before:
E = 1/2
Ek2
(6.4)
output units
where
Ek =
dk ak if k is an output unit
0
otherwise
The weight change for wij will be
X
E
ak
=
Ek
wij
wij
k
(6.5)
Page 6:6
Recurrent networks
Differentiating equation 6.3 gives

"
X
ak
am
= g (sk ) ki aj +
wkm
wij
wij
m
(6.6)
so that, collecting up like terms:

kig (sk )aj =
Lkm
am
wij
(6.7)
where
Lkm = km g (sk )wkm
(6.8)
Inverting the linear equations in 6.7 gives

ak
= (L1 )ki g (si )aj
wij
(6.9)
so that, putting this into equation 6.5, the weight change for wij is:
Ek (L1 )kig (si )aj
(6.10)
Backpropagation for feed-forward networks is a special case of this, of course. As

it stands this seems to require the expensive and global computation of the matrix
inverse L1 . However, this can be neatly avoided as follows. Let:
Yi =
Ek (L1 )ki
(6.11)
Undoing the inversion gives:

X
Lji Yj = Ei
(6.12)
which, using the definition in equation 6.8, is:

Yi
g (sj )wjiYj = Ei
(6.13)
This is the fixed point of the analogous dynamical equation:
X
dYi
= Yi + ( g (sj )wji Yj + Ei )
dt
j
(6.14)
By direct analogy with equation 6.2 this can be solved by a new network with the
same topology as the previous network, using inputs Ei at what used to be output
nodes, and using a linear activation function, and using reversed connections: instead
of wij from j to i, use g (sj )wji from i to j.
Thus the training procedure becomes:
6.4 Recurrent backpropagation
Page 6:7
(1) Run the original network until stable to find the ai s;

(2) Find the Ei s;
(3) Use these as inputs to the second network, run that until stable to find the
Yi s;
(4) Update the weights using 6.10 and 6.11.
However, it is important to remember that this can only work if there is a stable
attractor of the original network in the first place. This recurrent backpropagation
algorithm has been successfully applied to a number of interesting problems.
Chapter 7
Associative networks
The networks we have been studying to now have, for the most part, been ones whose
purpose could be described as learning functions or choices. We shall now turn to
consider networks that function as associative memories, storing data in a way that
allows it to be retrieved with cues comprising part of the original data or a corrupted
version of the original data.
In this chapter, then, we shall examine the Hopfield Net, the Boltzmann machine derived from it, and the associative net devised by Willshaw. Each of these
networks operates on inputs and outputs which are binary vectors: in other words,
each of the components in the vector can take one of two values. For Hopfield Nets
and Boltzmann machines it is convenient to make those values +1 and 1, while for
the Willshaw associative net its more natural to use 1 and 0. Outputs of units which
take values in {0, 1} will be denoted by b or bi (b for binary) while those which
take values in {+1, 1} will be denoted s or si (s for spin because of a connection
with the physics of magnetic materials whose properties are explained by spinning
subatomic particles).
7.1
Hamming Distance
Well begin this chapter with a few mathematical preliminaries on binary vectors.
These are for the most part simple consequences of the application of the ideas from
section 1.1.1 of the first chapter, to vectors whose components are drawn from either
{0, 1} or {+1, 1}.
The major new idea is that of Hamming Distance the natural way to assess
how similar two binary vectors are is by counting how many components differ, and
this number is the Hamming distance between the vectors. The Hamming distance
has a straightforward relationship with the ordinary Euclidean distance:
For binary vectors, each place where b(1) differs from b(2) contributes one to
0
95
Chapter written by John Hallam, Feb 93; updated by him, Dec 94; extended by Peter Ross, Sep
Page 7:2
the Euclidean distance, and places where the vectors match contribute zero.
The Euclidean distance is thus the square root of the Hamming distance.
For spin vectors, each place where s(1) differs from s(2) contributes two (since
si is 1) and the Euclidean distance is the square root of twice the Hamming
distance.
For spin vectors its also interesting to consider the vector dot product as a
measure of overlap between the vectors. This can take integer values in the range
[N, +N] where N is the dimension of the vector. Suppose that two vectors differ in
P places. Then the dot product, given by
s(1) .s(2) =
i=N
X
(1) (2)
si si
i=1
takes the value N 2P since each individual product of components is either +1

(same: N P of them) or 1 (different: P of them). Of course, P is just the
Hamming distance.
7.2
The Hopfield Net
Imagine a collection of N units like simple perceptrons each interconnected with all
the others and possessing a threshold. All the units are equivalent, in the sense that
the net is fully recurrent and no units are distinguished as inputs or outputs. Suppose
that each connection works both ways and carries a weight the same weight for
traffic in either direction which well write as wij (representing the weight to unit
i from unit j, though since the connections are symmetric, wij = wji). All units are
spin units, emitting either +1 or 1 and decide which on the basis of the weighted
sum of their inputs in a way which well come to in a moment.
These units constitute a Hopfield net (see e.g. [11, 12, 13]. The vector of
current outputs of the units, denoted by s, represents the current state of the network.
However, some of the units in the net may be emitting a value inconsistent with the
input weighted sum. Such units are permitted to change their output, resulting in a
new state s of the net. Of course, the change in outputs affects the input weighted
sums of all units and some others may now change their outputs, and so on. The
upshot of this is that the net operates by passing through a sequence of states as
the outputs of units change to try to follow the values of the weighted sums: the net
executes a dynamical process.
Hopfield nets are interesting for a number of reasons:
They recall data in a dynamic way, allowing pattern completion or correction
to occur. For example, a network might be programmed (somehow) to recall
the pattern ---+++---1 but started with a corrupted version of that pattern
1
We shall use the notation of + and to represent the components of the vectors which are +1
and 1 respectively.
7.2 The Hopfield Net
Page 7:3
Start
Stable
Sequence 1
-++++----+-++----+-+++-----+++-----+++---
Sequence 2
+----+++++---++++
+++--++++
+++---+++
+++---+++
Table 7.1: Two Examples of State Sequences in a Hopfield Net

(such as -++++----). It can progressively correct the presented pattern, as a
result of the dynamic recall process, until it is equal to the programmed pattern.
(Table 7.1 shows examples of the sequences of states the net might pass through.
Note that the negative of the programmed pattern is also stable.)
The net can be used to solve optimization problems, such as the Travelling
Salesman problem, by appropriate programming. In that case its dynamical
evolution of state is exploited to find the minimum of some cost function.
The Hopfield net shows some similarity with a physical system called a spin
glass, which is a kind of magnetic material, and we can see certain kinds of phase
change in such systems which are mirrored in the behaviour of the Hopfield net.
Thus it offers a simple example to illustrate the kind of statistical analysis,
inspired by a physical model, that one can do on connectionist systems to
discover their qualitative properties.
(In some textbooks, the units in a Hopfield net are presented as having binary
outputs rather than spin outputs. We follow the latter convention because it makes
the presentation simpler. The two are formally equivalent, since the spin and binary
vectors are related by a linear transformation.)
7.2.1
Operation of Hopfield Nets
Let us now move to look at the details of the operation of a Hopfield Net. First,
we note that, as with perceptrons, we can streamline the equations by subsuming
the thresholds into the weights. To do this, suppose that unit i has a threshold i .
Create a new unit, numbered 0, and connect it to all units so that unit i connects
with weight wi0 = i . Now suppose that the output of unit 0 is frozen to +1, and
start all the sums for computing input from zero rather than one.
With that convention, we can write the Hopfield Net update process as follows.
Given the N fully interconnected units and the connection weights wij (i = 1..N, j =
0..N), we
Page 7:4
(1) Inject a pattern into the net, by setting the outputs of the N units from some
initial state vector s(0) . In other words, we set the output of unit i to the value
of the ith component of s(0) .
(2) For each unit we compute the weighted sum of input which reaches it, that is,
j=N
X
wij sj
j=0
and look at its sign. This determines the desired state of unit i as follows: if
the sign is positive, the desired state is +1; if its negative, the desired state is
1; and if the input is zero, the desired state is the same as the current state
si .
(3) If the desired states of all units match their actual states, we have reached a
stable pattern, and we stop. The answer the state the network has recalled
is the stable state we have now reached.
(4) Otherwise we change the state of some of the units. There are two possible ways
of doing this synchronously or asynchronously. For a synchronous update,
we change the outputs of all the units in the net to the desired values computed
at step (2). For asynchronous update, we choose one of the units in the net (at
random) and change just its state to the desired state.
(5) We go to step (2) and continue from there.
Exercise 7.1
Derive the equivalent procedure for a Hopfield net built of binary

valued units.
The procedure above raises some interesting questions. Will the net ever reach
a stable state? Will it only arrive at a stable state for certain input states? Can we
program the state to which the net evolves? How long will it take to reach the stable
state if one can be reached? We will now give some answers to these questions.
7.3
Energy functions in the Hopfield Net
A useful trick for many problems of the kind posed by the Hopfield Net is to think
in terms of energy. The analogy is with the everyday idea of energy something
you run out of as a result of activity. Essentially, the idea is this: if we could find
a measure of energy, such that the operation of the Hopfield Net always resulted
in the energy falling, we would have shown that the Net would always reach a stable
state one at a local minimum of energy. We know that such a state exists since
there are only a finite number of states (2N of them). The dynamics of the Hopfield
net then acts like a ball rolling downhill, with each transition in the network taking
it closer to some minimum.
7.3 Energy functions in the Hopfield Net
Page 7:5
Unfortunately, its generally hard to find energy functions (also called Lyapunov
functions after the mathematician who used them in the study of differential equations) since they are derived by integrating the dynamics being considered. We will
guess an energy function and show that it works.
Consider
X j=N
X
1 i=N
E(s) =
wij si sj
(7.1)
2 i=0 j=0
which takes a state s and calculates from it a real number E representing the energy
of that state. The factor 21 is the result of 20-20 hindsight! We could also write the
equation in matrix form as
1
E = sT Ws
2
if we think of the state as a column vector and W represents the weight matrix. We
assume that the connection between a unit and itself has weight wii equal to zero, so
W has a zero diagonal.
Let us say that unit i is stable if its current state si is equal to the sign of its
input
hi =
j=N
X
wij sj
j=0
or hi is zero. Now to prove that the function in equation 7.1 is the energy we want,
we must show that making an unstable unit stable (by changing its output) results
in a state with a lower value of E.
Suppose that its unit k whose output is changed. Then the only change in the
state of the net is that sk has become sk . Call this new state s . What is the change
in energy?
E = E(s ) E(s)
X
1 X
=
wij si sj
wij s i s j
2 i,j
i,j
1X
wij {s i s j si sj }
2 i,j
Now, only those terms in the sum which have either i = k or j = k are non-zero,
since for all other values of i we know that s i = si (i 6= k). Thus the sum simplifies
to
X
1 X
wik (s i s k si sk ) +
wkj (s k s j sk sj )
E =
2 i6=k
j6=k
X
i6=k
wik (s i s k si sk )
Page 7:6
=
X
i6=k
wik si (s k sk )
= sk
wik si
i6=k
= sk .hk
where sk is s k sk , which equals 2sk since we changed the state of unit k. Note
that we used the symmetry of the connections in the second simplification above, in
which we replaced subscript j by i and swapped the order of subscripts on wij to
make the two sums identical.
Finally, then, we have that the change in energy resulting from the update is
E = 2sk hk .
This E is always negative: the unit was unstable so sk has the opposite sign to hk
and their product is less than zero (note that hk is non-zero since our definition of
stability guarantees that for an unstable unit). Thus the energy decreases when an
unstable unit state is changed, E(s) is the energy function we want and we can state
the result we have proved:
For a Hopfield Net with symmetric weights and no self-connection an
energy function E(s) exists and the update algorithm for the net decreases
the energy; therefore such a net, if updated asynchronously, will always
reach a stable state.
It can be shown that if the Hopfield net converges to a stable state it does so
after at most N updates, where N is the number of units. Thus it is possible to
terminate the update after N units have been updated, and test whether a stable
state has been reached: if any unit is unstable at that point, the net will never settle,
but will oscillate in some cyclic pattern of states, if further updates are done.
Exercise 7.2
Why is the qualification if updated asynchronously needed in

the result stated above? Construct an example with synchronous
update that illustrates what might happen.
Exercise 7.3
What happens if we leave out the condition that wii is zero for
all i? (This is done in some versions of the Hopfield net.) Try
to modify the proof above to take that case into account. What
happens if wii is positive? If its negative?
Exercise 7.4
What might happen if the condition that the weight matrix be

symmetric is relaxed? Consider some simple examples...

7.3.1
Page 7:7
Programming the net
Now we know that the net will evolve to a stable state from its starting configuration.
The next question is how do we program that stable state, since it is the pattern that
the net recalls.
Since the net, as it runs, performs gradient descent on its energy function,
one way of programming the net is to choose the energy function so that its minimum
is at the pattern to be recalled. We can then read off the weights from the energy
function by matching terms.
For example, suppose that the net is to be trained/programmed to recall the
pattern a(1) . Consider the energy defined by
E (1) (s) =
1 (1) 2
a .s
2N
(7.2)
which measures the overlap between the current state and the trained pattern. If the
overlap is high, the energy will be large and negative; as the overlap decreases until
the state is orthogonal to the trained pattern the energy becomes less negative,
i.e. it increases. The operation of a net with this energy function should result in
the trained pattern or its negative which is also a global minimum of E (1) (s)
being recalled. Notice that the energy is minimised (becomes large but negative) by
both a(1) , the desired pattern, and a(1) , its negative. In general, the negative of any
attractor is also an attractor because the energy is a quadratic function of the state
of the net.
Exercise 7.5
Show that in a net programmed with a single attractor pair, a,

every possible initial state moves to one or other of the two attractors when the net runs. Describe the boundary between the
basins of attraction (the set of initial states that move to a given
attractor form its basin of attraction) of the two attractors.
Unpacking the energy function by expanding the dot products allows us to read
off the weights:
1 (1) 2
a .s
2N
! j=N
X (1)
X (1)
1 i=N
=
a si
aj sj
2N i=1 i
j=1
E (1) (s) =
1 X (1) (1)
a a si sj
2N i,j i j
1X
wij si sj
=
2 i,j
Page 7:8
The last line is the energy function from equation 7.1 that we used when proving that
the net would reach a stable state. By matching terms together we see that
wij =
1 (1) (1)
a a .
N i j
The manipulations above can also be done using the vector notation:
1 (1) 2
a .s
2N
1
(s.a(1) )(a(1) .s)
=
2N
1 T (1) (1) T
s a (a ) s
=
2N
1
= sT W s
2
E (1) (s) =
and by comparing the last two lines we see that W = N1 a(1) (a(1) )T . (Note that the
outer product of a(1) with itself is a square matrix.)
The foregoing shows how to program a single stable state, or attractor, into the
network, and illustrates an important principle: we can program the net by matching
its energy function to some desired energy function in this case E (1) (s). With this
technique we could use the network as a function minimisation tool for any energy
function which is at most quadratic in the components of the state. Well return to
this point in a moment.
If the intention is to use the Hopfield net as an associative memory able to
reconstruct complete patterns from partial or noisy ones, then it is obviously vital to
be able to program in more than one pattern. A memory with a single remembrance
is not likely to be of much use. This programming is done by the simple expedient of
taking energy functions like E (1) () for each pattern to be programmed, and forming
a composite energy function to minimise by adding the individual ones together. The
rationale is that, provided the patterns to store are not closely inter-related, each
component of the composite energy should contribute a local minimum into which
the net can descend if started near to the corresponding remembered state. The
energy function for a collection of P patterns a(p) for p = 1..P is thus
E tot (s) =
p=P
X
E (p) (s)
p=1
and the corresponding weight matrix is given by W =

Exercise 7.6
1
N
Pp=P
p=1
a(p) (a(p) )T .
Derive the equivalent expression for wij for the case where P
patterns are simultaneously programmed.
Page 7:9
The method of programming in multiple patterns raises a number of questions.

Is it really true that the separate terms in the composite energy wont interfere with
each other? What happens if the local minima run together? Are there conditions
under which the local minima are no longer minima? And, perhaps most important,
how many patterns could we expect to program into a net of size N and achieve good
recall performance?
We shall look at the answers to some of those questions in the section 7.3.3. To
conclude this section, we state in a little more detail the procedure for programming
a net by matching energy functions, unpacking the details of what happens to the
unit thresholds currently concealed in the weight matrices.
Suppose we have some function of N variables f (x1 , . . . , xN ) for which each xi
can take only two values and no products of more than two x variables occur (i.e.
the function is a quadratic form in the xi ). To program a Hopfield net with such a
function,
(1) Write each xi , using a suitable linear transformation, in terms of the output si
of unit i of the Hopfield net.
(2) Substitute the transformations into f (x1 , . . . , xN ) so that it becomes a quadratic
form in si .
(3) Sort out the terms in f () which are linear in some si . Make all such terms
quadratic by replacing si with si s0 where s0 is the invented unit with output
frozen at +1.
(4) Now f () consists of quadratic and constant terms. Ignore all the constant terms.
Read off the connection weight wij , by matching terms with equation 7.1, as
twice the coefficient of si sj in the quadratic form f ().
7.3.2
Searching and optimisation problems
Hopfield nets can be used, amongst other things, to solve optimisation and combinatorial problems such as the Travelling Salesman problem. The net cant be programmed
to solve these problems using the desired attractors: that begs the question. Thus it
has to be programmed directly from the energy function of which we wish to find the
minimum, using the procedure outlined above.
To illustrate this, well consider using a Hopfield net to solve the Knights
Tour problem. Consider a chessboard of side M, on which a chess knight can move
according to the normal rules of chess. Starting from any given square, what path
should the knight follow to visit every square on the board exactly once? Since
the knight can reach eight places from most squares, given a large enough board,
the search tree has a large branching factor and the problem is intractable for large
boards by naive methods.
We shall consider using a Hopfield net to find the tours by minimising a suitable
energy function. (Of course, good heuristic search methods can solve the problem well,
Page 7:10
1
..
2
..
..
..
Board 1
its
.. M
M*
un
Board 2
..
..
..
..
..
..
.. M 2
..
M*M-2
further
boards
1
..
2
..
..
..
.. M
..
..
Board N-1
M
..
..
..
M*
..
..
..
..
.. M 2
un
its
Board N
Figure 7.1: Representing the Knights Tour in a Hopfield net

but we have our connectionist hat on today. Incidentally, anyone seriously thinking
of using a neural net to solve a problem such as this and intending to publish the
results should compare the net with the best of the conventional methods!) First
we need to decide on the representation for the problem. Well do the following.
Represent the board as a collection of N = M 2 units, one per square.
Represent a tour, which consists of at most N moves, as N boards each with a
single active unit showing where the knight is at that point in the tour.
This gives us a very simple representation for the problem, shown in figure 7.1, and
using only M 4 units!
Now to construct the energy. The constraints on the tour are:
(1) the knight can occupy only one place at a time, so only one unit per board may
be active in a legal tour;
(2) the knight should visit each place only once, so each board must have a different
unit active;
(3) the knight may make only legal moves, so coactive units in adjacent boards
should satisfy the chess rule for knight motion.
To convert this into an energy function, we could say that units breaking any constraint result in a positive contribution to the energy, while units meeting the third
Page 7:11
Brd 1 Brd 2 Brd 3 Brd j Brd N 2 Brd N 1 Brd N

Brd
Brd
Brd
Brd
Brd
Brd
Brd
1
2
3
i
N 2
N 1
N
X
Y
Z
Z
Y
Y
X
Y
Z
Z
Z
Y
X
Y
Z
Z
Z
Y
X
Y
Y
Z
Z
Y
X
Figure 7.2: The energy matrix for the Knights Tour

constraint result in a negative contribution. A solution which meets all the constraints
should then have a negative (minimum) energy.
For simplicity, lets suppose that the units are binary rather than spin units.
Well develop an energy for a binary state vector b and then convert it to a function
of s using the relation that
b = (1 + s)/2.
Lets suppose that the units are numbered so that vector b consists of N sets of
N units each representing a board with the first board being the first N units, the
second board (after first move) being the next N units, and so on. This situation is
depicted in figure 7.1.
Our aim is to construct an energy function of the form
E kt (b) =
1 T
b Bb
2
with the matrix B an N by N matrix defining the energy of state b. Because of the
symmetry of the problem, we expect B to consist of a collection of M by M blocks
(M 2 of them) arranged as shown in figure 7.2. As the figure reveals, X encodes the
connections between units in the same board, Y encodes connections between units
in adjacent boards, and Z encodes connections between units in non-adjacent boards.
To determine B we just need to find X, Y and Z.
X encodes the constraint that at most one unit per board be active at a time.
Thus for any two units bi and bj in a single board, the energy should be raised when
bi bj is true. The entry xij in the X matrix multiplies bi bj in the energy function,
so that entry should be positive. Thus X is a matrix with zero diagonal (no selfcoupling) and each off-diagonal element equal to the same positive number (since all
pairs of units within a board should be equally penalised).
Z encodes the constraint that the knight should visit each place only once.
Thus coactivity of corresponding units in non-adjacent boards should increase the
energy. Coactivity in any other units should have no effect on the energy. Because of
Page 7:12
the unit numbering scheme we have chosen, corresponding units are associated with
diagonal entries in Z while non-corresponding units are represented in the off-diagonal
elements. Thus Z is a multiple of the identity matrix, with positive diagonal elements
and zeroes elsewhere.
Since we wish to find legal tours, such tours should have low energy, and we
arrange this by including a negative contribution to the energy whenever coactive
units in adjacent boards represent a legal single move. This determines some of
the entries in the Y matrices which record coupling between adjacent boards. The
remainder of the entries in the Y matrix are positive, since only legal moves should
be encouraged.
Without filling in all the details, I hope you now have the general idea: the B
matrix for the whole problem can be assembled out of the individual X, Y and Z
pieces; the energy function E kt (b) can be converted to a function of s by substituting
for b in terms of s and the result matched against the net energy from equation 7.1
to determine the weights and thresholds.
7.3.3
Stability in the Hopfield net
The questions about the presence and stability of local minima which arose above
when considering nets programmed with many patterns are in general hard to answer,
for the answers depend on the particular patterns stored. As an example, consider
the case where the patterns are orthogonal (N must be even for this case to arise
why?). For an orthogonal pair of patterns a(p) and a(q) the dot product between them
is zero, so E (p) (a(q) ) vanishes. For any given one of the stored patterns, all terms but
one in the composite energy function vanish. The patterns are in a way independent
of each other.
It turns out we can store N such patterns in a network of size N, but the
resulting network is not very useful! In fact, its weight matrix is the identity matrix,
and every pattern is stable. Clearly this is an extreme case, and we would expect to
be able to store some number N of patterns in the network, where is less than
one.
Can we make any general statements about ? It turns out that we can,
provided we assume that the patterns being stored are random. The key question to
ask is then whether the attractors we have programmed are stable. That is, suppose
we set the initial state of the net to one of the programmed attractors: under what
circumstances would the net update rule move the state away from that starting
point? Clearly, if the starting point is an energy minimum all units will be stable and
the state wont change. So when does that happen?
Consider a net programmed with P patterns as described above, and choose
one of them, say a(r) , as the starting state. For the net in this state, the vector of
inputs to the units is given by
h = Wa(r)
=
=
=
p=P
X
N
p=1
p=P
1 X
a(p) (a(p) )T
a(r)
a(p) (a(p) .a(r) )
p=1
1
a(r) (a(r) .a(r) ) +
Page 7:13
a(p) (a(p) .a(r) )
p6=r
where we have taken the term depending on pattern r out of the sum over the patterns.
Now the first term in the braces is a(r) times the square of its size N since a(r) is
a spin vector. Using this gives us a simpler formula
h = a(r) +
1 X (p) (p) (r)

a (a .a )
N p6=r
(7.3)
and, since the state which follows a(r) is determined by the sign of h, we know from
this whether pattern r is initially stable. In fact, if pattern a(r) is stable when first
presented to the net it must satisfy the following equation.
a(r) = sign a(r)
1 X (p) (p) (r)

+
a (a .a )
N p6=r
The first term in the braces is just the desired pattern; the second, more complex, term
is the crosstalk between the desired pattern and all the other patterns programmed
into the net.
Notice that the crosstalk term consists of a sum of amounts of the other patterns
in proportion to their overlap (as measured by the dot product) with a(r) . Thus orthogonal patterns are always initially stable, since their overlaps are zero by definition
and the crosstalk term vanishes.
We now have the basic answer to the question of initial stability. Pattern r
will be initially stable provided the crosstalk term is sufficiently small that it does not
change the sign of the expression in braces.
Consider the ith component of pattern r. For this component, the term in
braces is given by
X
1
(p)
(r)
ai (a(p) .a(r) )
ai +
N p6=r
and the second, crosstalk, term can only change the sign of the expression if its
magnitude is greater than one (since it must overpower the first term) and its sign
is opposite to that of the first term. A little thought shows that this means that
provided
(r) 1 X (p) (p) (r)
ai
a (a .a ) < 1
N p6=r i
Page 7:14
f (x) =
2 =
1
2
x
exp ( 2
2)
P
N
+1
Figure 7.3: Distribution of the sum in the initial stability criterion

component i of pattern r will be initially stable.
We specified at the start that the attractors were random patterns, so the
components of the patterns are random values chosen from 1. Products of such
values are also random values of the same kind. Now each term in the sum above, if
we were to multiply everything out, would consist of a product of four components
chosen from the attractors p (for each p 6= r) and r. Thus each of the terms in the
sum above is made of random values and we can compute the distribution of values
expected for the sum. The probability that pattern r is initially unstable is then the
probability that the value of the sum exceeds one.
Each dot product in the sum is responsible for N of the terms in the expanded
sum, and there are P 1 such products. Thus there are N(P 1) or roughly NP
random 1 values in the sum. If we assume that +1 and 1 are equally likely,
probability theory tells us that, for large NP , N1 of the sum of NP random values in
P
1 is normally distributed, has zero mean, and has variance N
. The distribution is
pictured in figure 7.3 and the shaded area represents the probability that an individual
component of pattern r is initially unstable.
Further analysis, using an approximate expansion for the area under the normal
distribution curve, would allow us to derive the following two results for :
(1) If we allow a 1% probability of an unstable bit in a single individual pattern,
P
the network capacity (which is N
) is at most 1/2logN. Well call this the
single pattern perfect case.
(2) If we require there to be a chance of at most 1% that a single bit in the whole set
7.4 The Boltzmann machine
Page 7:15
of P patterns be unstable, the capacity drops to 1/4logN. This more stringent

condition will be called the all patterns perfect case.
This is the best we can do, to answer the questions of stability, without more
sophisticated tools. However, there is one further point to make here. The results
above apply only to initial stability of patterns. It may be that the few bits which are
unstable in a pattern when presented initially cause other bits to become unstable
(remember that recall in the Hopfield net is a dynamic process and we have just been
thinking about the very first step) and those upset other bits and the result is an
avalanche away from the supposedly memorised attractor. In fact, it turns out that
P
if N
> 0.138 . . . this avalanche effect is likely, and the network will fail completely as
an associative memory. The static analysis above wont tell us this; it comes out of
more sophisticated analysis well glimpse in the next section.
7.4
The Boltzmann machine
One serious problem with the Hopfield net, particularly when used as an associative
memory, is that there are certain combinations of patterns it cannot correctly remember. Suppose we try to program a network with 3 units to memorise the patterns
+-+, ++-, -++ and --- and require it to ignore the other four possible patterns on
three units. (This is just the two-bit parity problem otherwise known as learning
exclusive-or.) It turns out that this gives a weight matrix proportional to the identity
matrix, and all the possible patterns are stable attractors of the net. The problem
(p) (p)
is that the sum of the products ai aj over the set of patterns is zero for all cases
except i = j, as it is for the complete set of three-bit patterns.
The heart of the problem is that the Hopfield net energy function in equation 7.1
is quadratic in the states of the units, and so the net can only be programmed
with quadratic energy functions. Furthermore, a complete energy function is needed
for programming, since all the weights must be specified. When we program using
(p) (p)
attractors, all the products ai aj must be known and their sums over the patterns
determine the correlations between the activities of each pair of units i and j. The
upshot of this is that the Hopfield net can only be taught to memorize attractors
whose collective behaviour is completely determined by their second and first order
correlations and on this basis, for example, the set of N-bit patterns of even parity
and the set of all N-bit patterns are indistinguishable.
A second potentially serious problem arises with the Hopfield Net, particularly
when its being used as an optimising tool: what about the local minima in the
energy function? It is really very unlikely that in the general case the net will find
the global minimum of energy from an arbitrary starting state. This problem affects
the associative memory function of the net too, since programming in more and more
attractors introduces local minima mixtures of the programmed states to which
the net can settle.
Page 7:16
The Boltzmann machine is a simple generalisation of the Hopfield net which addresses these problems. It tackles the problem of needing to specify all the correlation
products between the components of the attractors by allowing units to be designated
as hidden in the sense that the user is not required to specify anything about
their states.
The problem of local minima is tackled by introducing noise. In general,
when a system might get stuck in local minima, one way to prevent that is to shake
or kick the system add energy to it the idea being that a suitable sized
kick will push the state out of a (relatively shallow) local minimum but not out of
a (relatively deep) global one. Thus a system kicked regularly should tend to end
up settling in a global minimum of energy.
Exercise 7.7
7.4.1
Use the Hopfield net simulator to convince yourself of the statement that the set of N-bit even parity patterns are indistinguishable from the complete set of N-bit patterns.
The update rule for the Boltzmann machine
The Boltzmann machine [9, 1, 10]applies the strategy of introducing noise in a sophisticated way. It is based on a property of thermodynamic systems: such systems
distribute their time between states at different energies in a way that depends on
the energy and a physical parameter the temperature. The hotter the system is,
the more time it will spend in high energy states; if the temperature is lowered, the
system loses energy and spends more time in lower energy states. However, high
energy states are not impossible for the cool system, just improbable.
This idea is incorporated into the Hopfield net by making the update rule
probabilistic. Instead of the units changing state to match the sign of their input,
they change state with a probability dependent on their input. Given an input hi ,
unit i changes state so that its output is +1 with probability
P {si +1} =
and 1 with probability
P {si 1} =
1
1 + e
2hi
T
1
1 + e+
2hi
T
where T is the network temperature and 2hi is the change in the energy (equation 7.1) of the whole system resulting from the change in state of unit i. Each unit
applies this stochastic update rule independently, asynchronously as for the Hopfield
net.
We can make a number of general comments about the update rule above.
First, the two probabilities add up to 1 the unit must output something.
Page 7:17
Second, if T is very large, the two probabilities will be almost identical: a very
hot system will have states that are completely random, each unit having an
equal probability of outputting +1 or 1.
Third, as the temperature is reduced, it becomes increasingly hard for a unit
to make a change to higher energy. Consider: if the unit is currently receiving
input greater than zero, its output ought (in the Hopfield model) to be +1 and
this state has lower energy. The denominator of P {si +1} is the sum of 1
and a fraction less than one, so the probability is greater than 12 . The smaller
T is, the larger will be the negative argument to the exponential and the closer
to 1 the denominator will come, making it more likely that the unit will output
a +1.
Fourth, in the limit, as T 0, the update rule reduces to the Hopfield net rule.
The units always choose the output that results in lower energy.
Exercise 7.8
7.4.2
Show that the two probabilities above sum to 1 as they ought.
Using and Training the Boltzmann machine
The Boltzmann machine, then, is essentially just a Hopfield net with the temperaturedependent stochastic unit update rule defined above. The change raises a number
of questions: how do we use a Boltzmann machine, how can we characterise the
(fluctuating) state the net is in, and how can we train or program it?
First, how do we use the Boltzmann machine? We used the Hopfield net by
starting it from some desired state and allowing it to run according to its update rule
until it settled to one of its attractors. However, the Boltzmann machine with nonzero temperature will never settle to a constant stable state, since the random choices
of outputs will cause the state to vary. The key idea here is the idea of dynamic
equilibrium although the Boltzmann machine will not reach a static stable state,
it will reach an equilibrium where the probability of its being in a particular state
is a function of the energy of that state and the temperature but independent of the
time since the network began running. For lower temperatures, lower energy states
are more probable.
This suggests one way of using the Boltzmann machine. We could start it with
a fairly high temperature, so that all the outputs fluctuate randomly, and reduce the
temperature slowly so that it remains in equilibrium at all times. As the temperature
falls, so the likelihood of the net adopting high energy states diminishes and gradually
the states the net can adopt will be confined to the valleys in the energy function.
Eventually the temperature will be low enough that (a) the system spends much of its
time at states near a global energy minimum, and (b) that it has barely enough energy
to jump out of that minimum to neighbouring local minima. As the temperature falls
further, the net will be trapped in the global minimum and reducing the temperature
to zero will freeze the state on to that attractor as it would in the Hopfield net.
Page 7:18
This process of slowly reducing the temperature, allowing the network to reach
equilibrium at each temperature before moving on, is called simulated annealing after
the physical process to which it corresponds2 .
The second question, that of characterising the states of a Boltzmann machine,
is also fairly straightforward. Given that the net fluctuates between different states in
a way depending on temperature, we can characterise the behaviour by looking at the
average state. This amounts to running the machine for a long time at equilibrium
and computing the average output of each unit. Its necessary that the machine be
at equilibrium since that is defined as the situation in which the average activity of
any unit is independent of time. We shall denote this time-independent average state
using the notation s .
Since the units are spin units, outputting 1, their average state is a real
number in the interval [1, +1]. A value near 1 signifies a unit whose output is
nearly constant it spends almost all its time outputting +1 and similarly for
an average near 1. An average near 0, on the other hand, signifies a unit whose
output fluctuates at random it spends about as much time at +1 as it does at 1.
Well see in a later subsection that we can use this characterisation of the state of the
network to explore how the net behaviour varies as a function of temperature and,
with a much more sophisticated analysis (which well not present here), derive limits
on the capacity of Hopfield nets and Boltzmann machines.
This brings us to the third question: how is the Boltzmann machine to be
programmed or trained? Clearly, if we are using the net for optimisation and have a
complete (but complicated) quadratic energy function to minimise, we can program
the net by matching energy and run it using simulated annealing to find the minimum.
The problem arises when we have a net with hidden units, whose activity we do
not want to specify.
To train a Boltzmann machine with hidden units, we employ a rule to change
the weights as a function of the correlations between the activity of units when the
network is at equilibrium. Suppose that the units in the net are grouped into visible
(input and output) and hidden classes. Then we run the network to equilibrium
in two different ways: first, with the states of the visible units frozen; and second,
with all the units free. For each of these two cases we observe, by averaging over a
suitably long period of time, the correlations in activity between pairs of units in the
network. This gives us two sets of data for each pair of units i and j: the average
coactivity sisj clamped with the visible units clamped, and the corresponding
measure si sj free for the situation where the net is allowed to run free. The
weights are then adjusted using the rule
wij =
o
n
sisj clamped sisj free
T
where is the learning rate and the process is repeated until the weight adjustments
2
Physical annealing slow cooling over a long period is used in the working of metals and
glasses to produce uniform crystalline properties in the material.
Page 7:19
drop close to zero.

This procedure works, informally, by encouraging in the network coactivity of
the units consistent with the correct state of the visible units (good correlations when
clamped strengthen the weights connecting the correlated units) and by discouraging
the free, internally generated, accidental correlations (good correlations when free
weaken the interconnection of the units concerned). Both parts are essential. The
net must learn the imposed patterns and unlearn the internally generated disturbances
due to initial connection strengths and random variation.
To summarise the algorithm for training the Boltzmann machine, then, we have:
(1) Choose a desired state for the visible units and clamp them to that pattern.
(2) Run the net to equilibrium at some non-zero temperature, using simulated
annealing to speed up the process of reaching equilibrium.
(3) Compute the correlations si sj clamped by averaging the products of the
states of pairs of units over a suitable period of time.
(4) Free the visible units.
(5) Run the net to equilibrium again at the same temperature.
(6) Compute the free correlations si sj free by time-averaging.
(7) Update the weights using the correlations computed on this cycle. If the weight
changes are small enough, stop.
(8) Go back to step 1.
This procedure clearly takes an enormous amount of time, which limits the usefulness
of the Boltzmann machine in practice. However, despite the long training time, it
has been applied to statistical decision tasks (at which it is very good), to speech
recognition, and to certain kinds of vision problems.
7.4.3
Analysing the Boltzmann machine
How will a system such as the Boltzmann machine behave in general? Previously,
we were able to view the network as a dynamical system moving in the direction of
(locally) minimum energy and say that the state would stop changing when the (local)
minimum was reached. Now, however, the state of the network fluctuates randomly
(though with the property that changes toward lower energy are more likely).
As we saw above, we can characterise the state of the network using the mean
output of the units. Lets pursue this idea a little further and investigate the kind of
behaviour we might expect at different temperatures from a Boltzmann machine.
Page 7:20
Unit i can be in one of two states, 1, with probabilities given above, so the
expected state of unit i is
(+1).P {si +1} + (1).P {si 1}
into which we substitute and simplify thus.
si = (+1).
1
T2
1 + e hi
hi /T
e
ehi /T
= hi /T
e
+ ehi /T
hi
= tanh
T
+ (1).
1
2
1 + e T hi
Recall that tanh x has the same shape as the sigmoid activation function g(x) used
in the back-propagation net, but is scaled and shifted: its output varies from 1 to
+1 and is zero for an input of 0.Thus we obtain, finally, the result that
si = tanh
X
1 j=N
wij sj .
T j=0
(7.4)
To get any further, we need a way of relating the inputs of the units to the mean
outputs that we are using to characterise the net state. Lets therefore approximate
and suppose that the average output of unit i is due to input derived from the
average outputs of all the other units in the net in other words, well suppose,
quite reasonably, that the average output results from the average input, which is a
weighted sum of the average outputs. This is called a Mean Field Approximation by
analogy with the similar method for treating statistical systems in physics.
Using this assumption, we can replace the sum over actual outputs of the units
in equation 7.4 with one over the average outputs and get the relationships
X
1 j=N
si = tanh
wij sj
T j=0
a set of N equations that interrelate the average behaviour of the units in the
Boltzmann machine. They are, of course, non-linear equations, because of the tanh
function.
Solving coupled non-linear equations is complex in general, but here we can
make another approximation and guess an answer. Suppose that the Boltzmann
machine has been programmed, by setting the connection strengths, with P attractor
patterns a(r) , for r = 1..P . If the net is in equilibrium and there arent too many
attractors (i.e. P is not too big relative to N), we might expect that the average
state of the units would be proportional to one of the attractors we have programmed.
Page 7:21
In view of this, lets guess that the average state of the net will be proportional to
attractor r, that is
s = a(r)
for some value of in [1, +1] and some attractor pattern a(r) . Then the Mean Field
equation above tells us that
(r)
ai
= tanh
X
1 j=N
(r)
wij aj
T j=0
when we substitute for si from our guess solution.

Since the weights are determined from the attractors, we can expand the argument of the tanh function exactly as we did for deriving the crosstalk between
patterns, using equation 7.3, to get
(r)
ai
1 X (p) (p) (r)

1 (r)
ai +
= tanh
a (a .a ) .
T
N p6=r i
Since we are assuming that P is relatively small, the crosstalk term is negligible and
can be dropped. The resulting set of equations
(r)
ai
= tanh
(r)
a
T i
(r)
reduce to a single equation, since tanh x is equal to tanh x and ai is 1 for all
i. That equation is
= tanh .
T
What we have achieved in this equation is a way of determining on the
assumption that the net is in equilibrium and that the equilibrium response is proportional to (correlated with) one of the programmed attractors of the net. The value
of tells us how correlated the equilibrium state is: if is 1, the net spends all its
time in the attractor state no fluctuations occur while if is zero the net state
is not correlated with any attractor, and the unit outputs vary randomly. Clearly,
the former example can only occur for zero temperature, while the latter corresponds
to the behaviour of the net at high temperatures.
We can solve the equation = tanh T for different values of T by plotting the
two graphs = tanh x and = T x on the same axes: intersections of the graphs, if
any, will be solutions of the equation.
A set of such graphs are shown in figure 7.4. A number of things are obvious.
First, = 0 is always a solution of the equation. It is not, however, a particularly
interesting solution since it means the net state is random. Second, provided T is
less than 1 the graphs intersect at two other places giving non-zero values for , and
in these cases the network can stabilise to an attractor or its negative. The larger
is, the more firmly the net will stabilise. Third, and perhaps most interesting, is
Page 7:22
1
0.8
0.6
0.4
0.2
0
-0.2
tanh(x)
0.4*x
0.5*x
0.6*x
-0.4
-0.6
-0.8
-1
-2
-1
Figure 7.4: Solving = tanh T graphically.

the case when T 1. Now the only point the graphs intersect is at the origin and
only the trivial solution for is possible. The Boltzmann machine state is completely
random.
This kind of behaviour corresponds to that observed in magnetic materials.
They retain their magnetic properties only while the (physical) temperature is below
some critical value, above which all magnetic properties disappear.
The upshot of the above is that for temperatures greater than 1 in the Boltzmann machine, only random states will be seen, while for temperatures less than one
the attractors of the net will be stable. It turns out that a more complex analysis
(too complex for these notes) reveals that programming in more and more attractors into the net is very like raising the temperature: at a certain moment, when
is around 0.138, the net reaches a critical point. Pushing the number of attractors
stored beyond that limit results in a catastrophic loss of memory, just as raising the
temperature above 1 results in the disappearance of stable states based on the attractors. The change in performance at the critical is dramatic: the net goes from
correctly recalling 97% of the bits stored, on average, to correctly recalling none!
The general analysis, applied to both Hopfield nets and Boltzmann machines,
yields a phase diagram for the networks like that shown in figure 7.5. The horizontal
axis measures the used capacity of the network, while the vertical axis records the
temperature. Within the diagram there are four regions: above T = 1 only random
states are found in the net; below T = 1 there are three regions, with boundaries
that depend on T and within which different standards of performance obtain. In
Page 7:23
Random
states
1.0
Only spin glass
states are minima
0.5
Programmed
attractors are
global minima
Programmed
attractors are
local minima
0.05
0.10
0.15
0.138..
Figure 7.5: Phase diagram for the Hopfield Net/Boltzmann Machine. (From Amit et
al.)
Page 7:24
the innermost region (closest to the origin) the net functions as an effective memory.
with the programmed attractors being the global minima. In the next region out, the
programmed attractors are local minima, the global minima being states uncorrelated
with any programmed attractor such states are called spin glass states and are
effectively stable random states. Finally, in the third region, the only stable states
are the spin glass states programmed attractors are no longer minima.
7.4.4
Summary
So far, we have looked at the Hopfield net as a dynamical system suitable for associative recall or solving search problems by means of energy minimisation. We saw
how to program the net as a memory or optimisation tool using the energy function,
and discussed the systems capacity as an associative memory.
In the face of problems with Hopfield nets, we introduced the Boltzmann machine extending the Hopfield net by adding a stochastic update rule and the notion
of hidden units. Its training regime was explained, and its general properties as a
memory device explored using mean field theory.
In the next section we shall look at an alternative form of associative memory
which is more efficient in space and time that the Hopfield net or Boltzmann machine.
7.5 Willshaws Associative Net
Page 7:25
Min are on,

there are Nin in all
.
.
.
Synapse, binary
excitatory
Sums inputs and
..
..
thresholds
Mout are on, there are Nout in all

Figure 7.6: The associative net model of Willshaw, Buneman and Longuet-Higgins.
7.5
Willshaws Associative Net
In the final section of this chapter, well look at a rather different kind of associative
memory that proposed by Willshaw, Buneman and Longuet-Higgins [32, 33]. This
memory, designed to model the regular structure found in certain parts of the brain,
is structured as a matrix of binary synapses feeding to binary output units whose
function is to threshold the sum of input. The general structure of such a memory is
shown in figure 7.6. Aside from biological plausibility, the model is easy to analyse
and turns out to be a more efficient associative memory scheme than the Hopfield
Net.
7.5.1
The simple matrix memory model
In the simplest version of the model, (see figure 7.6), there are Nin input lines which
carry a binary (0-1) pattern with Min bits set. Each input pattern is to be associated
with an output comprising Nout bits of which Mout will be on. Note that the sizes of
the patterns can be different unlike the Hopfield net, the matrix memory can be
hetero-associative (i.e. associate different things) as well as auto-associative. Between
input and output is a matrix of binary excitatory synapses (weight 0 or 1 only).
Initially all synapses are off and they turn on during training.
In the more general case, which well consider briefly later, the matrix of binary
synapses between input and output may be incomplete: typically, we might suppose
that only a proportion of the outputs connect to each input this is more realistic,
neurophysiologically speaking, than total connectivity, and makes only a quantitative
difference to the performance of the memory.
To train the net, we present each input pattern to the input lines and present
its corresponding output to the output lines. Then, at each synapse where both input
and output unit is active, we set the value of that synapse to be one. If the synapse
Page 7:26
was already one, we do nothing; nor do we change any synapse other than those with
input and output one. Furthermore, synapses once activated never revert to the off
state. Thus, as we train in patterns, more and more synapses are turned on. Notice
that the training rule is a variant of Hebbs rule: synapses between coactive units are
turned on.
For recall, the input pattern is presented to the net and each output sums up
the contribution from all activated synapses connecting it to active lines in the input
pattern. This value is called the dendritic sum for the unit. The output units then
threshold their dendritic sums using Min as their threshold: if the dendritic sum is
strictly less than threshold they output 0, otherwise 1.
Clearly, any output unit which has been trained to be on for a particular input
pattern will receive exactly Min inputs, since all the synapses between it and the
active input lines will have been set. Thus there can be no false negatives units
that should be on failing to be so. However, false positives are possible. It may be
that other patterns trained into the matrix result in all the synapses for one of the
patterns being activated. For that pattern, then, the unit will falsely output a 1.
Note that the above assumes that the net is presented with perfect input patterns. It is also possible to present the net with partial input patterns, where only
some of the input lines are used, or noisy input patterns, where the state of some
input lines differ from the true patterns values. In the former case, the simple recall
strategy would still work, since the partial input pattern would still satisfy the condition that all active input lines hit modified synapses on the appropriate output units.
For noisy patterns, however, the thresholding strategy must be more sophisticated:
in general, the units that are supposed to be on will not receive a dendritic sum of
exactly Min because the noise will alter the pattern of bits. Well mention briefly
what might be done for noisy patterns later on.
7.5.2
Estimating the memory capacity
This false positive phenomenon is what will set a limit to the capacity of the associative memory. If we store to many patterns, the number of modified (activated)
synapses will be large, and spurious positive responses will be very likely. How can
we analyse this?
Let us assume that the patterns are random, as we did for the Hopfield net.
This allows us to estimate the probability that a particular synapse is modified during
training. Assume that we train with R patterns. Then the probability of a synapse
being modified during training of a single pattern is just the probability that it is at
the intersection of two active lines. For the input, the probability of an active line is
Min /Nin and for the output it is Mout /Nout . Thus, the modification probability for a
single pattern trained is
Min Mout
.
Nin Nout
provided we assume that the input and output patterns are statistically independent.
Page 7:27
If the synapse is not modified by any of the R training steps, it remains inactive after
training. The probability that it not be modified during a single training step is one
less that probability of its being modified, and each training step is independent of
previous ones, so the probability of surviving unchanged through the entire training
is

Min Mout R
1
.
.
Nin Nout
Finally, a synapse that doesnt escape modification is modified, so the probability
that a synapse be activated by training is one less that above, in other words
= 1
Min Mout
.
Nin Nout
R
where represents the probability of activation we are after.

Armed with this probability, we can calculate the chance that an output unit
fire falsely. There are Min ones in the input pattern and each has a probability of
hitting an active synapse. The dendritic sum for a unit that should be off is therefore
binomially distributed with mean Min ; the probability that the dendritic sum is Min
will be Min , which is small when is small or Min large.
Using this, we can now predict the capacity of the network given a definition
of good recall as for the Hopfield net, capacity depends on the amount of error
in the output which can be tolerated. Lets suppose that good recall means that,
on average, one bit per pattern will be in error. This implies that
Nout Min 1
and we could calculate, for given parameters, the value of R.
As an example, consider at net with 8000 input lines of which 240 are on in any
pattern, and suppose there are 1024 output lines with 30 of them on in any pattern.
The desired value of to meet our good recall criterion above is then
=
1
Nout
1
Min
0.9715
which can be inserted into the derivation for above to give a capacity of about 4050
patterns.
Exercise 7.9
Derive the figure of 4050 cited above. Use the approximation

that
(1 x)R eRx
to simplify the calculation.
In fact, a simulation test shows that the net can learn and recall only about
3600 patterns and still meet our good recall criterion. Why should this be? It is,
Page 7:28
after all, substantially lower than we predicted. The problem is that in the analysis
above we assumed that the probability of a synapse being modified was constant.
However, if more ouput patterns use a particular unit then its synapses will be more
likely to be modified than those of less popular units. The number of patterns that
use an output unit is called the unit usage, and it is clearly binomially distributed:
each of R patterns has a probability of Mout /Nout of using any given output unit.
To take this into account complicates the analysis beyond the scope of these
notes; however, the qualitative effect is to broaden the distribution of dendritic sums,
making high values more likely, and to encourage spurious firing of more heavily used
output units.
7.5.3
Partially connected nets
The assumption that every input connects via a synapse to every output is an unrealistic one for plausible modelling of the brain. An important generalisation of the
model, therefore, is one in which not all synapses are present. Instead, of the possible
synapses, a proportion z are chosen at random to represent those actually present
and the others are ignored. Interestingly, this generalisation to reduced connection
density makes no serious qualitative difference to the performance of the network as
an associative memory.
Missing connections affect the training and recall procedures for the net. In
training, the problem is that synapses that ought to be modified may not exist. The
solution is not to worry about it and just activate synapses that do exist when their
input and output lines are coactive. In recall, the problem is more serious we
can no longer use a simple thresholding strategy, testing for a dendritic sum of at
least Min , since some of the synapses that would have contributed to that sum will
probably be absent. This spreads the dendritic sums for all the output units. Where
previously all the units that ought to be on, for a given pattern, had dendritic sum
Min , the dendritic sum of these units is now binomial with mean Min z.
Four thresholding strategies are worth a mention here.
(1) Given unit usage and net parameters (Min , Nin , ..., z) calculate a threshold for a
fixed probability of spurious firing. This method needs to know the distribution
of dendritic sums but not the number of output bits that should be on.
(2) The progressive recall strategy (Gardner-Medwin) is a dynamic recall strategy.
It works for an auto-associative net. The idea is to keep feeding the pattern
through the net, taking the output from one cycle and using it as input for
the next, while raising the threshold to keep the probability of spurious firing
low as more bits turn on. This strategy of progressive pattern completion is
appropriate for partial inputs, but also works for noisy input since the net
cleans up the pattern somewhat with each recall cycle.
(3) We could calculate the dendritic sum distribution for known input activity and
Page 7:29
then compute the threshold based on that. Various strategies are possible here,
for example to consider the fraction of input synapses that are active.
(4) If the input patterns are noise-corrupted, we could attempt to guess that
amount of noise, then threshold to minimise spurious outputs. (Buckingham,
1991). This strategy needs to know Mout and uses usage and activity information.
The general conclusions drawn from a study of a number of these thresholding
strategies (Buckingham, 1991) are that better strategies build in more knowledge, all
except strategy 1 assume knowledge of Mout and, if the information is available, the
fourth strategy seems to work the best.
7.5.4
Information efficiency
One useful technique for assessing the performance of associative (and other) memories is that of information efficiency. The idea is to calculate how much information
can be stored in the memory as a fraction of the amount of information it takes to
specify the memory.
As an example, consider the Hopfield net. A net with N units can store approximately 0.138N random patterns. Each pattern has 50%
of its components +1 and
!
N
such patterns so the amount
the other 50% are 1 on average. There are
N/2
of information each contains the number of binary choices needed to determine it
is given by
!
N
log2
N/2
which is approximately N for large N. (A better approximation can be calculated
using Stirlings formula to expand the binomial coefficient.) The net thus stores
0.138N 2 bits of information.
On the other hand, the net is specified by the values of about N 2 /2 connection
strengths each of which takes one of about 2P different values for P trained patterns.
To represent a weight takes log2 2P bits, and the total information needed to specify
the net is thus about 12 N 2 log2 0.276N. The information efficiency is the ratio of stored
information to necessary information for specification, so is about
0.276
log2 0.276N
interestingly, it decreases as the net grows.
Note that we assumed a criterion of recall in the analysis we need one
to derive the net capacity and thus the information efficiency depends on what
counts as good recall. For the Hopfield net we chose the criterion allowing the
largest number of patterns to be stored, namely that the net be operating within the
Page 7:30
good memory region of its phase space. recall that about 1.6% of the bits will be in
error at this extreme of performance. Still the capacity falls as we increase the net
size, and the Hopfield net is unattractive as a serious associative storage tool because
of that.
Interestingly enough, for the matrix memory described in this section, an analysis of information efficiency suggests two things. First, that the net is being best
used when the probability of synapse modification is one half, the information efficiency then being independent of the network size. Second, in order to achieve this
best use, the patterns should be coded sparsely with a number of bits on of the order
of the logarithm of the pattern size. Thus, in conclusion, the Willshaw, Buneman
and Longuet-Higgins matrix memory is a much more efficient storage device, for large
patterns, than the Hopfield net and, in fact, having large patterns with few bits on
is a considerable advantage for this type of architecture.
Exercise 7.10 Calculate the information efficiency of the example net above
with 8000 inputs etc.
Exercise 7.11 Derive a general formula for the information efficiency of the
matrix associative memory (quite hard) and derive the conclusions summarized above.
7.6
Kanervas Sparse Distributed Memory
There is one more kind of associative memory that is briefly worth considering: Pentti
Kanervas Sparse Distributed Memory (SDM) [25]. The idea is simple and ingenious.
Imagine that you wish to store items of data, each 1000 bits wide, in some huge
address space in which each address is also 1000 bits wide. If you were using real
computer memory then there would be 21000 locations. This 302-digit number is
larger than anybodys estimate of the number of atoms in the universe, and so is not
a practical proposition. However, SDM handles such addresses and data by having
only very few real addresses. To store something at one of the very large number of
addresses which dont then exist, the idea is store some information about it at all of
those real addresses which are near enough to the non-existant address.
Each real address contains an array of counters, one per data bit. If an attempt
is made to write to a non-existant address and some other address A is suitably close
to it in Hamming distance, then each of the counters at A get modified. If the k-th
bit of the data is 1 then the k-th counter is incremented; if the data bit is 0 then
the counter gets decremented. This happens for all counters at A, and for all other
address suitably close to the non-existant address. Thus if the sixth counter at A is,
at some moment in time, equal to 13 then that means that, on those occasions when
A was suitably close, the sixth bit of the datum was 1 on thirteen more occasions
than it was 0.
7.6 Kanervas Sparse Distributed Memory
Page 7:31
How is SDM read? To read from a non-existant address find all those real
addresses which are suitably close in Hamming distance and look at their counters.
Sum the first counter of each such real address; if the sum is positive then this means
that it was more likely that the stored bit had been a 1 than a 0. If the sum is
negative it means that the stored bit was more likely to have been a 0. Do the same
for every bit/counter. Figure 7.7 shows an example of reading using 8-bit addresses,
where only ten of the possible 28 = 256 addresses have been chosen to exist, and
using 9-bit data. The chosen Hamming radius is 2 in this case: all real addresses no
Real
addresses
Desired address
Radius
1 0 1 0 1 1 0 0
1
0
1
1
0
0
1
1
0
0
1
1
1
0
0
1
1
0
0
1
0
1
0
1
1
1
0
0
1
1
1
0
1
0
1
1
1
0
1
0
0
1
1
1
0
0
1
1
0
1
0
0
1
0
1
0
1
1
0
1
0
1
1
1
0
1
0
0
0
1
1
1
0
0
1
1
0
1
1
0
Data to be stored, if writing
dist
select?
6
5
4
2
4
7
3
2
5
3
n
n
n
y
n
n
n
y
n
n
sum
4
5
-2
-2
-1
-2
3
3
-4
-6
-2
1
2
1
-1
-6
-3
-3
5
-3
0
-4
1
3
2
2
-1
4
2
4
3
-3
-3
2
-4
4
2
-3
-3
2
3
2
-1
-4
-5
-1
4
-1
-3
-2
-6
3
-2
3
-2
-1
1
2
1
-2
1
-1
4
2
3
0
1
-4
4
-1
-3
-1
2
1
1
2
-2
3
-4
-3
2
5
3
-1
-2
4
2
-1
3
2
Counters
1 -2 7 -1 -5 5 -2 4 -2
1 0 1 0 0 1 0 1 0
Output data
Figure 7.7: SDM: reading from non-existant address 10101100

more than 2 distant from the desired address get selected, which in this figure means
two real addresses got selected. The counters of these two real addresses are summed
column by column.
Of course there is no real ned to make a distinction between address and data.
Since SDM can handle very long addresses one can just make the data itself be the
address. Suppose you were using 1000-bit addresses. Given any two random addreses
their expected distance apart is 500. Some very tedious computation shows that,
given one address (say 000. . . 000 for simplicity, although it doesnt matter), about
10.87% of all other addresses are within Hamming distance 480 of it, but only 0.08%
are within a distance of 450 of it. If you chose 480 as the Hamming radius for selection
purposes you would only need around 10 real addresses in order to make sure that at
Page 7:32
least one would be selected whatever other address you chose. In practice you might
use 100 or 1000 real addresses, each chosen entirely at random, so that a reasonable
number of them are close to any specific non-existant address. As you decrease
the Hamming radius the number of real addresses required for adequate performance
goes up fast. However, there is some optimum tradeoff, assuming that the pattern of
non-existant addresses used in future storage and recall is essentially random, and it
can be shown that the capacity of the SDM under such circumstances is exponential
in the length of the address!
This doesnt mean that its a practical device, unfortunately. For realistic data
storage requirements you will need very long addresses indeed; and so far the idea
has not been put to any real use. However, both experiment and theory show that
the recall behaviour can be excellent: the SDM can return the correct data even if
given a substantially wrong address to read it from.
Exercise 7.12 An SDM might be used to clean up the noise inherent in astronomers photographs of (say) a distant star. Explain how,
and speculate under what conditions it might work.
Finally, note that each real address could use a simple threshold unit, whose
threshold was the Hamming radius, to decide whether it was itself sufficiently close
to any desired address. Thus it would seem that an SDM could be implemented very
efficiently in hardware.
Chapter 8
Self-organizing networks
Most of the networks we have considered up to now have been trained by a supervised
learning procedure. However, in many cases there is no supervisor able to specify, for
each input, the right output in such circumstances we need a network which is able
to configure itself without supervision in a manner appropriate to the inputs being
presented. Such networks are, in general, called self-organising, since the organisation
of the network is not imposed by the programmer but is devised by the network itself
as a reflection of the organisation in the input data presented to the net. The example
we shall consider in detail in this chapter, perhaps the best known, is the Kohonen
net.
Before we consider the technical details of that network, well spend a moment
looking at the motivation behind it. One of the many interesting observations about
the mammalian nervous system is the frequency with which topographic mappings are
found there. As an example, consider the visual system. The retina the light sensitive surface in the eye is connected to the brain via the optic nerve. Neighbouring
parts of the retina connect to neighbouring parts of the lateral geniculate nucleus (a
relay station) and from there are connected to neighbouring parts of the visual cortex.
In creatures with good stereo vision, such as people, parts of the retinae in both eyes
that look at the same part of the visual field are mapped close together in the visual
cortex. In other words, the mappings between the (in this case) two-dimensional
structures are such that neighbourhoods are preserved.
Why should this be? It is not known in detail why the nervous system likes to
spread sensory stimuli out in the brain so that related stimuli map to related places,
but it is a scheme the nervous system employs often. More fascinating is the fact that
there is insufficient information coded in the genome to specify the connections: the
neighbourhood-preserving mapping must be set up by the nervous system in response
to the stimuli it receives from the world. In other words, topographic maps are derived
by self-organising processes in the nervous system.
Early work by Willshaw and von der Malsberg [33] suggested a number of
0
Chapter written by John Hallam, Mar 93; extended by Peter Ross, Sep 95
Page 8:2
M -dimensional input
models whereby retinotopic maps (maps from the retina to optic centres in the brain)
could be built by self-organisation, and Kohonen proposed and popularised a very
general self-organising scheme based on similar ideas see for example [17]. It is that
scheme we shall now look at in detail.
All inputs
go to each
unit.
Neighbours
during training
Array of units, typically

1- or 2-dimensional.
Figure 8.1: Structure of a Kohonen net
8.1
The Kohonen net
The structure of a Kohonen net is essentially simple. It comprises N units, each

connected to all of the M input lines and equipped with their own M dimensional
weight vector wi . The units are arranged in some topologically regular pattern
a line, rectangular grid, sphere or torus for example. Figure 8.1 shows an example
structure. Each unit computes a function of the distance between the current input
and its weight vector; the unit whose weight vector is closest to the input is updated,
together with some of its neighbours, to increase its similarity to the input concerned.
This process is repeated until the net has organised itself to form a topographic map
of the input space.
The two key ideas in this procedure are competition and neighbourhood. The
units compete for each input vector and the winner of the competition is adjusted to
increase its advantage in that and similar situations later on. This results in similar
input vectors being clustered. The adjustment of the winners neighbours makes it
more likely that similar inputs will be won by those neighbours, so similar inputs end
up mapped to neighbourhoods in the net.
The network therefore computes weights for the units such that a topographic
map is built between the organisation of the input space and that of the Kohonen
8.1 The Kohonen net
Page 8:3
net. Similar inputs map close together and distinct inputs map further apart. In
addition, the number of units committed by the network to representing a region of
the input space depends on the frequency with which input come from that region:
parts of the input space well-represented in the training set receive larger numbers of
units in the map than those sparsely represented. This is illustrated in figure 8.2.
M-dim input
space
Dense sample maps to large area
Neighbours stay neighbours, largely
Sparse sample maps to small area

Figure 8.2: Topographic mapping in a Kohonen net
The description above of the training procedure is somewhat sketchy. In particular, we didnt define how neighbourhoods and adjustments were to be implemented.
Here is the complete description of the operation of the Kohonen network.
(1) Initialise the weights the initial weights may be random values or values designed to cover the space of inputs (for example, uniformly distributed weights).
(2) Present an input x to the net.
(3) For each node, calculate the distance between the input and its weight vector
using the Euclidean distance formula
|x wi | .
(4) Find the closest node: this global computation determines the winner of the
competition. (Kohonen argues that the winner could be found by means of a
local lateral inhibition process in which each unit inhibits others nearby.)
Page 8:4
(5) Update the weights of the winner and nodes in a near neighbourhood of it using
the rule
wi (t + 1) wi (t) + (t) {x wi (t)}
where (t) is a learning rate or gain between zero and one and the formula is
applied to each unit i in the neighbourhood N(t). Neighbourhoods are typically
rectangular and consist of those nodes which lie within a certain distance of the
winner in the array of units. Note that both and the neighbourhood vary
with time.
(6) Decrease the size of the neighbourhood and the learning gain a little.
(7) Unless the net is adequately trained (for example is now very small) go back
to step (2).
The neighbourhood and learning rate start off large, so that each input presentation affects a large number of nodes and makes a large difference to their weights.
The net thus adjusts to the gross structure of the input space early on. As the neighbourhood and earning rate decrease, the changes made in the net are more localised
and the finer detail of the input space is learned.
8.1.1
Normalised weight vectors
The weight vectors in the Kohonen net are often normalised to unit length. This
allows a somewhat cheaper computation when determining the winning unit, results
in a somewhat more expensive update, but gives a fairer competition. In effect the
network tries to adjust the direction of the weight vectors to line up with that of the
input, since their length is fixed.
The basic idea is that the distance between two vectors is minimal when the
two are aligned, so minimising the Euclidean distance is equivalent to maximising the
vector dot product. Consider the quantity
x.wi
for each unit: the input vector x is not dependent on i, since all the units receive
the same input, so the size of the dot product is proportional to the size of wi
and monotonically to the angle between the input and weight vector. If the weight
vectors are normalised, all the wi have the same size, so the unit with the maximum
dot product has weights best aligned with the input. (Note that the weights must
be normalised, or you could confuse a long poorly-aligned weight vector with a short
well-aligned one.)
Using the dot product makes a small saving over computing the distance, and
costs a little more on update since the weight vectors affected must be renormalised.
It is, however, quite commonly used.
8.1 The Kohonen net

8.1.2
Page 8:5
Caveats for the Kohonen net
The use of Kohonen nets is not as straightforward as it might appear. Although the
net does much of the work in sorting out the organisation implicit in the input space,
problems can arise.
The first difficulty is in choosing the initial weights. These are often set at
random, but if the distribution of weights actually selected doesnt match well with
the organisation of the input space, the net will converge very slowly. One way round
this is to set the initial weights deliberately to reflect the structure of the input space if
that is known. For example, the unit weight vectors could be smoothly and uniformly
distributed across the input space.
A second problem is in deciding how to vary the neighbourhood and learning
rate to achieve best organisation. The answer really depends on the application. Note,
though, that the standard Kohonen training algorithm assumes that the training and
performance phases of the net are separate: first you train the net and, when thats
complete, then you use it to classify inputs. For some applications (for example
in robot control) its more appropriate that the net continue learning and that it
generate classifications even though only partially trained.
A third problem concerns the presentations of inputs. The training procedure
above presupposes that inputs are presented fairly and at random in other words,
there is no systematic bias in favour of one part of the input space against others.
Without this condition the net may not produce a topographic organisation but
rather a single clump of units all associated with the over-represented parts of the
input space. This problem too arises in robotic applications, where the world is
experienced sequentially and consecutive inputs are not randomly chosen from the
input space. It can be solved by batching up inputs and presenting them randomly
from those batches or, in some cases, by ignoring the problem!
8.1.3
Applications
Once the training is complete, the net may be used to classify input vectors according to the mapping it has learned and the units in the net can be labelled (by an
outside agent) with the names of the classes to which inputs belong. For example, in
the phonetic typewriter [18] a Kohonen network is trained to classify phonemes from
Finnish using sixteen input features (fifteen frequency-band and one volume measures). The network is trained with many such feature vectors derived from ordinary
speech. Then the network is presented with data for which the correct phoneme name
is known, and the units stimulated by that data are labelled with the phoneme names.
The net can then be used for classification of phonemes which are later processed by
rule-based methods embodying grammatical and other knowledge. Such a system
achieves between 80% and 90% correct classification of phonemes and the rule-based
component adds a further 2% to 7% to the success rate.
Amongst many other applications, the Kohonen net has also been used in
Page 8:6
robotics for learning maps of the environment. Patterns of actions or sensor signals derived from the robot controller are used to construct input vectors for the net.
It learns to classify the recurring patterns in the inputs and can then be labelled with
location names for recognising where a robot is. In effect, the robot learns to recognise the places in its environment based on either how they appear (for sensor-based
systems) or how it feels to get there (for action-based systems).
8.2
LVQ: a supervised variation
A Kohonen network is an unsupervised learner, it is given no information about

the class into which each training input should be placed. Kohonen suggested a
supervised variant of the basic idea that produces a network which classifies; this
variant is called Learning Vector Quantization, or LVQ for short. As before there is
some M-dimensional input space, and each of the M inputs is connected to every
unit in the learning layer, and every unit has its own M-dimensional vector as before.
However, each unit is pre-assigned to belong to a class, and typically there are equal
numbers of units assigned to each class. For example if there are to be 5 classes there
might be 5 units per class and so 25 units in all in the learning layer.
The basic training regime is as follows. A training input vector is compared
with all the units vectors, and the closest unit is the winner. If it is of the same
class as the input vector then the winners vector is moved a little closer to that input
vector. If it is of a different class then it is moved away a little instead. This attractif-same, repel-if-different process is the key to forming a topographic classificatory
map. However, you should be able to figure out that by itself this process is unlikely
to work. For example, suppose all the units assigned to class 1 happen to be way
out on their own, far from the other units. Then training instances from class 1 will
just push the other units away, and the units in class 1 wont be attracted until all
the others got pushed away sufficiently for the units in class 1 to stand a chance of
winning occasionally.
To get round this, a version called LVQ1 introduces a conscience mechanism:
units which win too often get penalised. This is usually implemented by using a
distance bias rather than using just a Euclidean or inner-product measure of distance
between a unit and the input vector, an extra term is added whose size is proportional
to the difference between the units win frequency and the average win frequency for
the class. This bias could grow very large, so it is also usual to make the constant
of proportionality get smaller with time. There are further variants, such as LVQ2
and Extended LVQ, which refine the basic idea still further but these notes will not
plague you with the details.
Exercise 8.1
It was said above that in LVQ, typically there are an equal number of units per class. Is this a good rule of thumb?
Chapter 9
Genetic Algorithms
This course is called Connectionist Computing rather than Neural Networks, because connectionism is somewhat wider in scope. A connectionist system is essentially
a dynamical system where the interactions between variables, nodes or whatever are
limited to a finite list of connections, and the connections can change strength or even
connectivity with time. Neural networks are by far the most studied of connectionist
systems, but there are several other kinds that have been extensively studied for
example, cellular automata, classifier systems, and to a much lesser extent immune
and autocatalytic networks.
This chapter introduces genetic algorithms and classifier systems. Genetic algorithms are very important in the study of connectionist systems because they provide
a simple form of simulated evolution. As you will have seen if you tackled most of
the exercises in earlier chapters, neural networks can be very sensitive to the choice
of parameters, topology and so on. When you try to create a network to do some
specified task, you must either experiment or use your knowledge of the task. Both
approaches rely on you, as an external agent, to judge how well the network is doing.
However, there is no strong evidence to suggest that biological neural networks were
overtly designed by some conscious entity. Instead, the process of evolution seems to
have done the job blindly but amazingly well considering the functionality of existing
organisms. Given the very considerable difficulty of designing a neural network
or for that matter any non-trivial knowledge-based system it is worth considering
whether some kind of simulated evolution could be used.
A classifier system is a collection of rules, using a very stylised and simplistic
representation of rules, that evolves as the system runs to produce a collection of
rules that work well together (in a sense to be defined later). A genetic algorithm is
an essential part of a classifier system. It turns out that classifier systems and neural
networks are exactly equivalent, provided you take a reasonably general view of what
a neural network is; a classifier system can be represented by a neural network and
vice-versa. However, it is still important to learn about classifier systems because they
0
Chapter written by Peter Ross, Feb 93; revised by him, Dec 93, Sep 95
Page 9:2
Genetic Algorithms
provide a much more concise way to express certain kinds of self-organising system.
After all, Lisp, Prolog and 6502 assembly language are equivalent but you would do
well to learn at least two1 because they have such different expressive powers.
9.1
Introduction to genetic algorithms
There are many misconceptions about evolution. For example, it is often asserted that
evolution acting over geological time has led to ever-greater diversity and complexity
of species. But as Stephen Gould has convincingly argued in his book Wonderful Life,
the diversity of basic designs for organic life was far greater during the pre-Cambrian
period around 590 million years ago than it ever was before or since. Within a few
million years of that time most of that range of different designs died out; since then
just a very few basic designs have been embellished in many different ways. It is also
sometimes asserted that evolution produces better designs, or even optimal designs.
However, what evolution seems mainly to do is cause species to adapt to each other.
For example, if you are a giraffe you have a long neck because if your neck is too short
to reach your food you die; but the trees you eat are tall because if you eat the seeds
they dont get the chance to reproduce. This contest between giraffes and trees has
not produced mile-high trees or mile-high giraffes; at some point they both stopped,
perhaps because gravity and physics were the enemy of both. It is also important to
remember that we dont yet know very much about evolution; it has been happening
on a much longer time scale than the history of the human race.
Genetic algorithms are to evolution what neural networks are to biological neurons: an extremely crude and simplistic analogue of the real thing.
9.2
Basics of genetic algorithms
In human genetics the totality of genetic information is called the genotype; the information takes the abstract form of instructions for building something (an organism,
in fact). What it produces on being decoded (when the instructions are executed, so
to speak) is called the phenotype. The genetic information is parcelled into chunks
called chromosomes; these are composed of genes. Each gene has a locus, or position
within the chromosome, and can take any of a set of values; the possible values are
called alleles.
In the typical genetic algorithm there is a pool of potential solutions to a problem; each solution is somehow expressed as a fixed-length chromosome, often but
not always in the form of a simple bit-string. Each such solution may be good or
bad as a solution, and its quality can be measured numerically by means of some
problem-specific fitness function: given a chromosome, the fitness function returns a
single number indicating in some way how good that chromosome is as a solution to
1
I wonder which two?
9.2 Basics of genetic algorithms
Page 9:3
the problem. Normally, the higher the number is then the better that chromosome is
as a solution.
The algorithm operates somewhat as follows; many variations are possible.
(1) A pool of random chromosomes is created. This step is done just once
(2) The fitness of each chromosome is calculated. If the pool has converged,
because all its members are of high fitness or because all its members are the
same now, then stop. There are many such choices of stopping criterion.
(3) Select (somehow) pairs of chromosomes to breed, producing children, in a manner described below.
(4) Perhaps mutate the children a little.
(5) Insert them into the pool somehow.
(6) Go to 2.
In order to see how this operates, consider the following trivial example taken
from Goldbergs excellent book Genetic Algorithms in Search, Optimization and Machine Learning (Addison-Wesley). Consider a GA using chromosomes each just 5 bits
long each representing an integer x, using a fitness function f (x) = x2 :
No(i) Chromosome
fi fi / fi
1
01101
169
0.1444
2
11000
576
0.4923
3
01000
64
0.0547
4
10011
361
0.3085
P
Table 9.1: A simple GA: at step 2 (from Goldberg 1989)

The sum of the fi is 1170, the average fitness is 293, the largest is 576. Suppose
now that strings 1 and 2 breed two children strings 3 and 4 breed two children and the
children replace their parents. Breeding is done either by a process called crossover
applied with probability pc , or by mere copying of the parents with probability (1pc ).
Crossover comes in a variety of flavours. The simplest is single-point crossover : a
random point between the start and end of the chromosomes is selected. For the
first child, copy parent 1 up to that point and parent 2 afterward. For the second
child, copy parent 2 up to that point and parent 1 afterward. In the very widely
used two-point crossover , two random points are chosen. Copy one parent up to that
point, the other parent between the points, and the first parent again to the end. In
uniform crossover each bit is copied from one of the two parents, but the choice of
which of the two parents is random. The unselected bit for the first child is used for
Page 9:4
Genetic Algorithms
the second child. Obviously, whichever type of crossover is used, if the two parents
have the same bit value at a given position, then the children will have the same bit
value there too. When the children are created, mutations is applied. For example,
each bit of a child is flipped with low probability, say 0.001. Since there are 20 bits
in the whole pool, the probability that no bits at all will be flipped is still around
1 20 0.001 = 0.98; imagine, for sake of clarity, that mutations therefore produced
no changes this time. If one-point crossover is used in the above example and a
vertical bar shows the crossover site chosen then the results might be as follows. Note
No(i) Old chromosome New chromosome
1
01101
01100
2
11000
11001
3
11000
11011
4
10011
10000
fi
144
625
729
256
Table 9.2: A simple GA: step 4

that the average value is now 439 and the largest is now 729; both have increased.
The absolute maximum is of course 312 = 961.
How are chromosomes chosen for breeding? There are two methods normally
employed. The first is called roulette-wheel selection in which chromosomes are chosen with a chance proportional to their relative fitness. The name comes from the
idea that a chromosome occupies a sector of a roulette wheel, whose angular size is
proportional to its fitness. This method is sensitive to the range of fitness values. If
fitnesses range from 1 to 100 than a highly fit chromosome has much greater chance
of being chosen than a low fitness one. If fitnesses range instead from 1000 to 1100
then all chromosomes are nearly equally likely to be chosen; merely adding a constant
of 1000 to the fitness function changes the process radically. The second method is
designed to avoid this problem: it is called rank-based selection. Chromosomes are
placed in rank order of fitness, and then selection depends upon ranking rather than
fitness. The selection can be simply roulette-wheel based on ranking, so that the
highest-ranked of N chromosomes gets N chances and the lowest-ranked gets 1, or
there can be an even stronger bias in favour of highly-ranked ones.
Exercise 9.1
9.3
Continue the above example by hand, using roulette-wheel selection. Can this trivial GA with too small a population size
(that is, so few chromosomes) find the absolute maximum? If
so, estimate the number of cycles needed, very roughly.
Other variations
In generation-based GAs, the entire population of chromosomes gets replaced by children at each cycle. Pairs of parents are repeatedly chosen according to the chosen
9.4 Why does it work? The schema theorem
Page 9:5
selection method, producing pairs of children. More usually, one pair of parents is
selected at each cycle and their children either replace the least fit members or the
pool grows by two. In an elitist generation-based GA, the fittest member from the
previous cycle is copied into the new generation. Some systems use adaptive mutation, in which the mutation rate is adjusted according to how similar the parents are
to each other if two parents are very similar then the mutation rate is made to
be high. Some systems, besides using crossover and mutation, also use an inversion
operator: two points in a chromosome are selected at random and that portion of the
chromosome between them is replaced by its reversal. For example, 11000 would
become 10010. In a parallel GA there might be several populations which evolve
separately, except that every so often the fittest member of one population migrates
that is, is copied into all the other populations.
There are few rules of thumb to guide you in making any of these choices when
designing a GA; often you have to resort to experiment. Nevertheless, GAs often
work amazingly well.
9.4
Why does it work? The schema theorem
Consider the example above. Any chromosome of the form 1#### (where a # is a
wildcard standing for 0 or 1) is going to have a fitness of at least 256, which is greater
then the fitness of anything of the form 0####. Thus there is a much higher chance,
using either kind of selection rule, that a parent of the form 1#### will be chosen.
If two such parents are chosen, the children will both be of the form 1####. If only
one such parent is chosen then at least one of the children will be of that form. Thus
in time the whole population will be of this form, except perhaps for the occasional
throwback briefly introduced by mutation. It should therefore be clear that unless
the mutation rate is so high as to destroy the gains created by crossover, the average
and greatest fitnesses will tend to increase. Mutation is still valuable because it
provides a way to reintroduce gene sequences that have been altogether lost from the
population as an accidental side-effect of crossover or inversion.
To try to formalise the ideas, introduce the notion of a schema: a pattern
consisting of a string made up of 0s, 1s and #s, of the same length as a chromosome.
In a large pool of chromosomes each of length 5, there may be many instances of the
schema 1###1 but any such instance stands a very good chance of being destroyed by
crossover. On the other hand any instance of the schema 01### could be destroyed
only if a crossover point falls between the 0 and the 1. This suggests that the notion
of defining length (H) of a schema H is important: the defining length is the distance
between the first non-wildcard and the last non-wildcard. For example, #1#0# has
defining length 2 (not 3, note; it is useful to call it 2 because there are just two places
where a damaging crossover point might fall). Also, the order o(H) of a schema H is
defined to be the number of non-wildcards: #1#0# has order 2.
The idea of a schema is valuable because you can think of the GA as combining
Page 9:6
Genetic Algorithms
schemas experimentally in order to find the fittest possible combination. The GA

could then be thought of as getting its remarkable efficacy from the fact that any one
chromosome is an instance of a very large number of schemas. For example, 01100 is
an instance of 0####, #1###, . . . and of 01###, 0#1##, #11##, . . . and so
on. At each of N positions of a given chromosome you can have the actual bit value
or a wildcard, so a chromosome of length N is an instance of 2N different schemas.
This idea is an over-simplification, but it is a valuable one. Consider a particular
schema, call it H. Suppose that at cycle t there are
m(H, t)
instances of H in the population. Any single instance having fitness fi will be chosen
by roulette-wheel selection with probability
fi /
fi
If such an instance is chosen, and defining length of schema H is called (H), then a
randomly-chosen crossover point will fall within the schema with probability
(H)/(N 1)
where N is the length of the chromosome as before. Crossover may not damage the
schema if the other parent happens to have a suitable form, but in the worst case
crossover will destroy H. Thus if crossover is applied with probability pc , the chance
that the schema H will survive crossover is at least
1 pc
(H)
N 1
Also, if mutation is applied to each bit position with very small probability pm ,
then each bit is unchanged with probability (1 pm ) and the whole schema survives
mutation with probability
(1 pm )o(H) 1 o(H)pm
Thus if f (H) is the average fitness of all instances of schema H in the current population and f is the average fitness of the whole population, it should be clear that
m(H, t + 1) m(H, t)
(H)
f (H)
(1 pc
o(H)pm)
(N 1)
f
Thus if o(H) and (H) are small and f (H) > f the proportion of instances of H in
the population should increase geometrically. This is called the schema theorem.
However, this theorem is almost useless. For one thing, schemas interfere with
each other. Because of this, the value of f (H) in the current population may differ
significantly from the value of f (H) in the next; there may be more instances of H
9.5 Deceptive problems
Page 9:7
but their average fitness may get much lower. Also, if instances of H have average
f (H) but a very high variance of fitness, then the fittest will get very much more
chance to reproduce. However, the theorem does offer a suggestive account of why
GAs work in many cases. It is tempting to guess that f (H) is approximated by the
average fitness of H taken over all possible instances of H rather than the instances
in the current population, but this is certainly not always true.
9.5
Deceptive problems
In many cases it is easy to understand why the GA is working, because in such cases
the fitness of a chromosome is attributable to separate parts of the chromosome. If
a part has one value the fitness is lower than if it has another value. For example,
a GA finds it easy to solve the trivial problem of maximising the number of 1s in
a chromosome using as fitness just the numbers of 1s. Every additional bit set just
increases the fitness.
However, this is not always the case. In some problems good-looking parts
of an answer do not combine to provide the best answer. Instead the best answer
is a combination of unlikely looking parts. It is possible to construct such deceptive
problems. For instance, consider this exceptionally trivial problem using chromosomes
of length 2 and the following fitness values. Thus the average fitness of schema 0#
Chromosome Fitness
00
8
01
6
10
2
11
10
Table 9.3: A trivial but deceptive problem
is (8 + 6)/2 = 7 and the average fitness of schema 1# is (10 + 2)/2 = 6. Thus in

a competition between instances of these two schemas, the fittest chromosome of all
is an instance of the less fit of these two schemas. Thus it is unclear whether the
GA will find the fittest chromosome it will depend on the initial population and
parameter such as the crossover and mutation rate. Experiment bears this out.
It is possible to construct much more deceptive problems than this, for example
by creating a slightly tougher version of the above problem using 3 bits instead of
two, and then replicating it 10 times to get a problem with 30-bit chromosomes. If
the sub-problems are interleaved within the chromosome as follows:
abcdefghijabcdefghijabcdefghij
Page 9:8
Genetic Algorithms
such that the three a bits define the first copy, the three b bits define the second
copy, and so on, the problem becomes surprisingly hard for a GA to solve because
the schemas for each sub-problem now have a substantial defining length.
However, it is still possible for a GA to solve such problems, despite what the
naive interpretation of the schema theorem would suggest. For consider the schemas
from some problem shown in table 9.4. We cannot predict how these two competitions
A:
B:
Schema
1###. . .
0###. . .
C: ##0#. . .
D: ##1#. . .
% of population
90
10
60
40
Table 9.4: The joint convergence problem
will proceed, because they will affect each other. For instance, we cannot even say
that C will win against D, because it may be that the many instances of A contain
many instances of D and very few instances of C, so that as A wins over B it causes
D to win over C.
Finally, note that problems which are not deceptive can be very hard for a GA.
Consider the following (suggested by Grefenstette): a problem using bit-strings of
fixed length N 10 to represent 2N equally-spaced points in the interval [0, 1]. That
is, given a bit-string you determine the integer M that it represents in binary, then
this string is taken to represent the real number x = M/2N . Suppose that 000. . . 0
has the greatest fitness, and that
f (000 . . . 0) = 2N +1
f (x) = x2
otherwise
There are only 2N possible chromosomes. Thus the average fitness of any schema
which has 000. . . 0 as an instance of it is at least 2. Moreover, the average fitness of
any schema which does not have 000. . . 0 as an instance is less than 1 since x 1 for
all other chromosomes. Thus the solution is an instance of every highly fit schema
and of no others. But it should be clear that a GA will have tremendous difficulty
finding this needle in a haystack.
The conclusion to be drawn from all this is that the choice of representation and
of fitness function have a tremendous effect on whether a GA can solve a problem.
However, it is very hard to tell whether particular choices are good without trying
them GAs have proved amazingly effective when used to tackle a wide range of hard
practical problems.
9.6 Applications
9.6
Page 9:9
Applications
There are many very effective applications of GAs. Goldberg showed that a GA
could find a near-optimal solution to the problem of controlling the compressors
that propel natural gas along a long pipeline, and that the GA only explored a
tiny portion of the search space to do so. Many other specific practical problems
have been effectively handled by GAs, including for example keyboard design, VLSI
layout, recursive adaptive filter design, certain scheduling problems, communications
network design and many others.
A considerable number of GA successes have been in areas where the real difficulty is to explore a large parameter space (assuming you tried the exercises concerned
with feed-forward networks, you will have some appreciation of the difficulty!). GAs
have also been used for designing neural networks. In some cases a neural network
is represented in chromosome form merely by rescaling the weights and representing
each as an integer, and then concatenating all the weights to form a long bit-string.
The fitness of a chromosome is then determined by seeing how well the network it
represents performs on the test data. An alternative higher-level representation might
be to express the topology of the network (including parameters such as learning rate
and momentum and details of which connections in each layer even exist at all, but
not the actual weights) as a bit-string of some kind. The fitness of such a chromosome
might be determined by trying to train the network it represents and seeing how well
the trained net does on test data.
This implies that enormously more time is spent in the evaluation of chromosome fitness than in any other part of the GA. This seems to be a general observation
about successful applications of GAs; if the fitness of a chromosome is easy to determine, the problem probably does not merit the use of a GA and other techniques may
work much better. This is true of the various function optimisation problems built
into the pga program; however, the functions are convenient as simple educational
illustrations of GAs.
Exercise 9.2
Are any of the function encodings used within the pga program
deceptive?
Exercise 9.3
Use pga to see whether using one population of size 500 performs better than five populations of size 50 on any of the problems, using the default migration rate for the five-population
cases.
Exercise 9.4
Compare generation-based reproduction (-rgen) with reproduction in which only one mating happens per cycle (-rone, which
is the default), and see whether this makes much difference to
the number of function evaluations it takes to find the solution
(or get very close at least).
Page 9:10
Genetic Algorithms
Exercise 9.5
Does adaptive mutation perform better than the default fixed

mutation rate of 0.05 on any of the problems? If so, do some
exploration to see whether some other fixed mutation rate does
as well or better. Use a single population of size 100 in every
case.
Exercise 9.6
Using a single population of size 100, see whether rank-based

or fitness-based selection is better on each problem. Leave the
other parameters set at their defaults.
Exercise 9.7
Using a single population of size 100, see what effect changes

to the crossover rate have in each problem. Leave the other
parameters at their default values.
9.7
Classifier systems
A classifier system contains a population of stylised rules, sufficiently rules in a conventional knowledge-based system to have a different name: classifiers. There is also
a counterpart of working memory in a conventional production system, here called
the message list. The elements of the message list are messages, expressed as fixedlength sequences; usually a message is just a bit-string of fixed length n, using 0s
and 1s. Each classifier consists of two parts, the condition and action parts, each
represented by a fixed-length sequence of length n too; usually, if messages are just
bit-strings then the condition and action parts are composed of 0s, 1s and #s. In a
condition part the # is a wild-card, which can stand for a 0 or 1 when that condition
part is matched against members of the message list. Each # in the action part has
a somewhat different meaning. If the classifier gets to fire, it puts a message onto the
message list the message is composed by using the 0s and 1s from the action part,
and each # in the action part (called a pass-through) is replaced by the corresponding
part of the message which had matched the condition part. Thus if
###1#00 01##110
is a classifier, and the condition matches the following message on the message list:
1111000
then, if the classifier gets to fire, it would put the following message on the message
list:
0111110
Every classifier has a numeric strength too. The strengths change dynamically as the
system runs, in a manner outlined below.
The classifier system runs within some environment; the environment also adds
messages to the message list, and sometimes rewards the placing of certain messages
on the message list. The overall operation of the whole system is as follows.
9.7 Classifier systems
Page 9:11
(1) The message list (which in most systems is of a fixed size) is initialised in some
way, with the inclusion of any messages from the environment.
(2) Any classifers that have a condition which matches something on the message
list get to bid to fire. Each bids a fixed proportion of its strength, scaled
down by how specific the condition is (how few wild-cards it contains; the less
the better), and finally perhaps with the addition of some modest noise. In
recent classifier systems messages themselves also have an associated numeric
intensity, and bids are further scaled by the intensity of the matched message.
Any classifier may match more than one message of course, and enters a bid
per enabling message.
(3) Winners are chosen typically, enough to replace all messages on the message
list after reserving enough space for incoming messages from the environment.
(4) The successful bids are paid to whatever put the enabling messages onto the
message list; this could include payment to the environment.
(5) The old message list is discarded, the new one is created. Any reward earned
from the environment is paid to the classifier(s) that put the rewarded message
there.
This is, in effect, a trickle-down information economy. Many classifier system include
a tax too, in which every classifier loses a very small proportion of its strength at
each cycle. Such a tax eventually kills off the useless classifiers. The payment of bids
to classifiers that enabled a classifier to fire causes a trickling-down of rewards earned
from the environment; the process is glorified by the name of the bucket-brigade
algorithm.
New rules get introduced by a genetic algorithm, which runs every so often
(determinsitically in some systems, stochastically in others) and breeds new rules
selecting parents according to rule strength. In some systems the children enlarge the
classifier set; in others they replace the least fit. Usually, only a fixed proportion of
the rules get to breed.
As you may readily appreciate, the trickling-down of rewards may be fairly
slow so if the system is to work the environmental messages and rewards have to
happen on a comparably slow time scale compared to the basic cycle of operations.
As yet there are very few commercial applications of classifier systems, but quite a
few intriguing and suggestive prototypes. As mentioned at the start of this chapter, it
turns out that classifier systems are completely equivalent to feed-forward networks.
Messages correspond to nodes. Classifiers correspond to connections between nodes,
but the existence of wild-cards and pass-throughs means that connections can be
represented in a more powerful way than is possible in a neural network. The bidding
and flow of strength and rewards corresponds to the passage of numbers from node
to node in a neural network. Two papers by Lawrence Davis show exactly how to
Page 9:12
Genetic Algorithms
create a classifier system equivalent to a given neural network, and how to create a
neural network equivalent to a classifier system see Advances in Neural Information
Processing I.
Appendix A
The rbp back-propagation
simulator
This chapter was written by Donald R. Tveter and is copyright by him, 1991 and
1992. It may be reproduced for educational purposes only. Within the Department
of AI, only the rbp program is currently used.
A.1
Introduction
The programs described below were produced for my own use in studying backpropagation and for doing some examples that are found in my introduction to Artificial Intelligence textbook, The Basis of Artificial Intelligence, to be published by
Computer Science Press. (I hope some time before the sun burns out or before the
Cubs win the World Series or before Congress balances the budget or ... .) I have
copyrighted these files but I hereby give permission to anyone to use them for experimentation, educational purposes or to redistribute them on a not for profit basis. All
others that want to use these programs for business or commercial purposes, I will
charge you a small fee. You should contact me by mail at:
Dr. Donald R. Tveter
5228 N. Nashville Ave.
Chicago, Illinois 60656
USENET: drt@chinet.chi.il.us
Note: this is use at your own risk software: there is no guarantee that it is
bug-free. Use of this software constitutes acceptance for use in an as is condition.
There are no warranties with regard to this software. In no event shall the author
be liable for any damages whatsoever arising out of or in connection with the use or
performance of this software.
There are four simulators that can be constructed from the included files. The
program, rbp, does back-propagation using real weights and arithmetic. The program, bp, does back-propagation using 16-bit integer weights, 16 and 32-bit integer
Page A:2
The rbp back-propagation simulator
arithmetic and some floating point arithmetic. The program, sbp, uses 16-bit integer
symmetric weights but only allows two-layer networks. The program srbp does the
same using real weights. The purpose of sbp and srbp is to produce networks that
can be used with the Boltzman machine relaxation algorithm (not included).
In general, the 16-bit integer programs are the most useful, because they are
the fastest. Unfortunately, sometimes 16-bit integer weights dont have enough range
or precision and then using the floating point versions may be necessary. Many other
speed-up techniques are included in these programs.
A.2
A Simple Example
Each version would normally be called with the name of a file to read commands
from, as in:
rbp xor
When no file name is specified, rbp will take commands from the keyboard (UNIX
stdin file). After the data file is read commands are then taken from the keyboard.
The commands are one letter commands and most of them have optional parameters. The A, B, d and f commands allow a number of sub-commands on a
line. The maximum length of any line is 256 characters. An * is a comment and it
can be used to make the remainder of the line a comment. Here is an example of a
data file to do the xor problem:
* input file for the xor problem
m
c
c
s
k
2
1
1
7
0
n
1
0
0
1
4
0
0
1
1
1 1
1 3 1
2 3 1
1
*
*
*
*
*
make a 2-1-1 network

add this extra connection
add this extra connection
seed the random number function
give the network random weights
* read four new patterns into memory

1
0
1
0
e 0.5
a 0.9
* set eta to 0.5 (and eta2 to 0.5)

* set alpha to 0.9
First, in this example, the m command will make a network with 2 units in the
input layer, 1 unit in the second layer and 1 unit in the third layer. The following
c commands create extra connections from layer 1 unit 1 to layer 3 unit 1 and from
A.2 A Simple Example
Page A:3
layer 1 unit 2 to layer 3 unit 1. The s command sets the seed for the random
number function. The k command then gives the network random weights. The
k command has another use as well. It can be used to try to kick a network out
of a local minimum. Here, the meaning of k 0 1 is to examine all the weights in
the network and for every weight equal to 0 (and they all start out at 0), add in a
random number between -1 and +1. The n command specifies four new patterns
to be read into memory. With the n command, any old patterns that may have
been present are removed. There is also an x command that behaves like the n
command, except the x commands ADDS the extra patterns to the current training
set. The input pattern comes first, followed by the output pattern. The statement, e
0.5, sets eta, the learning rate for the upper layer to 0.5 and eta2 for the lower layers
to 0.5 as well. The last line sets alpha, the momentum parameter, to 0.9.
After these commands are executed, the following messages and prompt appears:
Fast Back-Propagation Copyright (c) 1990, 1991, 1992 by Donald R. Tveter
taking commands from stdin now
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]?
The characters within the square brackets are a list of the possible commands. To
run 100 iterations of back-propagation and print out the status of the learning every
20 iterations type r 100 20 at the prompt:
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? r 100 20
This gives:
running . . .
20 iterations
0.00% right
40 iterations
0.00% right
60 iterations
75.00% right
62 iterations 100.00% right
patterns learned to within 0.10
(0
(0
(3
(4
at
right
4 wrong)
right
4 wrong)
right
1 wrong)
right
0 wrong)
iteration^G 62
0.49927
0.43188
0.09033
0.07129
error/unit
error/unit
error/unit
error/unit
The program immediately prints out the running . . . message. After each
20 iterations, a summary of the learning process is printed, giving the percentage of
patterns that are right, the number right and wrong and the average value of the
absolute values of the errors of the output units. The program stops when the each
output for each pattern has been learned to within the required tolerance, in this
case the default value of 0.1. A ctrl-G is normally printed out as well to sound the
bell. If the second number defining how often to print out a summary is omitted,
the summaries will not be printed. Sometimes the integer versions will do a few
extra iterations before declaring the problem done because of truncation errors in
the arithmetic done to check for convergence. The status figures for iteration i are
computed when making the forward pass of the iteration and before the weights are
Page A:4
updated so these values are one iteration out of date. This saves on CPU time,
however, if you really need up-do-date statistics use the u+ option described in the
format specifications.
A.2.1
Listing Patterns
To get a listing of the status of each pattern, use the P command to give:
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? P
1 0.90 (0.098) ok
2 0.05 (0.052) ok
3 0.94 (0.062) ok
4 0.07 (0.074) ok
62 iterations 100.00% right (4 right
0 wrong)
0.07129 error/unit
The numbers in parentheses give the sum of the absolute values of the output errors
for each pattern. An ok is given to every pattern that has been learned to within
the required tolerance. To get the status of one pattern, say, the fourth pattern, type
P 4 to give:
0.07
(0.074) ok
To get a summary without the complete listing, use P 0. To get the output targets
for a given pattern, say pattern 3, use O 3.
A particular test pattern can be input to the network with the p command,
as in:
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? p 1 0
0.90
A.2.2
Examining Weights
It is often interesting to see the values of some particular weights in the network.
To see a listing of all the weights in a network, use the save weights command described below and then cat the file weights, however, to see the weights leading into
a particular node, say the node in row 2, node 1 use the w command as in:
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? w 2 1
layer unit unit value
weight
input from unit
1
1
1.00000
9.53516
9.53516
1
2
0.00000
-8.40332
0.00000
2
t
1.00000
4.13086
4.13086
sum = 13.66602
This listing also gives data on how the current activation value of the node is computed
using the weights and the activations values of the nodes feeding into unit 1 of layer
2. The t unit is the threshold unit.
A.3 The Format Command

A.2.3
Page A:5
The Help Command
To get a short description of any command, type h followed by the letter of the
command. Here, we type h h for help with help:
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? h h
h <letter> gives help for command <letter>.
To list the status of all the parameters in the program, use ?.
A.2.4
To Quit the Program
Finally, to end the program, the q (for quit) command is entered:

[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? q
A.3
The Format Command
There are several ways to input and output patterns, numbers and other values and
there is one format command, f, that is used to set these options. In the format
command, a number of options can be specified on a single line, as for example in:
f b+ ir oc s- wB
A.3.1
Input Patterns
The programs are able to read pattern values in two different formats. The default
input format is the compressed format. In it, each value is one character and it is not
necessary to have blanks between the characters. For example, in compressed format,
the patterns for xor could be written out in either of the following ways:
101
000
011
110
10
00
01
11
1
0
1
0
The second example is preferable because it makes it easier to see the input and the
output patterns. Compressed format can also be used to input patterns with the p
command. In addition to using 1 and 0 as input, the character, ? can be used. This
character is initially defined to be 0.5, but it can be redefined using the Q command
like so:
Q -1
Page A:6
This sets the value of ? to -1. Other valid input characters are the letters, h, i,
j and k. The h stands for hidden. Its meaning in an input string is that the
value at this point in the string should be taken from the next unit in the second
layer of the network. This notation is useful for specifying simple recurrent networks.
Naturally, i, j and k stand for taking input values from the third, fourth and fifth
layers (if they exist). A simple example of a recurrent network is given later.
The other input format for numbers is real. The number portion must start
with a digit (.35 is not allowed, but 0.35 is). Exponential notation is not allowed.
Real numbers have to be separated by a space. The h, i, j, k and ? characters
are also allowed with real input patterns. To take input in the real format, it is
necessary to set the input format to be real using the f (format) command as in:
f ir
To change back to the compressed format, use:
f ic
A.3.2
Output of Patterns
Output format is controlled with the f command as in:

f or
f oc
f oa
The first sets the output to real numbers. The second sets the output to be compressed
mode where the value printed will be a 1 when the unit value is greater than (1.0 tolerance), a ^ when the value is above 0.5 but less than (1.0 - tolerance), a v when
the value is less than 0.5 but greater than the tolerance. Below the tolerance value,
a 0 is printed. The tolerance can be changed using the t command (not a part of
the format command). For example, to make all values greater than 0.8 print as 1
and all values less than 0.2 print as 0, use:
t 0.2
Of course, this same tolerance value is also used to check to see if all the patterns have
converged. The third output format is meant to give analog compressed output. In
this format, a c is printed when a value is close enough to its target value. Otherwise,
if the answer is close to 1, a 1 is printed, if the answer is close to 0, a 0 is printed,
if the answer is above the target but not close to 1, a ^ is printed and if the answer
is below the target but not close to 0, a v is printed. This output format is designed
for problems where the output is a real number, as for instance, when the problem is
to make a network learn sin(x).
For the sake of convenience, the output format (and only the output format)
can be set without using the f, so that:
A.3 The Format Command
Page A:7
or
will also make the output format real.
A.3.3
Breaking up the Output Values
In the compressed formats, the default is to print a blank after every 10 values.
This can be altered using the b (for inserting breaks) command. The use for this
command is to separate output values into logical groups to make the output more
readable. For instance, you may have 24 output units where it makes sense to insert
blanks after the 4th, 7th and 19th positions. To do this, specify:
b 4 7 19
Then for example, the output will look like:
1
2
3
4
10^0
1010
0101
0100
10^
01v
10^
0^0
^000v00000v0
0^0000v00000
00^00v00000v
000^00000v00
01000
^1000
00001
00^00
(0.17577)
(0.16341)
(0.16887)
(0.19880)
The b command allows up to 20 break positions to be specified. The default output

format is the real format with 10 numbers per line. For the output of real values, the
b command specifies when to print a carriage return rather than when to print a
blank. (Note: the b command is not part of the f command.)
A.3.4
Pattern Formats
There are two different types of problems that back-propagation can handle, the
general type of problem where every output unit can take on an arbitrary value and
the classification type of problem where the goal is to turn on output unit i and turn
off all the other output units when the pattern is of class i. The xor problem is an
example of the general type of problem. For an example of a classification problem,
suppose you have a number of data points scattered about through two-dimensional
space and you have to classify the points as either class 1, class 2 or class 3. For a
pattern of class 1 you can always set up the output: 1 0 0, for class 2: 0 1 0 and
for class 3: 0 0 1, however doing the translation to bit patterns can be annoying, so
another notation can be used. Instead of specifying the bit patterns you can set the
pattern format option to classification (as opposed to the default value of general)
like so:
f pc
and then the program will read data in the form:
Page A:8
1.33
0.42
-0.31

3.61
-2.30
4.30
1
2
3
*
*
*
shorthand for 1 0 0
shorthand for 0 1 0
shorthand for 0 0 1
and translate it to the bit string form when the pattern is loaded onto the output
units. To switch to the general form, use f pg.
In addition to controlling the input of data, the p command within the format
command is used to control the output of patterns from a set of test patterns kept on
a file. If the format is either c or g then when the test set is run thru the network
you will only get a summary of how many patterns are correct. If the format is either
C or G you will get a listing of the output values for each pattern as well as the
summary. When reading patterns, C works the same as c and G works the same
as g.
A.3.5
Controlling Summaries
When the program is learning patterns you can have it print out the status of the
learning process at regular intervals. The default is to print out only a summary of
how learning is going, however you can also print out the status of every pattern at
regular intervals. To get the whole set of patterns, use f s- to turn off the summary
format and f s+ to go back to summarizing.
A.3.6
Ringing the Bell
To ring the bell when the learning has been completed use f b+ and to turn off the
bell, use f b-.
A.3.7
Echoing Input
When you are reading commands from a file, it is often worthwhile to see those
commands echoed on the screen. To do this, use f e+ and to turn off the echoing,
use f e-.
A.3.8
Paging
The program is set up to write 24 lines to the screen and then pause. At the pause,
the program prints out a :. At this point, typing a carriage return will get you one
more page. Typing a q followed by a carriage return will quit the process youre
working on and give you another prompt. So, if youre running for example the xor
problem and you type, r 100 1 you will run 24 iterations through the program,
these will print out and then there will be a pause. Notice that the program will not
be busy computing anything more during the pause. To reset the number of lines
written to the screen, to say, 12, use f P 12. Setting the value to 0 will drop the
paging altogether.
A.4 Taking Training and Testing Patterns from Files
Page A:9
Note that the program will not be paging at all if you take a series of commands
from the original data file or some other input file and the output produced by these
commands is less than the page size. That is to say, a new line count is started every
time a new command is read and if the output of that command is less than the page
size there wont be any paging.
A.3.9
Making a Copy of Your Session
To make a copy of what appears on the screen use, f c+ to start writing to the
file copy and f c- to stop writing to this file. Ending the session automatically
closes this file as well.
A.3.10
Up-To-Date Statistics
During the i-th pass thru the network the program will collect statistics on how many
patterns are correct and how much error there is overall. These numbers are gathered
before the weights are updated and so the results listed for iteration i really show
the situation after the weight update for iteration i 1. To complicate matters more
the weight updates can be done continuously instead of after all the patterns have
been presented so the statistics you get here are skewed even more. If you want to
have up-to-date statistics with either method, use f u+ and to go back to statistics
that are out of date, use f u-. The penalty with f u+ is that the program needs
to do another forward pass. When using the continuous update method it is highly
advisable to use f u+, at least when you get close to complete convergence because
the default method of checking may claim the learning is finished when it isnt or it
may continue training after the tolerances have been met.
A.4
Taking Training and Testing Patterns from Files
In the xor example given above the four patterns were part of the data file and to
read them in the following lines were used:
n
1
0
0
1
4
0
0
1
1
1
0
1
0
However, it is also convenient to take patterns from a file that contains nothing but
a list of patterns (and possibly comments). To read a new set of patterns from some
file, patterns, use:
n f patterns
To add an extra bunch of patterns to the current set you can use:
Page A:10
x f patterns
In addition to keeping a set of training patterns you can take testing patterns
from a file as well. To specify the file you can invoke the program with a second file
name, as in:
bp xor xtest
In addition, if you do the following:
t f xtest
the program will set xtest as the test file and immediately do the testing. Once the
file has been defined you can test the patterns on the test file by t f or just t. (This
leaves the t command doing double duty since t 0.2 will set the tolerance level for
each output to be 0.2.) Also in addition, the test file can be set without being tested
by using
B t f xtest
as explained in the benchmarking section.
A.5
Saving and Restoring Weights and Related Values
Sometimes the amount of time and effort needed to produce a set of weights to solve
a problem is so great that it is more convenient to save the weights rather than
constantly recalculate them. Weights can be saved as real values in an ASCII format
(the default) or as binary, to save space. The old way to save the weights (which still
works) is to enter the command, S. The weights are then written on a file called
weights or to the last file name you have specified, if youre using the new version
of the command. The following file comes from the xor problem:
62r
file = ../xor3
9.5351562500
-8.4033203125
4.1308593750
5.5800781250
-4.9755859375
-11.3095703125
8.0527343750
To write the weights, the program starts with the second layer, writes out the weights
leading into these units in order with the threshold weight last. Then it moves on
to the third layer, and so on. To restore these weights, type an R for restore. At
this time, the program reads the header line and sets the total number of iterations
the program has gone through to be the first number it finds on the header line. It
A.6 Initializing Weights and Giving the Network a Kick
Page A:11
then reads the character immediately after the number. The r indicates that the
weights will be real numbers represented as character strings. If the weights were
binary, the character would be a b rather than an r. Also, if the character is b,
the next character is read. This next character indicates how many bytes are used
per value. The integer versions, bp and sbp write files with 2 bytes per weight, while
the real versions, rbp and srbp write files with 8 bytes per weight for double precision
reals and 4 bytes per weight for single precision reals. With this notation, weight files
written by one program can be read by the other. A binary weight format is specified
within the f command by using f wb. A real format is specified by using f wr. If
your program specifies that weights should be written in one format, but the weight
file you read from is different, a notice will be given. There is no check made to see
if the number of weights on the file equals the number of weights in the network.
The above formats specify that only weights are written out and this is all you
need once the patterns have converged. However, if youre still training the network
and want to break off training and pick up the training from exactly the same point
later on, you need to save the old weight changes when using momentum, and the
parameters for the delta-bar-delta method if you are using this technique. To save
these extra parameters on the weights file, use f wR to write the extra values as real
and f wB to write the extra values as binary.
In the above example, the command S, was used to save the weights immediately. Another alternative is to save weights at regular intervals. The command, S
100, will automatically save weights every 100 iterations the program does. The default rate at which to save weights is set at 32767 for 16-bit compilers and 2147483647
for 32-bit compilers which generally means that no weights will ever be saved.
To save weights to a file other than weights, you can say: s w filename.
To continue saving to the same file, you can just do s w. Naturally if you restore
weights it will be from this current weights file as well. You can restore weights from
another file by using: r w filename. Of course, this also sets the name of the file to
write to so if youre not careful you could lose your original weights file.
A.6
Initializing Weights and Giving the Network a Kick
All the weights in the network initially start out at 0. In symmetric networks then,
no learning may result because error signals cancel themselves out. Even in nonsymmetric networks, the training process will usually converge faster if the weights
start out at small random values. To do this, the k command will take the network
and alter the weights in the following ways. Suppose the command given is:
k 0 0.5
Now, if a weight is exactly 0, then the weight will be changed to a random value
between +0.5 and 0.5. The above command can therefore be used to initialize the
weights in the network. A more complex use of the k command is to decrease the
Page A:12
magnitude of large weights in the network by a certain random amount. For instance,
in the following command:
k 2 8
all the weights in the network that are greater than or equal to 2, will be decreased
by a random number between 0 and 8. Weights less than or equal to 2 will be
increased by a random number between 0 and 8. The seed to the random number
generator can be changed using the s command as in s 7. The integer parameter
in the s command is of type, unsigned.
Another method of giving a network a kick is to add hidden layer units. The
command:
H 2 0.5
adds one unit to layer 2 of the network and all the weights that are created are
initialized to between 0.5 and +0.5.
The subject of kicking a back-propagation network out of local minima has
barely been studied and there is no guarantee that the above methods are very useful
in general.
A.7
Setting the Algorithm to Use
A number of different variations on the original back-propagation algorithm have

been proposed in order to speed up convergence and some of these have been built
into these simulators. These options are set using the A command and a number of
options can go on the one line.
A.7.1
Activation Functions
To set the activation function, use:

A al the linear activation function
A ap the piece-wise activation function
A as the smooth activation function
A at the piecewise near-tanh function that runs from 1 to +1
A aT the continuous near-tanh function that runs from 1 to +1
When using the linear activation function, it is only appropriate to use the differential
step-size derivative and a two-layer network. The smooth activation function is:
1
f=
(1 + ex )
where x is the input to a unit. The piece-wise function is an approximation to the
function, f and it will normally save some CPU time even though it may increase
the number of iterations you need to solve the problem. The continuous near-tanh
function is 2f 1 and the piece-wise version approximates this function with a series
of straight lines.
A.7 Setting the Algorithm to Use

A.7.2
Page A:13
Sharpness (or Gain)
The sharpness (or gain) is the parameter, D, in the function:

1
(1 + eDx )
A sharper sigmoid shaped activation function (larger D) will produce faster convergence (see Speeding Up Back Propagation by Yoshio Izui and Alex Pentland in the
Proceedings of IJCNN-90-WASH-DC, Lawrence Erlbaum Associates, 1990). To set
this parameter, to say, 8, use A D 8. The default value is 1. Unfortunately, too
large a value for D will hurt convergence so this is not a perfect solution to speeding
up learning. Sometimes the best value for D may be less than 1.0. A larger D is
also useful in the integer version of back-propagation where the weights are limited
to between -32 and +31.999. A larger D value in effect magnifies the weights and
makes it possible for the weights to stay smaller. Values of D less than one may
be useful in extracting a network from a local minima (see Handwritten Numeral
Recognition by Multi-layered Neural Network with Improved Learning Algorithm
by Yamada, Kami, Temma and Tsukumo in Proceedings of the 1989 IJCNN, IEEE
Press). A smaller value of D will also force the weights and the weight changes to be
larger and this may be of value when the weight changes become less than the weight
resolution of 0.001 in the integer version.
A.7.3
The Derivatives
The correct derivative for the standard activation function is s(1 s) where s is the
activation value of a unit, however when s is near 0 or 1 this term will give only very
small weight changes during the learning process. To counter this problem, Fahlman
proposed the following one for the output layer:
0.1 + s(1 s)
(For the original description of this method, see Faster Learning Variations of BackPropagation: An Empirical Study, by Scott E. Fahlman, in Proceedings of the 1988
Connectionist Models Summer School, Morgan Kaufmann, 1989.) Besides Fahlmans
derivative and the original one, the differential step size method (see Stepsize Variation Methods for Accelerating the Back-Propagation Algorithm, by Chen and Mars,
in IJCNN-90-WASH-DC, Lawrence Erlbaum, 1990) takes the derivative to be 1 in the
layer going into the output units and uses the correct derivative for all other layers.
The learning rate for the inner layers is normally set to some smaller value. To set a
value for eta2, give two values in the e command as in:
e 0.1 0.01
To set the derivative, use the A command as in:
Page A:14

A
A
A
A
A.7.4
dd
dF
df
do
use
use
use
use
the differential step size derivative (default)

Fahlmans derivative in only the output layer
Fahlmans derivative in all layers
the original, correct derivative
Update Methods
The choices are the periodic (batch) method, the continuous (online) method, deltabar-delta and quickprop. The following commands set the update methods:
A
A
A
A
uc
ud
up
uq
for
for
for
for
the
the
the
the
continuous update method

delta-bar-delta method
original periodic update method (default)
quickprop algorithm
The delta-bar-delta method uses a number of special parameters and these are set
using the d command. Delta-bar-delta can be used with any of the derivatives and
the algorithm will find its own value of eta for each weight.
A.7.5
Other Algorithm Options
The b command controls whether or not to backpropagate error for units that have
learned their response to within a given tolerance. The default is to always backpropagate error. The advantage to not backpropagating error is that this can save
computer time. This parameter can be set like so:
A b+ always backpropagate error
A b- dont backpropagate error when close
The s sub-command will set the number of times to skip a pattern when the
pattern has been learned to within the desired tolerance. To skip 3 iterations, use A
s 3, to reset to not skip any patterns use A s 0.
The t sub-command will take the given pattern (only one at a time) out of
the training set so that you can then train the other patterns and test the networks
response to this one pattern that was removed. To test pattern 3, use A t 3 and to
reset to use all the patterns use A t 0.
A.8
The Delta-Bar-Delta Method
The delta-bar-delta method attempts to find a learning rate eta, for each individual
weight. The parameters are the initial value for the etas, the amount by which to
increase an eta that seems to be too small, the rate at which to decrease an eta that
is too large, a maximum value for each eta and a parameter used in keeping a running
average of the slopes. Here are examples of setting these parameters:
A.9 Quickprop
d
d
d
d
d
d
d
e
k
m
n
t
Page A:15
0.5
0.1
0.25
10
0.005
0.7
sets the decay rate to 0.5

sets the initial etas to 0.1
sets the amount to increase etas by (kappa) to 0.25
sets the maximum eta to 10
an experimental noise parameter
sets the history parameter, theta, to 0.7
These settings can all be placed on one line:

d d 0.5
e 0.1
k 0.25
m 10
t 0.7
The version implemented here does not use momentum. The symmetric versions, sbp
and srbp do not implement delta-bar-delta.
The idea behind the delta-bar-delta method is to let the program find its own
learning rate for each weight. The e sub-command sets the initial value for each
of these learning rates. When the program sees that the slope of the error surface
averages out to be in the same direction for several iterations for a particular weight,
the program increases the eta value by an amount, kappa, given by the k parameter.
The network will then move down this slope faster. When the program finds the slope
changes signs, the assumption is that the program has stepped over to the other side
of the minimum and so it cuts down the learning rate, by the decay factor, given by
the d parameter. For instance, a d value of 0.5 cuts the learning rate for the weight
in half. The m parameter specifies the maximum allowable value for an eta. The
t parameter (theta) is used to compute a running average of the slope of the weight
and must be in the range 0 t < 1. The running average at iteration i, ai , is defined
as:
ai = (1 t)slopei + tai1
so small values for t make the most recent slope more important than the previous
average of the slope. Determining the learning rate for back-propagation automatically is, of course, very desirable and this method often speeds up convergence by
quite a lot. Unfortunately, bad choices for the delta-bar-delta parameters give bad
results and a lot of experimentation may be necessary. If you have N patterns in the
training set try starting e and k around 1/N. The n parameter is an experimental
noise term that is only used in the integer version. It changes a weight in the wrong
direction by the amount indicated when the previous weight change was 0 and the
new weight change would be 0 and the slope is non-zero. (I found this to be effective
in an integer version of quickprop so I tossed it into delta-bar-delta as well. If you find
this helps, please let me know.) For more on delta-bar-delta see Increased Rates of
Convergence by Robert A. Jacobs, in Neural Networks, Volume 1, Number 4, 1988.
A.9
Quickprop
Quickprop (see Faster-Learning Variations on Back-Propagation: An Empirical Study,

by Scott E. Fahlman, in Proceedings of the 1988 Connectionist Models Summer
Page A:16
School, Morgan Kaufmann, 1989.) is similar to delta-bar-delta in that the algorithm

attempts to increase the size of a weight change for each weight while the process continues to go downhill in terms of error. The main acceleration technique is to make
the next weight change mu times the previous weight change. Fahlman suggests mu
= 1.75 is generally quite good so this is the initial value for mu but slightly larger or
slightly smaller values are sometimes better. In addition when this is done Fahlman
also adds in a value, eta times the current slope, in the traditional backprop fashion.
I had to wonder if this was a good idea so in this code Ive included a capability to
add it in or not add it in. So far it seems to me that sometimes adding in this extra
term helps and sometimes it doesnt. The default is to always use the extra term.
A second key property of quickprop is that when a weight change changes the
slope of the error curve from positive to negative or vice-versa, quickprop attempts
to jump to the minimum by assuming the curve is a parabola.
A third factor involved in quickprop comes about from the fact that the weights
often grow very large very quickly. To minimize this problem there is a decay factor
designed to keep the weights small. Fahlman does not mention a value for this
parameter but I usually try 0.0001 to 0.00001. Ive found that too large a decay
factor can stall out the learning process so that if your network isnt learning fast
enough or isnt learning at all one possible fix is to decrease the decay factor. The
small values you need present a problem for the integer version since the smallest
value you can represent is about 0.001. To get around this problem the decay value
you input should be 1000 times larger than the value you intend to use. So to get
0.0001, input 0.1, to get 0.00001, input 0.01, etc. The code has been written so that
the factor of 1000 is taken into account during the calculations involving the decay
factor. To keep the values consistent both the integer and floating point versions use
this convention. If you use quickprop elsewhere you need to take this factor of 1000
into account. The default value for the decay factor is 0.1 (=0.0001).
I built in one additional feature for the integer version. I found that by adding
small amounts of noise the time to convergence can be brought down and the number
of failures can be decreased somewhat. This seems to be especially true when the
weight changes get very small. The noise consists of moving uphill in terms of error
by a small amount when the previous weight change was zero. Good values for the
noise seem to be around 0.005.
The parameters for quickprop are all set in the qp command like so:
qp
qp
qp
qp
qp
d 0.1
e 0.5
m 1.75
n 0
s+
*
*
*
*
*
the default decay factor

the default value for eta
the default value for mu
the default value for noise
always include the slope
or a whole series can go on one line:

qp d 0.1 e 0.5 m 1.75 n 0 s+
A.10 Recurrent Networks
A.10
Page A:17
Recurrent Networks
Recurrent back-propagation networks take values from higher level units and use
them as activation values for lower level units. This gives a network a simple kind
of short-term memory, possibly a little like human short-term memory. For instance,
suppose you want a network to memorize the two short sequences, acb and bcd.
In the middle of both of these sequences is the letter, c. In the first case you want a
network to take in a and output c, then take in c and output b. In the second
case you want a network to take in b and output c, then take in c and output d.
To do this, a network needs a simple memory of what came before the c.
Let the network be an 7-3-4 network where input units 1-4 and output units 1-4
stand for the letters a-d. Furthermore, let there be 3 hidden layer units. The hidden
units will feed their values back down to the input units 5-7, where they become input
for the next step. To see why this works, suppose the patterns have been learned by
the network. Inputing the a from the first string produces some random pattern of
activation, p1, on the hidden layer units and c on the output units. The pattern p1
is copied down to units 5-7 of the input layer. Second, the letter, c is presented to
the network together with p1 now on units 5-7. This will give b on the output units.
However, if the b from the second string is presented first, there will be a different
random pattern, p2, on the hidden layer units. These values are copied down to input
units 5-7. These values combine with the c to produce the output, d.
The training patterns for the network can be:
1000 000
0010 hhh
0010
0100
* "a" prompts the output, "c"

* inputing "c" should produce "b"
0100 000
0010 hhh
0010
0001
* "b" prompts the output, "c"

* inputing "c" should produce "d"
where the first four values on each line are the normal input, the middle three either
start out all zeros or take their values from the previous values of the hidden units.
The code for taking these values from the hidden layer units is h. The last set of
values represents the output that should be produced. To take values from the third
layer of a network, the code is i. For the fourth and fifth layers (if they exist) the
codes are j and k. Training recurrent networks can take much longer than training
standard networks and the average error can jump up and down quite a lot.
A.11
The Benchmarking Command
The main purpose of the benchmarking command is to make it possible to run a

number of tests of a problem with different initial weights and average the number of
iterations and CPU time for networks that converged. A second purpose is to run a
training set thru the network a number of times, and for each try, a test pattern or
a test set can be checked at regular intervals.
Page A:18
A typical command to simply test the current parameters on a number of

networks is:
B g 5 m 15 k 1 r 1000 200
The g 5 specifies that youd like to set the goal of getting 5 networks to converge but
the m 15 sets a maximum of 15 tries to reach this goal. The k specifies that each
initial network will get a kick by setting each weight to a random number between -1
and 1. The r 1000 200 portion specifies that you should run up to 1000 iterations on a
network and print the status of learning every 200 iterations. This follows the normal
run command and the second parameter defining how often to print the statistics
can be omitted. For example, here is some output from benchmarking with the xor
problem:
[?!*AaBbCcdefHhiklmnOoPpQqRrSstWwx]? B g 5 m
seed =
7; running . . .
patterns learned to within 0.10 at iteration
seed =
7; running . . .
seed =
7; running . . .
seed =
7; running . . .
seed =
7; running . . .
1 failures; 4 successes; average = 49.750000
5 k 1 r 200
62
54
39
44
0.333320 sec/network
The time/network includes however much time is used to print messages so to time
the process effectively, all printing should be minimized. The timing is done using
the UNIX clock(3C) function and on a UNIX PC at least, the time returned by clock
will overflow after 2147 seconds, or about 36 minutes. If you system has the same
limitation, take care that ALL of the benchmarking you do in a single call of the
program adds up to less than 36 minutes.
In the above example, the seed that was used to set the random values for
the weights was set to 7 (outside the benchmarking command), however if you set a
number of seeds, as in:
s 3 5 7 18484 99
the seeds will be taken in order for each network. When there are more networks to
try than there are seeds, the random values keep coming from the last seed value, so
actually you can get by using a single seed. The idea behind allowing multiple seeds
is so that if one network does something interesting you can use that seed to run a
network with the same initial weights outside of the benchmarking command.
Once the benchmarking parameters have been set, it is only necessary to include
the run portion in order to start the benchmarking process again, thus, B r 200 will
run benchmarking again using the current set of parameters. Also, the parameters
can be set without starting the benchmarking process by just not including the r
parameters in the B command as in:
A.11 The Benchmarking Command
Page A:19
B g 5 m 5 k 1
In addition to getting data on convergence, you can have the program run test
patterns thru the network at the print statistics rate given in the r sub-command.
To specify the test file, test100, use:
B t f test100
To run the training data thru for up to 1000 iterations and test every 200 iterations
use:
B r 1000 200
If the pattern format specification p is set to either c or g you will get a summary
of the patterns on the test file. If p is either C or G you will get the results for
each pattern listed as well as the summary. To stop testing the data on the data file,
use B t 0.
Sometimes you may have so little data available that it is difficult to separate
the patterns into a training set and a test set. One solution is to remove each pattern
from the training set, train the network on the remaining patterns and then test the
network on the pattern that was removed. To remove a pattern, say pattern 1 from
the training set use:
B t 1
To systematically remove each pattern from the training set, use a data file with the
following commands:
B t 1 r 200 50
B t 2 r 200 50
B t 3 r 200 50
... etc.
and the pattern will be tested every 50 iterations. If, in the course of training the
network, all the patterns converge, the program will print out a line starting with
a capital S followed by a test of the test pattern. If the programs hits the point
where statistics on the learning process have to be printed and the network has not
converged, then a capital F will print out followed by a test of the test pattern. To
stop this testing, use B t 0.
It would be nice to have the program average up and tabulate all the data that
comes out of the benchmarking command, but I thought Id leave that to users for
now. You can use the record command to save the output from the entire session
and then run it thru some other program, such as an awk program in order to sort
everything out.
Page A:20
A.12
Miscellaneous Commands
Below is a list of some miscellaneous commands, a short example of each and a short
description of the command.
!
Anything after ! will be passed on to UNIX as a command to execute.
The C command will clear the network of values, reset the number of iterations
to 0 and reset other values so that another run can be made. The seed value is
reset so you can run the same network over with the same initial weights but
with different learning parameters. Even though the seed is reset, the weights
are not initialized, so you must do this step after clearing the network.
Entering i filename will read commands from the file. When there are no more
commands on the file, the program resumes reading from the previous file being
used. You can also have an i command within the file, however the depth to
which you can nest the number of active files is 4 and stdin itself counts as the
first one. Once an input file has been specified, you can simply type i to read
from the file again.
Entering l 2 will print the values of the units on layer 2, or whatever layer is
specified.
In sbp and srbp only, T -3 sets all the threshold weights to -3 or whatever value
is specified and freezes them at this value.
Entering W 0.9 will remove (whittle away) all the weights with absolute values
less than 0.9.
In addition, when a user generated interrupt occurs the program will drop its current
task and take the next command from the keyboard.
A.13
Limitations
Weights in the bp and sbp programs are 16-bit integer weights, where the real value of
the weight has been multiplied by 1024. The integer versions cannot handle weights
less than 32 or greater than 31.999. The weight changes are all checked for overflow
but there are other places in these programs where calculations can possibly overflow
as well and none of these places are checked. Input values for the integer versions
can run from 31.994 to 31.999. Due to the method used to implement recurrent
connections, input values in the real version are limited to 31994.0 and above.
Appendix B
The cluster and pca programs
NAME
cluster, pca Hierarchical Cluster Analysis and Principal Component Analysis
SYNOPSIS
cluster [options] [vectorfile [namesfile]]
pca [options] [vectorfile [namesfile]]
DESCRIPTION
Cluster performs Hierarchical Cluster Analysis (HCA) on a set of vectors and outputs
the result in a variety of formats on standard output.
Pca performs Principal Component Analysis (PCA) on a set of vectors and prints
the transformed set of vector on standard output.
If vectorfile is given, it is read as the file containing the vector data, one vector per
line, components separated by whitespace. An optional namesfile can be given to
assign names (arbitrary strings) to these vectors. Names must be specified one per
line, matching the number of vectors in vectorfile. Names are either contiguous nonwhitespace characters or arbitrary strings delimited by an initial double quote and
the end of line. Vector names may also be given in vectorfile itself, following the
vector components on each line. If no names are provided, vectors in the output are
identified by their input sequence number instead.
Either of these files may be given as -, indicating that the corresponding information
should be read from standard input. If no arguments are given standard input is read,
allowing cluster to be used as a filter.
Cluster and pca also provide a simple scaling facility. If the first line of the input is
terminated by the keyword SCALE it is interpreted as a vector of scaling factors.
The following lines are then read as data as usual, except that vector components are
multiplied by their corresponding scaling factor.
Page B:2
Yet another potentially useful feature is that vector components may be specified
as D/C (dont care), meaning that that component will always contribute zero in
computing distances to other vectors. In PCA mode, each D/C value is replaced by
the mean of all non-D/C values along its dimension.
OPTIONS
-p
Force PCA mode, even when the program is called as cluster. (cluster and
pca are different incarnations of the same program, depending on the zeroth
argument.)
-s
Suppress scaling. Vector components are not scaled, even if a SCALE line
was found. This is useful to produce both scaled and unscaled analyses from
the same input file.
-v
Verbose output. Reports the number and dimension of vectors read and precedes each output section with an explanatory message. For pca , progress
execution of the computational steps involved is reported.
-d (cluster only)
Output all pairs of clusters formed, along with their respective inter-cluster
distances. Clusters are given as lists of vectors.
-t (cluster only)
Represent the hierarchical clusters as a tree lying on its side. The leaves of the
tree are formed by vector names, and the horizontal spacing between nodes is
proportional to the distances between clusters. The output uses only ASCII
characters, resulting in a rough approximation of the true proportions.
-wwidth (cluster only)
Set the width of the tree representation used by -t to width characters. The
default width is 80 or the terminal width as determined by curses(3). Wider
trees are more difficult to view but give a more accurate picture of relative
distances.
-g (cluster only)
Same as -t, but the graphical output is specified in a format suitable for the
UNIX graph(1G) utility, which allows further formatting such as bounding
box, axes labels, rotation, and scaling. Graph(1G) in turn produces plotting
instructions according to the plot(5) format, for which a variety of output filters
exist. The following are typical command lines. For previewing on a standard
terminal:
Page B:3
cluster -g | graph -g1 | plot -Tcrt

Previewing under X windows:
cluster -g | xgraph
Converting to postscript:
cluster -g | graph -g1 | psplot
Printing on a printer supporting plot (5) format:
cluster -g | graph -g1 | lpr -g
-b (cluster only)
Same as -g, except that double drawing of lines is avoided, thus saving space
and time. This requires however that graph be called with the -b option to
correctly assemble the tree from pieces:
cluster -b | graph -b
-np (cluster only)
Norm to be used as distance metric between vectors. A positive integer p
specifies a metric based on the L p-norm. The value 0 selects the maximum
norm. The default is 2 (Euclidean distance).
For compatibility with an earlier version of the program, the default behavior of
cluster corresponds to the combination of options -dtv.
The following options apply only to pca:
-eeigenbase (pca only)
Use eigenbase as a file with precomputed eigenvectors. If the file exists, it is read
and the relatively costly eigenvalue computation is avoided. It also allows to
transform a set of vectors independent from the ones originally used to compute
the PCA dimensions. If the file does not exist, an eigenbase is computed from
the current input and saved in the file.
-cpc1,pc2,... (pca only)
Select a subset of the principal components for output, as typically used for
dimensionality reduction of vector sets. Components of the transformed vectors are listed in the order specified by the comma-separated list of numbers
pc1,pc2,... For example, -c4,2 prints the fourth and second principal components (in that order).
-E (pca only)
Output the eigenvalues instead of the transformed input vectors. Eigenvalues
are printed in descending order or as specified by the -c option. This option
forces recomputation of the eigenbase even if an existing file is specified with
the -e option.
Page B:4
BUGS
Halfhearted error handling. If vectors and names are given in the same file, the name
at the end of the first line must be a non-numerical string, or it will be mistaken as
a vector component.
The vector names at the leaves of the cluster tree tend to stretch beyond the bounding
box of the plot. This is a feature since cluster leaves the graphing process entirely
to graph(1G), which doesnt care about the length of strings. This can be corrected
by explicitly specifying an upper limit for the x coordinate.
The clustering algorithm used is simple-minded and slow.
SEE ALSO
graph(1G), plot(5), plot(1G), xplot(1), xgraph(1), xterm(1), psplot(1), curses(3),
lpr(1).
AUTHORS
Original version by Yoshiro Miyata (miyata@boulder.colorado.edu).
Minor fixes, various options, curses(3) support, graph(1G) output and PCA addition
by Andreas Stolcke (stolcke@icsi.berkeley.edu).
Scaling suggested by Steve Omohundro (om@icsi.berkeley.edu), dont care values suggested by Kim Daugherty (kimd@gizmo.usc.edu).
The algorithms for eigenvalue computation and Gaussian elimination were adapted
(but not copied) from Numerical Recipes in C by Press, Flannery, Teukolsky & Vetterling.
Finally, this program is free distributable, but nobody should try to make money off
of it, and it would be nice if researchers using it acknowledged the people mentioned
above.
Appendix C
Genetic algorithms: the pga
program
PGA was written by Peter Ross, based on an original version by Geoffrey Ballinger,
both at the University of Edinburgh. It is a simple testbed for multi-population
genetic algorithms, with a choice of some well-known problems and a range of options.
It is useful as a first introduction to genetic algorithms, but a package such as John
Grefenstettes Genesis or Art Corcorans LibGA may provide more flexibility.
The problems
In each of the problems below a chromosome (potential solution of the problem) is a
string which is decoded in a problem-specific way and evaluated by a problem-specific
evaluation function. The function value itself is taken as the fitness function of a
chromosome, so the maximum of the function is also the maximum fitness. The -e
flag is used to select the evaluation function, and hence the problem.
max The problem is to maximise the number of 1-bits in the chromosome. The
fitness of a chromosome is just the number of 1-bits.
dj1 De Jongs first function: the problem is to maximise the function
f (x1 , x2 , x3 ) = 100 x21 x22 x23
An N-bit chromosome is interpreted as three N/3-bit integers (last two bits
are ignored) and each integer is then rescaled to be a real number in the range
5.12 to +5.12. The theoretical maximum fitness is of course 100.
dj2 De Jongs second function: the problem is to maximise the function
f (x1 , x2 ) = 1000 100(x20 x1 )2 (1 x0 )2
0
Chapter written by Peter Ross, Dec 91; revised by him, Dec 93, Dec 94
Page C:2
Genetic algorithms: the pga program
An N-bit chromosome is interpreted as two N/2-bit numbers, which are then

rescaled to be reals lying in the range 2.048 to +2.048. The theoretical maximum fitness is 1000.
dj3 De Jongs third function: the problem is to maximise the function
f (x1 , x2 , x3 , x4 , x5 ) = 25
5
X
floor(xi )
i=1
where the floor function returns the largest integer not larger than its argument. An N-bit chromosome is interpreted as five N/5-bit numbers, which are
then rescaled to lie in the range 5.12 to 5.12. This function has many flat
plateaux; it would cause a hill-climbing search algorithm severe problems. The
theoretical maximum fitness is 55.
dj5 De Jongs fifth function: the problem is to maximise the function
f (x1 , x2 ) = 500
1
0.002 +
P25
1
j=1 j+(x1 a1j )6 +(x2 a2j )6
where an N-bit chromosome is interpreted as two N/2-bit numbers, which

are then rescaled to lie in the range 65.536 to 65.536. The numbers aij
are constants, typically taken from the set (0, 16, 32), so that the function has 25 equally-spaced and very pronounced maxima, only one of which
is the true global maximum. A hill-climbing search algorithm would be very
unlikely to find the global maximum. The theoretical maximum fitness is about
499.001997.
bf6 A modified form of the function binary F6 used by Schaffer et al (see Proceedings
of the Third International Conference on Genetic Algorithms, 1989) as a test
for comparing efficiency of different forms of genetic algorithm. The problem is
to maximise the function
q
(sin( x21 + x22 ))2

f (x1 , x2 ) =
1 + 0.001(x21 + x22 )2
where an N-bit chromosome is interpreted as two N/2-bit numbers, which are
then rescaled to lie in the range 100 to 100. This function has many maxima,
arranged in concentric rings about the origin; all points on the ring closest to
the origin (but not the origin itself) are the true global maxima. Each ring
of local maxima is separated from the next by a ring of local minima. The
theoretical maximum fitness of this function is about 0.994007.
mcbK where K is an integer in the range 1. . . 9 inclusive. Alleles lie in the range
0. . . K inclusive. The fitness is the size of the longest sequence of equal alleles
Page C:3
(mcb stands for maximum contiguous block). Therefore there are (K + 1)

equal maxima. This function provides a simple way to explore how different
maxima invade territory in spatially-structured (cellular) GAs.
knap A one-dimensional knapsack problem. A target integer and a further set of
integers are read in from a file called weights or as specified by the -Fdatafile
flag; the size of the set can be as large as the size of a chromosome. The input
routine skips non-integers, so you can include textual comments if you like;
however, the very first item in the file must be the target value. The task is to
find a subset whose sum is as close to the given target as possible, preferably
exactly equal to it. The fitness function is
f (x1 , , xn ) =
1
P
1 + |target ni=1 xi |
so that it has maximum value 1.0, and is 0.5 or less unless the target is exactly
obtained.
rr The Royal Road function(s). In this problem each chromosome is regarded as
being composed of regularly-spaced non-overlapping blocks of K bits, separated
by a number of irrelevant bits. Various parameters control the scoring: b, g,
and m are integers and u , u and v are real numbers. The low-level blocks are
of size b bits, with g irrelevant bits making up a gap between each. Each lowlevel block scores 0 if completely filled, or mv if it contains m bits and m m ,
or mv if there are more than m bits set but the whole block is not filled.
Thus the low-level blocks are mildly deceptive. In addition there is a hierarchy
of completed blocks which earn bonus points. The hierarchy has a number of
levels; level 1 is the lowest, and level (j + 1) has half the number of blocks that
level j has - its blocks are adjacent pairs from the level below. If level j has
nj > 0 filled blocks then it earns a bonus score of u + (nj 1)u; thus u is
a special bonus for getting at least one filled at that level. The topmost layer
of the hierachy has one block, which is filled if and only if every block in the
lowest level is filled.
See the paper by M.Mitchell, S.Forrest and J. Holland, The royal road for
genetic algorithms in (eds) F.J.Varela and P.Bourgine, Towards a Practice of
Autonomous Systems: Proceedings of the First European Conference on Artificial Life, MIT Press, 1992 and a later paper by Mitchell and Holland available
by FTP from santafe.edu in /pub/mm and in Foundations of Genetic Algorithms 2, ed. L.D.Whitley, Morgan Kaufmann, 1993. These papers report that
such royal road functions are very hard for some GAs and analyse why. See
also the challenge issued by John Holland in the GA list, vol 7 no 22.
The various parameters can be specified in a file named rrdata or as specified
by the -Fdatafile flag; the program ignores non-numeric text and expects the
Page C:4
parameters to occur in a specific order. For example, the file might contain
this:
These are the defaults specified in Hollands challenge:
Blocksize (int)
Gapsize (int)
m* (int)
u* (double)
u (double)
v (double)
8
7
4
1.0
0.3
0.02
Bits in a block
Irrelevant bits between blocks
Threshold for reward in low-level blocks
First block in a higher level earns this
Later blocks in a higher level earn this
Reward-or-penalty/bit in low-level block
Note that Holland specifies number of levels (k) rather than

a gapsize so that there are two-to-the-k blocks at the lowest
level.
This file is read by pga when you specify -err for Royal Road
functions. It looks for numbers in this file, in the order
and of the types shown above. Non-numeric text is ignored.
If the file is absent, or there are fewer than six numbers, the missing numbers
default to the values shown here.
tt Simple timetabling problems. The data describing the problem will be read from
the file ttdata (or the file specified by the -Fdatafile flag), whose format is
described separately below. This uses a penalty-function approach; the fitness
function is
X
f (timetable) = 1/(1 +
w i ci )
i
where wi is the penalty for violating a constraint of type i and ci is the number
of violated constraints of type i. How ci is counted is somewhat under your
control. The problem description language is sufficiently flexible to be able to
handle a good many real timetabling problems; indeed, this version has been
able to solve some real-world problems and produce better results than humans
did!
The stand-alone decode filter program, if given the -t flag, will turn each chromosome it reads from standard input back into a human-readable timetable.
tt:E+S Better timetabling problems. This is like the previous problem, but uses
better mutation than random mutation. When mutating a timetable, an event
is selected and then a slot is chosen to which to shift that event. The E component sets the event selection: use r for random, w for roulette-whell selection
based on how much of the total penalty each event is causing, and tM to select tournament selection with a tournament of size M to find an event which
Page C:5
is causing a lot of the total penalty. The S component sets the slot selection:
use r for random, f for a random free slot (that is, not subject to constraints
on the chosen event) or tN to select tournament selection of size N to find a
good slot which reduces the penalty contribution from the chosen event. We
find that -ett:r+t5 is pretty good and reasonably fast. This smart mutation
only changes a single allele, so the mutation rate set by -m is not per bit but
per chromosome.
Please note: At the University of Edinburgh, Dave Corne and I are doing some
research on using GAs for timetabling. The approach used here in PGA is not
our current state-of-the-art, it is merely a simple version. A separate GA tool
called GATT specifically for timetabling will appear for FTP in due course.
However, we would be very pleased to hear of your successes or failures using PGA for timetabling, and we would be very pleased to receive the data
you used, to add to a collection of interesting timetabling problems also to be
made available by FTP. Please mail comments and/or timetabling problems to
peter@aisb.ed.ac.uk it would help my mail sorter if you would put PGA
in the subject line. This research is supported by grant GR/J44513 from the
UKs Engineering and Physical Sciences Research Council (formerly SERC).
In the function optimisation problems (dj1, dj2, dj3, dj5, bf6) each chromosome represents a set of real numbers. The actual chromosomes can be dumped to a
file. A separate program called decode is provided which will allow you to recover the
real numbers from each chromosome, so that you can use one of the many standard
graph-plotting programs in order to see the distribution of these numbers and how it
changes as the GA runs. As noted above, decode can also be used to turn timetabling
chromosomes back into explicit timetables.
Running the program

If you run the program with the -h option you get brief help as shown in figure C.1.
The default values are shown in brackets. The program displays the chosen values
of the various parameters, and runs the genetic algorithm using the chosen problem
and parameters for the appropriate number of generations per population as set by
the -l (for limit) option. You can set this to a large number initially if you know
that not much of interest is going to happen for a while, since the number of cycles
to run for can be changed interactively (see below). This allows you to reduce the
number as, for example, the populations move closer to convergence. The program
will pause anyway if all populations appear to have converged, in the sense that the
average fitness of each population is within 109 of the highest fitness of the same
population.
At each reporting interval, as set by the -i option, the display of the average
and best fitness is updated. If you name a file on the command line and it can be
Page C:6
% pga -h
PGA: parallel genetic algorithm testbed, version 2.7
-P<n>
Set number of populations. (5)
-p<n>
Set number of chromosomes per population. (50)
-n<n>
Set chromosome length. (32)
-l<n>
Set # of generations per stage. (100)
-i<n>
Set reporting interval in generations. (10)
-M<n>
Interval between migrations. (10)
-m<n>
Set bit mutation rate. (0.02)
-c<n>
Set crossover rate (only for gen). (0.6)
-b<n>
Set selection bias. (1.5)
-a
Adaptive mutation flag (FALSE)
-t
Twins: crossover produces pairs (FALSE)
-g
In function optimisation, when decoding treat
bit pattern as Gray code (FALSE)
-C<op>
Set crossover operator. (two)
-s<op>
Set selection operator. (rank)
-r<op>
Set reproduction operator. (one)
-e<fn>
Set evaluation function. (max)
-S<n>
Seed the random number generator. (from clock)
-NO<n>
Non-interactive, stop when One reaches <n>.
-No<n>
Like -NO<n> but also save final chromosomes.
-NA<n>
Non-interactive, stop when All reach <n>.
-Na<n>
Like -NA<n> but also save final chromosomes.
-F<file> Use problem datafile <file> instead of default.
-h
Display this information.
<file>
Also log output in <file>. (none)
Crossover operators ... one, two, uniform, none.
Selection operators ... rank, fitprop,
tnK (K=integer > 1),
tmK (K=integer > 1).
Reproduction operators ... one, gen,
ssoneN (N=integer > 0),
ssgenN (N=integer > 0).
Evaluation functions ... max, dj1, dj2, dj3,
dj5, bf6, knap, tt,
tt:E+S (E=r/w/tM; S=r/f/tN),
mcbK (K in 1..9), rr.
Figure C.1: PGA help summary
Page C:7
opened, the same information at each reporting interval is appended to the file, for
your later analysis. The file is a plain text file and contains enough information,
including the random number seed used, to allow you to replicate that run precisely
at any later date. Note that it is possible to generate large files quickly by this means,
so please tidy up after your experiments by deleting or compressing such files. Simple
AWK scripts are provided which show how one might use such a file to plot a graph
of average or best fitness, in one or all populations, using xgraph.
PGA allows you to have multiple populations of chromosomes which evolve
separately, with occasional migration of a chromosome from one population to all
the others. Migration is controlled by the -M option, which selects the interval (in
generations) between migrations. At each migration, a chromosome is chosen from
one population (the first population is chosen for the first migration, the second for
the second and so on) according to the current selection procedure and copied into all
the other populations. The populations remain fixed in size, so this insertion causes
the least fit in a population to be destroyed. A migration interval of zero means that
no migrations ever happen.
The selection procedure is determined by the -s option. There are four choices.
The fitprop option causes chromosomes in a population to be selected for reproduction or migration in proportion to their fitness. Obviously, the effect of this will
depend on whether fitness values lie in the range 0 to 100 (say) or 1000 to 1100 (say).
To counter this occasionally undesirable dependence on the range you can select the
rank procedure instead. This puts the chromosomes in rank ordering of fitness and
then selects from that ordered set with a probability proportional to ranking. If
the bias, as set by -b, is 1.0 then the probability is a linear function of the ranking
(first-ranked is most-favoured, and it behaves as fitprop would if the rankings were
the actual fitnesses); a bias of 2.0 favours the first-ranked markedly more strongly,
and the probability distribution is essentially parabolic. Bias values between 1.0 and
2.0 vary the distribution smoothly between these extremes. Tournament selection
can be chosen using the tnN option, where N is some positive integer. In tournament selection, N chromosomes are chosen uniformly randomly (that is, irrespective
of fitness; and a chromosome could be chosen more than once), and the fittest of
these is selected. For large N this process can be expensive, so a modified form of
tournament selection is also available using the tmN option. In this, one chromosome
is chosen at random, and then up to N tries are made at random to pick another
which is fitter than the first. The first such that is found is the selected chromosome; if none is found, the first-chosen is selected. This procedure is modelled on the
classic marriage problem in dynamic programming textbooks in mathematics, and
is intended to overcome to some extent the bias that normal tournament selection
imposes, especially for anything other than very small values of N. For instance, if N=6
then normal tournament selection will only choose something in the lower half of the
population if all its choices are in the lower half; this happens with a probability of 1
in 26 because the fittest of the six chosen always wins. With tm6 selection, the fittest
of six chosen wins 54.7% of the time, the second-fittest wins 21.3% of the time, the
Page C:8
third-fittest wins 13% of the time and so on. Of course, tmN does not always have to
actually choose N chromosomes. As N increases the proportion of the time that tmN
would choose the same fittest winner as tnN decreases steadily, although the expected
fitness of that fittest will also rise.
At each generation, reproduction occurs in each population. There are four
methods of reproduction, set by the -r option. The one method selects two chromosomes according to the current selection procedure, performs two-point crossover
on them to obtain a child or two, perhaps mutate it/them as well, and installs the
result back in that population; the least fit of the population is destroyed. Mutation
consist of randomly altering each bit with a probability determined by the -m option,
or adaptively if the -a option is used. Adaptive mutation happens with a probability
governed by the degree of similarity of the parents the probability would be 0.0 for
two totally dissimilar parents, running up to the value specified by -m if the parents
are identical.
In gen reproduction, the whole of a population is potentially replaced at each
generation. The procedure is to loop N times where N is the population size, selecting
two chromosomes each time according to the current selection procedure. These are
crossed with a probability governed by the -c option. If crossover didnt happen
the parent is just retained. If crossover happened the child or children are also
mutated with probability determined by the -m option and the adaptive mutation
flag. The resulting chromosomes are inserted back into the population, the least fit
being destroyed as usual.
There are also two spatially-oriented reproduction operators, ssoneN and ssgenN
when N is a positive integer. In these two, the populations are not sorted; instead
they are regarded as a two-dimensional grid, roughly square (the population size is
adjusted if necessary to ensure a rectangle, and the dimensions of the rectangle are
shown alongside the population size in the output). This grid is also toroidal: leave
one edge and you reappear at the opposite edge. In either, the chromosome at a given
location is replaced by the child of two parents, each found by seeking the fittest on
a random walk of length N starting at that location. Replacement only happens if
that child is fitter than the chromosome already there. In ssoneN a single random
location is chosen at each generation. In ssgenN a complete replacement population is
constructed by doing this for every location; the newly-constructed population then
replaces the old one. If -t (see below) is also specified then the parents produce
two children and the second gets put in the location immediately below the first in
the grid. If either of these operators are used, the selection procedure is forced to
be this special spatial selection process, shown as *spatial* in the output; it is an
error to specify something different. These spatially-oriented reproduction operators
are loosely modelled on Wrights shifting-balance model of evolution, whereas all the
others are based loosely on Fishers panmictic model.
If you choose spatially-oriented reproduction and also specify an output file,
you will be given the option of writing out a map of the population to a file with
extension .map; the map is appended if the file exists already. In the map, the fittest
Page C:9
chromosome is labelled A, the next (or equal-) fittest is labelled B and so on. Up
to 26 distinct chromosomes may appear in the map, all the rest will be labelled ..
Remember that the 27th fittest will appear as . even though it may be exactly as fit
as the 26th, and so on. A dictionary, showing what each letter means, appears below
the map. The format of the dictionary is desgined to make it easy to do a range of
simple processing tasks using a standard Unix tool such as awk.
The crossover procedure is selected by the -C option. The one option selects
one-point crossover: a random point along the chromosome is selected, and the child
contains a copy of one parent up to that point and the other parent thereafter. The
two option selects two-point crossover: two random points are selected, and the child
contains a copy of one parent between these two points and of the other parent before
and after these points. The uniform option selects uniform crossover: at each point
of the chromosome, a random choice is made between the two parents as to which
to copy at that point. There are important differences bewteen these three kinds
of crossover; to see them, consider the unlikely event that two maximally dissimilar
parents are chosen (such as 000..0 and 111..1), and that the chromosome length is
L. With one-point crossover there are only 2(L+1) possible children. With two-point
crossover there are L(L + 1) possible children. With uniform crossover any child is
possible, there are 2L possibilities. You can also use the none option to see what
happens when there is no crossover. Mutation and selection will then be the sole
sources of change. Even if you set the mutation rate to zero, selection will still drive
out the less fit because copies of the more fit will displace them; there will just not
be any new chromosomes.
If the -t flag is used, crossover produces two complementary children (t for
twins) rather than one; this is often claimed to be useful to retain genetic diversity
in the population.
It is possible to change the length of a chromosome by using the -n option.
The default length is 32. The system will run slower if you increase the length
much; each chromosome is held as a character array and long chromosomes therefore
take longer to manipulate and decode. The upper limit on chromosome length is
dictated by the amount of memory available and the resolution of the chromosome
decoding algorithm if the problem chosen is such that the chromosome represents one
or more numbers. For shortish chromosomes, the decoding to two or more numbers
is done using integer arithmetic for speed; if you ask for long chromosomes floating
point arithmetic may be necessary, which further reduces the speed. The decoding
algorithm should cope with quite long chromosomes: each number could occupy up
to 1022 chromosome positions (bits).
In the function optimisation problems, a chromosome is interpreted as a number
of equal-sized chunks. For example, in the dj3 problem a chromosome is regared as 5
chunks each of N/5 bits. A chunk is normally treated as an integer and used to locate
a value within the appropriate range for the problem. However, this leads to modest
deceptiveness since (say) the pattern 01111 represents a number adjacent to 10000
even though all five bits changed. One widely used cure for this is to employ Gray
Page C:10
coding, a reordering of the normal binary sequence such that adjacent integers are
always exactly one bit different. If you use the -g option, the program will suppose
that each chunk is a Gray-coded integer instead of a normal binary-coded integer.
An example of the systems display is shown in figure C.2. An = will appear
(A)gain, (Q)uit, (C)ontinue:
Populations: 5
Generations per stage:
Reporting interval:
Migration interval:
Eval function:
Selection:
Selection bias:
100
10
10
max
rank
1.50
Chromosomes per pop:

Chromosome length:
Reproduction:
Crossover type:
Crossover rate:
Mutation rate:
50
32
one
uniform, twins
n/a
0.020000
Generation: 100
Evaluations so far: 500
Pop.......Average..........Best.(max = 32)
0
21.5200000
27.0000000
1
20.7800000
24.0000000
2
21.7800000
24.0000000
3
22.1800000
25.0000000
4
20.4400000
25.0000000
Figure C.2: Example of PGA screen display
to the right of the population number (Pop) whenever the average and best fitness
values become equal, to within 109 . The Evaluations so far shows the number of
times that the problem function has been called (after initialisation of the population(s)) to find the fitness of some chromosome; it gives some feel for the work done
by the genetic algorithm in finding the maximum fitness. Note that there are very
many more evaluations of the problem function per generation in gen reproduction
than in one reproduction, so the display will be updated much more slowly. You
may wish to reduce the generations per stage (-l) and reporting interval (-i) to
compensate.
The program cycles for the number of generations shown against Generations
per stage (the -l option), updating the average/best information every reporting
interval generations (the -i option), provided that not all populations have yet converged. It then pauses, and asks you
(A)gain, (Q)uit, (C)ontinue:
Type a to start again with the same parameters, basically to see if a random reinitialisation of the populations produces a significantly different result this time.
Page C:11
Type c to continue evolving for further generations. Type q to quit. If you want to
change the number of generations per stage before continuing, type a positive integer
and then c. The integer itself does not get echoed to the screen. To abort this
change, just type any command other than c, even an invalid command.
If you have specified an output file too, you will be asked
(A)gain, (Q)uit, (S)ave, (C)ontinue:
instead. Type s to append information about the chromosomes in a file with extension .chr; the program will not let you write out the same population twice by
mistake. The chromosomes are printed out in population order. A typical line looks
like:
3
17
100
25.0000000
1001110...1
179
The first number is the population. The second is the number within this population
(except for spatially-structured reproduction, the population is ordered, fittest is
first). The third is the generation number when this line was written out. The fourth
is the fitness, this is followed by the chromosome itself. The last three numbers are
used to track a chromosomes parentage: each chromosome has a unique ID number,
starting from 1, and the first of these three numbers is this chromosomes ID. The
next two give the IDs of its parents. If both are zero it is an aboriginal chromosome
created at initialisation time.
If you have not specified an output file, you can still save the chromosomes
by typing s, although the prompt will not tell you so. This allows you to save
interesting-looking states even though you were dutifully trying not to waste file
space! The chromosome information is appended to a file called unnamed.chr.
If you have specified an output file and have also chosen one of the spatiallystructured reproduction operators, the prompt will be:
(A)gain, (Q)uit, (S)ave, (M)ap, (C)ontinue:
Type m to get a map of each of the current populations appended to a file with the
extension .map. Information about the format of a map was given above.
Note that chromosome and map files, like the log file, can grow very fast. With
default settings, the chromosome file will increase in size by about 19k bytes every
time you press s ! A separate document contains some basic ideas for analysing
populations using standard Unix tools.
PGA uses a random number generator that is seeded from the system clock.
However, the -S option allows you to seed it with a number of your choice so that you
can repeat runs exactly. The initial set of chromosomes will depend only on the seed
value and the -n, -P and -p parameters. This allows you to run precise experiments
on the effects of changing other parameters.
The -N (for noninteractive) flag suppresses screen output. PGA will merely
run for the number of generations specified by the -l flag or until a stopping condition
Page C:12
is met, whichever happens first, and will then exit. It will write reports to file every so
often, as specified by the -i flag in the normal way. Thus you can set up batch jobs.
Use -NO<n> to ask PGA to stop when the fittest member of any one population is at
least the (real) number <n> or when one has converged; use -NA<n> to ask it to stop
when the fittest member of all populations is at least the (real) number <n> or when
all have converged. In these tests, convergence is treated as though the population
had passed the threshold this is a defence against you setting an unreasonable
threshold. If you use -No<n> or -Na<n> (that is, lower-case o and a) then the final
set(s) of chromosomes will also be saved.
Timetabling with PGA
The timetable problem is assumed to be in the file ttdata in the current directory;
you could use ithe Fdatafile| flag is you wanted to use a different file name. The pro
of assigning slots, such as timeslots, for each of a number of events. Slots are
assumed not to overlap; if two events are assigned different slots then there can be no
clash between them. A chromosome is an array of slot values, one per event; thus the
chromosome abcde.. is interpreted as meaning that event 0 happens in slot number
a (actually, in a-0 in the hope that chromosomes will not contain control characters!), event 1 happens in slot b and so on. The fitness is calculated by totalling
the penalties for each constraint that this timetable violates, and then computing
1/(1 + totalPenalty). The maximum fitness is thus 1.
Several kinds of constraints can be stated:
that each of a given set of events must happen in a separate slot from all the
others. If N events are mentioned, this really translates to N(N 1)/2 distinct
constraints between pairs of events. Violations are called clashes, for obvious
reasons.
that an event must happen before every one of a set of N other events. This
translates to N distinct constraints between that event and one of the others.
that an event must not happen in any of N given slots. This translates to N
distinct constraints on the value of that events slot.
that an event must happen in a given slot. This does not contribute to the total
penalty; instead, a chromosome is altered to ensure that this happens, before
it is evaluated.
You can specify the penalty to be awarded for each violation of a constraint; the
penalty is the same for all violations of a given type, but can differ beytween types,
so that (typically) an exclusion violation attracts a large penalty, each clash attracts
a modest penalty and so on.
Page C:13
The problem should be described using the following simple language. PGA
looks for a keyword at the start of a line (perhaps with white space beforehand) and
then reads the appropriate number of integers following that keyword, ignoring any
non-numeric stuff it finds. If the first word in a line does not look like a keyword but
PGA is looking for one (because it found all the arguments for the previous keyword),
then the whole line gets skipped. Thus you can put (say) a # at the start of a line
if you want PGA to ignore that line for the moment.
The keywords are:
events expects one integer: the number of events to timetable.
slots expects one integer: the number of slots in the whole timetable.
days the number of days. It assumes that there are the same number of slots per
day, except perhaps for the last day if slots is not exactly divisible by days.
near-clash expects an integer. Events that are not supposed to clash, and dont,
will still be penalised if they are this close or closer and are on the same day.
The penalty is p-near (the maximum penalty for a near-clash) divided by the
distance between the slots of the two events.
p-clash expects a real number. The penalty for a clash.
p-order expects a real number. The penalty for an order violation.
p-exclude expects a real number. The penalty for an exclusion violation.
p-near expects a real number. The maximum penalty for a near-miss by a pair of
events that are not supposed to clash. The penalty is p-near/gap.
separate expects any number of integers ON THE SAME LINE. All these are events;
all the events should occur in separate slots. If there are N integers, this is
treated as N|N 1)/2 distinct clash constraints.
clash-type expects an integer which should be 1, 2 or 3. If the type is 1, then PGA
will assume that each clash constraint should appear no more than once, and
it will print a warning message each time it finds a duplicate, and it will ignore
such duplicates. You will be asked to press RETURN just before PGA finally
gets down to business; this gives you a chance to read the messages just before
the program clears the screen.
If the type is 2, PGA will still ignore duplicated clash constraints but it wont
print any warning messages. Use this to stop the program harrassing you!
If the type is 3, PGA will accept duplicated clash constraints. This is useful if
you wish a constraint such as
Page C:14

separate events 2, 3 and 7
to mean (say) that a specific student takes exams 2, 3 and 7 and that clash
penalties should be awarded per student affected, so that a clash involving many
students attracts a much higher penalty than one involving just a few.
The default value is 1. The clash-type ought to be specified before the constraints, since the checking is done as each constraint is read in.
order expects any number of integers ON THE SAME LINE. All are events: the
first event must precede any othe others mentioned.
exclude expects any number of integers ON THE SAME LINE. The first is an event,
the rest are slots. The event must not happen in any of those slots.
preset expects two integers, an event and a slot. The event will be forced to happen
in that slot (in fact what the chromosome says for that event will be ignored).
You must specify the number of events and slots before you specify any constraints; PGA needs them in order to check that the numbers mentioned in constraints
are within legal range. PGA does not try to check that the constraints are consistent
- after all, you might even want to add inconsistent constraints for some reason.
Keywords can be abbreviated - it is up to you to make the abbreviation unambiguous, PGA will not check for ambiguity. The keywords and the number of
numbers each expects are defined in a single struct in eval.c, so you can change or
add keywords just by editing that struct.
Because PGA will ignore everything except (possibly abbreviated) keywords
and numbers, you have considerable flexibility about the layout. For example the file
might contain:
This file defines a simple timetable problem.
events
= 30
slots
= 11
days
= 1
near-clash if same-day events are at most 1 apart
# near-clash if same-day events are 2 or less apart
The various penalties are as follows:
p-clash
p-near
p-exclude
p-order
=
=
=
=
1.0
0.0
0.0
0.0
Page C:15
clash-type = 1 so gripe about non-unique clash constraints

The various constraints are:
separate
separate
separate
order
exclude
preset
separate
..etc..
events
events
events
event
event
event
events
0 and 1 and 11 and 12 and 13

0 and 11 and 17
3 and 6
0 before events 9 and 10
0 from slots 2, 5 and 7
1 to be in slot 1
2, 5, 8, 14, 23 and 26
There can be some difference in success rate between clash-types 1 or 2 and

clash-type 3. Even if your data is in type 3 form, you may find it convenient to set the
clash-type to 2 initially. If PGA finds a maximum-fitness timetable then no student
will be affected by a clash anyway. If PGA cannot find a timetable of fitness 1, you
may then want to switch to clash-type 3 so that PGA tries to cut down on the number
of students affected. The resulting imperfect timetable may then be hand-craftable.
The decode program enables you to recover actual timetables from chromosomes, for example by using the s to save chromosomes to a *.chr file (perhaps
even in unnamed.chr), and then a Unix command such as:
% awk -e {print $5} foo.chr | decode -t
will print each timetable for you to standard output.
The file ttdata.edai93 contains MSc exam timetabling data for the Department of AI and several other departments who are involved in the same modular MSc
programme. A perfect score is possible. The file ttdata.ael contains data from a
sixth-form college in England, who wished to timetable mock public exams for their
students over a six-day period, with two slots per day for the first five days and one
on the last day. A perfect score is not possible; the best we have achieved is to have
two students each taking two of the exams in one day.
Remember that in timetabling, alleles are not just binary-valued but can have
many different values; you may find that it pays to increase the population size
significantly.
If you use -ett:E+S you can adjust the timetable mutation process; see the
description at the start of this document. In this case you might like to specify
-m1.0 -a as well; this asks for mutation which is adaptive, up to certain if the two
parents are identical. Mutation in this case is not bit-mutation; just one allele is
changed per call of the mutation operator.
As mentioned earlier, it would be taken as a kindness if you would report your
data and results on any real timetabling problem to peter@aisb.ed.ac.uk, putting
PGA (with or without quotes) somewhere in the subject line).
Page C:16
Technical points
As mentioned above, each chromosome is held as a character array so it should easy
to adapt the source to a good range of other problems. Function pointers are used for
such operations as selection, reproduction and decoding in order to avoid the overhead
of conditional switching at each cycle. A simple reference-counting scheme is used to
decide which chromosomes are still in use and which can be freed, so memory usage is
independent of the number of cycles and/or number of restarts. This would be trivial
were it not for the periodic migration of chromosomes between populations; such
migrating chromosomes are not copied, for the sake of speed and memory turnover.
Appendix D
Hopfield Net Simulator
This appendix describes Johns Hopfield Net Simulator, version 2.3. The simulator
is started by the Unix command hop.
The simulator accepts one or two letter commands as described below. It can
read commands from file or terminal. Commands are executed at once.
The simulator implements a fully connected Hopfield Net with 1 output units
and zero thresholds. A number of patterns can be entered for experimentation, and
individual patterns or the whole set can be trained into the net or tested for retrieval
by the net. Random patterns can be made and patterns used for retrieval can be
damaged by random noise before presentation to the net.
The single letter commands accepted are listed below. <num> indicates an
integer; <real> indicates a decimal number between 0 and 1. Most commands are
accepted in either upper or lower case.
Commands can be read from a file or from the terminal; the program will also
read a file named as its first argument on the command line, after which it returns
to terminal input. Thus, if you have a prepared file of commands which you want to
run and collect output from, you could do:
hop my_cmd_file > my_result_file
provided you end your command file with a q command (otherwise the program will
read from your terminal but send prompts to your result file!). hop is the name of the
simulator. The source code is available for those interested in ~john/public/hop.
Q Quit from the program maybe the most important command!
N <sz> Set the number of units in the net; cannot be changed unless the network
is cleared. <sz> has one of two formats: <num> sets up a linear array of units;
<num>*<num> sets up a rectangular array (the first <num> is the count of rows,
the second that of columns) and causes pattern display to be formatted rectangularly too.
0
Chapter written by John Hallam
Page D:2
P <num> Set the number of pattern stores to be used. At most this many patterns
will be held by the simulator for testing the network with.
A Update in asynchronous mode: at each step choose an unstable unit at random,
update its output, and propagate the consequences (adjust the inputs to other
units).
S Synchronous update: update all the unstable units at each step and recalculate
input to each unit.
B<real> Set the badness of presentation patterns during retrieval. This determines how badly the presented patterns are corrupted by noise. It works by
generating a random pattern with badness*Nunits bits set, then for each such
set bit the corresponding bit in the presented pattern is chosen to be +1 or -1
with equal chance. Thus even with a badness of 1.00 its unlikely that all the
bits in a pattern will be affected. (Why is this a good idea?) Unlike most
<real>s, the badness can be set to 1.
V (note, upper case) turn on debugging/tracing output
v (note, lower case) suppress tracing output. When debugging/tracing is on, the
simulator pauses after each display during a g<num> command output, waiting
for a <return>. This is to allow you to examine the display before it scrolls off
the screen. Pausing can be turned on and off using the v or dz commands.
The program prompts with : when it pauses. Typing q before <return> at
this prompt will turn pausing off.
F<file> Read input commands from <file> until it is exhausted. The <file>
should be a valid path name terminated with either a caret ^ or a newline
character; white space (even inside the path) is ignored.
R<num><ps> Read in pattern <num> into the pattern store. This doesnt alter the
state of the network. <num> must be between 1 and Npatterns. <ps> is a
pattern specifier which can take one of two forms: a <real> in [0,1) which
determines bit density in a pattern to be generated at random; or an explicit
string of bit states represented by + and -, of which there must be exactly
Nunits entries. The string can be split over multiple lines (interfoliated white
space is ignored) and the program prompts for continuations with the number
of bits entered so far.
Any pattern in the range 1. . . Npatterns can be entered as many times as desired; later entries overwrite earlier ones.
R0 <real> A pattern number of zero processes all patterns which are not yet defined;
it initialises them to random patterns with bit density given by <real>.
Page D:3
Z<file> Write the current pattern set out into <file> so that it can be read back
at a later time. The patterns are written out as r commands.
L <num> Train the network with pattern <num>; youll get a complaint if you havent
defined that pattern. A zero <num> rains with all patterns.
G <num> Present pattern <num> to the network and report what happens. The pattern is set onto the network units, corrupted if required by noise, and the network dynamics are iterated at most Nunits times. The actual input presented
and the output obtained are remembered for later investigation. If tracing is
on each iteration reports on its progress, printing the unit updated, the number
of stable units, and the hamming distance from the true pattern.
A <num> of zero presents all the defined patterns one after another and then
prints out a set of summary statistics number of iterations needed, hamming
distance of input and final output with respect to the true input pattern are
printed for each input pattern.
O<num> Compute the hamming distance between the current (last tested) pattern
and pattern <num> for both actual input and final net output. If <num> is zero,
tabulate the distances from all defined patterns.
The simulator also supports the following two letter commands.
C Clear things; subcommands are
N Network zero the number of units, clear the weights, dont alter the pattern
store.
P Patterns zero the number of patterns, clear the store, dont alter the network.
W Clear the weight matrix; leave everything else alone.
D Clear the weight matrix diagonal the self-coupling term for each unit; leave
everything else alone.
D Display things; subcommands are
O Display the current network output (last pattern used) and the input associated with it.
I<num> Display the input (weighted sum) to unit <num>; display the inputs for
all units if <num> is zero.
W<num> Display the input weights of unit <num>; if <num> is zero display all
the weights.
P<num> Display pattern <num>; if <num> is zero, display all defined patterns.
Page D:4
Z Enable display pausing. The simulator will pause after each display in trace
mode while running the net or while showing the summary for a g0
command. It emits a colon : prompt and waits for a <return>. Typing
q<return> turns pausing off at this prompt useful if your hand is
tiring.
z Disable the pausing mechanism.
An example set of commands can be seen in the file synch in the program
directory. It does problem two of the tutorial exercises. The file asynch does
problem one.
A more complex example can be found in the file test in the program directory. This sets up a network with 20 units, asks for a 10 pattern store, initialises the
first four patterns to specified values, trains the network with all these patterns, tests
retrieval with them, fills out the pattern store with random patterns, and presents all
the patterns to the network for retrieval with zero badness. It then clears the weight
matrix diagonal and repeats the test.
Bibliography
[1] D.H. Ackley, G.E. Hinton, and T.J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, 9, 1985. Reprinted in [2].
[2] J.A. Anderson and E. Rosenfeld, editors. Neurocomputing: Foundations of Research. MIT Press, Cambridge, 1988.
[3] S. Brunak and B. Lautrup. Liniedeling med et neuralt nevrk. Skrifter for
Anvendt Matematik og Lingvistik, 14:5574, 1989.
[4] J.L. Elman. Finding structure in time. Cognitive Science, 14:179211, 1990.
[5] S.E. Fahlman. Fast-learning variations on back-propagation: An empirical study.
In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proceedings of the 1988
Connectionist Models Summer School, pages 3851, Pittsburg 1988, 1989. Morgan Kaufmann, San Mateo.
[6] R.P. Gorman and T.J. Sejnowski. Analysis of hidden units in a layered network
trained to classify sonar targets. Neural Networks, 1:7589, 1988.
[7] R.P. Gorman and T.J. Sejnowski. Learned classification of sonar targets using a
massively-parallel network. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 36:11351140, 1988.
[8] D.O. Hebb. The Organization of Behavior. Wiley, New York, 1949. Partially
reprinted in [2].
[9] G.E. Hinton and T.J. Sejnowski. Optimal perceptual inference. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages
448453, Washington 1983, 1983. IEEE, New York.
[10] G.E. Hinton and T.J. Sejnowski. Learning and relearning in boltzmann machines.
In D.E. Rumelhart and J.L. McClelland, editors, Parallel Distributed Processing,
volume 1, chapter 7, pages 282317. MIT Press, Cambridge, 1986.
[11] J.J. Hopfield. Neural networks and physical systems with emergent collective
computational abilities. Proceedings of the National Academy of Sciences, USA,
79, 1982. Reprinted in [2].
Page D:2
BIBLIOGRAPHY
[12] J.J. Hopfield and D.W. Tank. neural computation of decisions in optimization
problems. Biological Cybernetics, 52:141152, 1985.
[13] J.J. Hopfield and D.W. Tank. Computing with neural circuits: A model. Science,
233:625633, 1986.
[14] R.A. Jacobs. Increased rates of convergence through learning rate adaptation.
Neural Networks, 1:295307, 1988.
[15] M.I. Jordan. Attractor dynamics and parallelism in a connectionist sequential
machine. In Proceedings of the Eighth Annual Conference of the Cognitive Science
Society, pages 531546, Amherst 1986, 1986. Lawrence Erlbaum, Hillsdale.
[16] K.Asakawa and H.Takagi. Neural nets in japan. Communications of the ACM,
37, 1994.
[17] T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag, Berlin,
3 edition, 1989.
[18] T. Kohonen, K. Makisara, and T. Saramaki. Phonotopic maps insightful
representation of phonological features for speech recognition. In Proceedings
of the Seventh International Conference on Pattern Recognition, pages 182185,
Montreal 1984, 1984. IEEE, New York.
[19] S. Kullback. Information Theory and Statistics. Wiley, New York, 1959.
[20] A. Lapedes and R. Farber. Nonlinear signal processing using neural networks: Prediction and system modelling. Technical Report LAUR872662,
Los Alamos National Laboratory, Los Alamos, NM, 1987.
[21] Y. Le Cun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard,
and L.D. Jackel. Handwritten digit recognition with a back-propagation network.
In D.S. Touretzky, editor, Advances in Neural Information Processing Systems,
volume 2, pages 396404, Denver 1989, 1990. Morgan Kaufmann, San Mateo.
[22] W.S. McCulloch and W. Pitts. A logical calculus of ideas immanent in nervous
activity. Bulletin of Mathematical Biophysics, 5, 1943. Reprinted in [2].
[23] M.L. Minsky and S.A. Papert. Perceptrons. MIT Press, Cambridge, 1969. Partially reprinted in [2].
[24] F.J. Pineda. Generalization of back-propagation to recurrent neural networks.
Physical Review Letters, 59:22292232, 1987.
[25] P.Kanerva. Sparse Distributed Memory. MIT Press, 1988.
[26] F. Rosenblatt. Principles of Neurodynamics. Spartan, New York, 1962.
BIBLIOGRAPHY
Page D:3
[27] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning representations by
back-propagating errors. Nature, 323:533536, 1986. Reprinted in [2].
[28] T.J. Sejnowski and C.R. Rosenberg. Parallel networks that learn to pronounce
english text. Complex Systems, 1:145168, 1987.
[29] G. Tesauro and T.J. Sejnowski. A neural network that learns to play backgammon. In D.Z. Anderson, editor, Neural Information Processing Systems, pages
442456, Denver 1987, 1988. American Institute of Physics, New York.
[30] P. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the
Behavioral Sciences. PhD thesis, Harvard University, 1974.
[31] B. Widrow and M.E. Hoff. Adaptive switching circuits. In 1960 IRE WESCON
Convention Record, volume 4, pages 96104. IRE, New York, 1960.
[32] D.J. Willshaw, O.P. Buneman, and H.C. Longuet-Higgins. Non-holographic associative memory. Nature, 222, 1969. Reprinted in [2].
[33] D.J. Willshaw and C. von der Malsburg. How patterned neural connections can
be set up by self-organization. Proceedings of the Royal Society of London B,
194:431445, 1976.

Notes

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Notes

Hochgeladen von

Copyright:

Verfügbare Formate

Neural networks: an introduction

The Boltzmann machine . . . . . . . . . . . . . . .

A The rbp back-propagation simulator

B The cluster and pca programs

C Genetic algorithms: the pga program

D Hopfield Net Simulator

1.1 Mathematical preliminaries

x21 + x22 + + x2n

The inner product w.x is defined as:

and has a natural geometric interpretation as:

This natural geometric interpretation is not black magic. Show

wi xi = |w| |x| cos()

(2 )((5 )(1 ) 63) 3(4(1 ) 54) + 4(28 6(5 )) = 0

1.1 Mathematical preliminaries

An orthogonal matrix is a square matrix A such that AT = A1 . Such matrices

The number of ways of selecting k items from a collection of n items is

Basic probability and distributions

A random variable X is a variable which, in different experiments carried out under

a finite or infinite interval. Random variables are completely characterised by their

and the distribution function is then:

For a discrete random variable, the mean value is

The variance 2 is, for a discrete variable:

1.1 Mathematical preliminaries

If z = f (x1 , x2 , , xn ) is a function of n independent variables then one can form

Find the partial derivatives of the function

Optimisation: Lagrange multipliers

where the j are m specially-introduced variables called Lagrange multipliers. This

each set to zero.

to get x = y = = 1/ 2, after which you should then check to determine which of

Find the maximum of y x subject to the constraint that y +

Find the answer to the same problem experimentally as follows.

1.2 About the nervous system

About the nervous system

Pattern recognition and statistical tools

presented with a brand-new image.

Figure 2.1: Two clumps

2.1 Clustering techniques

However, consider the same set of points in the form of data:

(38,59) (21,71) (36,49) (18,62)

Pattern recognition and statistical tools

Figure 2.2: A dendrogram

2.1 Clustering techniques

Figure 2.3: Clumps that fool the nearest-neighbour methods

favourite text editor. You will find lines ending in a control-A

Pattern recognition and statistical tools

Try using cluster -t to create a dendrogram of your test data,

Principal components analysis

2.2 Principal components analysis

= (p11 , p12 , , p1n )

= (pm1 , pm2 , , pmn )

First compute the average point by averaging each co-ordinate separately:

Pattern recognition and statistical tools

Use the pca program to experiment with principal components

Try forming two or three clusters of points, using xfig, and

PCA assumes that the transformation of the original data should

2.3 Canonical discriminant analysis

that is, the desired x is an eigenvector of MT M. Understanding this derivation

Canonical discriminant analysis

= (p11 , p12 , , p1n )

= (p21 , p22 , , p2n )

Pattern recognition and statistical tools

Now let W (the within-groups sum of squares matrix) be the n n matrix:

and let H be the n G matrix, where G is the number of groups, as follows. If is

Rather than complicating the equation by having a distinctive-looking threshold s,