Beruflich Dokumente
Kultur Dokumente
Engineering Applications
of Bio-Inspired
Artificial Neural Networks
International Work-Conference on
Artificial and Natural Neural Networks, IWANN'99
Alicante, Spain, June 2-4, 1999
Proceedings, Volume II
Springer
Series Editors
Gerhard Goos, Karlsruhe University, Germany
Juris Hartmanis, Cornell University, NY, USA
Jan van Leeuwen, Utrecht University, The Netherlands
Volume Editors
Jos6 Mira
Universidad Nacional de Educaci6n a Distancia
Departamento de Inteligencia Artificial
Senda del Rey, s/n, E-28040 Madrid, Spain
E-mail: jmira@dia.uned.es
Juan V. S~inchez-Andr6s
Universidad Miguel Hern(mdez, Departamento Fisiologia
Centro de Bioingenieria, Campus de San Juan, Apdo. 18
Ctra. Valencia, s/n, E-03550 San Juan de Alicante, Spain
E-mail: juanvi@umh.es
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
Fifty years after the publication of Norbert Wiener's book on Cybernetics and
a hundred years after the birth of Warren S. McCulloch (1898), we still have
a deeply-held conviction of the value of the interdisciplinary approach in the
understanding of the nervous system and in the engineering use of the results of
this understanding. In the words of N. Wiener, "The mathematician (nowadays
also the physicist, the computer scientist, or the electronic engineer) need not
have the skill to conduct a physiological experiment, but he must have the skill
to understand one, to criticize one, and to suggest one. The physiologist need
not be able to prove a certain mathematical theorem (or to program a model
of a neuron or to formulate a signaling code...) but he must be able to grasp
its physiological significance and to tell the mathematician for what he should
look". We, as Wiener did, had dreamed for years of a team of interdisciplinary
scientists working together to understand the interplay between Neuroscience
and Computation, and "to lend one another the strength o/ that understanding".
The basic idea during the initial Neurocybernetics stage of Artificial Intelli-
gence and Neural Computation was that both the living beings and the man-
made machines could be understood using the same organizational and struc-
rural principles, the same experimental methodology, and the same theoretical
and formal tools (logic, mathematics, knowledge modeling, and computation
languages).
This interdisciplinary approach has been the basis of the organization of
all the IWANN biennial conferences, with the aim of promoting the interplay
between Neuroscience and Computation, without disciplinary boundaries.
IWANN'99, the fifth International Work-Conference on Artificial and Natural
Neural Networks, that took place in Alieante (Spain) June 2-4, 1999, focused on
the following goals:
I. Developments on Foundations and Methodology.
II. From Artificial to Natural: How can Systems Theory, Electronics, and
Computation (including AI) aid in the understanding of the nervous system?
III. From Natural to Artificial: How can understanding the nervous system
help in the obtaining of bio-inspired models of artificial neurons, evolutionary
architectures, and learning algorithms of value in Computation and Engineering?
IV. Bio-inspired Technology and Engineering Applications: How can we ob-
tain bio-inspired formulations for sensory coding, perception, memory, decision
making, planning, and control?
IWANN'99 was organized by the Asociaci6n Espafiola de Redes Neuronales,
the Universidad Nacional de Educaci6n a Distancia, UNED, (Madrid), and the
Instituto de Bioingenierfa of the University Miguel Herns UHM, (Alicante)
also in cooperation with IFIP (Working Group in Neural Computer Systems,
WGI0.6), and the Spanish RIG IEEE Neural Networks Council.
yl
Sponsorship was obtained from the Spanish CICYT and DGICYT (MEC),
the organizing universities (UNED and UHM), and the Fundaci6n Obra Social
of the CAM.
The papers presented here correspond to talks delivered at the conference.
After the evaluation process, 181 papers were accepted for oral or poster presen-
tation , according to the recommendations of reviewers and the author's pref-
erences. We have organized these papers in two volumes arranged basically fol-
lowing the topics list included in the call for papers. The first volume, entitled
"Foundations and Tools in Neural Modeling" is divided into three main parts
and includes the contributions on:
1. Neural Modeling (Biophysical and Structural Models).
2. Plasticity Phenomena (Maturing, Learning and Memory).
3. Artificial Intelligence and Cognitive Neuroscience.
In the second volume, with the title, "Engineering Applications of Bioin-
spired Artificial Neural Nets", we have included the contributions dealing with
applications. These contributions are grouped into four parts:
1. Artificial Neural Nets Simulation and Implementation.
2. Bio-inspired Systems.
3. Images.
4. Engineering Applications (including Data Analysis and Robotics).
We would like to express our sincere gratitude to the members of the orga-
nizing and program committees, in particular to F. de la Paz and J.R. •lvarez,
to the reviewers, and to the organizers of invited sessions (Bahamonde, Barro,
Benjamins, Cabestany, Dorronsoro, Fukushima, Gonz~lez-Crist6bal, Jutten, Mil-
l~n, Moreno-Arostegui, Taddei-Ferretti, and Vellasco) for their invaluable effort
in helping with the preparation of this conference. Thanks also to the invited
speakers (Abeles, Gordon, Marder, Poggio, and Schiff) for their effort in prepar-
ing the plenary lectures.
Last, but not least, the editors would like to thank Springer-Verlag, in partic-
ular Alfred Hofmann, for the continuous and excellent cooperative collaboration
from the first IWANN in Granada (1991, LNCS 540), the successive meetings
in Sitges, (1993, LNCS 686), Torremolinos, (1995, LNCS 930), and Lanzarote,
(1997, LNCS 1240), and now in Alicante
The theme for the 1999 conference (from artificial to natural and back again),
focused on the interdisciplinary spirit of the pioneers in Neurocybernetics
(N. Wiener, A. Rosenblueth, J. Bigelow, W.S. McCulloch, W. Pitts, H. von
Foerster, J.Y. Lettvin, J. von Neumann, ...) and the thought-provoking meet-
ings of the Macy Foundation. We hope that these two volumes will contribute
to a better understanding of the nervous system and, equally, to an expansion
of the field of bio-inspired technologies. For that, we rely on the future work of
the authors of these volumes and on our potential readers.
Field Editors
Bio-inspired Systems
EEG-Based Cognitive Task Classification with ICA and Neural Networks . 265
D.A. Peterson, C. W. Anderson
Images
Application of the Fuzzy Kohonen Clustering Network to Biological
Macromolecules Images Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
A. Pascual, M. Barcdna, J.J. Merelo, J.-M. Carazo
Engeneering Applications
How to Select the Inputs for a Multilayer Feedforward Network by Using
the Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
M. Ferndndez Redondo, C. Herndndez Espinosa
Dendritic [Ca 2+] Dynamics in the Presence of Immobile Buffers and of Dyes 43
M. Maravall, Z.F. Mainen, K. Svoboda
Gaze Control with Neural Networks: A Unified Approach for Saccades and
Smooth Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
M. Pauly, K. Kopecz, R. Eckhorn
The Neural Net of Hydra and the Modulation of Its Periodic Activity . . . . . 123
C. Taddei-Ferretti, C. Musio
Solving the Packing and Strip-Packing Problems with Genetic Algorithms. 709
A. Gdmez, D. de la Fuente
Alfred Strey
1 Introduction
Many artificial neural network models have been proposed and successfully ap-
plied to technical problems. They always use simple rate-coded neurons. Also
several neurosimulators have been implemented to simplify the development of
neural applications. Often they are based on neural simulation kernels contain-
ing optimized realizations of a few artificial neural network models and learning
algorithms (e.g. SNNS [17], NeuralWorks [9]). Alternatively, several neural spec-
ification languages (like AXON [5], CONNECT [8], epsiloNN [12]) allow a more
or less flexible description of artificial neural networks. However, there is an
actual trend in neural network research towards more biologically plausible neu-
ral networks. Several experimental results and theoretical studies show t h a t the
time and temporal correlation of neuron activity are relevant in neural signal
processing [2] [11]. To study the behaviour of such neural networks, only a few
specialized simulation tools exist. T h e y support the neural network simulation
on only one of several abstraction levels.
Neurosimulators like GENESIS [1] or NEURON [6] are specialized for the
simulaton of multi-compartment models. Here the biophysical processes of each
neuron are simulated on a microscopic level. The spatial extension of a neuron
is considered by partitioning the neuron model in several compartments: soma,
axon and m a n y dendritic compartments. Each c o m p a r t m e n t is modelled by a dif-
ferential equation which represents the behaviour of the internal cell membrane.
For the simulation of the complete neural network a system of m a n y coupled dif-
ferential equations has to be solved numerically. Due to the high computational
effort only small neural networks can be simulated.
On a more abstract level the neuron is modelled by a single compartment.
The detailed internal mechanism of the cell m e m b r a n e is hidden. Each neuron
generates a spike if its input signals fulfil a certain condition (e.g. if the total
input potential exceeds a certain threshold). The spike impulse is propagated to
succeeding neurons by synaptic connections. Here the information is weighted
and a postsynaptic potential according to an impulse response function is gener-
ated. The spatial aspect is reflected in delays : an impulse of a distant neuron may
be more delayed than the impulse of a neighbour neuron. A typical simulator
supporting this abstraction level is SimSPiNN [15].
The simulation of pulse coded neural networks can be further simplified if
delays are not supported. So the spatial extension of the neural network is fully
ignored. This is realized in several neurosimulators like MDL /14] or NEXUS
[10]. Neurosimulators for artificial neural networks like SNNS mentioned already
above are suitable only for rate-coded neuron models. They do not support
temporal behaviour. Also a restricted underlying neural network model often
allows only the simulation of a limited subset of artificial neural networks.
So the user must first determine the abstraction level and then select an
appropriate simulation tool. A change of the level either requires the use of an-
other simulation tool or results in an inefficient simulation. Also hybrid network
architectures consisting of artificial and biology-oriented models can not be sim-
ulated. To overcome this problem a universal neurosimulator capable of efficient
neural network simulation on different abstraction levels is highly desirable.
In this paper a unified neural network model for the simulation of neural
networks on several abstraction levels is presented. The focus is on the modelling
of neurons consisting of a single compartment although multi-compartmental
neurons can easily be mapped onto the model too. The resulting unified model
which is described in Sect. 2 represents a basis for the design of a universally
applicable neurosimulator. The neural specification language EpsiloN N [12] [13]
originally developed by the author for the simulation of artificial network models
has been redesigned to incorporate the new unified model. Its extensions to
support also biology-oriented neural networks are summarized in Sect. 3.
are used for the description of the model behaviour. After discretization, they
can be simulated by a difference equation of the form
Throughout this paper it is assumed that such difference equations with expo-
nential decay constants c~ are sufficient for a description and a correct computer
simulation of the model behaviour.
Each neuron (see Fig. 1) consists of several input groups (also called sites), by
which it receives signals from other neurons or external inputs. A site contains all
inputs with similar characteristics from a certain part of the dendritic tree. The
signals u(/)(t) . .(u~. j). , (J) of each site j are combined by an arbitrary func-
, U~naxj
tion f(uj) resulting in an input potential p(J)(t) = f(J)(u(J)(t), p(J)(t - At)) which
may be excitatory (p(J) > 0) or inhibitory (p(J) < 0). The total neural activity x
(also called neuron potential or membrane potential) is calculated by the follow-
ing function from all k input potentials: x(t) = fact (p(1)(t),..., p(k)(t), x(t--At)).
The neuron's output potential y(t) (also called axonal potential) is com-
puted from the neuron's activity x by applying an arbitrary output function:
y(t) = fout(X(t),~(t)). The sigmoidal function y(t) = 1/(1 + e-x(t)), the Gaus-
sian function y(t) = e-X~(t)/z(t)2 or a threshold function are often used here. In
biology-oriented simulations the output y may be delayed by an axonal delay
l(t)~ ..... . . . . . - ............
l "--...../
, teach
A(n) = d. A t which is a multiple d of the time step. The output functions often
need a p a r a m e t e r fl or a p a r a m e t e r vector j3 = (/31,...,~ma• It m a y repre-
sent e.g. a threshold 6) or a variance a 2. The p a r a m e t e r ~ m a y also be adapted
by a function t3(t + 1) = fz(l~(t), x(t), y(t), l(t)). In biology-oriented neural net-
works the p a r a m e t e r ~ often describes a dynamic threshold by which a refractory
mechanism is realized:
Here after the generation of a spike for x > (9 at time step t (8) the threshold is
raised to a high value 0(s) to prevent the neuron from spiking again.
In adaptive neural networks also a learning potential has to be computed:
l(t) = fiearn(x(t), y(t), teach(t), e(t)). It is required by the incoming synapses for
learning (compare Sect. 2.2). In case of Hebbian learning l(t) is identical with
the activity x(t). If supervised learning algorithms are used l(t) either depends
on an externally supplied teacher signal (e.g. l(t) = teach(t) - y(t)) or on an
internal error potential e(t) = fd(d(t)) which is calculated from the elements of
a difference vector d = ( d l , . . . , d m a x ) received from other succeeding neurons
via synaptic connections (e.g. e(t) = ~y dj).
Each synapse (see Fig. 2) represents a connection between two neurons or be-
tween an external network input node and a neuron. It has at least one in-
put in, one output out and at most one weight value w which represents the
synaptic strength. In artificial neural network models, the output out(t) =
fprop (in(t), w (t)) is computed from the presynaptic potential in and the weight w
...~~ . . . . . . . . ~-. ~176
', f,'
"'"--~ . . . . . . . . . . ..-*"
e(t)=
{exp(-~)-exp(-~)
0
if t ~ A(s)
else (4)
with sharp rise and exponential decay. Here A (s) represents the synaptic delay.
The actual output value out(t) = ~ i w . e ( t - tl ~)) is a superposition of the
response functions induced by all previous spikes at time steps tl s). However,
not all time steps of previous spikes must be stored in each synapse. The output
value can more easily be calculated by the following equation:
Here outl (t) and out2 (t) represent the parts of the output signal that result from
the first and second exponential term of Eq. 4. The values a i = e x p ( - - A t / T 1 )
and as = exp(-At/~-2) indicate the corresponding decay constants.
More generally, the actual synaptic output value out(t) can be described by a
function f p r o p ( i n ( t - A(s)), o u t ( t - At), w(t)) that depends on the synaptic delay
and the past output value. The synaptic delay of each synapse is modeled by a
(not adjustable) multiple of the time step: A(*) = d - At. It can be realized by
an internal FIFO buffer containing at least the input signals of the last d time
steps and a demultiplexer for selecting the correct value of time step t - A (s) .
Learning depends on the presynaptic potential in(t) and a postsynaptic po-
tential post(t) (usually the learning potential l(t) of the postsynaptic neuron,
compare Sect. 2.1): A w ( t ) = fiearn(in(t -- A(~)),post(t), w(t -- A t ) ) . Often Heb-
bian learning is used here: A w ( t ) =- 7" in(t - A(~)) 9post(t). It may be combined
with a decay term w(t) = 7 " w ( t - A t ) + A w ( t ) to realise a forgetting mechanism.
Many learning functions depend on a local parameter 5' or on a local parameter
vector 3, = (V1,..., ~/max). This parameters may also be updated during learning
by a local function f~(~/(t - At), in(t - A(s)),post(t)).
In most neural network models a synapse represents an unidirectional connec-
tion. However, in several supervised learning algorithms for multi-layer networks
there is also a flow of (weighted) error information in the reverse direction. So
each synapse requires an additional output back(t) = fb~ck(post(t), w(t)).
{ ( i s , i v) if sx = d x and s y = d y
c(ix,iy) = ([ixsx/dxl, [iysu/du]) else. (6)
9, : / I
, / I
.9 o : /~] ~, map
9" 0 9 ., / (intra)
0 0 ' / I
9" 0 : / /I
99149 i // / /I
/ I
.9149 : / , / ~ full "o //01
//000 I
/ 0 I
'Ii,,'
/ I
/ 0 I
/ 0 I
0 j
0 I
o map '' ' 0 I
I
OI
0 ; ," 0 '1 OI
0 /O^O~l ~ t /" , 0 I
0 0 ~ I
0 / '
o ~)
9 o 0 /
0 o /
i 0 /
0 9149 I //
Oot/"
0 0 9," ,
t t, ,
I "
/ corres- 0 /
0 0 9 /
o o o .-'9 ., ,, ponding /
0 9
The unified neural network model presented in the previous sections can either
directly be implemented in a neural simulation kernel or it can be used as an
underlying model for a neural specification language.
A neural simulation kernel allows on efficient simulation because all required
activation, output, propagate and learning functions of the synapse and neuron
models can be optimally implemented in the simulation kernel. The user must
only select such functions, several parameters and (possibly) the network topol-
ogy, which can be done by a configuration file or a graphic interface. However
the flexibility is limited: Only the parameters and functions predefined in the
kernel can be used. For each new parameter or new function the kernel source
code has to be extended and recompiled.
A neural specification language allows the description of all neural networks
that are confirm with the underlying formal model. A compiler translates the
specification into simulation code. This methodology is rather flexible because
the specification language allows the definition of any arbitrary neural function
that can be expressed by the language and an arbitrary number of internal pa-
rameters. The abstract high-level syntax follows the neural network terminology
and allows a concise and unique neural network specification. Thus it also sim-
plifies the interchange of specifications between neural network researchers from
different disciplines. Furthermore, the specification is also independent of the
target computer architecture. Compilers for parallel computers can be imple-
mented too. Thus, it represents the preferable approach for the implementation
of a neurosimulator.
Usually a t i m e - d r i v e n simulation is realized. In each simulation step the vari-
ables of all neural objects are updated. So the network behaviour is simulated
exactly. However, in many simulations of spiking neural network models the
mean network activity is low. At most a few neurons generate a spike in each
time step. Here an event-driven simulation can be used to improve the efficiency
of the simulation [16]. Each spike is considered as an event which is character-
ized by the time step t (s) and the index i of the spike-generating neuron. A
central e v e n t list contains all spikes in temporal order. Only those synapses w~j
connected to spiking neurons must be simulated for a certain period of time
starting at time step t (s) + A~~) + AI~) and ending when the induced postsynap-
tic potential is again negligibly small (i.e. below a certain threshold). Selecting
an appropriate threshold represents a compromise between a high efficency and
a high simulation accuracy. The event-driven simulation is especially interesting
for special-purpose hardware [4] [7]. It can also be included in a neurosimula-
tor if the compiler can generate event-driven simulation code from a network
specification.
In the unified neural network model the postsynaptic potential is modelled (in
accordance with biology) by the synaptic propagate function fprop. The neuron
simply adds all incoming potentials. For the implementation however, it is more
efficient to combine the calculation of all postsynaptic potentials with identical
time constants in the postsynaptic neuron. It adds the weighted input signals
and computes all impulse response functions locally.
The neural specification language EpsiloNN has been designed especially for the
simulation of artificial neural networks on different computers [12] [13]. To sup-
port also biology-oriented neural networks the language has been redesigned.
First, the underlying neural network model has been extended in accordance
with the unified model presented in the previous section. Secondly, several new
language constructs have been included in the latest EpsiloNN release to support
all new features of the underlying model:
- Two-dimensional populations of neurons or i n p u t / o u t p u t nodes can be spec-
ified (e.g. s p i k i n g _ n e u r o n p o p l [50] [50]) and connected by all topologies.
- The new f i e l d topology is available for a simple specification of topographic
maps. Here the user specifies the names of source and destination population,
the size kx x ky of the neighborhood (in the example below 7 x 11) and the
neuron i n p u t / o u t p u t variables that are connected by the map, e.g.:
map_synapse net = {field, popl, popi, 7, II, "init.map" , zero,
in = popl.y, out = popi.y}
Initial weights wxv (identical for all instances of the map) may be read
from an optional initialization file. Alternatively, the weights can be set by
a user-defined function (randomly or dependent on the indices of the sourcc
and destination nodes). Also arbitrary learning functions can be defined for
updating the weights according to presynaptic and postsynaptic potentials.
Thus, the weights can differ in different instances of each map.
All connections required for the topographic map will automatically be built
by the simulator (also if the sizes of source and destination poulation are
different, see Sect. 2.3). At the borders of the source population only a partial
map can be realized because the source nodes (ix+x, iy+y) do not exist for all
z 6 { - ( k x - 1 ) / 2 , . . . , ( k x - 1)/2} and all y 9 { - ( k y - 1 ) / 2 , . . . , ( k y - 1)/2}.
The connections to the missing nodes are either truncated by the option
z e r o (default), or the source nodes ((ix + x) mod sx, (iy + y) rood Sy) at the
opposite site of the source population are used instead (option c y c l i c ) .
- All network delays are mapped into the synapse object. So the delay A~j
of the synapse with weight wij represents the sum of the delay AI ~) of the
presynaptic neuron i and the synaptic delay AI~). It can be specified by the
user as a multiple dij of the time step Aij := dij . At. The delay can be
different for each synapse of the same network and can be set by a user-
defined function (dependent on the indices of the source and destination
nodes). Internally, the FIFO buffer required for storing the dij last input
values is implemented in the presynaptic neuron and not in each synapse (as
assumed in the underlying model). Thus, the output values must be stored
only once in a FIFO buffer of size maxi dij and the efficiency is improved.
4 Conclusions
The presented unified neural network model incorporates the features of all ar-
tificial and many spiking neural network models. Especially the most important
characteristics of biology-oriented neural network models (postsynaptic poten-
tials, delays, spike generation and topographic maps) are included. Thus, im-
portant models like the integrate-and-fire neuron or the spike-respsonse-model
[3] can easily be described. Because of the same notation, artificial and spiking
neural networks can be combined to model complex hybrid neural architectures.
10
Acknowledgements
This work is partially supported by the D F G (SFB 527, subproject B3).
References
1. Bower, J., and Beeman, D. The book of GENESIS : exploring realistic neural
models with the GEneral NEural Simulation System. Springer, New York, 1995.
2. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Re-
itboeck, H. Coherent Oscillations: A mechanism of feature linking in the visual
cortex ? Biological Cybernetics 60 (1988), 121-130.
3. Gerstner, W. Spiking Neurons. In Pulsed Neural Networks, W. Maas and C. Bishop,
Eds. MIT Press, 1998, ch. 1, pp. 3-54.
4. Hartmann, G., Frank, G., Schs M., and Wolff, C. Spike 128K - An Accel-
erator for Dynamic Simulation of Large Pulse-Coded Networks. In Proceedings
MicroNeuro'97 (1997), H. Klar, A. Koenig, and U. Ramaeher, Eds., pp. 130-139.
5. Hecht-Nielsen, R. Neurocomputing. Addison-Wesley, 1990.
6. Hines, M., and Carnevale, N. The NEURON Simulation Environment. Neural
Computation 9 (1997), 1179-1209.
7. Jahnke, A., Roth, U., and SchSnauer, T. Digital Simulation of Spiking Neural
Networks. In Pulsed Neural Networks, W. Maas and C. Bishop, Eds. MIT Press,
1998, ch. 9.
8. Kock, G., and Serbed~ija, N. Artificial Neural Networks: From compact descrip-
tions to C+T. In Proceedings of the International Conference on Artificial Neural
Networks ICANN'94 (1994), Springer, pp. 1372 1375.
9. NeuralWaxe, Inc., Pittsburgh(PA). NeuralWorks Reference Guide, 1995.
10. Sajda, P., and Finkel, L. NEXUS: A simulation environment for large-scale neural
systems. SIMULATION 59, 6 (1992), 358-364.
11. Singer, W., and Gray, C. Visual feature integration and the temporal correlation
hypotheses. Ann. Rev. Neuroscience 18 (1995), 555-586.
12. Strey, A. EpsiloNN - - A Specification Language for the Efficient Parallel Imple-
mentation of Neural Networks. In Biological and Artificial Computation: From
Neuroscience to Technology, LNCS 1240 (Berlin, 1997), J. Mira, R. Moreno-Diaz,
and J. Cabestany, Eds., Springer, pp. 714 722.
13. Strey, A. EpsiloNN A Tool for the Abstract Specification and Parallel Simulation
of Neural Networks. Systems Analysis Modelling Simulation (SAMS), Gordon &
Breach, 1999, in print.
14. Teeters, J. MDL: A system for fast simulation of layered neural networks. SIMU-
LATION 56, 6 (June 1991), 369-379.
15. Walker, M., Wang, H., Kartamihardjo, S., and Roth, U. SimSPiNN - A Simulator
for Spike-Processing Neural Networks. In Proceedings of the 15th IMACS World
Congress on Scientific Computation, Modelling amd Applied Mathematics (Berlin,
1997), A. Sydow, Ed., Wissenschaft & Technik Verlag.
16. Watts, L. Event-driven simulation of networks of spiking neurons. In Advances in
Neural Information Processing Systems (1994), J. Cowan, G. Tesauro, and J. A1-
spector, Eds., vol. 6, Morgan Kaufmann Publishers, Inc., pp. 927-934.
17. Zell, A. et al. SNNS Stuttgart Neural Network Simulator, User Manual, Version
4.0. Report 6/95, University of Stuttgart, 1995.
Weight Freezing in Constructive Neural
Networks: A Novel Approach
1 Introduction
Multi layer perceptrons (MLP) trained by error backpropagation are used widely
for function approximation. Given a data set:
subspace spanned by the different units, the weights of the output layer must be
retrained. [2] and [3] are the most popular algorithms of this category.
Two main criteria may be used to compare freezing and non-freezing algo-
rithms : the final network size and the convergence speed.
In general, the freezing algorithms lead to larger networks. Indeed, these algo-
rithms try to find the optimal solution in a small subset of parameters space and
not in the whole space. Consequently, with respect to non-freezing algorithms,
they need more parameters to achieve the same performance. This problem espe-
cially depends on the estimation capacity of the new unit. For a simple sigmoidal
unit, this capacity is very limited, especially when a single hidden layer network
is used. When the network locks into a "hard state", one needs to add a consid-
erable number of single sigmoidal units to exit [6]. A solution proposed by many
researchers is to use more complicated units [3] or the cascade architecture [2].
present a freezing algorithm which constructs the main network by adding such
small accessory networks trained by a non-freezing algorithm.
2 Algorithm
where r is a sigmoidal function. Considering the data model (1), suppose we al-
ready have K accessory networks providing the estimation f g ( x ) = ~ 1 /31el(X)
of g(x). Hence, the residue of estimation is: ~g = y _ f g ( x ) and we want to add
another accessory network r (x) to minimize II Cg _ f l g + l C g + l (X) I1~. It has
been shown [7] that the above expression achieves its minimum by maximizing :
N K X 2
E = CK+ (0) (4)
and choosing :
~K+I : EN----I(~KCK'I'I(Xi)) (5)
EN=I r 1 (Xi)
In our method the accessory network CK+~ is constructed using a non-
freezing algorithm. In fact, after computing the residue of the main network,
CK, we first try to estimate it by maximizing (4) with a single neuron. If this
neuron succeeds in decreasing significantly the error of residue approximation,
that is the objective function E mentioned in (4) is greater than a predefined
threshold, it will be added to the main network. Otherwise, we add another neu-
ron to the first one and these two neurons, this time together, try to estimate the
residue. After the convergence, we verify again if there is a significant reduction
of error or not. If yes, the construction of the accessory network will be stopped;
otherwise, the process of construction continue until there is a sensible reduction
of error. Afterwards, the weights of the accessory network will be frozen and its
output will be added to the main network with its output weight, 13K+1, whose
initial value can be computed using (5). Then, in order that the residue remains
orthogonal to the subspace spanned by the different accessory networks, the
14
weights of the output layer will be updated. Finally, we compute once more the
residue and another accessory network is constructed for estimating it. The algo-
rithm continue until a good estimation of target function, satisfying a stopping
criterion, is obtained. If there is enough data, a cross validation on a test data
base, can be used ms stopping criterion; else the other methods of generalization
evaluation may be considered. Figure 1 shows the network construction scheme.
Fig. 1. Network construction method: a) Residue computation for the main network.
b) Estimation of the residue with a accessory network using a non-freezing algorithm.
c) The fusion of the accessory network in the main network
BEGIN
Initialization: main-network-size=0, residue=target-function.
DO{
New-accessory-network-size= 0.
DO{
new-accessory-network-sizc++.
Train accessory network to cstimate the residue.
} W I t I L E ((F/residue-power)<TH1).
Connect the accessory network to the main network.
Optimize tile output layer weights.
C o m p u t e new residue.
} W H I L E (stopping criterion is not satisfied).
END
Fig. 2. Approximation of a function with ANBFA. Target function (solid line), its noisy
samples (dots) and the approximation (dashed line).
In Fig. 2, the result of a sample run with our algorithm is shown. Table 1
illustrates the result of experiments for the 3 algorithms. In this table, "Combi-
nation" indicates the size of different accessory networks constructed by ANBFA.
As can be seen, ANBFA is on the average 4.4 times faster than SSBFA and 3.1
times faster than NFA, and it leads to a slightly larger networks. Moreover, in
30% of experiments, the SSBFA method was locked in a plateau so that even
with 100 neurons it was not able to satisfy the stopping criterion and we had to
stop the construction procedure. In the raw "Average", we neglected these runs.
4 Simulation r e s u l t s w i t h real d a t a
In the second experiment, we would like to find suitable models for geostatistical
data, concerning the different pollution factors measured in the Leman lake. At
]6
.-...-..::..:.:-
....... ~176149
....:.:.:.1.:.:-:.:.:-:9149149
99 . . . . . . . . . . . . . . . . .
ooo~
o*.OoOoO.,~ .'oOoOoO~ o*oO,*oOoO~
9. ..:.:.:.:.y,:.:.
9
".:.~
o~
z~
o~
F i g . 3. Data map.
Table 2 illustrates the mean and the standard deviation of the final network
size, of the normalized training time and of the final training error computed
on the 8 pollution variables for each of the three algorithms. As can be seen,
for nearly same performance (approximation error measured on training base),
our method is the fastest (10 times faster than non-freezing algorithm). The
final network size is on the average much less than single sigmoid based freezing
algorithm but more than the network constructed by a non-freezing algorithm.
Considering the discussions of the section 1, these results are not surprising.
18
(a) (b)
:!
N 200 .!........ i
N 201 ,.
500 20
. i. 1
480 ~ - - ~
460 ~,
\ ~ 14o 140
x 440 150 y X 440 150 y
(c) (d)
. . . ~
(a) (b)
~~ i , i , , , I , b o 1
0 2 4 9 8 ?0 t2 14 ~B 0 2 4 0 81k I io t2 14 16
lhl
(0
i i i i i i i i t,
0 z 4 n e io t~ 14 1o
Ibl
Table 2. Estimation of pollution variables. For each algorithm the mean and the
standard deviation of the final number of hidden units, of the normalized time necessary
to convergence and of the final training error are given.
5 Conclusion
References
1. T.Y. Kwok and D.Y. Yeung, Constructive Algorithms for Structure Learning in
Feedforward Neural Networks for Regression Problems. IEEE Trans. on Neural
Networks, vol. 8, no. 3, pp. 630-645, May 1997.
2. S.E. Fahlman and C. Lebiere, The Cascade-Correlation Learning Architecture,
inAdvances in Neural Information Processing Systems 2, D.S. Touretzky Ed., pp.
524-532, Morgan Kaufmann, Los Altos CA, 1990.
3. J.H. Friedman and W. Stuetzle, Projection Pursuit Regression, Journal of American
Statistical Association, vol. 76, no. 376, pp. 817-823, 1981.
4. T. Ash, Dynamic Node Creation in Backpropagation networks, Connection Sciences,
vol. 1, no. 4, pp. 365-375, 1989.
5. Ch. Jutten and R. Chentouf, A New Scheme for Incremental Learning. Neural Pro-
cessing Letters, vol. 2, no. 1, pp. 1-4, 1995.
6. T.Y. Kwok and D.Y. Yeung, Experimental Analysis of Input Weight Freezing in
Constructive Neural Networks, In Proceedings of the IEEE International Conference
on Neural Networks, vol. 1, pp. 511-516, San Francisco, California, USA, 1993.
7. T.Y. Kwok and D.Y. Yeung, Objective Functions for Training New Hidden Units in
Constructive Neural Networks, IEEE Transaction on Neural Networks, vol. 8, no.
5, pp. 1131-1148, September 1997.
8. N.A.C. Cressie, Statistics for Spatial Data, John Wiley &: sons, New York, 1991.
9. Sh. Hosseini and Ch. Jutten, Simultaneous Estimation of Signal and Noise by Con-
structive Neural Networks, In Proceedings of International ICSC / IFA C Symposium
on Neural Computation, Vienna, Austria, September 1998.
10. G. Baillargeon. Mgthodes Statistiques de l'Inggnieur, volume 1. Les Sditions SMG,
1994.
11. Y. Le Cun, J.S. Denker, and S.A. Solla, Optimal Brain Damage, in Advanced
in Neural Information Processing (2), D.S. Touretzky Ed., pp. 598-605, Morgan
Kaufmann, 1990.
12. M. Lehtocangas, J. Saarinen and K. Kaski, , Fine-tuning cascade-correlation feed-
forward network trained with backpropagation , Neural Processing Letters, vol.2,
no. 2, pp. 10-12, 1995.
Can General Purpose Micro-processors Simulate
Neural Networks in Real-Time?
1 Introduction
2 Real time
Neural networks are often used in real-time applications. Such applications are
for example the recognition of the amount of a bank check or of a zip postal
code. In these applications the simulation time is hard limited. In this article
we have taken a time constraint of 40 ms, which correspond to the C C I R video
rate.
In this article we consider the two most used kind of neural networks, which
are the Multi-Layer Perceptrons (MLP) and the Radial Basis Function networks
(RBF).
To determine if general purpose micro-processor can perform real-time sim-
ulation of artificial neural networks, we simulated two neural nets: a MLP called
LENET and a RBF called RBF3.
22
3.1 LENET
3.2 RBF3
RBF3 [6, 5] is a Radial Basis Function network. Its 3 layers include respectively
256, 10 and 4 neurons, and it uses the Mahalanobis distance. This distance is
a very hard benchmark because the number of computations increases as the
square of the number of neurons in the input layer.
4 Evaluation
4.1 Method
The usual method to predict the simulation time of neural networks on an elec-
tronic architecture is based on the measure of an average speed S for the con-
nections processing. T h e n the simulation time of a MLP or a R B F with C
connections is simply taken as S * C. We have demonstrated in [7] that this
m e t h o d cannot be applied for a general neural network architecture because it
leads to very high predictions errors. Thus we introduced a new m e t h o d for this
prediction.
Primitives for MLP The equations 1 and 2 give the primitives associated to
the M L P model.
23
g (XJdeE')) = E X j * W0 (1)
jEEi
1 -- e -~v~
f ( ~ ) = m-1 + e_aV' (2)
m determines the range of the neuron state, included in [-1 : 1], and /~ is
the slope of f .
where F is the CPU frequency and T is the simulation time measured for this
primitive. To approximate the C P I p , we have made numerous simulations of
the primitives, measured the simulation time and determined the C P I p with
the formula 5.
At this point we have two functions: a function N B I p which provides the
number of instuctions for the primitive executed as a function of the sizes of the
24
layers of a neural network, and a function CPIp which provides the number of
cycles per instruction for the primitives as a function of the sizes of the layers
of a neural network.
Let us take now a neural network characterized by the primitives p E H
and by the layers (ap, bp,..., cp, dp) for the primitive p. We can compute the
simulation time TS of this neural network with the formula:
Firstly we evaluated two processors of the SPARC family: the SUPEKSPARC and
the ULTRASPARCII.
5.2 T h e p r o c e s s o r s X86
Hardware
W i t h all the analytical models we can perform both evaluation and predic-
tion.
SeAac f a m i l y T h e table 2 shows that measured times are smaller than maxi-
m u m predicted time and larger t h a n m i n i m u m predicted time: this confirms the
validity of our methodology.
For the real time simulation of the neural networks, this table shows t h a t the
SUPIgRSPA~C processor can not satisfy the 40 ms time constraint.
But on the other hand, the ULTRASPAKClI can manage the real time simu-
lation of LENET. We have a m a x i m u m time of 8.3 ms for the integer version and
a m a x i m u m time of 14.621 ms for the floating-point version. Because LEN~T
is one of the biggest MLP ever designed, we can state t h a t current MLPs can
be simulated in real-time on general purpose micro-processors, when the time
constraint is 40 ms.
However, the table 2 shows that the real-time simulation of RBF3 cannot
always be achieved
27
X86 f a m i l y Similarly to the SPARC family, the table 3 shows that our methodol-
ogy is valid, and that MLP, can be simulated in real time on these architectures.
Our methodology can evaluate actual electronic architectures, but it can also
predict the simulation time of future evolutions of these architectures. We used
it to predict the simulation time of the neural networks LENET and RBF3 on four
possible future evolutions of the ULTRASPARCII and PENTIUMII. For the sake
of simplicity, we modified only a parameter: the clock frequency. The prediction
will be pessimistic, because progress in microelectronics technology m a y lead to
speedup larger than the ratio of the clock frequency as we saw when we compared
the SUPERSPARC and the ULTRASPARC.
The four evolutions for which we predict the simulation time of LENET and
RBF3 networks are:
Table 4. Predicted time for ULTRASPARClIand PENTIUMII with 400 MHz and 1 GHz
clock frequencies
29
This table shows that with 400 MHz and 1 GHz clock frequencies, simulations
of neural networks will possible in real time for the two kind of neural networks
when the time constraint is 40 ms.
7 Conclusion
In this article we propose a new methodology to evaluate and predict the simula-
tion time of MLP and RBF neural networks on general purpose micro-processors.
With this methodology we evaluated two processors family, SPARC and X86
and we demonstrated that the general purpose micro-processors can now simu-
late MultiLayer Perceptrons with a 40 ms real time constraint.
We used also our methodology to predict the simulation time of neural net-
works on two future possible evolutions of SPARC and X86 family, and we showed
that these architectures would simulate Radial Basis Function networks with
Mahalanobis distance in real time with a 40 ms time constraint. They could be
available in the next three years.
References
1. Introduction
Large neural network sottware simulation has a problem in its large requirements for
hardware resources (especially memory for storing the weights), due to which, until a
short time ago, the simulation of this neural network type was restricted to the use of
neurocomputers [11][13]. However, these neurocomputers have a high cost, and are
expensive to keep updated.
Over the last years, there have been some advances that have changed the
panorama of simulation and scientific calculations in general.
9 In the first place, standard hardware cost is falling, while, its power is increasing.
When we say standard hardware, we are of course referring to computers built
around Intel x86 processors. This evolution has narrowed the distance separating
this standard hardware from the workstation, and in some fields it is already a
serious competitor to the latter.
9 Also, the hardware for interconnecting computers has undergone a mayor
evolution, it being quite normal to have a switched Ethernet network at 100 Mbits
at a low cost.
9 The other important thing is the appearance on the scene of the Linux operating
system, which, as we know, is a complete UNIX, with excellent performance and
with great interconnection possibilities. In addition to these excellent
characteristics, we must bear in mind its price (it is freeware).
These three facts together allow us to build Beowulf systems at a low cost
[2][15][16], i.e., PC clusters connected by fast Ethernet and using Linux as operating
31
system. This class of system has been used for scientific calculations (high-energy
physics, plasma studies, etc) with great success [20], and obtaining a excellent
performance/cost relationship.
The use of such systems for neural network simulation, although they do not offer
neurocomputer performance, is certainly a good alternative, as we shall show in a
following section.
Over the last years, we have been using multiprocessor systems for neural network
simulation, in particular, VME backplanes with Motorola processors (MC680X0 and
PowerPC) and VxWorks as operating system. However, to keep our VME
multiprocessor up-to-date is for too expensive and, as we observed in the previous
section, the performance of Intel x86 processors is better and their cost goes on
decreasing. Therefore, we decided to implement our neural network simulator under a
Beowulf system built around Pentium processors.
The simulation system is built on a client-server structure, which is based on
object-oriented modeling of neural networks using the OMT methodology [8]. In this
model, we consider the layers and connection between layers as the units that
conform the neuronal network, the connections being in all directions (feed-forward,
feedback and lateral interaction), as well as there being the possibility of choosing the
recognition and learning algorithms.
The server, called NeuSim, is responsible for the simulation performance, and is
the part that runs on the Beowulf system. In one cluster station runs the subsystem
that we denote master, in charge of coordinating and supervising the simulation and
monitoring the state of the other cluster stations, in which the other server subsystem
runs: the slave part. This has the task of doing the real simulation. The
communication between subsystems is done using TCP/IP sockets.
The client part does not run on the Beowulf system, but on another UNIX
workstation. Instead of developing a final user application as the client, we decided to
develop a library (which we NNLIB) for programming in the C language. We believe
this gives the fmal users more flexibility to adapt the simulation software to their
needs. The library has functions to create and delete objects, set and get attributes,
control the simulation, handle events, etc.
The NNLIB library could be used easily with widget libraries for X-Window (such
as GNU/GTK) for straightforward customization of any application with a graphics
interface.
Another important aspect of NeuSim-NNLIB is the possibility of developing new
models (recognition and learning algorithms, connection patterns, etc.) using
extensions (plug-ins), which can be written using a defined protocol [8].
32
3. Performance analysis
Our purpose is to get a simulation time estimate as a function of the neural network's
characteristic topology and the number and type of installed processors. Although
approximate, this prior estimate of simulation time will help us to decide if it is
necessary to use all the cluster processors or if it is better to use only a part of the
cluster for that simulation.
In recent years, many investigators around the world have been working on
dividing, automatically, the execution of some algorithms into atomic blocks with the
idea of using cache memories and systems with more than one processor (either
multiprocessor systems [1 ] or clusters of workstations [4]).
However, in general, the problem is quite difficult, and efforts have focussed on
parallelizing nested loops algorithms with uniform dependencies [3][14][16][18]. In
practice, there are many problems that can be solved using this method.
The optimization study of nested loops algorithms with uniform dependencies is
approached using a technique called Supemode Transformation or Tiling [10][3].
To describe this technique, without going into mathematical complexity, let us
suppose we have a problem with one n nested loop, and denote by index space a
subset J of Z~, i.e., a block in which every point means an iteration of the n-times
nested loop. Then, to execute all the loop, we should execute every point in the
iteration index space. If there were not dependencies between certain points in the
iteration space, we could execute in parallel every point until we complete all the
iteration space.
However, the dependencies could require one block to be executed before others
blocks. The uniform dependencies [18] help us to simplify the problem, because these
are the same for all points in the iteration space. Mathematically, the dependencies
can be characterized by a matrix in which every column represents a dependency
which is a vector of dimension n.
The tiling consists of dividing the iteration space into blocks (called tiles) using a
transformation. This transformation gives us a new iteration index space, where each
point represents one of these blocks (or tiles). The execution of each block can be
made in a practically autonomous way, needing at most the tiles executed
immediately before. The values of the components of the dependencies vector will be
0 o r 1.
But, since the dependencies still exist, we still have the problem of not being able
to execute the tiles independently in parallel, so that it becomes necessary to plan the
execution with respect to the dependencies [6] [17].
With the purpose of optimizing the execution in a multiprocessor environment, we
should choose in an appropriate way the size and the form of the tile [9][12][5]. We
must take into account that the time needed to execute a tile is made up of two terms:
one due to the computation itself, and another due to the time needed for the data
communication. The computation time is proportional to the volume of the tile, and
the communication time also grows with the size of the tile because the neighboring
tiles will be bigger. If we also take into account that the time needed to begin the
communication is much greater than that needed to transmit one item of data, we
reach the conclusion that the larger the tile, the better will be the performance.
33
Since, the dependencies force us to execute the tiles in a certain order, this implies
that, if we choose a very large size, we will not be able to take maximum advantage of
parallel execution [7]. This leads us to look for an optimum size for the tile.
Focusing now on our problem of neural network simulation, we find that the
recognition (or learning) algorithms for each layer (connection) correspond to nested
loops in which the dependencies are not uniform, since each layer can receive
information from various connections, and are also different between layers. Since we
execute the net as a whole, assigning to each processor the neurons of each layer that
it should process, and therefore their weights to store, executing each layer
individually connection by connection would involve an excessive increase of
communication between processors, and hence would make it unprofitable in terms of
total calculation time.
If we analyze the form of the algorithms used in the neural networks we should
notice the following facts:
9 Firstly, in the recognition phases, there exist dependencies that are fixed by the
exact form of the connection patterns. These dependencies affect the values that we
want to calculate: the new states of the neurons. If we take these dependencies into
account, we would be forced to also take into account the order in which we
execute the neurons. Since the size in memory of the neuron states is insignificant
relative to the memory occupied by the weights, we can maintain two copies of the
neuron states, alternating their use. This eliminates the obligation of executing the
neurons in a certain order.
9 Secondly, during the learning phases, in the learning algorithms are generally such
that the values of the new weights are not affected by the values of the neighboring
weights. This eliminates any problem with the dependencies in so far as the
execution order is concerned.
These considerations obviate our having to plan the order of execution of the
blocks of neurons, thus giving us the possibility to exploit the parallelism better.
In estimating the simulation time in the recognition phase we will follow the
exposition of other work [9][12]. We seek an estimate that is reliable when the neural
net is large, without worrying unduly if the prediction is less accurate with a small
network. We will divide the simulation time into two contributions: one due to the
calculation itself and the other due to data communication between processors.
The calculation contribution will be proportional to the time needed to execute a
connection.
For the second contribution (that due to communication), we shall neglect the
particularities of the physical medium used. We will assume that the communication
time is linear in form (first-order approximation). The constant represents the time
that it takes to establish communication, while the slope represents the cost of
transmitting a datum.
Let us now define the parameters that we will use in the expressions to model the
simulation time.
n = f(c, p) (1)
In principle, we will also consider the time needed to make one iteration (t). This
will also be a function o f the previous two variables, and will be the sum, as noted
above, o f a calculation time and a communication time, i.e.:
a at = c Pi (3)
P i = - - ::~ tui = - - and ci .--
tui Pi P
Taking into account that the computation time should be the same for all
processors in the cluster, we have that:
at (4)
tc~ac=tuici ~ tcat~(p,c)=--'c
P
i m
tco fl" ni + r (5)
where n, is the number o f data that the processor must exchange across the
communication network.
In a Beowulf system, these data are the neuron states. We must take into account
that in this case the necessary weights for a processor are in the local memory, and
therefore it is not necessary to exchange them. In this situation the data to exchange
are only some states o f the neurons o f the network, which implies in practice that we
can neglect the term ft. n; relative to ),, and therefore have:
i.e.,
35
P . (7)
J
tWScom= Z tWScom= Z'P
j=l
r
I f we take into account that n ( p , c ) - t ( p , c) ' we have that:
(9)
nws (p, c) - p. c
y.p2 +Ct.c
Analyzing the function n(p,c) w i t h constant connection number, we find that the
function n(p, c=cte) has a maximum at:
PWSmax(C)=~O:~c (10)
I f we keep the processor index constant, we find that the function n(p=cte, c) has a
horizontal asymptote given by the expression:
4. Results
Table 1. Characteristics o f the eight neural networks used for performance analysis.
Fig. 1. Graphical comparison between experimental and estimated results for each o f our eight
neural networks.
Table 2. Simulation times o f the eight neural nets for different values o f the cluster
performance index.
Table 3. Non-linear regression results using the model given by expression (9).
Lastly, in Table 5 we list the estimated values of the optimal number of processors
and o f the performance limit.
c (Kcu) 11 1)at. A (~) Dat. d (?4) Dat. d (%) Dat. d (~) Dat. d (?4) Dat. d (%)
5. Conclusions
We have proven the viability of a cluster of PC's with the operating system Linux
(Beowulf systems) for large neural network simulation.
From the interpretation of the results, we have that, as the number of processors
grows, the performance limit follows expression (11). This is because the data that
represent the weights are kept in the processor that has to work with them, obviating
the need to use of the communication network. This leads us to the importance of
equipping each one of the components of the cluster of PC's with enough memory.
Also, our estimate of the simulation time was found to be correct when the neural
network has a large enough size.
Acknowledgements
This work has been partially supported by project PRI9606D007, financed by the
Junta de Extremadura
References
1. Agarwal A., Kranz D.A, Natarajan V.: Automatic partitioning of parallel loops and data
arrays for distributed shared-memory multiprocessors. IEEE Trans. Parallel Distributed
Systems, 6(9):943-962, 1995.
2. Becker D.J., Sterling T., Savarese D., Dorband J.E., Ranawak U.A., Packer C.V. Beowulf:
A Parallel workstation for scientific computation. Proceedings, International Conference on
Parallel Processing, 1995.
3. Boulet P., Darte A., Risset T. and Robert Y. (Pen)-ultimate tiling?. Integration, the VLSI
Journal, 17:33-51, 1994.
4. Pierre Boulet, Jack Dongarra, Yves Robert and Frederic Vivien, Tiling for Heterogeneous
Computing Platforms, Report UT-CS-97-373, Jul 1997
39
5. Calland P.Y., Dongarra J. and Robert Y. Tiling with limited resources. Application Specific
Systems, Architectures and Processors. ASAP'97, pp 229-238. IEEE Computer Society
Press, 1997.
6. Darte A., Khachiyan L. and Robert Y. Linear Scheduling is Nearly Optimal. Parallel
Processing Letters, vol 1.2, pp. 73-81, 1991.
7. Desprez F., Dongarra J., Rastello F. and Robert Y. Determining the Idle Time of a Tiling:
New Results Journal of Information Science and Engineering, pp. 167-190, Vol.14 No.1.
March 1997.
8. Garcia Orellana, C.J. Modelado y Simulaci6n de Grandes Redes Neuronales. Doctoral
Thesis - University of Extremadura. October,, 1998.
9. Hodzic E. and Shang W. On Supernode Transformation with Minimized Total Running
Time. IEEE Transactions on Parallel and Distributed Systems. Vol. 9, N ~ 5. May 1998. pp.
417-428.
10. Irigoin F. and Triolet R. Snpemode partitioning. In Proc. 15th Annual ACM Symp.
Principles of Programming Languages, pages 319-329, CA, January 1988.
11. A. Miiller, A. Gunzinger and W. Guggenbiihl, Fast Neural Net Simulation with a DSP
Processor Array. IEEE Transactions On Neural Networks, Vol. 6, No. 1, January 1995.
12. Ohta H., Saito Y., Kainaga M. and Ono H. Optimal Tile Size Adjustment in Compiling
General DOACROSS Loop Nets. Proc. 1995 Int'l Conf. Supercomputing, pp. 270-279.
ACM Press, 1995.
13. U. Ramacher et al., SYNAPSE-1 -- A General Purpose Neurocomputer, Siemens AG,
available on request Feb. 1994.
14. Ramanujam J. and Sadayappan P. Tiling Multidimensional Iteration Spaces for
Multicomputers. J. Parallel and Distributed Computing, vol. 16, pp. 108-120, 1992.
15. Reschke C., Sterling T., Ridge D., Savarese D. Becker, D., Merkey P. A Desing Study of
Alternative Network Topologies for the Beowulf Parallel Workstation. Proceedings, High
Performance and Distributed Computing, 1996.
16. Schreiber R. y Dongarra J.J. Automatic Blocking of Nested Loops. Technical Report 90.38,
RIACS, Aug. 1990.
17. Shang W. and Fortes, J.A.B. Time Optimal Linear Schedules for Algorithms with Uniform
Dependencies. IEEE Trans. Computers, vo140, no 6, pp 723-742, Jun 1991.
18. Shang W. and Fortes, J.A.B. Independent Parttioning of Algorithms with Uniform
Dependencies. IEEE Trans. Computers, vo141, no 2, pp 190-206, Feb 1992.
19. Sterling T., Becker D.J., Savarese D., Berry M.R., Reschke C. Achieving a Balanced Low-
Cost Architectures for Mass Storage Management through Multiple Fast Ethemet Channels
on a Beowulf Parallel Workstation. Proceedings, International Parallel Processing
Symposium, 1996.
20. Warren M.S., Salmon J.K, Becket D.J., Goda M.P., Sterling T., Winckelmans G.S.:
Pentium Pro inside: I. a treecode at 430 Gigaflops on ASCI Red, II. Price/performance of
$50/Mflop on Loki and Hyglac. In Supercomputing'97. Los Alamitos, 1997. IEEE
Computer Society.
A Constructive Cascade Network with Adaptive
Regularisation
1 Introduction
The casper algorithm [1, 2] has been shown to be a powerful method for training
feedforward neural networks. It is a constructive algorithm t h a t inserts hidden
neurons one at a time to form a cascade architecture, similar to Cascade Corre-
lation (cascor) [3]. The amount of regularisation in casper is set by a parameter.
The optimal value for this p a r a m e t e r is difficult to estimate prior to training, and
is generally obtained through trial and error. An inherent problem for the regu-
larisation of constructive networks is that the number of weights in the network
is continually changing, and thus even an optimal regularisation level for a given
size network will become redundant as the network grows. This work explores
the use of a method which adaptively sets the regularisation level as the network
is constructed. This paper will first give an introduction to the casper algorithm,
then describe the adaptive regularisation method and provide the results of some
comparative simulations. Finally the algorithm is benchmarked on the P r o b e n l
[4] series of d a t a sets and its performance is compared to an optimised Cascade
Correlation algorithm.
41
The casper algorithm uses a version of the R P R O P algorithm [5] for network
training. R P R O P is a gradient descent algorithm which uses separate adaptive
learning rates for each weight. Each weight begins with an initial learning rate,
which is then adapted depending on the sign of the error gradient seen by the
weight as it traverses the error surface. This results in the update value for each
weight adaptively growing or shrinking as a result of the sign of the gradient
seen by that weight.
Casper constructs a cascade network in a similar manner to cascor: it begins
with all inputs connected directly to the outputs, and successively inserts neurons
which receive inputs from all prior hidden neurons and inputs. R P R O P is used
to train the whole network each time a hidden neuron is added. The use of
R P R O P is modified, however, such that when a new neuron is inserted, the
initial learning rates for the weights in the network are reset to values that
depend on the position of the weight in the network. The network is divided into
three separate regions, each with its own initial learning rate: L1, L2 and L3. The
first region is made up of all weights connecting to the new neuron from previous
hidden and input neurons. The second region consists of all weights connecting
the output of the new neuron to the output neurons~ The third region is made up
of the remaining weights, which consist of all weights connected to, and coming
from, the old hidden and input neurons.
The values of LI~ L2 and L3 are set such that L1 > > L2 > L3. The reason for
these settings is similar to the reason that cascor uses the correlation measure:
the high value of L1 as compared to L2 and L3 allows the new hidden neuron
to learn the remaining network error. Similarly, having L2 larger than L3 allows
the new neuron to reduce the network error, without too much interference from
other weights. Importantly, however, no weights are frozen, and hence if the
network can gain benefit by modifying an old weight, this occurs, albeit at an
initially slower rate than the weights connected to the new neuron. In addition,
the L1 weights are trained by a variation of R P R O P termed S A R P R O P [6]. The
S A R P R O P algorithm is based on RPROP, but uses a noise factor to enhance
the ability of the network to escape from local minima.
In casper a new hidden neuron is installed after the decrease of the validation
error has fallen below a set amount. All hidden neurons use a symmetric logistic
activation function ranging between - 0 . 5 and 0.5. The output neuron activation
function depends on the type of analysis performed. Regression problems use a
linear activation function. Classification tasks use the standard logistic function
for single output classification tasks. For tasks with multiple outputs the softmax
activation function [7] is used. Similarly, the error function selected depends on
the problem. Regression problems use the standard sum of squares error function.
Classification problems use the cross-entropy function [8]. For classification tasks,
a 1-of-c coding scheme for c classes is used, where the output for the class to
be learnt is set to 1, and all other class outputs are set to 0. For a two class
classification task, a single output is used with the values 1 and 0 representing
42
the two classes. For multiple classes a winner-takes-all strategy is used in which
the output with the highest value designates the selected class.
The regularisation used in casper is implemented through a penalisation term
added to the error function as shown below:
43
One method that would allow the amount of regularisation to change in con-
structive algorithms is to adapt this parameter as the network is trained. This
was done using the following method as applied to the casper algorithm. The
adaptation process relies on using three training stages for each new hidden neu-
ron added, instead of the usual single training stage. The validation results taken
after the completion of each training stage are then used to adapt the regulari-
sation levels for future network training. This process repeats as the network is
constructed.
For each new hidden neuron inserted into the network, three training stages
are performed. Each training stage is performed using the same method as the
casper algorithm, and is halted using the same criterion. The commencement
of a new training stage results in all R P R O P and SA parameters being reset
to their initial values. Importantly, however, the final weights from the previous
training stage are retained and act as the starting point for the next training
stage. The motivation for this is that it is likely to increase convergence speed,
and thereby construct smaller networks.
The regularisation level for the network once a new neuron is added is set
to the initial value, :ki, termed the initial decay value. This parameter takes the
form Ai = 10-% It is this initial decay value that is adapted as the network is
constructed. The first training stage uses the initial decay value. Each successive
stage uses a regularisation level that has been reduced by a factor of ten from the
previous stage. After each training stage the performance of the network on the
validation set is measured, and the network weights recorded. On completion of
the third training stage, the initial decay value is adapted as follows: if the best
performing regularisation level occurred during the first two training stages, the
initial decay value is increased by a factor of ten, else it is decreased by a factor
of ten. At this point the weights that produced the best validation results are
restored to the network. When the next neuron is added, the process repeats
using the newly adapted initial decay value.
The initial network with no hidden neurons is trained using a single training
stage with a regularisation level of a = O. The adaptation scheme begins with
43
the addition of the first hidden neuron, which is given an initial decay value of
a -- 2. The initial decay value is chosen to give a relatively high regularisation
level as this can easily be reduced through network growth and the adaptation
process~ The limits placed on the initial decay value are a -- 1 to 4, which
gives a total possible regularisation range of a = 1 to 6 (since there are three
training stages). The lower initial decay limit (a -- 4) was selected to stop the
regularisation level falling too low, which can occur in early stages of training
when the network is still learning the general features of the data set. The top
initial decay limit (a -- 1) was selected since convergence becomes difficult with
excessive regularisation levels.
For reasons of efficiency, if the validation result of the second stage is worse
than the first, the third training stage is not performed. In addition, if the vali-
dation results of the first training stage are worse than the best validation results
of the previous network architecture, the weights are reset to their previous val-
ues before this training stage was commenced. The regularisation level is then
reduced as normal, and the second training stage is started. This was done to
stop excessive regularisation levels distorting past learning.
This regularisation selection method allows the network to adapt the level
of regularisation as the network grows in size. The motivation for using this
adaption scheme is the relationship between good regularisation levels in similar
size networks. By finding a good regularisation level in a given network, it is
likely that a slightly larger network will benefit from a similar regularisation
level. The adaption process allows a good regularisation level to be found by
modifying the window of regularisation magnitudes that are examined. This
adaption process is biased towards selecting larger regularisation levels since the
initial decay value is increased if either of the first two training stages have the
best validation result. The reason for this bias is that as the network grows in
size, in general more regularisation will be required.
The motivation for reducing the regularisation level through each training
stage is that it allows the network to model the main features of the data set,
which can then be refined by lowering the regularisation level. This is the same
motivation for the use of the SA term in the regularisation function. The algo-
rithm incorporating this adaptive regularisation method will be termed acasper.
The parameter values for this algorithm were selected after some initial tuning
on the Two spirals [3] and Complex interaction [9] data sets. Some tuning was
also performed using the cancerl data set from the Probenl collection.
4 Comparative Simulations
the grid [0, 1]2. Gaussian noise of 0 mean and 0.25 standard deviation was added
to the training and validation sets. The two classification benchmarks were the
Glass and Thyroid data sets, which are glass1 and thyroid1 respectively from
Probenl.
For each data set 50 training runs were performed for each algorithm using
different initial starting weights. The Mann-Whitney U test [10] was used to
compare results, with results significant to a 95% confidence level indicated in
bold. Training in both casper and acasper is halted when either the validation
error (measured after the installation of each hidden neuron) fails to decrease af-
ter the addition of 6 hidden neurons, or a maximum number of hidden neurons
have been installed. This maximum was set to 8 and 30 for the classification
and regression data sets respectively. The measure of computational cost used is
connection crossings (CC) which Fahlman [3] defines as the number of multiply-
accumulate steps required to propagate activation values forward through the
network, and error values backward. This measure is more appropriate for con-
structive networks than the number of epochs trained since it takes into account
varying network size.
The results on the test sets at the point where the best validation result
occurred for the constructed networks after the halting criterion was satisfied
are given in Tables 1 and 2. For the classification data sets this measure is the
percentage of misclassified patterns, while for the regression data sets it is the
Fraction of Variance Unexplained (FVU) [9], a measure proportional to total
sum of squares error. Also reported is the number of hidden neurons installed
at the point where the best validation result occurred, and the total number
of connection crossings performed when the halting criterion was reached. The
casper results reported are those that gave the best generalisation results from
a range of regnlarisation levels: letting A = 10 - a , a was varied from 1 to 5.
4.1 Discussion
1.0e+O0
cif - e - -
r ~--
1.0e-01
..... _ / "_/
1.~-04
1.0o-OS
, .Oe-06
0 10 30
In order to allow comparison between the acasper algorithm and other neural
network algorithms, an additional series of benchmarking was performed on the
remaining data sets in the Probenl collection. The same benchmarking set-up
was used as for the previous comparisons, except the maximum network size for
the regression problems was set to eight. The four regression data sets in Probenl
are buildingl, flarel, heartal, and heartacl. The test results for these data sets
are given in terms of the squared error percentage as defined by Prechelt [4]:
where omax and omln are the maximum and minimum values of the outputs, N
is the number of training patterns, and c the number of training patterns.
To allow direct comparison with a well known constructive algorithm, the
results obtained by the cascor algorithm [3] are also given. These results were
47
sults for twelve out of the fourteen data sets. There are some cases where the
difference is surprisingly large, for example the Soybean and Thyroid data sets.
One reason for this may be that the halting criteria for a c a s p e r specifies a max-
imum network size of eight, although in general this limit is rarely reached by
a c a s p e r during the benchmarking.
Interestingly, many of the data sets are solved by a c a s p e r using very small
networks, often with no hidden units at all. This illustrates a major advantage
of using constructive networks: the simple solutions are tried first. It is often the
case t h a t many real world data sets, such as the ones in P r o b e n l , can be solved
by relatively simple networks.
49
5 Conclusion
References
1 Introduction
- Firstly, showing that the agent-based paradigm can provide a neutral, un-
biased, operational model for such a unified framework.
- Secondly, showing that this model includes most the known forms of meta-
learning proposed by the machine learning community.
- Thirdly, showing that this kind of model may help to overcome some of the
traditionally weak points of the work around meta-learning.
3 Classes of Bias
Hilario distinguishes between two kinds of bias, representational and search bias,
that can be studied at different grain levels. We classify these granularity levels
as follows:
- Hypothetical level.
On the representational side, it has to do with the selection of formalisms or
languages used for the description of hypothesis and instances in the prob-
lem space.
52
Regarding search, this level deals with the kind of task we are trying to ac-
complish through automatic means: classification, prediction, optimization,
etc.
- Strategic level.
A particular representation model (production rules, decision trees, per-
ceptrons, etc.) has to be selected, compatible with the formalism preferred
at the previous level. This model is built by a particular learning algorithm
by searching the hypothesis space.
- Tactical level.
Once a pair model/algorithm has been selected, some tactical decisions may
remain to be taken about the representation model (e.g., model topology
in neural nets) or the search model (number of generations in genetic al-
gorithms, stopping criteria when inducing decision trees, etc.)
- Semantic level.
This level concerns the interpretation of the primitive objects, relations and
operators. Concerning representation, this level includes the selection, com-
bination, normalization (scaling, in general), discretization, etc. of attributes
in the problem domain. Semantic level search bias includes the selection of
weight updating operator in neural nets and fitness updating operator in
genetic algorithms, the information-content measure used for the selection
53
..........
5 C a s e 2. T a c t i c a l L e v e l B i a s : P a r a m e t e r Selection
A good amount of work can be found in the literature about systems intended for
the selection of adequate representational or search bias at the tactical level. For
instance, the C45TOX system, developed for a toxicology application in the MIX
project, uses genetic algorithms for optimising the parameters used by the C4.5
learning algorithm. A work with the same goal had been previously developed
by Kohavi and John [4]. They used a wrapper algorithm for parameter setting.
In the C45TOX system, the genetic algorithm acts as a specialised config-
uration manager. It provides the experiment designer with candidate sets of
parameters that are used for training a decision tree. This tree is tested using
cross-validation. The evaluator agent estimates the performance of the decision
tree and transmits the error rate to the genetic agent to update the fitness of the
corresponding individual of the population. The knowledge base of the genetic
system evolves through the application of genetic operators. When a new genera-
tion is obtained, new experiments are launched until no significant improvement
is achieved.
The architecture of this system is shown in Fig. 2.
55
- Full integration. The meta-learning agents are exactly the same used for
object-level learning. In the same way, several learning agents can be launched
simultaneously for meta-learning, and their results can be compared or in-
tegrated in an arbiter or combiner structure.
- On-line learning. Meta-learning can be achieved simultaneously with object-
level learning.
- Use of transformed and artificial data-sets. The lack of source data-bases is
a difficulty that can be overcome through the generation of new data-sets
obtained from the transformation of the original ones. New attributes can
be derived or noise can be added in order to test noise-immunity. Even fully
artificial data-bases can be generated from rules or any other mechanism,
controlling at the same time the level of noise to be added.
56
! ................................................................................................................................................................. i
i .....
7 Current Work
The ideas and the architecture proposed in this paper are being implemented
at this moment in the project M2D2 ( "Meta-Learning in Distributed Data Min-
ing"), funded by CYCIT, the Spanish Council for Research and Development.
This approach has been successfully used, for instance, for the development of
the C45TOX system.
References
1. P. Chan and S. Stolfo. A comparative evaluation of voting and meta-learning on
partitioned data. In Prieditis and Russell [6], pages 90-98.
2. J. Gama and P. Brazdil. Characterization of classification algorithms. In E. Pinto-
Ferreira and N. Mamede, editors, Progress in Artificial Intelligence. Proceedings of
the 7th Portuguese Conference on Artificial Intelligence (EPIA-95), pages 189-200.
Springer-Verlag, 1995.
3. Melanie Hilario. Bias and knowledge in symbolic and connectionist induction. Tech-
nical report, Centre Universitalre d'Informatique, Universitfi de Gen~ve, Gen~ve,
Switzerland, 1997.
4. R. Kohavi and G. John. Automatic parameter selection by minimizing estimated
error. In Prieditis and Russell [6], pages 304-312.
57
1 Introduction
previous models in which the two combined algorithms were adapted and applied
to discrete optimisation [3].
The remainder of this article is organised as follows: first, the weak hybrid
model is formalised in Sect. 2. Subsequently, its functioning is described in Sect.
3, considering scalability issues as well. Next, experimental results are reported
in Sect. 4. Finally, some conclusions are outlined in Sect. 5.
According to this definition, ODR returns the best individual (or one of the
best individuals) that can be built without introducing any new material. This
implies performing an implicit exhaustive search in a small subset of the solution
space (i.e., in the discrete dynastic potential of the recombined solutions). Such
an exhaustive search can be efficiently done by means of the use of a subordinate
A*-like algorithm as described in next section.
r = r x,~)) = ~ r (3)
i=l
It is easy to see that, in this situation, ODR must simply scan x and y,
picking the variables that proportionate the best value for each r Hence, ODR
scales linearly and, subsequently, this case does not pose any problem.
The scenario is different when epistasis is involved. In this situation, and
since an A*-like mechanism for exploring A({x,y}) is used, ODR is sensitive
to increments in the dimensionality of the problem and to the subsequent expo-
nential growth of [A({x,y})[. An adjustment of the representation granularity
is proposed to alleviate this problem. To be precise, recall that solutions are in-
crementally constructed by adding one variable at a time. If the computational
cost of this procedure were too high, ODR could be modified so as to add g
variables at a time, i.e.,
(5)
It can be seen that increasing g confines ODR to a smaller subset of A({x, y})
whose size is 0 ( 2 ~/g) and thus the computational cost is reduced. However, a
very high value of g may turn ODR ineffective since the chances for combining
valuable information are reduced as well. For this reason, intermediate granu-
larity values represent a good trade-off between computational cost and quality.
4 Experimental Results
A large collection of experiments has been done to assess the quality of the pro-
posed recombination mechanism in the context of several different continuous-
optimisation problems. These problems are described in Subsection 4.1 Subse-
quently, the experimental results are reported and discussed in Subsection 4.2.
The test suite used in this work is composed of four problems: the generalised
Rastrigin function, a weighted noisy matching function, the Rosenbrock func-
tion and the design of a brachystochrone. Each of these functions exhibits several
distinctive properties, thus providing a different scenario for evaluating and com-
paring different operators. These properties are described below in more detail.
For high absolute values of xi, this function behaves like an n-dimensional
parabola. However, the sinusoidal term becomes dominant for small values.
Hence, there exist m a n y local m a x i m a and minima around the optimal value
(located at x = 0). Although not epistatic, this function is highly multimodal
and hence difficult for gradient-based algorithms. The values a = 10, w = 27r,and
- 5 . 1 2 _< xi _< 5.12 have been used in all experiments.
where
ai =
K . .(1 - max.(~,-v,I)
~'-v'l ~
/
if Ixi - vii ->- e (8)
0 otherwise.
If the noisy terms Ni(0, ai) are discarded, this function is equivalent to a
scaled translated sphere function (the optimum being located at x = v). How-
ever, the presence of Gaussian noise makes this function harder. Moreover, it can
be seen t h a t the amplitude of the noisy terms increases as the reference values
are approached, thus becoming stronger as the algorithm converges. The noise
ceases within a small neighbourhood e of each reference value.
The values wi -- i, vi = 5.12. s i n ( 4 ~ i / n ) , K = .5, e = .1, and - 5 . 1 2 _< xi <:
5.12 have been considered in all the experiments.
63
In this problem, there exist epistatic relations between any pair of adjacent
variables. Additionally, there exist non-epistatic terms as well. However, the
latter have a much lower weight, and hence the search is usually dominated
by the former. As a matter of fact, there exists a strong attractor located at
x = 0, where these terms become zero. The further evolution towards the global
optimum (x = 1) is usually very slow. As for the previous functions, the range
- 5 . 1 2 < xi < 5.12 has been considered.
where n is the number of pillars, h i is the height of the ith pillar (h0 and hn+l are
data of the problem), ~ is the distance between consecutive pillars (a problem
parameter as well), and vi = v ( v i - l , h i - l , h i ) is the velocity at the ith pillar
(vo = 0).
As it can be seen, this is also an epistatic problem: the contribution of each
variable (i.e., pillar height) depends on the value of previous variables; but,
unlike the Rosenbrock function, there does not exist any non-epistatic term. The
experiments with this function have been carried out using h0 -- 2, h~+l -- 0,
(n + 1) .)~ = 4, and (2h~+1 - h0) < hi < h0.
For comparison purposes, experiments have been also done both without
recombination and with several classical recombination operators such as inter-
mediate recombination (IR), random respectful recombination (R3), and random
discrete recombination (RDR). Furthermore, two different reproductive mecha-
nisms have been tried: mutation + recombination (i.e., the parents are mutated
before being recombined) and recombination + mutation (i.e., the parents are
recombined and the resulting child is then mutated). When using a non-discrete
recombination operator, stepsizes are always geometrically averaged. For each
test problem, dimensionality, reproductive mechanism and operator, twenty runs
have been performed and the mean value has been considered. Runs are termi-
nated after 105 function evaluations. When using the ODR operator, the addi-
tional partial evaluations carried out during recombination are also considered
and hence fewer generations are performed in this case.
Table 1 shows the results for the generalised Rastrigin function. It can be
seen that using a recombination operator always improves the performance with
respect to a non-recombinative ES. As stated in [6], this is due to the regularity
in the arrangement of local optima and to the global structure of the function,
similar to a unimodal landscape. Also, notice that the results are generally better
mutating and then recombining that vice versa. Moreover, applying the ODR
operator after mutating provides the best results. The lower performance of
ODR when applied before mutation is a consequence of the disturbing effects of
mutation on the heuristically selected combination of variables.
Table 1. Results for the generalised Rastrigin function. All results are averaged for 20
runs. The best results for each dimensionality are shown in boldface.
The results for the weighted noisy matching function are more impressive.
As it can be seen in Table 2, all operators are relatively satisfactory for dimen-
sionalities up to 16. From that point, only ODR consistently finds quasi-optimal
solutions. The rest of operators are incapable of dealing with the high number of
independent and non-uniformly scaled noisy terms that are present in this func-
tion. Again, the use of recombination improves the results of a non-recombinative
ES (with the exception of RDR applied before mutation), and mutating after
applying ODR yields slightly worse results.
65
Table 2. Results for the weighted noisy matching function. All results are averaged
for 20 runs. The best results for each dimensionality are shown in boldface.
The results for the Rosenbrock function are shown in Table 3. As mentioned
in Sect. 3, the computational cost of ODR quickly grows when the dimension-
ality of this problem is increased. For that reason, the granularity factor g has
been modified. To be precise, the algorithm automatically adjusts g so as to
keep a constant dimensionality-to-granularity ratio p = n/g. In these experi-
ments, the value p ~ 10 has been used. This value seems to be robust for the
problems considered. Nevertheless, empirical evidence suggest that fine-tuning
of this parameter (usually in the interval 8 < p < 12) may yield better results.
Table 3. Results for the Rosenbrock function. All results are averaged for 20 runs.
The best results for each dimensionality are shown in boldface.
Table 4. Results for the brachystochrone design problem. All results are averaged for
20 runs. The best results for each dimensionality are shown in boldface.
Finally, the results for the brachystochrone problem are given in Table 4. As
for the Rosenbrock function, ODRg=I becomes computationally prohibitive for
high dimensionalities, and hence the value p ~ 10 has been kept. The results
are very satisfactory: ODR performs better than other recombination operators
do. Furthermore, it seems to scale much better. It can be also seen that, again,
the results of ODR are worse when it is applied before mutation, although the
difference is not very significant in this case. Also, and with the exception of
RDR applied after mutation, the algorithms with recombination perform better
than the non-recombinative algorithm.
5 Conclusions
A hybrid model that combines evolution strategies with the A* algorithm has
been presented in this work. This model tries to exploit the available knowledge
about the fitness function in order to intelligently combine valuable parts of
solutions independently discovered. By using this model, recombination turns to
be a strong exploitative operation (no new value is introduced in any variable),
thus relying the responsibility for exploration on the mutation operator, a very
powerful element in evolution strategies.
The empirical evaluation of the hybrid algorithm has been very satisfac-
tory, outperforming other classical operators on a benchmark composed of mul-
timodal, noisy and epistatic functions. Moreover, the algorithm can be scaled by
tuning the representation granularity (i.e., the size of the blocks combined by the
subordinate A* algorithm). In fact, this parameter can be adjusted according to
the available computational resources to allow a finer exploration.
Notice that, although recombination has been restricted to a binary operation
in this work, the hybrid model can be straightforwardly upgraded to multiparent
recombination [6, 7]. In this sense, it is simply necessary to extend the concept
67
References
1. Th. B~ick. Evolutionary Algorithms in Theory and Practice. Oxford University
Press, New York, 1996.
2. Th. B~ick, D.B. Fogel, and Z. Michalewicz. Handbook of Evolutionary Computation.
Oxford University Press, New York NY, 1997.
3. C. Cotta, E. Alba, and J.M. Troya. Utilising dynastically optimal forma recombi-
nation in hybrid genetic algorithms. In A.E. Eiben, Th. B~ick, M. Schoenauer, and
H.-P. Schwefel, editors, Parallel Problem Solving From Nature - PPSN V, volume
1498 of Lecture Notes in Computer Science, pages 305-314. Springer-Verlag, Berlin
Heidelberg, 1998.
4. C. Cotta and J.M. Troya. On decision-making in strong hybrid evolutionary al-
gorithms. In A.P. Del Pobil, J. Mira, and M. Ali, editors, Tasks and Methods in
Applied Artificial Intelligence, volume 1416 of Lecture Notes in Computer Science,
pages 418-427. Springer-Verlag, Berlin Heidelberg, 1998.
5. L. Davis. Handbook of Genetic Algorithms. Van Nostrand Reinhold Computer
Library, New York, 1991.
6. A.E. Eiben and Th. B~ick. Empirical investigation of multiparent recombination
operators in evolution strategies. Evolutionary Computation, 5(3):347-365, 1997.
7. A.E. Eiben, P.-E. Raue, and Zs. Ruttkay. Genetic algorithms with multi-parent
recombination. In Y. Davidor, H.-P. Schwefel, and R. M~nner, editors, Parallel
Problem Solving From Nature - PPSN III, volume 866 of Lecture Notes in Computer
Science, pages 78 87. Springer-Verlag, Berlin Heidelberg, 1994.
8. W.E. Hart and R.K. Belew. Optimizing an arbitrary function is hard for the genetic
algorithm. In R.K. Belew and L.B. Booker, editors, Proceedings of the Fourth
International Conference on Genetic Algorithms, pages 190-195, San Mateo CA,
1991. Morgan Kaufmann.
9. M. IIerdy and G. Patone. Evolution strategy in action: 10 es-demonstrations.
Technical Report TR-94-05, Technische UniversitRt Berlin, 1994.
10. N.J. Radcliffe. Forma analysis and random respectful recombination. In R.K. Belew
and L.B. Booker, editors, Proceedings of the Fourth International Conference on
Genetic Algorithms, pages 222-229, San Mateo, CA, 1991. Morgan Kaufmann.
11. N.J. Radcliffe. The algebra of genetic algorithms. Annals of Mathematics and
Artificial Intelligence, 10:339-384, 1994.
12. G. Syswerda. Uniform crossover in genetic algorithms. In J.D. Scha/fer, editor,
Proceedings of the Third International Conference on Genetic Algorithms, pages
2 9, San Mateo, CA, 1989. Morgan Kaufmann.
13. D.H. Wolpert and W.G. Macready. No free hlnch theorems for optimization. IEEE
Transactions on Evolutionary Computation, 1(1):67-82, 1997.
E x t r a c t i n g R u l e s from Artificial N e u r a l
Networks with Kernel-Based Representations
Josd M. Ramfrez *
Dpto. de Computaci6n
Universidad Sim6n Bolivar
Apartado 89000, Caracas 1080-A, Venezuela
jramire~ldc.usb.ve
1 Motivation
At a certain level of abstraction, neural learning can be seen as a case-based learn-
ing. T h e Neural Network stores only a selected set of examples (or a c o m b i n a t i o n
of examples) and uses t h e m to find the best approximation for new instances of
the problem, according to its generalization ability. T h e system learns from ex-
amples, not rules, but the examples are instances of the application of (partially)
u n k n o w n rules in a given domain. A succe~ful Neural Network synthesizes in
a subsymbolic representation the rules n ~ d e d to solve instances of a problem
in a given domain; but its mechanism does not give significant feedback to the
designer t h a t could contribute to the understanding of the problem domain.
In Neural Networks models the knowledge synthesized from the training pro-
cess is represented in a subsymbolic fashion (weights, kernels, combination of
"~ International postal address: Jos6 M. Ramirez, CCS 90996, 4440 N.W., 73 Av. Miami,
FI. 33166, USA.
69
unit's weight into discrete, symbolic descriptions. Some candidate templates are
generated and instantiated; the candidates that better fit with the actual weights
are selected. The activation of the units is assumed boolean and the transfer
function is restricted to sigmoidal.
Other approaches use domain theories to initialize the networks [26]. The
networks are trained using labeled examples and rules describing the behavior of
the networks, according to the theories, are extracted using an iterative clustering
algorithm. The rules generated are, in fact, mathematical descriptions of the
networks behavior and not symbolic rules.
Craven and Shavlik use a trained Neural Network to perform queries using
the training data to induce decision trees [7]. The trained network is used as a
black box to answer queries, making the method architecture independent, but
what is generated is not a symbolic translation of the Neural Network's internal
representation, but a decision tree that is functionally equivalent to the network.
This observation can be subtle, but the training process of the network derive a
compact subsymbolic representation that is lost, since the network is used only
as the oracle of the decision tree induction algorithm.
Deductive learning is another method for extracting rules from Neural Net-
works that modify the network architecture to simplify the representation to
better learning. One of the better explained methods of deductive learning is
presented by Ishikawa in [11] and is named structural learning with forgetting.
The method interrupt the standard learning of the network at certain points and
prune the connections with weights lesser than a certain ~; this action produces
a "forgetting" effect in the network that contributes to generate a more gener-
alized representation. When the training converges, a symbolic interpretation of
the remaining connections values is generated in form of rules.
Kosko [13] proposes the use of clusters generated in the input-output prob-
lem space by competitive learning as the extensional description of fuzzy sets;
and sketch a method to generate rules from a discretization of the problem space
in a way that captures the spatial distribution of the clusters. This proposal is
closely related with our work, in the sense that, starting from a discretization of
the problem space and target classes, fuzzy implications are derived from mem-
bership functions associated to defined intervals (clusters in Kosko's proposal).
Clustering is also used by Sreerupa and Mozer in [24] and by Omlin and
Giles in [17] to induce Finite State Machines from Neural Networks. The result
is not rules, but FSM that can be seen as a discrete representation of the network
behavior, closer to the people's understanding.
3. If a rule appears more than once, leave only one rule and attach the number
of coincident rules as a relative strength.
For example, if the prototype vector of one kernel is (1.5, 6.3), representing
the values of input features of the problem, namely x and y; and a discretization
criteria that assigns label to intervals of x and y:
Low: M e d H i g h V e r y High
o.6o 10.90 o.oo o.oo
y 0.00! 0.i0 0.75 0.00
The confidence of the rule generated will be 0.675, that is the product of 0.90
and 0.75. This value captures the confidence on the classification suggested by a
kernel, given that part of its area is overlapping with another kernels associated
to different classes. The rule will be:
/f (x is M e d ) and (y is H i g h ) then ClassA with 0.675
The confidence factor is a kind of compound probability that is far most
accurate than the certainty factor used previously. The confidence factors are
used to resolve ambiguities and as a tie-breaking criteria during the inference
process.
Given that this modification of the algorithm may produce rules with the
same template, but different confidence factor; the rules with the same template
are aggregated and a compound confidence factor is associated to the aggregated
rule. Finally the rules are sorted using the confidence factor.
73
5 Experimental results
To initially test the algorithm we used the well known Iris Plants Database [5].
The goal is to recognize the type of the iris plant to which a given individual
belongs. The data is composed of 150 instances, equally distributed between
three classes, 50 for each of the three types of plants: Setosa, Versicolor, and
Virginica. The first class is linearly separable from the other two; while the
latter are not linearly separable from each other. Each instance features four
attributes: petal length, petal width, sepal length and sepal width, which take
continuous values measured in centimeters.
The whole database was processed using a RBF network, a Coulomb potential-
based network (RCE) and the algorithm presented in this work. 100 instances of
the database were used as training data and 50 as testing data. The RBF network
was trained using a kernel unit for each training vector and cr was calculated for
each kernel as the RMS distance to its neighbors, as suggested in [21]. The RCE
network was trained using an initial O (threshold or radius of each kernel) equal
to half of the dimension of the feature space. Both networks used 4 input units
(one for each attribute) and 3 output units (one for each class).
For the rule extraction algorithm, membership functions of type S (sigmoidal)
and H (bell) were used and the attributes were discretized as shown in Table 1.
Table 2 and 3 show the results of the application of the networks and rules
extracted for the classification of the testing set. It can be seen that the accuracy
was over 90% and the number of rules extracted from the kernels created with
the training set gives us an idea of the compactacy that the rule extraction
achieves, maintaining a good performance.
The "extra" rule generated from RBF was a very specialized rule for the class
Virginica that was present in RBF due to the initialization strategy used, that
created a kernel for each instance in the training set.
Finally, the alignment measures the functional equivalence of each rule set
extracted with the corresponding network. This number was obtained comparing
the output of the network and the corresponding rule sets for each instance of
the testing set~ and is proportional to the number of coincident outputs.
T h e rules extracted from the RCE network were:
74
Kernels or Rules
RBF 100
RCE 23
Rules from RBF 7
Rules from RCE 6
Table 2. Kernels and rtfles generated from each network using the 100 instances train-
ing set from Iris Plants Database
Another database used was the Mushroom Database that consists of descrip-
tions of 8124 instances corresponding to 23 species of mushrooms, each identified
as edible (51.8 %) or poisonous (48.2 %) and described by 22 nominal-valued
attributes, with between 2 and 12 possible values.
In this case the discretization step is skipped given the nominal nature of
the attributes. RCE and RBF were trained using 5416 instances, the rules were
extracted and then tested against the remaining 2708 instances. As expected,
the rules extracted were perfect and the accuracy reached almost 100 % with 10
rules. Tables 4 and 5 show the results.
75
Kernels or Rules
RBF 5416
RCE 32
Rules from RBF 10
Rules from RCE 10
Table 4. Kernels and rules generated from each network using the 100 instances train-
ing set from Mushrooms Database
The method presented successfully extracts fuzzy rules from trained kernel-based
Neural Networks, including RBF and RCE. It is expected that the method will
behave well on the rest of the kernel-based networks, given the analogy in es-
tablish in terms of the internal representation.
The format of the rules extracted is extremely straightforward, allowing the
understanding of the functionality of the Neural Network and the dynamics of
the target problem. Moreover, the rules generated shown an outstanding com-
parative performance in the resolution of the same problems.
We are not interested, by the moment, in the complexity and scalability of the
algorithm; the degradation in performance due to kernel addition is a well known
drawback of kernel-based networks, but its precision and stability are preferred
for certain tasks. The precision is maintained in the symbolic representation
obtained.
The alignment of the rules generated with the networks gives us an idea
of how accurate can be an study of the original problem using the symbolic
representation obtained. Over 94% was obtained, what is outstanding in terms
of the target that is widely used for model alignment.
The analysis of the rules extracted can lead to the definition of strategies
to debug or expand Neural Networks; creating new training instances to cover
conditions that seem to be absent from the neural internal representation. We
will explore this issue elsewhere.
References
I. J.A. Alexander and M.C. Mozer. Template-based algorithms for connection]st rule
extraction. Advances in Neural Information Processing Systems, 7:609-616, 1995.
76
2. C. Bachinan, Cooper L., Dembo A., and Zeitouni O. A relaxation model for
memory with high storage density. Proceedings of the National Academy of Science,
21:609-616, 1995.
3. J.M. Benitez, J.L. Castro, and I. Requena. Are artificial neural networks black
boxes? [EEE Transaction on Neural Networks, 8:1156-1164, 1997.
4. B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin
classification. In Proceedings 5th annual Workshop on Computational Learning
Theory, pages 144-152. ACM, 1992.
5. E. Keogh C. Blake and C.J. Merz. UCI repository of machine learning databases,
1998. http: / /www.ies.uci.edu/,~mlearn[MLRepository.html.
6. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273-297,
1995.
7. M.W. Craven and J.W. Shavlik. Extracting tree-structured representations of
trained networks. Advances in Neural Information Processing Systems, 8:24-30,
1996.
8. S. I. Gallant. Connectionist expert systems. Communications of the ACM, 31:152-
169, 1988.
9. M.J. Healy and T.P. Caudell. Acquiring rule sets as a product of learning in
a logical neural architecture. IEEE Transaction on Neural Networks, 8:461-474,
1997.
10. J. J. Hopfield. Neural networks and physical systems with emergent collective com-
putational abilities. In Proceedings of the National Aca. of Science USA, volume 81,
pages 3088-3092, 1984.
11. M. Ishikawa. Neural network approach to rule extraction. In Proceedings of the
2nd New Sealand International Conference on Artificial Neural Networks and Fuzzy
Systems, pages 6-9. IEEE Computer Society, 1995.
12. J.S. Roger Jang and C.T. Sun. Functional equivalence between radial basis function
networks and fuzzy inference systems. IEEE Transaction on Neural Networks,
4:156-159, 1993.
13. B. Kosko. Neural Networks and Fuzzy Systems: A dynamical approach to machine
intelligence. Prentice-Hail, 1992.
14. S. Lee and R. Kil. A gaussian potential function network with hierarchically self-
organizing learning. Neural Networks, 4(2):207-224, 1991.
15. C. McMillan, M.C. Mozer, and P. Smolensky. Rule induction through integrated
symbolic and subsymbolic processing. Advances in Neural Information Processing
Systems, 4:969-1497, 1992.
16. 3.E. Moody and C. Darken. Fast learning networks of locaily-tuned processing
units. Neural Computation, 1(2):281-294, 1989.
17. C.W. Omlin and C.L. Giles. Extraction of rules from discrete-time recurrent neural
networks. Technical Report 92-23, Department of Computer Science, Rensselaer
Polytechnic Institute, 1992.
18. T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings
IEEE~ 78:1481-1497, 1990.
19. J. Ramirez. MRC-an evolutive cormectionist model for hybrid training. In Lecture
Notes in Computer Science 686: New trends in neural computation, pages 223-229.
Springer-Verlag, 1993.
20. D. Reilly, L. Cooper, and C. Elbaum. A neural model for category learning. Bio-
logical Cybernetics, 45:35-41, 1982.
21. A. Saha and J. Keeler. Algorithms for better representation and faster learning
in radial basis function networks. Advances in Neural Information Processing Sys-
tems, 2:482-489, 1990.
77
1 Introduction
perturbation AAi of Ai causes the ANN to change its classification from one
class to another, then, according to [Engelbrecht et al 1998a,Engelbrecht 1998b],
a decision boundary is located in the range [Ai, Ai + AAi]. That is, a decision
boundary is located at the point in input space where a small perturbation to
the value of an input parameter causes a change in the output class.
Sensitivity analysis of the ANN output with respect to input parameter Ai
is used to assign a "measure of closeness" of an attribute value to the boundary
value(s) of that attribute. That is, for each example p in the training seti the
first-order derivative
ocj (3)
OAf v)
is calculated for each class (output) Cj and for each input Ai
OCj
[Engelbrecht et al 1998a,Engelbrecht 1998b]. The higher the value of OA-~(?,the
greater the chance that a small perturbation of A~p) will cause a different classifi-
cation [Engelbrecht 1998b]. Therefore, patterns with high ~ values lie closest
OA~
to decision boundaries.
A graph of ~OA~
, p = 1,..., P, reveals peaks at boundary points. A curve
fitting algorithm can be used to fit a curve over the values ~ and to find
OA~p)
the values of Ai where a peak is located. These values of Ai constitute decision
boundaries. Sampling values to the left and right of the boundary peaks indicates
whether an attribute should have a value less or greater than the boundary value
to trigger a rule.
The sensitivity analysis decision boundary detection algorithm is applied in
conjunction with the ANNSER rule extraction algorithm in the next sections.
The aim of this section is to illustrate the sensitivity analysis decision boundary
detection algorithm and to compare the rules extracted to the rules extracted
when the attribute evaluation method is used.
The Iris classification problem concerns the classification of Irises into one
of three classes, namely Setosa, Versicolor and Virginica. Irises are described by
means of four continuous-valued inputs sepal-width, sepal-length, petal-width and
petal-length.
The original 150 instance Iris data set was randomly divided into a 105
instance training set and a 45 instance test set. The sensitivity analysis decision
boundary detection algorithm was executed against the 105 instance training
set.
Firstly, a 4-2-3 ANN was trained using sigmoid activation functions with
steep slopes to approximate linear threshold functions. All input values were
scaled to the range [-1, 1]. Training converged after 10 epochs, with a classi-
fication test accuracy of 98%. Next, the sensitivity analysis pruning algorithm
in [Engelbrecht et al 1996,Engelbrecht 1998b] was executed to prune irrelevant
81
Z5
2
t
1.5 #
1
#
(, '$ $
0,5
o
o o
00
|
0
..fie 46 .o.4 4)2 0 0.2 0.4 0.6 0.8 -0.8 -o~ .0.4 -0.2 o 02 0.4 0.6 0.8
Petall.e~lh Peal~rdil
Figure I illustrates the decision boundary peaks formed for the petal-width
and petal-length attributes with regard to the Versicolor Iris. These peaks were
used to determine the actual unscaled attribute values that corresponded to these
boundaries. The boundaries were located at 49.50 for the petal-length attribute
and at 18.50 for the petal-width. The actual relational operators were determined
by sampling values to the left and right of the boundaries. The boundaries pro-
duced two attribute-value tests describing the Versicolor Iris, namely the tests
(petal-length < 49.50) and (petal-width < 18.50). The decision boundaries for
the Setosa and Versicolor Irises were located using the same approach as dis-
cussed above. The Setosa decision boundaries were detected at a petal-length of
19.50 and a petal-width of 6.50, producing attribute-value tests (petal-length <
19.50) and (petal-width < 6.50). The Virginica Iris type was described by the
attribute-value tests (petal-length > 49.50) and (petal-width > 16.50).
82
The test set accuracy of the rule set was 95.9%, with individual rule accuracies
ranging from 93.9% to 100%. The accuracy of the set of rules was equal to that
of the classification accuracy of the 2-2-3 ANN. This implies that the rule set
models the ANN to a comparable degree of fidelity, where the fidelity is measured
by comparing the classification performance of the rule set to that of the ANN
from which it was extracted [Craven et al 1993].
The attribute evaluation method was applied next, and is illustrated by con-
sidering the construction of the attribute-value tests of the rule that describe
the Versicolor Iris, as depicted in Table 1. This rule concerned the petal-length
attribute. For the Versicolor Iris, the petal-length attribute had values within
a range of (13.0 < petal-length < 46.50). The petal-length attribute values in
the training set ranged from 13 to 69. Therefore, the minimum attribute-value
test range value corresponded to the minimum value in the training data set.
The attribute-value test was simplified to (petal-length < 46.50). To improve
the generalization of the rule set, the value of r was set to 0.03. This value was
used to calculate a new threshold, using equation (2). A new attribute-value
test, namely (petal-length < 47.50), was produced.
The resultant rules were subsequently compared with the results of the deci-
sion boundary detection algorithm. Table 1 shows the attribute-value tests of the
two rule sets. Using the attribute evaluation approach, four rules with a test set
accuracy of 93.9% were extracted. The accuracy of the individual rules ranged
from 89.9% to 100.0%. Using the decision boundary threshold values obtained
from the sensitivity analysis approach, an improvement of 2.0% on the overall
accuracy was achieved. An improvement of 4.0% was achieved on the least accu-
rate rule. For this set of experiments, the decision boundary detection algorithm
produced an accurate, general set of rules.
The aim of this section is to illustrate the sensitivity analysis decision boundary
detection algorithm in a noisy domain that contained incorrect values. The breast
cancer data set, obtained from the UCI machine learning repository was used
for this purpose. Originally, the breast cancer database was obtained from Dr.
William H. Wolberg of the University of Wisconsin Hospitals, Madison. The
data set contained 699 tuples and distinguished between benign (noncancerous)
breast diseases and malignant cancer. The data set concerned 458 (65.5%) benign
and 241 (34.5%) malignant cases. In practice, over 80 percent of breast lumps
are proven benign.
83
The data set contained missing values and the level of noise (incorrect values)
was unknown. There are 10 input attributes, including the redundant sample
code number. The other nine inputs concerned the results obtained from the
tissue samples that were pathologically analyzed.
A 10-10-2 ANN was trained, using sigmoid activation functions with a high
slope to approximate linear threshold values. The sensitivity analysis pruning
algorithm reduced the ANN to a 3-3-2 network that produced six rules. The
classification test accuracy of this ANN was 95.2%. Next, the attribute-value
test thresholds were determined using the attribute evaluation method and the
sensitivity analysis decision boundary detection algorithm. The rule sets for both
methods were extracted. For the original attribute evaluation method, the rule
set accuracy was 79.6%. The individual rule accuracies ranged from 66.4% to
85.3%. The accuracy of the rule set that was produced after the results of the
sensitivity analysis decision boundary detection algorithm were incorporated was
94.3%, giving an improvement of 14.7%. The individual rule accuracies ranged
from 65.4% to 93.4%. The fidelity of the final rule set is high, since the rule set
accuracy of 94.3% is comparable to that of the original ANN (95.2%).
5 Conclusion
References
[Baum 1991] EB Baum, Neural Net Algorithms that Learn in Polynomial Time from
Examples and Queries, IEEE Transactions on Neural Networks, 2(1), 1991, pp 5-19.
[Cohn et al 1994] D Cohn, L Atlas, R Ladner, Improving Generalization with Active
Learning, Machine Learning, Vol 15, 1994, pp 201-221.
[Craven e~ al 1993] MW Craven and JW Shavlik, 1993. Learning Symbolic Rules using
Artificial Neural Networks, Proceedings of the Tenth International Conference on
Machine Learning, Amherst: USA, pp.79-95.
[Engelbrecht et al 1996] AP Engelbrecht, I Cloete, A Sensitivity Analysis Algorithm
for Pruning Feedforward Neural Networks, IEEE International Conference in Neural
Networks, Washington, Vol 2, 1996, pp 1274-1277.
[Engelbrecht et al 1998a] AP Engelbrecht and I Cloete, 1998. Selective Learning us-
ing Sensitivity Analysis, 1998 International Joint Conference on Neural Networks
(IJCNN'98), Alaska: USA, pp.1150-1155.
[Engelbrecht 1998b] AP Engelbrecht, 1998. Sensitivity Analysis of Multilayer Neural
Networks, submitted PhD dissertation, Department of Computer Science, University
of Stellenbosch, Stellenbosch: South Africa.
[Fu 1994] LM Fu, Rule Generation from Neural Networks, IEEE Transactions on Sys-
tems, Man and Cybernetics, Vol 24, No 8, August 1994, pp 1114-1124.
[Hwang et al 1991] J-N Hwang, JJ Choi, S Oh, RJ Marks II, Query-Based Learning
Applied to Partially Trained Multilayer Perceptrons, IEEE Transactions on Neural
Networks, 2(1), January 1991, pp 131-136.
[Sestito et al 1994] S Sestito and TS Dillon, 1994. Automated Knowledge Acquisition,
Prentice-Hall, Sydney: Australia.
[Towell 1994] GG Towell and JW Shavlik, Refining Symbolic Knowledge using Neural
Networks, Machine Learning, Vol. 12, 1994, pp 321-331.
[Viktor et al 1995] HL Viktor, AP Engelbrecht and I Cloete, 1995. Reduction of Sym-
bolic Rules from Artificial Neural Networks using Sensitivity Analysis, IEEE Inter-
national Conference on Neural Networks (ICNN'95), Perth: Australia, pp.1788-1793.
[Viktor et al 1998a] HL Viktor, AP Engelbrecht, I Cloete, Incorporating Rule Extrac-
tion from ANNs into a Cooperative Learning Environment, Neural Networks & their
Applications (NEURAP'98), Marseilles, France, March 1998, pp 386-391.
[Viktor 19981 HL Viktor, 1998. Learning by Cooperation: An Approach to Rule Induc-
tion and Knowledge Fusion, submitted PhD dissertation, Department of Computer
Science, University of Stellenbosch, Stellenbosch: South Africa.
The Role of Dynamic Reconfiguration for
Implementing Artificial Neural Networks Models in
Programmable Hardware
Abstract. In this paper we address the problems posed when Artificial Neural
Networks models are implemented in programmable digital hardware. Within
this context, we shall especially emphasise the realisation of the arithmetic
operators required by these models, since it constitutes the main constraint (due
to the required amount of resources) found when they are to be translated into
physical hardware. The dynamic reconfiguration properties (i.e., the possibility
to change the functionality of the system in real time) of a new family of
programmable devices called FIPSOC (Field Programmable System On a Chip)
offer an efficient altemative (both in terms of area and speed) for implementing
hardware accelerators. After presenting the data flow associated with a serial
arithmetic unit, we shall show how its dynamic implementation in the FIPSOC
device is able to outperform systems realised in conventional programmable
devices.
1 Introduction
The advances raised during the last years in the microelectronics fabrication processes
have facilitated the advent of new families of FPGA (Field Programmable Gate
Arrays) devices with increasing performance (in terms of both capacity, i.e., number
of implementable equivalent gates, and processing speed). This has motivated their
popularity in the implementation of complex embedded systems for industrial
applications.
Due to their inherent capability of tackling complex, highly non-linear optimisation
tasks (like classification, time series prediction . . . . . ), Artificial Neural Networks
models have been incorporated progressively as a functional section of the final
system. As a consequence, there have been several approaches, [I], [2], [3], [4],
dealing with the digital implementation of different neural models in programmable
hardware. However, due to the amount of resources required by the arithmetic
operations (especially digital multiplication), these realisations have been limited to
small models or alternatively have required many programmable devices.
During the last years the programmable hardware community has evidenced a
trend towards the integration of dynamic reconfiguration properties in conventional
FPGA architectures [5]. As a consequence, there have been already several proposals,
86
coming from both the academic [6] and the industrial [7], [8], [9] communities. The
term dynamic reconfiguration means the possibility to change, totally or partially, the
functionality of a system using a transparent mechanism, so that the system does not
need to be halted while it is being reconfigured. This feature was not available in
early FPGA devices, whose reconfiguration time is usually several orders of
magnitude larger than the execution delay of the system. In this paper we shall
concentrate our attention on the device presented in [9], which constitutes a new
concept of programmable devices, since it includes a programmable digital section
with dynamic reconfiguration properties, a configurable analog section and a
microcontroller, thus constituting an actual system on a chip. Through a careful use of
the dynamic configuration properties of the programmable digital section we shall
provide efficient arithmetic strategies which could assist in the development of
customisable neural coprocessors for real world applications.
The paper is organised as follows: In the next section we shall briefly explain the
main features of the FIPSOC device, paying especial attention to those related to its
dynamic reconfiguration properties. Then we shall evaluate some efficient arithmetic
strategies capable of handling the data flow associated with neural models. Bearing in
mind the intrinsic characteristics of the HPSOC family, we shall then present an
efficient serial scheme for implementing digital multipliers, providing throughput
estimates obtained from the first physical samples. Finally, the conclusions and future
work will be outlined.
::r~___'~____"Y~__ ..... ,
4-z-q .~ C ~ I g Idem
I 0
a
As it can be seen, the internal architecture of the FIPSOC device is divided in five
main sections: the microcontroller, the programmable digital section, the configurable
analog part, the internal memory and the interface between the different functional
blocks.
Because the initial goal of the FIPSOC family is to target general pro'pose mixed
signal applications, the microcontroller included in the first version of the device is a
full compliant 8051 core, including also some peripherals like a serial port, timers,
parallel ports, etc. Apart from running general-purpose user pro~ams, it is in charge
87
of handling the initial setup of the device, as well as the interface and configuration of
the remaining sections.
The main function of the analog section, is to provide a front-end able to perform
some basic conditioning, pre-processing and acquisition functions on external analog
signals. This section is composed of four major sections: the gain block, the data
conversion block, the comparators block and the reference block. The gain block
consists of twelve differential, fully balanced, programmable gain stages, organised as
four independent channels. Furthermore, it is possible to have access to every input
and output of the first amplification stage in two channels. This feature permits to
construct additional analog functions, like filters, by using external passive
components. The comparators block is composed of four comparators, each one at the
output of an amplification channel. Each two comparators share a reference signal
which is the threshold voltage to which the input signal is to be compared. The
reference block is constructed around a resistor divider, providing nine internal
voltage references. Finally, the data conversion block is configurable, so that it is
possible to provide a 10-bit DAC or ADC, two 9-bit DAC/ADCs, four 8-bit
DAC/ADCs, or one 9-bit and two 8-bit DAC/ADCs. Since nearly any internal point
of the analog block can be routed to this data conversion block, the microprocessor
can use the ADC to probe in real time any internal signal by dynamically
reconfiguring the analog routing resources.
Regarding the programmable digital section, it is composed of a two-dimensional
array of programmable cells, called DMCs (Digital Macro Cell). The organisation of
these cells is shown in figure 2.
As it can be deduced from this figure, it is a large-granularity, 4-bit wide
programmable cell. The sequential block is composed of four registers, whose
functionality can be independently configured as a mux-, E- or D-type flipfiop or
latch. Furthermore, it is also possible to define the polarity of the clock (rising/falling
edge) as well as the set/reset configuration (synchronous/asynchronous). Finally, two
main macro modes (counter and shift register) have been provided in order to allow
for compact and fast realisations.
The combinational block of the DMC has been implemented by means of four
16xl-bit dual port memory blocks (Look Up Tables - LUTs - in figure 2). These
ports are connected to the microprocessor interface (permitting a flexible management
of the LUTs contents) and to the DMC inputs and outputs (allowing for their use as
either RAM or combinational functions). Furthermore, an adder/subtractor macro
mode has been included in this combinational block, so as to permit the efficient
implementation of arithmetic functions.
A distinguishing feature of this block is that its implementation permits its use
either with a fixed (static mode) or with two independently selectable (dynamic
reconfigurable mode) functionalities. Each 16-bit LUT can be accessed as two
independent 8-bit LUTs. Therefore it is possible to use four different 4-LUTs in static
mode, sharing two inputs every two LUTs, as depicted in figure 2, or four
independent 3-LUT in each context in dynamic reconfigurable mode. Table 1
summarises the operating modes attainable by the combinational block of the DMC in
static mode and in each context in dynamic reconfigurable mode.
Furthermore, since the operating modes indicated in table 1 are implemented in
two independent 16x2-bit RAMs (8x2-bit RAMs in dynamic reconfigurable mode), it
is possible to combine the functionalities depicted in this table. For instance, it is
possible to configure the combinational block in order to provide one 5-LUT and one
88
16x2-bit RAM in static mode or two 3-LUTs and one 4-LUT in dynamic
reconfigurable mode.
I I I I I I I I Output
courc Cl c 2 s3 cotrrs c~ st sz Unit
COUTC t cOtrl~
C3 R7 O3
IA3 D3
C21~ 02
1A_.~0 (:1
s 0
OE0 IAUXI
TTTT
D3 D2 DI DO
Sequential
Block
Combinational
Block
Fig. 2. Organisation of the basic cell (DMC) in the programmable digital section.
The multicontext dynamic reconfiguration properties have been provided also for
the sequential block of the DMC. For this purpose, the data stored in each register has
been physically duplicated. In addition, an extra configuration bit has been provided
in order to include the possibility of saving the contents of the registers when the
context is changed and recover the data when the context becomes active again.
In order to enhance the overall flexibility of the system, an isolation strategy has
been followed when implementing the configuration scheme of the FIPSOC device.
This strategy, depicted in figure 3(a), provides an effective separation between the
actual configuration bit and the mapped memory through an NMOS switch. This
switch can be used to load the information stored in the memory to the configuration
cell, so that the microprocessor can only read and write the mapped memory. This
implementation is said to have one mapped context (the one mapped in the
microprocessor memory space) and one buffered context (the actual configuration
memory which directly drives the configuration signals).
89
The benefits of this strategy are clear. First, the mapped memory can be used to
store general-purpose user programs or data, once its contents have been transferred
to the configuration cells. Furthermore, the memory cells are safer, since their output
does not drive directly the other side of the configuration bit. Finally, at the expense
of increasing the required silicon area, it is possible to provide more than one mapped
context to be transferred to the buffered context, as depicted in figure 3(b). This is the
actual configuration scheme which has been implemented in the FIPSOC device, and
it permits to change the configuration of the system just by issuing a memory write
command. Furthermore, the programmable hardware has also access to the resources
which implement this context swap process. In this way, it is even possible to change
the actual configuration of the DMCs in just one clock cycle. As it will be explained
in the following sections, this constitutes in fact the basis of the strategy we shall use
to implement efficiently arithmetic operators for artificial neural networks models.
i[ ......... i ..........
i. . . . . . . . . . . . . . . . . . . . . . !
t ~
Toad I
(a) (b)
Fig. 3. (a) Configuration scheme. (b) Multicontext configuration.
Multiplication and addition are among the most common operators found in the data
flow associated with Artificial Neural Networks models. For instance, they are found
in the synaptic function of the neurons constituting a Multilayer Perceptron network
or in the distance calculation process inherent to Learning Vector Quantization (LVQ)
or Radial Basis Function (RBF) models. Since most commercial FPGA devices
include specific hardware macros devoted to the realisation of fast and compact adder
units, addition does not usually represent a serious limitation when a digital
implementation for these neural models is envisioned.
On the contrary, the implementation of a digital multiplier usually requires too
many physical resources or a large latency, thus penalising the performance (in terms
of area and/or execution delay) of the final system. The advent of programmable
devices with dynamic reconfiguration properties has resulted in new strategies for the
physical realisation of multiplier units. In this way, the alternative presented in [ 10] is
based on what has been termed partial evaluation. The term partial evaluation refers
to the possibility of simplifying certain functions when some operands are fixed. This
is the case of artificial neural networks models during the recall phase, since the
neurons' weights have been already established during the learning phase. For
instance, if we consider the multiplication of two 4-bit numbers, there are 16 8-bit
possible results if one of the operands is fixed. As a consequence, the multiplier could
be implemented in this case by means of 8 4-input LUTs. This is the approach which
was introduced in [10], which is represented in figure 5 for the case of 8-bit numbers.
Figure 5(a) shows how the multiplication of an 8-bit constant (A) by an arbitrary 8-
bit number (B) can be constructed as the overlapped addition of two 12-bit numbers
(resulting from the partial products A x B 1 and A x B2, respectively), both of them
obtained from 24 4-LUTs, as indicated in figure 5(b). Since the combinational part of
91
the DMC included in the FIPSOC device allows, in static mode, for the realisation of
up to 4 4-LUTs (sharing two inputs every two LUTs) or one 4-bit adder, it can be
easily deduced that the multiplier depicted in figure 5 can be implemented using 9
DMCs. The execution delay of this multiplier could be quite low, since it is given by
one LUT access (i.e., the time associated with a read cycle in a SRAM) plus the
propagation delay of the 12-bit adder.
Fig. 5. (a) Partial evaluation principle (b) Implementation with 4-LUTs of a 8-bit multiplier
B A
i
8-bit register ] mitJ o~ 8-bit shift register
i
I -1 _1 _1 _1 _1 _1 _1 ',
P,
b~
i.....................
4
i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
b2 x ~r
b3a,~ 3-LUT
DMC 1
...................... ,
oMc3i
,......................
b, ! ~1
bs "~ i 4 4i
b6 i b 4
b7 3-LUT
aI
DMC 2
...................... ,
Fig. 7. Implementation of an 8 x 8-bit serial multiplier in the FIPSOC device.
As it can be deduced~from this figure, two DMCs (DMC 1 and 2) of the total four
DMCs required are used just for generating the 8-bit partial product to be
accumulated each clock cycle. However, we can further optimise this implementation
by using the dynamic configuration properties of the FIPSOC device. As it has been
explained in the previous section, ea6h configuration bit of the DMCs is attached to
two mapped configurations (i.e., there are two mapped contexts for each buffered
context). Furthermore, there is an input signal in each DMC which permits to switch
between both contexts. Therefore, since the routing structure used for the inputs
attached to each DMC is based on a multiplexer (which, in addition to attaching each
input to a given routing channel is able to fix it to a logic level 1 or 0), we can emulate
the A N D function required to obtain the partial products in the serial multiplier by
means of a context swap, which is controlled by the a~ bit. In this way, if the a, bit
equals 1, it selects the context where each input is connected to the corresponding b~
bit of the second operand, while in the case the a, bit equals 0, it activates the context
93
where all the inputs are tied to ground. This context swap governed by the value of a
is possible since the state of the registers can be saved when a context swap is
produced, and furthermore the 4-bit adder functionality is available in both contexts.
This scheme thus permits the implementation of a 8 x 8-bit digital multiplier using
just 2 DMCs. Furthermore, since all the signals (except the carry signals transferred
between the DMCs, which have fast dedicated routing channels) are propagated
locally (i.e., inside a DMC), the overall execution delay can be kept very small
(operation with a clock frequency of 96 MHz has been already qualified for the
FIPSOC device), since the propagation delays incurred when traversing routing
resources between DMCs is removed.
Finally, since the result produced by the multiplier is obtained serially, it is
possible to combine each multiplier with a shift register and a 1-bit adder in order to
accumulate the results and provide finally the activation potential of the neuron. In
this way, we could construct an array of processing elements organised following a
Broadcast Bus Architecture, as depicted in figure 8.
xj--;... '8 / -/
.......................................... .,_,'_: ....... :," ,
: /
L~20-bishift
t reg.~ //I//
ai "multiplier [/
................................ "~8 i/'
B
Fig. 8. Array of processing elements organised as a Broadcast Bus Architecture.
As it can be seen, there is a global bus shared by all the processing elements (PEs
in the figure), which is in charge of providing the inputs (xj in the figure) to all the
neurons, where they will be multiplied by the corresponding synaptic weights (~k in
the figure). An array composed of 12 such units could be mapped in the first device of
the FIPSOC family (which includes an array of 8 x 12 DMCs). Therefore, since a
maximum clock frequency of 96 MHz could be used, the maximum throughput of the
system is 70 MCPS (Millions of Connections Per Second), thus offering an efficient
alternative for the implementation of neural accelerators in programmable hardware.
4 Conclusions
After presenting the main features and the global organisation of the FIPSOC
(Field Programmable System On a Chip) devices, we have reviewed some strategies
for implementing digital multipliers, which are the core of the arithmetic unit used to
realise physically neural models. By improving a serial multiplication scheme with
the dynamic reconfiguration properties of the FIPSOC devices, we have derived an
architecture which provides an efficient solution for the implementation of parallel
processing systems in programmable hardware.
Our current efforts are concentrated in the exhaustive qualification of the samples
corresponding to the first member of the FIPSOC family, as well as in the
implementation and characterisation of the proposed architecture.
Acknowledgements
This work is being carried out under the ESPRIT project 21625 and spanish CICYT
project TIC-96-2015-CE.
References
1. Cox, C., Balnz, W.E.: GANGLION: A fast field-programmable gate array implementation
of a connectionist classifier. IEEE Journal on Solid-State Circuits, Vol. 27, no. 3 (1992)
288-299
2. Beiu, V., Taylor, J.G.: Optimal Mapping of Neural Networks Onto b't'GAs - A New
Constructive Algorithm. In: Mira, J., Sandoval, F. (eds.): From Natural to Artificial Neural
Computation. Lecture Notes in Computer Science, Vol. 930. Springer-Vedag, Berlin
Heidelberg New York (1995) 822-829
3. Hartmann, G, Frank, G, Sch~.fer, M. Wolff, C.:SPIKE 128K-An Accelerator for Dynamic
Simulation of Large Pulse-Coded Networks. In: Klar. H, KOnig, A, Ramacher, U. (eds.):
Proceedings of the 64 International Conference on Microelectronics for Neural Networks,
Evolutionary & Fuzzy Systems. University of Technology Dresden (1997) 130-139
4. P6rez-Uribe, A. Sanchez, E: FPGA Implementation of Neuronlike Adaptive Elements. In:
Gerstner, W., Germond, A., Hasler, M. Nicoud, J.-D. (eds.): Artificial Neural Networks-
ICANN'97. Lecture Notes in Computer Science, Vol. 1327. Springer-Verlag, Berlin
Heidelberg New York (1997) 1247-1252
5. Becker, J., Kirchbaum, A., Renner, F.-M., Glesner, M:Perspectives of Reconfigurable
Computing in Research, Industry and Education. In: Hartenstein, R., Keevallik, A. (eds.):
Field-Programmable Logic and Applications. Lecture Notes in Computer Science, Vol.
1482. Springer-Verlag, Berlin Heidelberg New York (1998) 39-48
6. DeHon, A.: Reconfigurable Architectures for General-Purpose Computing. A.I. Technical
Report No. 1586. MIT Artificial Intelligence Laboratory (1996)
7. Churcher, S., Kean, T., , Wilkie, B.: The XC6200 FastMap Processor Interface. Field
Programmable Logic and Applications, Proceedings of FPL'95. Springer-Verlag (1995) 36-
43
8. Hesener, A.: Implementing Reconfigurable Datapaths in FPGAs for Adaptive Filter Design.
Field Programmable Logic, Proceedings of FPL'96. Springer-Verlag (1996) 220-229
9. Faura, J., Horton, C., Van Duong, P., Madrenas, J., Insenser, J.M.: A Novel Mixed Signal
Programmable Device with On-Chip Microprocessor. Proceedings of the IEEE 1997
Custom Integrated Circuits Conference (1997) 103-106
10.Kean, T., New, B., Slous, B.: A Fast Constant Coefficient Multiplier for the XC6200. Field-
Programmable Logic. Lecture Notes in Computer Science, Vol. 1142. Springer-Verlag,
Berlin Heidelberg New York (1996) 230-236
An Associative Neural Network and Its Special
Purpose Pipeline Architecture in Image Analysis
Topics: Computer vision, neural nets, texture recognition, real-time quality control
Abstract.- There are several approaches to texture analysis and classification. Most have limitations in
accurate discrimination or complexity in time calculation. A first phase is the extraction of texture
features and later we classify it. Texture features should have the followings properties: be invariant
under Ihe transformations of translation, rotation, and scaling; a good discriminating power; and take the
non-stationary nature of texture account. In Our approach we use Orthogonal Associative Neural
Networks to Texture identification. It is used in the feature extraction and classification phase (where its
energy function is minimized). Due his low computational cost and his regular computational structure
the implementation of a real-time texture classifier based on this algorithm is feasible. There are several
platforms to implement Artificial Neural Networks (VLSI chips, PC accelerator cards, multiboard
computers, ...). The election relies on the type of neural model, their application, the response time,
capacity of storage, type of communications, and so on. In this paper we present a pipeline architecture,
where precision, cost and speed are optimally trade off. In addition we propose CPLD (Complex
I'rogrammahle Logic Device) chips to complete realization of the system. CPLD chips have a
reasonable density and performance at low cost.
1. I n t r o d u c t i o n
Texture segmentation is one of the most important task in the analysis of texture
images [1]. It is at this stage that different texture regions within an image are isolated
for subsequent processing, such as texture recognition. The major problem of texture
analysis is the extraction of texture features. Texture features should have the
followings properties: be invariant under the transformations of translation, rotation,
and scaling; a good discriminating power; and take the non-stationary nature of texture
account. There are two basic approaches for the extraction of texture features:
structural and statistical [2]. The structural approach assumes the texture is
characterized by some primitives following a placement rule. In this view, to describe
a texture one needs to describe both the primitives and the placement rule. This
approach is restricted by complications encountered in determining the primitives and
the placement rules that operate on these primitives. Therefore, textures suitable for
structural analysis have been confined to quite regular textures rather than more natural
texture in practice. In the statistical approach, texture is regarded as a sample from a
probability distribution on the image space and defined by stochastic model or
characterized by a set of statistical features. The most c o m m o n features used in
practice are based on the pattern properties. They are measured from first and second
order statistics and have been used as discriminators between textures.
96
For real-time image analysis, for example in detection of defects in textile fabric, the
complexity of calculations has to be reduced, in order to limit the system costs [3].
Additionally algorithms which are suitable for migration into hardware have to be
chosen. Both the extraction method of texture features and the classification algorithm
must satisfy these two conditions. Moreover, the extraction method of texture features
should have the followings properties: be invariant under the transformations of
translation, rotation, and scaling; have a good discriminating power; and take the non-
stationary naturc of texture account. We choose the Morphologic Coefficient [8] as a
feature extractor that is adequate for its implementation by associative memories and
dedicated hardware.
In the other hand, the classification algorithm should be able to store all of patterns,
have a high correct classification rate and a real time response. There are many
models of classifier based on artificial neural networks. Hopfiel [11 ] y [12] introduced
a first model of one-layer autoasociative memory. The Bi-directional Associative
Memory (BAM) was proposed by Kosko [14] and generalizes the model to be
bidirectional and heteroassociative. The BAMs have storage capacity problems [17].
It has been proposed several improvements (Adaptative Bidirectional Associative
Memories [15], multiple training [17] y [18], guaranteed recall, and a lot more
besides. One-step models without iteration has been developed too (Orthonormalized
Associative Memories [9] and the Hao's associative memory [10], which uses a
hidden layer). In this paper, we propose a new model of associative memory which
can be used in bidirectional or one-step mode.
Artificial neural networks needs a high number of computations and data interchange
[5]. So, parallel and high integration techniques (multiprocessor, array processors,
superscalar chips, segmentation, VLSI chips . . . . ) have been used for its
implementation, neural models come in many ways and flavors. Implementations
include analog, digital and hybrids. However, in some cases, when we are looking for
and adequate platform to map a neural model and its application we choose the most
suitable for both. In our case, we use Complex Programmable Logic Devices (CPLD)
chips to implement a small associate memory and we use it for texture characterization
and classification. These CPLD devices combine gate-array flexibility and desktop
programmability. So, we can design a circuit, test and probe it in short (avoid
fabrication cycle times). In the other hand, it only has some thousands of gates, so its
use is only adequate for specific neural models and applications.
The Hausdorff Dimension (HD) was first proposed in 1919 by the mathematician
Hausdorff and has been used, mainly, in fractai studies [4]. One of the most attractive
features of this measure when analyzing images is its invariant properties under
isometric transformations. We will use HD when extracting features. Given an image I
2
belonging to R and being S a set of points in that image, that is, S c I. The HD of that
set is define as follows.
The HD is invariant to isometric and similar transforms of the image. This property
makes it appropriated in objects recognition. It is difficult to calculate the dimension
from the definition. Because of that, some alternative methods have been proposed like
mass dimension, box dimension, etc. Our proposal is to prove that the calculation of
g7
Theorem L The Hausdorff dimension of a set S (Dh(S)) can be calculated from its
semicover 05-sm(S)) with the following expression:
with
and
Proof'. The definitions II, III and IV express the HD from a ~5-cover of the set S. W e
only have to consider that in the limit 05-->0), it follows that 5-cover(S)=5-sm(S).
Theorem I allows us to express the calculation of the HD as a semicover of a set
calculation problem. This implies that its computation with semicovers inherits the
invariant properties of the dimension. And inversely, the characterization of
semicovers as an NP-complete problem allows to estimate the complexity of
evaluating the HD.
We can approximate the HD by semicovers, so we define the morphologic coefficient
which can be used to feature extraction. W e call morphologic coefficient of the
semicover of a set S over an morphologic element A i, of diameter 6 = IAil to
It is at the level of discretization where the goodness of the semicovers method can be
seen in comparison with set covers raisings (box-dimension). For discrete values of 8,
the 5-semicover is much more restrictive than the 8-cover of the set S, and this allows
us to capture much more better its topologic characteristics [8]. Therefore, in the
practice it allows a better features extraction. The ~5-semicover offers output patterns
with more Hamming distance than the k-cover and therefore allows a better process of
classifying.
Characterization of the texture
In order to extract the invariant characteristics of an image we divide it in several
planes attending to the level of intensity of each point. Then we could define the
multidimensional morphologic coefficient like the vector formed for the CM of each
one of these planes. We can characterize the texture with his CM vector.
r = [CMI, CM2..... CMp] ; p-- n ~ of planes in which image is partitioning
The CM vectors of the patterns will be employed in the learning process of the
classifier that it is described below.
In this paper, we propose a new model of associative memory which can be used in
bidirectional or one-step mode. This model uses a hidden layer, proper filters and
orthogonality to increase the store capacity and reduce the noise effect of lineal
dependencies between patterns. Our model, that we call Bidirectional Associative
Orthogonal Memory (MAO) , go beyond tile BAM capacity. The BAM and MAON
models are particular cases of it.
However, It is a particular representation of more general model [8]. So, when we use
this neural network as a classifier the particular values of Q matrix and the filters are
different. For example, the filter in the hidden layer will be to get the maximum
response in forward classification mode.
Usually the quality of the fabric is controlled by visual inspections of humans. In order
to substitute this work by automatic visual inspection, fast image processing system
are required [3]. If consider that the fabric is processed with 100m-180m per minute,
then 5-15MB image data have to be processed per second. We propose a pipelined
architecture that carry out this job in real-time, and suggest the CPLD's or FPGA's
chips to implement it due his adequate ratio cost/performance.
Figure 1 shows the block diagram of the texture analysis system (TAS). The TAS
inputs are the eight bits of the pixel and the five thresholds that determined the
intervals of each one of the four planes in which we divide the image to analyze. These
thresholds are predefined or may be programmed depending on the application.
TAS is divided in two modules: Analysis module and Classification module. The first
module performs the feature extraction of the image using the CM. The second module
performs the classification of the texture using the MAON algorithm. The
Classification module find the minimum distance pattern to the texture. This is
equivalent to minimize the subtraction [8].
2.(CM x.CM i) - IIC-Mill2 ;
where (C--Mx) is the CM vector of the texture to classify and IIC-MilJ2 is the square
module of the CM vector of the i-pattern. So in the learning mode, the patterns (r i)
and his square module (IIC-Mill2) must be stored in tile Classification module. And in
the recognition mode the Classification module have to calculate the dot product and
the subtraction.
The CM unit
The CM unit is designed to calculate the 4-dimensional CM (four planes partitioning)
using a 2x2 pixels morphologic element (also named mask). The CM calculations for
each of the four planes are perform in parallel, so it is necessary four circuits like
shows in Figure 3. Notice that the columns image data is fed to the CM unit in serial
manner via a 8-bit bus, and the filter module (Figure 3) produces a " l " if the pixel
intensity level belong to the interval [Thmin,Thmax]. The 2bit-shift register and the
100
Tl~eshold~
From h o s t ~ 4xCM , ~i Dot
i unit Clasifier
Clasifier
unit
pixel ,~ l ~li
i
i
i i
i
i. . . . . . . . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
elk i dl
EndWindow
Fig 2. CM unit datapath
Let the image have the size of m x n . To provide the data for the horizontal mask, m+l
filtered pixels values (lbit) must be stored. Figure 3 shows also the FIFO array. Every
time the horizontal mask produces a "1" the mask count is incrementing. When the last
pixel has been processed the CM is stored in the counter.
The CM unit is pipelined in four stages:
First stage: read the pixel and filter
Second stage: shift and vertical mask
Third stage: shift to the FIFO array and horizontal mask
Four stage: increment counter mask
Each pixel is processed in one clock cycle (clkl). The number of cycles to calculate
the CM of a m x n frame is [4 + (m x n) - 1].
The Dot Product unit
It performs the dot product (CM x"CM i). The patterns are stored in four 256x8 RAM
modules. The memories are organized in such a way that the firsts components of all
C--Mvectors are in module RAM0 the seconds components are in module RAM1 an so
on. So when the address is "00000000" the outcome of the RAM modules is the four
components of the first pattern. Next the dot product it is done in parallel with the four
multipliers and the partial adders.
101
The unit can work in two modes: store mode and recognition mode. In the first, the
CM vectors of the patterns are stored in the RAM memories (RAM0..RAM3), the
square modules (11~ ill 2) in the RAMmod (256xi8), and the number of patterns in the
address counter. The data come from the host computer or may be calculated for the
system itself. The second mode is the normal operation mode.
8~
/ ,~1A I
Thmin
"1 ~oMpI
in I: .+.
8/
-T
elk
out
T~ ..........
Product unit
Ms~--q I o 1I
LSB
"--] To host
clk2
Like the previous unit the Dot Product is pipelined too, and the number of clock cycles
(clk2) to realized
one dot product is 3:
First stage: generate the address and memories access
Second stage: product of individual components
Third stage: partial adders and in parallel RAMmod access and 2's complement
102
The address generator is a 8-bit down counter, additionally the address is used like a
index of the product, so the dot product and his index go on together to the Classifier
unit.
Classifier unit
This unit produce the index of the most similar pattern to the texture. First the dot
product is doubled simply by sticking a zero to the LSB to produce a 19-bit number,
next the sign bit is added (sticking a zero to the MSB) and finally a 20-bit number
result. In the same way two bits (sticked a one) are added to the IIC---Mill 2 , to convert it
in a 20-bit negative number. Next it performs the add and the result is inverse
2'complemented if the sign is negative. Finally it is compare with the previous data
and the major, next with his index are stores in the aux. registers. When all
comparisons are completed the most similar pattern and his index will be in the aux.
registers.
The stages of the Classifier unit are:
First stage: adder and inverse 2'complement
Second stage: magnitude and sign comparison
Third: store the winner
Figure 4. Classifierunit
103
Let p the number of patterns stored previously, then the total cycles (clk2) in the
recognition process (Dot Product unit + Classifier unit) is [6+ p - I]. Therefor the
total latency in texture recognition (analysis + classification) of a m x n pixels window
using p patterns textures is:
RLat = [4 + (mxn) - 1] clkl + [6 + p - 1] clk2
We can consider the TAS like two stages pipelined (feature extraction and
recognition), then the process of different windows are overlapped. The total latency
in texture recognition of k windows is:
Tk= [2+k-1] clk ; elk=max{J4 + ( m x n ) - 1] clkl, [6 + p - 1] clk2}
To test the texture analysis algorithm (features extraction and classifier) we consider
the problem of defects detection in textile fabric. Real-world 512x512 images (jeans
texture) with and without defects (figure la and lb) were employed in the learning
process of the MAO classifier. We considered windows of 70x70 pixels with 256 gray
levels and the parameters of the algorithm were adjusted to obtain high precision and
low response time. These are shows in the table la and lb.
and 64x64 (SAC) windows sizes. The implementation was made in a C-program. In
the test process and in the learning process were employed different images. In both
cases were 1.200 images with defects and 1.000 without defects. The results (tab. 2)
shows that in all the cases our algorithm is two magnitude order faster than the others.
In addition the hit rate it is next to 90% for with and without defects texture
recognition (notice that in the C-Ill, ad-hoc partitioning, it is over 95%). The
conclusion is that it is feasible to implement a real-time system with a high precision
level based in our algorithm. So an architectural proposal will be made in the next
section.
Image C-1:11,32][33,641t65,961197,12811129,1601[
161,192][ 193,224][225,256]
partitioning
C-iI: ll,64][65,128][I 29,192][193,2561
C-III:[I-80][81-120][121-1801[181-21011211-220l[221-2561
L,earning Adaptative (qinitial=50patterns y q,..l<40l patterns)
Operation mode No iterative
distancia Euclidean
fl filter Euclidean
f2 filter Maximum response determination
Algorithm pixels hit rate without hit rate with response time
defects defect
C-i 70x70 92,23% 87,14% 0,081 seg.
C-II 70x70 96,12% 93,32% 0.055seg.
C-ill 70x70 97,81% 94,42% 0.070seg.
Laws 40x40 93,71% 64,69% I ,Sseg
SAC 64x64 95,12% 84,34% 1.1 seg
Table 2. Simulation results
VLSI Design
The gate level structure of the architecture has been simulated (using VHDL timing
simulator), to verify the functionality of the TAS. The parameters values were chosen
to match those that have been used during the algorithm testing. The results shows that
there is no performance degradation.
We propose CPLD's or FPGA's chips for implementation. Using these technologies it
is possible achieves moderates frequencies of process but a low cost. For example
there is no problem with employ clkl=30ns (33MHz) and clk2=100ns (10MHz), in
this case the time to process 5MB-15MB of data image is 0,15sec-0,46sec. These
results complete perfectly with the specifications.
6. Conclusion
A real time system for texture analysis is successfully applied to solve the problem of
defects segmentation in textile fabric. The system presents a statistic method for
feature extraction and a neural classifier.
105
The method for the extraction of texture features is based on the Hausdorff dimension
and its most important properties are: it is easy to compute and it is invariant under
geometrical m a p p i n g such as rotation, translation and scaling.
An Associative Neural Model is used as a classifier. In this extension the neurons have
an output value that is updated at the same time that the neurons weights. From this
output value we can easily calculate the distance between the neuron and the cluster
and get the probability that a neuron is into a cluster, that is, the probability which the
system works well.
This system works in real time and produces about 96.44% of correct rate. This
suggest that a system based on a C C D camera that inspect textile fabric with our
pipeline proposal is a good and low cost tool for texture analysis in textile industry.
7. References
[It N.R. Pal and S.K. Pal, A review on image segmentation techniques, Pattern Recognition, Vol. 26,
No.9, pp. 1277-1294, 1993.
[21 R.M. Haralick, Statistical and structural approaches to texture, Procc. 1EEE, Vol. 67, pp. 786-804,
1979
[3] C. Neubauer, Segmentation of defects in textile fabric, Proce. IEEE, pp. 688-691, 1992.
[4] Hoggar, S. G. Mathematics for Computer Science. Cambridge University Press. 1993.
[5] J.M.Zurada, Introduction to Artifial Neural Systems, West Publishing Company, 1992.
[61 Harwaood, D. et al. Texture Classification by Center-Symmetric Auto-Correlation, using Kullback
Discrimination of Distribution. Pattern Recognition Letters. Voi 16, pp. 1-10. 1995
[71 Laws, K. Y. Texture Image Segmentation. Ph D. Thesis. University of Southern California. January.
1980.
[8] Francisco Ibarra Pie6. An~ilisis de Texturas mediante Coeficiente Morfol6gico. Modelado
Conexionista Aplicado. Ph.D. Thesis. Universidad de Alicante. 1995.
[91 Garcia-Chamizo J.M., Crespo-Llorente A. (1992) "Orthonormalized Associative Memories".
Proceeding of the IJCNN, Baltimore, vol 1, pg. 476-481.
[10] Hao J., Wanderwalle J. (1992) "A new model of neural associative memoriy" Proceedings of the
JJCNN92, vol 2, pg. 166-171.
[11] Hopfield J.J. (1984a) "Neural Networks and physical systems with emergent collective
computational abilities". Proceedings of the National Academy of Science, vol 79, pg. 2554-2558.
[12J Hopfield J.J. (1984b) "Neural networks with graded response have collective Computational
properties like those of two-state Neurons". Proceedings of the National Academy of Science, vol 81, pg.
3088-3092.
[13] lbarra-Pic6 F., Garcia-Chamizo J.M. (1993) "Bidirectional Associative Orthonormalized
Memories". Actas AEPIA, vol 1, pg 20-30.
[14] Kosko, B. (1988a) "Bidirectional Associative Memories". IEEE Tans. on Systems, Man &
Cybernetics, vol 18.
[15] Kosko, B. (1988b) "Competitive adaptative bidirectional associative memories".Procedings of the
IEEE first International Conference on Neural Networks, eds M. Cardill and C. Butter vol 2. pp 759-
766.
[16] Pao You-Hart. (1989)"Adaptative Pattern Recognition and Neural Networks". Addison-Wesley
Publishing Company, Inc. pg 144-148.
106
[ 17] Wang, Cruz F.J., Mulligan (1990a) "On Multiple Training for Bidirectional Associative Memory ".
IEEE Tans. on Neural Networks, 1(5) pg 275-276.
[181 Wang, Cruz F.J., Mulligan. (1990b) "Two Coding Strategies for Bidirectional Associative Memory
", IEEE Tans. on Neural Networks, pg 81-92.ang, Cruz F.L,
Effects of Global P e r t u r b a t i o n s on Learning
Capability in a C M O S A n a l o g u e I m p l e m e n t a t i o n
of S y n c h r o n o u s B o l t z m a n n Machine
1. I N T R O D U C T I O N
) 1
X? +1 = 0 -- v.n (2)
1 + exp(+)
A W.. =
q
PU - p (3)
[15] has investigated its analogue implementation and [16] had realised a
mixed digital/analogue synaptic circuit for a mixed digital/analogue implementation
of [15]. The electronic board realised by [16] includes two main integrated circuits :
MBAT2 chip and MBAT11 chip. MBAT2 is an analogue/digital circuit containing 32
neurones and M B A T l l is a mixed analogue/digital synaptic circuit containing 16
synapses. The prototype includes two MBAT11 synaptic chips, one MBAT2 neurone
circuit and some standard control logic. The neurone cell includes :
Kn ~ $3.
~_velmp
: X1. -'~,,.Bk.._s 2.
O
~ Vont
~ ' n vaa
~ h a ! NEURONE
CELL
TemperatureControl
A B
Bloc
00
Figure 1 reproduce the bloc diagram of a neurone cell and the B.T.C.B.
circuit (Bollzmann Temperature parameter Control Bloc) of the MBAT2 chip. Each
neurone cell of the MBAT2 chip performs following operations : it collects and add
up synaptic currents ; then converts the total synaptic current into a voltage ;
computes the potential action and gives a voltage representation of the f ( V i n); at
the same time a random number distributed according into the uniform probability
low is generated (as a voltage) by C.A.R.N. circuit of the neurone cell ; finally, the
generated random voltage is compared to the proportional voltage representation of
the f ( V i n) performing the neurone state updating 15,18
V.n
Let Vp be S.F.B. input voltage. So, Vp is an electrical representation of
quantity. Let K p be transformation coefficient given in volts-1 (Kp depends on the
electrical realisation of the sigmoidal function S.F.B. bloc) and IM be the maximum
value of the total synaptic current (1/aA in the case of the MBAT2 circuit). Finally, let
Rct be the current to voltage conversion ratio between current and voltage
representations of synaptic potential. Thus, Boltzmann parameter could be obtained
as a function of Rct, Kp and I M 12. Rct and Kp depend to structural and physical
parameters.
1
W-- (4)
&,'Kp'I M
Considering the figure 1, one can remark that the Rct resistor is obtained by a set of
MOS gates. Two control signals perform Rct variation : VRc that controls MOS
transistor current and S k (with k E { 2 , 3 , 4}) that selects a different MOS
transistor geometry (performed by different MOS grids areas). Thus T O value of T
parameter, corresponding to Rct0 resistor's value, is obtained for $2 = $3 = $4 = 0.
Taking into account the Rct structure and Kp dependence with physical parameters,
the T parameters behaviour with physical temperature z and supply voltage will be
given by (5) 12, 17 (all voltages are reported to a reference offset Vref).
The VRC analogue control line using to adjust the Boltzmann parameter could be
used to compensate temperature effects. It's expression as a function of the
Boltzmann parameter, the physical temperature and the supply voltage is given by
(7).
112
i 0.6
0.55
~o45 0
0.4 - " 9
5
~
~
~
/
/5.~
5.2
6
~ 0.351 J. , . / 4.8
o io ~ 4 VOLTAGE
60 80 100 (V)
Ambient Temperature (~
E
,t
4
.4
,Y
vv ;NO 3E
1OO
Ambient Temperature (~ (v)
Figure 3 9 Simulated VRC analog control line as a function of both temperature and
supply voltage.
Figures 2 and 3 show simulation results based on the presented model. The
figure 2 plots the Boltzmann parameter as a function of both ambient temperature and
supply voltage. The figure 3 gives the simulated V RC analogue control line Figures 2
and 3 show simulation results based on the presented model. The figure 2 plots the
Boltzmann parameter as a function of both ambient temperature and supply voltage.
113
The figure 3 gives the simulated VRC analogue control line as a function of both
temperature and supply voltage to keep the Boltzmann parameter value constant. As
one can remark, this control line varying in a quasi linear way with physical
temperature 17. That leads the possibility of using this analogue control input for
temperature effects compensation easily. In the case of the supply voltage
perturbations, this control line could be also used as an external compensation
parameter even if the behaviour is more non linear. Figure 4 shows the experimental
Boltzmann parameters (T) evolution with the physical temperature and the supply
voltage when V RC is controlled to reach the temperature and electrical perturbation
effects compensation. The VRC behaviour to reach the perturbations effects
compensation has been computed using the model presented in previous section. As
one can remark, the model leads to compensate the perturbations effects and to
stabilize the Boltzmann parameter.
0.5 5.8
6
0.4 /5 5~'~'SUPPLY
0.3 / 468 VOLTAZ
O. 0 " e " " " ' m ' ~ / 4.244 (V)
40 50
Ambient Temperature (~
Figure 4 : Boltzmann parameters (T) evolution with both temperature and the supply
voltage when VRC is controlled to reach the perturbation effects compensation.
The figure 5 shows the evolution of the convergence speed of the neural net
with the Boltzmann parameter (T). Looking at these figure, one can remark that the
convergence speed (measuring the learning speed of the network) decreases when the
Boltzmann parameter value grows up. Indeed, for the high values of Boltzmann
parameter, the decision probability low (updating the neuroues output states) goes
114
5O
[" 20-
0 "-" 10-
!
100
Boltzmann Parameter
! I i
0 0.4 0.45 0.50 0.55 0.60
-100
F ~.~__ I--
-200 ~ --
Figure 6" Synaptic weights value evolution, after the learning phase, as
function of Boltzmann parameter.
Synaptie weights
values d ~ ~ _ [
400]-
1
3 4 0 ~
I6
300~ J . ~ 5.2 Supply Voltage
28 4
48 64 80 96 4
Ambient Temperature (~
Figure 7 : Synaptic weights values dynamics evolution, after the learning phase, as a
function of both anlbient temperature and supply voltage variation.
5. CONCLUSION
REFERENCES
[1] C.A. Mead, "Analogue VLSI and neural systems", Addison Wesley 1989.
[2] R.F. Lyon and C.A. Mead, " An Analog electronic cochlea". IEEE transactions on
Acoustic, Speech and signal Processing, Vol. 36, PP 1119- 1134, 1988.
[3] L. Jackel, "Electronic neural networks". In NATO ARW, Neuro-algorithms, architecture
and applications, Les Arcs, 1989.
[4] M. MAYOUBI, M. SCHAFER, S. SINSEL, "Dynamic Neural Units for Non-linear
Dynamic Systems Identification", From Natural to Artificial Neural Computation, L N C S
Vol. 930, Springer Verlag, pp. 1045-1051,, 1995.
116
1 Introduction
affect the functional power of the neurochip. However, it is evident that decreas-
ing area/synapse and extending the functional possibilities of a neuron are prior
aims when creating new neurochips.
We suggested a new type of a CMOS threshold element (fl-driven thresh-
old element, flDTE) [5,6] and a CMOS artificial learnable neuron on its base
[7-10]. flDTE has a noticeable feature: its implementability depends only on
the threshold value and does not depend on the number of inputs and their
weights. This fact and low complexity (5 transistors per a learnable synapse)
make this artificial learnable neuron a very attractive candidate for usage in the
next generations of digital/analog neurochips.
The goal of this work is the circuit development of CMOS artificial neuron
on the base of flDTE, studying its characteristics and dependence of its behavior
on the parameters.
2 Controllable/~DTE
F=Sign_L~ (~_~k
~On-WjXj
1 --T ) 9 '
Sign(A) = {01 if A < 0
if d > 0 (1)
Xn~tWny
xj (i e s)
~ ~~~.oo ~ g ~ F
xj(ir s)
Fig. 2./3-driven threshold element.
that leads to the circuit of/3-comparator shown in Fig.3a where/3nj = wj/3; /3p =
/3; / ~ = Z ~ j =n-1
0 ~x,.
b),7
_ Vout
Fig. 3./3-compaxator: CMOS implementation (a); equivalent circuit (b).
3 To construct S it is sufficient to take any hypercube vertex that lies in the separating
hyperplane and to include in S indexes of the variables having the value 1 on the
vertex.
120
ip
I~q-Ip=O
In [5] these equations were analyzed and it was shown that the suggested
comparator circuit has sensitivity dY_Z~de* ~ - 2 V in the point ~ --- j3~/j3p ---- 1.
Hence, at the threshold level (Vout = Ydd/2) the reaction of the 3-comparator
to a unit change of the weighted sum AVo~t ,.~ J2/TJV, i.e. it linearly decreases
as the threshold grows.
The analysis of stability of j3DTE to parameter variations made in [5] showed
that only/3DTE with small thresholds (<3-4) can be stably implemented. How-
ever, an artificial neuron is a learnable object and variations of many parameters
(for example, technological) can be compensated during the learning.
The learnable artificial neuron on the base of j3DTE [7,8] has a sufficiently
simple control over the input weight (Fig.4): the control voltage changes the
p root F
F i g . 4. /3-driven l e a r n a b l e n e u r o n .
does not change the neuron output state. It follows from this that the imple-
mentability of flDTE and, hence, of the neuron on its base depends only on the
threshold value and does not depend on the number of inputs and sum of their
weights (this fact was established in [5]). The essential aspect is the sensitiv-
ity of the fl-comparator to the current change at the threshold point. Since the
range of B-comparator output voltage is limited within (O-Vdd), the only way
of increasing the ]3-comparator steepness at the threshold point is increasing
the non-linearity of how the fl-comparator output voltage depends on the ratio
Vout = 1 - a + ApVdd
Ap + Ana ' a = Zn/Zp
and
dVo~,t _ An + A v + AnApVdd
d(~ (AB + A n a ) 2
Let An = 0.03 1/V and Ap = 0.11 1/V. 5 It is easy to calculate that for Vo~,t =
Vdd/20L = 1.15. Parameter a is not equal to one at this point since the values
of An and An are different. When V d d = 5 V and a = 1.15 adY-~=-7.5V. Thus, the
sensitivity of the ft-comparator has increased by 3.75 times. The less An and Ap,
the more the sensitivity.
In the learnable neuron circuit (Fig.4), every synapse consists of two transis-
tors. The gate of one transistor is fed by the input variable x j; the gate of the
5 The values of these parameters were found from the used transistor models.
122
other one is fed by the voltage Vcj that controls the variable weight (current in
the synapse).
Let us first consider the lower part of the neuron fl-comparator where the
synapse currents are summed. Let us replace the couples of transistors that form
synapses by equivalent transistors with characteristics shown in Fig.5. These
characteristics were obtained by SPICE simulation.
300uAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . !
~: 9 wlv i
1.~ 2.~ 3.~ 4.~ 5.~
To the left of the mode switching line, the transistors are in the non-saturated
mode; to the right - - in the saturated mode. We can see from these characteris-
tics that when Vout--2.5V, the equivalent transistors are in the saturated mode
if the control voltage Vc < 2.5V and in the non-saturated mode if V~ > 2.5V.
Thus, the saturated mode condition restricts the range of control voltage change.
Breaking this restriction leads to decreasing the output signal of the comparator
because the currents are re-distributed among the synapses.
Indeed, let the smallest weight corresponds to synapse current Imi~ and
adding this current to the total current of the other synapses must cause the
switching of the neuron. If the synapse with the biggest current is not saturated,
decreasing Vo~,t because of the total current increase makes the current of this
synapse smaller. The currents of other non-saturated synapses also decrease. As
a result, the total current increases by a value which is considerably smaller than
Imi~. This leads to decreasing the output signal of the comparator.
The range in which the control voltages of the synapses change can be ex-
tended if an extra n-channel transistor is incorporated into the circuit as shown
in Fig.6. The gate of this transistor is fed by voltage Vr~i1 such that when the
current provides Vout ~-, Ydd/2, the transistor is saturated under the reaction of
the voltage Vg8 -- Vr~i1 - VE. Increasing the total current through the synapses
by adding a synapse with the smallest current makes Vz smaller, so that Vg8
becomes bigger. The extra transistor opens and the extra increase of the to-
tal current compensates the change in VE. Thus, due to the negative voltage
feedback, the extra transistor stabilizes V~ and therefore stabilizes the currents
123
M 1 . 7 Vdd
M2,~ VdM1
V r e f 3 ~ F
v of, l Vs l
Xl--~IX2-~t.,.Xn~}I
Fig. 6. Modified/3-comparator.
300uA ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,
200uA ~
ri
.......
Vc=5V
100uA i
5.ffr .................................................................................... I
r 9
2.0VI . . . . . . . ,
Fig. 8. Curve 1 - dependency Vo,,t(I) when the comparator has one p-channel transis-
tor; Curve 2 - dependency Vo,~t(I) when the comparator has two p-channel transistors;
Curve 3 - dependency VdM1(I).
is obviously not sufficient for good stabilization of the threshold value of the
current.
In the modified/~-comparator circuit(Fig.6), the p-channel p a r t of the com-
parator consists of two transistors M1 and M2 referenced by voltages V,-ef2 and
Vrel3 respectively. These reference voltages axe selected so t h a t as the com-
p a r a t o r current grows, transistor M1 gets saturated first, followed by M2. The
dependence of voltage VdM1 at the drain of M1 from the current is shown in
Fig.8 (Curve 3). As soon as M1 comes into the saturation zone, the voltage Vg8
of M2 begins to change with higher speed because Vg8 = Vr~f3 - VdM1. The
voltage drop on M2 sharply grows, increasing the steepness of Vout(I) (Curve 2
in Fig.8).
(89, 1.151V)
0 20 40 60 80 1O0 110
Fig. 9. Comparator characteristics: curve 1 for the old comparator; curve 2 for the new
o n e .
125
In the second series of the experiments, for fixed parameters of the com-
parator we tried to find in what range of threshold voltages control voltages on
synapses existed that provided rain AVo,,t > 100mV. In other words, we tried
to find in what range the threshold Vth of the output amplifier may change (for
example, because of technological parameter variations). This range can be com-
pensated during the learning. The results are given in the third column of Table
1. The neuron during the learning is adjusted to any threshold of the output
amplifier from these ranges.
The other experiments were associated with the question: with what precision
should we maintain the voltages for normal functioning of the neuron after the
learning?
First of all, we were interested in the stability of the neuron to supply volt-
age variations. With constant values of the reference voltages and changes of the
voltage supply 4-0.1% (4-5mV), the dependence of the output voltage Vo~,t from
the currents flowing through p-transistors of the comparator shifts along the
axis of current by 4-1.5% as shown in Fig.10. For neuron F12, the current in the
5.OV, - .................................................................................. ,
Fig. 10. Behavior of the dependency Vo,~t(Ip) when the voltage Vdd changes in the
interval -t-0.1%.
working point is about 233Imi~; 1.5% of this value is 3.5Imi~, i.e. the shift of the
characteristic is 3.5 times more than the minimum current of the synapse. Evi-
dently, the neuron will not function properly when the working current changes
like that.
On the other hand,taking into account the way of reference voltages pro-
ducing, it is natural to assume that the reference voltages must change propor-
tionally to the changes of the voltage supply. The effect from reference voltage
change opposes the effect of supply voltage change, partially compensating it.
The experiments carried out under these conditions showed that learned neurons
Fx0, FI~ and F12 can function properly in respective ranges of supply voltage
change shown in the fourth column of Table 1. To fix the borders of the ranges,
the following condition was used: signal AVo,,t should be more or less than the
output amplifier threshold by a value not more than 50mV.
127
The control voltages of the synapses were set up with the accuracy of 1inV.
With what accuracy should they be maintained after the learning? Evidently,
the neuron will not function properly if with the same threshold of the output
amplifier the total current of the synapses will drift by Imi~/2 in one or the
other side. Experiments were conducted where we determined the permissible
range in which the control voltage 5Vc of one of the synapses (with minimum and
maximum currents) can change while the control voltages of the other synapses
are constant. The condition for fixing the range borders was the same as in the
previous series of experiments. The obtained results are given in Table 2.
Results of SPICE simulation Table 2
5 Conclusion
The suggested neuron with improved fl-comparator has a number of attractive
features. It is very simple for hardware implementation and can be implemented
in CMOS technology. Its fl-comparator has a very high sensitivity that provides
the minimum output signal of the comparator as small as 325mV for the thresh-
old value as big as T=233. Its implementability does not depend on the sum of
input weights, being determined only by the threshold value. Such a neuron can
perform very complicated functions, for example, all logical threshold functions
of 12 variables. There is no doubt that it is learnable to any threshold func-
tion of 12 variables because the dispersions of all technological and functional
parameters of the circuit are chosen during the learning.
The drawbacks of the suggested neuron are very high demands to the stabil-
ity of the voltage supply after the learning. This drawback looks to be peculiar
to all circuits with high resolution, for example, digital-analog and analog-digital
converters. If these demands cannot be matched on the interval of neuron func-
tioning, one should reduce the threshold value until they are matched or carry
out additional research in order to study if it is possible to compensate the
influence of unstable supply voltage.
This work does not deal with the problems of teaching the neuron to threshold
logical functions and its maintenance in the learned state. These issues are of
special interest and should be the object of a separate research.
128
References
1 Introduction
Artificial neural network models are widely used for the design of adaptive, intel-
ligent systems since they offer an attractive property: the capability of learning
in order to solve problems from examples. These models achieve good perfor-
mance via massively parallel networks composed of non-linear computational
elements, often referred to as units or neurons. A value, referred to as activity
(or activation value) is associated with each neuron. Similarly, a synaptic weight
is associated with each connection between neurons. A neuron's activity depends
on the activity of the neurons connected to it and the weights. Each neuron com-
putes the weighted sum of its inputs. This value is called net input. The activity
is obtained by the application of an activation function (e.g. sigmoid, gaussian
or linear function) to the net input.
Many network architectures have been described in the literature. Multilayer
perceptrons, which are used in our project, are composed of several layers of
neurons: an input layer simply holding input signals, one or more hidden layers
of neurons and an output layer, from where the response comes. Connections
are only possible between two adjacent layers. Let us further introduce some
notations. N,~ designates the number of neurons in layer m. wn,,n# is the weight
between neuron i in layer n and neuron j in layer m. am, P j = ~(h~,j) denotes
the activity of neuron j in layer m, where ~a is the activation function. Finally,
P
hPm,j = ~"~=~-~ Wm-l,m~ "arn_l,i is the net input. Other interconnection shemes
include recurrent or competitive networks.
The Backpropagation rule [1] is perhaps the most popular supervised algorithm
for multilayer perceptrons. It iteratively computes the values of weights using a
gradient descent algorithm. | Initially, all weights are initialized to small random
values. ~) An input vector is then presented and propagated layerwise through
the network. | We compute the error signal | which is back-propagated through
the network. This process allows to assign errors to hidden neurons. | Finally,
computed errors and neuron activities determine the weight change. Steps |
to | are carried out for all training vectors. This process is repeated until the
output error signal falls below a given threshold. Supervised algorithms solve a
wide range of complex problems including image processing, speech recognition
or prediction of stock prices.
In reinforcement learning, the system tries an action on its environment and
receives an evaluative reward, indicating whether the action is right or wrong.
Reinforcement learning algorithms try to maximize the received reward over
time. Such algorithms are efficient in autonomous robotics.
There is no teacher available in unsupervised learning. These algorithms try
to cluster or categorize input vectors. Similar inputs are classified within the
same category and activate the same output unit. Applications of unsupervised
learning include data compression, density approximation or feature extraction.
Equation 2 drives synaptic coefficients to zero. Weights are removed when they
decrease below a given threshold. Pruning connections sometimes leads to a
situation where some neurons have no more inputs or outputs. Such neurons,
called dead units, can be deleted.
3 Hardware Implementation
Though software simulations are essential when one sets about to study a new
algorithm, they can't always fulfill real-time criteria required by some practi-
cal applications. In order to exploit the inherent parallelism of artificial neural
networks, hardware implementations are essential.
Analog implementations allow the design of extremly fast and compact low-
power circuits. This approach has been successful on the design of signal process-
ing neural networks, like the H~!rault-Jutten model [4], or bio-inspired systems
like silicon retinas [5]. The main drawback of analog circuits lies perhaps in
their limited accuracy. Consequently, they cannot implement the backpropaga-
tion algorithm which requires a resolution from 8 bits to more than 16 bits [6],
depending on several factors, such as the complexity of the problem to be solved.
Among the many digital neuroprocessors described in the literature, we dis-
tinguish two main design philosophies. The first approach involves the design
of a highly parallel computer and a programming language dedicated to neural
networks. It allows the implementation of multiple algorithm on the same envi-
ronment. [7] gives an interesting overview of different academic and commercial
systems. However, programming such computers is often arduous.
The second approach involves the design of a specialized chip for a given al-
gorithm, thus avoiding the tedious programming task. [8] describes such circuits
and presents the benefits of this method: "resource efficiency in respect to speed,
compactness and power consumption". However, the main drawback lies in the
need for a different hardware device for each algorithm.
Besides analog and digital approaches, literature describes other design
paradigms. Let us mention two examples:
[::1 F. N. Sibai and S. D. Kulkarni have proposed a neuroprocessor combining
digital weight storage and analog processing [9].
Q The optical neural network paradigm "promises to enable the design of highly
parallel, analog-based computers for applications in sensor signal process-
ing and fusion, pattern recognition, associative memory, and robotic con-
trol" [10]. The VLSI implementation of a fully connected network of N units
requires area O(N2). Optics allows the implementation of connections in a
third dimension, reducing the chip area to O(N). However, additional re-
search is indispensable to embed such a system in a small chip.
4 Reconfigurable Hardware
C'llipN I ~ __
al, ex ^11~.~
,.- ?
/ . . . .
?
I
(a) (b)
Configuration
Database Algorithm 2
Algorithm I Algorithm 2
Forward Propagation
ErrorComputation
Backward Propagation
Weigh[ Updam
,,,%.
(f=2
9 O0 O1 0 2 03 04 05 06 07
." . ." .. .' ." " .. :
l0
:
~I Z2 Z3 Z4 Z5 Z6 Z7
: :
i i i i i i i i i : : : : :
'
: Time
~ - - ~ 5 = 3
0 1 2 3 4 5 6 7 8 9 5= 2 Register
(a) (b)
Fig. 3. (a) Delay of an on-line operator. (b) Pipeline with on-line operators.
Many on-line operators are described in the literature. Each of them is char-
acterized by a delay 5, which indicates the number of clock cycles required to
compute the MSB of the result. Figure 3a depicts the schedule diagram of an
operator of delay 2. An input signal i is provided at time t -- 0. Two clock cycles
are required to elaborate the MSB of the output signal o. A new bit of the result
is then produced at each clock cycle.
We have designed a VHDL library of on-line operators dedicated to our neu-
roprocessor. Table 1 summarizes the available operators and their delay. Notice
that it is possible to implement multipliers and squarers of delay 2. However,
the size of these operators depends on the length of the operands [14]. The
implementation of activation functions is obtained with the Horner's scheme.
Suppose now that the result of a first operator F is needed for further com-
putations. The second operation may begin as soon as F has generated the MSB
of the partial result. Figure 3b shows how to chain different on-line operators.
The addition and the first multiplication are carried out in parallel. As the de-
lay of the adder is smaller than the one of the multiplier, we use a register to
synchronize the inputs of the second multiplier.
136
When designing the hardware architecture of our neural network we first ob-
served that a time-multiplexed interconnection scheme provides a good trade-off
between speed and scalability (Fig. 4). The main idea is to connect all the out-
puts of hidden layer m and the inputs of hidden (or output) layer m + 1 to a
common bus; the same hardware is reused for all layers of the network. The
multiplexor allows to provide the network with an input signal or an activation
value of a hidden unit. The neuroprocessor is basically made of FPGAs, each of
them embedding N neurons (this number depends on the F P G A family). Fur-
thermore, each F P G A is connected to its own memory. Memory module m stores
the synaptic weights between all neurons implemented by F P G A m and their
inputs (Fig. 4).
We will focus herein on forward propagation of a signal (the backward propa-
gation obeys the same principles). The first neuron in layer m places its activity
aPm,t on the bus. All neurons in layer m + 1 read and multiply it by the ap-
propriate synaptic weight w,r,,m+b and finally store the result. This process is
sequentially repeated for every neuron in layer m. Each processing element in
layer m + 1 accumulates the results of the successive multiplications. Due to this
interconnection scheme, each neuron is a very simple processing element.
ilili
:i ii:i:i
I :.-- "::: : : ~
"''...
(Registersholdingthq
::=3:::)1~ | I~tlvitv of . . . . . . . inI
11 o li:[:i!i :?:)::::).] ?::o'o: ?1o 11
Wmi+lrtt+l 1 Wmi+lmN+l
Fig. 4. General architecture of a reconfigurable neuroprocessor.
137
The FIFO allows the synchronization of the inputs of the binomier. A special bit,
called Pruning, is associated with each weight and indicates whether a connection
has been pruned (in which case, this bit is set to 0) or not. This information is
stored in a flip-flop and combined with an input of the on-line operator.
Pruning .. ~ a+ _ ~
1_~ rn,k + l
[ P r u n i n ~
I ~
Load Clrw+
r~h+t r n §
Wrnk+l rn§
aP
~-~ Wrnim+'j " rn,i " :.
It
.-- -
[
FI-F-o ~ Load Cl---~
(a) (b)
We now have to provide our neuroprocessor with a means for detecting dead
units. This problem is solved by the simple mechanism depicted in Fig. 5b. As-
sume that a neuron j in layer m has no more inputs. All wm-l~m~ coefficients are
loaded when a signal is forward-propagated through the network. As the Pruning
bits associated with the wm-l~,-,~ are set to zero, the flip-flop output remains
zero as well. Consider now a neuron with no outputs. The backward-propagation
process involves all weights w,~r,+lk whose Pruning bits are equal to zero. Con-
sequently, the detection of such dead units occurs during this step. Once a dead
unit has been detected, a signal is sent to a global controller which manages
the network topology. As the activities of neurons are sequentially placed on the
bus, the deletion of dead units increases the learning speed.
6 Conclusions
This paper has presented an attractive paradigm for the design of neuropro-
cessors. The reconfigurable approach allows the implementation of circuits ded-
icated to different algorithms on the same board. Furthermore, it leads to an
138
References
1. B. Widrow and M. A. Lehr. 30 Years of Adaptative Neural Networks: Perceptron, Madaline and
Backpropagation. Proc. IEEE, 78(9):1415-1442, September 1990.
2. Russell Reed. Pruning Algorithms - A Survey. I E E E T~ansaetlons on Neural Networks,
4(5):740-747, September 1993.
3. Masumi Ishikawa. Structural Learning with Forgetting. Neural Networks, 9(3):509-521, 1996.
4. J. Hdrault and C. Jutten. Rdseau~ neuronaux et traiteme nt du signal. Hermes, 1994.
5. C. Mead. Analog V L S I and Neural Systems. Addison-Wesley, May 1989.
6. Shigeo Sakaue, Toshiyuki Kodha, Hiroshi Yamamoto, Susumu Maruno, and Yasuharu Shimeki.
Reduction of Required Precision Bits for Back-Propagation Applied to Pattern Recognition.
I E E E T r a n s a c t i o n s on Neural Networks, 4(2):270-275, March 1993.
7. Paolo Ienne. Digital Connectionist Hardware: Current Problems and Future Challenges. In Josd
t . . . . .
Mira, Roberto Moreno-Dlaz, and Joan Cabestany, editors, Bzolog~cal and Artificial C om put a-
tion: F rom Neuroscience to Technology, pages 688-713. Springer, 1997.
8. Ulrich Rfickert and Ulf Witkowski. Silicon Artificial Neural Networks. In L. Niklasson, M. Bod~n,
and T. Ziemke, editors, I C A N N 98, Perspectives in Neural Computing, pages 75-84. Springer,
1998.
9. Fadi N. Sibai and Sunil D. Kulkarni. A Time-Multiplexed Reconfigurable Neuroprocessor. I E E E
Micro, 17(1):58-65, 1997.
10. B. Keith Jenkins and Jr. Armand R. Tanguay. Optical Architectures for Neural Network Imple-
mentations. In Michael A. Arbib, editor, The Handbook of Br ai n T he or y and Neural Networks,
pages 673-677. The MIT Press, 1995.
11. Jean-Luc Beuchat and Eduardo Sanchez. A Reconfigurable Neuroprocessor with On-chip Prun-
ing. In L. Niklasson, M. Bod~n, and T. Ziemke, editors, I C A N N 98, Perspectives in Neural
Computing, pages 1159-1164. Springer, 1998.
12. Kishor S. Trivedi and Milos D. Ercegovac. On-line Algorithms for Division and Multiplication.
I E E E T r a n s a c t i o n on Computers, 0-26(7), July 1977.
13. Algirdas Avizienis. Signed-Digit Number Representations for Fast Parallel Arithmetic. I R E
Tra nsa ctions on Electronic Computers, 10, 1961.
14. Jean-Claude Bajard, Jean Duprat, Sylvanus Kla, and Jean-Michel Muller. Some Operators for
On-Line Radix-2 Computations. Journal of Parallel and Distributed Computing, 22:336-345,
1994.
15. Eduardo Sanchez, Moshe Sipper, Jacques-Olivier Haenni, Jean-Luc Beuchat, Andr~ Stauffer,
and Andrds Perez-Uribe. Static and Dynamic Configurable Systems. To appear.
16. Stephen M. Scalera, John J. Murray, and Steve Lease. A Mathematical Benefit Analysis of
Context Switching Reconflgurable Computing. In Josd Rolim, editor, Parallel and Distributed
Processing, n u m b e r 1388 in Lecture Notes in Computer Science, pages 73-78. Springer, 1998.
Digital Implementation of Artificial Neural Networks:
From VHDL Description to FPGA Implementation
Abstract
This paper deals with a top down design methodology of an artificial neural network
(ANN) based upon parametric VHDL description of the network. To come off early in the
design process a high regular architecture was achieved. Then, the VHDL parametric
description of the network was realized. The description has the advantage of being
generic, flexible and can be easily changed at the user demand. To validate our approach,
an ANN for electroc,'u'diogram (ECG) arrhythmia's classification is passed through a
synthesis tool, GALILEO, for FPGA implementation.
Key words
ANN, top down design, VHDL, parametric description, FPGA implementation.
Introduction
Engineers have long been fascinated by how efficient and how fast biological neural
networks are capable of performing complex tasks such as recognition. Such networks are
capable of recognizing inputs data from any of the five senses with the necessary
accuracy and speed to allow living creature to survive. Machines, which perform such
complex tasks, with similar accuracy and speed, were difficult to implement until the
technological advances of VLSI circuits and systems in the late 1980's [I]. Since then,
VLSI implementation of artificial neural networks (ANNs) has witnessed an exponential
growth. Today, ANNs are available as microelectronics components.
The benefit of using such implementation is well described in a paper by R. Lippman [2] :
<< The great interest of building neural networks remains in the high speed processing that
could be provided through massively parallel implementation >>. In [3], P. Trealeven and
others have also reported that the important design issues of VLSI ANNs are parallelism,
performance, flexibility and their relational ship to silicon area. To cope with these
properties [3] reported that a good VLSI ANN should exhibit the following architectural
properties:
9 Design simplicity that leads to ,architecture based on copies of a few simple cells.
9 Regularity of the structure that reduces wiring
9 Expandability and design scalability that allow many identical units by packing a
number of processing units on a chip and interconnecting many chips for a complete
system.
Historically, the development of VLSI implementation of artificial neural networks has
been widely influenced by the development in technology as well as in VLSI CAD tools.
140
Xn , ~ L J l
(c)
Fig. 1. (a) Biological model neuron. (b) Artificial neuron model (c) Three layer artificial neural network
The ANN computation can be divided in two phases: learning phase and recall phase. The
learning phase performs an iterative updating of the synaptic weights based upon the error
back-propagation algorithm [2]. It teaches the ANN to produce the desired output for a set
of input patterns. The recall phase computes the activation values of the neurons from the
output layer according to the weighted values (computed in the learning phase).
Mathematically, the function of the processing elements can be expressed as:
l j = )-~ ,wiji ~ail - l ) + O (I)
i
w.[ is the real valued synaptic weight between element i in layer l-1 and element j in layer
U
O-l)
I. s i is the current state of element I in layer I-I. 0 is the bias value. The current state
142
It must be mentioned that our aim is to implement the recall phase of a neural network,
which has been previously trained on a standard digital computer where the final synaptic
weights are obtained, i.e. "off- chip training".
III. D e s i g n methodology
The proposed approach for the ANN implementation follows a top down design
methodology. As illustrated in Fig. 2, architecture is first fixed for the ANN. This phase
is followed by the VHDL description of the network at the register transfer level (RTL)
[8], [13], Then this VHDL code is passed through a synthesis tool which performs logic
synthesis and optimization according to the target technology. The result is a netlist ready
for place and root using an automatic FPGA place and root tool. At this level verification
is required before final FPGA implementation.
In the following sections the digital architecture of the ANN will be derived then the
proposed parametric VHDL description. Synthesis results, placement and rooting will be
discussed through an application.
143
In Fig. 5(a) the VHDL description of the neuron is illustrated. Fig. 5(b) illustrates the
layer description. Fig. 5(c) illustrates the Network description.
First, a VHDL description of the MAC circuit, the ROM and the LUT memories was
done. In other to achieve flexibility, the word size width (nb_bits) and the memories
depth (nb_addr and n b a d d ) are kept as generic parameters (Fig. 5(a)).
Second, a VHDL description of the neuron was achieved. The parameters that introduce
the flexibility of the neuron are the word size (rib_bit) and the component instantiation. A
designer can change the performances of the neuron by choosing other pre-described
components stocked in a library without changing the VHDL code description of the
neuron (Fig. (5b)).
145
Third, a layer is described. The parameters that introduce the design flexibility and
genericity of the layer are the word size ( n b b i t s ) and the number of neuron (nb_neuron).
The designer can easily modify the number of neurons in a layer only by easy
modifications of the layer VHDL code description (Fig.5. b.).
Finally, a VHDL description of the network is achieved. The parameters that introduce
the flexibility of the network are the neurons word sizes (n), the number of neurons in
each layer (nb_neuron) and component instantiation of each layer (component layerS,
component layer3 and component layer2). The designer can easily modify the size of the
network simply by giving small changes in the layers descriptions. The designer can also
change the performances of the network only by using others pre-designed layers (Fig
4.c.).
entity neuron is Entity nelwock is
generic(nb_bits :integer) ; -- word size generic (n, nl, nO: integer) ;
port (in nenr :in unsigned(nb_bits- I downto O) ; pelt (X I,X2,X3,X4,X5:in sl.d_logic_vector (n downto 0);
out_neur : out std_logic_v~tor((nb_bits - L) downlo O) ; ad:in unsigned(nl dowmo 0);
rend_en,rst,clk,rcady : in std_logic) ; adl:in unsigned (nl dew, ate 0);
end neuron ; ad2:in unsigned(hi downto 0):
,lrchileclnrc nenron_dc,~tiplion of neuron is elk ,rst.rendyl.rend en : in ~d logic;
zompoucm M A C c 132,e232:oet std_logic_vector(((2 *n+ 1) downto 0)) ;
generic (nb_bits : integer) ; end network ;
port (x, w : i n std_logic_vcctor((nb_bits-I) downto 0) ;
elk. rsl : in sld_logic ; architecture network_desctiplion of nctwoA is
q : out std_logic_vector ((2*nb bits) -I) downto 0)) ; component layer I
gild cotnponen[ ; generic (nb_neeron : integex ; nl : integer) ;
component ROM port(XI,X2,X3.X4,X5:in std_h~ic_vcctor (nl downlo 0);
generic {nb add : integer : ni',_bits :integer) ; ad:in unsigned(nl downto0).
port ( add : in unsigl~'d ((nb_addr -1) downto 0) ; s l:in std logic_vector(no downlo 0);
out_tom : out ~d logic vector((nb_bits - I) downto 0) ; clk,rsl.rendy,read_en : in std_logic) ;
read en : in ~d_logic) ; n 13.n23,n33,n43.n53 :out std logic_vector(((2*n l )+ I )
end colnponeflt ; dowmo 0));
[,'OIllpone n.I LUT end component ;
generic(nb_ad(h" :integer ; nb_bits :integer) ;
port (addr : in tad Iogie_vectoc((nb_bits - I) downto 0)); component layer2
out lut : out sld_logic_vector((2*nb bits -I) downto 0) ; genetic (nb_ncuron : integer ; nl : integer) ;
read en : in std_logic) ; port(X l,X2,X3,X4,X5:in s~d_togic_vector (at downto 0);
end component ; adl :in unsigned(nl dowmo 0);
begin s2:in sld_logic_vector(nO downto 0).
rein_wight : ROM generic inap (). port nrmp (read en, add, w) ; clk,rsl,ready,.rend_en : in std logic) ;
molt ace : MAC genetic map(), port map (x.w,clk.rea,q) ; n 13.n23,n33:out std Iogic_veCter(((2 *n I)+ 1) dowmo
result : LUT generic map O . port map (rend_en. q. out_lut ); 0));
end neuron_de~ription ; end component ;
(a)
component layer3
entity Inyer_n is genetic (nb_neuron : integer ; n I : integer;) ;
generic(nb neuron :integer ; rib_bits :imeger) ; port(X ~,X2,X3:in std_loglc_vector {n ~ downlo 0);
p~t(inpot_layer I : unsigned ((nb_bits -I) downto 0); nd2;in unsigned(nl dowmo 0);
inl~tt Inyer2 : in ~td Iogic_vector((nh hits) downlo 0); s3:in md logic vector(no (k)wnlo 0);
elk, rst, ready.rcad_enl : :in ~ d [t~ic ; clk,r~,ready,rend cn : in ~d logic) :
output_layer I ..... output layer n : out n 132,n232 :out sl d_logic_vector(((2 *n l )+ I ) downl o 0));
~d..iogic((2*(nb_bits)+ I) downto 0)) ; end component ;
end layer_n ;
architecture layer_description of layer n is ~r
component neuron layer 5 : laycri genetic n'mpO, port mal~sl, X[,X2,X3,X4,X5,
port (in_neur :std_iogic_vector(nb bits- I downto 0) ; rs~,clk, ready.read_cn,ad,n 13.n23,n33,n43,n53) ;
out_neur : out sld_loglc_vcctor(nb_bits - I dowmo O) ;
read_en.rst.clk.ready : in std_logic) ; layer 3:layer2 generic innpO, Portrnap(s2.X I,X2,X3,X4,X4,
end eounponent ; elk jsl,ready,read en,ad I .n ! 3,n23,n33);
begin layer_2:layer3 genetic map(), port nmp(s3,X l,X2,X3,clk,r~,
neuron_n : neuron generic map(), rendy.read_en,nd2,n 132.n232);
port map (input_laycxi ,input_layer2. elk, rst. ready, :nd ;
read_enl .output_layerl....,outpuLlayer n) ;
end iayer_descriiXion;
(b) (c)
Fig. 5. Parametric VHDL description. (a): Neuron description. (b): Layer description. (c): Network
description.
146
NS
~swr
PR
PP
ECG ~iB.nl
Fig. 7. (a): ANN input- output connections. (b): Functional simulation results of the (5-3-2) ANN.
Fig. 8. Galileo Synthesisresults. Fig.9. Top view of the ANN FPGA structure.
148
References
[1] M. I. Elmasry, <<VLSI Artificial Neural Networks Engineering >~,Kluwer Academic
Publshers.
[2] Richard P. Lippmann, ~An Introduction to computing with Neural Nets ~, IEEE
ASSP Magazine, pp. 4 -22. April 1987.
[3] Philip Trealeven, Macro Pacheco and Marley Vellasco, ~ VLSI Architectures for
Neural Networks ~, IEEE MICRO, pp. 8-27, December 1989.
[4] Y. Arima, K. Mashiko, K. Okada, ~A Self- Learning Neural Network Chip with 125
Neurons and 10K Self-Ornization Synapses ~, Symposium on VLSI Circuits, pp. 63-
64, 1990 IEEE.
[5] H. Ossoing, ~Design and FPGA- Implementation of Neural Networked, ICSPAT'96,
Pp. 939-943.
[6] Charles E. Cox and W. Ekkehard Blanz, ~ GANGLION- A Fast Field Programmable
Gate Array Implementation of a Connectionist Classifier ~, IEEE JSSC, Vol. 27, No.
3, pp. 288- 299, March 1992.
[7] R. Airiau, J. M. Bcrge, V. Olive, J. Rouillard " VHDL du language a la
modelisation",
Presses Polytechniques et Universitaires Romandes et CNEST- ENST.
[8] R. Airiau, J. M. Berge, V. Olive, "Circuit Synthesis with VHDL", Kluwer Academic
Publishers.
[9] Daniel Gajski, Nikil Dutt, Allen Wu, Steve Lin, "High level Synthesis- Introduction
to
Chip and System Design", Kluwer Academic Publishers.
[10] XACT user manual.
[11] M. S. Ben Romdhane, V. K. Madissetti and J. W. Hines, " Quick-Turnaround ASIC
Design in VHDL Core- Based Behavioral Synthesis", Kluwer Academic Publishers.
[12] N. Izcboudjen and A. Farah, " A New Neural Network System for arrhythmia's
Classification," NC'98, International ICSC/IFAC Symposium on neural
Computation. Vienna, September 23-25, pp. 208-212.
[13] GALILEO HDL Synthesis Manual.
H a r d w a r e I m p l e m e n t a t i o n U s i n g D S P ' s of t h e
N e u r o c o n t r o l of a W h e e l c h a i r
P. Martin, M. Mazo, L. Boquete, F.J. Rodriguez, I. Fernhndez, R. Barea, J.L. l_Azaro,
Abstract. This paper describes the implementation of a neural network control system for
guiding a wheelchair, using an architecture based on a digital signal processor (DSP). The
control algorithm is based on a radial base network model, the main advantage of which is
learning speed. The hub of the algorithm architecture is a DSP of the company Texas
Instruments (TMS320C31). The board has complete autonomy of action and is specially
designed for executing control algorithms in real time. The wheelchair prototype forms part of
the SIAMO project, currently being developed by the Electronics Department of the University
of Alcaht. The stability conditions are obtained for the correct functioning of the system and
various simulations are conducted to deduce the correct functioning of the system when
governing the output of the wheelchair.
I. Introduction
In the field of wheelchairs to aid the mobility of handicapped persons and/or the elderly there
is a need for systems affording users smooth, safe driving conditions with a quick response
(above all in especially dangerous situations for the user). An important aspect in said aim is the
development of a control system able to respond to the user's commands in the shortest possible
time and with the greatest accuracy. Two requisites to this end are a high-performance
hardware and reliable control algorithms governed by a real time operational system.
A recent computer search revealed that in the period between 1990 and 1995, 9,955 papers
were published in which the words "neural network" appear [Narendra, 1996]. 8,000 of them
deal with the approximation of 'functions and reeogni.tiotr of patterns (static systenis).
Approximately 1,960 papers discuss control problems, only 353 of which deal with their
possible applications, and within this group, 45% are theoretical. Of the rest, 28% are
computer-based simulations and only 14 papers deal with real applications (it may safely be said
that many industrial applications are not published for reasons relating to patent fights). This
paper presents a practical case in which a wheelchair is guided by using an inverse control
system, where the neurocontroUer must generate the control signals which allow the output of
the controlled plant to be the appropriate one. One of the main advantages of using a neural
network as a controller is that neural networks are universal function approximators which learn
on the basis of examples and may be immediately applied in an adaptive control system, due to
their capacity to adapt in real time.
When using an inverse control system, in which the controller is a neural network, the problem
is how to propagate the control error to the adjustable coefficients of the neurocontroller in
such a way that the latter vary in the fight direction, so that the error is reduced. In short, the
problem is how to obtain the sensitivity of each plant output with respect to each input. This
problem has been solved in different ways: thus, [Hunt & Sbarbaro, 1991], [Ku & Lee 1995],
[Noriega & Wang 1998] and [Boquete et al, 1998], use a neuroidentifier in parallel with the
150
Safely a.nd.e.nv..ironmtnt d e t e c t i o n
LonWorks B u s (1)
I i-
NaviIalioa and sensor
....... .~t~a__o~ ..........
i i
physical system to be controlled which serves as a path for the prolongation of the error. This
neuroidentifier may be a recurrent neural network or a "feed-forward" network with inputs of
different moments of time. Another possibility is that used by [Maeda et al., 1997], who obtain
said sensitivity by increasing each one of the neurocontroller's adjustable coefficients and
making the corresponding observation of the variations in each one of the outputs of the plant
to be controlled, thereby estimating the Jacobian of the plant.
There are also techniques or adjustment algorithms in which this problem does not arise, since
a stochastic algorithm is used, e.g. Alopex [Venugopal, 1993], where in order to adjust the
controller's coefficients, the correlation between the error function to be minimized and the
variations produced in each coefficient is used.
The use of DSPs is justified because their combination with neural networks results a powerful
sytem used in control task, [Bona et al, 1997] by their computing power in multiplication and
addition operations and also because their peripherals allow additional hardware to be
controlled. The DSP receives the commands, implements the network and sends the commands
on to the motors. To do so it uses an FPGA-implemented odometric systemr a double-port
storage to receive the commands (coming from a joystick, voice command, eye movements,
breath expulsion, etc) and a LonWorks bus (based on a Neuron-Chip) for setting up the
communication protocol with the motors [Garcia et al. 1997].
The architecture of the prototype wheelchair in the tests carried out is shown in figure 1. As
may be seen, it comprises: a) an environment-detection and safety system (including infrared
and ultrasound sensors), b) wheelchair-user interface including such output devices as a display
and voice synthesiser and input devices that allow the user to guide the wheelchair using
different strategies, including joystick, voice commands, breath expulsion and eye movements,
c) low level control mainly in charge of regulating the .motor speed and the speed (angular and
linear) of the wheelchair, d) navigation and sensory integrati.bn, whose main functions are dead
reckoning (working from information provided by the encoders in both drive wheels), path
151
generation (using spylines) and avoidance of obstacles and integration of the information from
the various sensors [Mazo et al, 1998].
Conmmnication between the various modules is established through two LonWorks buses and
a parallel bus, which guarantee the response speed and communication reliability required in
these types of applications.
This paper has been divided into the following sections: firstly the control scheme is described.
Section 3 shows the neural model (neurocontroller), the formulas for the adjustment of the
neurocontroller and the system's stability conditions. Section 4 focuses on the hadware
implementation of the sytem and lastly section 5 shows the results of the different tests and the
main conclusio~as of this work.
2. Control scheme
The control scheme used is shown in figure 2. The shaded blocks are implemented in the DSP
board. The input commands [V f~]T can either be sent by a manual system implemented in the
joystick or automatically by voice, breath expulsion or eye movements.
The neural control is implemented using a neural network model based on radial base functions
(RBF). The controller outputs are the speeds of the right-hand and left-hand wheels [~o,~]r.
These outputs do not act directly on the wheels but are sent to an electronic system that directly
actuates the drivers of the wheelchair's motors, with a classic PID control loop. The latter
ensures that the speed of each wheel corresponds as closely as possible to that sent by the neural
controller (eo'~o, ~ and t o ' l = ~ ) . The control scheme is thus divided into two levels: the
PID that implements the low level control and the second control level ("high level")
implemented by the neural controller, which makes sure the wheelchair's linear and angular
speeds IV~, kqRE]Tcorrespond to the command speeds.
The feedback loop of the high level control is set up via a reading of the wheel speeds [r
~]T, using an odometric system implemented in a field-programmable gate array (FPGA).
Once the equations have been worked out for modelling the wheelchair a calculation is made
of the linear and angular speed IV~, f~]T. The difference between the real and desired speeds
(error function e is used to adjust the coefficients by the gradient descent technique).
i .....................................................
IV, flit
[vu . O W
3. Neurocontroller m o d e l
Figure 3 shows the neurocontroller model. The neural network used is a model with
architecture based on radial base functions (RBF). The gradient descent technique is used to
adjust the inter-neuron synaptic connections and the network outputs. The model equations are:
M
yNp(k) = ~w,p.g,(k); p = k2; O)
g:k) = ~ ., (2)
~,~k)=V~k) ~ g,~)-~-~3-~, w, t
.... Wilt
In the scheme shown in figure 2, the neurocontroller coefficients have to be adjusted on line,
the error (equation 3), being reduced to the minimum by the error backpropagation technique:
1
e<k) = ~(v,. - i,32 + ~cta,.
l -ta? (3)
The problem posed is the propagation of the error through the wheelchair dynamic. In this case
a behaviour study of the wheelchair is made, thereby obtaining its Jacobean to adjust the
neurocontroller. With this alternative the following is obtained:
-- ~-[,.,, +
I4)
R
where R is the radius of the drive wheels and D the inter-wheel distance. (Rffil6 cm, D = 54
cm..)
153
(5)
Analysis of stability
In this section we find a maximum in the value of the learning factor (a) in such a way that it
ensures that the training error decreases or at least does not increase at all times. For this, a
vector W containing all the adjustable coefficients of the network is considered. The variation
in vector W is:
1 1
O < a < ~ O < a < ~
(9)
a% u2
,-T lieU2
where the vector W is made up o f all the coefficients adjusted in each sampling cycle.
154
Applying the results indicated by equations (9) and considering that the neurocontroller+chair
unit is a single neural network, the following conditions must be fulfilled in the control system
of Figure 2:
1 1
O<a< O<a<
ave(k) 2 . af~'(/O ,2 (io)
u---Ew u
with:
it results:
nonv~(k)n R
= J . . g # ) ~ J . = -~ (z2)
HaVe(k) n = --
R
n _a~(k)Bm = J~,.g#) ~ J,, -- (14)
~11r
R
(is)
awa
Wig the physical values of the chair, the most unfavourable case is that indicated by equations
14 or 15. In short, the maximum value of the learning factor which may be used in the control
system of Figure 3 is:
0<a< 1
2M.(R.)2D ( 16 )
4. D e s c r i p t i o n o f t h e b o a r d
Figure 4 shows the board's block diagram. A description is given below of the components
therein. The hub ofthe board is the DSP of the firm Texas Instruments C1"MS320C31). The use
of a digital signal processor rather than a general purpose one is justified by the fact that its
architecture is specially designed to tackle the computational type and load needed for the
155
execution of the proposed algorithms. The most important of the devices complementing the
DSP are the RAM of256K x 32, and the software loading storage, which allows the board to
work with complete autonomy. A high-performance, user-programmable device is used for
implementing the other functions, the encoder reader and calculation of the position and speed
of the drive wheels.
Communication with the exterior can be effected using two different communication protocols.
The control stage, for example, sends the orders to the motor actuators through the LonWork
bus. The central unit and the path generator make use of a second communication protocol
(parallel bus in figure 1) by means of mailboxes in a double-port RAM through which
commands are sent (angular and linear wheelchair speeds) to the control stage.
A debugging tool of the DPS family TMS320C3X was u s ~ for measuring the execution times
of the neural control algorithms. The algorithm execution period was established at I00 ms,
conditioned by the response of the wheelchair motors. The total execution time of the control
algorithms for a [6 neuron network was 0.937mseg. In figure 5 an example of adaptive control
is shown, in which at certain moments of time a person sits in the chair (t=50 s.), stands up
(r=90 sg.) and sits down again (t~140 s.). The wheelchair is making a circumference with a
radious equal to 1 m.. The parameters used in the ncurocontroller are: M=16 neurons, tr~l.8
and a = 0,1.
f
-- i , ,
/
|
6. Acknowledgements
This work has been carried out thanks to the grants received from CICYT (Interministerial
Science and Technology Committe-Spain) project TER96-1957-C03-01.
References
Bona, B., Carabelli, S., Chiaberge, M., Miranda, E. and Reyneri, L. M. "Neuro-Fuzzy harware
and DSPs: a promising marriage for control of complex systems". MICRONEURO'97.
Dresden, Septem. 1997.
Boquete, L., et al., "Control with Reference Model Using Recurrent Neural Networks". En:
International ICSC/IFAC Symposium on Neural Computation. Septiembre 1998, pp. 506-511.
Garcia J.C. et al. "An Autonomous Wheelchair with a LonWorks Network based Distributed
Control System ". The Spring 97 LonUsers International Conference and Exhibitions. May
1997. Santa Clara.
Hunt, K. J. And Sbarbaro, D." "Neural networks for nonlinear internal model control"9 lEE
Proceedings-d, Vol. 138, (1991) n* 5.
Ku, C. C. & Lee, K. Y. "Diagonal Recurrent Neural Networks for Dynamic Systems
Control". IEEE Transactions on Neural Networks. Vol. 6. N* 1, January 1995.
Narendra, K. "Neural Networks for Control: Theory and Practice". Proceedings of the IEEE,
Vol. 84, n* 10 Oct. 1996.
Noriega, J. R. and Wang, H. "A Direct Adaptive Neural-Network Control for Unknown
Nonlinear Systems and its Application". IEEE Transactions on Neural Networks. Vol. 9. N ~ 1,
January 1998.
Venugopal, K. " Learning in Connectionist Networks Using the Alopex Algorithm". Phd.
Thesis. Florida Atlantic University. Boca Raton, Florida, April 1993.
Forward-Backward Parallelism in On-Line
Backpropagation
Rafael Gadea Giron6s Antonio Mocholl Salcedo
Dpto. lng. Electr6nica U.P. Y Dpto. Ing. Electr6nica U.P. V
Abstract
The paper describes the implementation of a systolic array for a multilayer perceptron on ALTERA FLEXIOKE
FPGAs with a hardware-friendly learning algorithm. A pipelined adaptation of the on-line backpropagation
algorithm is shown. It better exploits the parallelism because both the forward and backward phases can be
performed simultaneously. As a result, a combined systolic array structure is proposed for both phases. Analytic
expressions show that the pipelined version is more efficient than the non-pipelined version. The design is
implemented and simulated using VIfDL at different levels of abstraction andfinally mapped on FPGAs.
1. Introduction
In recent years it has been shown that neural networks are capable of providing Solutions to
many problems in the areas of pattern recognition, signal processing, time series analysis, etc.
While software simulations are very useful for investigating the capabilities of neural network
models and creating new algorithms, hardware implementations are essential to take full
advantage of the inherent parallelism of neural networks.
To organize the ideas described below a careful examination of the parallelism inherent in
artificial neural networks (ANN) is useful. A casual inspection of the standard equations used to
describe backpropagation reveals two obvious degrees of parallelism in an ANN. Firstly, there is
parallel processing by many nodes in each layer. Secondly, there is parallel processing in the
many training examples.
The former comes to mind most easily when parallel aspects of ANNs are being considered.
The network may be partitioned by distributing the synaptic coefficients and neurons throughout
a processor network. Again, there are two variations of this teclmique: "neuron-oriented
parallelism" - and "synapse-oriented parallelism".
In the first variation, the neurons are distributed among available processors. However, it is
difficult to place the neurons in such a way as to produce efficient implementations, which
require both an evenly distributed computational load (easy) and reduced data communications
(difficult) [ 1].
The second variation is based on the fact that the computations in a neural network are
basically matrix products [2],[3], [4] and [5]. The advantage of this approach is the amount of
data communicated between processors is moderate and evenly distributed, although in a
multilayer perceptron the synaptic matrix is lower triangular. It can be more interesting to
perform the matrix in an implementation distributed by layers (matrix-vector representation) [6].
The latter, which we will refer to as "training set parallelism", is perhaps the most useful. In
backpropagation, this latter parallel aspect is the result of the linear combination of the individual
contributions made by each training pattern to the adjustment of the network weights. The
linearity implies that the patterns can be processed independently and hence, simultaneously.
However, this implementation requires that the weights be updated after all the parallel processed
training patterns have been seen:- so-called "batch" updating [7].
A third, but less obvious, aspect of backpropagation stems from the. fact that forward and
backward passes of different training pattems can be processed in parallel. In a work by
Rosemberg,and Belloch [8] with Connection Machine, the authors noted this possibility in their
implementation, though it remained unimplemented. Later, A. Petrowski et.[9], describe a
theoretical analysis and experimental results with transputers. However, only the batch-line
version of backpropagation algorithm was shown. The possibility of an on-line version was noted
by the authors in general terms, but it was not implemented with systematic experiments and
theoretical investigations. In [10] we show that this parallelism, we will refer to it as "forward-
backward parallelism, has a good performance in convergence time and generalization rate and
158
we begin to show the better hardware performance of the pipelined on-line backpropagation in
terms of speed of learning. Now in this paper our main purpose will be to concrete this
improvement of speed in a hardware implementation on Altera FLEX10K50 and to show the
hardware costs of this pipelined on-line backpropagation always compared to standard
backpropagation.
In section 2 pipelined on-line backpropagation is presented and proposed. Section 3 studies the
latency, throughput, and efficiency, of this algorithm compared to a non-pipelined algorithm. An
alternating orthogonal systolic array is used for these measurements of hardware performance.
The methodology of design using VHDL is described in Section 4. Also in this section the
implementation properties when we compile on FLEX10K FPGAs from Altera will be evaluated.
The starting point of this study is the backpropagation algorithm in its on-line version. We
assume we have a multilayer perceptron with three layers: two hidden layers and the output layer
The phases involved in backpropagation taking one pattem m at a time and updating the
weights after each pattern (on-line version) are as follows:
a) Forward phase. Apply the pattem a~r to the input layer and propagate the signal forwards
through the network until the final outputs a~L have been calculated for each i and l
a t, = f(u/)
N~_
y/ =u/ ='~-~wt#d-l/ (1)
I-0
l<i~;N{ , I < I < L
b) Error calculation step. Computer the 8's for the output layer L and compute the 6's for the
preceding layers by propagating the errors backwards using
8~ = I,(,L )(,,_y,)
mAWI
=17m~l
otyjt-I (3)
l<i~N t ,l~l<L
All the elements in (3) are given at the same time as the necessary elements for the error
calculation step; therefore it is possible to perform these two last steps simultaneously (during tile
same clock cycle) in this on-line version and to reduce the number of steps to two: forward step
(1) and backward step (2) and (3). tIowever in the batch-line version, file weight update is
performed at the end of an epoch (set of training patterns) and this approximation would be
impossible.
159
Non- pipeline: Non-pipeline: The algorithm takes one training pattern m . Only when the forward
step is finished in the output layer can the backward step for this pattern occur. When this step
reaches the input layer, the forward step for the following training pattern can start (Figure 1).
F i b r e 2. Pipeline
Figure I. Non pipeline
In each step s only the neurons of each layer can perfornl simultaneously, and so this is the
only degree of parallelism for one pattern. However, this disadvantage means we can share the
hardware resources for both phases because these resources are practically the same (malrix-
vector multiplicntion).
Pipeline: The algorithm takes one training pattem m and starts the forward phase in layer i. The
following figure shows what happens at this moment (in this step) in all the layers of the
multilayer perceptron.
Figure 2 shows that in each step, every neuron in each layer is busy working simultaneously,
using two degrees of parallelism: synapse-oriented parallelism and forward-backward
parallelism. Of course, in this type of implementation, the hardware resources of the forward and
backward phases cannot be shared. In the following section we will see how, in spite of this
problem, the pipeline version for the proposed systolic array is more efficient than the non-
pipeline version.
Evidently, the pipeline carries an important modification of the original backpropagation
algorithm [11 ],[12]. This is clear because the alteration of weights at a given step interferes with
computations of the states a~ and errors ~ for patterns taken at different steps in the network. For
example, we are going to observe what happens with a pattern m on its way to the network during
the forward phase (from input until output). In particular, we will take into account the last
pattern that has modified the weights of each layer. We can see:
1. For the layer t the last pattern to modify the weights of this layer is the pattern ra-5.
2. When our pattern m passes the layer d, the last pattern to modify the weights of this layer
will be the pattern m-3.
3. Finally, when the pattern reaches the layer L the last pattern to modify the weights of this
layer will be the pattern m-l.
160
Of course, the other patterns also contribute. The pattems which have modified the weights
before patterns m-s m-3 and m-I, are patterns m-6, m-4 and m-2 for the layers L J and L
respectively. In the pipeline version, the pattern m-I is always the last pattern to modify the
weights of the all layers. It is curious to note that when we use file momentum variation of the
backpropagation algorithm with the pipeline version, the last six patterns before the current
pattern contribute to the weight updates, while with the non-pipeline version, only the last two
patterns contribute before the current pattern.
It is important that the equations for the two phases perform in the same manner as the non-
pipeline version. For this, it will be necessary to store the values of sigmoids and their derivatives
(see the following section). Therefore, we have a variation of the original on-line
backpropagation algorithm that consists basically in a modification of the contribution of the
different patterns of a training set in the weight updates, and in the same line as the momentum
variation.
3. Hardware performance of pipeline systolic architecture
The aim of this section is to characterize the hardware performance of the pipeline on-line
backpropagation algorithm compared with the non-pipeline version. For this proposal we employ
the "alternating orthogonal systolic array"[6] and use the following metrics:
1) Throughput rate (measured as the number of clock cycles between two processed
patterns)
2) Latency (clock cycles required to process one pattern)
3) Array efficiency.
The array efficiency metric provides a measure of PE (processing elements) and pipeline
usage over the period required to process one pattem. We measure the efficiency of a parallel
algorithm by the common ratio S/P [13], where S is the speedup and P is the number of
processors in the network. The speedup is given by the ratio S=6e,/tpo,, where tse, and tm, are the
sequential and the parallel computation times, respectively.
We suppose that we have a MLP (multilayer perceptron) with three layers and with the
following characteristics:
NE =number of inputs.
Figure 3
161
To quantify this, we are going to assume that the "synapses units" and the "neurons units"
perform theirs operations in one clock.
non-pipeline: For the two phases of backpropagation algorithm we have the following time costs:
tl Forward phase: (Ne + Nto+ N2o +5) Cycles
tl Backward phase: (Ns + N2o+ NIo+NE+ 5) Cycles
To measure the efficiency, we are going to consider our object of analysis to be the MLP
performing one epoch training (being b the number of patterns). The duration t~r of the
presentation of an epoch for the non-pipelined BP algorithm is given by:
tpa~= Latency (one pattern) + Throughput (b-l) = = 2NE+2Njo + 2N2o+ Ns+ 10) Cycles (6)
+ (NE+ 2Nio + 2N2o + Ns+ 9) (b-l.)Cycles
It is evident that our alternatingorthogonal systolicarray has the following processing units
(takingaway the biasPEs):
P= NE + 3N20 +4 (8)
E~cte.cy= ~er ! =
t P
par
= I('NE+I)NIO + OVIO+I)N20+(N20+I)NS+ N I O + N 2 0 4 1 1 2 N S (9)
[(/dE+ 2NIO + 2 N 2 0 + NS+ 9)0VE + 3N20 +4)]
pipeline: For the two phases of backpropagation algorithm we have these time costs:
0 Forward phase: (N~ + Nm+ N2o +5) Cycles
n Backward phase: (Ns + N2o+ Nm+NE+ 5) Cycles
This expression is exactly the same as the non-pipeline vez?sion (4) and this shows that the
latency does not improve with the pipeline.
The throughput rate is, however, very affected by the application of this variation. In the
particular case of an alternating orthogonai systolic array, with three layers distributed as: vertical
layer - horizontal layer - vertical layer, the throughput is given by:
tpar= Latency (one pattern) + Throughput (b-1)= (2NE+ 2Nm + 2N2o + Ns+ 10) Cycles+ (12)
OVto + 1) Co-l)Cycles
It is evident that for our aitemating orthogonal systolic array in the pipeline version the number of
processing units is:
This equation shows an increase in the number of processing units because the MAC
operations of synapses units for forward and backward phases cannot work simultaneously, and
so we duplicate the quantity of synapse units.
If we compare the obtained expressions, we can make stand out the better efficiency and the
9clear improve of the throughput of pipeline version. We must remember that the number of
connections updated per second will be directly proportional to the frequency of our
implementation and inversely to this throughput.
163
This section compares directly the implementation properties of pipelined on-line BP with the
standard BP algorithm when we use the same technology: ALTERA FLEXI 0KE FPGAs.
4.1 Design entry with VHDL
The design entry of the pipelined on-line BP, and classical on-line BP for performing
comparatives, is accomplished in VHDL. It is very important to make these system descriptions
independeuts of the physical hardware because our objetive in the future will be to test our
descriptions on others FPGAs and even on ASICs.
We have made eight VHDL testbenches to perform the simulations shown in Figure 5: four for
the pipeline version and four for tile non-pipeline version. The VHDL description of the
"alternating orthogonal systolic array" (always the unit under test) is totally configurable by
means of generics and generates statements whose values are obtained from three ASCII files:
0 Database file: nt, mber of inputs, number of outputs, the training and validation patterns
0 Learning file: number of neurons in the first hidden layer, number of neurons in the second
hidden layer, type of learning (on-line, batch-line or BLMS), value of learning rate q, value
of momemtum rate, type of sigmoid (binary or bipolar) etc.
0 Resolution file: resolution of weights (integer and decimal part), resolution of activations,
resolution of accumulator (integer and decimal part),etc.
Table 1. SPEED
Table I shows how the implementation of pipeline version affects to frequency of operation.
This effect is evident in the synapses because in the pipeline version is necessary to do 2 read
and 1 write operations of the embedded dual port RAM which stores the weights. Although the
FLEXIOKE permit simultaneous read and write operation, the cycle period must be shared for
this three operations. However this increased of period doesn't avoid that the speed performance
of pipeline version is much better than non pipeline version (standard version) in the number of
Connections Updated Per Second . These results of the last raw were obtained for a multilayer
perceptron as the Figure 4 but with the following parameters: 3 inputs, 4 outputs, 20 neurons in
the first hidden layer and 10 in the second hidden layer.
164
Table 2. AREA
Tables 2 show the resource usage for the two versions supposing that the number of neurons of
first hidden layer is less than 32. We have used the FASTEST style for the implementation and
optimization and we have mapped all the memory elements (FIFO and RAM) on embedded array
blocks (EAB) of the FPGA by means of megafunctions of ALTERA. We can observe that the
hardware cost for pipelining the backpropagation algorithm is higher in the synapses than in the
neurons and is produced fundamentally because the pipeline version needs different multipliers
and accumulators for the forward and backward phases.
5. Conclusions
This paper evahtates the hardware performance of the pipelined on-line backpropagation
algorithm. This algorithm removes some of the drawbacks that traditional backpropagations
suffer when implemented on VLSI circuits. It may go on to offer considerable improvements,
especially with respect to hardware efficiency and speed of learning, although the circuitry is
more complex.
165
We believe this paper contributes new data for the classical contention between researchers
who update network weights continuously (on-line) and those updating weights only after some
subset, or often after the entire set, of training patterns has been presented to the network (batch-
line). Until now, batch updating after lhe entire training set has been processed (i.e. after each
epoch) was preferred in order to best exploit "training set parallelism" and "forward backward
p~lrallelism." Now, we can see that to exploit all the degrees of parallelism, we can use the on-line
version of backpropagation without degradation of its properties.
Abstract
The paper describes a VLSI viable integrate-and-fire neuron model with an easily
conlrollable firing threshold that can be used to induce synchronization processes. The
circuits are intended to exploit both rate aml spike time coding schemes, taking
adwmlage of these synch,onization processes to accelerate processing tasks. In this way
Ihc temporal domain can be exploited in neural computation architectures. A simple
iicural structure is also discussed, providing simulation results to illustrate how these
lime coded signals can be combined to perform a simple processing task such as
coherent input detection.
I. Introduction
Thc way in which biology manages to perform complex processing tasks in a very short
lime is still t, nclear. An efficient exploitation of the temporal domain may be the key.
Most spiking neuron models use rate coding but several biological studies have revealed
the important role that single spike timing may play in biological processing schemes.
I,i fact, although some processing tasks in biological nervous systems might be carried
~)ul making use of a pure rate coding, this is not the case for other processing pathways
whcrc complex processing tasks are performed in times of a few milliseconds, through
scvcral neural layers 1"I"!-10961.This is hard to achieve compatible with the use of a rate
coding coml)uting scheme based on biological neurons, which exhibit interspike times
ill Ihc ntnge of milliseconds. This fact has motiwlted different research groups to study
altcnmtive coding schemes, such as rank order coding [THO98] and temporal coding
IHOP951. These models are compatible with the rapid processing observed in some
scnsory pathways of the biological reference.
1101'951. Biological systems seem to make efficient use of both firing rates and the
relative timing of individual spikes to code neural inlormation [SEJ95].
This paper describes hasic circuits, and neural configurations based on them, intended
to exploit both rate and spike time coding schemes. Their functionality is illustrated by
SPICE simulations taking into account the parameters of the 1.2 IotmCMOS fabrication
process of AMS, in which some of the circuits have already been implemented and
tested [PEL97, ROS97a].
The computational primitives described here are still far from any solution that could be
directly applied to a real problem. Nevertheless, in order to reach this point, it is
wzccessary to develop the basic time coding circuits and also to study how the neural
information processing capabilities offered by these cells can be used collectively in
massively parallel architectures. Basic circuits like the one proposed here may motivate
the search for new neural configurations that take full advantage of time coded signals
Io i)cl'forln complex processing tasks efficiently.
Section 2 of the paper presents the integrate-and-fire neuron model; Sections 3 and 4
describe briefly the synaptic circuit and the time coding circuit, respectively. In Section
5 a simple neural structurc is discussed with simulation ,esults illustrating how coherent
input detection can be carried out with the proposed cells. In Section 6 some concluding
I C l u a r k s a r e made.
2. Neuron Model
The circuits proposed in this paper implement a spiking neuron model [GER98]. The
neuron state is represented by a variable (Vx) called the membrane potential. Each time
Ihat Vx reaches a certain threshold (Vth) the neuron fires an output spike. Two processes
affect the value of Vx according to expression (I). First, Vx falls to its minimum value
each time an output pulse is fired. Second, Vx integrates the contribution of all the
presynaptic neurons.
In expression (I) w~i represents the weight of the synaptic connection and ~i.i is a
function that describes in time how the synaptic contributions of individual spikes are
i~tcgrated in the membrane potential (Vx). Using a similar nomenclature to that in
[GER98], for a particular neuron i, Fi denotes its receptive field, that is, all its
presymlptic connections, and ~i represents the set of its firing times as indicated in
exp,'ession (2).
3. Synaptic Circuits
The synaptic circuits are described in detail in [ROS97a, PEL97]. For the sake of
clarity, in this paper we consider a particular configuration of the synaptic circuits that
168
induces a linear behaviour on the synapse model (see Figure I.a). In this way the
mcmbrane potential variation induced by a single spike does not depend on the actual
value of the membrane potential (V~). Therefore V, will rise (or fall) linearly with the
number of excitatory (or inhibitory) input pulses, respectively (see Figure I.b).
The weight of the excitatory synapse circuit (w0+) is controlled by a reference voltage
(V,r according to equation (3).
K+. C + (3)
W~ C.r
II'WT .................................................................... 1
,, i~.! f
" ~i v~ ,/~-----" v~ ::
v,
2.u~ ..... , ................ , ............... , ................ , ............. 1
2~s ~?6ks 2D0eS 3ml~s 311us
e(12)
1Lm,
(a) (b)
Figure I: (a) Schematic circuit of an excitatory synapse. Each time a spike reaches the
circuit a charge packet is injected into the membrane capacitance C~. (b) The membrane
polcntial (Vx i) rises in response to spikes received by all excitatory synapse. Each time
an inpt, t pulse reaches the synapse, a charge packet is injected into Cx for a time that
depends on Vrefij -
4. T i m e c o d i n g circuit
The p,'oposed time coding circuit can be seen basically as an integrate-and-fire neuron
with an external firing threshold that can be modulated for synchronization purposes
(see Fig. 2).
169
C~r
Each time that the membrane potential (V,) reaches the firing threshold (Vth) Ihe
comparator circuit swilches, pl'oducillg a I'ast charge of the intermediate cal)acitance
(C~), and generates a spike through an output stage similar to the one proposed by
Carvcr Mead [MEA89]. While the pulse is being fired, the membrane capacitance Cx is
colnplctely depleted and the intermediate capacitance Ci is partially discharged to a
fixcd wdue below the transition threshold of the first inverter (I I). This is done by two
spccil'ic depletion transistors (mn,~:~, and mp~.~,). An additional current source may be
ilnplemcntcd to complete the depletion of Ci in order to avoid undesirable charges
caused by leakage currents at the output of the comparator circuit (see Figure 3).
Cd
____+I
9
-•-C•
- IN~' K'II IC/] __1_
Vi
ilL c,T
"=2_~
nnl~k, r
This circuit behaves like an integrate-and-fire neuron. Vth is the firing threshold and iexc
rcpresents the global incoming charge received by the membrane capacitance Cx
thlough all the synapses. The time to the next spike is described by expression (4), and
depends on the global excitation received from the whole presynaptic tree and on the
time of the previously fired spike.
V,,, 9 C x (4)
t. = i...(t) + t._,
170
When a constant firing threshold (Vth) is used, tile output frequency of spikes (Fo,0 is
I)roporiional to the inconling excitation current (i<,xc). On Ihe ()tiler hand, a periodic
threshold signal (Vd,) can he used to induce synchronization processes [tlOP95,
ROS97bl. This periodic signal, applied as the conlmon reference voltage (V,~) lbr a
popuhilion of neurons, can be seen as 'an artificial way of inducing synchronization in a
set of Ilcurous receiving similar inputs. In biological systems the synchronous
oscillation of populations of neurons leads to subthreshold variations in the membrane
potentials, which may play a similar role to this common reference signal. Global
periodic oscillation signals have been observed in biological systems and may cause
illtrinsic modulation signals. Therefore a simple external threshold applied to the circuit
siilnulates its inherent synchronization properties, responding with similar spike timing
io silnilar inputs.
The power consumption of tile above circuits is described in detail in [ROS97a]. All
circuits have low power consumption with typical values in the order of 0.3 ItW for
each synapse and 0.5 I.tW for each time coding circuit.
N, ~;;:,';'; :.;;5;;:;; . . . . . . . . . . . . . . . .
Frequencycoding
Figure 4: Neural configuration for coherence detection. An input layer encodes the input
siilnuli through individual spike timing. A collecting neuron (No) with a strong passive
decay tcrln fires spikes if it receives synchronized bursts of pulses.
This neuron (No) has a strong passive decay term and therelbre only strong excitation
phases (i.e. a high number of input spikes in a short time period) will be able to raise the
membrane potential over the threshold and fire output pulses. The passive decay term of
171
Simulation results of such a neural configuration, with ten neurons in the input layer,
are shown in Fig. 5. The input signals (Fig. 5.a) evolve through time and are quite
heterogeneous until t=60 ms when all inputs converge to a similar value. From this time
Ihe output spikes of the input layer gradually become synchronized. The collecting
neuron receives pulse streams: the more synchronized the firing by the input neurons,
the narrower are the pulse streams generated. Due to the strong passive decay term of
Ihis neuron (N~) only very abrupt excitation processes (concentrated pulse streams) will
be able to dominate it and raise the membrane potential over the firing threshold. In Fig.
5.b it can be seen that when the input spikes received are synchronized in a short
interval of time, the membrane potential rises over the threshold, producing output
spikes i,1 the collecting neuron (N~).
i,+~,~(hA)
1
v,.,,+
. u(11)
1H][l[tlIJ U ILL
- u(121
i
r , . m l +. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
g(49) * 11120)
i
v ..........................................................................................
:............. P+ !
(a) (b)
Figure 5: (a) Evolution of input stimuli. (b) Upper trace: spikes fired by the input layer,
beginning to be synchronized from t=60ms, when the input pattern becomes
homogeneous. Middle trace: spikes fired by the collecting neuron (No). Lower trace:
,icmbrane potential of the collecting neuron. Only dense pulse streams coming from Ni
lead to abrupt excitation phases able to dominate the passive decay term of the
collecting neuron.
The input layer produces spikes of a fixed frequency. The particular time of a spike
withi,1 the period of the threshold reference signal (Vth) depends on the excitation
current (iexc); weak excited neurons will fire delayed pulses with respect to more
strongly excited units.
Fig. 5.b (lower trace) shows that the collecting neuron Nc exhibits subthreshold
oscillations in response to non-coherent input stimuli sensed by (Ni). On the other hand,
172
when the input layer receives a homogeneous stimulus, the membrane potential of the
collecting neuron exhibits coherent oscillations over the threshold and generates output
pulse streams (see middle trace in Fig. 5.b).
If different groups of input neurons in the receptive field of a collecting neuron (No)
receive homogeneous stimuli, but of different values, then this collecting neuron will
respond with several excitation phases per reference cycle, each one corresponding to a
different region of homogeneous input stimuli.
Fig. 6 shows simulation results of the same configuration with twenty input neurons
where the input pattern converges at time t=60ms. From this time on, two populations
of neurons receive different values and therefore they synchronize their output spikes to
different firing times. This spatial pattern sensed in the receptive field generates an
output time signature consisting of two output pulse streams. If the populations of
synchronized input neurons are significant, the collecting neuron (N,,) will produce
output spikes for each abrupt excitation received. In this way such a neural
configuration could produce a particular output sequence in response to simple patterns
within the receptive field or even to textures (sensed as two input values distributed
through the receptive field).
6,~ .......................................................................
i~xci(nA)
06 :: ,~'") "
v:.............L.Htl
..........lid
.....
2.~T ......................................................................... !
02
o~/
,.=~/
.; /
v'iU(21 i
100
Ni 0 0
t (ms)
(a) (b)
Figure 6: (a) Input stimuli evolution. (b) Upper trace: spikes fired by the collecting
neuron (No). Lower trace: Nc membrane potential. Two high activity phases can be
observed in each V~h cycle (20 ms) once the input stimuli have converged to
homogeneous values.
If the receptive field of Nc has a specific shape and orientation within the input
processing layer (see Figure 7), different collecting neurons may respond with
oscillations to stimuli of a certain shape and orientation. The intensity level of the input
stimulus is coded as oscillations at a specific time within the time interval defined by
the reference oscillation signal. On the other hand the amplitude of the oscillations of
the membrane potential of collecting neurons gives a measure of the number of
synchronized neurons in the previous layer.
/'"7
~ N ~ N~
(a) (b)
Figure 7: (a) An oriented bar stimulates only a few neurons (dark units) in a specific
receptive field. (b) In this case an oriented bar stimulates most of the neurons (dark
units) of a receptive field. The number of excited neurons represents the degree of
simila,ity between the input stimulus and the receptive field characteristics. A single
collecting neuron Nc can be used to detect this degree of matching and to code it in
spike streams.
6. Conclusion
J. J. Hopfield made a claim [HOP95], illustrating how a simple periodic firing threshold
induces synchronization within integrate-and-fire neuron populations receiving similar
inputs. This can be used to exploit the temporal domain, coding the neural information
in the spike timing rather than in spike rates. The circuits described in this paper
represent a VLSI approach of this concept tlmt is a starting point for tile study of VLSI
neural structures able to take full advantage of this time coding scheme.
References
ITHO98] S.J. Thorpe, J. Gautrais, "Rank Order Coding: A new coding scherne for
rapid processing in neural networks", Computational Neuroscience: Trends
in Research, J. Bower (Ed.), New York: Plenum Press.
[TON921 G. Tononi, O. Sporns, G. M. Edelman, "Reentry and the problem of
integrating multiple cortical areas: simulation of dynamic integration in the
visual system", Cerebral Cortex, vol. 2, pp. 310-335, 1992.
An Artificial Dendrite Using Active Channels
1 Introduction
Conventional (digital) computers are an integral part of our lives and become
more powerful. Nowadays, it is possible to do several millions of calculations per
second even with a modest PC. This computational power is accomplished by
high clock rates, pipelining and some degree of parallelism. Despite this compu-
tational power, a conventionally programmed computer is not able to recognize
images (for example), as good as we do. This task would require a very high
clockrate to make real-time operation possible. An alternative is to use other
computational methods or structures. One of these structures is an artificial
neural network [1], a computational structure inspired by the (human) central
nervous system.
1.1 A n a l o g u e neural n e t s
could be capable of simulating large neural networks, however most of the par-
allelism found in a neural network has to be translated to sequential programs,
making real-time processing of large neural nets nearly impossible. Because some
of the basic relations found in analogue electronics already consist of summa-
tions and threshold functions, it should be possible to create neurons consisting
only of a few analogue devices. It has been shown by Carver Mead [2] that it
is possible to mimic specific behavior of sensory systems using simple analogue
circuits. His replica of the human retina [3] was able to detect edges and motion
using a photo-sensitive array of neurons.
Due to the possible reduction in complexity of the neuron, it is possible to fit a
large amount of neurons on a single chip. In order to use the neural net, learning is
required. When the conventional supervised learning rules are applied, practical
problems arise. A supervised algorithm would be implemented using a complex
circuit centrally placed between the neurons and connected to each neuron. In
case of tenths of neurons this would be feasible, though the limited amount of
metal layers on chip would not be sufficient when a few thousand or even a few
million neurons are used. In order to create networks of such magnitudes it is
necessary to divide and decentralize the supervising algorithm resulting in more
localized learning circuitry. Good examples of localized learning algorithms can
be found in the human brain itself. Each individual human neuron posesses the
ability to learn.
The goal of this paper is to model the most important part of a neuron, the
dendrite. These dendrites could prove to be the key to local learning, because
most of the processing (if not all) is done by the dendrite. The resulting artificial
dendrite serves as a suitable vehicle for experiments concerning local learning.
The last aspect of this introduction is the information coding. In normal com-
puters, the information passed on from one element to another is coded using
a sequence of patterns. The human brain uses a different information encoding,
using the temporal relations in a series of spikes or action potentials. One can
compare this modulation with FM, the more spikes, the higher the intensity.
This temporal coding could prove to be less sensitive for parasitics than other
forms of coding.
2.1 Anatomy
Some segments are similar for each neuron, though most neurons can differ
greatly in size and shape. A neuron has roughly said three distinct kind of
178
segments, dendrites, an axon and a cell body. The neurons are connected to other
neurons using the dendrites and the axons. The axons excite other dendrites or
cell bodies using small bulb-like terminations called synapses. A typical neuron,
a Purkinje cell, can be seen in fig. 1.
Cell body
The dendrites act as inputs of the neuron, while the axons are used as out-
puts. The dendrites and axons have in common a structure consisting of a cylin-
drical membrane formed by a bilayer of lipid molecules. Large molecules, called
channels, form passages through the membrane between the cell's interior and
exterior. Figure 2 shows a typical membrane with some channels. The termina-
Channels
Inlerior
Lipid molecules
tions of the axon, the synapses, do not connect physically to the dendrite (by
merging the membranes) but lie close to the dendrite. These synapses mainly
use chemical reactions to transfer information from the axon to the dendrite.
2.2 Function
The best way to describe a biological neuron is by a highly nonlinear filter. Sev-
eral effects can be subscribed to the dendrite, for example threshold behavior
and a delay. It has been stated earlier that the information coding in the dendrite
uses the temporal relation between indivual or groups of spikes. These spikes are
transient potentials (potential difference between the cell's interior and exterior)
along the membrane of the dendrites and axons.
179
A more proper name for these spikes is action potentials. Though the ampli-
tude and the timing can differ, each action potential has the same characteristic
periods. A typical action potential as well as tile characteristic periods can be
seen in fig. 3. A generic action potential has four characteristic periods, the
Repc~za~c~
AI zest A.b~olutc re~a~:lov]p~lod A! rest
\ I I J
I
t
D~oo~Hzation Depol~uization
Fted,3t~ refructory p~iod
resting period, the depolarization and the two refractory periods (partly repo-
larization/depolarization). During tile resting period, the dendrite is in a state of
equilibrinm with a resulting membrane potential of approx. -60mV (the resting
potential). When the membrane potential is raised above a certain level, tile de-
polarization begins and the membrane potential rises quickly to approx. 90mV.
When this point is reached, the membrane potential seizes to rise and starts to
drop quickly below the resting potental (undershoot). After this undershoot, the
membrane potential slowly rises to its resting state (repolarization). The last
two events are called the refractory period, in which the potential drop is called
absolute refractory (the membrane cannot be excited) and tile slow return to
tile resting potential is called relative refractory (the membrane is less sensitive
to excitations).
The information is coded in number of spikes per second, though recent research
suspects that the shape of an action potential also contains information [5]. This
kind of information encoding is described the best with pulse density modulation.
Dendrite
that the pulse is attenuated and t h a t the RC-line behaves like a low-pass filter.
Fig. 5 shows the six m e m b r a n e potentials measured at the six points on a six-
section RC-line. Within a section the values of R~, Rm, Cm and Vm potential
=.mi
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
are considered to be constant. As one can see, the amplitude gradually decays
and the shape changes (the shape gets smoother due to the low-pass nature of
the RC-line) when the potential is measured at a point further away. Although
passive propagation is sufficient for short dendrites, it is necessary to use some
kind of active propagation for long distances (e.g. from the brain to a foot).
Active propagation is achieved using channels, large tunnel shaped molecules
through the membrane. There are several different kinds of channels [6]. These
channels are used to transport ions (or molecules) from the interior to the cell's
exterior and vice versa.
2.4 Axons
2.5 Channels
As has been stated in the text above, channels are large molecules through the
membrane. These channels use passive (diffusion) or active mechanisms to trans-
port particles through the membrane. Most of the active channels exhibit a volt-
age dependent behavior and memory effects. The ionic flow is determined by the
membrane voltage and the derivative (different behavior on a rising/decreasing
edge). These dependencies on the membrane potential are depicted using hys-
teresis graphs in fig. 6. The arrows show the possible trajectories of the membrane
potential and the ionic current. Three ionic currents are important for the shape
of the action potential. These flows (electrically seen as currents) are: the inflow
of N a + ions ( N a + influx) and K + ions ( K + efflux and influx).
K ~ in
K§ ex
Na § in () !
i,
V~ Vn VT~ vv~
M e m b r a n e Potential
This section deals with the translation of the biological dendrite to an artificial
dendrite. The first subsection gives some considerations regarding the conducted
research. The model is a simplified version of the dendrite. The simplifications
are explained in the second paragraph. The next subsection describes the trans-
lation of the hysteresis graphs to ideal electrical schematics. The final phase of
the modeling is the translation of the ideal schematic to a circuit consisting of
conventional components, described in the last subsection.
3.1 Considerations
The biological dendrite is capable of processing information using addition, mul-
tiplication, delay and threshold behavior. The artificial dendrite should be capa-
ble of the same processes except multiplication (this could be a target for further
182
research). The ionic flows found in the biological dendrite can be modeled using
currents. The channels consist of circuits controlled the membrane potential and
drive the membrane with a current source or sink. The advantage of this layout
is the fact that both input (membrane potential) and output (driving current)
could be connected to the same node, without undesired influences. The artificial
dendrite described in this paper is capable of propagating action potentials in
both directions (to and from the cell-body). It is, however difficult to model the
dendrite completely, because this would result in many circuits.
One of the reductions of the model are the characteristics of the channels. Bio-
logical channels have a variable aperture, the flow of ions can vary. The artificial
channels, described here, can only be opened or closed. Fig. 6 showed the mem-
brane potential dependency of the biological channels. The following picture
shows the effect of the reductions on the membrane potential dependency of the
artificial channels.
This figure is also a hysteresis diagram using arrows to show the possible tra-
K§ in ()
K* ex ()
"i
()
jectories. The modulation of the ionic flow in the biological channels result in a
smoother action potential. The values VT? are threshold values triggering certain
events in the artificial channels. In this graph, VT~ and VT3 have the same value.
In the practical implementation, these values have been chosen different to pre-
vent undesired equilibria. The indexes correspond to the ones used in the ideal
schematics later on in this section. When the artificial channels function accord-
ing to fig. 7 the resulting action potential will be like fig. 8. Another reduction
in the model is the used structure. The structure of a biological dendrite can
be seen as a continuous transmission line with all resistances and capacitances
183
VT2
vT,
vro~
VT3
VT4
Time
Istim ~ , Vsup
o Rm Cm Rm Cm Rm Cm GND
o
3.3 D e r i v a t i o n o f ideal m o d e l
The main information used to derive the ideal circuits is the graph shown in
fig. 7. The basic elements used in the artificial dendrite are comparators (for the
threshold behavior), logic gates (used to combine different thresholds), latches
(for the memory behavior, to determine the direction) and switched current
supplies/sinks. When these four elements are combined it is possible to construct
the different hysteresis diagrams. The resulting circuits are shown in fig. 10, 11
and 12.
3.4 T r a n s l a t i o n i d e a l m o d e l t o e l e c t r i c a l circuit
It is possible to extract circuits from the three different ideal circuits using
conventional devices. These circuits can be used to simulate the artificial dendrite
184
Vsupp
Vin VT5
GND
Vsupp
Vre~t
+
Output
Vio
" GND
with the circuit simulator SPICE. Each ideal component can be replaced by a
circuit built from conventional components. These circuits can be found in any
textbook on electronics [8] [9]. These small basic circuits can be found in any
general textbook on electronics. The resulting schematics are included with this
article in Appendix.
Having electrical models of the biological channels give the opportunity to sim-
ulate the artificial dendrite. These experiments were used to verify whether the
185
model had the desired functionality. This section describes the results of the
conducted simulations.
The simulations have been run in three different categories. The first category
tests the i n p u t / o u t p u t behavior of the three different channels. The second cat-
egory verifies the behavior of the channels connected to a RC-pair (representing
the passive line). It is sufficient to mention the fact that the results of both cat-
egories agree with the desired behavior. The values of the different components
have been chosen to make the results of the simulations clear.
4.2 Verification
The experiments of a third category are used to determine whether the artificial
dendrite functions properly. These experiments verify whether the artificial den-
drite is capable of producing and propagating an action potential. Two different
dendrites are verified, a three-section dendrite and a nine-section dendrite. Four
consecutive stimuli (injected currents) have been applied. The passive RC-line is
charged in the beginning. The membrane potential rises from 0V to Vrest; the
K + influx channel is responsible for this charging. The first stimulus in both
cases is not sufficient to excite the dendrite above the threshold voltage VT2, due
to the fact that the membrane potential is still rising to the resting potential.
The next three stimuli are able to excite in both cases because the membrane po-
tential rises above a threshold potential, and for each dendrite three consecutive
action potentials are generated.
, . ~ . . . . . . . . . . _ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
!i ......
Fig. 14. Simulation of a 9-section dendrite
4.3 Review
The experiments from the former section show that action potentials can be
generated in the artificial dendrite. Several properties found in biological den-
drites are also implemented in the artificial dendrite, such as threshold behavior
and refractory periods. The refractory periods of the artificial dendrite have a
sequential nature instead of the timed behavior of the biological dendrite. The
shape of the action potentials generated by the artificial dendrite are also more
rectangular shaped than their biological counterparts due to the discrete nature
of the artificial channels (in contrast to the continuous nature of the biological
channels). The artificial dendrite is capable of summation and threshold behav-
ior. Summation is accomplished by the injection of different currents into one
point. The threshold behavior can be subscribed to the active channels. It is
possible to simulate a conventional artificial neuron with temporal behavior.
In the next section conclusions are drawn regarding the simulations and im-
plementation of the artificial dendrite. The last section gives recommendations
regarding further research.
5.1 Conclusions
This research shows the possibility of mimicking specific behavior of the biolog-
ical dendrite using conventional electronic components. The approach used in
this article resulted in analogue circuits, that perform different functions found
in the biological dendrites. The simulation experiments are used for verification
of the desired behaviour. This verification has been conducted on different levels.
The first category verified the response of the different channels to transient in-
put voltages. The second category verified the separate channels in a membrane
like environment. Finally, the active membrane sections were concatenated in
187
the third category to simulate small artificial dendrites. These three categories
show t h a t the artificial dendrite as well as its subcircuits have function properly.
With its behavior, the artificial dendrite could be a good underlying structure
for neural networks processing information coded in the t e m p o r a l domain.
5.2 Recommendations
References
1. Simon Haykin, Neural Networks, A Comprehensive Foundation, Prentice Hall, New
Jersey, 1994.
2. Carver Mead, Analog VLSI and neural systems, Addison Wesley, 1989.
3. C.A. Mead and M.A. Mahowald, "A silicon model of early visual processing," Neural
Networks, , no. 1, pp. 91-97, 1988.
4. R.H.S. Carpenter, Neurophysiology - Third edition, Arnold, London, 1996.
5. C. Koch, "Computation and the single neuron," Nature, vol. 385, pp. 207-210,
1997.
6. Editorial, "Making sense of channel diversity," Neuroscience, vol. 1, no. 1, pp.
169-170, 1998.
7. J. Hoekstra, "Single and multiple compartment models in neural networks," in
Computing Anticipatory Systems - C A S Y S ' 9 7 conference proceedings, Daniel M.
Dubois, Ed., Liege, Belgium, 1997, CHAOS, pp. 626-641, AIP.
8. T. Bogart, Electronic devices and circuits - Third edition, Merril Prentice Hall,
Columbus, Ohio, 1993.
9. P. Horowitz and W. Hill, The art of electronics - Second edition, Cambridge uni-
versity press, 1996.
Analog Electronic System for Simulating
Biological Neurons
ABSTRACT
This paper deals with the implementation of an analog electronic system capable of emuhttiog
and~or characterizing the electrical activity of biological neurons. We detail the mahl
characteristics and performances of the system, and point out its litheness as an experimentation
tool :
9 high level of modeling accuracy, validated by simple and hybrM experiments,
9 analog modelingprinciple, andpossibility to emulate in real time a large range of neurons or
neural networks, thanks to a set of programmable parameters.
9 model intplementation simplicity, owing to a dedicated hardware and sr inteJfilce.
I. INTRODUCTION
The study of biological neural networks, that are made of complex and highly
non-linear elements, is limited by classical experimental approaches. A way to overcome
these limitations is to study those networks from a theoretical point-of-view, using
matllcmatical models of the neurons, such as those following tile classical l lodgkin-
1luxley formalism [1], [2], [3]. That statement is a key point justifying the dcvelopment of
the research in computational neuroscience. But softwares that numerically solve the
model equations are also somehow limited by the computation time : in that casc, analog
computation appears as an alternative solution tbr neural computation.
Following initiative studies we have developed in the 90s [4], [5], an analog
electronic model for the implementation of artificial neurons and neural networks, based
on the Hodgkin-Huxley formalism. Equations are computed by analog circuits, integrated
on ASIC (Application Specific Integrated Circuits) chips in a BiCMOS technology, and
run in real time the electrical activity of the neuron, i.e. its membrane potential with a
high level of accuracy. Each module of the model chip acts as an ionic current generator.
and a neuron or a synapse is build by summing a number of these generators on a neural
membrane capacitance. Each ionic current generator can be independently conligured to
follow the dynamics and gain of the ionic activity, using voltage parameters introduced in
the circuitry via external inputs on pads of the chip. Then by simply externally tuning
voltages on these inputs, one can configure the desired neural network.
We have shown in previous publications [6], [7], [8] the validity of this mode of
implementation for artificial neurons. Those circuits represent a powerful tool to test and
189
The integrated circuits we designed compute, in analog mode, equations of the Hodgkin-
lluxley formalism [1]. That formalism describes single neuron or neural network
electrical activity ( membrane potential or synaptic current), as the result of the sum of
ionic currents on a membrane capacitance (Figure I.A). These currents express the
membrane permeability to various ionic species (Sodium, Potassium, Calcium...), and arc
time and membrane voltage dependent [3]. Those variations are explicited by a set of
mathematical equations that we consider as the generic operations of an ionic era'rent
geJTerator (see Figure 1.B, equations (1) to (5)). Parameters of these equations (Figure
I.B, see list of the parameters) are specific to the ionic specie considered and of the
modeled neuron. Those parameters are originally experimentally determined, using
neurophysiology classical methods such as voltage-clamp. That experimental source of
the models justifies the admitted assumption that Hodgkin-Huxley models closely match
the biological neurons activity.
The analog ASICs we developed include a set of electronic blocks, each emulating
an ionic current generator that follow the equations of Figure 1.B. The only
approximation is found in the kinetic expressions (equation (3)), where we neglect the
voltage dependence phenomenon. The ionic current generators outputs are summed on an
external capacitance (representing the membrane capacitance): that current sum is
realized by a simple connection, outside the chip. Synaptic connections that imply the
application of membrane voltages are also externally made. The topology of a neural
network and a single neuron constitution are then set by the user when interconnecting the
chips. For example, it is admitted that a simple spiking neuron activity can bc described
by the tlowing of two ionic species (typically Sodium and Potassium), whereas cells with
a more complex activity are build with up to 6 ionic currents.
Additional functionalities have been introduced among the implemented blocks.
The first one expresses interdependencies of ionic currents that can not be directly
explicited by the Hodgkin-Huxley formalism, and that happen to be very important for
some neuron activities.
As shown in equations (6) and (7), internal variables of the model then get an
additional dependency: the most common is the Calcium-dependence phenomenon,
where the Calcium concentration (computed from the Calcium current) balances the
membrane permeability to another ion [10]. To better describe the neural activity, we
added a function that is capable of expressing what is called a regulation process:
observations on living neurons have shown that, if physiological perturbations appear in
190
A ioniccurrent
Inside membrane generatorblock
, Ii
r
:. gi
__ em =
T 1_-
Outside membrane
Figure 1
A neuron model. A : electrical equivalent schematic. B : generic equations.
a neuron environment, the cell may respond with a change in the intrinsic structure of its
ionic channels, in order to maintain its original activity. We use in the model a
mathematical expression defined by G. Le Masson [11], who chooses the Calcium
concentration as the feedback parameter for the regulation process, and so renders
phenomena measured during experiments on living neurons (equations (8) and (9)). Other
expressions could be considered, but that tentative approach however covers a large part
of the well-known regulation processes in single neuron computation.
191
One important specificity of our process for implementing the neuron model
equations, is that none of the equation parameters is definitely set. For each of the
parameters that appear in an ionic current generator expression, a dedicated chip input is
reserved ; the effective parameter value will depend on the voltage value applied on that
input by the user. The voltage range is adapted to the concerned parameter, to match its
variation range for neuron models. Using identical chips, we then had the opportunity to
model different types of neurons, of both vertebrate and invertebrate species [71. 181.
Some of these models will appear in the applications of section IV.
However, model chips by itself can be exploited for the processing of simple
models. In that case, the many voltage inputs used to fix the parameters are manually
driven, using for example a set of variable resistors. But the circuits can not be easily used
in more complex or systematic experiments, in which the equation parameters are to be
accurately managed and often modified.
Figure2
Architecture of the analog simulation ~ystem
The other part is composed of 3 digital lines, driving the configuration protocol of
the chips parameters. These data are first treated by a programmable logic circuit, that
decodes the addressed chip and an associated digital to analog converter (DAC). The
DAC converts the serial 12 bits data of the parameter value, and continuously applies the
analog result to the corresponding ASIC parameter input.
193
IV. APPLICATIONSEXAMPLES
An interesting application for the analog artificial neurons consists in the study of
the thalamus relay structure. This small neural network is located in the tlmlamus, often
called the gateway to the cerebral cortex, and acts as an interface transmitting inlbrmation
li"om the optic nerve to the cortex. It comprises two populations of neurons, that arc called
TC for the relay thalamo-cortical cells, and nRt for reticular thalamic neuron. An
interesting function of that intra-thalamic network is that it selectively controls the flow of
visual information from the retina, during the various states of the sleep-wake cycle and
arousal [13]. In a wake or attentiveness state, the neurons of the network are depolarized
and present a general tonic activity, whereas during the sleep phase, synchronized
oscillations (called spindle waves) appear, that are the result of synaptic reciprocal
interactio,a between nRt and TC cells [14]. The study of these interactions is important lbr
understanding the mechanisms of the transitions from sleep to waking, and more
generally for explaining how the thalamocortical system may control the state of activity
of the brain.
194
,v I
2.6 see
A B C
stimulation 4(1 EtA
current
0.5 V [
I Itll
1 1 I I I A ^ ~ ^ ^,.^... 250 ms
B
0.5 V
C 95 ms
Figure 3
Modeling a TC neuron different activity states, using a ramp stimulation current.
A: bursting activity (delta waves), B: bursting activity vanishes.
C: in a depolarization phase, the neuron presents action potentials.
The ramp stimulation allows the visualization of the different states of activity of the cell :
first, a bursting activity where low kinetic currents are activated ; second, a silent phase
when the neuron gets depolarized ; third, a tonic phase of simple action potentials for a
high stimulation level. Those three types of activities are characteristic of the TC cell, and
are necessary for the efficiency of the TC-nRt loop, that we will be illustrated in the next
example.
The neurophysiologlcal experimental approach for studying the thalamus relay
structure is quite complicated : in the in vitro preparations of vertebrates thalamus slices.
due to slice thickness, the synaptic connections between the neurons are generally cut.
and are then difficult to characterize. The TC-nRt loop effect on a visual stimulation is
then impossible to evaluate. An hybrid experiment allows to solve that problem, by
artificially reconstructing the thalamus relay structure. In the experiment presented in
Figure 4, the TC cell is a living one, impaled in an in vitro preparation of a thalamus slice
of a vertebrate animal. The nRt cells and the synaptic connections are artificial.
constructed with our analog simulation system; the TC->nRt synapse is an excitatory one.
whereas the nRt to TC synapse has an inhibition effect. The experiment intends to prove
that tim synaptic combination is a key point to obtain the production of bursts which arc
the characteristics of the network awake state.
V. CONCLUSION
Figure 4
llybrid experiment with the analog artificial neurons, handling a thalamus relay network.
A: one synaptic connection is made, from TC to nRt. An external stimulation prod,ces
only one spike. B: reciprocal synapses are connected. The network responds" to the
stimulation with a bursts sequence.
197
REFERENCES
[1] A.L. Hodgkin and A.F. Huxley, A quantitative description of membrane current and
its application to conduction and excitation in nerve, Journal of Physiology, vol. 117, pp.
500-544, 1952.
[2] B. Softkey, C. Koch, Single cell models, in M. Arbib, editor, The handbook of brain
theory and neural networks, pp. 879-884, MIT Press, Boston, MA, 1995.
[3] C. Koch and I. Segev, Editors, Methods in neuronal modeling: from synapses to
networks, MIT Press, Cambridge, MA, 1989.
[4] M. Mahowald, R.J. Douglas, A silicon neuron, Nature, vol. 354, pp. 515-518, 1991.
[5] R.J. Douglas, M. Mahowald, A construction set for silicon neurons, in S.F. Zornetzer
and al.editors, Neural and Electronics Networks, pp. 277-296, Academic Press, Arlington.
1995.
[6] D. Dupeyron, S. Le Masson, Y. Deval, G. Le Masson and J.P. Dom, A BiCMOS
implementation of the Hodgkin.Huxley formalism, Proc. of MicroNeuro'96, Lausanne.
IEEE Computer Society Press, pp. 311-316, 1996.
[7] A. Laflaqui/:re, S. Le Masson, G. Le Masson and J.P. Dom, Accurate amdog I/LSI
model of Calcium-dependent bursting neurons, International Conference oll Ncural
Networks (ICNN'97, Houston, Texas), 1997.
[8] S. Le Masson, A. Laflaqui/:re, D. Dupeyron, T. Bal, G. Le Masson, Analog circuits
.fi~r modeling biological neural networks : design and applications, IEEE Transactions o11
Biomedical Engineering, in press.
[9] G. Le Masson, S. Le Masson and M. Moulins, From conductances to neural networks
properties: analysis of simple circuits using the hybrid networks method, Progress in
Biophysics and Molecular Biology, vol.64 n~ pp. 201-220,1995.
[10] R.W. Meech, Calcium-dependent activation in nervous tissues, Annual review of
Biophysics and Bioengineering, vol. 7, pp.l-18, 1978.
[I 1] G. Le Masson, E. Marder and L.F. Abbott, Activity-dependent regulation of
conductances in model neurons, Science, vol. 259, pp. 1915-1917, 1993.
[12] G. Le Masson, Stabilit~ foncionnelle des rdseaux de neurones: dtude expdrimentale
et thdorique dans le cas d'un r~seau simple, Th/:se de l'Universit6 Bordeaux 1, 1998.
[13] D. A. McCormick, T. Bal, Sensory gating mechanisms of the thalamus, Current
Opinion in Neurobiology, vol. 4, pp. 550-556, 1994
[14] T. Bal, D.A. McCormick, Mechanisms of oscillatory activity on guinea-pig nucleus
retictdaris thalami in vitro: a mammalian pacemaker, Journal of Physiology, vol.486, pp.
669-691, 1993.
[I 5] A. Destexhe, A. Babloyantz, T. Sejnowski, Ionic mechanisms .fi~r intrinsic slow
oscillations in thalamic relay neurons, Biophysical Journal, vol.65, pp. 1538-1552, 1993.
[16] T. Bal, M. von Krosigk, D.A. McCormick, Synaptic and membrane mechanisms
tmderlying synchronized oscillations in the ferret lateral geniculate mtcleus in vitro. J. of
Physiology, vol. 483.3, pp. 641-663, 1995.
[I 7] M. von Krosigk, T. Bal, D.A. McCormick, Cellular mechanisms of a ,~ynchronized
oscillation in the thalamus, Science, vol. 261, pp. 361-364, 1993.
Neural Addition and Fibonacci Numbers
Valeriu Beiu *
RN2R LLC, 14850 Montfort Drive, Suitel81, Dallas, Texas 75240, USA
E-maih v b e i u @ r o s e - r e s e a r c h , c o m
1 Introduction
In this paper we shall consider feedforward artificial neural networks for computing ad-
dition. Formally, a network is a graph having several input nodes, and some (at least
one) output nodes. If a synaptic weight is associated with each edge, and each node i
computes the weighted sum of its inputs to which a nonlinear activation function is then
applied (i.e., artificial neuron, or simply neuron):
fi(xi) : f/ (Xi, 1..... Xi'k) : i~i(Zk=l WjXi, j.~Oi), (1)
the network is a neural network (NN), with the synaptic weights w i e IR, 0 i e IR known
as the threshold, k 6 IN being the fan-in, and o i a non-linear activation function. If the
underlying graph is acyclic, the network does not have feedback connections, and can
be layered being known as a multilayerfeedforward neural network, and is commonly
characterised by two cost functions: depth (i.e., number of layers) and size (i.e., number
of neurons). W e shall firstly discuss Fibonacci numbers, then shortly present known re-
sults for ADDITION before introducing and proving the VLSI-optimality of a N N having
Fibonacci numbers as their weights. Conclusions and open questions are ending the pa-
per.
2 Fibonacci Numbers
Leonard of Pisa (Leonardo Pisano: 1170-1240) is better known by his nickname: Fi-
bonacci. This is short fromfilius Bonacci which means son of Bonacci which m a y mean
"lucky son" (literally, "son of good fortune"). He played an important role in reviving
ancient mathematics and made significant contributions of his own. Liber Abaci (pub-
lished in 1202) introduced the Hindu-Arabic place-valued decimal system and the use
The author is 'on leave of absence' from the "Politehnica" University of Bucharest, Computer Science
Department, Spl. Independent,ei 313, RO-77206 Bucharest, Romania.
199
3 ADDITION
Historically much attention has been paid to the tradeoff between delay (depth)
and number of gates (size), but later attention has switched and focused on the VLSI
area complexity, by looking on how to connect the gates in simple and regular ways
for minimising it.
Some of the well known adders are built out of AND-OR bounded fan-in logic
gates (we use delay instead of depth, and gates instead of size, as in most of the
original articles) are shown in Table 1 (here n is the number of bits needed to rep-
resent one input).
It has also been proven that a depth-2 circuit of AND-OR logic gates for ADDITION
must have exponential size [33]. Some authors [12, 40] have formulated the problem
of minimising the latency in carry-skip and block carry-lookahead adders [14, 29]
as multidimensional dynamic programming. Others [23] have investigated implemen-
tations based on spanning trees. But on the whole a lot of effort has been devoted
to practical implementations [18, 23, 28, 40]. Out of these, at least two papers made
the remark that a way to reduce the number of logic levels (and corresponding the
circuit latency) is to increase the fan-in [40], or equivalently to group more bits
[23]. But they mentioned that "no practical method for doing this has been pre-
sented in the literature." Still, some very interesting results using fan-ins larger than
two--building on [9, 20]--have been reported in [15, 25, 26, 34]. They mention
that increasing the fan-in affects the time performance of the circuitry in three dif-
ferent ways:
201
By convention, we consider A ~'-JPi = 1. One restriction is that the input variables are
pair-dependent, meaning that we can group the A input variables in A / 2 pairs of two
input variables each: (gA/2-1, Pa/z-1) . . . . . (go, Po), and that in each such group one
variable is "dominant" (i.e., when a dominant variable is 1, the other variable forming
the pair will also be 1). This can can be explained if the generate and propagate variables
are defined as gi = xi ^ Yi andp~ = x~ v y~. Because the Boolean functions from Step 3 and
Step 4 of Lemma 4 from [33] are/Fzx functions, the depth-7 construction can immedi-
ately be shrunk to depth-5 by allowing threshold gates instead of AND-OR gates in the in-
termediate layers. This depth-5 TG circuit still has O (nlogn) size [3, 4].
From Brent and Kung [9] it is known that the carry chain can be computed based
on an associative operator "o" defined as:
(g, p) o (g', p') = [g v (p^g') , pap'] ( G i, Pi) = (g~,p~) o ( G~_1, Pi-~) (4)
for 2 < i < n. It has been proven that c~ = Gi 9 In these equations, gi is the "carry gener-
ate", pi is the "carry propagate", Gi can be imagined as "a block carry generate" (also
known as "G-functions" or "triangles" [37]), and P~ can be imagined as "a block Carry
propagate". The carry generate is computed as g~ = x~ ^ y~ ; for the carry propagate we
202
4 A D D I T I O N Revisited
The TGs for implementing thefa functions have as inputs generate and propagate val-
ues and output a group-generate value (for the group-propagate an AND gate is the sim-
plest i m p l e m e n t a t i o n ) . As an e x a m p l e , consider foul ~" gz +P2 (gl + P l go), w h e r e
go = Ci,, gl = al'bl, g2 = az'bE, pl = al + bl, andp2 = a2 + b2. We will prove that the result-
ing functions:
Co,, = gk +P~ (gk-t + " ' + P2 (g, + P, go))) = f2(k+,) (5)
are always linearly separable. In fact these arefa functions without the restriction of
dominant input variables. We will show recursively that any such function (i.e. having
arbitrary fan-in A, see eqn (5)), can be implemented by o n e T G (having fan-in = A - 1):
9 f4 can be implemented by one TG;
9 fa.2 can be implemented by one TG having the same weights as the TG imple-
mentingfa, by adding two weights and modifying the threshold.
Proposition 1 IFA is a class of linearly separable functions without any restriction on
the input variables.
Proof. For A = 4, eqn (5) becomes:
f4(g~,P~,go, Po) = g~ v (Pl A go)
and from [16] it is known that this is a linearly separable function:
g~v(p~^g0) = (2g~+P~+go)L~ = sgn(2gl+pl+go--2 ). (6)
Refining eqn (5), we can determine the following recursive version (we increment
by 2, as A has to be even):
In this paper Lx]is the floor of x, i.e. the largest integer less than or equal to x, and [xq is the ceiling
of x, i.e. the smallest integer greater or equal to x, and all the logarithms are to the base 2 (except
otherwise explicitly mentioned).
203
a) The exact value for size is size (n, k) = 2klogn [n/(klogn)] ( [n/(klogn)] - 1) + 8klogn [n/(klogn)] , spanning
10n -<size < 2nr + 8n, while the fan-in, spanning 21ogn<fan-in < 2n, is: A (n, k) = max {2k logn, 4([n/(klogn)] + 2) }.
fA = s g n (S"
, . , , Ai=0
/ Z - I Vi gi + ~ Ai=0
/2-I wiPi
+ tA ) =
sU n (~) (8)
The worst case--due to the fact that all the weights are positive--is when all the
other input variables are 0. By substituting eqns (9), (10) and (11) into the pre-
vious equation we obtain:
fa*2 = sgn(va/2+t^,2) = sgn(O) = 1.
9 IfgA/2 --- 0 we have to analyse two cases.
First, suppose that Pa/z = 0. These makes fA +2 = 0 regardless of the other input
variables (see again eqn (7)). By replacing gA/2 = 0 and PA/z = 0, eqn (12) be-
comes:
fa+2 = s g n (x~..~
~ a /i2= 0- 1 v i g i + ~A/2-1wiPi+t~+.)
~ i=0 -
and even if all the (other) input variables are 1, the value of tA+2 (see eqn (11)) is
large enough:
L.2 = , g . (\ gv- . ~ ~i =, 20 - . v, + v~ . a ai =' 02 -~ w i - - l - - E i =~,2-~
O v , - Z ~i =' 02 - ' w,)
= sgn(-1) = 0.
Second, suppose that PAl2 = 1. This is in fact the most complicated case. We re-
member that gA/2 = 0. In this c a s e f a + 2 =fA (see eqn (7)). Starting again from eqn
(12), we replace ga/2 = 0 and PAl2 = 1:
L+~_ = s g . Ix;, ~ , ~ - i .,g, + (wA,2+ v ~ ' ~ - ' w,p,) + , ~ , ~ ] .
L/..t i=0 9~..,~ i = 0
--E A/2-1
i=0
W,]
i=0 vig,+ /.~i=0
wip,)- 1 _ ~ a /i 2= 0- 1 wi].
The first two sums are larger or smaller than - t~ iff~ = 1, or respectively f~ = 0
(see eqn (8)). Let these two sums (i.e., between the round parentheses) be
- ta + e, with e positive iffA = 1, and respectively negative iffa = 0 (see eqn (7)).
Then:
fA+2 = sgn [ ( - / a + e ) - 1 _ g , Ai/=20 - ! w,]
and replacing ta as given by eqn (11):
205
= sun ( Z ~,2-2
i=O
v, + e - w~,2_,)
Finally, w e use eqn (10) to obtain:
/=0 i=0
The fact that the recursion (eqn (7)) is verified concludes the proof. ~1
Proposition 2 The sequence of weights w k and v~ are the even and respectively odd Fi-
bonacci numbers: w k = Fib2k and vk = Fib2k + 1.
Proof T h e initial conditions s h o w the f o l l o w i n g c o r r e s p o n d e n c e b e t w e e n w 0, vo, w 1,
v 1 and the F i b o n a c c i numbers:
Fib 0 = w0 = 0 Fib t = v0 = 1 Fib2= w l = 1 Fib3 = vl = 2.
L e t us suppose that w k and vk are the e v e n and respectively odd F i b o n a c c i numbers. W e
will p r o v e that vk = Fib2k + 1 and % = Fib2k satisfy eqns (9) and (10).
B e c a u s e v k and w k are F i b o n a c c i numbers, eqn (9) b e c o m e s :
vk = Fib2k+~ = 1 +~2k-lFib,
i=0
Proposition 2 implies that the weights are b o u n d e d as ~2k + 1 / , ~ - (see eqn (3)).
Solutions with small fan-ins and small weights are o f interest [5, 6, 13, 30, 32, 39]
because the area o f a V L S I i m p l e m e n t a t i o n is considered to be proportional to the
sum o f the digits needed to represent the weights. The weights w e h a v e d e t e r m i n e d
are the smallest integers (by construction), therefore the solution is V L S I - o p t i m a l as
m i n i m i s i n g the area.
5 Conclusions
The paper has presented a class o f N N s c o m p u t i n g the ADDITION o f tWO binary numbers.
The interesting result is that the weights o f such a V L S I - o p t i m a l solution are the Fibon-
acci numbers. O p e n question r e m a i n on w h y the weights are exactly the F i b o n a c c i n u m -
bers, and if such ' F i b o n a c c i ' T G circuits could c o m p u t e other (useful) functions.
206
References
1. Alon, N., & Bruck, J. (1991). Explicit Construction of Depth-2 Majority Cir-
cuits for Comparison and Addition. IBM Technical Report RJ 8300 (75661).
San Jose, CA: IBM Almaden Research Center.
2. Alon, N., & Bruck, J. (1994). Depth-2 Threshold Logic Circuits for Logic and
Arithmetic Functions. Patent US 5357528.
3. Beiu, V., Peperstraete, J.A., Vandewalle, J., & Lauwereins, R., (1994a). Area-
Time Performances of Some Neural Computations. In P. Borne. T. Fukuda &
S.G. Tzafestas (Eds.): Proc. IMACS Intl. Syrup. on Signal Proc. Robotics and
Neural Networks, Lille, France (pp. 664-668). Lille: GERF EC.
4. Beiu, V., Peperstraete, J.A., Vandewalle, J., & Lauwereins, R., (1994b). On
the Circuit Complexity of Feedforward Neural Networks. In M. Marinaro &
P.G. Morasso (eds.): Proc. Intl. Conf. on Artif. Neural Networks, Sorrento, It-
aly (pp. 521-524). Springer-Verlag.
5. Beiu, V., Peperstraete, J.A., Vandewalle, J., & Lauwereins, R., (1994c). Opti-
mal Parallel ADDITIONMeans Constant Fan-In Threshold Gates. In Proc. Intl.
Conf. on Technical Informatics, Timisoara (vol. 5, pp. 166-177). Timisoara:
Technical University of Timisoara Press.
6. Beiu, V. (1996). Constant Fan-In Discrete Neural Networks Are VLSI-Opti-
real. In S.W. Ellacott, J.C. Mason, & I.J. Anderson (eds.): Mathematics of Neu-
ral Networks - - Models, Algorithms and Applications (pp. 89-94). Kluwer
Academic.
7. Beiu, V., & Taylor, J.G. (1996). On the circuit complexity of sigmoid feed-
forward neural networks Neural Networks, 9(7), 1155-1171.
8. Beiu, V. (1998). On the Circuit and VLSI Complexity of Threshold Gate COM-
PARISON. Neurocomputing, 19(1), 77-98.
9. Brent, R.P., & Kung, H.T. (1982). A Regular Layout for Parallel Adders. IEEE
Trans. on Comp., 31(3), 260-264.
10. Cannas, S.A. (1995). Arithmetic Perceptrons. Neural Computation, 7(1), 173-
181.
11. Chandra, A.K., Stockmeyer, L.J., & Vishkin, U. (1984). Constant Depth Re-
ducibility. SlAM J. Comput., 13(2), 423-539.
12. Chang, P.K., Schlag, M.D.F., Thomborson, C.D., & Oklobdzija, V.G. (1992).
Delay Optimization of Carry-Skip Adders and Block Carry-Lookahead Adders
Using Multidimensional Programming. IEEE Trans. on Comp., 41(8), 920-930.
13. Cotofana, S., & Vassiliadis, S. (1997). Low Weight and Fan-In Neural Net-
works for Basic Arithmetic Operations. In Proc. IMACS World Congress on
Sei. Comput., Modelling and Appl. Maths. (vol. 4, pp. 227-232).
14. Doran, R.W. (1988). Variants of an Improved Carry Look-Ahead Adder. IEEE
Trans. on Comp., 37(9), 1110-1113.
15. Han, T., Carlson, D.A., & Levitan, S.P. (1987). VLSI Design of High-Speed,
Low-Area Addition Circuitry. In Proc. Intl. Conf. on Circuit Design (pp. 418-
422). IEEE Press.
16. Hu, S. (1965). Threshold Logic. Berkeley, Los Angeles: University of Califor-
nia Press.
17. Hwang, K. (1979). Computer Arithmetic: Principles, Architecture and Design.
New York: John Wiley & Sons.
18. Kelliher, T.P., Owens, R.M., Irwin, M.J., & Hwang, T.-T. (1992). ELM A Fast
Addition Algorithm Discovered by a Program. IEEE Trans. on Comp., 41(9),
1181-1184.
207
Abstract
Tile fine grain, data-driven parallelism shown by neural models as the Boltzmann machine
cannot be implemented in an entirely efficient way either in general-purpose multicomputers
or in networks of computers, which are nowadays the most common parallel computer
architectures.
In this paper we present a parallel implementation of a modified Boltzmann machine
where the processors, with disjoint subsets of neurons allocated, asynchronously compute
the evolution of their neurons by using values that might not be updated for the remaining
neurons, thus reducing interprocessor communication requirements. An evolutionary
algorithm is used to learn the rules that allow the processors to cooperate by interchanging
tt~e local optima that they find while concurrently exploring different zones of the Boltzmann
machine state space. Thus, the way the processors interact changes dynamically during
execution of the algorithm, adapted to the problem at hand. Good figures for speedup with
respect to the Boltzmann machine computation in a uniprocessor computer have been
experimentally obtained.
1. Introduction
The resolution of combinatorial optimization problems can greatly benefit from the parallel and
distributed processing which is characteristic of neural network paradigms. Nevertheless, although artificial
neural networks (ANNs) process the information in a distributed and massively parallel way, the fine grain
data-driven parallelism of neural models, such as the Boltzmann Machine (BM), cannot be implemented
in an entirely efficient way either in general-purpose multicomputers or in networks of computers, which
are nowadays the most common parallel computer architectures because they represent a good choice
in terms of cost/performance and scalability [1]. In these architectures, each processor has its own local
memory and contacts the other processors of the system through an interconnection network or a local
area network, thus corresponding to a coarse grained architecture with the memory distributed among the
processing nodes. As the cost of communicating the processing nodes is high, it is necessary to have a
great volume of computation being performed between subsequent communications in order to achieve
appropriate efficiency.
Some parallel implementations of BM in general-purpose multicomputers have been proposed [4-
6]. In the scheme described in [6], the neurons are distributed among the processors, and as each
processor computes the changes in its subset of neurons while considering the remaining to be clamped,
the set of processors searches in different and smaller zones of the solution space in order to find a local
optimum. The solution found is communicated by each processor to the others, and any processor
receiving this information uses it to guide its search within new subspaces where better solutions could be
found. This method can be included in the class of large-step optimization methods [11,12], which are
defined by a procedure to perform the local search, a procedure to perform the large-step transitions to
209
non-local solutions, and an accept/reject test. In the present case, the use of several processors and the
characteristics of the BM would make it possible to exploit the work done by the remote processors in order
to drive the large-step transitions of each processor and speed up the search. Several alternatives to allow
the processors to work cooperatively are analyzed and their performance detailed in [6]. Among the
proposed schemes, one oi them is identified that allows the corresponding BM to converge to solutions with
high quality and which provides a high acceleralion over the execution of the BM in uniprocessor
computers.
Nevertheless, it has been shown [2] that if an algorithm performs well on a particular class of
problems, it shows a degraded performance with another class. This implies that each parallel Boltzmann
machine (corresponding to a particular optimization problem) would require a specific rule to allow the
cooperation between the processors in the machine with the best performance. Thus, it is more effective
to devise a procedure that automatically determines the best rule by adapting it dynamically while the
parallel Boltzmann machine is processed than trying to determine the best procedure, simply because such
procedure might not exist. In this paper, we propose the use of a genetic algorithm to learn, while the
optimization procedure is running, the best way to improve the local solution by extracting information from
tl~e solutions received from other processors. Thus, a hybrid optimization procedure which mixes neural
and evolutionary techniques is proposed.
Thus, Section 2 presents a parallel implementation of the BM in which the processors alternate
phases of local optimization with phases in which the local solutions are interchanged and used by each
receiver processor to update its own local solution according to an Update Rule (UR). The space of
possible Update Rules that allow the processors to work cooperatively through interactions is presented
in Section 3 along with the genetic procedures used to find the best one for a given problem and also to
reach an adaptive behaviour. Finally, Section 4 gives the conclusions of the paper.
Given the configuration S[m], the difference in the consensus, dC(i), when the state of neuron i
changes while the states of the remaining neurons are unchanged is
where T denotes the value of a control parameter usually called temperature. Equation (3) describes the
'heatbath' algorithm for a BM. It is also possible to use the Metropolis algorithm [3], in which a change in
a neuron with dC(i)>0 is always accepted, whereas a change with dC(i)<0 is accepted with probability
Aw(dC(i))=exp(dC(i)/-I). In a minimization problem, a change with dC(i)<0 is always accepted, whereas if
dC(i)>O, the change is accepted with probability AT(dC(i))=exp(-dC(i)/T).
The BM has been implemented in both general-purpose and specific-purpose parallel architectures
J4,5]. They are able to speed up the evolution of the BM with respect to the number of neurons, but due to
synchronization and communication requirements, the use of several processors is not efficient when
either the neurons are highly interconnected or it is not possible to find an adequate clustering of neurons
that allows the reduction of communications among the processors where the BM has been located. In [6]
a procedure with low communication and synchronization requirements, in order to take advantage of the
availability of local networks of general-purpose architectures, is proposed. The goal of this procedure is
similar to that of [7], which analyzes the effect of reducing communications in a parallel implementation of
the simulated annealing algorithm, or [8], which considers the possibility of accelerating, by a factor greater
than the number of processors used, the obtention of sufficiently good solutions for the NP-complete
problems.
Figure 1. (a) Distribution of the neurons among the processors and (b) functional
blocks implemented by a given processor
Thus, in the parallel BM implementation presented in [6], the neurons of the BM are distributed
among the P processors (k=1,2,..,P) as shown in Figure 1.(a) using as an example a BM with eight neurons
and four processors. The neurons associated with processor k are called neurons ofk. Nevertheless, to
compute the changes in its neurons (using (2)), a processor also needs the values of all the neurons
connected to it. Thus, each processor k stores in its local memory a local configuration denoted as
211
S,[m,]=(Sk(1),S,(2),..,S,(N)), where Sk(j) is the state of neuron j in the local configuration of processor k. In
this way, the local configuration of processor k has two kinds of components, those corresponding to the
n e u r o n s o f k and those corresponding to the neurons assigned to the remaining processors, called the
remote neurons of k. When required, the components of a given local configuration are noted with
superindices. Thus, S,q(i) (q=1,2,..,P) refers to the state, in the local configuration of p r o c e s s o r k, o f neuron
i (which is a n e u r o n of p r o c e s s o r q). If q=k, S,"(i) is the state of neuron i, which is one of the neurons of
processor k, otherwise i is a remote neuron of k. As we will see, the values Skq(i) at a given instant do not
necessarily coincide with S~q(i). Thus they are not necessarily updated.
The P processors implementing the evolution of the parallel Boltzmann machine interact at some
instants and alternate two processing phases or steps, as indicated in the algorithm shown in Figure 2,
whose functional blocks are provided in Figure 1 .b. In the first phase (Step 1 ), each processor k evolves
(as a sequential Boltzmann machine) by changing in the local configuration, S , [ m j , only its subset of
212
neurons (values S,~(i)) and keeping its remote neurons (values s,q(i), with k~q) clamped and acting as
parameters. In this way, in Step 1 each process searches within a reduced subspace defined by the
clamped states of its remote neurons, tn the second phase (Step 2), the processor receives remote
configurations from other processors and interacts with them through an Update Rule (UR) that defines
the way in which the remote neurons of k (values of Skq(i), with k~q, in Sk[m.] ) are modified according to
the states of the neurons of Sq[mq], which come from processor q. In this step, the local neurons in S,[m,]
are clamped while the remote neurons in S,[mj change according to their state in the remote configuration
and the UR used. Each processor takes advantage of the search carried out by the remote processors in
their Step 1 through an interaction among configurations, which takes place in a given processor according
to the UR used. This should produce a diversification of the local configurations, impelling each processor
to a different solution subspace, which is explored afterwards in a new Step 1.
The four URs of Table 1 correspond to different alternatives according to the usual evolution of a
sequenllal Boltzmann machine (olher URs can also be delined, as shown In Section 3). In [6}, experhnenlal
results are provided and explained for these four rules and different examples of BM with up to 1024
neurons. Some URs provide good solutions with a reduced number of iterations, thus being able to achieve
a significant speedup with respect to the sequential execution.
The speedup (S) that may be attained wilh this scheme can be evaluated from the following
expression
where T~s,.p~(N)is the time required by each iteration in the sequential execution (SEQ) of a Boltzmann
machine with N neurons; T",,.p,(N,P) is the time per iteration of Step 1 of the parallel scheme when its N
neurons are distributed among P processors; and T"~t.,,2(N,P) is the time required by Step 2. The values
N,,,,' and N,~? correspond, respectively, to N,., for the sequential and the parallel schemes.
The value of T~,~p,(N) is proportional to the length of the Markov chain, L, used at each
temperature. In our case, we have used L=N as in [3], so T~s~p~(N)=N.t~o.~,with t.o,~ being the time required
to compute a transition in a neuron. The value of TP,t~p,(N,P)is equal to the product L,oca,.l=r~,and as L~=f
has been assumed to be proportional to (N/P), Lro~.,=K,,.p~.(N/P), thus TPst.p~=K~,~p~.(N/P).t~p. Finally
T~',,.p2(N,P) depends on the time required to communicate the interacting processors, and on the time
required to compute the interaction, which is also considered proportional to N/P. Thus,
]~',,,,~(N,P)=InIer(P).{(N/P).L.o.,p+t ...... (P)}, where Inter(P) is the number of processors interacting in Step 2.
In this way,
T",~.p2(N,P)=(N/P).t~o,.p.K,,.,~(N,P)
where K,,.p2(N,P)=(1+(P/N),(t~or.,,(P)/t,o~)).lnter(P), and the efficiency (E.=S/P) can be expressed as:
s T~ N~.~ ~ 1
~' (5)
P PTp Niter p Kstep J (N, P) +Kstep2 (S, P)
Whenever the communication can be overlapped with the computation, the value of K,,,p~can be
considered as approximately equal to Inter(P). Thus, for schemes such as UR4, where N,.~VN,e,p does not
change with the number of processors P, if the complexity of Ks~.p~and K,f~p2 is less than P, the speedup
obtained increases with the number of processors. Moreover, whenever (Kst.p~+Kste,,2)<(N,.,VN,e,p) is verified,
efficiencies higher than 1 are obtained (corresponding to superlinear speedups). If Inter(P) is set
proportional to the number of processors, the speedup tends to N,e,VN,.,p as P grows.
Figure 3 shows the average efficiencies experimentally obtained by running different Boltzmann
213
machines (with UR4) that correspond to the vertex cover problem applied to randomly selected graphs of
up to 256 nodes. For clarity, Figure 3 only provides the results for N=64 and N=256. The experiments were
carried out in a PARAMID multicomputer, with nodes based on the TTM200 board equipped with an Intel
i860/XP processor, 16 Mbytes of memory, and a T805 Transputer with four bidirectional links and 4 Mbytes
of memory. Up to eight processors were available for our experiments. As shown in Figure 3, it is possible
to obtain efficiencies higher than one (superlinear speedups). For N=64 and P=2, the experimental data
present a high deviation with respect to the expression (5). This can be explained by taking into account
that, as the number of processors is low, the effect of cooperation among processors also decreases, with
a corresponding reduction in N,er'/N,te,p for this small number of neurons.
Rule l)escription
three alternatives for (b): (bl) the possible change is applied if the bit in the local solution and the
corresponding bit in the received solution are different; (b2) it is applied if they are equal; and (b3) it is
applied irrespective of whether they are equal or not. Finally, five possibilities are considered for (c): (ct)
the value of a selected bit is always changed (the sign of the cost change is not taken into account); (c2)
it is only changed if this change decreases the cost computed with expression (2); (c3) it is only changed
if Ihe cost docreasos according to the approximate expression (used in UR4 in Table 1); (c4) and (c5) are
iised in UR3 and UR4, respeclively, thus being similar to (c2) and (c3), except for lhe changes lhat,
although determining a cost increase, can also be accepted with the probability given by expression (3).
In this way, associated with each local solution there are seven bits, (gl ,g2,g3,g4,g5,g6,gT), that codify the
program implementing the UR. Bits gl and g2 codify the four possibilities for step (a), bits g3 and g4 the
lhree possibilities for step (b), and bits g5, g6, and g7 the five possibilities for step (c).
N=64 *~
2G !':':: , N=256eq(51eq(51
....+......
~.4
\%..
a
1.8 ",,
~.,.. ,.
t6
c
B
08 I I I J J
3 4 5 6 7
Processo~
Figure 3. Experimentalefficiencies obtained in the PARAMID multicomputercompared with the theoretical ones
(equation (5) in [6}).
A genetic algorithm has been devised, by which each processor uses a population of URs defining
the different ways the local solution could change when a solution is received from a remote processor.
This procedure is included in the optimization algorithm and is applied while the optimization is running
(Figure 5); it is called the On-Line (ONL) learning procedure. In the procedure
Evaluate._Fitness Update_Rulel~ 0 in Figure 5, the fitness of UR~Iof the population is obtained from the
reduction in the consensus function after applying a small number of iterations of the local optimization
procedure to the solution provided by the corresponding URji. In each generation, half of the population of
the URs with the better fitness values is selected for the new generation. The second half of the population
for the next generation is obtained by applying the two-point crossover and the bit-flip mutation operators
to the half of the URs previously selected. The procedure Selection_Mutation_Crossover0 in Figure 5
applies these transformations. The best solution obtained during Step 2 is used to start the next iteration
of Step 1. Thus, after each iteration of the parallel Boltzmann machine (Step 1 plus Step 2), the population
ol Update Rules in each processor, and so the way the processors cooperate, evolves and is improved
at the same time as the optimization procedure proceeds.
Table 2 summarizes the experimental result obtained. The set of weight matrices of Table 2 (Boltz.
Mach. column) corresponds to different levels of connections between neurons and weight magnitudes,
with two different randomly selected matrices for each set of conditions. Thus, in the code mX_YYY.ZZin
Table 2, YYY indicates that the weights take values between -YYY.0 and +YYY.O; ZZ means that there is
215
a probability of 0.ZZ for two given neurons to be connected; and Xis an index which identifies different
randomly selected weight matrices with the same values for YYY and ZZ. The EXH column of Table 2
~hows, for each weight matrix, the costs of the minimum found by the parallel procedure of Figure 2 when
lhe best UR, found by an exhaustive search, is applied. The code describing the UR is given in brackets,
[ai,bi,ci], and the number of iterations, Niter, required to obtain a solution which is less than 1% worse than
the best solution found, appears in parentheses, (Niter). Results are provided considering 8 and 16
processors (column Proc. in Table 2) and N=64 neurons. The asterisk means that the best solution found
is more than 1% worse than the optimum.
An exhaustive search has been used to determine the UR that, applied during the whole
optimization process by all the processors, provides the best figures of solution quality and convergence
speed. As the number of possible URs in this case is only 60, it is possible to find the best one for each
Boltzmann Machine by analyzing the performance of every rule. The EXHcolumn of Table 2 shows, for
each weight matrix, the costs of the minimum found when the UR determined by the indicated exhaustive
search is applied. The code describing the UR is given in brackets, [ai,bi,ci].
As is to be expected from [21, the optimal UR obtained depends on the characteristics of the weight
matrix, although some alternatives occur more frequently than others, and there are alternatives such as
(a3), (a4), (b2), (cl) and (c3), that do not appear in lhe EXHcolumn.
Table 2. Cost of the Solution obtained, ( Number of Iterations, Niter) and [Update Rule]
Boltz. Mach, [ Proc, [ EXH Procedure ONL Procedure Random Sel. U R
ml .025.25 8 -1019.75 (6) [al,bl,c4] -1019.75(5) -1019~75 (22)
16 -1019.75 (5) [al,bl,cSJ -1019.75/5) -1019.75 (21)
m2_025.25 8 -1113.10 (7) [al,bl,c5] -1113.10 (6) -1113.10 (16)
16 -1102.46 16) [a2,bl,c5] -1107.32 (7) -1087.35 (12)*
m1_025.80 8 -2039.25 (5) [al,bl,c2] -2044.25 (5) -2044.25 (7)
16 -2043.25 (5) [at,bl,c2] -2044.25 IS) -2044.25 (19)
m2_025.80 8 -2331.72 (6) [al,bl,c5] -2333.10 (7) -2333.10 (10)
16 -2327.69 (51 [al,bl,c2} -2333.10 (71 -2327.69 (14)
The ONL Procedure column of Table 2 gives the cost of the best solution found and the value of
Niter obtained for the different weight matrices and with a population of 10 UR codes in each processor
and a mutation probability of 0.1. The Random SeL column shows the values of the cost and Niterwhen
a randomly selected UR is used. Comparing these two columns, it is clear that the ONL procedure
216
represents an improvement, providing better solutions with fewer iterations (a large reduction in Niter in
most cases).
From the ONL Procedure and EXFI Procedure columns, it can be seen that the use of the Update
Rule obtained by the exhaustive search procedure allows a reduction in Niter in most cases, although
sometimes Niter is slightly higher than the value corresponding to the use of the ONL procedure. However,
except for ml 100.80 and m2 .100.80 with 16 processors, this reduction is not very important, and in most
cases the differences in Niterare similar to the experimental error (+/-1 iteration). With respect to the quality
of the solutions obtained, the ONL procedure provides similar or better solutions in all cases except for
m2 100.80 with 8 processors. Figure 4 shows the evolution of the local solution found by a processor for
each of the three cases considered in Table 2, and four different weight matrices.
These results are understandable, remembering that the ONL procedure allows the use of different
URs in each processor and iteration, which represents a more general situation with respect to the use of
only one UR in all the processors during the whole optimization procedure. Indeed, in the execution of the
ONL procedure, it has been observed that the processors use populations of URs with different individuals
across different processors.
-1000 -1800!
.20001 ~ -.
-12(
5 10 15 20 10 15 20
-20OG
m2 100.25(8 Proc) m1100.80 (8 Proc)
-250C I -4000,1ilI ~
-50001 ~Ii
-30OG
-3500 -6000[ i x.
-400G -7000
5 10 15 20 0 5 10 15 20
Figure 4. Cost vs. iterations in a processor, when using ONL procedure (solid line), the UR obtained by the
exhaustive search procedure (dash-dot line and a randomly selected UR (dashed line)
4. Conclusions
The cooperation of several processors can improve the resolution of combinatorial optimization
problems by using parallel computer architectures. The parallel Boltzmann machine implementation
considered uses independent search processes implemented by the processors in different solution space
zones and the interaction among processors exchanging solutions, in order to take advantage of the work
done by the other processors. The problem is to determine the way a processor extracts information from
the solutions received. In this paper we propose an evolutionary strategy to learn the interaction between
processors.
The proposed strategy allows an efficient implementation of Boltzmann machines in coarse grain
parallel computer architectures. It is based on the procedure in Figure 2 [6], in which each processor
217
alternates two computation phases (Step 1 and Step 2). In Step 1, the processor improves the consensus
of the Boltzmann machine by only changing the states of the neurons assigned to it while its remote
ileurons are considered to be clamped. Thus, these clamped neurons act as parameters or constraints,
which are updated in Step 2 through interactions between the processors according to an UR that guides
the optimization process by taking into account the w o r k done by the remote processors in their
corresponding Step 1. The procedure was modified by including a genetic algorith which operates at the
s a m e time as the execution of the parallel optimization procedure, as shown in Figure 5. Thus, each
processor uses a different population of URs, which can change dynamically, to cooperate with the others
processors. This population of rules changes during the optimization process.
The complexity of the optimization procedure w h e n the number of neurons grows is not increased
by this evolutionary selection of URs because the number of individuals in the population of rules
218
associated with each processor does not change with the number of neurons.
An exhaustive search procedure has also been applied to determine the best UR to use.
Nevertheless, it has been proven [2] that for any optimization procedure, the high-quality performance
obtained when it Js applied to a given class of problems is balanced by poor results in another class. Thus,
the EXH column in Table 2 provides different optimal URs for different problems.
The results provided in this paper show that the proposed evolutionary computation method
performs well in both convergence speed and quality of the solutions obtained. Although this paper is
devoted to describing a parallel implementation of a Boltzmann machine, similar evolutive strategies can
L)e applied to allow processor cooperation in othe parallel combinatorial optimization methods. For
example, in [10] a procedure is proposed in which multiple populations evolve independently by a genetic
algorithm. Each population determines local solutions that represent components of the global solution and
which are combined to build the whole solution by assigning a credit 1o each local solution, according to
how well it collaborates in achieving such a global solution. In the procedure proposed here, each
processor determines a local optimum through Step 1 and then this local solution is combined in Step 2
with the solution coming from other processors. The matrices of weights used here to obtain the
experimental resulls correspond to situations in which there is a high number of interconnections among
the neurons assigned to differenl processors. In this way, the procedure proposed here is tested by using
optimization problems with many interdependent variables and highly multimodal cost functions, thus
corresponding to problems that are more difficult to solve than others considered in works where the
variables of the function to be optimized are reasonably independent.
Acknowledgements. This paper has been supported by project TIC97-1149 (CICYT, Spain).
References
[1] Anderson, T.E.; Culler, T.E.; Patterson, D.A.; and the NOW team: "A Case for NOW (Networks of
Workstations". IEEE Micro, pp.54-64. February, 1995.
[2] Wolpert, D.H.; Macready, W.G.:"No Free Lunch Theorems for Optimization". IEEE Trans. on
Evolutionary Computation, Vol.1, No.l, pp.67-82. April, 1997.
{3J Aarts, E.H.L.; Korst, J.H.M.:"Simulated Annealing and Boltzmann Machines". Wiley, 1988.
[4] Oh, D.H.; Nang, J.H.; Yoon, H.; Maeng, S.R.:"An efficient mapping of Boltzmann Machine
computations onto distributed-memory multiprocessors". Microprocessing and Microprogramming,
Vol. 33, pp.223-236, 1991/92.
[5] De Gloria, A.; Faraboschi, P.; Ridella, S.: "A dedicated Massively Parallel Architecture for the
Boltzmann Machine". Parallel Comp., Vo1.18, No.l, pp.57-75, 1993.
[6] Ortega, J.; Rojas, i.; Diaz, A.F.; Prieto, A.:"Parallel Coarse Grain Computing of Boltzmann
Machines". Neural Processing Letters, Vol.7, No.3, pp.1-16, 1998.
[TJ Hong, C.-E.; McMillin, B.M.:"Relaxing Synchronization in Distributed Simulated Annealing". IEEE
Trans. on Parallel and Distributed Systems, Vol.6, No.2, pp.189-195. February, 1995.
[~] Pramanick, I.; Kuhl, J.G.:"An inherently Parallel Method for Heuristic Problem-Solving: Part I -
General Framework". IEEE Trans. on Parallel and Distributed Systems, Vol.6, No.10, pp.1006-
1015. October, 1995.
[9J Zissimopoulos, V.; Paschos, V.T.; Pekergin, F.:"On the approximation of NP-complete problems
by using the Boltzmann Machine method: The cases of some covering and packing problems".
IEEE Trans. on Computers, Vol.40, No.12, pp.1413-1418. December, 1991.
I10] Potter, M. A.; De Jong, K.A.:"A Cooperative Coevolutionary Aproach to Function Optimization". In
Third Conference on Parallel Problem Solving from Nature, Y. Davidor and H.P. Schwefel (Eds.),
Lecture Notes in Computer Science, Vol.866, Springer-Verlag, pp.249-257, 1994.
[11] Martin, O.; Otto, S.W.; Felten, E.W.:"Large-step Markov chains for the TSP incorporation local
search heuristics". Operations Research Letters, 11, pp.219-224. May, 1992.
[12] Lourenqo, H.R.:"Job-shop scheduling: Computational study of local search and large-step
optimization methods". European J. of Operation Research, 83, pp.347-364, 1995.
Adaptive Brain Interfaces
~Joint Research Centre of the EC, 21020 Ispra (VA), Italy. E-mail: jose.millan@jrc.it
bHelsinki University of Technology, PO Box 9400, 02015 HUT, Finland
~ di Riabilitazione S. Lucia, Via Ardeatina 306, 00179 Roma, Italy
~Fase Sistemi Srl, Via Ildebrando Vivanti 12, 00144 Roma, Italy
Abstract. This paper presents first results of an Adaptive Brain Interface suit-
able for deployment outside controlled laboratory settings. It robustly recog-
nizes three cognitive mental states from on-line spontaneous EEG signals and
may have them associated to simple commands. Three commands allow inter-
acting intelligently with a computer-based system through task decomposition.
Our approach seeks to develop individual interfaces since not two people are the
same either physiologically or psychologically. Thus the interface adapts to its
owner as its neural classifier learns user-specific filters.
1 Introduction
Physiological studies indicate that EEG signals are a reliable mirror of mental activity.
In addition, the combination of EEG, MRI and PET are providing gradually better
maps of brain functions (i.e., which cortical areas are responsible for specific mental
activities). Thus, it is quite appealing to try to use EEG signals as an alternative means
of interaction with computers. This paper describes a recent European research effort
whose objective is to build Adaptive Brain Interfaces (ABI) suitable for deployment
outside controlled laboratory settings. The immediate application is to extend the ca-
pabilities of physically-disabled people (e.g., select items from a computer screen,
explore virtual worlds, or guide a motorized wheelchair).
We aim to recognize from three to five mental states (e.g., relaxation, visualization,
music composition, arithmetic, verbal) from on-line spontaneous E E G signals I by
means of artificial neural networks and to associate them with simple commands such
as "move wheelchair straight", "turn left" and so on. Thus, users will be able to oper-
ate computer-based systems by composing sequences of these patterns.
An ABI requires users to be conscious of their thoughts and to concentrate suffi-
ciently on those few mental tasks associated to the commands. Any other E E G pattern
different from those corresponding to these mental tasks will be associated with the
command "nothing", which has no effect on the computer-based system.
The current ABI prototype is built upon the experience of the different partners in
the whole spectrum of areas covering the multidisciplinary nature of the project:
An obstacle to the achievement of the ABI project is the robust recognition of EEG
patterns outside laboratory settings. This presumes the existence of an appropriate
EEG equipment that should be compact, easy-to-use, and suitable for deployment in
real-world environments. No commercial product fulfilling these requirements exists.
We have set up a first prototype for the acquisition of high-quality EEG signals. We
are also able to robustly recognize three cognitive mental states from on-line sponta-
neous EEG signals.
2 Related Work
In the last years several other research groups have begun to develop EEG-based brain
interfaces (BI). Several companies are also commercializing basic mind-controlled
devices. By basic devices we mean that they can only recognize two patterns or use
muscular activity.
Two groups are developing BIs based on the recognition of mental states associated
to motor activities. McFarland and Wolpaw's approach relies completely on user
training (users must control their mu rhythm on each brain hemisphere) and looks for
a fixed EEG pattern that should be present in a large majority of individuals (e.g., [7]).
The ABI project adopts an opposite approach: rather than putting all the training re-
quirements on the user, who has to learn to generate a fixed EEG pattern, it makes the
brain interface adapt to the user. This approach is partially followed by Pfurtscheller's
group who seeks to recognize the motor readiness potential generated while people are
planning movements. They are using artificial neural networks with the aim of devel-
oping universal BIs (e.g., [3]). That is, they gather EEG signals from a given number
of users in well-controlled laboratory conditions and learn a classification function that
should be valid for everybody. They have obtained good results with a few healthy
subjects, but there is no definite evidence that motor readiness also happen in motor-
impaired people. Instead of using brain activities associated with motor-related tasks,
the ABI project seeks also to recognize cognitive mental tasks outside controlled labo-
ratory settings. Anderson's group is also using artificial neural networks to build uni-
versal BIs (e.g., [1]). They are using pre-recorded EEG signals and try to derive in-
variant information from cognitive tasks. Their results, however, are mixed.
221
One of our concerns is the acquisition of high-quality EEG signals by means of robust
and easy-to-use equipment, suitable for deployment outside controlled laboratory
environments. To this end, we have built a first prototype (see Fig. 1). The EEG sys-
tem is made of a standard PC running LabVIEW and C++, a commercial signal acqui-
sition board, a cap with integrated electrodes, and a dedicated hardware for the acqui-
sition of EEG signals. This hardware is a stand-alone, fully isolated, portable system
that gathers analog brain-wave voltages from up to eight scalp electrodes, amplifies
and filters them, converts to digital values, and transmits them via the acquisition
board to the PC for analysis. This prototype is very easy to operate (healthy users can
run it without external assistance), what greatly reduces the preparation time for the
acquisition of good signals.
Figure 1 shows the ABI current prototype at work. In this picture, the user holds the
cap with integrated electrodes located according to the International 10-20 system.
222
Eight of these electrodes are directly plugged into amplifiers before sending the sig-
nals to the dedicated hardware (left). In the computer screen one can see two of the
bipolar EEG signals being processed (left windows), their corresponding power spec-
tra (top right window), and a circle (bottom right) indicating that one of the mental
tasks has been recognized.
EEG potentials are measured on the 8 channels F 3, F 4, C 3, C 4, P3, P,, O1' and O 2,
with a reference electrode located in between F z, Fpl, and Fp2. Ground is applied to one
of the ear lobes. The sampling rate is 128 Hz and data is preprocessed in temporal
windows of half a second. This preprocessing consists of a Hanning windowing, a
Butterworth bandpass filtering (4-30 Hz), on-line removal of temporal windows cor-
rupted by ocular artifacts, and computation of either the energy of 5 differential chan-
nels (F3C3, C3P3, F4C4, C4P4, OiO 2) or the coherence between 10 pairs of channels (6
intra- and 4 inter-hemispheres). The energy or coherence features are fed to the neural
classifier. In this paper we only report experiments with energy features.
We want our experimental protocol to fit the real conditions in which users would
work. This means that perfect synchronization is not possible, it cannot depend on
external events, and the size of the time window is to be short. Another critical aspect
of the experimental protocol is the set of mental tasks to recognize (and differentiate
from each other). In this respect, we are considering a relatively large number of them
consisting of both cognitive tasks (e.g., arithmetic) and motor-related ones (e.g.,
imagination of left-hand movement). Tasks will be chosen so that the cortical areas
involved are quite localized and, in particular, to invoke hemispheric lateralisation.
From this list, each individual user selects the 3-5 tasks most comfortable to him/her.
The subject is seated and spontaneously concentrates on a mental task. The subject
performs the selected task during 10 to 15 seconds, and he/she chooses when to stop
doing it and the next to be undertaken. Each recording session lasts about 5 minutes.
For the training and testing of the neural classifier, the subject informs an operator of
the task he/she will perform; then 2 seconds before and 2 seconds after are removed
from the recording to remove the artifacts introduced by this "communication".
The mental tasks used in the study reported in this paper are "relaxation", "cube
rotation", and "subtraction ''2. Relaxation is done with closed eyes, and all other tasks
with opened eyes. Relaxation is used to switch on and off the neural classifier in order
to facilitate the recognition task. The rationale is that if users inform the ABI when
they do (or do not) intend to use it, the rest of the desired mental tasks has only to be
distinguished from each other (and from relaxation), but not from any possible back-
ground mental activity. An additional advantage is that users will probably concentrate
better on the associated mental tasks, as they will not be so stressed while not operat-
ing the ABI (e.g., avoiding to think on those mental tasks). It follows that users will be
able to use the ABI for longer periods of time. Of course, relaxation must be distin-
guished from whatever other task.
2 The tasks consist on getting relax, visualizing a three-dimensional cube in rotation around one
of its axis, and doing successive subtractions by a fixed number (e.g., 64-3=61, 61-3=58, 58-
3=55, ...), respectively.
223
4 Results
In this section we present first results we have obtained with two users. For each of
them we have recorded 4 sessions to train and test the neural classifier. In particular,
one of the sessions is used for training, one for validation, and the remaining two for
testing. We have adopted this unusual splitting of the available data--where training is
done over one fourth of the data while generalization is tested over half the patterns--
to probe our approach under realistic situations since the very beginning.
Recognizing mental states from on-line spontaneous EEG signals is a complex task
where we cannot expect to reach recognition rates near to 100%. But a practical ABI
doesn't require such a high performance; on the contrary, it is our view that it suffices
a recognition rate in between 70% and 80% provided that it has an insignificant pro-
portion (less than 2%) of false positives. This is what we mean by robust recognition.
tn other words, the neural classifier (almost) never takes a relevant pattern for another
(what, for example, would make the wheelchair move in the wrong direction), but
doesn't eventually recognize EEG patterns corresponding to the desired mental tasks
(which is associated with the command "nothing" and thus doesn't have any conse-
quence except delay).
To illustrate the hard task faced by the neural classifier, Figure 2 shows the PCA
projection on the two first eigen directions of the energy features of the five bipolar
channels recording EEG signals while a user carried out five mental tasks according to
the experimental protocol above (see [9] for details). Every sampled EEG pattern is
indicated with the number of its associated mental task (from 0 to 4). The figure
shows a high degree of overlap among the classes. Thus compact networks, such as
classical multi-layer perceptrons, will fail since they cannot compute different outputs
for very similar inputs. Their performance will not improve even if the data is pre-
processed with self-organizing maps [5] since every unit of the map will codify EEG
patterns of different categories. We have confirmed experimentally this suspicion [9].
On the other hand, one could use a single local network, such as RBF (e.g., [10]) or
LVQ [5]. Our results, however, show that local networks achieve good results during
training but generahze poorly. For example, Table [ reports the results ~ we have ob-
tained with Platt's RAN algorithm [10] for the classification of the task "cube rota-
tion". Similar results are obtained for the task "subtraction", whereas the task "relaxa-
tion" is better classified.
Table 1. Performance of the RAN algorithm for the mental task "cube rotation".
-~The results of this section are referred to the personal brain interface of one of the users.
Similar levels of performance are got for the second user.
224
Fig. 2. PCA projection of the power spectral energy features on the two first eigen directions.
Fig. 3. Hierarchical committee of incremental networks for the classification of EEG patterns.
A new unit is added only if two conditions apply. First, an E E G pattern corre-
sponding to the mental task to be recognized is incorrectly classified. Second, the
current input doesn't activate sufficiently any existing RBF unit. In this case, the cen-
ter of the new unit corresponds to the current EEG pattern, and its width is initialized
to a fixed value. The weight of the connection from this new unit to the output unit is
set to the difference between the desired output (i.e., 1.0) and the actual output of the
network.
If any of the above conditions is not satisfied, the learning algorithm adjusts simul-
taneously the center and width of the active RBF units as well as the weighted con-
nections to decrease the output error. The resulting gradient descent rules have intui-
tive interpretations. Unit centers are pulled toward E E G patterns of the desired mental
task, while are pushed away from EEG patterns of other tasks. On the other hand, unit
widths grow to cover as many desired EEG patterns as possible, but shrink to avoid
negative E E G patterns.
An important feature of this kind of neural classifier is that units are moved so as to
find clusters of those EEG patterns corresponding to the mental task to be recognized.
After training, it comes out that some of the units have learned quite robust user-
specific filters whereas some other units are tuned to E E G patterns that are still too
similar to patterns of different mental tasks. Our approach is, then, to label as unknown
the output of the classifier if one of the latter units is the closest to the observed E E G
pattern. In this way, the neural classifier doesn't make risky decisions for uncertain
EEG patterns, which are thus associated to the command "nothing". Furthermore,
users can take this "no answer" of the brain interface as a "warning" that they should
either concentrate more intensively on the desired mental task or choose another strat-
egy to undertake it. Indeed, initial observations seem to indicate that, with practice,
users learn to generate those individual E E G patterns that are better distinguished by
their personal brain interface. But more extensive testing of the approach is needed
before confirming this hypothesis.
Even though the output of a neural classifier is a real number, the ABI makes dis-
crete decisions to classify the incoming E E G patterns all across the hierarchical com-
mittee. To this end, a network classifies an EEG pattern as:
9 belonging to the desired class if the output is higher than a given threshold, K,
9 belonging to the remaining classes if the output is smaller than 1-•, or
9 unknown if the output is in between.
This procedure is another key factor for enhancing the robustness of the ABI, for the
same reasons stated before.
Figure 4 shows two user-specific filters discovered from the differential channel
F4C4 that classify quite robustly the mental task "cube rotation". Table 2 reports the
performance of our approach for the classification of the task "cube rotation". Similar
results are obtained for the task "subtraction", whereas the task "relaxation" is better
classified. It is worth noting that the channel O10 2 is irrelevant for the classification of
the three mental tasks of interest.
226
Fig. 4. Two user-specific filters for the robust classification of the mental task "cube rotation"
using energy features.
5 Discussion
action (e.g., move the pointer up or the wheelchair forward), but does not worry about
its implementation (e.g., how far so as to reach the next item above or obstacle avoid-
ance). The implementation of the elementary commands associated to the EEG pat-
terns of interest will depend on the application. For example, we can use these three
patterns to guide a motorized wheelchair. The first pattern--relaxation--switches on
or off the ABI. Turning on the ABI makes the wheelchair move forward, while turn-
ing it off makes the wheelchair stop. The remaining two other patterns are used to
make the wheelchair turn right or left, respectively. These elementary commands (i.e.,
move forward, turn right, and turn left) are sent to a second learning system (e.g., [8])
that uses the on-board sensors to bring the wheelchair in the desired direction in a safe
(avoiding collisions) and smooth way.
These preliminary results have been obtained with a couple of users. Before going
on to recognize a larger set of patterns, we are now experimenting the ABI with more
users and trying to improve its robustness. In this latter respect we are exploring alter-
native feature extraction methods (e.g., autoregressive models, wavelets, etc.). This
on-going work is partially built upon previous studies with off-line E E G signals. One
of them confirms that an artificial neural network distinguishes E E G patterns better if
it uses the temporal dynamics of brain activity [11]. This is not surprising since EEG
signals carry temporal information. We deal with the time dimension (or history) by
means of a novel recurrent self-organizing map.
References
1. Anderson, C.W., Sijercic, Z.: Classification of EEG signals from four subjects during five
mental tasks. Int. Conf. on Engineering Applications of Neural Networks (1996) 407--414.
2. Babiloni, F., et al.: Improved realistic Laplacian estimate of highly-sampled EEG potentials
with regularization techniques. Electroencephalography and Clinical Neurophysiology 106
(1998) 336-343.
3. Bemardi, M., Canale, I., et al.: Ergonomy of paraplegic patients working with a reciprocat-
ing gait orthosis. Paraplegia 33 (1995) 458-463.
4. Kalcher, J., et al.: Graz brain-computer interface II. Medical & Biological Engineering &
Computing 36 (1996) 382-388.
5. Kohonen, T.: Self-Organizing Maps. 2nd ed. Springer-Verlag, Berlin (1995).
6. Marciani, M.G., et al.: Quantitative EEG evaluation in normal elderly subjects during men-
tal processes. International Journal of Neurosciences 76 (1994) 131-140.
7. McFarland, D.J., et al.: Spatial filter selection for EEG-based communication. Electroen-
cephalography and Clinical Neurophysiology 103 (1997) 386--394.
8. Mill~n, J. del R.: Rapid, safe, and incremental learning of navigation strategies. IEEE Trans.
on SMC-Part B, 26 (1996) 408-420.
9. Mill~m, J. del R., Mourifio, J., et al.: Incremental networks for the robust recognition of
mental states from EEG. Technical Report, Joint Research Centre of the EC, Italy (1998).
10. Platt, J.: A resource allocating network for function interpolation. Neural Computation 3
(1991) 213-225.
11. Varsta, M., Mill~, J. del R., Heikkonen, J.: A recurrent self-organizing map for temporal
sequence processing. 7th Intl. Conf. on Artificial Neural Networks (1997) 421-426.
Identifying Mental Tasks from Spontaneous
EEG: Signal Representation and Spatial Analysis
Charh;s W. Anderson
Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA,
anderson~es, c o l o s t a t e , edu,
~VW~V home pages: http:/[rJww, cs. colostate, edu/'anderson
1 Introduction
Automatic classification of electroencephalogram, or EEG, signals can lead to signif-
icant advances in studies of psychiatric diagnosis [21], truman-computer interfaces,
aids for disabled persons [14], and cognitive workload [15]. The state-of-the-art, how-
ever, is very limited; usually a small number of mental states are discriminated in
any given experiment. For example, previous work by Keirn and Aunon studied the
discrimination of pairs of mental tasks [12]. We have repeated their work with pairs
of tasks [1-4] and a much larger data set. We have also extended their work to the
discrimination of three tasks [5]. In this article, wc dcscribc the proccdurcs and results
of attempting to discriminate between five mental tasks.
A criti(zal colllponent to any automatic classification scheme is the representation
with which the information for each case is encoded. For EEG signal classification, a
representation is desired for which an accurate classifier can be trained with a reason-
able nunlbcr of known examples and that is relatively invariant over time. The lack
of comparative studies in EEG classitication makes it difficult to draw useful con-
clusions. Some studies have reported comparisons between conventional classification
m(rtho(ls ;rod ~cural networks (e.g., [22]), but it is relatively rare to thud comparisoz~s
among dill'crcnt signal representations.
This article reports the results of a comparison of EEG signal representations
judged by the performance of neural-network classifiers. We compared representations
based on AR models and Fourier Transforms and reduced-dimensional versions of
based on the Karhunen-Lodve (KL) Transform [10]. Oar best results were obtained
with a sixth-order, AR representation and a feedforward neural network having one
hidden layer of 20 units, trained with error backpropagation [17]. By averaging the
OUtl)ut of the classifier over approximately five seconds of consecutive, half-second
windows, we found that 72% of the test segments were classified correctly, averaged
229
2 Related Work
Since the early days of automatic EEG processing, representations based on a Fourier
Transfi)rm have been most commonly applied. This approach is based on earlier ob-
serwd, ions that tile EEG st)ectrmn contains some characteristic wavch)rms that fall
primarily within four frequency b a n d s - d e l t a (1-3 Hz), theta (4--7 llz), alpha (8-13
l lz), and I)cta (14-20 Ilz). Such methods have l)roved benelicial f()r wu'ioas EEG char-
acterizations, but the Fourier Transform and its discrete version, the FFT, suffer from
large noise sensitivity. Numerous other techniques from the theory of signal analy-
sis have been used to obtain representations and extract the features of interest for
classification purposes. Gevins and R6.mond [6] summarize many of these techniques.
Yun(:k and Tuteur [22] describe an aml)itious comparative study of a variety of
classifiers, all using the same representation, for the discrimination of EEG recorded
h'om 40 subjects performing the following seven tasks: resting, mental arithmetic,
listening to music, performing verbal exercises, listening to speech, performing pic-
torial exercises, and viewing a fihn. They compared four parametric classifiers based
on Gaussian assumptions and four nonparametric k-nearest neighbor classifiers. Their
representation consisted of 320 features I)ased on the power in several frequency bands
from the signals recorded simultaneously at four electrodes [9]). The nonparamet-
tic classifiers were found to be superior to the Gaussian-based classifiers, suggesting
that the majority of 1)ublished work on EEG classification, which is based on linear-
discriminant analysis and related Gaussian-based methods, could be improved by
using nonparametric methods, such as neural networks.
Others also fin(l AR models to be fruitfifl ways of characterizing EEG segnw.nts.
Sanderson, ct al., [18] descril)e a nmltiple-stage procedure whereby single channels
of EEG arc adaptively divided into relatively stationary segments, models using AR
models, and the AR coefficients are clustered. Tseng, et al., [20] evaluated different
parametric models on a fairly large database of EEG segments. Using inverse filtering,
white noise tests, and one-second EEG segments, they found that AR models of orders
between 2 and 32 yielded the best EEG estimation. For a method which avoids the
use of signal segmentation and t)rovides an on-line AR 1)arameter estimation that fits
nonstationary signals, like EEG, see [7].
In a problem of discriminating EEG of normal subjects from those with psychiatric
disorders, Tsoi, et al., [21] used AR models to represent one-second EEG segments
and trained neural networks to perform the classification. Their best classification
results were obtained on averaged data over 250 second intervals.
Inouye, et al., [8] used EEG to localize activated areas and determined directional
patterns in activity changes during mental arithmetic. They considered two EEG rep-
resentations based on information theoretic measures. Signals from 18 electrodes were
230
represented by first calculating F F T ' s of each one-second segment, then averaging the
F F T ' s over four consecutive segments, and finally calculating the entropy of the re-
suiting power spectra. Differences in the entropy at a number of electrode locations
was found during rest versus during the l)erformance of a mental arithmetic task.
They also studied a mutual information measure based on two-dimensional, 15-order,
AR models fitted to each of the 153 pairwise combinations of electrodes. Their results
showed a signilicant different in infornmtion tlow betwccn electrodes for the resting
and mental arithmetic tasks.
3 Method
3.1 EEG Data Acquisition and Representation
Subjects were seated in an Industrial Acoustics Company sound controlled booth
with din, lighting and noiseless fans for vcntilatiom An Electro-Cap elastic electrode
cap was used to rc('ord fi'om position,s C3, C4, P3, P4, O1, and 02, detined by the
1[)-20 system of electrode placement [9] and shown ill Figure 1. The electrodes were
connected through a bank of Grass 7P511 amplifiers and bandpass filtered from 0.1-
100 Hz. Data was and recorded at a sampling rate of 250 Hz with a Lab Master 12
lilt A / D converter mounted ill an IBM-AT computer. Eye blinks were detected by
recalls of a S(;lmrate cham~el of data recorded frola t w o electrodes placed above and
below the subject's left eye.
For this paper, the data from one subject performing the following five mental
tasks was analyzed. These tasks were chosen by Keirn and Aunon to invoke hemi-
spheric brainwave asymmetry [16]. The five tasks are:
Baseline T a s k : The subjects were not asked to perform a specific mental task, but
to relax as much ms possible and think of nothing in particular.
L e t t e r T a s k : The subjects were instructed to mentally compose a letter to a friend
or relative without vocalizing.
M a t h T a s k : The subjects were given nontrivial multiplication problems, such as 49
times 78, and were asked to solve them without vocalizing or making any other
physical movements.
V i s u a l C o u n t i n g Tile subjects were asked to imagine a blackboard and to visualize
mnubers being written on the board sequentially, with the 1)revious number being
erased I)efore the next number was written.
231
segments fi'om the five tasks, the global KL estimate is 31, a small reduction from
the original 30 dimensions of the representation. For the PSD representation, the
global KL estimate is 21. This is a large reduction from the 378 dimensions of the
PSD representation.
4 Results
Figure 2 summarizes the average percent of test segments classified correctly for
various-sized networks using each of the four representations, which will be called
AR-KL and AR for the representations based on AR coefficients, with and without
dimensionality reduction by the Karhunen-LoSve transform, and PSD-KL and PSD
for the representations based oil the power spectral density. 90% confidence intervals
are included ia tile plots. For one hidden unit, the PSD representations perform better
than the AR representations. With two hidden units, the PSD-KL representatiou
performs about 10% better than the other three. With 20 hidden units, the KL
representations perform worse than the non-KL representations, though the difference
is not statistically signiticant.
233
8O
55
5O
Percont OI
Tost
Segments 4~
Correctly
Classified 40
35
30
25
I 2 5 10 20
Number of Hidden Units
F i g . 2. Average perccnt of test segments correctly classified. Error bars show 90% confidence intervals.
Inspectiozt of how the network's classification changes from one segment to the next
suggests that better pcrh)rmance might bc achieved by averaging the network's output
over consecutiw; segments. "ib investigate this, a 2[)-unit network trained with the AR
representation is studied. The left cohtmn of graphs in Figure 3 show the outlmt values
of I,he net.work's five output traits fi~r each segment of test data from one trial. On each
Count
~.~?. . . .
Fig. 3. Network o u t p u t vahtes and desired values for one test trial. The first five rows of graphs show
the vMues of the five network o u t p u t s over the 175 test segments. T h e sixth row of graphs plots the t&~k
determined by the network o u t p u t s and the true tm~k. The first c o h m m of graphs is without averaging over
consecutive segments, the second is for averaging the network o u t p u t over ten consecutive segments, while
the third column is for averaging over twenty segments.
graph tile desired value for tile corresponding output is also drawn. The bottom graph
shows the true task and tile task predicted by the network. For this trial, 547o of the
segments are classified correctly when no averaging across segments is performed. The
other two columns of grat)hs show the network's output and predicted classification
that result front averaging over 10 and 20 consecutive segments. Confusions that the
234
classifier nmde can be identified by the relatively high responses of an output unit for
test segments that do not correspond to the task represented by that output unit. For
example, in the third graph in tim right column, the output value of the math unit
is high during math segments, as it should be, but it is also relatively high during
count segments. Also, the output of the count unit, shown in the fourth graph is high
during count segments, but is also relatively high during letter segments.
For this trial, averaging over 20 segments results in 96% correct, but performance
is not improved this much on all trials. The best classification performance for the
20 hidden unit network, averaged over all 90 repetitions, is achieved by averaging
over all segments. Figure 4 shows how the fraction correct varies with the number
of consecutive segments averaged for each representation. All trials contain at least
20 segments, t)ut very few contain 35, so the statistical significance of the averages
plotted in Figure 4 quickly decreases above 20 segments. The AR representation
performs the best whether averaging over 10 or 20 segments, but when averaging
over 20 segments, the AI~ and AR-KL rel)rescntations 1)erform equally well. The PSD
and PSD-I(L representations do consistently worse than the AR rel)resentations.
i i I * ,
i i l I *
I , * I I i
...... s. . . . . . ., . . . . . . t.. . . . . . 1. . . . . . J . . . . . -i . . . . .
9 * r , i
Pe=e.t o, 76 ...... ; . . . . . ] . . . . . . r . . . . .
Test i * r i i
Segments i , i
oo.-.," ,o : ~ ..... /
...... 9, = 9, ~ - -, a i. . . . . - Li - - . . ' - J
' - .....
' .I,......
~ ,7.....
[ TrF~6:K-t.l,.....
, ~, --
s ~o ~ 2o 2S P~ 3S
Fig. 4. The fraction of averaged windows classified correctly versus the number of consecutive windows
averaged over.
I 9 .* o.. 9 . II.-
. . . . . . = 2 * , , , i
C3 ..o.;.,..o .,. 9 t ~ J *
, i , , J
.............. o o
' i;iiii!ii: ~176
C4 9 ;
4
13 .:. ' :' .; ai.
. . . . . . . . . . . . . = Cluster 1
P3
Index 5
19
:.:'.;:..:';. : :
. . . . 9 . . . . o..
P4 . 9. .. . . . .. .. .. . . . . . . o .
, ~ i i i
25 io.*o i D* . io..
I0 i.v IOQOlO .Io
":: :;,:gU;: : :
01
31 o~;:'o?,': . : ::! 8
02 9
: : .::' : : ;
Tasks
Base :,~ ; : : ~ : . . ' : . : , : , O ~ r ~
.s,h~eI'ei ' : ili
tO . . . . .
Count " _
9 i ~ i t 1 1 1 1
5 tO 15 20 C3 IC41 p 3 * p 4 1 011 0 2
R o t a l l O n -~" ~,e M~lh RolJ,llon
Cluster Index Ll~lw Ccurd
a. b.
Fig. 5. a. Results of k-means clustering for 20 clusters with the AR representation; b. Results of k-means
clustering for 10 clusters with the PSD representation.
Cluster 2 suppresses (is connected negatively to) the math task output unit and
Cluster 3 suppresses all but the math task unit. Other clusters also contain significant
weights for the O1 and 02 channels. One cluster that includes large weights in other
channels is Cluster 18, for which the first order weights are relatively large, positive
vahms for the C3, P3, and O1. As Figure 1 shows, these electrodes record from the
left hemisphere. This cluster has positive output weights for the baseline, letter, and
math tasks, and negative for tile counting and rotation task, suggesting a hemispheric
asymmetry in the EEG signals related to the first three tasks. II.ccall that prior
to training all representation components wcrc normalized to have the same mean
;ttl(l variat|ce. This removes biases that would arise from (liffering inl)ut coml)olmnt
variances, allowing the direct comparison of the magnitudes of the weights in these
clusters.
A similar cluster analysis can be applied to other signal representations. Figure 5b
shows the results of a cluster analysis of the PSD representation. As before, vectors
of hidden unit input weights and output weights were clustered, this time into ten
clusters. There are too many comt)onents to display as boxes, so they are simply
plotted versus a component in(lex. The left cohmm of graphs in Figure 5b show the
input components of the ten clusters. Components are grouped by channel and the
components for each channel correspond to tlle power at frequencies ranging from 1
to 125 Hz. The right column of graphs display the output weights for each cluster,
with one value corresponding to each task.
EEG signals arising from brain activity are typically characterized by their power
st)cctrum fi'om 0 to 30 or 40 Hz, with higher frequencies being attributed to muscle
activity or seusor uoise. Consideration of the third and fifth clusters suggests that,
whatever the cause of the high frequency signals, the high frequency components
236
are correlated with task. Cluster 3 contains a positive connection to the math task
and negative connections to the others, while Cluster 5 contains a negative math
connection and positive or near-zero connection to the others, i.e., the inverse of
Cluster 3. The input weights of these clusters are also approximately negatives of
each other: Cluster 3 has negative weights for O1 components and positive for 02
components mid this pattern is reversed for Cluster 5.
Another very interesting observation is that Cluster 8 contains large weights in
the P3 and O1 channels at a frequency very close to 60 Hz. This is most likely due
to interference during the recording process of the 60 Hz power supply. The EEG
recording amplifier used to gather this data supposedly filters out 60 Hz, but the
cluster analysis clearly shows the presence of a 60 Hz signal and, not only its presence,
but that it is correlated with the letter task. Even though all tasks were repeated on
two ditfcrent clays, there may be more 60 Hz noise in the letter task data than in
other data. This demonstrates how the cluster anMysis of a large number of resulting
weight w'~(:tors (:a,l lead to an understanding of what relationshil)s the networks have
extracted from the data. It also shows how assumptions about the data, such as the
removal of known noise sources, can be verified.
6 Conclusion
Tile correct task out of five was identified correctly for 64% of the EEG test patterns
when tile output of tim network was averaged over five consecutive, half-second seg-
ments and each segment was represented by either an AR model or a power spectrum
density (PSD). This level of performance was achieved with a neural network of one
hidden layer containing 20 units for the AFt representation and 40 units for the PSD
reI)resentation. Performance was increased to 72% by averaging over 20 consecutive
segments-approximately live seconds of d a t a - - b u t only for the AR case.
Karhuncn-Lohvc transforms were applied to both the AR and PSD representations
to investigate the possibility of reducing the dimensionality of the input reprcsentatioll
without sacriticing performance. Results show that the AR representation could not
be significantly reduced without decreasing performance. The dimension of the PSD
rei)rescntation could bc greatly reduced with little loss in performance on individual
half-second segments, but performance was considerably lower when averaging over
consecutive segments.
Cluster analysis was applied to learned weight vectors, revealing some of the ac-
quired relationships between representation components and mental tasks and also
revealing unexpected characteristics of the data, such as the presence of 60 Hz noise.
The results of clustering can be used both for the construction of lower-dimensional
representations and for investigating hyl)otheses regarding differences in brain activity
related to different cognitive behavior.
Many issues remain to be solved before this approach can be developed into a
reliable, portable EEG-computer interface. Portable EEG acquisition devices are not
generally available, but are being developed. Current EEG electrodes are very in-
convcnicnt to use. A primary limitation of work to-date is the lack of generalization
studies across subjects. Lin, Tsai, and Liou [13], did test multi-subject generalization
using data very similar to that used in this article, but met with little success.
A c k n o w l e d g m e n t s : This work was supported by the National Science Foundation through grants IRI-
9202100 and 01SE-9422007.
237
References
1. C.W. Anderson, S. V. Devulapalli, and E. A. Stolz. Determining mental state from EEG signals using
neural networks. Scientific Programming, 4(3):171-183, Fall 1995.
2. C. W. Anderson, S. V. Devulapani, amt E. A. Stolz. EEG signal classification with different signal
rel)resentatines. In F. Girosi, J. Makhoul, E. Manolakos, and E. Wilson, editors, Neural Networks for
Signal Processing V, pages 475--483. IEEE Service Center, Piscataway, NJ, 1995.
3. C. W. Anderson, E. A. Stolz, and S. Shamsunder. Disciminating mental tasks using EEG represented by
AR models. In Proceedings of the 1995 IEEE Engineering in Medicine and Biology Annual Conference.
Montreal, Canada, 1995.
4. C. W. Anderson, E. A. Stolz~ and S. Shamsnnder. Multivariate autoregressive models for classification of
spontaneous electroencephalogram during mental tasks. IEEE Transactions on Biomedical Engineering,
45(3):277-286, 1998.
5. Charles W. Amlerson. Effects of variations in neural network topology and output averaging on the
discrimination of mental tasks from spontaneous electroencephalogram. Journal of Intelligent Systems,
7(1-2):165-190, 1997.
6. A. S. Gevins alld A. Rdmond. Methods of Analysis of Brain Electrical and Magnetic SignaL~, volume 1
of Handbook of Eleclroencephalogynphy and Clinical Ncurophysiology (revised series). Elsevier Science
Publishers B.V., New York, NY, 1987.
7. S. Goto, M. Nakamura, and K. Uosaki. On-line spectral estimation of nonstationary time series ba.~ed
on AR model parameter estimation and order selection with a forgetting factor. IEEE Transactions on
Signal Processing, 43(6):1519-1522, June 1995.
8. T. Inouye, K. Shinosaki, A. lyama, and Y. Matsumoto. Localization of activated areas and direc-
tional EEG patterns during mental arithmetic. Electroencephalography and Clinical Neurophysiology,
86(4):224-230, 1993.
9. 11. J,~sper. The ten twenty electrode systeln of the international federation. Electrocnccphalography and
Clinical Ncurophysiology, 10:371-375, 1958.
It}. I. T. 3olliIe. Principal Component Analysis. Springer-Verlag, New York, 1986.
11. S. M. Kay. Modern Spectral Estimation: Theory and Application. Prentice-llall, Englewood Cliffs, NJ,
1988.
12. Z. A. Keirn and J. I. Aunon. A new mode of communication between man and his surroundings. 1EEE
Transactions on Biomedical Engineering, 37(12):1209-1214, December 1990.
13. Shiao-Lin Lin, Yi-.lean Tsai, and Cheng-Yuan Lion. Conscious mental tasks and their EEG signals.
Medical ~'~ Biological Engineering CJ Computing, 31:421-425, 1993.
14. ti. S. Lusted and I1. B. Knapp. Controlling computers with neural signals. Scientific American, pages
82-87, October 1996.
15. Scott Makcig, Tzyy-Ping Junj, and Terrenee J, Sejuowski. Using feedforward neural networks to monitor
alertness from changes in EEG correlation and coherence. In D. S. Touretzky, M. C. Mozcr, and M. E.
Ila.ssehno, editors, Advances in Neural Information Processing Systems 8, pages 931-937. The MIT
Press, Canal)ridge, MA, 1996.
16. M. Osaka. Peak alpha frequency of EEG during a mental task: Task difficulty and hemispheric dilfer-
euces. PsychophysiohJgy, 21:101-105, 1984.
17. I). E. lhmwllmrt, G. E. llintou, and IL W. Williams. Learning intern',d representations by error propaga-
tion. ht D. E. R.umelltart, J. L. McClellaml, and The PDP Researdl Group, editors, Parallel Distributed
Processing: Ezplorations in the Microstructure o/Cognition, vohmm 1. Bradford, Cambridge, MA, 1986.
18. A. C. Sanderson, J. Segen, and E. Richey. Hierarchical nmdeling of EEG signals. IEEE Transactions
on Pattern Analysis and Machine lnleUigence, PAMI-2(5):405-414, September 1980.
19. E. Stolz. Multivariate autoregressive models for classification of spontaneous electroencephalogram
during mental tasks. Master's thesis, Electrical Engineering Department, Colorado State University,
Fort CoUins, CO, 1995.
20. S-Y. Tseng, R-C. Chen, F-C. Chong, and T-S. Kuo. Evaluation of parametric methods in EEG signal
analysis. Mcd. Eng. Phys., 17:71-78, January 1995.
21. A. C. Tsoi, D. S. C. So, and A. Sergejcw. Classification of electroencephalogram using artificial neural
networks. In J. D. Cowan, G. Tcsauro, anti 3. Alspeetor, editors, Advances in Neural Information
Processing Systems 6, pages 1151-1158. Morgan Kaufinann, San Francisco, CA, 1994.
22. T. P. Yunck and F. B. "Ihlteur. Comparison of decision rules for automatic EEG classification. 1EEE
Transactions on Pattern Analysis and Machine Intelligence, PAMI-2(5):420-428, September 1980.
Independent Component Analysis
of Human Brain Waves
1 Introduction
With no doubt, brain is among the most intriguing and complex systems ever
studied by human-kind. In an attempt to give a plausible explanation to the
why's and how's of human perception and cognition many conjectures have been
formulated and theories have been tested throughout centuries. The end of this
century, in particular, sees an impressive explosion of knowledge about the brain,
both in the understanding of some of the most basic human processing systems,
and in the elaboration of efficient computational neuroscience models.
In a bootstrapping (reinforced) manner, the discoveries made on the human
brain are leading into the formulation of more efficient computational methods
which in turn make it possible to design new signal processing tools for better
extracting information from brain data. Some of the most promising of such tools
are in the field of artificial neural networks, of which this paper's independent
component analysis (ICA) algorithm is a good example.
239
Several approaches to the solution of the ICA problem are available in the
literature [2, 4, 6, 7, 10, 14]. A good tutorial on neural ICA implementations is
available by [15]. The particular algorithm used in this study is discussed in [10,
18].
240
The initial step in source separation, using the method described in this arti-
cle, is whitening, or sphering. This projection of the d a t a is used to achieve the
uncorrelation between the solutions found, which is a prerequisite of statistical
independence [10]. The whitening can as well be seen to ease the separation of
the independent signals [15]. In [11], it has been shown t h a t a well chosen com-
pression, during this stage, may be necessary in order to reduce the overlearning
(overfitting), typical of ICA methods. The result of a poor compression choice
is the production of solutions practically zero almost everywhere, except at the
point of a single spike or bump.
The whitening m a y be accomplished by P C A projection: v = V x , with
E { v v T} = 1. The whitening matrix V is given by V = A-1/2~ T, where
A = diag[)~(1),..., X(M)] is a diagonal matrix with the eigenvalues of the data
covariance matrix E { x x T } , and Z a matrix with the corresponding eigenvectors
as its columns.
Consider a linear combination y = w T v of a sphered d a t a vector v, with
Hw[I = 1. Then E { y 2} = 1 and kurt(y) = E { y 4} - 3, whose gradient with
respect to w is 4 E { v ( w T v ) 3 } .
The fixed point algorithm [10], calculated over sphered zero-mean vectors v,
finds one of the rows of the separating matrix B (noted w) and so identifies one
independent source at a time - - the corresponding independent source signal can
then be found using Eq. 2. Each iteration of this algorithm, a gradient descent
over the kurtosis, is defined for a particular time instant k as
w*, = E { v ( w T l v ) 3 } -- 3wl-1
wz = w*~/llw*zll. (3)
The challenges presented to the signal processing community by the electro- and
magnetoencephalographic recordings from the human brain m a y be divided in
241
two classes, one dealing with the identification and removal of artifacts from the
recordings, and another on the understanding of the brain signals themselves
(see Table 1). The amplitude of the artifactual disturbances may well exceed the
one of brain signals, turning the analysis of brain activity a very hard process.
Moreover, artifacts may present strong resemblance to some physiological brain
responses, bringing an erroneous interpretation of the recording [9].
Table 1. Some signal processing problems encountered in EEG and MEG studies.
Typical artifacts, present in most EEG and MEG measurements, include eye
and muscle activity; the heart's electrical activity, captured at the lowest sen-
sors of a whole-scalp magnetometer array; and externally induced artifacts. The
relevance of the identification of such artifacts can be seen in the analysis of the
QRS complex, followed by the repolarising T wave, which may be misinterpreted
as a spike and slow wave associated to some epileptic seizures.
As for the analysis of the human brain's functioning, it is common to use event
related activity as an entry level to this study. This activity is time-locked to a
particular stimulus, that may be of auditory, somatosensory or visual type [16].
Brain responses to the stimulation present minimal inter-individual differences
to a particular set of stimulus parameters. In order to understand the physiolog-
ical origins of the event related activity, it may be desirable to decompose the
complex brain response into simpler elements, that would b e easier to model,
and to localize their neural sources. In addition, the separation of multi-modal
responses to complex stimuli, may represent a hard task to conventional meth-
ods, but is surely of capital importance, due to the diversity of stimuli in the
perception of the real world.
In this experiment, the measured subject was asked to bite his teeth, to move
his eyes horizontally, as well as to blink. This activity ensured the presence
of strong eye and myographyc artifacts. In order to augment the number of
possible artifacts, a watch was inserted in the shielded room [23]. In Fig. 1 a
sample of the MEG signals is depicted, showing the clear areas of eye and muscle
activity. Both the watch and the heart cycle can be guessed from some of the
sensor signals. Vertical and horizontal electro-oculograms (VEOG and HEOG)
and electrocardiogram (ECG) were recorder simultaneously with the MEG, in
order to guide and ease the identification of the independent components.
243
Fig. 2. Six independent components extracted from the MEG data. For each compo-
nent the left, back and right views of the field patterns are shown - full lines stand for
magnetic flux coming from the head, and dotted lines the flux inwards.
The latencies of the two different evoked responses is clear in some MEG
channels (compare e.g. MEG58 with MEG61, over the auditory and somatosen-
sory primary cortices, respectively). Nevertheless, in most of the recorded signals
this separation is far from accomplished. Figure 4 a) show tile results obtained
by PCA, where we may see that the confusion hasn't been solved. In b) we see
Fig. 4. Principal a) and independent b) components of the data. Field patterns cor-
responding to the first two independent components in c). In d) the superposition
of the localizations of the dipole originating IC1 (black circles, corresponding to the
auditory cortex activation) and IC2 (white circles, corresponding to the SI cortex acti-
vation) onto magnetic resonance images (MRI) of the subject. The bars illustrate the
orientation of the source net current.
245
the auditory and somatosensory responses clearly separated in the first two in-
dependent components. The corresponding field patterns c), together with the
superimposition of the localizations of the sources on MRI slices d), allow us to
conclude on a satisfactory agreement between the IC's and conventional loca-
tions for this type of brain responses.
A final experiment, using only averaged auditory evoked fields, illustrated the
decomposition capabilities of ICA in such setups. The stimuli consisted of 200
tone bursts that were presented to the subject's right ear, using ls interstimulus
interval. These bursts had a duration of 100ms, and a frequency of 1KHz [25].
PeA lC~
-1012345 -1012345 ~1
Fig. 5. Principal a) and independent b) components found on the auditory evoked field
study. Each tick in a) and b) corresponds to lOOms, going from lOOms before stimulation
onset to 500ms after. In c) and d) the four [C's are plotted after scaling to one lef~ and
right MEG original signals.
4 Discussion
In this paper we have seen how to apply the recently developed statistical tech-
nique of independent component analysis, to the processing of biomagnetic brain
recordings. In particular, we have seen that it is very well suited for extracting
different types of artifacts from EEG and MEG data, even in situations where
the order of magnitude of these disturbances is lower than the background brain
activity.
246
Often the use of more than one sensing modality is employed to perceive the
world. ICA has shown to be able to differentiate between somatosensory and
auditory brain responses in the case of a complex vibrotactile stimulation. The
result obtained augurs the appearence of new sets of effective modal-sensitive
applications/studies. In addition to this findings, the experiment showed as well
that the independent components, found with no other modeling assumption
than the independence of the sources, exhibit field patterns that agree with the
conventional dipolar source modeling. In fact, when we admitted that model,
the localization of the equivalent source dipole of the independent sources fell
on the expected brain regions, for the particular stimulus.
Finally, in addition to the above result, the application of ICA to averaged
auditory evoked responses isolates the main response, with a latency of about
lOOms, from subsequent components. Furthermore, it discriminated between the
ipsi- and contralateral main responses of the brain. These decompositions may
lead us to an increase in the understanding of the functioning of the human
brain, as a finer mapping of the brain's responses may be achieved.
Acknowledgment
The authors thank Professor Riitta Hari, and Dr. Veikko Jousm~iki, from the
Brain Research Unit OF Helsinki University of Technology, for the MEG data,
and for very valuable discussions on the results reported in this paper. We ex-
press as well our gratitude to Mr. Jaakko Sirel~i, for his help in some of the
experiments.
References
1. FastICA MATLAB package. Available at the WWW adress:
http ://www. cis. hut. fi/projects/ica/fastica.
2. S. Amari. Blind source separation - mathematical foundations. In S. A m a r i and
N. Kasabov, editors, Brain-like Computing and Intelligent Information Systems,
pages 153-166. Springer, Singapore, 1997.
3. J. S. Barlow. Computerized clinical electroencephalography in perspective. IEEE
Trans. Biomed. Eng., 26:377-391, 1979.
4. A. Bell and T. Sejnowski. An information-maximization approach to blind sepa-
ration and blind deconvolution. Neural Computation, 7:1129-1159, 1995.
5. P. Berg and M. Scherg. A multiple source approach to the correction of eye arti-
facts. Electroenceph. elin. Neurophysiol., 90:229-241, 1994.
6. A. Cichocki and R. Unbehauen. Robust neural networks with on-line learning for
blind identification and blind separation of sources. IEEE Trans. on Circuits and
Systems, 43(11):894-906, 1996.
7. P. Comon. Independent component aniysis - a new concept? Signal Processing,
36:287-314, 1994.
8. M. Hs R. Hari, R. Ilmoniemi, J. Knuutila, and O. V. Lounasmaa.
Magnetoencephalography--theory, instrumentation, and applications to noninva-
sive studies of the working human brain. Reviews of Modern Physics, 65(2):413-
497, April 1993.
247
Abstract. Sensorimotor EEG rhythms are affected by motor imagery and can,
therefore, be used as input signals for an EEG-based brain-computer interface (BCI).
Satisfactory classification rates of imagery-related EEG patterns can be activated
when multiple EEG recordings and the method of common spatial patterns is used
for parameter estimation. Data from 3 BCI experiments with and without feedback
are reported.
Sensorimotor EEG rhythms such as mu and central beta rhythms display an event-
related desynchronization (ERD) not only with execution of hand movement but also with
imagination of the same or a similar type of movement [8]. Imagination of right and left
hand movement can therefore be used as a mental strategy to realize an EEG-based brain
computer interface (BCI) [10]. Examples of high-resolution ERD maps based on a
realistic head model obtained from magnetic resonance imaging (MRI) during left and
right hand movement imagery are displayed in Fig. 1. It can be seen that the ERD is
circumscribed and localized over the contralateral sensorimotor hand area.
249
Fig. 1. ERD maps calculated for a realistic head model during imagination of left and right hand
movement. The ERD focus is indicated by dense "isopotential" lines
Although the imagery-related ERD forms a focus close to the hand representation area,
one or two EEG signals recorded either from one or both hemispheres are insufficient to
describe the state of brain activation during motor imagery. Therefore, it is understandable
that the BCI system, using either 1 or 2 EEG channels for parameter estimation and
control of cursor movement in 2 directions (e.g. cursor up and down), can achieve only an
accuracy of 80-90% after about 10 sessions [5,7,10]. It can be expected that the analysis
and classification of a large number of EEG signals recorded over sensorimotor areas may
improve the classification accuracy of a BCI.
It was shown recently by off-line analysis of 56-channel EEG data from a motor
imagery experiment that EEG patterns during left and right motor imagery could be
discriminated in 3 healthy subjects with an accuracy of 90.8%, 92.7% and 99.7%,
respectively [11]. For this discrimination the common spatial pattern (CSP) method was
used [3,6]. With this CSP-method variance-related feature vectors from 2 populations of
EEG patterns are extracted and used for classification. It is therefore of interest, whether
the CSP-method can be used for on-line BCI sessions with continuous feedback [7] and
what classification accuracy can be achieved after e.g. only 3 days of training.
The CSP-method lead to new time series that are optimal for discriminating 2
populations of EEG patterns related to right and left motor imagery. The method is based
on the simultaneous diagonalization of 2 covariance matrices [1]. The imagery-related
EEG pattern (E) recorded from m electrodes is multiplied by a mapping matrix W. The
first two and last two rows (time series) of the resulting matrix Z (Z=WE) are best suitable
to discriminate the 2 populations of EEG patterns and are used to construct the weight
vector for the classifier. The components (features) used for classification are the
logarithm of the normalized variances of the time series obtained by spatial filtering (for
details see [6]).
250
Three students participated in the BCI experiment, all experienced with the BCI
(subjects g3, g7, i2). Each student imagined 80 left and 80 right hand movements per
session whereby the side of imagination was indicated by an arrow on a monitor pointing
either to the left or to the right (for details see [2,10]. The experimental paradigm is shown
in Fig. 2.
Fig. 2. Experimental paradigm for EEG data collection during motor imagery without feedback.
EEG was recorded from 27 electrodes closely spaced over left and right sensorimotor
areas. Amplified EEG signals filtered between 8-30 Hz, sampled at 128 Hz and cleared of
artifacts were used for calculating subject-specific common spatial filters and weight
vectors. All sessions with and without feedback were performed within only 3 days. A
typical example for one subject (g7) is given in Fig. 3.
Feedback (FB) was given in form of the outline of a rectangle. Immediately after the
arrow (cue) disappeared, the feedback stimulus appeared in the center of the screen and
began to extend horizontally toward the right or left side. The subject's task was to extend
this feedback bar toward the left or right boundary of the screen, depending on the
direction of the arrow (cue stimulus; see also Fig. 2) presented before. During a 3.75-
second period the bar was moving to the right or left side of the screen according to the
results of tile on-line analysis (linear distance function as described before).
251
Fig. 3. Flowchart of 6 BCI sessions with and without feedback for subject g7 within 3 days.
Altogether 3 CSP filters and 4 weight vectors (WV) were calculated.
the update procedure twice (see Fig. 3) in session 6 with FB on the 3 ~ day a classification
accuracy of 94% was achieved. The time courses of the on-line classification for all
subjects are displayed in Fig. 4.
g3 =:
C l a s s i f i c a t i o n T i m e P o i n t In s e c o n d s
60
50
40
g7
20
0
2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 75 8
C l a s s i f i c a t i o n T i m e P o i n t in s e c o n d s
i2 ~
C l a s s i f i c a t i o n T i m e P o i n t in s e c o n d s
Fig. 4. Time courses of on-line classification error over a period of 6 seconds, starting
1 second before visual cue presentation (from second 3 to 4.25). Summarized data of all 3
subjects are shown. Subjects g7 and g3 participated in 4 and subject i2 in 5 sessions with
FB. Instead of the classification accuracy, the error rate (100%: minimum classification
accuracy) is displayed.
253
Subject i2 started, similar to subject g7, without any classification power (50%
classification accuracy) in session 2. After calculation of 3 spatial filters and 5 weight
vectors a classification accuracy of 96% was achieved in session 7 with FB.
In contrast to subjects g7 and i2, subject g3 started in the first FB-session with an
accuracy close to 80%. In the last FB-session the classification rate was 98%.
4 Conclusion
Acknowledgements
This research was supported by the "Fonds zur F6rderung der wissenschaftlichen
Forschung" project PI1208MED, the "Steierm~irkische Landesregierung" and the
"AIIgemeine Unfallversicherungsanstalt, AUVA" in Austria.
254
References
ABSTRACT
Since about twenty years, the otoneurology functional exploration possesses experimental
techniques to analyze objectively the state of the nervous conduction ofauditive pathway.
it concerns brainstem evoked response auditory. In this paper, we present a new
classification approach based on a hybrid neural network technique focusing this
biomedical application for developing a diagnostic tool. We have used two models of
artificial neural networks : Learning Vector Quantization and Radial Basis Function ones.
In our approach, these two neural networks are used to achieve the classification in a
serial multi-neural network configuration. Case study and experimental results have been
reported and discussed.
1. I N T R O D U C T I O N
Artificial Neural Networks are information processing systems, which allow the
elaboration of many original techniques covering a large field of applications. Among
their most appealing properties, we can quote their ability of learning and generalization,
and, for some of them, their ability of classification. On the other hand, the classification
problems cover a large domain of applications such as signal processing, image
processing, biomedical diagnosis... The problem of non-linear classification, classification
with uncompleted database or with database representing a large rate of resemblance is
256
difficult. Over the past decades, new approaches based on Artificial Neural Networks
have been proposed to solve such class of problems [1]. Some of studies have been made
for electrical signals classification in the field of test and diagnosis of analog circuits
[2113114][5]. ANNs techniques have also well performed for classification tasks in the
biomedical field [6l|7].
This paper is structured as follow. In the next section, we present the BERA signals. Then
we expose the approach based on Multi-Neural Network (MNN) structure. In section 4,
we present the classification results we obtained by using a database of 213 Brainstem
Evoked Response Auditory like waveforms. A comparison study with the single RBF and
LVQ ANNs has been made. Finally, we conclude and give the prospects that follow from
our work.
Brainstem Evoked Response Auditory (BERA) is generated as follow : the patient hears
clicking noise or tone bursts through earphones. The use of auditory stimuli evokes an
electrical response. In fact, the stimulus triggers a number of neurophysiological
responses along the auditory pathway. An action potential is conducted along the eight
nerve, the brainstem and finally to the brain. A few times aPter the initial stimulation, the
signal evokes a response in the area of brain where sounds are interpreted. These response
signals have small amplitude, and so they are frequently masked by the background noise
of electrical activity in the brain. But, it can be seen that the average of this noise equals
zero. Indeed, the response is obtained by extraction from the noise by the principle of
averaging. The response waveform consist on a series of five peaks numbered with
257
Roman numerals (wave I to V). Figure I (extracted from | l i d represents a perfect BERA.
This test provides an effective measure of the integrity of the auditory pathway up to the
upper brainstem level.
A technique of extraction, presented in [11 | allows us, following 800 acquisitions such as
describe before, the visualization of the BERA estimation on averages of 16 acquisitions.
Thus, a surface of 50 estimations called Temporal Dynamic of the Cerebral trunk (TDC)
can be visualized. The software developed for the acquisition and the processing of the
signal is called ELAUDY. It allows us to obtain the average signal, which corresponds to
the average of the 800 acquisitions, and the TDC surface. Figure II (extracted from 111])
shows two typical surfaces, one for a patient with a normal audition (II-A) and the other
one for patient who suffers from an auditory disorder (ll-B).This figure shows the large
variety of BERA signals even for a same patient. Moreover, this software automatically
determinates, from the average signal, the five significant peaks and gives the latency of
these waves. It also allows us to record a file for each patient, which contains
administrative information (address, age...), the results of auditory tests and the doctor's
conclusions (pathology, cause, confidence's index of the pathology...).
BERA signals and TDC technique are important to diagnosis auditory pathologies.
However, medical experts have still to visualize all auditory tests' results before making a
diagnosis.
Today, taking into account the progress accomplished in the area of intelligent
computation or artificial intelligence, it becomes conceivable to develop a diagnosis tool
assisting the medical expert. One of the first steps in the development of such tool is the
BERA signals classification.
The approach we propose to solve the posed problem is based on Multi-Neural Network
(MNN) concept. A MNN could be seen as a neural structure including a set of similar
neural networks (homogeneous MNN architecture) or a set of different neural nets
(heterogeneous MNN architecture). On the other hand, both two above mentioned
(homogeneous and heterogeneous MNNs) could be organized in different manners. In a
general point of view, three topologies [11] could characterize the MNN's organization :
258
9 parallel organization : in this case, ANNS are not inter-connected. The MNN
input is dispatched to all neural networks composing such structure.
9 serial organization : in this case, the output of a given ANN composing the
structure is the input of the following ANN.
9 serial/parallel organization : which combines the two structures above mentioned
connections,
The problem (application) on which our efforts have been focused concerns the signals
(signatures) classification, where signals to be classified could represent a high
resemblance. In such class of problems, a very fine separation should be performed in the
feature space (parameters space). So, the use of single neural structures could lead, on the
one hand, to a large number of neurons, and on the other hand, to a long learning process.
Especially, when the application deals with real time execution constraints : in our case,
the BERA signals classification is intended to be used as a part of the process in a
computer aided medical diagnosis tool, and so, the execution time constraint should be
taken into account.
As it has been mentioned and shown in the previous sections, the main difficulty in
classification of BERA signals is related, on the one hand, to the large variety of such
signals for a same diagnosis result (the variation panel of corresponding BERA signals
could be very large), and on the other hand, to the close resemblance between such
signals for two different diagnosis results. The serial homogeneous MNN is equivalent to
a single neural structure with a greater number of layers with different neuron activation
functions. So the use of homogeneous MNN with a serial organization is here out of real
interest. In the parallel homogeneous MNN configuration, each neural net operates as
some 'expert' (learning a specific characteristic of the feature space). So, the interest of
the parallel homogeneous MNN appears when a decision stage, to process the results
pointed out by the set of such 'experts', is associated to such MNN structure. In this case,
such structure becomes then a serial/parallel MNN, needing an optimization procedure to
determine the number of neural nets to be used.
The RBF model we use is a Weighted-gBF model but a standard one, and so, it performs
the feature space mapping associating a set of 'categories' (in our case a category
corresponds to a possible pathological class) to a set of 'areas' of the feature space. The
LVQ neural model belongs to the class of competitive neural network structure. It
includes one hidden layer, called competitive layer. Even if the LVQ model has
essentially been used for the classification tasks, the competitive nature of it's learning
strategy ('based on winner takes all' strategy), makes it usable as a decision-classification
operator. On the other hand, the weighted nature of transfer functions between the input
layer and the hidden one and between the hidden layer and the output one in this model
allows non-linear approximation capability, making such neural net a function
'approximation operator'.
259
Taking into account the above analysis, the proposed serial MNN structure could be seen
as a structure associating a neural decision operator to a neural classifier. Moreover, the
proposed structure could also be seen as some global neural structure with two hidden
layers. So, the association of two neural models improves the global order of the non
linear approximation capability o f such global neural operator, comparing to each single
neural structure (here RBF or LVQ) constituting the MNN system. This technique allows
to fill in the gap induced by the P,BF ANN, and thus, to refine the classification.
R~F LVQ
I Illlll
4. C A S E S T U D Y A N D E X P E R I M E N T A L RESULTS
In order to achieve this work, we dispose on a database, which contains the BERA signals
and the associated pathology. This database contains the files of 11 185 patients. We have
decided to work on the average signal presented in the section 2 and we choose three
categories of patients according to the type of their auditory disorder, which are :
9 n o r m a l : these patients have a normal audition
9 endocochlear : these patients suffer from disorders which concern the
part of the ear before the cochlea
9 retrocochlear : these patients suffer from disorders which concern the
part of the ear at the level o f the cochlea and after the cochlea, like
acoustic neuroma.
We select 213 signals. 92 belong to the normal class, 83, to the endocochlear class and 38
to the retrocochlear class. In a general point o f view, for a patient who has a normal
audition, the result of the TDC test is a regular surface. The waves are well synchronized,
and stay stable in latencies, amplitude and form. The results of the TDC test for patients
who suffer from endocochlear disorder are quite the same as for normal one. The
latencies stay normal, the morphology of TDC surface is not altered. One parameter
allows the medical expert to conclude for an endocochlear disorder : the auditory level. At
last, the retrocochlear disorders are characterized by an extension of the latencies and
non-synchronized waves, except from the wave V which can appeared well synchronized
with amplitudes modulation.
But, in reality, it is not so easy to distinguish one class from the other. The BERA signals
can be different for different test sessions for the same patient, because they depend on
260
the relaxation of the person, the background, the test's conditions, the signal-to-noise
ratio... The aim of this case study is to classify these signals using the MNN technique
presented in the above section.
The aim of classification by ANNs is to link a set of input vectors to a set of specific
output vectors. In our case, the components of input vectors are the samples of the BERA
average signals and the output vectors correspond to the different classes. The signals
corresponding to :
9 a retrocochlear disorder are associated to the class 1,
9 an endocohlear disorder are associated to the class 2,
9 a normal case are associated to the class 3.
In order to build our training database, we choose signals that come from patient whose
pathology is given as sure. All BERA signals come from the same experimental system.
After the learning phase, if not learned signals vectors are presented to ANN, the
corresponding class (type of disorder) must be designed.
We used the RBF-LVQ based serial heterogeneous MNN, described in the previous
section (Figure Ili).
Concerning the RBF ANN, the number of input neurons (88) is corresponding to the
number of components of the input vectors. The output layer contains 3 neurons. The
number of neurons of the hidden layer (in this case, 20 neurons) has been determined by
learning.
For the LVQ ANN, the number of input cells is equal to the number of output cells of the
RBF ANN. The output layer of the LVQ ANN contains as many neurons as classes (3).
The number of neurons in hidden layer (in this case, 8 neurons) has been determined by
considering the number of subclasses we can count into the 3 classes.
Class
~ R e a l Retrocochlear Endocochlear Normal
Class given~
By the ANN
Retrocochlear 27 4 10
Endocochlear 8 45 19
Normal 4 34 63
Table I - RBF+LVQ'sResults
The learning database has successfully been learnt. All of the learnt vectors are well
classified in the generalization phase. We can see that this network well classifies 63% of
the full database (including the learnt vectors), with a rate of correct classification of:
261
The behavior of the MNN concerning the retrocochlear and the normal classes permits to
obtain high rate of well classification o f these classes. However, the classification rate of
endocochlear signals is not satisfactory. Only about 55% o f vectors are well classified.
One can remark that when the network makes a mistake on an endocochlear vector, it
classifies it more often as a normal one rather than a retrocochlear one. In the same way, a
normal bad-classified vector is told preferentially endocochlear. This could be explained
by the fact that these results have been obtained without taking into account the auditory
threshold. The above mentioned parameter is among the key parameters used to
distinguish a normal hearing from an endocochlear disorder. The results considering this
parameter are presented in Table II.
To evaluate our MNN approach to single RBF or LVQ ANNs based techniques, we have
compared the obtained results with the results relative to these two cases.
The structure of the RBF and LVQ ANNs for the respective single approaches are
composed as follows :
9 the number of input neurons for RBF and LVQ ANNs corresponds to the
number of components of the input vectors,
9 the output layer of RBF and LVQ ANN contains 3 neurons, corresponding to
the 3 classes,
9 for RBF ANN, the number of neurons o f the hidden layer (in this case, 22
neurons) has been determined by learning,
9 for LVQ ANN, the number o f hidden neurons (in this case, 10 neurons) has
been determined by considering the number o f subclasses we can count into
the 3 classes.
262
For RBF ANN, the learning database contains 24 signals, 11 of them are retrocohlear, 6
endocochlear and 7 normal. For LVQ ANN, the learning database contains 20 signals, 6
of them are retrocohlear, 7 endocochlear and 7 normal. The results we obtain are given in
the following table (Table Ill).
In the two cases, the learning database has successfully been learnt. All of the learnt
vectors are well classified in the generalization phase. The RBF network well classifies
62,5% of the full database (including the learnt vectors), with a rate of correct
classification of:
9 61% for the retrocochlear class,
9 58% for the endocochlear class,
9 68% for the normal class.
The LVQ network well classifies 59% of the full database (including the learnt vectors),
with a rate of correct classification of:
9 72% for the retrocochlear class,
9 57% for the endocochlear class,
9 57% for the normal class.
Comparing these two single ANN based approaches with our proposed MNN technique,
one can remark :
9 similar performance has been obtained in the case of normal class for the
MNN technique and single RBF one.
9 similar performance has been obtained in the case ofretrocochlear class for the
MNN technique and single LVQ one.
9 performance is improved in the case of the normal class for the MNN
technique compared to that obtained by the single LVQ approach.
9 performance is improved in the case of the retrocochlear class for the MNN
technique compared to that obtained by the single RBF approach.
Therefore, the MNN structure combines the advantages of both LVQ and RBF ANNs.
Moreover, the high rates of classification of our MNN technique are achieved with low
number of neurons in the ANNs architecture, taken into account the specificity of our
problem.
263
5. C O N C L U S I O N
The MNN we propose, involves Learning Vector Quantization (LVQ) and Radial Basis
Function (RBF) neural models. The first neural net (RBF ANN) is used as a classifier,
and the second one (LVQ ANN) as a competitive decision processor. The RBF model
performs the feature space mapping associating a set of 'pathological classes' to a set of
'areas' of the feature space. Because of the competitive nature of the LVQ model's
learning strategy, this ANN is used, in our case, as a decision-classification operator.
Moreover, the proposed structure could also be seen as a global neural structure with two
hidden layers. So, the association of two RBF and LVQ neural models improves the
global order of the non-linear approximation capability of such global neural operator,
comparing to each single neural structure constituting the MNN system.
To evaluate the capability of this technique, we have classified BERA average signals for
three categories of patients according to the type of their auditory disorder : normal
hearing, endocochlear and retrocochlear disorders. Our proposed Multi-Neural Network
architecture allows us to keep the advantages of both RBF (classification rate equals 68%
for normal class) and LVQ (classification rate equals 72% for retrocochlear class) ANNs
and improves classification rate in a fine classification problem (classification rates equal
71% for retrocochlear class, 84% for normal class and 87% for endocochlear class).
Moreover, these results are achieved with low number of neurons in the ANNs
architecture, taken into account the specificity of our classification problem.
The results we obtained are encouraging and show the feasibility of a neural networks
based tool for help-diagnosis. The study field in BERA's classification remains wide and
this work should be carried on.
ACKNOWLEDGEMENTS
The database we have used belongs to the CREFON (Center of Research and Functional
Investigation on Oto-Neurology). We wish to thank its member, especially Dr. M
OHRESSER for her help.
REFERENCES
[I] WIDROW B., LEHR M.A., "30 years of adaptative Neural Networks : Perceptron,
Madaline, and Backpropagation", Proceeding of the IEEE, Voi.78, pp. 1415-1441, 1990.
264
121 BENGHAR81 A , "Contribution au test et diagnostic des circuits analogiques par des
approches basres sur des techniques neuronales", PhD thesis report, University of
Creteil - Paris XII, 1997
13| AMARGER V., BENGHARBI A., MADANI K., "A New Approach to fault diagnosis
of Analog Circuit using Neural Networks Based Techniques", IEEE European Test
Workshop 96 Montpellier, June 12-14, 1996.
15] MADANI K., BENGHARBI A.,'AMARGER V., "Neural Fault Diagnosis Techniques
for Non Linear Analogue Circuits", SPIE'97, Orlando, 1997,
16] BAZOON M., A. STACEY D., CUI C., " A Hierarchical Artificial Neural Network
System for the Classification of Cervical Cells", IEEE International Conference on
Neural Networks, Orlando, July, 1994.
171 ALPSAN D., "Auditory Evoked Potential Classification by Unsupervised Art 2-A and
Supervised Fuzzy Artmap Networks", IEEE International Conference on Neural
Networks (ICNN), Orlando, July, 1994.
[8l KOHONEN T., "Learning Vector Quantization", Neural Networks, vol. I, suppl. 1, p.
303, 1988.
19] KOHONEN T., "Self Organizing and Associative Memory", 3'~ ed., Springer-Verlag,
Germany, 1989.
|10] NIRINJAN M., FALLSIDE F., "Neural Networks and Radial Basis Functions in
classifying-static speech pattern", Report CUED/FINFENG/TR22, Cambridge
University, England, 1988.
1 Introduction
we study the effect of applying ICA and eICA to EEG data on classification
performance using standard power spectral density (PSD) signal representations
and feedforward neural network classifiers.
1.2 ICA
ICA is a method for blind source separation. It assumes that the observed signals
are produced by a linear mixture of source signals. Thus, the original source sig-
nals could, in principle, be recovered from the observed signals by running the
observed signals back through the inverted mixing matrix. Computationally-
intensive matrix inversions can be avoided, however, with recent relaxation-based
ICA algorithms [3]. These algorithms derive maximally independent components
u by maximizing the joint entropy between ui, which is equivalent to minimiz-
ing the components' mutual information. The joint entropy is maximized with
respect to the unmixing matrix W. The result is a simple rule for evolving W
in an iterative, gradient-based algorithm.
It is reasonable to apply ICA to EEG data because EEG signals measured
on the scalp are the result of linear filtering of underlying cortical activity [5,
7]. However, ICA assumes that all of the underlying sources have similar super-
Gaussian probability density functions. It is unknown how well EEG "sources"
follow this assumption, but it is reasonable to assume that some may not. A
recent extension to ICA, extended ICA, takes a first step toward addressing this
issue.
Extended ICA [5] provides the same type of source separation as ICA, but also
allows some sources to have sub-Gaussian distributions. The learning rule for
the unmixing matrix W is modified to be a function of the data's normalized
4th order cumulant, or kurtosis:
where k4 is the kurtosis and ui is the i th activation. Periodically during the course
of learning, the kurtosis is calculated and the learning rule adjusted according
to the kurtosis sign. Positive kurtosis is indicative of super-Gaussian distribu-
tions, and negative kurtosis of sub-Gaussian distributions. By accommodating
sub-Gaussian distributions in the data, eICA should provide a more accurate
decomposition of multi-channel EEG data, particularly if different underlying
sources follow different distributions.
267
2 Methods
Ten 10-second trials were given to each subject for each of three tasks:
The subject kept their eyes open during the trials, and were asked to avoid
blinking. E E G d a t a was collected from six channels of the International 10-20
System: C3, C4, P3, P4, 01, 02, referenced to linked mastoids. E O G d a t a was
also collected to provide a reference for eye blinks. All signals were sampled at
250 Hz. Further details are provided in [6].
Despite instructions to avoid eye blinks, m a n y of the trials contain one or more
eye blinks. Two categories of schemes were used for handling the eye blinks: 1)
the 'threshold' approach and 2) ICA. With the threshold approach, eye blinks
were detected by at least a 100 #V change in less than 100 msec in the E O G
channel. The subsequent 0.5 sec window of the trial was removed from further
consideration.
With the ICA approach, eye blinks are "subtracted" rather t h a n explicitly
detected, and no portion of the trials are thrown out. ICA is performed on the
combination of the E O G and six E E G channels. The number of activations spec-
ified was the same as the number of input channels: seven. As a result, activity in
the E E G channels that is closely correlated with the activity of the E O G channel
is separated and placed in one activation, as illustrated in Figure 1 for the first
five seconds of one trial of the base task. Notice that the eye blinks in the E O G
channel influence even the most posterior E E G recordings at channels O1 and
02. The ICA activations show the eye blink activity in only one component.
Thus, eye blink activity reflected in the E E G channels is "subtracted" from
those E E G channels. The activation containing the E O G activity can be trans-
parently detected, because it is the one with the highest correlation to the
original E O G data. The remaining activations are retained as the "eye-blink
subtracted" independent components of the E E G data. Thus, with the ICA ap-
proach, the full trial of E E G d a t a is used for all trials, regardless of the number
or distribution of eye blinks in those trials. Within the ICA-based category of
eye blink removal schemes, three specific forms of ICA were used:
- ICA
- Extended ICA (i.e. the algorithm chooses the number of sub-Gaussian com-
ponents to use)
- Extended ICA with fixed number of sub-Gaussian components
268
. . . . . . to , , , ' , , ,
.~, , , t , , t I .tot ~ , , i , i
.......... . . . . . . 10 . . . . . .
~ . 0
]
o ~o 4~ er~ f~o 1~ ~oo 9 ~ ~ ~ ~ tc~o t2oo 1leo
Thus, a total of four different schemes were used to remove eye blinks and rep-
resent the "blink-free" signals: t h r e s h (for eye blink removal using threshold
detection, as described above), ICA, eICA, and eICA_:f (for eICA with fixed
sub-Gaussian components). Our objectives were not only to see how cognitive
task classification performance varies as a function of the eye blink-removal ap-
proach, but also to see how cognitive task classification performance varies as a
function of the number of sub-Gaussians in the ICA representation.
linear output node. The number of sigmoidal hidden nodes was varied over [0 1
2 3 5 10]. By including zero hidden nodes as one of the network architectures,
we are effectively assessing how well a simple linear perceptron can classify the
data. Network inputs were given not only to the hidden layer, but also to the
output node, in a cascade-forward configuration. Thus, network classifications
were based on a combination of the non-linear transformation of the input fea-
tures provided by the hidden layer as well as a linear transformation of the input
features given directly to the output node.
The networks were given input feature vectors normalized so that each feature
has a N(0,1) distribution. The networks were trained with Levenberg-Marquardt
optimized backpropagation [4]. Training was terminated with early stopping,
with the data set partitioned into 80, 10, and 10% portions for training, valida-
tion, and test sets, respectively. The mean and standard deviation of classifica-
tion accuracy reported in the results section reflect the statistics of 20 randomly
chosen partitions of the data and initial network weights.
3 Results
The best classification accuracy for each different eye blink removal scheme over
all network architectures is shown in Figure 2. For the eICA_f scheme, the per-
formance shown is for the best number of sub-Gaussian components. The per-
formance is statistically similar across the different schemes. In all cases except
ICA on the letter v. math pair, mean classification accuracies are over 90%. For
both task pairs, eICA and eICA_f perform statistically as well as the t h r e s h
scheme.
Figure 3 shows how classification accuracy varies with the size of the neural
network's hidden layer. For the t h r e s h scheme, the linear neural networks (i.e.
zero non-linear hidden layer nodes) perform about as well as the non-linear net-
works. Thus, the simple t h r e s h scheme seems to represent the data's features in
a linearly-separable fashion. However, with all three of the ICA-based schemes,
performance tends to improve with the size of the hidden layer, then decrease
again as the number of hidden nodes is increased from five to ten. Notice that
eICA and eICA_f perform about as well as t h r e s h when networks of sufficient
hidden layer size are used for the classification. Apparently the eICA repre-
sentations produce feature vectors whose class distinctions fall along non-linear
feature space boundaries. Notice that for the base v. math task pair, the mean
performance with eICA.:f is greater than that of t h r e s h for all of the non-linear
networks.
So are there specific numbers of sub-Gaussian components for which perfor-
mance is better than others? We explored this question, analyzing task pair clas-
sification accuracy while varying the number of fixed sub-Gaussian components
used in the eICA_:f scheme. The results are summarized in Figure 4. Notice that
for both task pairs, classification performance is indeed a function of the number
of sub-Gaussian components. Also, the variability in performance is consistent
across different size networks. For both task pairs, performance is about max-
270
95 - - ......... 95
~ 90 .... 90
._~
u=
9~ 85 85
80 80
75 75 i
thresh ICA elCA elCA_f thresh ICA elCA elCA_f
imum when the number of sub-Gaussians is four, and decreases steadily with
additional sub-Gaussian components. However, the classification performance
differs markedly between the task pairs when the number of sub-Gaussian com-
ponents is less than four. Perhaps with the base task the underlying sources have
fewer sub-Gaussian components, making the choice of fewer fixed sub-Gaussian
components in our representation helpful for classification.
4 Discussion
We have shown that eICA can be used to subtract eye blinks from E E G d a t a and
still provide a signal representation conducive to accurate cognitive task classi-
fication. We have also provided preliminary evidence that eICA-based schemes
can generalize across different cognitive tasks. In both cases, however, it was
necessary to use non-linear neural networks to achieve the same performance
as was attained with a simple thresholding eye blink removal scheme and linear
neural network classifiers. Further work needs to be done to assess the sensitivity
of these results to different cognitive tasks.
By using a combination of ICA and artifact-correlated recording channels
(e.g. the E O G channel) for artifact removal, eye blinks were removed without a
hard-coded definition of eye-blink such as magnitude thresholds. This approach
could generalize to other artifact sources. If, for example, specific muscle activity
is interfering with E E G signals in a specific cognitive task monitoring setting,
271
base v. m a t h letter v. m a t h
o .................................... ....... tO0 .................................. 9.......
9s ................. i ...... : 22 - / :
i-" I
./~..
90 ~o ....... ~- ..i...... i . . . . . .
4,'/:
I" / " i
....... i ....... 2 . - " / ! :
/ : i i
.~ 85
...... ! ....... ~.. ...... : ................. : 85 ........ ! ......... : ...... :i::'.":.i:. ...... i
I :
u=
.= / .
i z : .."i i "'.:
;/ i
80 ........ i! .................................... 80 ..... : ........... :. . . . . . . . . . . . . . . . . . . . . . . . . .
I :
/ ,
75 -./ . . . . . : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70 i = J i 9 i
1 2 3 5 10 1 2 5 10
# hidden nodes # hidden nodes
F i g . 3 . C l a s s i f i c a t i o n p e r f o r m a n c e a s a f u n c t i o n of h i d d e n l a y e r size. ( E r r o r b a r s o m i t t e d
for c l a r i t y . F o r m o s t d a t a p o i n t s , a < 4.)
b a s e v. m a t h letter v. math
"iCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 . . . . . . . . . . . . . . . . . . . . . .
-- # HLnodes =
i .-. # HLnodes =
-- #HLnodes=
95 .... ; .................... : .............. 95 ................................... 9 .....
/ :\ , i
\\\ / : \
..!... ~ ..:. \ i
90 ................ / ..i~. i ..... :: .......
./ /i'.. ~ i i
~ ~o 8o
75 75
: i !
i i i
i :! i :! : i
7o i i i ~ i i 70 ; ; *
2 3 4 5 6 7 3 4 6 7
# subgaussian components # subgaussian components
then this approach could be used to subtract the myographic activity from the
E E G signals by including the appropriate electromyographic (EMG) reference
channel in the ICA decomposition.
References
INTRODUCTION
Inspired by recent physiological studies in animals, which report synchronized activity of
cortical neurons during processing of visual stimuli (Corbetta, 1998; Goldman-Rakic, 1988;
LaBerge et al., 1992; Lopes da Silva, 1991; Mesulam, 1990; Mclntosh et al., 1994; Rees et
al., 1998; Webster et al., 1993; Wrigth & Liley, 1996), we hypothesized that the analysis of
coherence of phase-locked ERP activity will carry information about the interactions
between distant brain areas involved in visual attention. Event-related brain potentials
(ERP) are averages of the electroencephalogram (EEG) which are time-locked to the
presentation of a sensory event. ERPs can measure activity of distant areas of the cortex
with high temporal resolution. Because coherence describes the phase-locked component
shared by two signals, large-scale cortical interactions can be detected over the whole
cortex (Sarnthein et al., 1998; Wrigth & Liley, 1996). Following this rationale, we
performed a coherence analysis on human scalp EEG recorded while subjects accomplished
a visual attention task in order to test two hypotheses:
In this report we present evidence that focalized attention to objects in the visual scene
brings about a corresponding increase in the spectral coherence of the EEG signal recorded
from temporo-parietal and parieto-occipital brain areas. This pattern of synchronization
involved left temporal-parietal rather than frontal regions.
METHODS
Sixteen right-handed young volunteers (8 women and 8 men; age range 19-28 years, mean
= 20.8 years) with normal or corrected vision and no history of neurological or psychiatric
problems were recruited from colleges in the University campus. Subjects were informed of
all aspects of the research
and signed a consent form
approved by the Ethical
Committee of the Brain
Mapping Unit. Subjects
were paid for their
participation.
Each trial began with the onset of a compound stimulus containing the four WCST key-
cards on top of one choice-card, all centered on the screen. The compound stimulus
275
subtended a visual angle of 4* horizontally and 3.5* vertically. Subjects were instructed to
match the choice-card with one of the four key-cards following one of three possible
sorting principles: number, color, or shape. The correct sorting principle could be
determined on the basis of feedback which was delivered 1900 ms after each response
through a computer-generated tone (2000 Hz for correct, 500 Hz for incorrect). Responses
were made with a 4-button panel. The length of the WCST series varied randomly between
6 and 9 trials. The inter-trial interval varied randomly between 3000 and 4000 ms. The task
consisted of two blocks of 18 series each. The order of choice-cards within the series was
dctermined on a semi-random basis so that the first four sorts in the series could be made
unambiguously. Elimination of ambiguity eased the correction of the test, and improved the
sigllal.l().noi~e ratio in Ihe ERPs. The average duration of each block was 12 min, with a 5
min rest period between blocks.
The electroencephalogram (EEG) was recorded from tin electrodes at positions Fpl, Fp2,
T7, T8, C3, C4, PO7, PO8, O1, and 0 2 of the extended 10-20 system (American
Electroencephalographic Society, 1994) and referenced to left mastoid (Figure I). The EEG
was amplified with a band pass from DC to 30 Hz (12 dB/octave roll-off), and digitized at
250 ltz over a 1700 ms epoch with a 200 ms baseline. Impedances were kept below 5 kfl.
The electrooculogram (EOG) was also recorded for blink correction. Trials with remaining
muscle or movement artifacts were discarded. Separate averages were computed for early
attd late WCST trials. The second and third trials across series were averaged into a
' S t t I [ ~ ' waveform, and the last two trials were averaged into an 'A'I1'END' wavcform. A
linked-mastoid reference was computed off-line for the averaged data.
Coherence for two signals, x and y (Cxy), is equal to the average cross power spectrum
normalized by the averaged powers of
the compared signals. Coherence is the
,SHIFT ATTEND frequency domain equivalent of the
cross-covariance function and is a
measure of the similarity of two signals.
~o Its value lies between zero and one, and
-2
it estimates the degree to which phases at
the frequency of interest are dispersed.
Coherence estimates were computed on
~o the pre-stimulus and post-stimulus
I I I I I I periods of the averaged stimulus-locked
ERP signal among all possible pairs of
electrodes. Separate coherence estimates
were obtained for the 'SHIFT' and
I I I I I I I I I I I I
increase in mean coherence between the pre-stimulus and post-stimulus periods, as well as
between the SHIFT and A'VrEND conditions was evaluated with a series of paired t-tests.
RESULTS
For the SHIFT condition, the mean coherence of the pre-stimulus period did not differ
significantly from the mean coherence of the post-stimulus period. This was so for all
electrodes tested. For the ATTEND condition, there was an increase in mean coherence
between the pre-stimulus and post-stimulus periods in central, temporal, and parieto-
occipital electrodes (P< 0.02), but not in frontal sites.
A summary of results for the test of differences in coherence between the SHIFT and
ATTEND conditions is presented in Table 1. Figure 3 displays a line connecting all
electrode pairs which showed a significant increase in coherence (P< 0.01 or better).
INTRAtlEMISPHERIC COHERENCE
T3 " ) T4 ~ C3 " ) C4 " ) P 3 " ) P4 " )
C3 P3 O! C4 P4 02 P3 OI P4 02 Oi 02
0.03 0.13" 0.13" 0.00 0.00 0.00 0.06 0.07 0.10 0.07 0.01 0.01
INTERHEMISPHERIC COHERENCE
T3 " ) T4 ") C3 ")' C4 ") P3"-) P4 " )
C4 1'4 02 C3 P3 Ol 1'4 02 P3 Ol 02 Ol
0.08* 0.15" 0.12" 0.03 0.04 0.05 0.14" 0.04 0.11 0.10" 0.10" 0.16"
* P < 0.01
DISCUSSION
We have observed enhanced EEG coherence specifically associated with the phasic
deployment of attention to visual stimuli. This enhancement was not present while the
subject was in the process of shifting attention between stimulus dimensions, but appeared
while the person's attention was concentrated on one aspect of the stimulation. This
outcome indicates that the increase in coherence is a phenomenon specifically associated
with attention, rather than with any other physical property of the stimulation. As expected,
the pattern of coherence was larger during the ATTEND condition than during the SHIFI'
condition (Barcel6 & Rubia, 1998; Berman et al., 1995). Figure 3 illustrates the pattern of
connectivity between those temporal and parieto-occipital areas that registered the larger
increment in coherence. The left temporal area experiences the largest increases in
277
REFERENCES
Barcel6, F. & Rubia, F.J. (1998) Non-frontal P3b-like activity evoked by the Wisconsin
Card Sorting Test. Neuroreport, 9, 747-51.
Barcel6, F., Sanz, M., Molina, V., Rubia, F.J. (1997) The Wisconsin card sorting test and
the assessment of frontal function: A validation study with event-related potentials.
Neuropsychologia, 35, 399-408.
Berman K.F., Ostrem J.L., Randolph C., Gold J., Goldberg T.E., Coppola R., Carson R.E.,
Herscovitch P. & Weinberger D.R. (1995) Physiological activation of a cortical
network during performance of the Wisconsin card sorting test: a positron emission
tomography study. Neuropsychologia 33, 1027-1046.
Corbetta, M. Frontoparietal cortical networks for directing attention and the eye to visual
locations: Identical, independent, or overlapping neural systems? (1998) Proc. Natl.
Acad. Sci. USA 95, 831-838.
Getting, P.A. (1989) Emerging principles governing the operation of neural networks.
Annual Review of Neuroscience 12, 185-204.
Goldman-Rakic P.S. (1988) Topography of cognition: Parallel distributed networks in
primate association cortex. Annual Review of Neuroscience 11,137-156.
Horner, M.D., Flashman, L.A., Freides, D., Epstein, C.M. & Bakay, R.A. (1996) Temporal
lobe epilepsy and performance on the Wisconsin Card Sorting Test. Journal of
Clinical and Experimental Neuropsychology, 18, 310-313.
LaBerge, D., Carter M., & Brown V. (1992) A network simulation of thalamic circuit
operations in selective attention. Neural Computation, 4, 318-331.
Lopes da Silva F. (1991) Neural mechanisms underlying brain waves: from neural
membranes to networks. Electroencephalography and Clinical Neurophvsiologv, 79,
81-93.
Mclntosh, A.R., Grady, C.L., Ungerleider, L.G., Haxby, J.W., Rapoport, S.I. & Horwitz, B.
(1994) Network analysis of cortical visual pathways mapped with PET. J. Neurosci.
14, 655-666.
Mesulam, M.M. (1990) Large-scale neurocognitive networks and distributed processing for
attention, language, and memory. Annals of Neurology, 28, 597-613.
Milner, B. (1963) Effects of different brain lesions on card sorting. Archives of Neurology,
9, 90-100.
Rees, G., Frackowiak, R., & Frith, C. (1997) Two modulatory effects of attention that
mediate object categorization in human cortex. Science 275, 835-838.
Sarnthein, J., Petsche, H., Rappelsberger, P., Shaw, G.L. & Stein, von A. (1998)
Synchronization between prefrontal and posterior association cortex during human
working memory. Proceedings of the National Academy of Sciences USA, 95,
7092-7096.
Webster, M.J., Bachevalier, J. & Ungerleider, L.G. (1993) Connections of inferior temporal
areas TEO and TE with parietal and frontal cortex in macaque monkeys. Cerebral
Cortex 5,470-483.
Wright J.J. & Liley D.T.J. (1996) Dynamics of the brain at global and microscopic scales:
Neural networks and the EEG. Behavioral and Brain Sciences, 19, 285-320.
A Bioinspired Hierarchical System for Speech
Recognition
1. Introduction
Artificial speech recognition area is evolving continuously because their application
areas involve very different systems, like bank transactions, friendly interactive user
information, bioengineering, forensic evaluations, automatic translation, language
assistance, handicapped people aids and so on. The proposed systems must be user
independent, they must be able to recognize a considerable amount of words, they
must handle continuously speech and they must keep their performance under adverse
environments.
Human listeners can recognize speech of different talkers, with different rates, distinct
accents and even under noisy conditions. A detailed understanding of how speech
is processed by humans could help design of bio-inspired systems which share the
main characteristics of biological systems, accuracy and robustness. If speech signals
are coded in the same way the auditory periphery does, it would be later extracted by
a model inspired on the central auditory system with the desired properties.
weights of the map different nodes, which will code in its internal organization the
characteristic temporal evolution of the speech units. This kind of networks is also
biologically plausible.
2. Speech Production
In order to study the main components of speech sounds, first we must understand
how it is produced. The process of generating speech begins in the lungs, where the
constrictions force air out through the trachea to the larynx, which contains the vocal
cords and the glottis. The air can follow two ways: it can reach the outside through the
vocal tract, which begins at the vocal cords and ends at the lips or it can pass through
the nasal cavity. The air flow is regulated by the velum.
The air can arrive to the tract in form of vibration of the vocal cords (this process will
produce voiced phonemes) or in form of breath noise (unvoiced phonemes will b e
heard). The periodicity of this vibration is called fundamental frequency or pitch.
However this flow of air is affected by many physical factors in the vocal tract,
including the position of the tongue, the dental effects, the position of the velum, the
movement of the lips...ete. All these physical process act as a resonator. The natural
resonances of the vocal tract correspond to the poles of the transfer function and they
are called formants. They provide the most important way for recoEnizing phonemes.
Formants are identified by a number in order of increasing frequency: F1, F2,...etc. F~
is the first resonant frequency and for voiced speech, it generally is in the range of
250 to 800 Hz. F2 has a wider range, from 600 to 3600 Hz. Formants F3, F4 and F5
may be present in voiced speech, however, the lowest two formants are usually
sufficient to identify specific phonemes.
When spectrograms of speech signals are represented three distinct components can
be observed. There exist horizontal bars defined precisely in certain characteristic
frequencies (CF) components that will correspond to static formants. Some oblique
bars appear at the beginning or ending of prior elements. They correspond to
transitions between formants and they are called frequency modulated (FM)
components. The last elements that can be observed are certain bands of energy in
frequencies above 2 kHz corresponding to noise burst (NB) components. This three
elements are shared by human speech and animal sounds for communication [1].
There exists specific groups of neurons for detecting each one of these components
and combinations of them [2][3][4]. In humans, CF elements correspond to vowels
and vowels-like (e.g. nasals) sounds, stops are identified by FM elements, while
fricatives have characteristic NB components (Figure 1A).
Vowels are steady-state voiced sounds. This fact produces a quasiperiodic waveform
with fixed formants during vowel duration. If we set up a co-ordinate system using Fx
and F2 as a basis, vowels lie in specific regions. The exact positioning of the F1-F2
space varies with age, sex, language, and from one talker to the other, but the overall
281
clustering pattern does not vary (Figure 1C). However, most of the consonants (stops)
require precise dynamic movements of the vocal-tract articulators for their
production. These articulatory movements make the formant tracts to change
continuously, so in the formant spectrum these consonants will be recoLrnized by
different transitions (FM components) to the steady-states that identify vowels and
NB elements at the beginning which will correspond to the initial scatter of sound
energy. Consonants will be reco~ized by the transition part of both formants, not the
adaptation state, which will be responsible of discriminating between vowels. But
detecting an ascending/descending transition in formants is not enough to recoEnize
consonants. We must detect the exact slope of the transition in formants in order to
identify such phonemes. For Spanish [ba],[da] and [ga], the only difference is in the
slope of the transition of formant F2, so while transition in F1 is the same for all of
them, high slope for ascending formant Fa identifies [ba], high slope in descending
formant [ga] and low descending transition in F2 represents Ida] (Figure 1D). The
same spectrograms but with certain delays in the emission of voice may be used for
stops [pa], [ta] and [ka] (Figure 1B).
Fricatives are produced by exciting the vocal tract with steady air stream that becomes
turbulent at certain point of constriction. This point of constriction is used to
distinguish between different fricatives. In the spectrogram there exist certain NB
elements centred around certain characteristic frequencies.
A p--~_p yOWlS
B
VOT
I :::::::::::::::::::::::::::::::::
F= b<=i<p
d <ss< t
'7 L ........... " g<41<k
C t**.~/~w**
D
~1 I l II i i i' i
G~i ~n,~m
,i' ?
,-/
t ..[ Ft~d~tl
i" t; ,, i' ~ I ! I I
'"<-.,, /.",,,'
I.i,i9 II I
u
!
9 I '% .."
F r ~ q ~ * # d it~ im kltl
Signals arrive to the cochlea through the outer and middle ear, where no relevant
frequency computation is made, these centers just amplify the signals level. The first
important processing is produced in the basilar membrane, inside the cochlea. The
basilar membrane has cross striations, much like the strings of a piano, and its apical
end is much wider than the basal one, so striations resonate with different
frequencies, but the overall behavior is in the form of a travelling wave, so a single
frequency stimulation cause a very broad area displacement in the basilar membrane.
Low frequencies are represented by peaks in the apical end of the membrane, while
higher ones are represented towards the basal area in a topological ordered way. It is
important to note that different frequencies of sound produce different travelling
waves with different peak amplitudes. These peak locations code different frequency
stimuli in the basilar membrane. Also these peak amplitudes will excite different hair-
cell at different positions in the cochlea which will be responsible for the mechanical
to neural transduction process that propagates electrical impulses to higher neural
centers through a tonotopical organized array of auditory nerve fibers. Each auditory
nerve fiber is specialized in the transmission of a different characteristic frequency
and the rate of the transmitted pulses by these pathways code not only the frequency
intensity information, but also certain features of the signal relevant for discrimination
purposes. Fibers with characteristic frequencies below 3 kHz fire in synchrony with
the stimulus, so signals will be coded by the temporal firing. On the other hand high
CF fibers loose this phase-locking mechanism in a linear way. So stimuli will be
coded by the relative position of the peak along the basilar membrane, place coding
mechanism.
thalamus (medial geniculated nucleus) which acts as a relay station for prior
representations (some neurons exhibit delays of a hundred milliseconds storing in this
way, some delayed information which may be used for detecting sequential
acoustical events), and there exists also neurons sensible to noisy stimuli. Finally, in
this center it has been detected synaptic plasticity, which permit the labelling of the
units with their perceptual meanings. A considerable amount of inputs to thalamus
comes from the cortex, implementing in this way the circuit recurrence, forming
neural atractors.
The high level processing is done in the cortex. Different anatomical experiments
have confirmed the existence of coarse specialized areas organized according with its
responsibility for processing information coming from the sensory receptors (visual,
auditory, etc.) and to generate the information to different actuators (speech, eye
movement, motor functions, etc.). That is, it seems that the neural tissue in the brain is
organized as ordered feature maps [6] according to its sensory specialization.
The exact location of the area in the human cortex responsible for speech
processing and understanding is not well defined due to the fact that the subjects of
experimentation have been mainly animals as cats, bullfrogs, squirrel monkeys,
guinea pigs, goldfish, etc. It has been detected in cats some neurons that fire with
certain slope of frequency transitions (FM elements) [2], some neurons that respond
to specific noise bursts (NB components) in macaque [3] and also some neurons
which are able to detect the combinations between these elements (CF-CF, FM-FM
with different delays between them) [4] in auditory cortex. The information that is
used for the bats for locating objects, may be used bay human beings for
communication, because they share the same neurobiological principles. Finally
signals arrive to Wernicke's area, where it has been speculated about the possibility of
a word or concept map observing different cognitive disfunctionalities.
@
Auditory cortex
(in lemporal )obe) - -
.: .?~
i Jill I Medial geniculate
Inferior colliculus ? '!]{i nucleus
The system consists of two main modules adopted from standard automatic speech
recognition systems, a parametric extraction module and a recognition module. Each
module will be composed by bioinspired algorithms, which will provide the desired
functionality under an acceptable computational cost and taking into account the
biological process very closely. The parametric extraction module incorporates a
cochlear model based on gammatonne filtering [7] which supplies the frequency
analysis and the temporal response observed in physiological registers, a mechanical
to neural transduction process based on Meddis' hair cell model [8], which will
include adaptation, compression and half way rectifier processes and even also the
lost of phase locking with stimuli components for the fibers with high characteristic
frequency. A temporal integration stage that will emphasize static components,
aligning the different fibers information colored by the cochlear transmission delays,
and an component extraction module which will use a spatio-temporal strategy for
obtaining the robustness and the independence of the level provided by temporal
approaches and the energy estimation achieved by spatial methods [9].
The recognition module consists in a time delay self-organizing map [10], which
will group and classify in an automatic way the different components combination
provided, capturing the spectro-temporal variability of speech signals in its Structure.
The modular and hierarchical system design will allow validate each module
separately, and will permit a more efficient evolution, cause any new published
algorithm could be used just by inserting in the appropriated module without affecting
the whole hierarchical system.
000 O]
WORDS WERNICK~'S AREA
000 0
O0 0 000 0
, O0 0 0 0
Parametric
Extraction
I .~,~c.s r I7~~ u
Module Trn~s~176176
I BA$/~RMs
VOLUMgI'RICVELOC.17Y
"tl/dPnNIC ~E.SStnlE
5. Results
For testing the parametric extraction module, it was used synthetic speech instead
of real one in order to detect precisely the accuracy of this stage. Initially, vowels
were generated because of their static spectrum. All of them had the same
fimdamental frequency F0=100 Hz and only the two first formants were included. As
an example, results for vowels/a/are shown. The vowel/a/has been generated using
F~=640 Hz and F2=1190 Hz. The first one was a synthetic/a/under clean conditions.
The same vowel was corrupted with white noise with a 100% of the vowel level.
Figure 4 shows the interval interspike histogram of the fibers response. For clean
vowel/a/(left) three peaks appear. The lowest one is centred on 10 msec, so it will
code a fundamental frequency F0 of 100 Hz. The second one is located around 1.65
msec, which will correspond to 645 Hz, and the highest one is at 0.85 msec or 1176
Hz. These values approximate with certain accuracy to the formants provided. On the
other hand, it can be observed in the noisy vowel histogram that white noise is
transformed in a high frequency component, due to mechanical cochlear filtering
observed as a huge peak on the left of the temporal axis. The other three peaks
(marked with arrows) lie on the same locations than clean data peaks with lower
magnitude caused by the noise inserted. The extraction used ignores temporal peaks
lower than 0.3 msec. (high frequency components will be detected by spatial methods
identifying them as NB elements), so the formants position (CF) will not be affected
by white noise using a bioinspired parametric extraction module.
as-ipi-~ m,saLwav anlO~ipi-sum.sai.wav
lilt .... I ......... I , . * I I I L , , I ......... I,,
10240
Q -
~, o-,
-loz4o~ . . . . . . . . . i . . . . . . . . . t . . . . . . . . . i . . . . . . . . . i,,
Time [ms] 44.05 Time ImsJ 44.05
Fig.4. Interval interspike histogram of a clean (left) and a noisy (right) vowel/a/
The system precision for dynamic stimuli was checked using synthetic
consonants. The results for Spanish phonemes/b/and/g/are shown in Figure 5. The
first formant consist in an ascending frequency from 500 to 700 Hz during the first 10
msec until reaching the steady state. The second formant varies from 900 to 1200 for
p h o n e m e / b / a n d decreases from 2000 to 1200 for phoneme/g/. The third formant
provided slides from 2100 to 2400 for phoneme /g/ and from 3200 to 2400 for
phoneme/g/. This data is the same that Secker used for their analysis [5] in auditory
fibers and matches the plosives behaviour described in section two.
The extraction was performed every 5 msec. using a 16 msec. overlapping window.
The Figure shows the provided formants (solid line) with the formant estimations
286
obtained by the bioinspired parametric extraction module (first formant is dotted with
diamonds, second with crosses and third with squares). It can be observed sharp
precision for all estimations. The deviations are main in the transition from the slope
to the steady state and they are caused mainly by the temporal integration stage and
the lost of synchrony (phase-locking) with the stimuli for high characteristic
frequency fibers. The estimation detects in this way the temporal evolution of the
components for dynamic stimuli with high accuracy.
I O Cl
~o
j~-J
One important aspect of this method is computer cost. In prior Figures it has been
used 75 sections in each stage. If we obtain the intervals histogram using 20 sections,
peaks will lie on the same locations. This will permit reducing 2/3 the complexity of
the model without loosing accuracy and robustness.
For the recognizing module, it was used static and dynamic phonemes for testing the
incorporation of the spectro-temporal variability in the structure (weights) of the time-
delay self-organizing map. In the vocalic map, each vowel lie in a specific region,
placing very closely those which share certain characteristics (/g-/e/and/o/-/u/) and
very distant those classes with severe dissimilarities (a/-/i/-/u/, which are the vowels
which origin the vocalic triangle). The dynamic map consists on the distribution of
plosives/b/,/d/and/g/. Their representation is more complex than the previous one,
because there exists phonemes with more than one area in the map (phoneme/d/) and
there exists certain areas (left low row) which fire with two different phonemes (/d/
and/g/), due to the similarity of their spectral representation.
For analizing how the map incorporates the spectro-temporal variability in its
structure, it was compared the spectrogram of the phoneme, with the visual evolution
of the weights of the unit labeled with this phoneme in the map. The three delayed
components was aligned in consecutive columns, with NB elements (estimation of the
energy) on the upper side. Figure 6 shows the spectrogram and the visual analysis of
the weights of unit/aJ. It can be seen that the static behavior of the components is
reflected on the weight evolution. Phoneme/a/has its two firsts formants on the lower
part of the spectrum, while the third is on the upper side. This is coded in the unit
weights with static information on the lower side and a band reflecting the third
287
formant. The NB elements show the energy of this third formant and absence of
energy above this frequency, also in a static way.
Fig. 6. Spectrogram of phoneme/a/with the visualization of the weights of its unit in the map
Fig. 7. Spectrogram of phoneme/b/with the visualization of the weights of its unit in the map
The global results compared with other similar models, Payton [11] and Patterson [7]
are equivalent for static speech, for vocals, fricatives ands nasals, was about 90, 70
and 50% respectively, while for dynamic speech it was increased compared with other
two models. For plosives and glides it was about 50 and 65%. The inclusion of
temporal delays in the network increases the discriminative performance for dynamic
data.
288
6. Conclusions
The obtained results show its precision in the estimation of the static, dynamic and
noisy components, its robustness in the extraction of corrupted phonemes, and the
computational efficiency just by using a limited number of sections in the model. The
recognition module present independence on the order of the provided data, and it
captures the spectro-temporal variability of the speech components in its weights
structure, obtaining a recognition rates which match other similar models for static
phonemes, while improve their results for dynamic data.
Acknowledgements
We would like to thank Dr. Roy Patterson for providing AIM software. This research is funded
by NATO CRG-960053
References
1 Suga, N: "Basic Acoustic Patterns and Neural Mechanism Shared By Humans and Animals
for Auditory Perception: A Neuroethologist view". Proceedings of Workshop on the Auditory
bases of Speech Perception, ESCA, pp. 31-38, July 1996.
2 Mendelson JR, Cynader MS: "Sensitivity of Cat Primary Auditory Cortex (AI) Neurons to
the Direction and Rate of Frequency Modulation". Brain Research, 327, pp 331-335, 1985.
3 Rauschecker JP, Tian B, Hauser M: "Processing of Complex Sounds in the Macaque
Nonprimary Auditory Cortex". Science, vol. 268, pp 111-114, 7 April 995.
4 Suga, N: "Cortical Computational Maps for Auditory Imaging". Neural Networks, 3, pp. 3-
21, 1990.
5 Secker H. and Searle C.: "Time domain analysis of auditory-nerve fibers firing rates". J.
Acoust. Soc. Am. 88 (3) pp. 1427-1436, 1990.
6 Schreiner C.E: Order and Disorder in Auditory Cortical Maps. Curr. Op. Neurobiol., 5, pp.
489-496.
7 Patterson RD, Anderson TR, Allerhand M: "The Auditory Image Model as a Pre-processor
for Spoken Language". ICSLP, pp. 1395-1398, 1994.
8 Meddis R: "Simulator of mechanical to neural transduction in the auditory receptor". J.
Acoust. Soc. Am. 79 (3), pp. 702-711, 1986.
9 Ferr~ndez J.M. "Estudio y Realizaci6n de una Arquitectura Jerarquica Bio-Inspirada para el
Reconocimiento del Habla" Ph.D Thesis, Universidad Polit6cnica de Madrid, Junio, 1998.
10 Mc. Dermott and Katagiri: "Shift-lnvariant Multicategory Phoneme Recognition using
Kohonen LVQ2". Proceedings de ICASSP, pp. 81-84, Glasgow 1989.
11 K. L. Payton. Vowel processing by a model of the auditory periphery: A comparison to
eight-nerve responses. J. Acoust. Soc. Am. 83 (1), pp. 145-162, January 1988.
A Neural Network Approach for the Analysis of
Multineural Recordings in Retinal Ganglion Cells
1. Introduction
Our perception of the world, our sensations about light, color, music, speechl taste,
smell are coded in raw data by the peripheral sensory systems, and sent, by the
corresponding nerves, to the brain where this code is interpreted and colored with
emotions. The raw or binary sensory data consists of sequences of identical voltage
peaks, called action potentials. Seeing implies the decoding the pattems of spike trains
that are sent to the brain, via the optic nerve, by the visual transduction element, the
retina. Thus, the external world object features, as size, color, intensity.., are
transformed by the retina into a myriad of parallel spikes sequences, which must
describe with precision and robustness all the characteristics perceived.
Understanding this population code is, nowadays, a basic question for visual science.
Understanding the code means quantifying the amount of information each cell
carries, and studying the possible parameters that are used by the cells for transmitting
the data. The system has to assign meaning to this population code. Thus for a given
pattern of action potentials, the brain has to estimate the stimulus that has produced it.
The encoding has to be unequivocal and fast in order to ensure object recognition for
any single stimulus presentation.
290
New recording techniques and the emergence o f new electrode array technologies,
allow simultaneous recordings from populations of neuronal cells. However there are
still many difficulties associated with collecting and analyzing activity from many
individual cells simultaneously. FitzHugh [6] proposed a statistical analyzer that
applied to the neural data estimates the characteristics of the stimulus. Different
approaches have been used on the construction of such a decoder, including
information theory [7], linear filters [8], discriminant analysis [9]...etc.
In this paper we used two different artificial neural networks, one trained by back-
propagation and other implemented with auto-organizing maps to estimate how an
ensemble of retinal ganglion cells can encode the characteristics of the light incident
at the retina. Our results show that artificial neuronal networks are useful tools for
analyzing multineuronal recordings and that visual information is coded as the overall
set of activity levels across neurons rather than by single cells.
2. Methods
Light stimuli were produced from a tungsten lamp. Flashes with a duration of 0.2
seconds, followed by a 0.24 second period of darkness, were used as typical stimuli.
Wavelength selection (400, 450, 488, 514, 546, 577, 600, 633 and 694 nm) was
achieved with narrow band filters, and intensities were controlled with neutral density
filters. Different spot sizes (ranging from 0.195 to 2.6 mm) were also used through
291
this study in order to learn how well recordings from a network of ganglion cells
could be used to predict the shape, color and intensity of the visual stimulus. Each set
of stimuli was presented 7 times. Responses were amplified with a differential
amplifier and stored in a Pentium based computer. A custom analysis program
sampled the incoming data at 20 kHz, plotted the waveforms on the screen, and stored
the record for later analysis.
Ringer
Solution
Computer
R t ina
I I
Light
3. Analysis
In this study we selected the signals from those electrodes that had the highest signal
to noise ratios. In general, multi-unit signals were obtained from most of the
electrodes and ot~en single unit separation was difficult so that we selected those 13-
15 prototypes which were unequivocal in terms of both amplitude and shape. For each
electrode, a 4-vector element was constructed using the number of spikes, the relative
time of the first and second spike, and the interspike interval of these firings. A 60-
element vector (4 variables x 15 cells) was used as the input matrix to our different
neural network approaches.
Two different neural networks were used. The first one was a three layer
backpropagation [10], with 20 nodes in the hidden layer. The output layer consisted of
the same number of neurons as the classes to be recognized. Using this architecture,
each neuron on the output layer only fires for a certain stimulus, and the rest of the
neurons of the output layer have no activation (winner take all network). The
activation function used for all neurons, including the output layer was the hyperbolic
tangent sigmoid transfer function given by:
2
f(x) l+e-2" 1
. . . . . . . (1)
using as initial momentum and adaptive learning rate the values established by default
by the Matlab Neural Network Toolbox. The initial weights were randomly initialized
and the network was trained to minimize a sum squared error goal of 1, for providing
more generality to the estimation stage.
The other network used was the Kohonen Supervised Learning Vector Quantization
(LVQ) [11] with 16 neurons in the competitive map, and a learning rate 0.05. This
network is a competitive network, where the neurons with weights more similar to the
input, increase their strength in response to this input, decreasing the rest of the nodes
except those in a close neighborhood. This establishes a topological relation in the
map. The main advantage of using learning vector quantization is that it takes less
time to reach the convergence criteria.
Once the network was trained, the estimation with extended data were used, and the
correlation coefficients between the stimulus and their estimations were computed.
Other studies use their own concepts as mutual information [8] in order to assess the
overall quality of the reconstruction, but there does not exist a common agreement
about the measure that best estimates the goodness of the prediction.
293
4. Results
For many stimulus conditions and many cells only a few spikes were produced in
response to light-ON. Figure 2 shows a raster plot of the response time stamps of 15
cells to several identical presentations of a full field flash, using a wavelength of 546,
log. relative intensity = -0.5. Stimulus is indicated in channel 1, so that 8 different
flashes are shown. It can be seen that most of the cells are ON-OFF, and that they
only fire a few spikes in response to the stimulus. Another characteristic is that for a
given cell, different presentations of the same, identical stimuli, evoke different
responses. These responses differ not only in the number of spikes but also in their
relative timing, manifesting variability in their spiking behavior. This variability
produces uncertainty for recognizing the right stimuli using only one individual cell,
because there is no unequivocal function that associates the firing variables with the
provided visual information.
Ambiguity is another aspect noticed. Thus a single cell can have exactly the same
response to different stimuli, making the stimulus estimation task much more
difficult. These aspects are presented in detail in Ammermtiller et al. [9], and they
point to population coding as the strategy used to represent information in the visual
system.
Figure 3 shows the correlation between the output of a trained backpropagation neural
network and the correct stimuli, which in this case consisted in 8 different intensities.
The three wavelenghts choosen were those where discrimination of the population was
worst (633 nm), intermediate (546 nm) and best (450 run). It can be seen that the
294
scores show variability depending on the cell and the wavelength studied. On average,
all single cells were far below ideal discrimination, although to a varying degree. The
cells with higher estimation scores were cells 8, 10, 11 and 12. On the other hand, the
performance of all the cells taken together ("All" column) exceeded 0.95 for all
wavelengths.
1
o.
0,8o II
III |
0,7-
0,6-
0,5-
0,4~
0,3-
0,2-
0,1
O 1 I I I I I I I I
2 3 4 5 6 8 9 10 11 12 13 14 15 All
Fig. 3. Intensity estimation scores for individual cells and for all the
cells taken together ("All" column) using a BP network
Color estimation is more complex, and the estimation rates for single cells were
considerably lower (Figure 4). For these kind of studies the intensity was fixed and
we asked the network to correctly classify nine different wavelengths. Again the
population discrimination was fairly good, with correlation coefficients ranging from
0.95 to 0.97, values that clearly surpassed all the individual cell coefficients for all
kinds of stimuli.
I
0,9 I
0,8 I I I
0,7
~ I, I
0,4
0,3
0,2
0,1
0 1 I I I I I
I 2 3 4 5 6 7 8 9 10 11 12 13 14 15 All
Fig. 4. Color estimation scores for individual cells and for all the
cells taken together ("All" columns) using a BP network
For validating the above mentioned results, the same data was presented to another
kind of neural network, a supervised learning vector quantization (LVQ) with 20
nodes in the competitive layer. This network converged faster than the back-
propagation (BP) network, and again the cells with higher estimation scores were
cells 8, 10, 11 and 12. The results obtained by using all the cells together was nearly
the same as that obtained using the BP algorithms (Figure 5).
295
1 ,
0,9 ~
0,8 ~ _ _ I I
0,7 ~ I. II NI
0,6 i II II~
o,s-
0,4
0,3-
o,2-
o,i-:
0~
I 2 3 4 5 6 7 8 9 10 11 13 14 15 All
Fig. 5. Intensity estimation scores for isolated cells and for all the cells
taken together ("All" column) using a LVQ network
0,g~
0,8i
0.6i |
0,7i
0,5~
o,4i
0,3i i ,,, ,o, i I!
0,2i
0,1~
0_~
1 2 3
9
4 5 6 7 8 9 10
ih
11
Fig. 6 Color estimation scores for individual ceils and for all the
12 13 14 15 All
In order to get some insight into the relative importance of each one of the variables
for the discrimination task, we used a BP network with 20 nodes in the hidden layer.
The input to this layer was only the spike rate (N), only the timing of the first spike
(T1), only the timing of the second spike (T2), or only the time difference between
spike one and spike two (Interval) for the entire population of 15 cells. We also used
all these parameters taken together. Figure 7 shows the correlation indexes between
the real stimuli and the network estimations. Spike rate (N in figure 7) was the most
important variable followed by the exact timing of the first spike (T1). The timing of
the second spike (T2) and the interspike interval carried less information, and were
296
poor coding elements. When all the variables from the ensemble of cells were used
the correlation coefficients were close to 1.
1
9 633
o,9
0,8 [] s46
0,7 [] 450
0,6
0,5
0,4
0,3
0,2
0,10
N T1 T2 Interval All
Fig. 7. Intensity estimation scores for the population using different variables
Basically the same results were obtained for color discrimination, although the overall
performance was not as good as in the case of intensity discrimination (Figure 8).
I
0,g k\\'q~-4
0,8 ~,,"~ I
0,7
0,6
0,5
0,4 NN-I
0,3 kx,NNI
0,2
01 Nx\',l..-i
[N.\\~l I
'0 i
N T1 T2 Interval All
Fig. 8. Color estimation scores for the population using different variables
5. Conclusions
In this paper, a cormectionist method has been used to investigate how color and
intensity can be estimated l~om single cells and from populations of retinal ganglion
cells. Two different neural networks, a feedfoward backpropagation and a competitive
LVQ, have been used for determining the coding capabilities of individual cells
versus a group of neurons. The correlation between the estimation of the networks and
the real stimuli was used for quantifying the transmitted information. Both networks
indicate that the brain could potentially deduce reliable information about stimulus
features from the response patterns of ganglion cell populations but not from single
ganglion cell responses.
The spike rate together with the exact timing of the first spike at light-ON were the
most important parameters that encoded stimulus features as it has been shown for
297
different systems [4,5]. The fact that the number of spikes, or the first spike's relative
timing, obtain the same estimation index as the overall parameters, approximately
0.95, could imply redundancy in the transmitted information, and could be related to
the robustness in the data transference inherent to this system.
A more refined data set will help in providing more accuracy to our analysis. Thus,
new physiological techniques which decrease the level of the background noise in the
recorded responses, and an efficient separation of the action potential prototypes
recorded with a single electrode [12], will help to isolate the firing parameters from
artifacts which contaminate our present recordings.
Finally, while the quality of the different coding parameters could be assessed by
using this neural network approach, we have no idea if indeed the brain also focuses
on these variables. Once the visual code is understood, the construction of spiking
retina models which reflects with accuracy the physiological recordings will be
available.
Acknowledgements
This research is being funded by CICYT SAF98-0098-CO2-02, DFG SFB 517 and
NSF grant #IBN 9424509.
References
1 Ammermuller J., Kolb H. "The Organization Of The Turtle Inner Retina. I. ON- and OFF-
Center Pathways." J. Comp. Neurol. 358(1), pp. 1-34, 1995.
2 Ammermuller J., Weiler R., Perlman I.: "Short-term Effects Of Dopamine On Photoreceptors,
Luminosity- And Chromaticity-Horizontal Cells In The Turtle Retina." Vis. Neurosci. 12(3),
pp. 403-412, 1995.
4 Berry M. J., Warland D. K., Meister M.: "The Structure and Precision of Retinal Spike
Trains" Proc. Natl. Acad. Sci USA 94(10), pp. 5411-5416, 1997.
5 Secker H. and Searle C.: "Time Domain Analysis of Auditory-Nerve Fibers Firing Rates". J.
Acoust. Soc. Am. 88 (3) pp. 1427-1436, 1990.
6 Fitzhugh, R. A.: "A Statistical Analyzer for Optic Nerve Messages". J. Gen. Physiology 41,
pp. 675-692, 1958.
7 Rieke F., Warland D., van Steveninck R., Bialek W.: "Spikes: Exploring the Neural Code".
MIT Press. Cambridge, MA, 1997.
8 Warland D., Reinagel P., Meister M.: " Decoding Visual Information from a Population of
Retinal Ganglion Cells". J. Neurophysiology 78, pp. 2336-2350, 1997.
298
9 Ammermiiller J., Fern~tndez E., Ferrb.ndez J.M., Normann R.A.: "Color and Intensity
Discrimination of Retinal Ganglion Cells in the Turtle Retina". J. Physiology. (under revision).
10 McClelland J., Rumelhart D.: "Explorations in Parallel Distributed Processing", vol. 1 and 2.
MIT Press, Cambridge MA, 1986.
11 Kohonen T.: "Self Organization and Associative Memory" vol. 8. Springer Series in
Information Sciences, Springer-Verlag NY, 1984.
12 Ohberg F, Johansson H., Bergenheim M., Pedersen J., Djupsjobacka M. "A Neural Network
Approach to Real-Time Spike Discrimination during Simultaneus Recordings from several
Multi-Unit Filaments." Journal of Neuroscience Methods 64 (1996), pp. 181-187.
Challenges for a Real-World Information
P r o c e s s i n g by Means of R e a l - T i m e Neural
C o m p u t a t i o n and R e a l - C o n d i t i o n s S i m u l a t i o n
J.C. llerrero
Software AG
c/Ronda de la Luna, 4. - 28760 Tres Cantos (MADRID), SPAIN
e-maih jcherrer@arrakis.es
Summary
Should we consider the dimensions o f natural neural computation as they are known as a result of the
scientific research, we realize there is a long tomorrow before us, interested in neural computation, for the
simple reason that we can only handle a relatively low number o f units and connections nowadays. All
along this centmT we have significantly improved our knowledge on natural neural nets, to realize that
huge nnmber of cells and connections and begin to umterstond some of the brain signals processing and the
repetitive structures which support it. However, even in the most developed cases, such as the auditoly
pathway modelling, there is not a neural computational device which can involve a real time response and
Jbllow the.fitcts ab'eady known or phtusibly postulated on some brain processes (e.g. by McCulloch and
Pitt,;J, with the unavoidable great number o f processbtg elements involved too, besides neither suitable
models regarding those kind o f real-look nets have been designed nor their con'esponding real-conditions
simulations have been carried out. That means there is a lack o f connectionistically computable models and
also reduction methods by which we can obtain a connectionistic implementation design, given the
knowledge level model.
Therefore, we would like to ask: what is within reach? hi order to answer this question we are going
to present a restricted auditory pathway modelling case, where we shall be able to see the realistic
challenges we are fitcing up to. By eying to propose a consistent implementation for it, based on parallel,
mo&dat; diso'ibuted and self-programming computation, we shall see the kh~d o f methods, equipment,
software and simulations required attd desirable.
1. Introduction.
Natural neural computation is a result of natural evolution. Following Charles
Darwin [Darwin 1859], natural neural things are there because they represent the most
fcasiblc way for them to be in their environment, as a result of such evolution, and in such
tcnns wc may try to understand it. Natural neural nets have evolved to process real-world
information. Thus, whcn we try to understand what those nets are for, why those
connections, shapes and quantities, perhaps it may be helpful to begin with a real world
information proccssing modelling and then construct the corresponding circuits whose
input is somc interaction with physical events which happen in the real world
[Churchland 1992] [Hawkins 1996].
We have discovered the computational features of one neuron are not so simple as
some modelling thrends pretended, and unlike the models we are accustomed to manage,
the structures involving neurons are very complex and consist of a huge number of
components [I)eFelil)e 1997] (that slmuld not be surprising, since it is about the number
of cclls in a body), as complex is the computation they carry out [Moreno-Diaz 1997,
1998]. However the more we know about them and their functionalities, the more it
seems this is the better way for processing the involved signals [Mira&Delgado 1995a,
1997], and wish to emulate their features [Churchland 1992] [Hawkins 1996].
Thus, parallel, modular, distributed, and self-programming computation appears in
the pathway. However, it is not less true this processing has to be understood in a causal
300
relationship with well known facts at a different level [Mira&Delgado 1995a, 1995b,
1997], i.e. the knowledge level [Marr 1982] [Newell 1981]. For instance, in the auditory
pathway modelling, the whole signal processing is causally related to auditory sensations
and perceptions (some person has), like timbre and chord recognition, i.e. the
psychoacoustics [Delgutte 1996]. This causality between levels is admitted although
unknown in the case of natural neural nets (in the brain), and completely impossible in
artificial neural nets nowadays.
This does not mean one cannot eventually model the whole auditory pathway in the
future, make either a design or a physical implementation and objectively interpret the
results in terms of well known components of the input that the circuit would receive
from the real world; this is the connectionistic long term aim, rather. In the meantime, in
this paper we are going to present a restricted auditory pathway synthesis modelling for
timbre recognition by processing real world information by emulating the natural neural
nets features, basicly those of parallel, modular, distributed and self-programming
computation, and that means real time processing too. Such a synthesis modelling
exercise will show us the magnitude of the problem, even in a restricted case like this, and
it will suggest us the kind of tools we would need in order to tackle these kind of
problems, as well as the necessity of explicit reduction methods which should eventually
be used in order to obtain the computational design, given the knowledge level model of
analysis, like in the symbolic computation counterpart [Mira 1997, 1998] [Herrcro 1998,
1999], methods which explicitly justify the causal relationship between computational
levels.
It is usually said that the aim of computational analysis should be the development of
formal models, sufficiently explicit, internally consistent and complete, that what
conceptual models, in natural language, are not [Hawkins 1996] [Benjamins 1997].
However, this must not mislead us. Firstly, because beside the computational analysis aim
we cannot forget there is a computational synthesis aim or else there would not be any
computational aim at all. Sencondly, because there are two kind of causalities we cannot
forget either: the model's causality and the reduction method's causality. As to the
former, formalisms are intended to express things in a powerful manner [Russell 1959]
[Whitehead 1913] and they have their own formal causality, based on abstract
relationships (usually mathematical ones) for handling abstract entities (like elements of
sets, etc.). Of course, the formalism can be expressed in natural language, albeit in a long-
winded way, or we could never understand what it means; but that what it means has
properly to do with those abstract entities and relationships, which have nothing to do
with the real problem's causality and entities, unless someone interprets the formalism in
these othcr terms. Then, on the one hand, this knowledge level interpretation of the
precise descriptions of the formalism can be expressed in natural language, and therefore,
the fact that conceptual models are imprecise is rather a custom than an intrinsic
characteristic. But, on the other hand, if knowledge level models talk about facts of a real
world, this disables the possibility of any formalism, completeness, internal consistency,
etc., at least at the knowledge level, since those facts are the only possible justification of
the relationship between the model entities; we can always ask some "why?" about the
model whose only possible answer is "because of the observed facts", and then it is about
a scientific model. As to the reduction method's causality, we have to justify why the
implementation level model has to do with the knowledge level model, and we must
explain it explicitly. While it is possible to describe reduction methods for obtaining the
program code, given the knowledge level description of a problem [Mira 1997, 1998]
[Herrero 1998, 1999], analogous reduction methods are not yet available for
connectionistic implementations. Anyway, the interpretation of the (electronic level of
301
2. F r o m p h y s i c a l e v e n t s to n e o c o r t e x , b y t h e a u d i t o r y p a t h w a y .
We begin with a very brief summary of known facts about the auditory pathway, only
picking those which are significant (we estimate) for our purposes, disregarding the
description of a good deal of wonderful details already known to science.
Sounds are physically described as a kind of vibrations usually transmitted by the air,
as very fast cyclic changes in pressure in the direction of the sound propagation. We hear
sounds because of a causal chain or line of events [Lyon 1996] [Russell 1948], starting at
the physical event which produces the air vibrations that eventualy reach our outer ear
and then the tympanic membrane. The tympanic membrane transmits the vibrations to the
middle ear ossieles through which the vibrations get to the inner ear and then enter the
neocortex, and then we hear, although maybe we do not listen.
There are several noteworthy components in the inner ear, contained in the cochlea.
The shape of the cochlea looks like a snail shell, as it is well known, and it is filled up
with a fluid (endolymph) and divided along the shell spiral into three compartments by
the basilar membrane and the Reissner's membrane. The basilar membrane gets thicker
the more we move nearer the spiral center or apex. The vibrations which get to the
cochlea are transmitted by the fluid to the basilar membrane, which responds to them
depending on the frequency. In the 19th century, Hermann yon Hehnholtz modelled the
basilar membrane as a series of mechanical oscilators. The fact is that depending on the
vibration frequency a different point of a,zone of the basilar membrane vibrates
maximally, the thinner the zone the higher the frequency, the thicker the lower. On the
basilar membrane sits the organ of Corti, which undergoes the vibrations of the
membrane. In the organ of Corti, there are two kind of cells: the inner hair cells (about
3,500 in humans) and the outer hair cells (about 12,000). The inner hair cells seem to be
the first responsible of our hearing, since "almost a 95% of afferent fibers of the cochlea
division of the eighth nerve (auditory nerve) originate at the base o f these cells, while
most of the efferent input to the cochlea from the central nervous system reaches the
bases of the outer hair cells", a really astonishing phenomenon [Dnuw 1996] [Delgutte
1996] [Hanavan 1996] [Lyon 1996]. However, there are interactions with the outer hair
cells that join the filtering, resonance, amplifying, and others effects of the outer, middle,
and rest of the inner car [Mountain 1996].
The inner ear transmits to the cortex two features of the sound we are interested in:
frequency and intensity. As we have said, frequency is identified by the inner cells
corresponding with a basilar membrane zone, and this information is transmitted by the
corresponding nerve fibers, being these like labelled lines in the very causal lines we
referred to previously. It is not less interesting the fact that the sound intensity is coded
and travels along the same fibers, carried by the rate or frequency of the discharge of the
neurons, beside cooperative processing under saturation conditions [Delgutte 1996]. The
302
computation can code intensity by means of numbers. Following Marr [Marr 1982] if we
have to understand the behaviour o f neurons, i.e. the natural processing o f information, at
some other level, there is no need for an exact neuron by neuron, synapse by synapse,
artificial synthesis. Albeit we could choose this way, we can also consider a local
computation program represents the computation o f serveral neurons and synapses or
conversely, while we preserve the problem structure and we reproduce the suitable global
functionalities in regard with our problem.
Figure 1 represents the rnodules o f our model. The reception module is a transductor
whose mission is to pick the sound waves from the environment as input and return a
complete feature map as output, the wave spectrum for each time interval At. That
complete feature map may be represented by a bidimensional chart, where X axis holds
the frequency values and the Y axis holds the intensity values, both in a logarithmic scale,
as we can see in the same figure. The next module detects intensity variations in time for
each frequency, parses the wave spectrum and returns one or more sub-espectra
components. This module uses columns for computing the suitable output. The final one
is the memory-recognition module, which handles normalized forms, where spectrum
frequencies and their respective intensities are relatives.
Next we are going to see these modules in detail, except for the reception module,
and we shall discuss some alternatives.
.
Recept
. . on moau
.
e [ :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
,; ~ " t me
,~ ntens ty " ""-'----.
~' ...............................................................
d TIll J ill I WAVE SPECTRUM in At
Intensityvariationsin L........'...!_!....!.._...!..!.......!...!..!..._..!__....{....
frequency
I time for frequency ] [ ...........................................................................
]
/ module (IVTM) I I........................................t"~-- ................'1 I I i
t ..... ~. t ................................. / I I I I .
_I_ ...................... IT-H-aj I...................................
=.........:...._.:._.I
9 L .......................... 1
module(MRM) .......................
:.......................
:...........................
I I ...................
Figure 1. Modular model for timbre recognition, a restricted auditory pathway model.
3.1.Reception.
In this restricted model, we can see the reception module (RM) as a low frequency
electronic device, with a total bandwidth o f 20-20kHz, which consists o f one or more
electroacoustic components for transforming the sound waves into electric signals,
followed by band-pass filters tuned to m different frequencies, spread over the total
bandwith in a logarithmic scale, as usual, and finally followed by half-wave rectifiers and
analog to digital converters, so that the output o f such device is m labelled-lines, each one
corresponding with a definite frequency, each one carrying a coded number a
representing the intensity associated with that frequency. Note that m is over 3000. While
this device's analog input may vary continuously, this device changes the digital output a
discretely every Ato, i.e. it renders a complete feature map every At0, the wave spectrum
for this time interval, as we saw in figure 1.
304
"\ ..............
We can computationally implement this module in a little different way, say the
second version, functionally equivalent despite it seems unlikely that real neurons operate
like this. We can arrange m units, one per frequency, with q outputs per unit, and assign e;
to thejth output. Wc can preserve an order for each output in each unit, consistent with
the corresponding output assignments to the rest o f the units, so that the set which
consists of thejth output o f each unit corresponds with thejth column; now colums are
made up just o f output connections, corresponding with the computations performed in
the m units. Each unit i calculates Zlai/ai = 8'j and by comparing e'j to ej and ej-i (]=I,q)
the unit decides which of the q outputs is not 0 and will transmit the coded information to
the next module. Since, the number o f units is m, the number o f inputs is m too, and the
number of outputs is mq, so we are talking about 3,000 units, 3,000 inputs and 150,000
outputs. We can see that, for the same number of outputs, the number o f inputs and units
are considerably less than those o f the first version, so this is a better choice to be
implemented.
The purpose of this module is to detect groups o f frequencies which vary together. As
we previously said in this paper, we are trying to recognize timbres. Although this does
305
not mean that we restrict ourselves to the case of harmonic sounds, suppose we hear
several instruments playing together a melody on the radio (which is not stereo). We can
distinguish each one because they are not too many, they do not play continuously
together all the time, and the series attack-release-sustain-decay is different for each
instrument; if not for this, the timbre alone could not help us in order to distinguish
between them (this happens when we listen to an orchestra). So we can imagine each
instrument's spectrum in frequency varies in a different way during a given At and also
from one At to another as time goes by. Because of the latter, we can recognize the
melody that each instrument plays, and this is not the basic functionality of the module.
Because of the former, the module gathers in the same colum all the frequencies whose
intensities vary in the same proportion, so they probably have the same origin, e.g. the
same instrument. Besides, if we think in terms of evolution, suppose some living being
has been hearing a sound with the same frequency spectrum, say the soft breeze on the
savannah. Probably, this living being will associate this sound with a unique mild origin.
But a quite different case happens if some part of the frequency spectrum varies, though
softly, at a different rate from the rest: it will probably be successful to associate all the
frequencies of this spectrum which vary at a different rate and indentify them with a
different origin, in the struggle for life.
where m is the number of IVTM outputs per colum (as many as labelled lines, one per
frequency, in a given column k). Every one of these a~ijk can be either 0 or the intensity
coded value. A normalization over the absolute intensity coded values turns the series into
relative intensity ones. Another normalization over the absolute frequencies turn them
into relative ones, simply by displacement. We call the normalized series a "form", so
units memorize forms. Once these normalizations end, the MR connections are never
more used to memorize, but to recognize, i.e., the MR connections start working like RO
connections.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . ; ............................... - ......
!I
: I~ // "~ I/ "~ H 1 icol,.,,,q
Figure 3. The three modules. In the memory-recognition module, only 1 unit is shown. At a given time,
there are q units ready to memorize, one per IVTM column, each one with its MR connections with a
different column. In general, there are many other units like them, some of them have already memorized
their own form, and therefore they are capable of recognize it, some other are not yet memorized anything,
but only q of them are ready to memorize.
Recognition process is available when a unit has memorized a form. Then, whatever
the input series it receives, it will be normalized and compared with the memorized form.
The RO connections are arranged so that the unit considers that inputs which come from
different IVTM columns belong to different series. So we have a series like
(aPilt, aPi2r, ..., aPimr)~
where m is the number of IVTM outputs per colum and r the specific column whose
outputs are now inputs to the unit i for recognition purposes. Computation acts in order to
consider only one of the q possible series in the normalization and comparison process. If
comparison fails, then another series is considered, until recognition is achieved or no
matches at all are found. If recognition is successful, the unit output is not 0, being 0
otherwise. Comparison criteria is not necessarily an exact-match one, since we can admit
little differences in the values of the series. Note that, since this unit operates in parallel
with the rest of units for recognition purposes, recognition is achieved in parallel and in a
few steps. Note also that first of all, the memory-recognition module (MRM) tries to
recognize the form, by inhibiting the MR connections. If no output is obtained, then the
inhibitions cease, and memorization is permitted.
307
If not for the displacemetlt and normalization, the underlying idea is that connections
support the memory function. The natural neural nets analogy is to straightforwardly
interpret that memory would imply synaptic changes: when memory function is on, the
synapses which received impulses become excitatory ones, while those which did not
receive any impulse in the same process become inhibitory ones. But this process is not
meant to be accomplished by only one neuron, but by a group of them, working
cooperatively, i.e. a neural net.
I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ' ~ ~ ..... [ ..... [ ..... l ~. . . . . M O eonneclions
: _ :]-. :•
Figure 4. The memory-recognition unit inspired in the Pitts-McCulloch research. Grey lines come from
IVTM columns outputs for the corresponding frequency; these input lines converge in units X j and, after
being normalized, cross slantwise the manifolds and enter the units on a multiplexed-by-manyfoldbasis. On
the other hand, the black lines represent the MR connections that come from only one IVTM column j,
whose intensity was also normalized.
In order to preserve to the utmost possible this very interesting feature in our design,
we could think of a second version of our MRM, based on the neural structures studied by
W. Pitts and W.S. McCulloch [Pitts&McCulloch 1947]. As we can see in figure 4, for
each IVTM column there are m manifolds, each representing the m possible
displacements in frequency, so that we have a m x m matrix o f units. The m possible
outputs o f a IVTM column k, previously normalized (in intensity), are sent to each of the
inputs corresponding with the same frequency but in a different manifold. These are the
MO connections (memorizing-only), so memorization takes place in all the manifolds in
the corresponding inputs of the matrix, and also in a unit X (on the right). As to the
rccognition connections RO, they are as follows: the m possible outputs o f each IVTM
column k are sent firstly to multiplexed units whose output is controlled by a multiplexed
signal, then converge in units X I, then are normalized and afterwards sent slantwise to the
manifolds, so that the output corresponding to the lower frequency crosses the diagonal of
the matrix, and the higher theoretically arrives to only one input. Each IVTM column
output is examined when the multiplexed signal is on respectively for the column, and
multiplexor's units render output only when the multiplexed signal is on for them and
there is input coming from the corresponding frequency and column. Recognition takes
place only when the RO connections match a manifold with the suitable input inhibitory
charactcristics and a multiplexed signal (not represented in the figure) is on for this
nmnifold, and then this manifold's units render outputs different from zero, such that
entering the trait X as inputs, coincide with this unit's memorized form (these connections
for unit X are not represented in the figure, being the manifold units' outputs; unit X
considers these connections orderedly by manifold). Normalization for MO as well as RO
connections is performed as follows, as we can see in figure 4 (for RO connections only):
m units X l receive the RO inputs and send them to other m units X "+l which keep them
during n,dt, while signals are compared from r to r in n layers (represented by the dotted
308
triangle), so that the nth layer consist of only one unit xn: this unit's output is the greatest
input value received in X" units. So, units X "§ divide their inputs by the value of unit
Xn's output. That quotient is sent to the manifold's units as input.
If we calculate the dimensions of the MRM, for this second version, we have that one
MRM unit has only 1 output, m MO inputs and m 2 RO inputs corresponding with unit X;
m2 MO inputs and m2/2 RO inputs plus m2/2 multiplexed-by-manifold inputs
corresponding with these manifolds; 2mq inputs corresponding with multiplexed units;
mq inputs corresponding with units XI; 4m-2 inputs corresponding with RO normalizing
units and 5m-2 inputs corresponding with MO normalizing units (if r=2). That yields
6m2/2+m(10+3q)-3 connections per MRM unit. Under the same circumstances
considered in the IVTM case, that means we are talking about 2.7x107 connections per
MRM unit; please, note once more we do not mean this MRM unit is a neuron. This
MRM unit consist of simpler units: 1 unit X, m2 units per manifold, 2(3m-1) normalizing
units, one for MO connections, one for RO connections, and mq multiplexed units. That
yields m2+m(3+q) simpler units per MRM unit, i.e. we are talking about 9.Ix 106 simpler
units. Then, if MRM has 1000 MRM units, we are talking about 2.7x 10 t~ connections and
9.1 x 109 simpler units per MRM, following the Pitts-McCulloch postulates.
However, per MRM unit, the first version of MRM has 1 output, m MR inputs, and
mq RO inputs (since the MR inputs have to change to RO ones after memorization), so
we are talking about 1.53xi05 connections per MRM unit, and there are not simpler units,
but unit's local computation is more complex than that of the second version.
Multiplexion can be unnecessary if we think of a third version for this module, at a
higher cost. We could have q matrices of units (one per IVTM column), one for MO
connections and the rest for RO connections. When a MRM unit memorizes in the matrix
of units where the MO connections arrive, there are outputs of every unit of this matrix
which are sent to the corresponding units of the rest of q-1 matrices, where memorization
takes place too. In each manifold, units are fully interconnected to each other, so that if
one of the units which has not memorized receives input, it inhibits the rest of its
manifold's units, so they do not produce any output. Besides, there are m units which
receive outputs which come from each manifold's corresponding same frequency, whose
outputs are sent to unit X's inputs. This third version of MRM does not significantly
increase the number of units with respect to the second version one. However, note the
full interconnection per manifold yields m2(m-1) connections per matrix, that means we
are talking about m j connections per MRM unit, i.e. 2.7x10 t~ connections, therefore
2.7x 1013 connections per MRM, if we wanted 1000 memory-recognition units; anyway,
human technology cannot afford it nowadays.
4. Implementation issues.
The restricted synthesis modelling we just have presented, based on the processing of
real-world coded information carried by labelled-lines, is an example of parallel, modular,
distributed and self-programming computation for real-world information processing.
There is a need of paralellism because of the real-world processing, since inputs are
provided in parallel and real-time processing is required. As to modularity, it is a correct
methodological desing procedure, but in this case it is a consequence of two special
circumstances: first, disregarding the modules that pick the real world information, the
rest of modules process feature maps, i.e. abstract information, therefore they can be used
in a wider kind of problems, i.e. they are generic, as their natural analogues are frequently
found all over the brain; and second, the high number of units and connections with
which we have to deal means that, if a real implementation has to be made, we cannot
309
think in tenns of interconnected units any more, but in terms of interconnected modules,
i.e. prefabricated standard objects. These standard modules (or objects) are only
characterized by their functionality, the number and characteristics of inputs and the
number and characteristics of outputs.
As we have seen, the simpler the computation per unit, the greater the number of
units and connections, to achieve the same global functionality. If we take into account
the current technological resources, by increasing the complexity of each unit's
computation we can jeopardize the real-time response of the net: thus, we must estimate
what we can implement, if any, for a real-time response, and what we can only simulate,
out of any real-time response, but still under real-world conditions.
The estimation of the number of units and connections for the IVTM 2"d version and
the MRM 1 st version lead us to an implementation which involves thousands of units and
hundreds of thousands of connections. Due to the characteristics of this kind of
processing, simulations can be and should be done simply by sequencing the parallel
computation module by module, and unit by unit, regarding the net layers in each module.
In spite of such number of units (about 3,000) and connections (about 300,000), this kind
of simulation is totally within reach, all the more because very powerful GHz processors
and Gb RAM chip memories will be available at low cost in the short term. It would be
desirable that programming interfaces were available for performing those simulations,
such that the researchers could define the net in temls of standard modules, picked from
either a standard library or a user library (such that the researchers could define their own
modules). A very important characteristic would be that those s#nulations would process
teal world information, so the feature maps should be provided as input. That means that
recognition modules (RM) should be built in order to record those real world
characteristics. That is a great difference with regard to the current simulations, that
seldom perform computations with inputs under real conditions. Most use analytical
mathematical models, which are not translated into synthesis models, so the real
computation is not described and therefore simulations cannot display any information
about the real implementation behaviour.
As to real implementations, we can distinguish two cases: specific and universal
machines, both including modules for real world information picking and feature map
rendering. Specific machines are built for a definite purpose, with fixed modules and
fixed connections between and within them. In this case, units execute specific
programming which could be either software or hardware. The former means that units
must be general purpose processors, the latter means that units can be more simple and
specific circuits. The latter is the preferred option, and therefore, as programs are Turing
machines (TM's) we call these specific machines "neural Turing machines" (NTM's).
These NTM's are made of standard modules, all of them are also NTM's. (We can also
think of machines which could build these NTM's following CAD/CAM design
specifications due to the high number of connections to be made, as well as symbolic
languages which express the modules' characteristics and interconnections). Finally,
starting from the knowledge level model described in terms of members of a PSM's
library, a reduction methods library is needed for building them, following a suitable
methodology [Mira 1997, 1998] [Herrero 1998, 1999].
Universal machines are justified as sequential universal machines (i.e., von Neumann
ones) are too, but for neural computation instead. We call these machines "neural
universal Turing machines" (NUTM). In these machines, units must be general purpose
processors (i.e. UTM's or universal Turing machines), and the connections between them
will be done depending on the case. Thus, any kind of modules could be defined, as well
as the same kind of modules with different dimensions, and connections between them
310
and also within them. A v o n Neumann machine could support the configuration of this
NUTM, the NUTM being a peripheral of the von Neumann machine. In the von Neumann
machine, we could perfonn simulations and design; there would be tile libraries of
modules, and once the design is achieved, this machine would load the suitable modules
in the NUTM, carrying out their configuration and interconnection. This would be a
totally flexible neural computer. We can think of a NUTM as net of computers, like the
current ones which support the big corporations' management, like banks, notorious
computer manufacturers, etc., which involve hundreds and even thousands of computers,
i.e. processors, all over the country or even the world. In general, these machines are
interconnected following telephone lines, that means connections capable of
configuration, even software configuration, and sometimes these lines are high speed
ones. Usually, there is a central (von Neumann) machine which supports the system
management and maintenance of the net, as well as software distribution utilities maintain
each system's software. Therefore, NUTM's are really possible in the world today,
although at a very expensive cost, in spite of they would not have the long distance
communication handicaps that the mentioned kind of computer nets actually have.
Nevertheless, the main problem is the lack of models, both at the knowledge level and at
the implementation level, which could be executed in these machines, and therefore the
corresponding reduction methods are not yet available.
5. Conclusions.
The lack of computational designs, which should be provided by reduction methods,
defers any implementation, since the solution for the available models based on accurate
analytical fonnulations can only be carried out by mathematical means, which are
computable, but do not render description about any neural net by themselves, thus do not
provide any idea about the structure and dimensions of the implementation and the
required tools, if available, the required resources for it, and so on. Thus, the inspiration
in natural neural nets is desirable, i.e. more research has to be done to find out more and
more about the brain structures and their functioning, as well as PSM's and reduction
methods which translate those models to the connectionistic implementation level. These
reductions methods' results should be neural nets, i.e. parallel, modular, distributed and
self-programming computation, where the modules consist of components also inspired in
already known and widely found structures throughout the brain (lateral inhibitory nets,
colums, slantcd scan, clc.). PSM's describe functionalities observed at the knowledge
level, beginning with the sense pathways, and simulations for them should be done under
real conditions with real world inputs. By simulations under real conditions (i.e. by
sequencing local computations module by module, unit by unit, following the net desing
and regarding inputs as they are received in the real implementation), we are preparing
the design of fi~ture resources and virtual machines which will carry out the neural
computation then, although we do not know if new materials, components and
conceptions will necessarily replace the currently available ones, in order to achieve real-
time responses.
6. Acknowledgements.
To J. Mira and A. E. Delgado for their support and underlying ideas of a suitable
viewpoint on computation and computational problem solving and reduction methods.
7. References.
[Benjamins 1997] V.R. Benjamins & M. Aben. Structure-preserving knowledge-based system
development through reusable libraries: a ease study in diagnosis. International Journal of Human-
Computer Studies, 47 (1997) 259-288.
311
[Churchland 1992] P.S. Churchland & T.L Sejnowski. The Computational Brain. (The MIT Press,
('ambridge, MA. 1992)
[Darwin 1859J C. Darwin. On the Origin of Species by Means of Natural Selection, or the
Preservation of Favoured Races in the Struggle for Life. (London: John Murray, Albemarle Street, 1859).
[DeFelipe 1997] J. de Felipe. Microcircuits in the Brain. In Biological and Artificial Computation:
From Neuroscience to Technology. Mira, Moreno-Diaz & Cabestany (Eds.) (Springer, 1997) 1-14.
[Delgutte 1996] B. Delgutte. Physiological Models for Basic Auditory Percepts. In Auditory
Computation. II.I_ l lawkins et al. (Eds.) (Springer, 1996) 157-220.
ll)nuw 1996] l)epartment of Neurophysiology, University of Winsconsin - Madison. Ilearing and
Balance. http://www.neurophys.wisc.edu/-ychen/texlbook/textindex.hlml
[llanavan 1996J P.C. Ilanavan. Virtual Tour of the Ear. http://ctl.augie.edu/perry/frames.htm
Augustana College, SD.
[llawkins 1996] II.L. Ilawkins & T.A. McMullen. Auditory Computation: An Overview. In Auditory
Computation. II.L. llawkins et al. (Eds.) (Springer, 1996) 1-14.
[llerrero 1998] J.C. [lerrero & J. Mira. In Search of a Common Structure Underlying a Representative
Set of Generic Tasks and Methods: The Hierarchical Classification and Therapy Planning Cases Study.
Methodology and Tools in Knowledge Based Systems. Mira, del Pobil & Ali (Eds.) (Springer, 1998) 21-36.
[tterrero 1999] J.C. Ilerrero & J. Mira. SCIIEMA: A Knowledge Edition Interface for Obtaining
Program Code from Stnfciured Descriptions of PSM's. Two Cases Study. Applied Intelligence (Accepted).
[Lyon 1996] R. Lyon & S. Shamma. Auditory Representations of Timbre and Pitch. In Auditory
Computation. H.I_ Ilawkins et al. (Eds.) (Springer, 1996) 221-270.
[Marr 1982] D. Marr Vision. (Freeman, New York, 1982).
[Mira 1995] J. Mira et al. Aspectos b~tsicos de la inteligencia artificial. (Sanz y Tortes, 1995).
[Mira 1997] J. Mira, J.C. llerrcro & A.E. Delgado. A Generic Formulation of Neural Nets as a Model
of Parallel and Self-Programming Computation. In Biological and Artificial Computation: From
Neuroscience to Technology. Mira, Moreno-Diaz & Cabestany (Eds.) (Springer, 1997) 195-206.
[Mira 1998] J. Mira, J.C. Herrero & A.E. Delgado. Where is Knowledge in Computational
Intelligence? On the Reduction of the Knowledge Level to the Level Below. Proceedings of the 24 th
Euromicro Conference. (IEEE, 1998) 723-732.
[Mira&Delgado 1995a] J. Mira & A.E. Delgado. Reverse Neurophysiology: the "Embodiments of
Mind" Revisited. In Proceedings of the International Conference on Brain Processes, Theories and Models.
R. Moreno & J. Mira eds. (The MIT Press, Cambridge, MA. 1995) 37-49.
IMira&Delgado 1995b] J. Mira & A.E. l)elgado. Computaei6n neuronal avanzada: fundamentos
biol6gicos y aspectos metodol6gicos. In Computaci6n Neuronal. Sen6n Barro & Jos6 Mira (Eds.)
(Universidad de Santiago de Compostela, 1995) 125-178.
[Mira&Delgado 1997] J. Mira & A.E. Delgado. Some Reflections on the Relationships between
Neuroscience and Computation. In Biological and Artificial Computation: From Neuroscience to
Technology. Mira, Moreno-Diaz & Cabestany (Eds.) (Springer, 1997) 15-26.
[Moreno-Diaz 1997] R. Moreno-Diaz. Systems Models of Retinal Cells: A Classical Example. In
Biological and Artificial Computation: From Neuroscience to Technology. Mira, Moreno-Diaz &
Cabestany (Eds.) (Springer, 1997) 178-194.
[Moreno-Diaz 1998] R. Moreno-Diaz. Neurocibemetics, Codes and Computation. In Task and
Methods in Applied Artificial Intelligence. Mira, del Pobil & Ali (Eds.) (Springer, 1998) 1-14.
[Mountain 1996] D.C. Mountain & A.E. Hubbard. Computational Analysis of Hair Cell and Auditory
Nerve Processes. In Auditory Computation. l I.L. I lawkins et al. (Eds.) (Springer, 1996) 121 - 156.
I Newell 1981] A. Newell. The Knowledge Level. AI Magazine 2 (2) (Summer 1981) 1-20, 33.
[l~itts&McCulloch 1947] W. Pitts & W.S. McCulloch. Ilow we know universals: the perception of
auditory and visuals forms. Bulletin of Mathematical Biophysics, Vol 9, pp 127-147. (University of Chicago
Press, 1947),
[Russell 1948] B. Russell. Human Knowledge: Its Scope and Limits. (George Allen & Unwin, 1948).
[Russell 19501 B. Russell. An Inquiry into Meaning and Truth. The William James lectures for 1940,
delivered at llarvard University. (George Allen & Unwin, 1950).
[Russell 1959] B. Russell. My Philosophycal Development. (George Allen & Unwin, 1959).
[Turing 1950] A.M. Turing. Computing Machinery and Intelligence. Mind 49 (1950) 433-460.
[Whitehead 1913] A.N. Whitehead & B. Russell. Principia Mathematica. (Cambridge University
Press, 1913).
A Parametrizable Design of t h e
M e c h a n i c a l - N e u r a l Transduction S y s t e m of the
Auditory Brainstem
1 Introduction
time-space information, the items previously detected and will be driven to the main
nervous system.
2.1 Description
In the most of the models [1,3], the flow of the transmision fluid is modelized through
different reservoirs which can be found in the outer and the inner cytoplasm. This
models are more adjusted to a real behavior, with respect to adaptation process and
phase-locking. Van Shaik [4] gives an analogous implementation of the model, which
contains only simplified top-level characteristics of the more general model of hair
cell. Also Lyon [5] and Kumar [6] developed some portions of the auditive system in
analogue hardware. We have decided to apply Meddis physiological model [1,2] for
several reasons:
- To begin with, for develop the biological process in a more specific way than the
rest of the models.
- It reproduces in a realistic way some of the basic properties of the auditive nervous
fibers, such as adaptation through two components or the phase-locking with the
stimulus for the fibers of high characteristic frequency.
- It is considered a not linear model, and has been contrasted the relationship
between the linear form of a system and its performance under adverse conditions.
- Finally, the computational cost is not very high, compared with other similar
models.
2.2 S t r u c t u r e
Meddis model simulates the electrical activity of the auditive nervous fibers
beginning from a stimulus, in this case, the displacement of the basilar membrane.
As has been commented previously, the beginning of the stimulus provokes an
increase in the shot rate followed by an adaptation process that depends on the
intensity of the stimulus. However, the fall rate until the adaptation level is
independent of the amplitude of the perceived signal. The first models developed had
difficulty, in this aspect, due to the fact that the quantity released (or the probability of
be released) were determined directly from the intensity of the stimulus. Furthermore,
the cell were thoroughly emptied of transmitting substance in presence of a
moderately intensive signal. So the shot rate will have to depend on the amount of
neurotransmitter available, however would be given the case that a superior stimulus
do not reach to be propagated due to the transmitting substance lack in the cell. As
consequence, the models should possess certain amount of neurotransmitters in stock
to answer the increases in the stimulation, therefore certain quantity of this substance
should not be affected by the intensity of the signal.
With all this, the properties of the model can be summarized in Table 1.
314
I Factory [ Loss
l.ct
y.[m-qt]
?
Free transmitter pool [I Cleft ~_~
kt.qt
(qt) ~, (ct) Output = ct
(Cleft value)
T x.wt
I Reprocessing store I
I (wt) r.ct
Some transmitter is lost from the cleft (and from the system) by diffusion from the
cleft. This would lead to a gradual run-down of the system if it were not for a gradual
replenishment of the transmitter from a "factory" within the cell that slowly replaces
loss transmitter.
315
g.dt[s(t) + A] (1)
For [s(t) + A] > 0
k(t) = s(t) + A + B
0 For [s(t) + A] < 0
dw (2)
- r.c(t) - x.w(t)
dt
(3)
dq = y(1 - q(t)) + x.w(t) - k(t).q(t)
dt
dc (4)
dt k ( t ) . q ( t ) - l . c ( t ) - r.c(t)
Equations (1,2,3,4), give a complete mathematical account of the model. The flow
parameters y (replenishment rate), 1 (loss in cleft rate), r (reprocesing rate) and x
(adaptation rate) are constant; k(t) is a variable function of s(t) the instantaneous
amplitude of the acoustic stimulus. The k(t) equation represents a saturation function
with a maximum value of g.dt. We can also see that with no stimulus, s(t)=0, we get a
non-zero value of k(t). This non-zero value is determined by parameters A and B
(permeability constants). If we develop the equations (1,2,3,4) in function of the
design graph in Figure 1, we will have the following in/out relationship, equations
(5,6,7), determined by the discreet form of the previously referenced differential
equations.
In these equations we can see, as last addend of the right member, the reservoirs w(t),
q(t) and e(t). This is due to the own intrinsic feedback in the model and in the
development of the differential equations (1,2,3,4).
3 Computational analysis
The hair cell can be initially modeled at structural level in VHDL. This model will be
used as a reference to verify the subsequent stages in the design. All the code was
converted to VHDL. Once the netlist synthesized has been verified, it was proceeded
316
to obtain the values of the critical path, number of ports and other determinant factors
for the study of the structural model performance.
3.1 Design
Below is represented the data flow architecture used to implement the hair cell,
modeled according to the equations (5,6,7). To carry out this task, the operations have
been structured in 7 stages, attend to the precedence between operations and its
possible parallelism to work out the results, besides the stages have been optimized to
obtain a minimal time of computing. Figure 2 shows the complete model.
STAGE 1
STAGE 2
......................... y ................. 1.............
STAGE 3
STAGE 4
STAGE 5
STAGE 6
STAGE 7 ..................................................................
q(t+l) . . . . . . . . . . . . . . . . . . . . . . . . .
~/J.)
..........
c(t+l)
Finally, the divider used for the design is a barrel-shifter implementation, runing at 50
Mhz (20 ns). This model is yet in development, and it is being calculating its future
performance.
319
4 Result analysis
Now, let us see the results taken for the scheme of Figure 2 using the hardware
architecture of point 3.2. First, we study the referring times of each stage, and then
will see the occupation (in area terms) of each stage for two different solutions.
Time (ns)
Stages Solution 1 Solution 2
Stage 1 345.56 172.78
Stage 2 345.56 172.78
Stage 3 36.63 36.63
Stage 4 172.78 172.78
Stage 5 73.26 36.63
Stage 6 73.26 36.63
Sta~e 7 36.63 36.63
Sum 1083.68 664.86
The solution 1 involves only one functional unit of the same type per stage, while
solution 2 involves two functional units of the same type per stage. The clock
frequencies of each solution are 0.92 Mhz and 1.5 Mhz respectively for the solutions
1 and 2. It is important to emphasize that the solution 2 implies an increase in the
speed of the circuit, being the difference between both solutions of 418 ns.
We are going to study the occupation, in area terms, of both solutions described above
Time (ns)
Stages Solution 1 Solution 2
Stage 1 217 434
Stage 2 217 434
Stage 3 73 73
Stage 4 217 217
Stage 5 73 146
Stage 6 73 146
Sta~e 7 73 73
Sum 943 1523
As we can see, solution 2 involves a considerable increase in the area used to design
the circuit, having 580 ports more than solution 1. Taking into account the results of
Table 4, is important to arrive to a commitment between speed and the number of port
used in the design.
320
5 Conclusions
It has been presented a hardware implementation of the mechanical-neural
transduction model present in mammalians. The implementation of such system, is
based on the" VLSI parametrizable development of a fundamental component - the
hair cell -, modeled through standard cells beginning from a scheme presented in 3.
The contribution of this design is to be able to make parametrizable and reusable
items using high level design techniques. In this way, we have described the design of
a structural block - the hair cell - that it will be used for the lineal development of a
cells array for simulate the function of the cochlea as a transduction mechanical
mechanism of the auditory system. Furthermore, this design will be able to build
modular libraries which can be used for the construction of a bioinspirated system for
the human speech recognition.
Acknowledgements
This research is supported by P R O N T I C 97-1011 and NATO CRG-960053 grants.
References
1. Meddis, R., (1996). "Simulation of mechanical to neural transduction in the auditory
receptor.". J. Acoust. Soc. Am. 79, pp. 702-711.
4. Van Schaik, A., Fragniere, E., Vittoz, E., (1996). "A silicon model of amplitude modulation
detection in the auditory brainstem." Advances in Neural Information Processing Systems 9,
MIT Press.
5. Lyon, R.F., Mead, C. (1988). "An analog electronic cochlea." IEEE Trans. Acoust., Speech,
Signal Processing vol. 36, pp. 1119-1134.
6. Kumar, N., Himmelbaurer, W., Cauwengerghs, G., Andreau, A., G. (1997). "An analog
VLSI front-end for auditory signal analysis." IEEE International Conference on Neural
Networks 1997, pp. 876-881.
D e v e l o p m e n t of a N e w Space P e r c e p t i o n S y s t e m
for Blind People, B a s e d on t h e C r e a t i o n of a
V i r t u a l A c o u s t i c Space
IGonz~tlez-Mora, J.L., 1Rodrfguez-Herndndez, A., 2Rodrfguez-Ramos, L.F.,2Dfaz-Saco, L. 2Sosa, N.
Abstract. The aim of the project is to give blind people more information about their immediate
environment than they get using traditional methods. We have developed a device which captures the
form and the volume of the space in front of the blind person and sends this information, in the form of a
sounds map, to the blind person through headphones in real time. The effect produced is comparable to
perceiving the environment as if the objects were covered with small sound sources which are
continuously and simultaneously emitting signals. An experimental working prototype has been
developed, which has allowed us to validate the idea that it is possible to perceive the spatial
characteristics of the environment. The validation experiments have been carried out with the
collaboration of blind people and to a large extent, the sound perception of the environment has been
accompanied by simultaneous visual evocation, this being Ihe visualisation of luminous points
(phophenes) located at the same positions as the virtual sound sources.
This new form of global and simultaneous perception of three-dimensional space via a sense, as
opposed to vision, will improve the user's immediate knowledge of his/her interaction with the
environment, giving the person more independence of orientation and mobility. It also paves the way for
an interesting line of research in the field of the sensory rehabilitation, with immediate applications in
the psychomotor development of children with congenital blindness.
I Introduction
From both a physiological and a psychological point of view, the existence of three
senses capable of generating the perception of space (vision, hearing and touch) can be
considered. They all use comparative processes between the information received in
spatially separated sensors, complex neural integration algorithms then allow the three
dimensions of our surroundings to be perceived and "felt" [2]. Therefore, not only light
but also sound can be used for carrying spatial information to the brain, and thus, creating
the psychological perception of space[ 14].
The basic idea of this project can be intuitively imagined as trying to emulate, using
virtual reality techniques, the continuous stream of information flowing to the brain
through the eyes, coming from the objects which define the surrounding space, and being
carried by the light which illuminates the room. In this scheme two slightly different
images of the environment are formed on the retina with the light reflected by
surrounding objects, and processed by the brain in order to generate its perception. The
proposed analogy consists of simulating the sounds that all objects in the surrounding
space would generate, these sounds being capable of carrying enough information, despite
source position, to allow the brain to create a three-dimensional perception of the objects
in the environment and their spatial arrangement, after modelling their position,
orientation and relative depth.
This simulation will generate a perception which is equivalent to covering all
surrounding objects (doors, chairs, windows, walls, etc.) with small loudspeakers emitting
sounds according to their physical characteristics (colour, texture, light level, etc.). In this
situation, the brain can access this information together with the sound source position,
using its natural capabilities. The overall hearing of all sounds will allow the blind person
to form an idea of what his/her surroundings are like, and how they are organised, up to
the point of being capable of understanding and moving in it as though he could see them.
A lot of work has been done on the application of technical aids for the
handicapped, and particularly for the blind. This work can be divided into two broad
322
categories: Orientation providers (both at city and building level) and obstacle detectors.
The former has been investigated everywhere in the world, a good example being the
MOBIC project, which supplies positional information obtained from both a GPS satellite
receiver and a computerised cartography system. There are also many examples of the
latter group, using all kinds of sensing devices for identifying obstacles (ultrasonic, laser,
etc.), and informing the blind user by means of simple or complex sounds. The "Sonic
Path Finder" prototype developed by the Blind Mobility Research Group, University of
Nottingham, should be specifically mentioned here.
Our system fulfils the criteria of the first group because it can provide its users with
an orientation capability, but goes much further by building a perception of space itself at
neuronal level [20,18], which can be used by the blind person not only as a guide for
moving, but also as a way of creating a brain map of how his surrounding space is
organised.
A very successful qualified precedent of our work is the KASPA system [8],
developed by Dr. Leslie Kay and commercialised by SonicVisioN, This system uses an
ultrasonic transmitter and three receivers with different directional responses. After
suitable demodulation, acoustic signals carrying spatial information are generated, which
can be learnt, after some training, by the blind user. Other systems have also tried to
perform the conversion between image and sound, such as the system invented by Mr.
Peter Meijer (PHILIPS), which scans the image horizontally in a temporal sequence;
every pixel of a vertical column contributes a specific tone with an amplitude proportional
to its grey level.
The aim of our work is to develop a prototype capable of capturing a three-
dimensional description of the surrounding space, as well as other characteristics such as
coiour, texture, etc., in order to translate
them into binaural sonic parameters,
virtually allocating a sound source to every
position of surrounding space, and DRO:~m ~ / ~ = ~ I " - - " ' ~ C o r r i d o r
performing this task in real time, i.e. fast a)
enough in comparison with the brain's
perception speed, to allow training with
simple interaction with the environment,
stereopixels which actually represent the Fig. l.- Two-dimensional example of the
horizontal resolution of the vision system, system behaviour
(however the equipment could work with
323
an image of 16 x 16 and 16 depth) providing more detail at the centre of the field in the
same way as human vision. The description of the surroundings is obtained by
calculating the average depth (or distance) of each stereopixel. This description will be
virtually converted into sound sources, located at every stereopixel distance, thus
producing a perception depicted in drawing c, where the major components of the
surrounding space can be easily recognised (The room itself, the half open door, the
corridor, etc.)
This example contains the equivalent of just one acoustic image, constrained to two
dimensions for ease of representation. The real prototype will produce about ten such
images per second, and include a third (vertical) dimension, enough for the brain to build
a real (neuronal based) perception of the surroundings.
Two completely different signal processing areas are needed for the
implementation of a system capable of performing this simulation. First, it is necessary to
capture information of the surroundings, basically a depth map with simple attributes such
as colour or texture. Secondly, every depth has to be converted into a virtual sound
source, with sound parameters coherently related to the attributes and located in the
spatial position contained in the depth map. All this processing has to be completed in
real time with respect to the speed of human perception, i.e. approximately ten times per
second.
professional headphones
q f---- SENNHEISER HD-580 . . . . . . . - . . - ~
Ethernet link
(TCP-IP)
Colourw ~ e o Z ~ "-"
m icr~c amenras J
JAI CV-M 1050
Frame grabber
MATROX . _ ~
mod, GENESIS Huron Bus: Cards having: f
DSP 56002. A/D,D/A .... ../
Figure 2 shows a conceptual diagram of the technical solution we have chosen for
the prototype development. The overall system has been divided into two subsystems:
vision and acoustic. The former captures the shape and characteristics of the surrounding
space, and the second simulates the sound sources as if they were located where the
vision system has measured them. Their sounds depend on the selected paralneters, both
reinforcing the spatial position indication and also carrying colour, texture, or light-level
information. Both subsystems are linked using a TCP-IP Ethernet link.
The Vision Subsystem
A stereoscopic machine vision system has been selected for the surrounding data
capture[12]. Two miniature colour cameras are glued to the frame of conventional
spectacles, which will be worn by the blind person using the system. The set will be
calibrated in order to calculate absolute depths. In the prototype system, a feature-based
324
method is used to calculate a disparity map. First of all, the vision subsystem obtains a set
of comer features all over each image, and the matching calculation is based on the
epipolar restriction and the similarity of the grey level in the neighbourhood of the
selected comers.
The map is sparse but it can be obtained in a short time and contains enough
information for the overall system to behave correctly.
The vision subsystem hardware is based on a high-performance PC computer,
(PENTIUM II, 300 MHz), with a frame grabber board from MATROX, model GENESIS
featuring a C80 DSP.
taken from the 6 healthy, sighted young volunteers with closed eyes in all the
experimental conditions. All the subjects included in both experimental groups described
above were selected according to the results of an audiometric control. The acoustic
experimental stimulus generated was a burst of 6 Dirac deltas spaced at 100 msec and the
subjects indicated the apparent spatial position by calling out numerical estimates of
apparent azimuth and elevation, using standard spherical coordinates. This acoustic
stimulus were generated to simulate a set of five virtual positions covering a 90-deg range
of azimuths and elevation from 30 deg below the horizontal plane to 60 deg above it. The
depth or Z was studied by placing the virtual sound at different distances of up to 4
meters, which were divided into five intermediate positions in a logarithmic arrangement,
from the subjects.
2.4 Data analysis
The data obtained from both experimental groups (blind people as well as sighted
subjects) were evaluated by analysis of variance (ANOVA), comparing the changes in
the response following the change of virtual sound sources. This was followed by post-
hoc comparisons of both group values using Bonferroni's Multiple Comparison Test.
3 Results
Having discarded the real impossibility of distinguishing between real sound
sources and their corresponding virtual ones, for blind as well as for the visually enabled
controls, we tried to determine the capability of locating blind people's virtual sound
sources with regard to sighted controls. Without having had any previous experience, we
carried out Iocalisation of spatialized virtual sound tests in both groups, each one lasted 4
seconds.We found significant differences in blind people as well as in the sighted group
when the sound came from different azimuthal positions, (see figure 3). llowever, as can
be observed in this graph, blind people detected the position of the source with more
accuracy han people with normal vision.
100.
** = p<0.005
yes no yes
Response
100
9" ontrols
u~ 70 Fig. 4.- Mean percentages (with
w standard deviations), of accuracy in
=
SO response to tile virtual sound
localisation generated through
20 headphones in elevation.
** = p<0.005
yes no yes
Relponle
326
When the virtual sound sources were arranged in a vertical position, to evaluate
the discrimination capacity in elevation, one can see that there were significant
differences amongst the blind group, which did not exist in the control group (see figure
4).
It
|" !
nd
]hled controls Fig. 5.- Mean percentages (with
og standard deviations), of accuracy in
.~ response to tile virtual sound
localisation generated through
headphones in distances, Z axis.
** = p<0.005
ye6 no yes no
Response
Figure 5 shows that both groups can distinguish the distances well, nevertheless,
only the group of blind subjects showed significant differences.The results in the initial
tests using simultaneous multiple, virtual or real sounds showed that, fundamentally in
blind subjects, it is possible to generate the perception of a spatial image from the spatial
information contained in sounds,. The subjects can perceive complex tridimensional
aspects from this image, such as: form, azimuthal and vertical dimensions, surface
sensation, limits against a silent background, and even the presence of several spatial
images related to different objects. This perception seems to be accompanied by an
impression of reality, which is a vivid constancy of the presence of the object we have
attempted to reproduce. It might be interesting to mention that, in some subjects, the
tridimensional pattern of sound-evoked perceptions had mental representations which
were subjectively described as being more similar to the visual images than to the
auditive ones. Presented in a general way, and considering that the objects to be
perceived are punctual shapes or they change from punctual shapes into, mono, bi and
three-dimensional shapes (which include, horizontal or vertical lines, concave or convex,
isolated or grouped flat and curved surfaces composing figures, e.g., squares, or columns
o1"parallel rows, etc.), the following observed aspects stand out:
9 An object located in the field of the user's perception, generated from the received
sound information, can be described and therefore perceived, in significant spatial aspects
like; their position, their distance and the dimensions in the horizontal and vertical axes
and even in the axis z of depth.
9 Two objects separated by a certain distance, each one inside the perceptual field
captured by the system, can be perceived in their exact positions, regardless of their
relative distances from each other.
. After a brief period of time, which is normally immediate, the objects in the
environment are perceived in their own spatial disposition in a global manner, and the
final perception is that all the objects appear to be inside a global scene.
This suggests that the blind can, with the help of this interface, recognise the
presence of a panel or rectangular surface in its position, at its correct distance, and with
its dimensions of width and height. The surface structure of spatial continuity e.g. door,
window, gap etc are also perceived. Two parallel panels forming the shape of a corridor
are perceived as two objects, one on each side, with their vertical dimensions and depth,
and that there is a space between them where one can go through,
327
In an attempt to simulate the everyday tasks of the blind we created a dummy and a
very simple experimental room. It was possible for the blind to be able to move, without
relying on touch, in this space and he/she could extract enough information to then give a
verbal global image, graphically described (see figure 6), including its general disposition
to the starting point, the presence of the walls, his/her relative position, the existence of a
gap simulating a window in one of them, the position of the door, the existence of a
central column, perceived in its vertical and horizontal dimensions. In summary, it was
possible to move freely everywhere in the experimental room.
0
io
"-'1
g-
Column
9 Column
X Starting point
Door , Doorx ,qtarting point
A B
Fig. 6.- A. Schematic representation of the experimental room, with a particular objects
distribution. B. Drawing made by a blind person after a very short exploration, using the
developed prototype, without relying on touch.
It is very important to remark that in several blind people the sound perception of
the environment has been accompanied by simultaneous visual evocation, consisting of
punctate spots of light, (phophenes) located in the same positions as the virtual sound
sources. Phoshenes did not flicker, so this perception gives a great impression of reality
and is described, by the blind, as visual images of the environment.
4 Discussion
Do blind people develop the capacities of their other remaining senses to higher
level than those of sighted people?. This has been a very important question of debate for
a long time. Anecdotal evidence in favour of this hypothesis abounds and a number of
systematic studies have provided experimental evidence for compensatory plasticity in
blind humans, ]15], [19], [16]. Other authors have often argued that blind individuals
should also have perceptual and learning disabilities in their other senses such as tile
auditory system, because vision is needed to instruct them, [10], [17]. Thus, the
question of whether intermodal plasticity exists has remained one of the most vexing
problems in cognitive neuroscience. In the last few years, results of PET and MRI in blind
humans indicate activation of areas that are normally visual during auditory stimulation
[23],[4] or Braille reading [19]. In most of the cases, a compensatory expansion of
auditory areas at the expense of visual areas was observed, [14], In principle this would
suggest that this would result in a finer resolution of auditory behaviour rather than in a
328
reinterpretation of auditory signals as visual ones. However, these findings pose several
interesting questions: What is the kind of percept that a blind individual experiences when
a 'visual' area becomes activated by an auditory stimulus?, does the co-activation of
'visual' regions add anything to the quality of this sound that is not perceived normally, or
does the expansion of auditory territory simply enhance the accuracy of perception for
auditory stimuli?.
According to this, our findings suggest that, at least in our sample, blind people
present a significantly higher spatial capability of acoustic Iocalisation than the visually
enabled subjects. This capability, which one would expect, is more important in
Azimuth than in elevation and in distances. Nevertheless, in the latter ones they are
statistically significant. These results allow us to sustain the idea of a possible use of the
auditory system as a substratum to transport spatial information in visually disabled
people and, in fact, the system we have developed using multiple virtual sounds suggests
that the brain can generate an image of spatial occupation of an object with its shape, size
and three-dimensional location. To form this image the brain needs to receive spatial
information about the characteristics of the object's spatial disposition and this
information needs to arrive fast enough so that the flow is not interrupted, regardless of
the sensorial source it comes through.
It seems to be believable that neighbouring cortical areas share certain functional
aspects, defined partly by their common projection targets. In agreement with our results,
several authors think that the function shared by all sensory modalities seems to be spatial
processing [ 14]. Therefore, a common code for spatial information that can be interpreted
by the nervous system has to be used and probably, the parietal areas, in conjunction with
the prefrontal areas form a network involved in sound spatial perception and selective
attention [6].
Thus, to explain our results, it is necessary to consider that signals from many
different modalities need to be combined in order to create an abstract representation of
space that can be used, for instance, to guide movements. Many authors [3], [6] have
shown evidence that the posterior parietal cortex combines visual, auditory, eye position,
head position, eye velocity, vestibular, and propioceptive signals in order to perform
spatial operations. These signals are combined in a systematic fashion by using the gain
field mechanism. This mechanism can represent space in a distributed format that is quite
powerful, allowing inputs from multiple sensory systems with discordant spatial frames
and sending out signals for action in many different motor co-ordinate frames. Our
holistic impression of space, independent of sensory modality, may be embodied in this
abstract and in this distributed representation of space in the posterior parietal cortex.
These spatial representations generated in the posterior parietal cortex are related to other
higher cognitive neuronal activities, including attention.
In conclusion, our results suggest a possible amodal treatment of spatial information
and, in situations such as after the plastic changes which are a consequence of sensorial
deficits, it could have practical implications in the field of sensorial substitution and
rehabilitation. Furthermore, contrary to the results obtained from other lines of research
into sensorial substitution [8], [4] the results of this project have been spontaneous, and
did not follow any protocol of previous learning, which suggests the high potential of the
auditory system and of the human brain provided the stimuli are presented in the most
complete and coherent way possible.
Regarding the appearance of the evoked visual stimuli that we have found when
blind people are exposed to spatialized sounds, using the Dirac deltas is very important in
this context, since this demonstrates that the proposed method can, without direct
329
Acknowledgements
This work was supported by Grants from the G o v e r n m e n t of the C a n a r y Islands,
European C o m m u n i t y and I M S E R S O (Piter Grants).
References
1. Albert S. Bregman, Auditory Scene Analysis, The MIT Press (1990).
2. Alho, K., Kujala T., Paavilainen P., Summala H. and N~i~it~inenR. Auditory processing in visual areas of the
early blind: evidence from event-related potentials. Electroene. And Clin. Neurophysiol. 86 (1093) 418-
427.
3. Andersen R. Snyder HL, Bradley C, Xing J. (1997). Multimodal representation of space in posterior
parietal cortex and its use in planing movements Annu. Rev. Neurosci.20, 303-330.
4. Bach-y-Rita, P. Vision Substitution by Tactile Image Projection. Nature. Vol 221, 8,963-964, 1969.
5. Frederic L. Wightman & Doris J. Kistler, "Headphone simulation of free-field listening. I: Stimulus
synthesis", "II: Psychophysical validation", J. Acoust. Soc. Am. 85 (2), feb 1989.
6. Griffiths 1"., Rees G., Green G., Witton C., Rowe D., Biichel C., Turner R., Frackowiak R., (1998). Right
parietal cortex is involved in the perception of sound movement in humans. Nature neuroscience. 1.74-77
7. Innocenti G.M., Clarke S., (1984), Bilateral transitory projection to visual areas from auditory cortex in
kittens. Develop. Brain Research. 14: 143-148.
8. Kay L., Air sonars with acoustical display of spatial information. In Bushel, R-G and Fish, J.F., (Eds),
Animal Sonar Systems, 769-816 New York Plenium Press.
9. Kujala, T., (1992). Neural plasticity in processing of sound location by the early blind: an event-related
potential study. Electroencephalogr. Clin. Neurophysiol. 84,469-472.
10. Locke, J. (1991). An Essay Concerning Human Understanding (Reprinted 1991, Turtle).
I 1. Lessell, S. and M.M. Cohen. Phosphenes induced by sound. Neurology 29: 1524-1526, 1979.
12. Nitzan, David. "I'hree Dimensional Vision Structure for Robot Applications", IEEE Trans. Patt. Analisys
& Mach. hitell.. 1988
13. Page, N.G., J.P. Bolger, and M.D. Sanders. Auditory evoked phosphenes in optic nerve disease.
J.Neurol.Neurosurg.Psychiatry 45: 7-12, 1982.
14. Rauschecker JP, Korte M. (1993.) Auditory compensation of early Blindness in cat cerebral cortex.
Journal of Neuroscience, 13(10) 4538:4548.
15. Rauschecker JP. (1995). Compensatory plasticity and sensory substitution in the cerebral cortex.
TINS. 18,1,36-43
16. Rice CE (1995) Early blindness, early experience, and perceptual enhancement. Res Bull Am Found Blind
22:1-22.
17. Rock, I, (1966). The Nature of Perceptual Adaptation. Basic Books.
18. Rodrlguez-Ramos, LF., Chulani, H.M., Dfaz-Saco, L., Sosa. N., Rodrfguez-Hernlindcz, A., Gonz,'llez-
Mora, J.L. (1997). hnage And Sound Processing For The Creation Of A Virtual Acoustic Space For The
Blind People. Signal Processing and Communications, 472-475.
19. Sadato N. Pascula-Leone, A. Grafman, J., Ib~itez, V., Daiber, M.P., Dold, G., Hallett, M. (1996).
Activation
of primary visual cortex by Braille reading in blind people. Nature. 380,526-527.
20, Takahashi T. T., Keller C.H.(1994). ,'Representation of Multiple Sounds Sources in the Owl's Auditory
Map." Journal of Neuroscience, 14(8) 4780-4793.
21. Takeo Kanade, Atshushi Yoshida. A Stereo Matching for Video-rate Dense Depth Mapping and Its New
Applications (Carnegie Mellon University). Proceedings of lSth Computer Vision and Pattern
Recognition Conference.
22. Tasker, R.R., L.W. Organ, and P. Hawrylyshyn. Visual phenomena evoked by electrical stimulation of the
human brain stem. Appl.Neurophysiol. 43: 89-95, 1980.
23. Veraart C., De Voider, A.G:, Vanet-Defalque, M.mC., Bol, A., Michel, Ch., Goffinet, A.M. (1990)
Glucose utilisation in visual cortex is abnormally elevated in blindness of early onset but decreased in
blindness of
late onset. Brain Res, 510, 115-121.
24. Winfield D.A. The postnatal development of synapses in the visual cortex of the cat and the effects of
eyelid closure. Bra in Res. 1981. Feb. 9. 206:166-171
Application of the Fuzzy Kohonen Clustering Network
to Biological Macromolecules Images Classification
1. Introduction.
Image classification is a very important step in the three-dimensional study of
biological macromolecules using Electron Microscopy (EM) because three-
dimensional reconstruction methods need a homogeneous set of projections, that is,
different projection views (two-dimensional images) from the same biological
specimen. Obtaining such a set is a very complicated task due to several factors: the
low signal/noise ratio of the images obtained in the electron microscope and the
intrinsic heterogeneity in the set of images. This is because a biochemical
homogeneous population does not necessarily produce a homogeneous set of images,
since different 2D views of the same 3D structure may exist and also a lot of
projections usually are obtained from a large set of particles of the same specimen.
In the context of Pattern Recognition and Classification in Electron Microscopy,
different approaches have been previously used: classical statistical methods,
clustering techniques and Neural Networks. Multivariate Statistical Analysis (MSA)
[1][2], was first proposed as a way to reduce the number of variables characterizing
an image and in some cases, a visual inspection was enough to enable the visual
identification of the clusters in the data set under analysis. Visual inspection,
however, is not suitable for all kind of data, so more objective clustering methods
332
2. Materials a n d M e t h o d s .
We have used a set of images of negatively stained hexamers of the SPP1 G40P
helicase obtained in the electron microscopy [14]. 2458 images were translationally
and rotationally aligned and their rotational power spectra were calculated [15]. For
experimental purposes, we created two data sets: one composed of 2458 rotational
power spectra (up to 15 harmonics) and the other composed of 338 50x50 pixels
images that were extracted from an apparently homogeneous 6-fold and 3-fold
symmetry population. Original images have a very low signal/noise ratio, making
impossible visual classification. (Figure 1).
Fig. 1. Two examples of tile images used for expcri,nents. As can be seen they have a low signal/noise ratio,
making impossible visual classification.
333
The Kohonen model is a neural network that simulates the hypothesized self-
organization process carried out in the human brain when some input data is presented
[7]. The algorithm can be interpreted as a nonlinear projection of the probability
density function of the n-dimensional input onto an output array of nodes. The
functionality of the algorithm can be described as follows: when an input vector is
presented to the net, the neurons in the output layer compete among themselves and
the winner (whose weight has the minimum distance from the input) as well as a
predefined set of neighbors update their weights. This process is continued until some
stopping criterion is met, usually, when weight vectors "stabilize" or when a number
of iterations are completed. The update rule of this algorithm is described as follows:
Where the learning rule al is defined as a decreasing function that control the
magnitude of the changes with time, and hrt is a sigmoidal function that controls the
neighborhood of the winning node to be updated during training.
2.3 F u z z y c - m e a n s
Fuzzy c-means clustering is a process of grouping similar objects into the same class,
but the resulting partition is fuzzy, which means that in this case images are not
assigned exclusively to a single class, but partially to all classes. The goal is to
optimize the clustering criteria in order to achieve a high intracluster similarity and a
low intercluster similarity using n-dimensional feature vectors. The theoretical basis
of these methods has been reported in detail elsewhere [10][11] and will only be
briefly reviewed here.
Let X = {Xl,X2,X3..... x,}denote a set of n feature vectors xk. The data set X is going to
be partitioned into c fuzzy clusters, where 1 < c < n, being c the number of clusters to
be found. A c-partition of X can be represented by ui(xk) or utk, where u~, is a
continuos function in the [0,1] interval and represents the membership of xk in the
cluster i, 1 < i < c, 1 _< k < n. The fuzzy c-means algorithm consists of an iterative
optimization of an objective function:
n c m (2)
Jm(U,v) = ~_~k=,~_,,:,(uik) D,k
and A is a positive definite matrix. The parameter m determines the "fuzziness" of the
result and m e [1,oo]. The choice of m depends on the data under analysis.
334
Fuzzy Kohonen clustering network I11] is a type of Neural Network that combines
both methods described above: SOM and Fuzzy c-means. The structure of this self-
organization feature-mapping model consists of two layers: input and output. The
input layer is composed of p nodes, where p is the number of features and the output
layer is formed by c nodes, where c is the number of clusters to be found. Every
single input node is fully connected to all output nodes with an adjustable weight v~
(cluster center) assigned to each connection. Given an input vector, the neurons in the
output layer update their weights based on a pre-defined learning rule o~ This
approach integrates the fuzzy membership U~kfrom the FCM in the following update
rule:
and Uik., is tile Fuzzy membership matrix calculated by Fuzzy c-means, mo is any
positive constant greater than one, t is file current iteration and tmax is the iteration
limit.
9 The learning rate is a function of the iteration t and its effect is to distribute the
contribution of each input vector xk to the next update of the neuron weights
inversely proportional to their distance from Xk. The winner node (whose weight
has the minimum distance from the input) updates its weight favored by the
learning rate as the iteration increases and in this way the Kohonen concept of
neighborhood size and neighborhood updating are embedded in this new learning
rate.
9 FKCN is not sequential; code vectors update is done after each pass through X
(input vectors). Hence, it is not label dependent.
9 In the limit (mr = 1), the update rule reverts to Hard c-means (winner take all).
9 For a fixed nh > 1 (that is, Am = 0) FKCN is a truly Fuzzy c-means algorithm.
9 FKCN is a truly Kohonen-type algoritlma, because it possesses a well defined
method for adjusting both the leanimg rate distribution and update neighborhood
as function of time. Hence, FKCN inherits the "self-organizing" structure of
SOM-types algorithms and at the same time, is stepwisc optimal with respect to a
widely used fuzzy clustering model.
335
3. Results.
We have tested file proposed method in two different data sets: one composed by the
rotational power spectra of a large set of images and the other composed by a subset
of these images. In this way we attempt to demonstrate that FKCN is suitable for
working with large, noisy and high dimensional data, very common in Electron
Microscopy Image Analysis.
In this example, 2458 images were used for analysis. The rotational power spectrum
[15] of each particle was calculated yielding a 2458 15-dimensional data set.
A 7x7 SOM was applied to the data set and the resulting map was manually clustered
in four classes as described in [14]. The results are shown in Figure 2. Group A shows
a predominant 6-fold symmetry with a small but noticeable component on harmonic
3. Group B represents 3-fold symmetry images. Group C is closely related to 2-fold
symmetry particles and Group D showed only a prominent harmonic in 1, which can
be interpreted as a lack of symmetry in this group of particles.
FKCN was applied to the whole set of particle's spectra with the following
configuration: 15 input nodes (each node representing a component in the specmma)
and 4 nodes in the output layer (representing four clusters). For comparison purposes
with results already obtained using this data set [14], 4 clusters were used. Fuzzy
constant m was set to 1.5 and 500 iterations was used The resulting code vectors
(cluster centers) are shown in Figure 3. As can be seen in tile cluster centers, the four
groups visualized by the SOM were also successfully extracted by FKCN.
Quantitative results of coincidence (with respect to file SOM groups) are shown in
table 1, however, a major difference in the sets extracted by both algorithms (SOM
and FKCN) should be noticed. When SOM output was manually clustered, a set of
code vectors that were bounding the groups were no considered in order to avoid a
erroneous classification in the borders. FKCN considered the whole data, so an
unavoidable difference will be reflected in tile results. Cluster 2 obtained by FKCN
seems to be composed by these group of spectra associated to the code vectors that
were eliminated from the SOM for being part of the borders between the four
hypothetical clusters. Furthermore a large set of noisy spectra as well as the non-
symmetry ones are also included in this cluster.
!
,:9 e o
4 7 1o l ) I I o 13 7 101) 4 7 to IS I0 I I 7 10 13
T 10 IS I 7 1~ 1) ? 1o13 t ~' I u I I to I 1 ? 11 11
I i a
~ IO I1 4 ; 10 X~ io x s '/ 1 0 1 | 4 7 IO I I IO I 1 I l 0 11
| e
I
y Iu Is 4 ~ 10 11 7 I q 13 7 1011 | o 11 ? |0 t)
I e I I e
Fig. 2. 7x7 SOM output manually clustered in four regions. Group A: particles with a prominent 6-fold
component and a small but noticeable 3-fold component. Group B: 3-fold symmetry. Group C: 2-fold
symmetry and Group D: lacks of a predominant symmetry. The lmmber of elements assigned to each code
vector is printed in the upper-right comer of each spccu-um.
4 Y I0 13 4 I I0 ll
LI 4 T to 11 4 Y 10 tl
Fig. 3. Four cluster centers obtained from FKCN. The number of elements assigned to each cluster
is printed in the upper-right comer of each speclrum.
337
In this experiment 338 images were used lor testing the algorithms in the presence of
high dimensional and very noisy data. These images share similar rotational
symmetry (6-fold with a minor 3-fold component) See [14] for details. Classification
of this kind of particles is very difficult because of the low signal/noise ratio and the
apparent structural similarity of the population. Two examples of the images forming
the data set are shown in Figure 1.
For comparison purposes, a 7x7 SOM was applied to file data set and the resulting
map was manually clustered in two different classes that apparently exhibited an
opposite handedness [14]. Figure 4 shows the map clustered in two classes that seem
to reflect essentially the same type of macromolecular structure.
FKCN was applied to this data set, using a circular mask of the images (Area of
Interest) as input nodes. In the first experiment we clustered the data into 2 groups,
however, results (not shown here) showed that file method could not found the subtle
variations in handedness of the particle. A small but noticeable difference in the
rotational spectra was detected, but was not enough to get conclusions out of it. We
then clustered the data into 3 groups using m=l.5 and 500 iterations. Cluster centers
obtained are shown in figure 5. Classification accuracy is also shown in table 2.
Analyzing the results of the clustering algorithm it is clear that FKCN correctly
clustered Group A in SOM (Cluster 3), however, Group B need further analysis. It is
obvious that both clusters 1 and 2 belong to Group B in SOM, but a main question
arises: why FKCN needed three clusters to "find" this structure of the data? The
answer can be found by analyzing the particles assigned to these clusters. The images
from the three classes were independently realigned to obtain their average. Figure 6
shows the average image and rotational spectra of subsets belonging to each cluster.
In the case of cluster 3, a clockwise handedness is clearly observed as expected from
Group A of SOM. In the case of clusters 1 and 2, the average images show the same
handedness (counterclockwise) as expected from Group B of SOM, however, it is
also clear the differences in rotational symmetry between both clusters. Both of them
have a predominant 6-fold symmetry, but Cluster 2, as oppose to Cluster 1, is
influenced by a noticeable 3-fold component. This very subtle difference was not very
clear in SOM. This small symmetry variation in the data set was maybe the cause of
misclassification when using two clusters: symmetry was influencing more than
handedness.
B
Fig. 4. 7x7 SOM output manually clustered in two regions with different
handedness. The number of particles assigned to each code vector is printed in
the lower-right corner of each image.
I .2 3
Fig. 5. Three cluster centers obtained from FKCN. The number of particles
assigned to cluster is printed in tile lower-right corner of each image.
In this paper, a new fuzzy classification technique have has been applied to the study
of biological specimens by Electron Microscopy. This technique uses a special type of
Neural Network named Fuzzy Kohonen Clustering Network (FKCN) that
successfully combines dm well-knovm Self-Organizing feature maps (SOM) and the
fuzzy c-means clustering technique (FCM). The need of classification tools suitable
for working with long sets of noisy images is evident in electron microscopy field.
Here we have proposed a new approach which is in the middle somewhere between
two methods already applied in this context: SOM and Fuzzy c-means. The proposed
339
method combines the ideas of fuzzy membership values for learning rates, the
parallelism of FCM and the structure of tile update rules of SOM, producing a robust
clustering technique with a self-organizing structure.
It is important to note that FKCN can also be considered as an optimization algorithm
like FCM, so the possibility of falling into local minimum is theoretical present,
however, the experiments carried out in this work showed that FKCN properly
converged in all the analyzed cases. On the contrary, FCM apparently fell in a local
minimum in the presence of such kind of data.
FKCN have been fully tested in this work using two kinds of data sets that are very
common in any Electro Microscopy laboratory: the rotational power spectra and the
images of individual particles from a protein (In this case the G40P helicase was
used). In both cases FKCN was able to discriminate not only evident but subtle
variations in the data set. The results demonstrate the suitability of this method for
working with these kind of high dimensional and noisy data sets. Comparing this
clustering approach with others previously proposed in the field for structure-based
classification, we should emphasize that this method directly performs classification
(assignment of data to cluster), while at the same time offers a direct visualization by
inspecting cluster centers directly.
A number of future research topics remain open, especially the automatic
determination of number of clusters. In our opinion that topic can be addressed by the
mean of exploratory data analysis capable of faithful showing up the Probability
Density Function of the data set under analysis. We dimk that this type of neural
computation approaches (self-organizing networks) cau be successfully employed for
exploration of data in the Electron Microscopy field.
5. References.
1. Van Heel, M., Frank J.: Use of multivariate statistics in analyzing the images of biological
macromolecules. Ultramicroscopy 6 (1981) 187-194.
2. Frank, J., Van Heel, M.: Correspondence analysis of aligned images of biological
particles. J. Mol. Biol. 161 (1982) 134-137.
3. Van Heel, M.: Multivariate statistical classification of noisy images (randomly oriented
biological macromolecules). Ultramicroscopy 13 (1984) 165-184.
4. Frank, J., Betraudiere, J.P., Carazo, J.M., Verschoor, A., Wagenknecht, T.: Classification
of images of biomolecular assemblies. A study of ribosomes and ribosomal subunits of
Escherichia coil. J. Microsc. 150 (1988) 99-115.
5. Carazo, J.M, Rivera, F.F., Zapata, E.L., Radermacher, M., Frank, J.: Fuzzy set based
classification of electron microscopy images of biological macromolecules with an
application to ribosomal particles. J. Microsc. 157 (1990) 187-203.
6. Marabini, R., Carazo, J.M.: Pattem Recognition and Classification of Images of
Biological Macromolecules using Artificial Neural Networks. Biophysical Journal 66
(1994) 1804-1814.
7. Kohonen, T.: Self-Organizing Maps, 2 na Edition, Springer-Verlag (1997).
8. Siemon, H.P. Selection of Optimal Parameters for Kohonen Self-organizing Feature Maps.
Artificial Neural Networks 2 (1992) 1573-1577.
9. ViUmann, T., Der, R., Hen'mann, M., Martinetz, T.M.: Topology Preservation in Self-
Organizing Feature Maps: Exact Definition and Meastaement. IEEE Transactions on
Neural Networks 8 (1997) 256-266.
10. Bezdek, J. C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum
Press, New York. (1984).
11. Chen.Kuo Tsao, E., Bezdek, J. C., Pal, N. R.: Fuzzy Kohonen Clustering Networks.
Pattem Recognition 27 (1994) 757-764.
12. Jin-Shin Chou, Chin-Tu Chen, Wei-Chung Lin: Segmentation of Dual-echo MR Images
using Neural Networks. Image Processing 1998 (1993) 220-227.
13. Diago, L.A., Pascual, A., Ochoa, A.: A Genetic Algorithm for Automatic Determination of
the Cup/Disc Ratio in Eye Fundus Images. Proceedings TIARP'98, Mexico (1998) 461-
472.
14. Barcena, M., San Martin, C., Weise, F., Ayora, S. Alonso, J.C., Carazo, J.M.: Polymorphic
quaternary organization of the Bacillus subtilis bacteriophage SPP1 repilcative helicase
(G4OP).Journal of Molecular Biology (1988) (in press).
15. Crowther, R.A., Amos, L.A.: Harmonic analysis of electron microscope images with
rotational symmetry. J. Mol. BioL 60 (1971) 123-130.
16. Rivera, F.F., Zapata, E.L., Carazo, J.M.: Cluster validity based on the hard tendency of
the fuzzy classification. Pattern Recognition Letters 11 (1990) 7-12.
Bayesian VQ Image Filtering Design with
Fast Adaption Competitive Neural Networks
1. Introduction.
The Self Organizing Map and other Competitive Neural Networks [5] are applied to
Vector Quantization (VQ) [3,4] which is a widely used technique for signal
compression and pattern recognition. VQ is used as an encoding mechanism or to
provide the dimensionality reduction needed for the classifiers. In the field of digital
image processing [1] it has been mainly proposed for lossy image compression,
because of its nice rate-distortion properties. However, the problem of efficiently
computing the codebook and the image quantization is still an issue for research,
where Competitive Neural Networks are among the salient approaches. We apply fast
learning variants [7] of the SOM to obtain good approximations to the optimal
codebooks with affordable amounts of computation. This approach makes sense for
big problems such as the one tackled here. The practical application presented here
deals with the restoration and noise removal of 3D images produced by micro-
magnetic resonance imaging devices. The 3D image is a 128x256x128 matrix of 32
342
bits/pixel elements. Thus, the computation of the codebooks for either the filtering or
compression of these images requires very efficient VQ design techniques.
The Bayesian VQ filtering approach consists in two steps: (1) Determine the
codevector that encodes the vector given by the pixel and its neighborhood, and (2)
substitute the pixel by the central pixel in the codevector. The probabilistic
interpretation of this process is comes from the interpretation of the encoding process
as a Maximum A Posteriori (MAP) classification. Under this interpretation, the VQ
filter performs a simplified Bayesian restoration [1,2], where (1) The stochastic model
and its estimation, are, respectively, the codebook and the search for the optimal
codebook; and (2) the restoration process does not involve complex and lengthy
relaxation procedures (like simulated annealing in [2]), only the search for the nearest
codevector. The problem of estimating the optimal codebook remains the same
difficult problem as in conventional VQ. Our approach is related with other techniques
that involve the application of compression algorithms to image filtering. Among
them the Occam filters [8,9] are of special interest for future developments. The
definition of Occam filters based on VQ demands very efficient codebook design
algorithms. In this respect our fast learning application of the SOM could be a first
step towards this goal.
In Section 2, we present the Bayesian VQ filter more formally. Section 3 presents
the results on the application to an image, compared with the Compression VQ
approach. Finally, Section 4 gives some concluding remarks.
In this section we will review the approach to signal noise removal based on the so-
called Occam filters. Afterwards we will present our approach relating the VQ-based
filter to the Bayesian framework for image restoration. Let us first recall some basic
definitions about Vector Quantization.
A conventional definition of Vector Quantization [4] is as follows: given a
stochastic process {Xt;t>O } whose state space is (inside) the d-dimensional
(Euclidean) real space 9t a, the Vector Quantizer is given by a set of codevectors that
form a codebook Y = { Y l .... Yc}, where c is the codebook size, the encoding
operation E : ~ d ~ {1 .... C} that maps each vector in r d to the nearest codevector in
the sense of some defined distance (most usually the Euclidean distance); and the
decoding operation C-x : {1.... c} --* 9t d that reconstructs the encoded signal using the
codebook. The Vector Quantization design consists of the estimation of the codebook
from a sample X = {x 1.... Xn}.
For image compression, the image is decomposed in non-overlapping blocks and
each blockis assumed as an independent vector: F = {F1,1.... Fn,n} where n = N/~J-d,
N2 is the dimension of the square image, and each
Fi,j = {Fi+k,j+l : 0 < k,l < t-d_ 1} can be considered a d-dimensional vector. The
343
codebook is therefore a set of image blocks Yi = (y~,/ :0_< k,l < ~ - 1 ) , the
q
codification process produces a reduction of the data proportional to d, and therefore the
distortion of the decoded image (measured either by the mean square error or the signal-
to-noise ratio) increases accordingly. The complexity search for the optimal codebooks
grows also with the size of the blocks considered.
The so-called Occam filters have been introduced by Natarajan in [8,9]. The essence of
the technique is that when a lossy data compression algorithm is applied to a noisy
signal with the allowed loss set equal to the noise strength, the loss and the noise tend
to cancel rather than add. A lossy compression algorithm C is a program that takes as
input a sequence fn and a loss tolerance c~ > 0 and produces as output the string s
representing a encoded sequence gn such that lifo - gnl[ < 6 c it is said to obey the
norm I1"11. The decompression algorithm C-]produces the output sequence gn on
input the string s .
Let j3n a sequence corrupted with noise: J~n = fn + Vn, where VniS a sample
sequence of a random variable representing the noise. With respect to a metric I1"11,the
strength of the noise is denoted as Ilvll= tim
n---)oo
Ilvnll
The Occam filter algorithm is
defined as :
Let Ilvll be the strength of the noise.
.an obtain s ;
Thus, the general definition of the Occam filters requires the estimation of the rate-
distortion response of the compression algorithm. The knee-point of this plot, i.e.,
the point at which its second difference attains a maximum, is assumed as the optimal
rate-distortion relationship. The filtered sequence g n is the one obtained by the
compression/decompression with a loss tolerance set to the optimal distortion.
The applications of Vector Quantization to digital image processing are discussed in
[3]. They suggested that the encoding/decoding process introduces some non linear
smoothing of the image that removes some kinds of noise, specially speckle noise.
Therefore, VQ can be considered as an instance of Occam filters. However, some
specific characteristics of VQ must be taken into account. The application of VQ as a
compression algorithm is usually guided by the compression rate, therefore to obtain a
prescribed distortion result, an exhaustive exploration of the codebook dimension and
size must be performed. The computational complexity of the VQ design is very high
and grows exponentially with the codebook size and dimension. Therefore, the general
Occam approach is of little practical application. A situation that worsens for image
344
processing. We apply the well known Self Organizing Map [5] for the VQ design
task, employing some fast learning strategies [7]. Nevertheless , the computational
cost precludes the exhaustive computation of the rate distortion curve.
The Bayesian approach to the restoration of images [ 1,2] is one of the less restrictive
approaches in its assumptions. The observed image is modeled as the degraded image
G which is of the form ~(H(F))@ N, where F is the original image, H is the
blurring operation, ~ is a possibly nonlinear transformation, N is an independent
noise field and ~ denotes any suitably invertible operation. The a posteriori
conditional density given by Bayes' rule
= .tO, 0
It must be noted that in this application of the codevector no longer are meaningful
ideas relevant to compression such as the relationship between distortion, signal-to-
noise ratio and the dimensionality of the codevectors. The codevectors become the
probabilistic models of the pixel neighborhood. The filtering application of the
codebook must be interpreted as realizing the following approximation of the posterior
probabilities:
P(Fi,j : YO,O
k Gi,j = g) = •k,e(g) l<i,j<N and l<k<c (3)
To put the VQ filter in the framework of Bayesian image restoration, we recall the
probabilistic model embodied by the codebook. In our works we do consider that the
codebook design performed by the Self Organizing Map intends to minimize the
Euclidean distortion, and as such it has been applied in the experiments described
below.
O= Xi - Ye(xi)
with identical unit covariance matrices p(x(.oj)=~yj,I), and that the classes are
equiprobable, then the minimization of (4) is equivalent to maximum log-likelihood
estimation of the parameters of the model, the class means. Based on these parameters
the MAP decision m.axp(ojl x)
is the bayesian minimum risk decision. Thus, the
J
filtering realized by (3) corresponds to a MAP classification and restoration process, in
which the classes are the gray levels of the central pixel in the representative
neighborhoods extracted from the image. We can state the model of the dependencies
of each pixel to its neighborhood as:
(5)
P(Fij=fo,oFi,j=f)= ~1 1 e-89 z
j = l C ( 2 ~ i d12
In this paper we present the visual results of the application of the filtering based on
the VQ codebook computed by the SOM [5] over a 3D image. We have applied some
fast learning strategies that we have already tested elsewhere [7]. We have tested
346
several sample sizes, but the visual results shown in this paper are obtained with a
sample that consists of the 20% of the 3D image pixels and their 2D or 3D
neighborhoods. We have also tested several numbers of classes: 32, 64 and 128. The
compression rate is determined by the number of classes and the neighborhood sizes.
More detailed results with other number of classes and sample sizes can be found in
(http://sizx01.si.ehu.es/resultados.html). As the end interest of these images is for
medical-biological inspection, the visual evaluation is the prime concern. Therefore,
we do present the visual results of the application of Bayesian VQ filter and of the
conventional compression/decompression application as in the Occam filter approach.
The computational process is as follows:
1. Image samples are extracted randomly in the 3D image grid, with the specified
neighborhood sizes.
2. A one-pass learning SOM [7] is applied to the sample to obtain the codebook.
3. The whole 3D image is filtered based on (a) the Bayesian VQ approach, and (b)
the Occam filter approach.
The data upon which we have performed the experiments is a sequence of images of
micro-Magnetic Resonance obtained by the research group of the Unit of Magnetic
Resonance of the Universidad Complutense. The images have been obtained with an
experimental magnet of 4.7 Teslas. The sequence corresponds to a sequence of 128
cuts of a human embryo. Each image is 128x256 pixels of 32 bits/pixel. The
reduction to 8 bits/pixel has been done by ad-hoc manipulations of the intensities
based on the statistics over the whole sequence. In figure 1 we show the image #80 of
the sequence as it appears after the manipulation. One of the main interests of the
Magnetic Resonance group is the removal of artifacts in the background that
corresponds to the empty space. Also the noise removal must preserve some very
small classes of pixels for eventual segmentation.
Fig. 1. Original flame #80 after manual intensity range reduction to 8 bits/pixel, before
processing by the SOM Bayesian VQ filter.
347
In this section we present some visual results that consists in the frame #80 of the
sequence after being filtered with the Bayesian and compression approaches. The
complete sequence results can be obtained as MPEG movies from
(http://sizx01.si.ehu.es/resultados.html). We plan to perform volumetric rendering of
the sequence and present the results as MPEG movies, but at present the movies show
the sequence of cuts after processing with the Bayesian VQ filter and the compression
VQ filter.
Figure 2 and 3 show the Bayesian and Compression VQ filtering results on the
frame #80 of the sequence, using 3x3x3 spatial neighborhoods, with 32 and 64
codevectors respectively. That gives compression ratios 216:5 and 216:6, respectively.
It can be appreciated that the Bayesian VQ filter provides meaningful results, whereas
the results from the Compression VQ filter are obviously affected by the big
distortion produced by the high compression rates (very low coding rates).
Figure 4 and 5 show the Bayesian and Compression VQ filtering results using
5x5x5 spatial neighborhoods, again with 32 and 64 codevectors respectively. The
compression ratios are 1000:5 and 1000:6. The results of the Compression VQ are
almost unrecognizable, while the Bayesian VQ provides a strong smoothing effect thai
preserves the boundaries of the image regions. We note that the Bayesian VQ does not
involve a proper compression of the image. Nevertheless, the compression ratio is an
indication of the reduction in terms of the number of potential models applied to the
MAP filtering.
Figures 6 and 7 show the Bayesian and Compression VQ filtering results with 128
codevectors using 3x3xl and 5x5xl spatial neighborhoods, respectively. The
compression ratios are 72:7 and 200:7, respectively. Again the degradation introduced
by the Compression VQ distorts the image to the point of making it impossible to
process, whereas the Bayesian VQ improves over the previous figures and the
Compression VQ for the same setting. The improvement of Bayesian VQ given by
the 72:7 compression ratio over the 200:7 compression ratio is significative.
(a) Co)
Fig. 2. Filtering with a 3x3x3 neighborhood and 32 codevectors (a) Bayesian VQ filter
and (b) Compression VQ filter.
348
(a) Co)
Fig. 3. Filtering with a 3x3x3 neighborhood and 64 codevectors (a) Bayesian VQ filter
and (b) Compression VQ filter
(a) Co)
Fig. 4. Filtering with a 5x5x5 neighborhood and 32 codevectors (a) Bayesian VQ filter
and (b) Compression VQ filter
(a) 00)
Fig. 5. Filtering with a 5x5x5 neighborhood and 64 codevectors (a) Bayesian VQ filter
and (b) Compress!on VQ filter
(a) Co)
Fig. 6. Filtering with a 3x3xl neighborhood and 128 codevectors (a) Bayesian VQ filter
and (b) Compression VQ filter
349
(a) Co)
Fig. 7. Filtering with a 5x5xl neighborhood and 128 codevectors (a) Bayesian VQ filter
and (b) Compression VQ filter
4. Concluding Remarks.
Acknowledgments.
Work partially supported by the Departments of Education and Industry of the
Gobierno Vasco under project UE96/9. Ana Isabel Gonzalez has a predoctoral grant
from the Departamento de Educacion, and Imanol Echave has a predoctoral grant from
the Departamento de Industria. The work is also partially supported by the CICYT
project TAP-98-0294-C02-02
350
References
Abstract
We present a unique method for estimating the upper frequency band coefficients
solely from the low frequency information in a subband multiresolution
decomposition. First, a Bayesian classifier predicts the significance or
insignificance of the high frequency coefficients. A neural network then estimates
the sign and magnitude of the visually significant information. This prediction
model allows us to construct an image coder which can exclude transmission of the
upper subbands and reconstruct this information at the decoder. We demonstrate
results for a two level subband decomposition.
INTRODUCTION
Current subband coding techniques, however, make little or no use of the apparent
relationships between coel/icients in neighboring subbands. This paper investigates the
use of neural networks to predict the coefficients between adjacent frequency bands, given
the challenges that 1) the coefficients are produced by orthogonal basis functions, and 2)
the adjacent frequency bands have been downsampled, resulting in a loss of half of the
original information.
It is well known that neural networks excel at recognizing patterns and that mising
data can be reconstructed from local neighbors with reasonable fidelity. Specifically, our
goal is to learn about the behavior of image structure (e.g., edges and texture) across
frequency bands by exploiting the inherent abilities of neural networks. First, we
empirically model the intraband relationships characteristic of natural images in the
wavelet transform domain. Using these models, a Bayesian neural network, constrained to
minimize total entropy, classifies the high frequency signal as significant or insignificant.
A second neural network with analog outputs then learns the nonlinear mappings of the
patterns existing between the low and high frequency subbands. The algorithm estimates
upper frequency coefficients solely from lower ones within the same band and then
performs this estimation recursively. This prediction ablility allows us to exclude
transmission of the upper subbands and reconstruct the information at the decoder.
PREDICTION MODEL
Figure 2. Prediethm Within One I.evel. Firsl, information in the I.I. subband is usc'd I(}
predict coefficients in both the HL and I.H subbands, which are in turn used jointly to
predict the HH subband coefficients. Nole each prtxlicttxl coel]]cient corr~pontls
physically with the centc~ of Its IIIpUlnelghborlmod.
The prediction process consists of two steps as illustrated in Figure 3. Using a Bayesian
classifier, we first classify whether a particular upper frequency is significant or
353
Significant
Coeffs ~l Estimate Encode
rl Value El-rots
Subband ~[ SeparateClasses of L
Input "] Upper Coefficients
[ Neural Network
Figure 3. Prediction Sununary. Within the prediction of each subband, Ihe algorithm
consists of two steps: 1) separation of the significant and insignificant coefficienls, and 2)
prediction of the value of Ihe significant coefficients. The process is carried out
recursively over each level of the decomposition.
Bayesian Classifier
where p(m, z) is the joint probability of a particular slope value generating an insignificant
coefficient, and p(m, nz) is the joint probability of a particular slope value generating a
significant coefficient.
In addition to the probabilities described above, we also calculate the cost of
misclassification, detined as the number of bits necessary for transmitting corrections when
the significance classification process is not correct. We define ~.(zlnz) as the cost of
classifying a coefficient as insignificant when the true state is significant, and ~,(nzlz) as the
354
cost of classifying a coefficient as significant when the true state is insignificant. The
minimum number of transmission bits for corrections is defined as
The value )~i, is determined experimentally for each image using a variety of common
coding techniques such as a bit string code or a run length code. The components of Lm~,
then define the ratio,
r = 3, ( z l n z ) / 3 , (nzlz).
Now according to Bayes Decision Theory, we select class 'z' (the coefficient is
insignificant) if,
otherwise, we select class 'nz' (the coefficient is significant). The ratio in the above
equation is known as the likelihood ratio, and the two joint probabilities in the above
equation are shown in the graph of Figure 4.
.4 ~p(mlz)
.3
!~k A p(mlnz)
m
~.1 .2 .3 .4 .5 .6
,,
!
m = M where p ( m l z ) P ( z ) = ~;
p ( mlnz)P ( nz )
Figure 4. Separating Significance Classes. Using a measure of slope across the input
field, m, the requisite joint probabilities are computed, as displayed in the graph above.
Next. we cxixn-imentally calculate the point on the graph, defined as T, which minimizes
the cost of transmitting corrections for any misclassifications. The slope (x-axis) value
corresponding to the y-axis value of T, is defined as M. This value of M is then used as a
tln'esbold to define the two classes of coefficients: significant and ixtsignificant.
From such a plot, we determine the slope, M, where the likelihood ratio equals the value
x, and select this slope, M, as our threshold for classification. Classification is based on
355
CODER IMPLEMENTATION
Figure 5. Proposed Image Coder Model. In the proposed system, a prediction model,
which can recursively estimate the upper subband coefficients from the LL subband
coefficients, is duplicated at both the encode~ and decoder. Thus, the smallest LL subband
in the decomposition is the only subband information necessary for lossy image
compression. Image fidelity improves with transmission of prediction errors, yielding
lossless compression upon transmission of all errors.
Both the encoder and decoder are augmented by replicate learning models which are
used to predict the upper subband coefficient data within a wavelet decomposition. Only
the I.I. subband coefficients of the smallest level within the wavelet decomposition must be
transmitted verbatim. All of the remaining upper subbands are predicted by the recursive
coefficient prediction algorithm described below. For the latter, the errors (which are
defined by the learning model at the encoder) may be transmitted for use at the decoder.
All of the errors must be transmitted for lossless compression, while only some of the
errors must be transmitted lbr lossy compression.
The data rate reduction is based on the variance and shape of the error distribution.
The savings can be generalized to approximate
H e = Ho - l o g ( a ) , where a is defined as a = t r o / ,7 e
for the general case. The values H~ and Ho are defined as the entropy of the error and
original distributions, respectively. Likewise, the values t~ and t~o are the standard
deviations of the error and original distributions, respectively.
Both the Bayesian Classifier and neural network can produce errors. The Bayesian
Classifier can incorrectly classify the significance of the coefficient, which leads to a binary
string. Errors in the sign of the predicted coefficient at the output of the neural network
also lead to a binary string. The binary errors are currently encoded by a positional code,
a bit string code, or a run length code, with the method selected experimentally for each
image to provide a minimum transmission of data.
356
the slope calculated from the input field to the Bayes Classifier. The above classification
scheme is outlined in Figure 4.
The necessary probabilities have to be estimated for each subband individually, and
thus, a different threshold value is used for each subband. In addition, unique threshold
values are currently determined for each image, however, a series of training images could
be used to generate a global threshold for each subband.
Neural Network
Once a coefficient has been classified as significant or insignificant, the values of the
significant coefficients are estimated with a three layer, fecdtbrward neural network, with
a different network specified lor each subband. The data input to the neural network is
normalized between values of -1.0 to 1.0 and is exactly the same as the data input to the
Bayesian Classifier within each case. The number of neurons in the middle layer varies for
each subband and level, and is selected experimentally
The neural network architecture contains two output neurons, one I~)r positive
coefficient values, node A (the other output node is held to zero during this part of the
training), and one for negative coefficient values, node B (the positive output nodes is held
to zero during this part of training). These output nodes are only allowed to vary between
0.0 and 0.9. During operation, the maximum of the two output nodes is accepted as the
valid estimation of magnitude lbr the predicted coefficient.
where y^ is the output of node A and yB is the output of node B. The larger of the two
output nodes also denotes the sign of the predicted coefficient.
Each neural network is trained over the same set of several images with the standard
back propagation training algorithm. Once training is complete, the weight values are
stored and operation of the algorithm then remains fixed over any test image.
Reeursive Operation
The procedure described above to predict the coefficients in one level of a subband
decomposition is applied recursively to predict multiple levels of upper subband
coefficients. In multilevel subband decompositions, the algorithm is performed as follows.
1. Define n as the number of levels in the given subband decomposition.
2. Estimate the coefficients of the three upper subbands of level n as outlined in the
algorithm above and illustrated in Figures 2 and 3.
3. Reconstruct the low frequency subband of level n-1 with the synthesis filters
designed in the subband analysis. (not part of the prediction algorithm)
4. Replace n with n-1 and go back to step 1. PerIbrm this recursion until n equals
0, i.e., the original image size has been reconstructed.
357
Progressive Transmission
To progressively improve image reconstruction at the decoder, we currently use the
following transmission techniques in succession.
l. Transmit only the sign flip and significance map errors. This method can be used
effectively for Very Low Bit Rate applications.
2. Transmit all magnitude error terms that are greater than 10% of the true
coefficient value for the significant coefficients, resulting in lossy compression.
3. Transmit all remaining magnitude error terms for the significant coefficients.
This again results in a lossy reconstruction of the image.
4. Transmit all the insignificant coefficient values. This results in a lossless
reconstruction of the image.
RESULTS
The Ibllowing results were obtained for a two level decomposition over a test set of
twelve images displayed in Table 1. A separate set of twelve images was used to train the
neural network. Additionally, a third set of twelve images became a validation set for the
neural network. Neither the training set nor validation set are displayed in this paper, but
they contained a varied collection of images.
Network Performance
Lossy Compression
In Figure 6 we show three examples of image reconstruction for Very l.ow Bit Rate
applications. That is, only the signilicance map errors and sign flip errors were transmitted
and employed at the decoder. For these three images, coefficient prediction was employed
for two levels of the wavelet decomposition. The starting point or LI. subband of level
two is shown in the top of Figure 6. By employing the coefficient prediction algorithm we
were able to reconstruct the original size image at the rates given in Figure 6 with the
visual quality exhibited therein.
358
Table 1. Lossy Compression Results. A comparison of the transmission rates for all
significant coefficients using prediction and non-prediction methods.
Figure 6. Reconstruction with Prediction. Reconstruction of the original image from the 1/8
size LL subband, is illustrated for tl~eabove three images.
359
Figure 7 also illustrates how well the coefficient prediction algorithm perlbrms at
very low bit rates. We have enlarged (four times) the swimmer's hat from the swimmer
image in Figure 6 to more closely observe the details. The image on the far right is shown
for reference and displays reconstruction after transmission of all the wavelet coefficients.
In this case, the rate is computed as simply the entropy of all the coefficient data. The
image in the middle of Figure 7 exhibits reconstruction with the coefficient prediction
algorithm when only sign flip errors and significance map errors have been transmitted,
which results in a very minimal cost of 350 bytes.
The image on the left is reconstructed after transmitting the significant coefficients in
descending order along with their corresponding geographical locations (no significance
map is transmitted) until a rate equivalent to the Coefficient Prediction rate (350 bytes) is
achieved. The rate is calculated by computing entropy.
Now image quality, measured as peak signal to noise ratio (PSNR) is compared tor
the same bit rate. The Coefficient Prediction technique achieves a 26.2 PSNR, a 13.4%
increase over the 23.1 PSNR of the standard image coding technique. This coupled with
the visual similarity to the reference image noted in the flag and letters illustrate the
accuracy of the prediction of the significant coefficient magnitudes.
0.061 bpp 0.061 bpp (350 bytes) 6.04 bpp (31K bytes)
23.1 PSNR 26.2 PSNR
Figures 6 and 7 demonstrate the accuracy of the prediction algorithm when used in
an image coding application. In both figures, the reconstructed images rely purely on the
prediction precision of the neural network for the significant coefficients' magnitude
values, as no magnitude error terms are transmitted. This precision is illustrated by the
fidelity and visual quality of the reconstructed images in both figures. Furthermore, the
low bit rates demonstrate the accuracy of the significance classification by the Bayesian
Classifier and the accuracy of the sign value prediction for the significant coefficients by
the neural network.
360
DISCUSSION
Combining the compression power of the wavelet transform with the pattern
recognition power of the neural network allows enhanced visual perception, especially at
very low bit rates. For Very Low Bit Rate applications we can perform lossy image
reconstruction with near zero transmission because only the initial low frequency subband
is transmitted along with sign errors of the upper coefficient predicted values. Magnitude
error terms are sent for higher fidelity lossy image reconstruction and for the lossless case.
Additionally, the off-line training of the prediction weights, which can vary with
initialization techniques and network architecture, facilitates data encryption in secure
transmission environments.
This work presents predictive models for multiresolution image representations,
such as wavelets. These models are adaptive to different classes of imagery and their
application facilitates image reconstruction, image compression, and image enhancement.
REFERENCES
1 Introduction
There has been considerable progress in the area of content based image retrieval
during the last two decades. However, capturing perceptual similarity of images
is a relatively under-explored area of research [San96]. Trademark image retrieval
provides a good avenue of investigation in this regard since an effective trademark
retrieval system should necessarily be able to retrieve images which humans
perceive as similar.
Trademarks play an important role in providing unique identity for products
and services in the marketing environment and trademark classification systems
should be able to ensure that the existing trademarks are distinct to avoid con-
fusions. Traditionally, classification of trademarks is based on limited vocabulary
descriptions. Most of the patent offices use manually assigned codes to represent
these descriptions such as human beings, animals or geometrical figures. But it
has been shown that these methods suffer a number of problems. The assignment
of classes to trademarks is subjective, the classes become either too specific or
too broad depending on the use of classes, there is no mechanism to handle the
generation of new classes, and there is a large fraction of images with little or no
representational meaning making such a classification extremely difficult. This
motivates the need to investigate the potential of content based image retrieval
techniques to solve this problem.
362
2 Feature Extraction
edge extraction
contour decomposition
During feature extraction (as summarised in figure 1), we first perform edge
extraction using the original image (raw image) and segment edge pixels into
constant curvature edge segments using the method proposed by Wuescher and
Boyer [Wus91]. These two steps give straight line segments and curve segments
having the following properties: end points, orientation, curvature and pixel
points on the segment. Using this information we extract different perceptual
features (end-point proximity, parallelism, co-linearism and co-curvilinearism)
using both Lowe's methods [Low87] and Sarkar and Boyer's [Sar94] methods
for perceptual feature extraction [Suj98]. Figure 2 shows some of theses features
extracted using an example image. We group images based on co-linearism and
co-curvilinearism and obtain a new image (gestalt image) and this step gives
some new segments derived from segments from the raw image. We also store
the relationships between antecedents and new segments. The gestalt image is
then subjected to the earlier process of extraction of end-point proximity and
parallelism. We also obtain closed figures by grouping the segments of the image
based on end-point proximity and continuity. This method extracts alternative
364
Fig. 2. Figure 2(b) shows the co-linear and co-curvilinear segments while figure
2(c)shows parallel segments extracted using the image in figure 2(a).
Fig. 3. Some of the closed figures extracted using the image in figure 3(a). Using pixel
based linking methods it is possible to extract half circle shapes but not the perceived
circle or triangle.
The AURA system is aimed at fast combinatorial searching and high perfor-
mance knowledge base system design. The main building block of AURA is a
simple one layer neural network called correlation matrix memory (CMM) that
utilises binary weights. The information processing methods of AURA exploits
threshold logic and distributed processing capabilities of binary neural networks.
365
The CMMs used in AURA have binary weights and binary input and output
vectors which make the training process a one shot learning process of binary
associations.
The training process of a CMM(M) using an input vector (I) and an output
vector (O), can be summarised as follows: First obtain the vector product of
I and O which would give a matrix M I. Then perform a logical OR operation
between M r and M which superimposes patterns in M ~ onto M.
During the recall phase, an input pattern (I) is presented to the CMM to
recall the stored output pattern(s) (O). This produces a vector of summed val-
ues at the output of the CMM. The internal operation behind generation of
this output pattern (R) can be expressed as R = M I T. This is followed by a
thresholding process which generates subsequent output vector(s) called as sep-
arator(s) in the context of AURA. During thresholding, each bit in the output
pattern (R) is set to zero if it is below a previously determined threshold or
otherwise set to one. Several strategies for determining the threshold is available
including L-max thresholding and Wilshaw thresholding [Jim95]. The AURA
consists of required pre-processing and post-processing modules to support sym-
bolic pattern processing using CMMs.
3.1 Pre-processing
During pre-processing all the inputs (in the symbolic form) are converted into
binary vectors with exactly k bits set (where k is a constant for a given CMM)
by the pre-processor. The features offered in the architecture allows simulta-
neous presentation of more than one input pattern. The input pattern can be
superimposed to present them as a single pattern vector as follows:
X1 = 000001000000; X2 = 000100000000; X3 = 100000000000
super-imposed input pattern: 100101000000
This is an important feature of this architecture as it allows parallelising the
search by presenting multiple data items at once. This has been an important
step towards removing the information about the order of inputs as well as
preserving memory space and making the size of the input query to the network
to be independent of the number of variables that the input contains.
3.2 Post-processing
The output vector may contain more than one trained pattern. To separate
theses patterns a method called MBI (Middle Bit Indexing) is used [F194]. In
this method, the relationship between a separator and the symbolic output is
stored into the MBI database using middle bit of the separator as the index to
the database. This reduces the search process to only dealing with bits in the
output data that could belong to the middle bit of the code.
366
4 Connectionist Architecture
In designing the search engine for the image retrieval system, four factors have to
be considered. First, the strategy for mapping feature information into symbolic
data has to be determined. Second, the message passing strategy between CMM
nodes has to be determined. Third, the training strategy to store the above men-
tioned symbolic data has to be determined, and a method must be established by
which all the evidence can be combined to evaluate similarity between a query
image and images in the database.
Image ~ p ~ n t a t i o n
I I I I
]~al perceptual relafio~hips featu~s of Iocalperceptual ~latio~hip~ features of
featn~ between segments clmed figures
between between ~gnlents c l ~ d f'lgu~ features
segmentr and closed figures between and c l ~ e d f i g u r ~
~gnlents
I--
circularity ~gect-ratio
I
circularity ~ p e c t-~ tio
Image n p m e n t a t i 0 n
I I I r
I~al pe~eptual r e l a t i o ~ h i l s, f e a l u ~ of local p e ~ e p t u a l relatio~41il~ f e a t u ~ s of
bt ee ~t wt ~~ n bet w ~ n ~gmenls clos~l f i g u ~ s featu~ b e t w ~ n seg~nLq cl~d figa~
seg~nts ~ d cl<,~d f l g u ~ s betw~n ~ d clo~,l f l g u ~ s
seg~nU
I
ta~2 - la~l
I
ta~2 - last
tag2 - tag2
' I
............ I
tag2 - tag2
I feature v ~ t o r - t a g _ l
tag2 - tag2 tag2 - tag2 tag2 - tag2 tag2 - tag2
0 -typc2aod~ -- end-pohlt p ~ x i m i t y
9 --- p*r.dleUs~l
Training phase is aimed at storing all feature information inside CMMs in order
to utilise t h e m effectively and efficiently during retrieval. During the training
phase, symbolic associations, as summarised in figure 5, are generated to repre-
sent feature patterns in images. Memories at nodes of type 2 are trained with
different perceptual relationships (end-point proximity, parallelism, co-linearity
or co-curvilinearity, antecedents - new segments and segments - closed figures).
For example, memory for end point proximity is trained with associations in the
form of tag - tag which represent two segments connected by end point proximity
relationship. For example, according to the feature arrangements in figure 7(a),
following symbolic associations are generated for the node I1 S1 which represents
a type 2 node (constant curvature segment).
end point proximity relationships: I1 S1 - I1 $2, I1 S1 - I1 $3
parallelism relationships: I1 S1 - I1 $4
co-linear relationships: I1 S1 - I1 $5
antecedents - new segment relationships: I1 S1 - GI1 S1
segments - closed figures relationships: I1 S1 - I1 C2
During training, each symbolic tag is assigned a unique binary token and as
a result, association between a pair of tags becomes an association between a
pair of binary vectors.
Memories at nodes of type 1 are trained with relationships between feature
vectors and tags as well as relationships between tags which represent closed fig-
ures as well as constant curvature segments. Each feature vector consists of eight
feature elements as explained in section 2 (circularity, directionality etc.) which
are quantised to obtain eight different symbolic patterns. First, binary patterns
are generated to represent theses symbolic patterns and then superimposed to
generate a single pattern. As a result, associations between feature vectors and
tags for closed figures become associations between binary vectors.
According to the feature arrangement in figure 7(a), following symbolic as-
sociations are generated to represent inter-relationships for closed figure at I1
C1.
closed figure - closed figure relationships: I1 C1 - I1 C2
closed figure - segment relationships: I1 C1 - I1 $2, I1 C1 - I1 $3
The training phase is completed when all the feature information in the
database is stored within relevant CMMs. Addition of a new image into the
database can be performed at anytime since addition of new associations do not
affect the existing ones.
It Cl I2c2
11 $5 GII $4
(a) (b)
Fig. 7. (a) Typical feature arrangement of an image (b) Information fusion and pruning
process.
First, message passing structure for the particular query image has to be
determined. The considerations are the number of nodes allocated for each type
of node and the nature of connectivity between these allocated nodes. There has
to be a type 1 node for each closed figure and type 2 node for each constant
curvature segment in the query image. As a result, allocated number of type 1
nodes is equal to the number of closed figures and the allocated number of type 2
nodes is equal to the number of constant curvature segments in the query image.
The connectivity pattern is determined according to the perceptual relationships
within the query image as described in section 4.2.
Second, each node obtains a set of initial candidates for similarity, based
on their internal (context free) information (initialisation stage). The nature
of information used for this task depends on the type of the node. For type 2
nodes, a feature vector of three elements is used and it consists of length, ori-
entation (for lines) and curvature (for curves). For type 1 nodes, a feature vec-
tot of eight elements (circularity, directionality, straightness, complexity, right-
angleness, aspect-ratio, sharpness and stuffedness) is used and we also allow
partial matching during this process. It can easily be implemented under the
AURA framework by first obtaining superimposed binary vector for the feature
vector at node n and using it as the input vector (L~) to obtain the output vector
(Cn) with the CMM (M~) which has been used to store relevant information
(ie. feature vector-tag relationships for the whole database), as shown below.
C~ =M~I~ T (1)
370
During this process, each node i obtains its candidate vector Ci and the
third step is aimed at bringing feature information between nodes to achieve
contextual consistency.
During the third step, each node obtains evidence from other nodes to sup-
port the existing candidates. This information can be used to prune unplausible
candidates. During the process, each node plays its role by providing its possi-
ble support for the candidates at other nodes connected to itself by information
channels. In doing that, they use different CMMs stored within the nodes. These
CMMs contain feature information between nodes for the entire database (end-
point proximity, parallelism etc.) For example, to obtain supporting candidates
for nodes connected by relationship x to node n, it presents candidate vector for
n (C,~) to the CMM (M~) as follows.
o,, = M , ~ C J (2)
where M~ is the CMM which contains feature information on relationships
x, at node n Then On becomes one of the support vectors (denoted as C ~ ) for
node m. Node m accumulates all the support vectors (ie. C m = ~ C/~) received
thorough information channels (this can be viewed as an information fusion
process) and the existing vector in accumulator is subjected to thresholding
which prunes the entries for candidates below threshold t (currently we use t
=1). To guarantee convergence we perform a simple "and" operation between
C'r~ and Cm to obtain the new candidate vector Cm at node rn. This process can
be illustrated as in figure 7(b).
Due to the smaller feature vector used during initialisation at nodes of type
2, we continue this information fusion and pruning process at all nodes of type 2
as an iterative process which helps minimise ambiguities. Iterative nature of the
process ensures propagation of constraints throughout the connectionist struc-
ture. This process is halted when all type 2 nodes obtain stability (ie. no more
pruning). Information fusion process at nodes of type 1 is a non-iterative process
which is started after type 2 nodes obtain stability.
T h e theoretical foundation for the information fusion and pruning process is
obtained from the relaxation by elimination (RBE) framework[Jim97].
5 Experiments
@ @ @ m ... @ + .. .:-:.
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)
precision (x) = number of objects found and relevant to x / the total number
of objects found
According to this criteria, we can obtain pairs of recall-precision values which
indicate the fraction of relevant items retrieved and the fraction of retrieved items
that are relevant respectively, as we traverse from the top to the bottom of the
list.
During the experiment, we compared two different methods of combining
evidence from different feature interpretations within our image retrieval frame-
work. We compared the communication strategy we describe in this paper against
the external combination strategy we proposed earlier in [Suj99]. In obtaining re-
sults published in this section, we simulated the behaviour of CMMs using linked
lists. During our work on external evidence combination, we considered the fea-
tures of closed figures and constant curvature segments of raw and gestalt images
in four separate modules. In contrary to the model in this paper, there were no
inter-communication between these modules. We allowed modules to generate
different set of results and we observed that combination of these results using
Dempster-Shafer mechanism gave the best performance.
Figure 9 shows average recall-precision distribution of retrieval performance
over the ten queries under our external combination method using Dempster-
Sharer mechanism [Suj99] and our new model for image retrieval presented in
this paper. Results show that the new model which utilises inter-communication
between different interpretations performs better than the external combination
of modules. These results give evidence to argue that better performance can be
achieved by facilitating granular level communication in order to achieve global
consensus rather than attempting it at the modular level, within our image
retrieval framework.
6 Conclusions
09
0.8
0.7
0.6
05
04
I 02 0.3 04 05F~r~406 07 08 09
Fig. 9. Average recall- precision distribution of the system over the 10 queries.
Acknowledgements
We would like to thank Dr. John Eakins at the University of Nothumbria at
Newcastle and Dr. Philip Quinlan at the University of York for the substantial
benefits obtained from the discussions with them. Financial support for this
project comes from the association of common wealth universities.
References
[San96] Santini.S and Jain.R: Similarity matching. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 1996.
[Mic97] Turner.M and Austin.J: Matching Performance of Binary Correlation Matrix
Memories. Neural Networks, 1997.
[Jim95] Austin.J, Kennedy.J, Lees.K: The Advanced Uncertain Reasoning Architec-
ture. In Proc. Artificial Neural Networks and Expert Systems Conference, 1995.
[Bid87] Biederman.I: Recognition by components - A theory of human image under-
standing. Psychological Review, vol 94, no 2, pages 115-147, 1987.
[Wus91] Wuescher.D and Boyer.K.L: Robust contour decomposition using a constant
curvature criterion. IEEE PAMI, vol 13, pages 41-51, 1991.
[Low87] Lowe.D.G: Three-dimensional object recognition from single two-dimensional
images. Artificial Intelligence, vol 31, pages 355-395, 1987.
[Sar94] Sarkar.S and Boyer.K: Computing perceptual organization in computer vision.
World Scientific Publishers, 1994.
[Suj98] Alwis.S and Austin.J: A novel architecture for trademark image retrieval sys-
tems. In Proceedings of the challenge of image retrieval, 1998.
[F194] Filer.R: Symbolic reasoning in an associative neural network. In MSe Thesis,
University of York, UK, 1994.
[Jim97] Turner.M and Austin.J: A neural relaxation technique for chemical graph
matching. In Proceedings of the Fifth International Conference on Artificial
Neural Networks, Cambridge, UK, Editor. M Niranjan, IEE Publishers, July
1997.
IVan77] Van Rijsbergen.R.J: Information retrieval. Butterwoths, London, 1979.
[Suj99] Alwis.S and Austin.J: Trademark image retrieval using multiple features. To
be published in Proceedings of the challenge of image retrieval, February 1999.
Improved Automatic Classification of Biological
Particles from Electron-Microscopy Images
Using Genetic Neural Nets
1 Introduction
structure remains unknown. In the last few years, electron microscopy has pro-
vided low resolution structural information for some of these hexameric proteins
[1, 2, 3, 4]. Based on these studies, a general picture of the structure of the
hexameric helicases has emerged, featuring the homohexamer as a ring-like self-
assembly of protomers arranged around a central channel. Recently, B~rcena
et .al [3] have addressed the structural characterization of the gene 40 prod-
uct (G40P) of SPP1, a Bacillus subtilis double-stranded DNA bacteriophage by
means of electron microscopy of negatively stained samples of the protein, image
processing techniques and unsupervised classification methods (Self-Organizing
Maps [5, 6]). They proposed a new approach for the analysis of rotational sym-
m e t r y heterogeneities, which allowed the detection of different quaternary or-
ganizations of the protein. Normally, images produced by Electron Microscopy
present a low signal to noise ratio, which hides the subtle heterogeneities in the
population. In such cases classical classification techniques may fail. Rotational
s y m m e t r y analysis [7] have proved to be useful for the detection of the possible
differences in the s y m m e t r y of the population as shown in the structural study
of G40P. Here we propose to study the structure of this hexameric helicase by
a different approach: Instead of using non-supervised techniques, a supervised
classification scheme is proposed.
In general, we are interested in two kinds of final image processing ap-
proaches. The first one is an increase of the signal-to-noise ratio of the images
by a process of averaging. The second one is to use the views of an specimen
coming from different directions as input to a 3D reconstruction algorithm whose
m a t h e m a t i c a l principles are very similar to the ones behind the familiar "Com-
puterized Axial Tomography" in Medicine. Our biological goal is set to reach
high resolution (subnanometer resolution), and to this end it is neccesary to
have very significant image statistics, forcing us to process thousands of images.
However, it is obvious that prior to any of these processes it is vital to be able to
separate and classify these images into their main views in a fast, reliable, and
as authomatic and objective way as possible.
Our strategic aim would then be to provide a reliable method to classify
large quanties of images -in the order of thousands or ten of thousands- in an
automatic manner as a way to obtain high resolution structural results in spite
of the low signal to noise ratio of the individual images.
Neural network techniques have already been applied succesfully to biologi-
cal particle classification and reconstruction [8, 6], showing results t h a t are more
robust, and sometimes faster, than traditional statistical techniques. In this pa-
per, we will apply to the above mentioned problem several supervised learning
algorithms, G-LVq [9], based in a two-level genetic algoritm (GA) operating on
variable size chromosomes, which codify the initial weights and labels for an LVQ
network, and G-PROP, which acts in a similar way on MLPs trained with BP;
results will be compared to Kohonen's Learning Vector Quantization [10] (LVQ)
algorithm for codebook training and with QuickProp. We will first present the
state of the art in evolutionary codebook design in section 2. The new version of
the G-LVq algorithm, used in this paper to classify helicases, is briefly described
375
Codebooks are sets of labeled vectors {w~i} (called codeveetors), which, acting as
classifiers, offer more information than, for instance, usual feed-forward neural
network classifiers: it not only creates a method lot predicting the classes of
unknown patterns, but also is in itself a sample of the training universe; each
codevector ~ belongs to the input space and can thus be visualized and eval-
uated in the same way as the original sample, giving the user a clue of why
an unknown sample has been classified in a way and not another. Designing a
codebook usually involves deciding in advance its size, class label distribution,
initial weights, and the iterative training algorithm used to set the codevectors.
LVQ is one of the possible codebook design and training algorithms, proposed
originally by Kohonen [5]. In this algorithm, the initial weights (initial codevec-
tor values) and the codevector labels are set in advance, as well as a learning
constant, which determines the vector change rate; then the codevectors are
(:hanged by gradient descent, making them as close as possible to the vectors
in the training sample. LVQ, along with the multilayer perceptron and Radial
Basis Functions, are the most widely used techniques in the neural net realm.
The task of finding the correct number and values of a classifier parameters,
t h a t is, what is known in the neural network world as "training the weights of a
neural net", is called by statisticians parameter estimation. Statiscians face the
same problem as neural net researchers: finding a combination of parameters
which give optimal accuracy results, with m a x i m u m parsimony, t h a t is, mini-
m u m amount of parameters. So far, several global optimization procedures have
been tested: simulated annealing, stochastic approximation, and genetic algo-
rithms. In either case, the usual approach to codebook design optimization is
to optimize one of the three metrics for its performance; for instance, maximiz-
ing classification accuracy, while leaving size fixed. Several size values are usually
tested, and the best is chosen. However, the technique used in this paper, G-LVQ,
intends to optimize several codebook parameters at the same time: codevector
labels, codebook size (or number of levels) and codevector initial values.
Some other methods tor optimizing LVQ have been based on incremental
approaches (for a review, see [11]), which are still local error-gradient descent
search algorithms on the space of networks or dictionnaries with different size.
Other methods use genetic algorithms [12] to set the initial weights of a code-
book with a m a x i m u m implicit weight; this m a x i m u m length limits the search
space, which might be a problem to find the correct codebook, whose number
of levels might be higher. Since then, several papers combining GA and vector
quantization have been published; for instance, Johnson and collaborators in
[13] use GAs to select the optimal set of descriptors that are fed to a vector
quantization algorithm; FrSnti and collaborators in [14] use a genetic algorithm
for codebook generation.
376
3 Method
The method used here for obtaining optimal biological particle classifiers is based
on several pillars, which sometimes lean on each other: hybridization of local and
global search operators (an idea originally proposed by Ackley [21]), variable
length chromosome genetic algorithm (which is not too common in the GA
literature, but has already been used, for instance, by Harvey [22]) and vectorial
fitness (not very popular either, but proposed, for instance, by Schaffer [23]. The
genetic algorithm used in this work, based in the previous considerations, can
be described in the following way:
Table 1. Distribution of patterns in the training set; the most frequent patterns where
those with symmetry 3; frequencies are not exact and thus do not add up to 100.
tion in classification over the test set. Set vectorial fitness to the triplet
(accuracy, size, distortion).
(b) With a certain probability, apply chromosome length changing operators,
suppresing the genetic representation of the codevector which responds
to a lowest nmnber of input space vectors, or duplicating the one which
responds to the highest number.
3. Select the P best chromosomes in the population, according to their fit-
ness vector, and make them reproduce, with mutation and 2-point crossover
(acting over the length of the smallest chromosome of the couple).
4. Finish after a predetermined number of generations, keeping the best trained
codebook, and apply it to the validation set.
4 Results
The whole set of samples available consisted of 933 spectra, obtained after a
process of segmentation, translational and rotational alineation, manual label-
ing by visual inspection of the rotational, and preprocessing to obtain circular
symmetry spectrum of electron-microscopy images. After obtaining the spectra,
only those with physical meaning were used for training and testing, that is,
only with symmetry equal to 1,2,3,4 and 6, and those classified as noisy, which
were assigned the same class label as those with symmetry 1 (or no symmetry).
Absolute number of samples and frequencies for each class are shown in table
I. Each particle is represented by a 15-component vector, corresponding to the
spectrum components. Class averages are shown in figure i; each class represents
vectors with a strong symmetry, so there should be a peak value in one of the
components, as is seen in the figure.
Using these spectra, three files were created, one for training, another one
tor testing, and another one for validation. Training files were used to train
378
Class averages
30
' ' ' ' ' $1 and noi~se - -
$2 ....
$3 ........
$4 ................
25 $6 .....
20
15 ,
/ i!.,i ..'; ,~ ~ i
i i I i I I i
0
0 2 4 6 8 10 12 14 16
Spectrum Component
Fig. 1. Class averages for each class. As can be seen, each class has got a sharp peak
at one component, along with some harmonics at components with double and half
value; tbr instance, objects with symmetry 6 has an small harmonic at 12.
all neural nets, test files to select one among all trained nets, and validation
file to check accuracy for that network; validation file has never been shown
betbre to the chosen network. Each file contained 311 samples, with random
class distribution. Only the training and validation files were used tbr the Lvq
algorithm, since instead of selecting one net, the average and lowest errors are
taken.
Several classification algorithms were then tested on these sets: Kohonen's
classical Learning Vector Quantization, or LVQ; O-LVQ [9], which is basically, as
pointed out above, a genetic optimization of Kohonen's LVQ that, at the same
time, discovers the optimal size of the LVQ network; QuickProp [24] and G-PROP,
which is a genetically optimized version of Backpropagation which is also able
to discover the correct size for the hidden layer [18].
Lvq was run 1000 times with several preset codebook sizes, using the training
file to train and the validation file as test file; since no further selection is made on
the training parameters, the test file was not considered necessary. LVQ algorithm
has no method to select the codebook size in advance, thus, two different sizes
were tested: 8 and 16 levels. Weight vectors and labels were initialized with
vectors from the training file, which usually gives good results, but with so few
codevectors, it might happen that many codevectors have got the same label.
379
Tile gain factor a was set to 0.1, decreasing by (~/epochs each step; and they
were trained for 1000 epochs. The program was written using S-Vq, (Simple
Vector Quantization), a C + + class library for programming vector quantization
applications, which is available fl'om the author. The whole test took several
minutes on an Silicon Graphics 02. 100 QuickProp networks were also trained,
with the hidden layer size and gain factor set to the best one obtained by G - P R O P .
All settings for G-PROP were the same as in [18].
The genetic algorithm used for G-LVQ is a steady-state algorithm, which
means that a part of the population is kept in each generation. In this case,
40% of the population were substituted each generation, with the offspring of
the 40% best. Population was fixed to 100 networks for each generation (or
200 in one case). Variable-length chromosome operators were used, with 20%
probability for the gene-duplication operator, and 10 % probability for the gene-
elimination operator. Bit-flip mutation is applied with 10with 40% probability.
The GA was run for several generations, using a vectorial fitness as shown in
[9]; the main criterium for optimization was minimization of misclassification
rate, followed by length. G-LVQ was run 5 times with each of the the different
parameter settings, and different random initialization; each network was trained
for a number of epochs equal to twice the size of the training set (around 600
vector presentations in this case). Running the test around one hour in an old
Silicon Graphics R4000 Entry (similar to a Pentium 100 Mhz). G-PROP used the
same parameters as described in [18]
The best algorithm is G-PROP, which obtains an outstanding error level, with
low standard deviation, and an acceptable size. 16-level LVQ initialized with sam-
ples from the training file obtains also acceptale values, and if enough networks
are trained (1000 in this case), is able to find one which generalizes perfectly (0
380
Generated codebook
50
/
,,, 3 ........
ji tL / ,'', 0
o) ., , , ,. , .... , ~: /~ i, ~ , ~ , ..
I ",Jr ;:, ", ," ~ i ' ~i', 9 ',~ ,"-.' ,/ "~', - .." ,"
10
-10 I I i I J i i
0 2 4 6 8 10 12 14 16
Component
F i g . 2. Trained codebook obtained in one of the G-LVQ runs. The genetic algorithm has
assigned 1 vector to each class, except class with symmetry 3 (3) and s y m m e t r y 2 (2
codevectors). This particular run obtained a 9% error on the validation sample 9 Usually,
each codevector has higher values in the component corresponding to its s y m m e t r y and
its harmonics.
This paper proves that a very efficient classification of spectra procedent from
electron microscopy images, which paves the way tbr a high-speed and accurate
electron microscopy image processing. For this problem, genetically optimized
versions of LVQ and Backprop obtained better results than the standalone al-
gorithms, and, besdies, were able to find relevant parameters (initial weights,
size and learning parameters). G-PROP outperforms all the other algorithms,
and finds a perceptron with a small size, less than 1% the number of variables
involved in training. Probably, G-PROP is much more efficient than G-LVQ when
more than 2 classes are involved and class frequencies are not the same for all
classes, since it sometimes finds eodebooks with one or several classes missing
(and, depending on class frequencies, it could still have a good error rate). Fhture
work will include improvements on G-LVQ, so that it, obtains better results on
this kind of problems, and application to other electron microscopy problems.
Acknowledgements
This work has been supported in part by CICYT (Spain) grants number 1FD97-
0439-TEL1, and BIO98-076.
References
1. M. C. San Mart{n, N.P.J. Stamford, N. Dammeranova, N.E. Dixon, and J. M.
Carazo. A structural model for the Echerichia coli DnaB helicase based on electron
microscopy data. J. Struet. Biol., (114):167-176, 1995.
2. X. Yu, M.J. Jezewska, W. Bujalowski, and E.H. Egehnan. The hexameric e.eoli
dnab helicase can exist in different quaternary states. J. Mol. Biol., (259):7-14,
1996.
3. M. Bs M.C. San Mart{n, F. Weise, S. Ayora, J.C. Alonso, and J.M Carazo.
Polymorphic quaternary organization of the bacilus subtilis bacteriophage SPP1
replicative helicase (G40P). J. Mol. Biol., (283):809 819, 1998.
4. C. San Martin, C. Gruss, and J.M. Carazo. Six molecules of SV40 large T antigen
assemble in a propeller-shaped particle around a channel. Journal of Molecular
biology, (269), 1997.
5. Teuvo Kohonen. The self-organizing map. Proc. IEEE, 78:1464 1480, 1990.
6. R. Marabini and J.M. Carazo. Pattern recognition and classification of images
of biological macromolecules using artificial neural networks. Biophysics Journal,
66:1804 1814, 1994.
7. R.A. Crowther and L.A. Amos. Harmonic analysis of electron microscope images
with rotational symmetry. J. Mol. Biol., (60):123-130, 1971.
8. Jose-Jesus Fernandez and Jose-Maria Carazo. Analysis of structural variability
within two-dimensional biological crystals by a combination of patch averaging
techniques and self-organizing maps. Ultramicroscopy, 65:81-93, 1996.
9. J. J. Merelo and A. Prieto. G-LVQ, a combination of genetic algorithms and LVQ.
In N.C.Steele D.W.Pearson and R.F.Albrecht, editors, Artificial Neural Nets and
Genetic Algorithms, pages 92 95. Springer-Verlag, 1995.
382
10. T. Kohonen. The self-organizing map. Procs. IEEE, 78:1464 ft., 1990.
11. Ethem Alpaydim. GAL: Networks that grow when they learn and shrink when they
~brget. Technical Report TR-91-032, International Computer Science Institute,
May 1991.
12. Enrique Monte, D. Hidalgo, J. Marifio, and I. Herns A vector quantization
algorithm based on genetic algorithms and LVQ. In N A T O - A S I Bubi6n, page 231
ft., 1993.
13. S.R. Johnson, J.M. Sutter, H.L. Engelhardt, P.C. Jurs, J. Whiteand J.S. Kauer,
T.A Dickinson, and D.R. Walt. Identitication of multiple analytes using an optical
sensor array and pattern recognition neural networks. Anal. Chem., (69):4641,
1997.
14. P. Frhnti, J. Kivij/irvi, T. Kaukoranta, and O. Nevalainen. Genetic algorithms for
codebook generation in vq. In Proc. 3rd Nordic Workshop on Genetic Algorithms,
Helsinki, Finlan, pages 207-222, 1997.
15. Juan-Carlos Perez and Enrique Vidal. Constructive design of LVQ and DSM classi-
tiers. In J. Mira, J. Cabestany, and A. Prieto, editors, New Trends in Neural Com-
putation, Lecture Notes in Computer Science No. 586, pages 335-339. Springer,
1993.
16. Xin Yao and Yong Liu. Towards Designing Artificial Neural Networks by Evolu-
tion. Applied Mathematics and Computation, 91(I):83-90, 1998.
17. P.A. Castillo; J. Gonzs J.J. Merelo; V. Rivas; G. Romero; A. Prieto. G-Prop:
Global Optimization of Multitayer Perceptrons using GAs. submitted to Neuro-
computing, 1998.
18. P.A. Castillo; J. Gonzs J.J. Merelo; V. Rivas; G. Romero; A. Prieto. SA-Prop:
Optimization of Multilayer Perceptron Parameters using Simulated Annealing.
submitted to IWANN99, 1998.
19. J. J. Merelo; A. Prieto; F. Mor~n; R. Marabini; J. M. Carazo. Automatic classi-
fication of biological particles from electron-microscopy images using conventional
and genetic-algorithm optimized learning vector quantization. Neural Processing
Letters, 8:55 65, 1998.
20. A. Pascual, M. Brcena, and J.M. Carazo. Application of the fuzzy kohonen clus-
tering network to biological macromolecules images classification. In submitted to
IWANN99, 1999.
21. David H. Ackley. A connectionist algorithm for genetic search. In John. J
Grefenstette, editor, Proceedings of the First International Conference on Genetic
Algorithms and their Applications, pages 121-135, Hillsdale, New Jersey, 1985.
Lawrence Erlbaum Associates.
22. I. Harvey. Species adaptation genetic algorithms: a basis for a continuing SAGA.
In F. J. Varela and P. Bourgine, editors, Proceedings of the First European Confer-
ence on Artificial Life. Toward a Practice of Autonomous Systems, pages 346-354,
Paris, France, 11-13 December 1991. MIT Press, Cambridge, MA.
23. J. D. Schaffer and J. J. Grefenstette. Multi-objective learning via genetic algo-
rithms. In Procs. of the 9th international Conference on Artificial Intelligence,
pages 593 595, 1985.
24. S.E. Fahlman. Faster-Learning Variations on Back-Propagation: An Empirical
Study. Proceedings of the 1988 Connectionist Models Summer School, Morgan
Ka~fmann, 1988.
Pattern Recognition
Using Neural Network Based on
Multi-valued Neurons
Abstract
1. INTRODUCTION
other hand Hopfield-like MVN-based neural network has been proposed as associative
memory in [7]. A disadvantage of these three networks is impossibility to recognize
shifted or rotated image, also as image with changed dynamic range. To break these
disadvantages and to use effectively features of multi-valued neurons, we would like to
propose here a new type of the network, learning strategy and data representation
(frequency domain will be used instead of spatial one). An idea of the image recognition
on the single-layered MVN based neural network using analysis o f the orthogonal spectra
coefficients has been proposed in [6]. It will be considerably developed here.
2. M U L T I - V A L U E D NEURONS AND T H E I R L E A R N I N G
Multi-valued neuron (MVN) introduced in [2] was considered later in many papers. E.g.,
some important theoretical aspects have been presented in [3] and [5]. We would like to
remain here some keyword moments of the MVN theory (mathematical model o f the
MVN and its learning).
MVN [2, 3, 5] performs a mapping between n inputs and single output. The performed
mapping is described by multiple-valued (k-valued) function o f n variables .f(x, ..... x,) via
their representation through n+l complex-valued weights w0 ,wj ..... w, :
f ( x I..... x,) = P(w o + wlx t +...+ w , x , ) , (1)
where x t ..... x, are variables, on which performed function depends (values of the
function and of variables are also coded by complex numbers which are k-th power roots
of unity: e j = e x p ( i 2 n j / k ) , j e {0, k-l}, i is an imaginary unity. In other words, values of
the k-valued logic are represented as k-th power roots of unity: j --~ c j). P is the activation
function o f the neuron:
P(z) = exp(i2•j/k), i f 2• ( j + l ) / k > arg(z) > 2rcj/k (2)
or, with integer output:
P(z) = j, i f 2n ( j + l ) / k > arg(z) > 2 n j / k , (2a)
where j=0, I ..... k-1 are values of the k-valued logic, z = w 0 +w~x I + . . . + w , x , is the
weighted sum, arg(z) is the argument of the complex number z. So, ifz belongs to thej-th
sector, on which the complex plane is divided by (2), neuron's output is equal to e j , o r j
in the integer form (Fig. I).
I 0
P(z) = ~-'
Definition o f the MVN activation function
Fig. 1
MVN has some wonderful properties that make it much more powerful than traditional
artificial neurons. The representation (1) - (2)'makes possible implementation o f the
input/output mappings described by arbitrary partial defined multiple-valued functions.
Such a possibility to implement arbitrary mappings on the single neuron gives an
385
The following correction rule for learning of the MVN has been proposed in [2]:
w.,, -- w. +o,. qx, (3)
where W~, and W,, ~ are current and next weighting vectors .~ is the complex-conjugated
vector of the neuron's input signals (the current vector from the learning set), e q is the
desired neuron's output (in the complex-valued form), C= is the scale coefficient, co is the
correction coefficient. Such a coefficient must be chosen from the point of view that the
weighted sum should move exactly to the desired sector, or at least as close, as possible to
it after the correction of weights according to the rule (3).
Another effective, quickly converged learning algorithm for multi-valued neuron has
been proposed in [3] and then developed in [6]. It is based on the error-correction rule:
W,,+l = W,~+~ ( e q - e ") X , (4)
in+l)
where 14,'= and W,~,t are current and next weighting vectors, .~ is a vector of the neuron's
input signals (complex-conjugated), 6 is a primitive k-th root of unity (k is chosen from
(2)), C is a scale coefficient, q is the number of the desired sector on the complex plane, s
is the number o f the sector, to which the actual value of the weighted sum has fallen, n is
the number of neuron inputs. Learning algorithm based on both rules (3) and (4) are very
386
quickly converged. It is possible to implement them in truly integer arithmetic [3] also as
always possible to find such a value o f k in (2) that (1) will be true for given function f,
which describes the mapping between neuron's inputs and output [3, 6].
The/j-th neuron is connected with 8 other neurons, and with itself. Numbers of neurons,
from which/,/-th neuron receives the input signals, are chosen randomly.
To use more effectively the MVN features, and to break the disadvantages mentioned
above we would like to consider here a new type of the network, learning strategy and
data representation (frequency domain will be used instead of spatial one).
Consider N classes of objects, which are presented by images of n x m pixels. The
problem is formulated into the following way: we have to create recognition system based
on neural network, which makes possible successful identification of the objects by fast
learning on the minimal number of representatives from all classes.
To make our method invariant to the rotations, shifting, and to make possible
recognition of other images of the same objects we will move to frequency domain
representation of objects. It has been observed (see e.g., [11]) that objects belonging to the
same class must have similar coefficients corresponding to low spectral coefficients. For
different classes o f dlscrete signals (with different nature and length from 64 until 512)
sets of the lowest (quarter to half) coefficients are very close each other tbr signals from
the same class from the point of view of learning and analysis on the neural network [11 ].
This observation is true for different orthogonal transformations. It should be mentioned
that a neural network proposed in [I I] for solution of the similar problem has based on the
obvious threshold elements, and only two classes of objects have been considered. In the
terms of neural networks to classify object we have to train a neural network with the
learning set contained the spectra of representatives o f our classes. Then the weights
obtained by learning will be used for classification of unknown objects.
387
l
2
We propose the following structure of the MVN based neural network for the solution
of our problem. It is single-layer network, which has to contain the same number of
neurons as tile number of classes we have to identify (Fig. 4). Each neuron Ires to
recognize pattern belongency to its class and to reject any pattern from any other class.
Taking into account that single MVN could perform arbitrary mapping, it easy to conclude
that exactly such a structure of the network that was just chosen is optimal and the most
effective.
To ensure more precise representation of the spectral coefficients in the neural network
they have to be normalized, and their new dynamic range atter normalization will be [0,
511]. More exactly, they will take discrete values from the set {0, 1, 2 .... ,511 }. We will
use two different models for the frequency domain representation of our data. The first
one is using the low part of Cosine transformation coefficients. The second one is using
phases of the low part of Fourier transformation coefficients. In the last case we used such
a property of Fourier transformation that phase contains more information about the signal
than amplitude (this fact is investigated e.g., in [12].
The best results for the first model were obtained experimentally, when we reserved
the first l=k/4 (from the k=-512) sectors on the complex plane (see (2), and Fig. 1) for
classification of the pattern as belongs to the given class. Other 3/4 sectors correspond to
rejected patterns (Fig. 5). The best results for the second model were also obtained
experimentally, when for classification of the pattern as belongs to the given class we
reserved the first l=k/2 (from the k=512) sectors on the complex plane (see (2), and Fig. i).
Other k/2 sectors correspond to rejected patterns (Fig. 6).
k/4- !/ Sectors
k/41 / O..... k/4-1-
I -/domain ibr the patterns
f . - - ~ r o m t ~ given class
Sectors # ~ 0
k/4..... k-I -.
-domain for rejecte~,,.._ J ~ k- l
patterns
I~ 0
k-I
- Sectors k/2,...,k-I
domain for rejected patlems
Thus, for both models output values 0, ..., I-I for the i-th neuron correspond to
classification o f object as belonging to i-th class. Output values I, .... k-1 correspond to
classification o f object as rejected for the given neuron and class respectively.
Hence there are three possible results o f recognition alter the training: I) output o f the
neuron number i belongs to {0, ..., l-l} (it means that network classified pattern as
belonging to class number 0; outputs o f all other neurons belong to {/..... k-l }; 2) outputs
o f all neurons belong to {l..... k-I }; 3) outputs o f the several neurons belong to {0 ..... /-
I }. Case l corresponds to exact (or wrong) recognition. Case 2 means that a new class o f
objects has been appeared, or to non-sufficient learning, or not-representative learning set.
Case 3 means that number o f neuron inputs is small or inverse is large, or that learning has
been provided on the not-representative learning set.
4. S I M U L A T I O N RESULTS
The proposed structure o f the MVN based network and approach to solve o f the
recognition problem has been evaluated on the example o f face recognition. Experiments
have been performed on the software simulator o f t b e neural network.
1 2 3 4 5
6 7 8 9 10
ll 12 13 14 15
16 17 18 19 20
Fig. 7. Testing image data base
389
We used MIT faces data base [13], which was supplemented by some images from the
data base used in our previous work on associative memories (see [3 - 4]). So our testing
data base contained 64 x 64 portraits of 20 people (27 images per person with different
dynamic range, conditions of light, situation in field). So, our task was training of the
neural network to recognize twenty classes. Fragment of the data base is presented in
Fig.7 (each class is presented by single image within this fragment).
According to the structure proposed above, our single-layer network contains twenty
MVNs (the same number, as number of classes). For each neuron we have the following
learning set: 16 images from the class corresponding to given neuron, and 2 images for
each other class (so 38 images from other classes). Let describe the results obtained for
both models.
Model 1 (Cosine transformation).
According to the scheme presented in Fig. 5 sectors 0, ..., 127 have been reserved for
classification of the image, as belonging to the current class, and sectors 128 .... ,511 have
been reserved for classification of the images from other classes. The learning algorithm
with the rule (4) has been used. Thus for each neuron q=63 for patterns from the current
class, and q=319 for other patterns in the learning rule (3).
The best results have been obtained for 20 inputs of the network, or for 20 spectral
coefficients, which are inputs of the network. More exactly, there are 20 low coefficients
(from second until sixth diagonals, zero-frequency coefficient has not been used). Choice
of the spectral coefficients from the diagonals of spectrum is based on the property of 2-D
frequency ordered spectra: each diagonal contains the coefficients corresponding to the
same 2-D frequency ("zigzag", see Fig. 8).
o
oJ"o o o
Fig. 8. Choice of the spectral coefficients, which are inputs of neural network
We have got quick convergence of the learning for all neurons. Computing time of the
software simulator implemented on the Pentium-133 is about 5 - 15 seconds per neuron. It
corresponds to 2000-3000 iterations. It is necessary to make important remark: if it is
impossible to obtain convergence of learning for the given k in (2), it is necessary to
change it and to repeat a process.
For testing, twelve images per each person, which did not present in the learning set,
and are other or corrupted photos of the same people, have been shown to neural network
for recognition. For classes 1, 2, and 17 testing images are presented respectively, in
Fig. 9-11. Results are the following. Number of incorrectly identified images for all
classes (neurons) is li'om 0 (for 15 classes from 20) to 2 (8%), excepting classes No 2 and
13. For both classes No 2 and 13 this number is increased to 3-4. May be, it is influence of
the same background, on which photos have been made, and very similar glasses of both
persons (see Fig. 8). To improve results of recognition in such a case the learning set
should be expanded. From our point of view it is not a problem because additional
390
learning is very simple. On the other hand, increasing of the number of classes, which
have to be identified, also is not a problem, because always it is possible to add necessary
number o f neurons to the network (Fig. 4), and to repeat learning process beginning from
the previous weighting vectors.
Model 2 (Fourier transformation).
The results corresponding to model 2 are better. According to the scheme presented in
Fig. 6 sectors 0 .... ,255 have been reserved for classification of the image, as belonging 1o
the current class, and sectors 256, ..., 511 have been reserved for classification of the
images from other classes. The learning algorithm with the learning rule (3) has been
used. So, for each neuron q=127 for patterns from the current class, and q=383 for other
patterns in the learning rule (3).
The results o f recognition sequentially improved with increasing of number of network
inputs. It should be noted that such a property was not noticed in method i. The results of
recognition were stable for number o f coefficients more than 20.
The best results have been obtained for 405 inputs of the network, or for 405 spectral
coefficients, which are inputs of the network, and beginning from this number the results
stabilized. Phase of spectral coefficients has been chosen again according to "zigzag" rule
(Fig. 8).
I 2 3 4 5 6
8 9 10 11 12
Class "17" - 100% successful recognition
Fig. i 1.
391
For all classes 100% successful recognition has been gotten. For classes "2" and "13" 2
images from another class ("13" for "2", and "2" for "13") also have been identified as
"its", but this mistake has been easy corrected by additional learning. A reason of this
mistake is evidently, again the same background of the images, and very similar glasses of
both persons whose portraits establish the corresponding classes.
To compare both methods, and to estimate a store of precision ensured by the learning
Table 1 contains numbers of sectors (from 512), to which the weighted sum has been
fallen for images from the class No 17 (see Fig. 11 ).
It should be mentioned that using frequency domain data representation it is very easy
to recognize the noisy objects (see Fig. 11, Table I). Indeed, we use low frequency
spectral coefficients for the data representation. At the same time noise is concentrated in
the high frequency part of the spectral coefficients, which is not used.
We hope that considered examples are convinced, and show either efficiency of
proposed solution for image recognition problem, and high possibilities o f MVN or neural
networks based on them.
Table 1. Number of sectors, which to weighted sum has fallen during recognition of the
images presented in Fig. 6.2.8 (class 17).
lnmge | 2 3 4 5 6 7 8 9 10 !! 12
Melhod1, 60 62 62 102 40 34 65 45 99 65 35 46
Sector
(boarders
a r e Or 127
Method2. 126 122 130 129 120 135 118 134 151 126 107 119
Sector
(boarders
a r e 0, 255
A new MVN based neural network for solution of the pattern recognition problems has
been proposed ia the paper. This single-layered network contains a minimal number of
neurons. This number is equal to the number of classes, which would be recognized. The
orthogonal spectra coefficients (Cosine and Fourier) are using for representation of the
objects, which have to be recognized. A proposed solution of the recognition problem has
been tested on the example of Face recognition. Simulation results confirmed high
efficiency of the proposed solution: probability of the correct recognition of the images
fi'om the testing set is close to 100%. The obtained results may be generalized from the
face recognition to image recognition in general and pattern recognition in general. A
future work in developing of the obtained results will be directed to the minimization of
the number of neural network inputs and to the search for the best orthogonal basis for
representation of the data describing analyzed objects.
REFERENCES
I. Introduction
The majority of the recognition systems of pattern based on neural nets are very sensitive to
transformations as rotation, scaling, and translation. In the last years, some researchers build
systems based on neural nets of elevated order, that are insensitive to translation, rotation and
scaling. But in practical applications suffered from elevated combinatory explosion of units.
Although many other invariant neural recogniser has been proposed, does not still exist an useful
syslcm that could be considered insensitive to translation, rotation and scaling. This job orients
particularly to the development of a system of pattern pre-processing, which is able to make any
invariant pattern to the aforesaid transformations. The effect of the transformations is typical when
the device of acquisition, like for example a television camera, change his orientation or distance
from the model. Systems having the ability of recognising patterns in a transformation invariant
mariner, have pratical applications in a great variety of fields, from the control of existence of
simple objects up to the guide of the robots in their space of exploration. Besides system mandatory
characteristics are: independence fi'om the used recognition approach; autonomy that is it alone
extracts the pattern from a generic binary image, and through varied elaboration's like expansion,
translation, normalisation, rotation, and finally scaling, it reaches its complete transformation
itwariance, adapting the drawn out pattern to the in demand dimensions fiom the rccogniser. In the
present job it has been implemented like required by a Hopficld neural network constituted of fully
intcrcnnnccted matrix with 13x 13 neurones.
2. i're-proeessing
The first phase of the pre-processing consists of the acquisition through scanner with optic
resolution equal to 100 dpi, getting a representation of the image to the inside of the calculator like
map of bit (bit-map) in graphic format (PCX). This representation doesn't make facilitate the
following elal~oralion's on the image, for which is necessary Ihe conversion to the text format. The
second phase foresees the extraction of the pattern to the inside of the image in fully grown text,
and they then have subjected the operation morphological dilation that has the double purpose of
will fill the holes to the inside of the pattern introduced from the low resolution of sampling and of
confer to the same mutual lower correlation's. To this point begin the process of adaptation or
normalisation, that consists of the sequence of functional forms what translation, normalisation,
~x~tation, and scale, each action to draw a measure on the pattern in entry, that allows to lead it to a
canonical form.
2.1. Translation
The translation of an object consists of his shift an a new position that maintains unchanged its
dimensions and orientation. This process allows to get the invariance to position, calculating the
centre of gravity of the pattern for then subsequently translating it so that the centre of gravity
coincide with the cenlre of the new window. For instance(see figure 1):
394
x~=2 x,--4
i Iratt'~lntion y r----4
llll|l
~l::[I Ill ~ i J J
Y y
Figure 1-Trashttion of 5x8 Window into a 9x9 window.
f t ( x , y ) = f ( x + xc, y+ yc)
Where [xc] and [yc] are the co-ordinates of the baricentre of the pattern, f (x, y) gives the value of
the pixcl of the pattern at tile co-ordinates (x, y); for binary bidimentional digitised objects, this
function could return only value 0 or 1.
The centre of the area in a binary image is the same of the centre of mass if we consider the
intensity of the point like the mass of tile same point. The position of the pattern has given from:[ 31
Y.j*fli, jl ~ n~i*fli,J]
i=lj=l yc i=lj=l
Y~f[i, jl
i=lj=l i=lj=l
2.2 Normalisation
The first step of this process, consists of the effect a measure of greatness on the pattern, calculating
the middle ray with the following formula[ 1]
It n
2 Z"~ax{i~-xcl, lY-YCl}'"(x,y)
x=ly=l
Fit I =
2,(x,y)
x=ly=l
Where n is the dimension of the window, (Ixc], [ycl) is tile co-ordinates of tile centre of the window
and u (x, y) is tile matrix of the pattern to normalise:
The performed normalisation preserves the form of the pattern, scaling it, by means of a coefficient
namcd scale factor, expressed by:
395
s=rm; r=~
r 3
The function of mapping for the translation invariant pattern is the following:
Un (x, y) = u(s.x,s, y)
2.3 Rotation
The first computation for carrying the vector orientation in the canonical direction, so realising the
rotation invariance, consists in the calculus of a vector, by means of the Karhunen-Loeve
transformation; from this we have that: given a whole of vectors, the autovector related to biggest
autovalue of the covariance nmtrix, derived (you see under) fiom the whole of the vectors, points
in thc direction of maximum variance. [1, 3] The formula is the following:
The inclination of the corresponding autovector to the biggest of the autovalues of the covariance
matrix, that allows of determine the orientation of the pattern, is the following:
( 2 2
y = T,. - T~, + Tr, - T~,) +4'T~y where:
x 2'T~
ii it ii ii ii tt
7'- = E E u , ( x , , Y j ) ' x , 2 T~y= ~ ~ , u , ( x t , y j ) . y , ~ T,~. = Y~ Y . u , ( x , , y j ) ' x , ' y j
i=lj=l i=lj=l i=lj=l
Subsequently the pattern is rotated, toward the canonical direction, choice as coincident with tile y
axis in the direction "south" by the algorithm of rotation. This algorithm has been built so that it
minimises the errors of approximation, guarantees elaboration's less complex and finally allows a
possible implementation to the inside of a neural network. The purpose is reached with the
introduction of the following hypothesis:
1. we use the metric chess-board, that is: d= [max] ((X-Xb). (Y'Yb)) where ([xb], [Yh]) is the co-
ordinates of the baricentre of the figure.
2. an oblique line comes approximate by means of a broken one, for example see figure 3:
3. the equidistant points from the centre of rotation, according to the metrics of point 1, form a
circumference which constitutes the dominion in which the result of the rotation of such points
could fall. For instance as in figure 4 a and 4 b:
Figure 4a. Pattern before rotation of 22.50 ~ Figure 4 b. Pattern after rotation of 22.50 ~
The rotation of an angle cz of a matrix of points, involves that all the points, belonging the
circumference of 'd ray, suffer a shift along the same circumference, equal to the whole part of the
value (d* (z)/45.
2.4 Scaling
The scaling allows to adjust the dimension of an image to that one of the entry layer of the
recogniser neural network. It doesn't effect measure of greatness on the pattern, but it guarantees a
small distortion. Now it individualises a factor of scale, 'rap,' which is equal to the relationship
between the side of the window to scale and the side of the scaled window. The window to scale
comes sampled with a grate of equal dimension to the side of the scaled window. It takes place the
weighed sum (where the weights is constituted flom the values of the factors of contribution) of the
pixel that fall to the inside of the grate of sampling; this sum comes compared with a value said
threshold of dependent decision from the dimensions of the two windows. This value of threshold is
worth:
0 = raps
; least value engaged by the sums of contributions so that
2 the pixel under examination could be considered active.
This value corresponds to the half of the sum of all the contributions weighed in tile case in which
all the pixel are active. In the figure 5 we show with the more thin layer, the pixel of the scaled
window, on which they will come destined the contributions of the pixel of the window to scale.
0.33
0.33
It contributes 0.11 for a factor It contributes 0.11 for a factor
F i g u r e 5. The value of this pixel is conditioned from the pixel aloft to left with weight 0. l 1, from the
pixel aloft to the right with weight 0.11, from the pixel in low to left with weight 0.11, from the pixel in
low to the right with weight 0. 11 (rap= 2/3).
2.5 Dilation.
This process performs the morphological operation 'dilation', that allows to widen the geometric
greatness of the pattern in the window, and to fill the "holes" inside the same pattern. The dilation
bases itself on the Minkowsky sum definite like[7]:
For example for binary pattern, used in this job, the procedure happens like in the figure 6:.
Fig6. Dilation of E letter rotated of 22.5 ~ (b), using the structuring element (a).
'Fhc expansion process eliminated the "holes", now the image could be furnished to the algorithm of
pursuit of the contour without other problems. Along the contour of the expanded object, they are
characterised the vertexes belonging the minimal rectangle containing the object. Then the not
expandcd object is extracted, leaving unchanged accordingly, all the characteristics of the object.
2.7 Results
In this paragraph we have brought again the results gotten from the process of extraction, to the
process of normalisation, of tile pattern representatives the 26 hand-written letters acquired, through
so:tuner. For instance:
A C DE FGJ K
LMN OPQ S
T UV 4XyZ
Figure 8. The 26 handwritten letters of the alphabet
398
Figure 9a. Handwritten letters C B M A U Y Figure 9b. The same pattern normalized
Z after extraction and dilation, from the preprocessing.
Figure 10. Letters from the set{ A J T}. Right one are drawn out and dilate;
left one are the corresponding normalized characters.
Figure 11. Letters from the set{ X C G }. Right one are drawn out and dilate;
left one are the corresponding normalized characters.
Figure 12. Letters from the set{ HUS }. Right one are drawn out and dilate;
left one are the corresponding normalized characters.
But the simulation to the calculator has shown the recovery without errors of all sixteen patterns
shown in figure 13, underlining an ability of superior memorization to as scheduled of well 7
pattern. Unfortunately, in the practical cases, the conditions of middle void value and void mutual
correlation is not ever had, therefore we have divided the training pattern in groups, for training an
equivalent number of nets of Hopfield like in figure 14.
399
3~group ~ 3~Ilopfleldnet ]
The proposed strategy must guarantee, to the inside of each group, mutual correlation's between
patterns that are smallest, to the purpose of get a correct recall in phase of recovery. In fact if we
have a great number of pattern, like in our case, we arc not able to divide them in minimally
correlated groups without the help of a process tbat drives us. To such purpose we introduced a
threshold that, choice from the user, allows effecting a first selection of those pattern that have
mutual correlation with value under of her. The mutual correlation or overlap is definite by [5]:
C,, = ~_~X/Xi'
Where the Csr represents the values of mutual correlation between the s pattern and r, n is the
number of the neurons; p is the number of the pattern. The correlation matrix of Cij is built as in the
following schema:
Snbscquently, by thresholding, the correlation matrix Cbinij becomes binary as in the following,:
The clement of i rules and j column of the matrix of binary correlation, is equal to 1 if
C,i<threshold; it otherwise is O. The pattern of index of i rules is compatibly correlated under the
value of select threshold, with the pattern of index of j column, if the element of i rules and j
colunm is 1, otherwise is incompatible. In short: pattern i compatibly-correlated to pattern j (f
CI, in,l= 1. If the pattern i is compatibly con'elated with the pattern j, pattern k, pattern z, where
patlcrn j, pallerrt k, pattern z belong the same rules, it is not true in general that the patlern of j
index, k and z, arc compatibly correlated between them. For instance like it is seen in the following
chart, the pattetn 1 is compatible with the pattern 2, 3, 4, while the pattern 2 is COlnpatible with Ihe
pattern 3 but not with the pattern 4.
400
The following phase foresees the elimination in each line of the pattern incompatible between them,
building p groups of minimally correlated elements, getting the denominated matrix C'
For instance the following chart is releved:
I:rt~m the chart chr it is deduced the pattern 1, pattern 2, pattern 3 they form this group, while the
other has given from the pattern 1, pattern 4. The group minimally correlated has given C'ij that has
the smaller sum of the values of mutual correlation between the pattern from the group of the matrix
C,j. For instance if we have a net with 36 neuroues and the following correlation matrix Cij:
Wc have: Correlation total group (1,2,3)= Ci2+ CI.~+ C23= 11; Correlation total group{ 1,4}= Ct4=
23; [mini { 11,23 }= 11 ; C i j < < 0 points OUt pattern little correlated; C,j>> 0 points out very correlated
pattetn. The group is extracted minimally correlated as gotten and it is restataed the process afresh,
until to complete exhaustion of the pattern. The n groups gotten from this process, go to train 'n'
nets of I Iopficld.
3,1 Restdts
In this paragraph we visualise (see figure 15 ) the results concerning recognition of some patterns
normalised from the pre-processing, and representing the hand-written characters { A, E, J,}. Each
norrnalised letter has given in input to the net trained with the group containing the same normalised
letter; in fact only with this net could be gotten the correct recognition. The call of the pattern has
been performed by deterministic recovery.
4. Conclusions
In this job, the problem has been faced of how building a system of pre-processing that allows the
transformation invariant recognition of pattern with a neural net. Such system of pre-processing
must guarantee the invariance for position, greatness and rotation of each pattern, since all the
systems of recognition based on neural nets are very sensitive to the aforesaid transformations. For
getting the invariance the object has been firstly centred to the inside of a square window with
dimensions equal to 59x59 pixel, reaching the invariance to translation; then the object has been
normalised inside the same window, getting the invariance to greatness, and finally it has been
deliberate the angle of orientation with the measure of the direction of maximum variance, for get
the rotation invariance. The pattern then requires a scaling for adapting itself to the dimensions of
the neural net, equal to 13x13 pixel. The results have shown the suit success of this system of pre-
processing that is able to extract autonomously the pattern from an image and subsequently, through
varied stadiums, to adapt itself to the dimensions of any system of recognition. The choice of the
net of Hopfield like a recogniser presents a scarce capability of memorisation, which comes
resolved partially with the choice of the groups minimally correlated; this kind of net however has
the undeniable advantage of power recover pattern seriously damaged flom the noise, and from the
claboration's that allow to rcalize the invariance.
References
Ill C. Yticeer, K. Oflazer, "A rotation, scaling, and traslational invariant pattern classification
system ", Pattern recognition, vol. 26, no. 5, pp. 687-710, (1993).
[2J S.O. Belkasim, M. Shridhar and M. Ahmadi, "Pattern recognition with moment invariants:
a comparative study and new results", Pattern recognition, vol 24, no. 12, pp. 1117-1138,
(1991)
[31 W. Pratt, "Digital image processing", second edition Wiley,New York pp 629-647, (1978)
141 R. Jain, R. Kasturi, B. G. Shunck, "Machine vision", Mc Graw-Hill, (1995).
[51 E. Pessa, "Reti neurali e processi cognitivi", Di Renzo Editore (1993).
161 J. J. Hopfield, "Neural Networks and phisical systems with emergentc collective
computational abilities", Proceeding of the National Academy of Science USA, vol.79
pp.2554-2558, (1982).
[7] V. Cantoni, S. Levialdi, "La visione delle macchine", Tecniche Nuove, (1989).
[81 M. Fukumi, S. Omatu, Y. Nishikawa "Rotation-Invariant Neural Pattern Recognition
System Estimating a Rotation Angle", IEEE Transaction on neural network, vol. 8, No. 3,
May 1997
t91 Cho-Huak Teh, Roland T. Chin "On Image Analysis by the Methods of Moments", IEEE
Transaction on pattern analysis and machine intelligence, Vol. 10, No, 4, July (1988)
[lOI Michael Reed Teague, "Image analysis via the general theory of moments", J. Optical
Society of America Vol. 70, No. 8, August 1980
[liJ R.P.N. Rao, D. II. Ballard, "Localized Receptive Fields May Mediate Transformation-
Invariant Recognition in the Visual Cortex", Technical Report 97.2, National Resource
Laboratory for the Study of Brain and Behavior Departement of Computer Science,
University of Rochester, May (1997)
M e t h o d for A u t o m a t i c K a r y o t y p i n g of
H u m a n C h r o m o s o m e s B a s e d on t h e
Visual Attention System
J.F. Dfez Higuera & F.J. Dfaz Pernas
Department of Signal Theory, Communications and Telelnatics Engineering
School of Telecommunications Engineering. University of Valladolid
Campus Miguel Dclibes. Camino del Cementcrio, s/n. 47011 Valladolid, Spain
josdie@tel.uva.es
1 Introduction
The cytogenetic techniques have done possible to identify each human chromosome by
means of their pattern of bands. Technically, karyotyping is the process by which the
chromosomes of a cell in division (see Fig. 1), properly tinted, are identified and assigned
to a certain group [27]. In Fig. 2, the typical aspect of a karyotype is shown. This process
is very important, since the inspection of human chromosomes is an important and
complex task used mainly in clinical diagnosis and biological research [18, 24]. This task
is expensive, in time and money, and imprecise when it is made in a manual form.
An expert in cytology can produce karyotipes with a small en'or about 0,1% [ 191. It is a
tedious and expensive process in time. It's necessary to photograph the selected
metaphase and to cut the chromosomes in order to classify in a karyotype. During the last
30 years there have been diverse attempts to automate some or all the procedures inw)lvcd
in the analysis of chromosomes [8, 9, 16, 18, 20J. The automatization of this task presents
several difficulties, due partly to the deviation that the chromosomes present with respect
to the standard pattern of bands, and also because the chromosomes have random
direction and can be bent, overlapped and/or in touch between them.
All the efforts to make automatic the analysis of chromosomes have been a limited
successful and poor classification results compared with obtained by a skilled
403
cytotechnician [10, 24]. Some of the reasons of the poor operation are tile inadequate use
of the knowledge and experience of the expert and the insufficient ability to make
comparisons and/or eliminations between chromosomes of the same metaphase. In
addition, the systems require the interaction of the operator to separate leaned and/or
overlapped chromosomes and to verify the classification results [24].
3 System description
The present architecture has been designed thinking about solving the problem of the
analysis and classification of human chromosomes, but trying at any moment to maintain
the majority of the biological plausibility. As the human visual system, the analysis made
by the propose architecture is divided in a first preattentive level and the later attentive
level. Treisman[26] suggests two different processes in the visual perception. A
preattentive process, that acts as a fast tracking system and is only related to the objects
detection. This process checks the global object features and codifies the useful
elementary properties of the scene: color, direction, size or movement direction. In this
point, an edge or contour can be discerned to the variation in a simple property, but
complex differences in combinations of properties are not detected. The different
properties are coded in different feature maps, in different regions on the brain. The later
attentive processing initially directs the attention to the specific features of an object,
selecting and emphasizing the characteristics segregated in the independent maps. Also, a
saliency map must exist that codifies only the key aspects of the image. This map receives
405
entrances from all the feature maps, but it only abstracts those features that distinguish the
object of attention from the ground. In this way the saliency map selects the details that
are essential for the attentive recognition. The recognition takes place when the emergent
positions in different feature maps are associated.
The modular diagram of the architecture is shown in Fig. 3. As it can be observed, there
are three main blocks: dorsal module, ventral module and recognition module. This
diagram is based on the recognition model proposed by Kosslyn[ 17]. The dorsal module
is in charge of the preattentive processing of spatial features and of the integration of the
feature maps: it determinen the position, direction and size of the object to identify. The
ventral module is in charge of the attentive extraction of figure features in the atencional
window selected by the dorsal module. Finally, the information generated by both systems
is sent to the recognition module, where it is come to the recognition.
Distinguishing for the case of the chromosomes, the dorsal module must isolate each
one of the chromosomes, so that the ventral module individually analyzes each one of the
chromosomes, and the recognition module, from the information of size generated in the
dorsal module, and of the information relative to the formal structure of the chromosome
produced by the ventral module, identifies it and classifies in one of the 24 groups.
Next a functional description of each one of the blocks is made, proposing a possible
anatomical location for each one of them, and the used basic models for its design.
The sequence of operation of the proposed architecture begins with the image coming
from a camera CCD connected to a microscope. The visual memory corresponds
retinotopically to a assembly of mapped areas, in which knowledge on the objects that can
affect to processing is stored [21]. These areas constitute a structure characterized
functionally, and, therefore, they do not need to be anatomically contiguous. There arc
406
several visual areas that are component of this functional structure, including the areas
V1, V2, V3 and V4 [6,28].
Dorsal module
The block diagram of the dorsal module appears in Fig. 4. The image enters the dorsal
module. This module processes, preattentively, spatial properties, such as position,
direction and size [28]. In this module the luminance and direction feature maps are
extracted. Later, these maps are integrated by means of a relaxation process, in order to
generate a saliency map. This module has been denominated as dorsal because it has an
operation similar to the human dorsal system [17], assembly of cerebral areas that
includes from the occipital lobe to the parietal lobe. On the propose architecture, this
system is based on Grossberg's BCS model [115]. In this case receptive fields greater than
in the ventral module are used, since the objective is to detect objects, regions of interest
in the scene. The integration of the feature maps follows the model proposed by Milanese
[22], using the constrast maps generated on the LGN (channels ON and OFF), and the
texture maps generated by the BCS.
As result of the preattentive processing a binary map is generated, where the detection
of emergent regions has been made. The region of greater area is selected, and two signals
are generated: one towards the atencional control, and another one towards the recognition
module. The first signal indicates to the ventral module the region that has to process
attentively. Biologically speaking, it corresponds to the attention shift. In Computation, it
is reduced to a rotation (direction information) and a zoom (size inlk)rmation). In this way
it is had the region of interest in the attention window. This first processing allows to
generate the three types of inwu'iance, since the region of interest in agreement with the
information provided by the dorsal module can be moved, be turned and be resized. The
407
second signal, towards the recognition module, will be the input to one of the recognition
channels, the one that classifies the input pattern based on his size.
ASSOCIA~VEMEMORY
,1
(~)~,~@ CompetitiveStageI1~
f l
(~) CompetitiveStage I ~
] l
Comp,oxCo,,s | 1
lIlI IlII
T T
Atencionat
Window
Fig. 5. Block diagram of the ventral module.
Ventral module
The block diagram of the dorsal module appears in Fig. 5.The ventral module receives
its name from the ventral system, which it is a assembly of cerebral areas that includes
from the occipital lobe to the inferior lobe temporary, IT, and whose cells respond to
objects properties such as form, color typically and texture. The ventral module processes,
in a attentive form, the region selected by the dorsal module. This processing generates
the contours maps coming from two types of receptive fields (symmetrical and anti-
symmetrical). Since the object is turned and resized before beginning the attentive
processing, the direction of the contours maps is fixed, and in the case of the
chromosomes it corresponds to 0 ~ This map contains information referring to tile banded
pattern. Therefore, the ventral module sends two signals to the recognition module. These
two signals (contours maps) along with the information of size sent by the dorsal module,
constitute the input pattern that the recognition module has to identify.
In the propose architecture, the generation of characteristics, in the ventral module, is
made by means of a model based on the Grossberg's BCS. In the case of the
chromosomes, the receptive fields are small, adapted to the size of the chromosomes, with
the objective to detect the transition between the bands of the chromosome.
408
Recognition Module
The block diagram of the recognition module appears in Fig. 6. The outputs from the
ventral module (pattern of bands) and from the dorsal module (size) arrive at the
recognition module where they are compared with the stored information. This module
corresponds to the associative memory described by Kosslyn, and that seems to be
implanted partly in the superior and posterior temporary lobes[17].
In the proposed architecture the recognition module is implanted by means of a
multisensorial ART network, composed by 3 Fuzzy ARTMAP networks [3] and one
ART1 network [2]. The three first networks receive information from the dorsal module
(size information) and from the ventral module (information on band transition). ARTI
receives as input the output from the 3 networks Fuzzy A R T M A P and generates a single
output that indicates the identification of the chromosome. There are other models with
similar philosophy, like the network Fusion ART, propose by Asfour ET al. [I]. If the
chromosome cannot be identified, it separates so that an expert analyzes it.
Once the object selected preattentively by tile dorsal module, and analyzed attentively
by the ventral module, has been identified (or separated), the region corresponding to the
analyzed object it's inhibited in the saliency map. And so on, until all the interesting
regions of the image have been analyzed. Next they in detail describe each one of the
modules.
F;
I I [oo,,.',oou',l
Fig. 6. Block diagram of the rccogniUon module.
409
5 Results
In this section appear the results obtained when applying the proposed architecture to a
chromosomes database widely used as benchmark for several methods of classification.
To the experimentation of the proposed model it is had three extensive data bases of G-
banded chromosomes, coming from Copenhagen, Edinburgh and Philadelphia. These data
have been used in previous studies of classification [7, 24, 25]. Each lot contains a great
number of chromosomes extracted from images of cells in the metaphase stage of the
cellular division. The first data base was created in Copenhagen in 1976-1978 and
consists of 180 G-banded metafasic cells with TRG, coming from blood samples. Thc
second database was obtained in the MRC, Edimburgo, in 1984, and contains 125 G-
banded sanguineous cells with method ASG. The third database was obtained in the
Jefferson Medical College, Philadelphia, in 1987, and contains 130 cells of chorionic
villus coming from routine and crossed analyses of laboratory with Giemsa.
It must be indicated that all the chromosomes come from normal human cells and
therefore do not contain abnormalitys. Also it is necessary to emphasize that they are not
present cases of sly chromosomes, and are very little the chromosomes with a
considerable curvature. This fact has not allowed extending the architecture to all the
possible cases. Of all the chromosomes available in the database, the chromosomes with
an excessive curvature have been rejected. This must to that it is not had a sufficient
number of them like making a trustworthy learning.
On the other hand it is necessary to indicate that the results obtained in the
classification are compared with the ones obtained by means of other methods that use the
same database as benchmark. Nevertheless, the used information is not the same one in all
the cases. In the tests the chromosome images have been used like source, extracting from
them all the necessary information for the classification. The methods that participate in
the comparative study use as source the images features extracted by experts in
chromosome recognition. These features are are the unidimensional bands profile, the
centromeric index, geometric parameters, etc. Therefore, in the comparison of results it is
not only necessary to consider the goodness of the classification method, but also the
capacity to extract the relevant information fi'om the image. In the lbllowing table the
results obtained by the proposed architecture are shown, in comparison with other models
of classification.
It is possible to emphasize that the results of the proposed architecture with respect to
the units used in the learning are highly satisfactory, since the network recognizes them
with a percentage of 100%. With respect to the units that the network has not learned, the
low percentage slightly, due to chromosomes with smaller contrast or excessively bent.
6 Conclusions
The present paper describes a neuronal architecture for segmenting and recognizing
textured monochrome images, in general, and specially oriented towads the classification
of human chromosomes.
The initial objective to try to solve the problem of the analysis and classification of human
chromosomes has been complemented with the maintenance of the biological plausibility,
which confers to the proposed architecture a general porpuse feature. In each particular
application a module of recognition adapted to the scene and objects features will be
required. In the present paper, a module of specific recognition for objects with texture of
parallel bands has been designed, as it is the case of the chromosomes.
In relation to the biological plausibility, it has been tried to model both ways of
operation of the visual system, preattentive mode and attentive mode, as well as the
mechanism of visual attention. In the preatentive mode, the processing is highly parallel
and of low resolution, due to the extension of the receptive fields. Its battle area extends
to all the scene, and its function is to select the regions with excellent features, avoiding
the rest of the image. The attentive mode sequentially analyzes each one of the regions
selected by the preattentive module. In the analysis of chromosomes, the preattentive
mode, represented in the dorsal module, analyzes all the image and tries to isolate each
one of the chromosomes. Next, and already processing in attentive mode, it is come to the
individual identification of each chromosome.
Finally, the emergent properties that define the contribution of the present article are:
9 the proposed architecture is a first approach to the behavior of the visual system in the
processing and recognition of visual stimuli. In the development of the neuronal
architecture all their stages have been justified suggesting their location within the
structure of the visual system
9 It proposes an invariable transformation to translation, rotation and scale, from the
information provided by the dorsal module of preattentive segmentation, so that the
objects are presented to the attentive module with the same direction and size, since
they adapt to the attentional window.
7 References
[11 Asfour, Y.R., G.A. Carpenter, S. Grossberg and G.W. Lesher. 1993a. Fusion AI,',TMAP: a neural network
architecture for multi-channel data fusion and classification. Technical Report CAS/CNS-TR-96-006, Bostm~
University, January 1993.
[21 Carpenter, G.A. and S. Grossberg. 1987a. A massively parallel architecture lbr a self-organizing neural pattern
recognition machine. Computer Vision. Graphics, and hnage Processing, 37: 54-115.
[3] Carpenter, G.A., S. Grossberg, and J.H. Reynolds. 1991. ARTMAP: Supervised real-time learning and classification
of nonstationary data by a self-organizing neural network. Neural Networks, vol. 4, No.5, pp. 565-588.
[4] Cohen M.A. and S. Grossberg. 1984. Neural dynamics of brightness perception: Features, boundaries, diffusion,
and resonance. Perception and Psychophysics, Vol. 36, pp. 428-456.
[5] Desimone, R, 1992. Neural circuits for visual attention in the primate brain. In G.A. Carpenter and S. Grossberg,
editors, Neural Networksfor Vision and hnage Prm'esshag. Cambridge, MA: MIT Press, pages 343-364.
411
Abstract
In this paper we propose an adaptive modification of the output function of the CNN
(Cellular Neural Network) model to perform contrast enhancement of an image. First,
we define the output function to operate in the interval [0,1] with variable saturation
limits ill order to adapt the behaviour of the network to the grey levels in the
neighbourhood of every cell. Then we propose a three-layers CNN where the mean
value of the neighbourhood o f a pixel is obtained by the first layer and the calculation of
the mean deviation of the pixel values from the mean in the same neighbourhood is
carried out by the second one. These parameters are control signals that define the
saturation limits of the piecewise linear output function of each cell in the third layer,
the output of the network, adapting it to the neighbourhood of each cell. Some examples
are presented to demonstrate the capabilities of the model.
1.- Introduction
The use of neural networks as image processing structures is a growing research field in
the neural network community, because, as brain capabilities justified that the first
applications of these structures were devoted to pattern learning and recognition, the
visual processing carried out by the visual neural system of living beings justifies the
application of neural networks to those tasks. However, the existence of a quite
developed theory in image processing that constitutes a whole scientific field is the
reason why the attention of the neural network community on these tasks has not been
very intense, apart from being rather recent. It has been the need of looking for new
applications lbr neural network models what has originated his growing. Furthermore,
as many of the recognition tasks neural networks are devoted to are perlbrmed with
optical patterns, it is natural to pay some attention to pure image processing. The
possibility of having a pattern recognition system with a previous image processing unit
enhancing the image quality, both developed with neural networks, is a very attractive
idea.
On the other hand, the existence of a highly developed image processing theory,
what at first glance could be a handicap, can turn out to be helpful for neural image
processing research, since the study of techniques of proved efficiency can help to
develop neural models to use as such structures. Then if one can obtain with tile neural
networks analogous results to those provided by the standard image processing tools it
would justify their potentiality as image processing systems.
413
A lot of works have appeared in which neural networks are used to process
images with the goal of being a help in pattern and shape recognition tasks. However, it
has been a model whose main aim was to obtain an easy VLS! implementation which
seems to be better adapted to these tasks: Celullar Neural Networks (CNN). They were
originally proposed by Chua and Yang [1][2] as a unification of some aspects of Neural
Networks and Celullar Automata. They have a neuron model that is very similar to the
Hopfield one, but with the difference that each cell is only connected with those
surrounding it. These connections are the same for every cell defining a repetitive
structure usually named as "cloning template". This repetitive synaptic scheme
represents the main feature of the model, providing a local processing of the input
signal, that makes it specially appropriate to be used in image processing, what has
become one of the main applications of CNNs [3][4][5]. For this reason, it seems
reasonable to use them as an image processing system, leaving aside its possible VLSI
implementation.
To describe the model we will consider an image u as the network input, and,
assuming that both input and output have the structure of a mxn matrix, the equations
that govern the dynamic behaviour of the neural activity v(t) are:
where A(i,j;k,l) and B(i,j;k,l) represent, respectively, the synaptic weights of each neuron
with other cells and with pixels in the input image. They are the same for each neuron,
and are what we have defined as "cloning templates" that define the connectivity of each
neuron with its neighbourhood:
(3)
that represents a piecewise linear function (Fig. I). The output function may also be a
radial basis function or a sigmoid one [6].
t I
-1
where h is a constant that define the time step during the simulation.
2.- Neural T r a n s f e r Fu n c t i o n
As it has been stated previously, the piecewise linear function (3) is used in the CNN as
the cell output; so its value will be within the interval [-1,1].
C D1 C D1
(a) (b)
Figure 2. (a) Adaptive piecewise linear function. (b) Expansion of the grey interval
allowing the definition of different gains that can modify the relation between the neural
activity and the neuron output. The equation so obtained has the form:
(5)
Yu = 2 ~.I-~S'-C IO-CI
This definition of the output function allows it to adapt to the features of the task
to be performed. So, ifC=D a binary response will be obtained. On the opposite, if C=0
and D---I the neuron activity is provided as the neuron output. Between these extreme
definitions a great variety of possibilities appear. Moreover, fixing the difference
between C and D, what implies a fixed slope, their values may vary between 0 and 1 so
that the neural response is adapted to the cell input avoiding its saturation. So, defining
the values of the elements of A(i,j;k,O and B(ij;k,O so that the cell activity is always
inside the interval [0,1] the neural response may be controlled with C and D and the
maximum possible resolution in the neural output can be obtained. The delinition of
these parameters may be carried out before running the simulation, adapting the output
function to the specific task determined by A(i,j;k,O and B(i,j;k, O.
However, there is another possibility: to look for an automatic adaptation of
these parameters to the input neighbourhood of each cell, so that, instead of defining
them as fixed values, they may vary according to the environment in which every
neuron is placed. So as A(i,j;k, 0 and B(i,j;k, 0 determine the task the network will
perform and will be defined before running the simulation, C and D may be adapted to
the input neighbourhood during the simulation to obtain an optimal response of the
network.
As the adaptation capacity of the output function must be performed by varying the
values of C and D while running the simulation of the neural network it is necessary to
obtain them with a different network to that pertbrming the image processing. In
addition, this structure should be a CNN to preserve the homogeneity of the whole
system. So, the multilayer structure proposed in the original model [1] may be used,
defining two hidden layers to provide the desired values for those parameters.
The calculation of C and D as elements that adapt to tile features of the input
image need to be carried out through the definition of larger synaptic masks than those
commonly used in CNNs, because as they are assumed to represent a sort of "mean
values" of the pixels surrounding each cell they must be obtained statistically, and then
the number of pixel considered must be large enough to obtain meaningful values.
So the definition of the local adaptation of the output function may provide an
image enhancement since little differences in the grey level in a certain zone of tile
image may be increased by giving C and D tile appropriate values to obtain a relation
between inputs and outputs as that represented in Fig. 2 (b), where a small grey interval
is expanded into a bigger one. So differences in the first one are augmented when
transformed into the second. Therefore, assuming that C and D are the extreme grey
values in a neuron neighbourhood the output function will expand the original grey
interval into another ranging from 0 to 1, so that the contrast of that zone of the image
will be increased, allowing that details that are not clear enough can be easily detected
416
now. In this way, a generalized contrast enhancement of the image is obtained as every
neuron performs the same transformation.
Intuitively the easiest way to obtain C and D is defining them as the minimum
and the maximum values in the input neighbourhood of each neuron. Ilowever, this
definition would need the use o f two new relations between/he inputs: tile maximum
and the minimum of a set of values. Some neural models has been defined where the
relations between inputs are different from their weighted sum, usually the sum of their
weighted products [7] or logic functions [8]. Therefore, the use of the maximum and
minimum functions to obtain C and D would be feasible. However, their utilisation does
not represent the better solution, since these functions are very sensitive to noise,
because, as it usually appears as an extreme value (black or white), it will be considered
as the maximum or minimum in the neighbourhood, defining a higher grey interval for
the input, and therefore decreasing the model performance.
-1
I
J ,6, "~
Figure 3. Plot of function (6) with ct=l. The continuous line is obtained for ~ =0.
The striped line plots the function for e--_-0.2.
To avoid the noise influence we propose the use of an input function that is a
particular case of the extended absolute value proposed in [9]:
This function provides a new type of input relation that allows us to obtain
statistically the values of parameters C and D for the output function of every celt so that
the influence of noise is highly reduced.
They may be obtained in the following way. The two hidden layers have a neural
activity function where A(id;k, 0 and I are assumed to be zero, while B(i,j;k.O will define
417
the layer behaviour. The output function is (5) with C=0 and D=I. The first hidden
layer, which we will call L c , will provide the mean value of the neighbourhood of every
pixel if we assume that every element in B(i,j;k,l) has a value of I / r 2 when B(i,j;k,l) has
dimension rxr. So the output of every cell will have the form:
m0= ~ ~ (8)
C(k,I)eN,(i,)) r
This value is considered the central point of the [C,j,D,j] interval. The second
hidden layer, named Lo, will calculate the mean deviation from m,j in the same
neighbourhood. So, using expression (7), we have:
(9)
C(k,I)~N, OJ) r2
Co = mo - d O D,j = mv + d o (1o)
However, as the average value of the deviation is relatively small, the output
function slope could be too high, and some grey levels could fall into the saturation
limits. To avoid this effect d o is multiplied by 6, a greater than unity constant.
Therefore, equations (10) have the form:
C o = m,j - 6 d o D,j = m o + 6 d o (I 1)
and the size of the intervals [C0,D0] is widened. The value of~5 must be fitted from
simulation.
S
Figure 4. Multilayer structure.
418
Now they represent the two limits of saturation that define the output function of
every neuron in the processing layer and will be provided as control signals. The neural
activity function is represented by equation (4), where A(i,j;k,l) and I are zero and
B(i,j;k, 0 is lxl with this element equal to I.
So, the system structure is as follows (Fig. 4):
Therefore, we may say that the network performs the processing of each pixel in
the input image, assigning it a new value that will depend on its neighbourhood. The
new values form the output image. As connections in the three layers are purely
feedforward, the system stability is guaranteed.
Since the described structure will increase the contrast of the image, it can also
make excessively noticeable details that were clear enough by themselves. In order to
compensate for this effect, the processing layer can be provided with a smoothing ability
that can diminish it, although without damaging the overall capability of the system.
This effect can be obtained ifB(i,j,'k,O is defined as a mean filler analogous to that used
in layer Lc but with a less size. A 3x3 dimension will be enough.
An added problem that may appear is the fact that this structure can also produce
details that really are not. So, in areas with small variations in their grey levels, this
variations may have no special meaning. They could be just little faults or noise present
in the image, but the net will amplify them. So, the network performance may be
degraded by noise amplification or by new noise generation. This unwanted effect can
be even more harmful when the grey level homogeneity in an area of the image does not
vary, providing a near to zero mean deviation. So C o and D,j will be close together and
the corresponding slope will be too high. Therefore, little differences in the grey levels
will be augmented to the maximum range (black and white), providing as important
details things that are nothing but noise.
The problem may be solved with the imposition of a lower bound to the mean
deviation value computed by layer L~, although modifying as less as possible the
obtained values when there is a meaningful variation in the grey scale. This effect may
bc obtained from (6) assuming a=l and e.>0:
2 2 ,~2
abs,(u,~ -rib)= -m,~) +6 ) , c> 0 (12)
4.-Simulation
We present three different pictures in Figures 5, 6 and 7 to test the behaviour of the
proposed model. In order to reduce the computation time we use a more simplified
419
expression for (4). As the model has no feedback, A(i,j,'k,O =0, the output only depends
on the fixed values of the input image and the neural activity may be obtained in one
time step. So, assuming R=C---h=I and I=0 in (4) the neural activity function is now:
vo = Z B(i,j;k,t)u,, 03)
C(t,,i)eN O,j)
The neural output is provided by (5) and parameters controlling it are obtained as
mentioned above with 8=2.
The original images are presented in (a). They are processed in (b), with formula
(10) defining the input interval of the output function. A general contrast enhancement
appears and details that were hardly perceived are clearly visible now. Nevertheless,
many extreme values (black or white) also appear because, as the interval [C ,D,j] is
statistically obtained, some of the pixel values inside each neighbourhood may be
outside this interval, saturating the output function, q'o compensate for this effect a 3x3
mean filter is added in (c) to obtain a smoothing of the images. As we can see, many
extreme values have disappear but a blurred image is also obtained. On he other hand,
we can also see in Fig. 5 (b) and Fig. 7 (b) that those areas with a uniform grey level in
the original image present details that actually do not exist. They are obtained by the
amplification of little differences in the pixel values produced by a very high value of
the slope of the output function as has previously mentioned. This effect appears in the
girl's cheek and in her hat in Fig. 5 (c), and at the bottom of Fig. 7 (c). To avoid it,
function (11), with e=0.1, is used instead of(10) to obtain C o and D~j. We can see in
Fig. 5 (d) and Fig. 7 (d) that they have been removed while the presence of extreme
values have been also decreased in the three figures. So a contrast en!mncement has
been obtained only in those areas where it was necessary.
5.- C o n c l u s i o n s
We have proposed an adaptive model of the CNN output function that provides a
contrast enhancement capability in those areas of an image where details aren't clear
enough. This result was obtained only with the use of the adaptive output function. A
small lowpass filter was used to provide a little softening to compensate for an
excessive contrast obtained in some zones of the output image. So the use of the
adaptive function with different types of filters will probably improve their
performances. It could be interesting to study the effect of endowing neural networks
with image processing capabilities with an adaptive output function in order to
increment their performances and flexibility.
420
(a) (b)
(c) (d)
Figure 5.-
(a) (b)
(c) (d)
Figure 6.-
421
(a) (b)
(c) (d)
Figure 7.-
References
[1] L. O. Chua, L. Yang. "Cellular Neural Networks: Theory". IEEE Trans. on Circuits
and Systems. Vol. 35, No. 10. October 1988. pp. 1257-1272.
[2] L. O. Chua, L. Yang. "Cellular Neural Networks: Applications". IEEE Trans. on
Circuits and Systems. Vol. 35, No. 10. October 1988. pp. 1273-1290.
[3] T. Matsumoto, L. O. Chua, R. Furukawa. "CNN Cloning Template: Hole-Filler".
IEEE Trans. on Circuits and Systems. Vol. 37, No. 5. May 1990. pp. 635-638.
[4] T. Matsumoto, L. O. Chua, T. Yokohama. "Image Thinning with a Cellular Neural
Network". IEEE Trans. on Circuits and Systems. Vol. 37, No. 5. May 1990. pp. 638-
640.
115] B. E. Shi, T. Roska, L. O. Chua. "Design of Linear Cellular Neural Network tbr
Motion Sensitive Filtering". IEEE Trans. on Circuits and Systems. 1I: Analog and
Digital Signal Processing. Vol. 40, No. 5. May 1993. pp. 320-331.
[6] L. O. Chua, T. Roska. "The CNN Paradigm". IEEE Trans. on Circuits and Systems.
I: Fundamental Theory and Applications. Vol. 40, No. 3. March 1993. pp. 147-156.
[7] E. B. Kosmatopoulos, M. M. Polycarpou, M. A. Christodoulou, P. A. loannou.
"High-Order Neural Network Structures for Identification of Dynamical Systems". tEEE
Transactions on Neural Networks, Vol. 6, No. 2, pp. 422-431.
[8] F. J. L6pez Aligu6, M. I. Acevedo Sotoca, M. A. Jaramillo Mor~in. "A tligh Order
Net, ral Model". Lecture Notes in Computer Science, No. 686, "New Trends Neural
Computation". pp. 108-113, Springer-Verlag, Berlin. June 1993.
[9] R. Dogaru, K. R. Crounse, L. O. Chua. "An Extended Class of Synaptic Operators
with Applications for Efficient VLSI Implementation of Cellular Neural Networks".
IEEE Transactions on Circuits and Systems, Vol. 45, No. 7, July 1998, pp.745-755.
Application of A N N Techniques to A u t o m a t e d
Identification of B o v i n e Livestock
1. Introduction
For the control and conservation of the purity in certain breeds of bovine livestock,
one of the fundamental tasks is the morphological evaluation of the animals. This
process consists of scoring a series of very well defined characteristic [10, 11] in the
morphology of the animal, such as head or back and loins, and to form a final score
from a weighted sum of these partial scores. Evidently the process should be carded
out by people with great experience in this task, so that the number of qualified
people is very small. This, together with the high degree of subjectivity involved in
the whole process, leads one to think of the utility of a semiautomatic system of
morphological evaluation based on images of the animal.
In the publications on the topic it is suggested that most of the morphological
information of the animals involved in the process can be obtained by analysing their
different profiles. In this present work we try to corroborate this statement by means
of the study of similarities between the profiles of different images taken of the same
animal, and the similarities between the profiles of animals of the same breed, as well
as the degree of difference between animals of different breeds. To carry out this
study we developed a classifier based on images with a conventional structure [5] that
takes lateral images of cows as inputs (i.e. in profile) and that, after a first processing
for the extraction and normalization of contours, processes them by a neural classifier
which associates that image with one of the animals that it has previously learned, and
also relates it to one of the breeds that are objects of the study. I.e., we will be
describing a classifier that identifies the individual animal as well as makes the
classification by breed, simply using the information contained in the profile.
423
As noted above, our classification system consists of two clearly differentiated parts.
In the first, using a lateral image of a cow, we extract its contour and represent it in an
appropriate way for use as input to the neural classifier. In this phase we have mainly
used deformable model techniques [6], in particular those known as active shape
models [1, 2, 3], combined with strategies for within-image searching.
For the neural classifier we have used a type of network known as SIMIL [7, 8, 9],
a model which has been developed in our laboratory and that we have already applied
with success to other classification tasks. In the sections that follow we describe these
two parts in detail.
which the alignment of equivalent points is the most appropriate. Each contour will
then be represented mathematically by a vector x such that
x = x , + P.b (1)
where x is the average shape, P is the matrix of eigenvectors of the covariance
matrix, and b a vector containing the weights for each eigenvector and is that which
properly defines the contour in our description. Considering only a few eigenvectors
corresponding to the largest eigenvalues of the covariance matrix, we will be able to
describe practically all the variations that take place in the training set.
25
66 1
(a) (b)
Fig. 1. In figure (a) a representative example is shown of the input images to oar classifier. The
model of the cow contour formed by 73 points, the most representative of which are numbered,
is plotted in figure (b).
In our specific case we have used a description of the cow' s contour, not including
the limbs, that consists of 73 points (fig. 1,b), and the model was constructed using a
training set of 20 photographs, distributed evenly over the considered breeds, and
where the animals are in similar poses, since our purpose is to study the variations due
to different animals and not those due to the pose. Once the principal component
analysis was made, we only needed to use 12 eigenvalues to represent practically the
entirety of the possible variations. Hence, each normalized contour is perfectly
defined by a vector b of 12 components, to which we have to add a translation t, a
rotation 0 and a scale s to transform it to the space of the image.
(a) (b)
Fig. 2. Representation of the final edge determined from the colour information, as against the
edge of the coordinate L* in which only the luminance has been used.
To apply this set of techniques, a series of computer programs has been developed
which allows us to automate all the tasks without user intervention. The photographs
are taken in the field, and then read into a database which is consulted by the
programs which have to process them. All these computer applications were
developed in C except the one dedicated to the PDM and ASM that uses OCTAVE
due to the great amount of matrix calculations involved.
Once an image has been preprocessed, we have the information concerning its
contour as a set of 12 parameters b~ that forms the vector b. However this space does
not seem well suited for the classification process because various vectors b can exist
that give very similar contours but are quite far from each other. Accordingly, as our
objective is to classify the contours, it seems more appropriate to use them as inputs
to the classifier. In order to do that, a bitmap that has the contour represented can be
generated making use of the vector b and a transformation that must be the same for
all the cases, so that the contours are comparable independently of the position or the
size of the cow in the original image.
To classify this kind of input we used a type of neural network known as SIMIL
[7, 8, 9], which has presented very good results in classification problems similar to
the present [9]. This network was conceived from its origin to be integrated into a
classification system. In the learning process it uses a series of prototypes of the
classes into which we will classify the group of inputs. This learning process is based
on the direct assignation of prototype values to the weights of neurons, which makes
it fast and effective. Also, a neural function that detects similarities between its inputs
and the information in its weights is used, based on ideas similar to those of
Kohonen's self-organizing maps. All this is integrated into a feedforward network,
which permits high performance in classification problems.
As output of the network for each input we obtain the membership rates dp to each
one of the p classes that the network learned. To offer a final result we introduced
another element into the process, that we have denominated decision-maker [4, 7],
427
and whose purpose consists in, given the membership rates, to indicate either the class
whose membership rate is the largest, or a state of indecision. This decision-maker is
based on two rules:
9 For the class with the largest membership rate to be the final result, this should
surpass a minimum value. If we define
d =max{D} where O = { d , ..... dp} (2)
one must have that d > Vm, where Vmis one of the decision-maker parameters.
9 Also, we should require the network to be able to select "sufficiently" one of the
classes, i.e. that the difference between the largest of the membership rates and the
second is sufficiently large. To quantify this, if we write
s=l ds
dm (4)
which rises as the distance between d and d mincreases. Hence, class m must satisfy
s > V to be the output of the decision-maker, where V, is the other parameter that
defines that processing block.
It should be noticed that dm and s can take completely different values, i.e., we can
find cases of maximum separation with very small values of d m and, vice versa, cases
of minimum separation with high values of d~. For this reason, having established the
classification function, we were interested in defining a parameter that measures the
quality of the classification for the cases in which classification is possible. That
parameter is called the classification index and we define it as:
I = dmd~ -gmVs
1-V,,V,. (5)
As we can see, in the worst classification case, d, = V and s = V,, we will have
I = 0. In the rest of the cases the variations of the d and d, values have the same
importance.
Regarding the application of the entire technique to our problem, we must
comment on the following points:
9 As inputs to the neural network we used bitmaps of 400x300 pixels, generated with
the contours obtained in preprocessing. We used three contours with a thickness of
5 pixels, centred on the same point and with different sizes (see fig. 3,b), to
minimize the number of neurons that do not receive information.
9 The SIMIL network we used is composed of a single processing layer with
400x300 neurons and a random feedforward connection scheme, with 400 inputs
per neuron in a neighbourhood of radius 200. As output function of the neurons we
used a sigmoid with parameters 1, 0.4 and 0.1. To simulate this network we use a
parallel system with 6 processors (3 Pentium 200, 2 Pentium 233 and one Pentium
Pro), running the large neural network simulator NeuSim [4] which has been
developed in our laboratory. With this system we obtained recognition times of
approximately 3 to 4 seconds per image.
428
(a) (b)
Fig. 3. One of the photographs used, with its snake fitting the contour, is shown in figure (a),
and next to it we see the real input to our neural network (b). We have inverted the image to
facilitate the presentation.
With respect to the decision-maker, taking into account the trial simulations we
had made, it seemed reasonable to require, in order to establish a definitive
classification, that a minimum of 10% of the neurons associated to the
corresponding prototype are activated (V=0.1) and also that a minimum separation
of 5% exists between the activated neurons of the chosen prototype and the second
(i.e. that d, is smaller than 95% o f d m), i.e. V, = 0.05.
3. Results
The system described in the previous section was tested with a total of 95 pictures
corresponding to 45 different animals, distributed among the 5 breeds considered in
this present study (Retinta, Blanca Cacerefia, Morucha, Limusfn and Avilefia). Once
the photographs had been preprocessed and the input bitmaps for the classifier
obtained, we ran the neural network learning process, using one contour for each of
the animals. After learning, we had the network recognize the 95 pictures, classifying
them into the 46 corresponding classes (considering indecision as a separate class).
For the classification into breeds, these were considered to be superclasses formed
by those classes corresponding to animals of one specific breed. Hence, the animal
obtained as a result of the classification also determines the breed.
Given the number of images processed, it is impossible to describe all the results
obtained in the classification. In table 1 an overall summary of those results is shown,
corresponding to the classification into animals and into breeds. Also, table 2 presents
an example of results for a specific animal of which four photographs (fig. 4) were
used. In that table separation and classification indexes are presented, as well as the
two largest membership rates, for both classification processes.
429
Table 1. Summary of the final results, showing successes, mistakes and indecisions, of our
identification and classification system.
The data in table 2 are quite representative of the cases that may occur. As one can
see, in the classification into animals there is an erroneous assignment (with quite a
low classification index), in which the system has related the input image with
another animal of its same breed. There was also a case of indecision due to the low
separation index between the first and the second membership rates, although, as one
can see, the classification would have been correct. One should also notice that,
although the classification into animals for these images is not very good, all are
correctly assigned into races.
Table 2. Classification results, by animals and by breeds, for the 4 photographs of cow dx501.
The separation and classification indices are shown.
Photograph s I Classification
dx501_1 0.18 0.09 WRONG[1~ dx303 (0.333) 2~ 01416(0.273)]
Class. b y dx501_2 0.06 0.18 RIGHT[1~ dx501 (0.440) 2~ gb516 (0.411)]
Animals dx501_3 0.01 0.40 IND. [1~ dx501 (0.638) 2~ dx201 (0.630)]
dx501_4 0.44 0.55 RIGHT[1~ dx501 (0.994) 2~ 014005 (0.560)]
dx501_l 0.18 0.09 RIGHT[1~ BlancH(0.333) 2~ Avilefia(0.273)]
Class. b y dx501_2 0.06 0.18 RIGHT[1~ BlancH (0.440) 2~ Retinta (0.411)]
Breeds dx501_3 0.20 0.32 RIGHT[i~ BlancH (0.638) 2*: Av[lefia(0.511)]
dx501_4 0.44 0.55 RIGHT[10:BlancH (0.994) 2~ Avilefia(0.560)]
In light of the results described in the previous section, we can state that the
classification results are excellent, especially in the case of the classification into
breeds, where the indecisions are reduced to the minimum and the mistakes are very
few. It must be emphasized that, when the system makes a mistake in classification
into animals, the wrong choice is usually an animal of the same breed. This supports
one of our premises: the fact that a great part of the morphological characteristics of a
breed is reflected in the contour.
430
(1) (2)
(3) (4)
Fig. 4. The 4 photographs of cow dx501 with the fitted snake are presented, corresponding to
the data in table 2.
Acknowledgements
This work has been supported in part by the Junta de Extremadura (project
PR19606D007, and the doctoral scholarship of D. Horacio M. Gonz~ilez Velasco) and
the CICYT (project TIC 97-0268).
We also wish to express our gratitude to the personnel of the Centro de Seleccitn y
Reproducccitn Animal (CENSYRA) for the technical help with everything related to
cattle, and for aiding us in the process of taking photographs.
431
References
1. Cootes, T.F., Taylor, C.J., Cooper, Graham, J.: Active Shape M o d e l s - - T h e i r Training and
Application. Computer Vision and Image Understanding, vol 61, n~ 1, pp 38-59. Jan. 1995.
2. Cootes, T.F., Taylor, C.J., Cooper, Graham, J.: Training Models of Shape from Sets of
Examples. Proc. British Machine Vision Conference, pp 9-18. 1992.
3. Cootes, T.F., Taylor, C.J.: Active Shape Models - Smart Snakes. Proc. British Machine
Vision Conference, pp 266-275. 1992.
4. Garcfa, C.J.: Modelado y Simulaci6n de Grandes Redes Neuronales. Doctoral Thesis,
Universidad de Extremadura. 1998.
5. Jain, A.K.: Fundamentals of Digital Image Processing. Prentice Hall, 1989
6. Kass, M., Witldn, A., Terzopoutos, D.: Snakes: Active Contour Models. International
Journal of Computer Vision, vot 1, n~ 4, pp 321-331. 1988.
7. L6pez, F.J., Gonz,51ez, H.M., Garcfa, C.J., Macfas, M.: S~VIIL: Modelo neuronal para
clasificaci6n de patrones. Proc. Conferencia de la Asociaci6n Espafiola para la Inteligencia
Artificial, pp 187-196. 1997.
8. L6pez, F.J., Macfas, M., Acevedo, I., Gonz~ilez, H.M., Garcfa, C.J.: Red neuro-fuzzy para
clasificaci6n de patrones. Proc. Congreso Espafiol sobre Tecnologfas y L6gica Fuzzy, pp
225-232. 1998.
9. Macfas, M.: Disefio y realizaci6n de un Neurocomputador Multi-CPU. Doctoral Thesis,
Universidad de Extremadura, 1997.
10. Reglamentaci6n especffica del Libro Geneal6gico y de Comprobaci6n de Rendimientos de
la raza bovina Retinta. Boletfn Oficial del Estado, Espafia. 05/04/1977.
11. S~nchez-Belda, A.: Razas Bovinas Espafiolas. Manual T6cnica, Ministerio de Agricultura,
Pesca y Alimentaci6n, Espafia. 1984.
A n I n v e s t i g a t i o n into Cellular N e u r a l N e t w o r k s
Internal D y n a m i c s A p p l i e d to I m a g e P r o c e s s i n g
David Monnin 1,2, Lionel Merlat 1, Axel K6neke 1, and Jeanny H~rault 2
1 Introduction
Cellular Neural Networks (CNNs) [1] are lattices of analog locally connected
cells conceived for an implementation in VLSI technology and perfectly suitable
for analog image processing. The operation of a cell (i, j) is described by the
following dimensionless equations:
dxi,j 1
dt - - - ~ xi'j ~- A | Yi,j + B | ui,j + I (1)
1
y~,j(~) = ~(Ix~,j + 1L- Ix~,~- 1L). (2)
where | denotes a two-dimensional discrete spatial convolution such t h a t
A | Yi,j ~- ~k,lCN(i,j) Ak,l.Yi+k,j+l, for k and 1 in the neighborhood N ( i , j ) of
cell (i,j), which is generally restricted to the 8-connected cells. A and B are the
so-called feedback and feedforward weighting matrices, and I is the cell bias.
ui,j, xi,j and yi,j are the input, internal state and output of a cell, respectively.
The same set of parameters A, B and I, also called cloning template, is re-
peated periodically for each cell over the whole network, which implies a reduced
set of at most 19 control parameters, but nevertheless a large number of possible
processing operations [2]. It was shown that numerous traditional operators for
binary and gray level image processing, among which are all linear convolution
filters as well as morphological and Boolean operators, can be designed for un-
coupled CNNs, i.e. CNNs with no feedback interconnection [3]. In the case of
433
uncoupled CNNs and according to equations (1) and (2), the CNN dynamics
can be defined by a set of three differential equations valid over one of the three
domains of lineaxity of (2):
dx~,3
dt - x ~ j - a + B | u~,j + I , for x~,j E] - c~, -1] (3a)
It is known from [3] that gray level output operators can be obtained when
a < 1, while only binary output operators are obtained when a > 1. In terms
of dynamical internal behavior of a cell, a < 1 implies only one stable equilib-
rium point xqi,j~ whereas a > 1 leads to two possible stable equilibrium points
xq~,j and xqi,+ respectively located in ] - cr -1] and ]1, cr The values of the
different possible equilibrium points are derived from (3a-c) when the derivative
is canceled and are given by:
xq~,j = B | u i , j + I - a (4a)
B | ui,j + I
xqi~ - 1 - a (4b)
The usual way for designing cloning templates consists in acting on the CNN
dynamics to set prescribed equilibrium points and thus get the expected process-
ing operator in a straight way. The byroad presented in this paper investigates
the internal states dynamics to prescribe particular states configurations which
axe finally used to derive operation not realizable directly. After a short overview
of the design of CNNs for image processing, the processing of internal states
will be introduced, and applications to the composition of complex processing
schemes, gray level preserving segmentation and selective brightness variation
will be presented.
where the initial state x(0) E [-1, 1] is the same for all cells of the network. In
addition, an inversion effect is obtained by reversing the sign of B and T h .
where T h - applies to cells with an initial state x - ( 0 ) E [-1, 1], and T h + to cells
with an initial state x+(0) C [-1, 1], such that x - ( 0 ) < x+(0) and T h - > T h +.
435
S i n g l e T h r e s h o l d P r o c e s s i n g a n d B o o l e a n O p e r a t o r s . This is an adapta-
tion of the previous method which allows to combine a binary initial state with
the result of a thresholded convolution filter.
"OR" Boolean operators are obtained when:
Once again, an inversion effect can be obtained by simply reversing the sign
of B and of the threshold value T h .
It is obvious from (2) that for x E [-1, 1], the value of the output y reflects the
one of the internal state x, which can hence be straightforwardly observed from
the output of the CNN. However, when y = =t:l, the only information on the
internal state provided by the output is that Ix[ _> 1. In the latter case, it is
understood that binary output does not imply binary internal states. The use
of internal states histograms as investigation tools allows to have an insight into
the CNN behavior beyond the [-1, 1] range. As a meaningful example, fig. 1
shows an image and its internal states histogram before and after thresholding.
It is clear from this representation that even if the output is binary, it is not
necessarily so for the internal states and thus the gray level information is not
really lost but merely hidden and can be processed in order to complete specific
operations. To achieve this aim it is first interesting to focus on the internal
states location after a binary output image processing.
I n t e r n a l S t a t e s L o c a t i o n A f t e r a S i n g l e T h r e s h o l d P r o c e s s i n g . For sin-
gle threshold processing, the value of B | ui,j in m a x i j (xq~,j) and mini,j (xq+j)
is equal to the threshold value T h , while for m i n i j (:rq~j) and m a x i j (xq+j), the
value of B | ui, j is respectively equal to -I[B[[1 and IIB][~. This leads to the
following expression of D - and D+:
D-=[- IIBIIl + I - a , T h + I - a[, D + = ] T h + I + a, llBIIl + I + a] . (lla-b)
I n t e r n a l S t a t e s L o c a t i o n A f t e r a T w o T h r e s h o l d s P r o c e s s i n g . Deriving
the previous approach for two thresholds processing leads for T h - to:
D Z = [ - [IBI]I + I - a , T h - + I - a[, D +_= ] T h - + I + a, IIB]I1 + I + a]. (12a-b)
and for T h + to:
D+=[- [IBIIl + l - a , T h + + I - a[, D + = ] T h + + I + a, llBl[l + I + a]. (13a-b)
Considering T h - and Th+simultaneonsly, the overall expression of D - and D +
is:
D-=D-UD+, D + = D +_UD +. (14a-b)
Remembering that T h - > T h + it finally yields:
D-----[-tlBIll+I-a, Th-+I-a[, D +=lTh++I+a,llBIIl+I+a]. (15a-b)
xq + = I + a. (16c)
437
I
Th - l-a" (17)
Hence, when a cell initial state is less than T h it leads to x q - , and to x q + when
it is greater. The method then allows, by choosing parameters a and I according
to (17), to design a threshold operator which operates on an image stored in the
CNN initial state, and results in a binary image for both output and internal
state image. The possible values for the outputs are then of course -1 and 1,
whereas they are x q - and xq+for the internal states.
The choice of parameters a and I in the equation of T h (17) allows to set
either the value of x q - or that of x q +, but not both at the same time. However,
if a threshold operation is not useful because the internal state already results
from a previous threshold operation, it is possible to binarize the internal state
and to fix both the value of x q - and that of x q +. This is done by solving the
following set of equations for a and I:
xq- = I- a (18)
xq+ I + a
which yields:
xq + - xq- xq + + xq
a-- 2 , I- 2 (19a-b)
It must be clearly noticed that, if the latter method can binarize internM
states to prescribed values, whether the internal states have already been bina-
rized or n o t , it cannot modify any CNN output, i.e. it cannot move an internal
state from ] - co,-i[ to ]i, co[ or from ]i, oo[ to ] - oo,-I[. The only way of
changing the CNN output, consists in fact in dealing with an initial state in
[-1,1].
As all the internal states processing operations involved in section 3 are regarded
as a kind of preprocessing for new CNN image processing operators, it means
t h a t the internal states involved should not get stuck in ] - c ~ , - 1 [ and ]1, co[, and
t h a t it should be possible to shift them even into [ - 1 , 1]. This implies the use of
cloning templates for which a < 1, which paradoxically generates CNN operators
for which steady state is independent of the CNN internal state I3]. Fortunately,
this paradox can be solved if the CNN convergence is stopped before the steady
state is reached. The following subsections will establish the relation between
internal state value and transient time and expose the principle of internal states
shifting.
438
R e l a t i o n b e t w e e n I n t e r n a l S t a t e V a l u e a n d T r a n s i e n t T i m e . The de-
termination of the relation between internal state value and transient time can
be done by solving the differential equations (3a-c). Even if more complex cases
could be considered, for clarity, it is convenient to set a = 0. Equations in (3a-c)
can then be gathered in only one equation:
dxi,j(t)
dt - x i , j ( t ) + B | uid + I . (20)
which finally yields the expression of the transient time t for a given value of
z i , j = x~,j (t):
( Xij--BQUij--[
ti,j = - In \)'xi,j';O'
t - B | uYi,j - I / " (22)
I = x - ( 0 ) - x +(t) - x + ( 0 ) . x - ( t ) (24)
x - (0) + x + ( t ) - x + (0) - (t)
dx ,j (t)
- xi,j(t) + u<j + I. (25)
dt
( ( x+-,,,+-, (26)
- ln \ x - ( O ) - u- - I] = -in \x~((0~Tu~:7i] "
4 Applications
The applications proposed here are based on the processing of the image in
fig. la. This image has a particular histogram which makes the segmentation
of objects easier and does not require complex segmentation schemes which are
beyond the scope of this paper. In fact, the sky has a histogram included in
[ - 1 , -0.81], the balloon in [-0.81, -0.56], the landscape in [-0.56, 0.40] and the
helicopter in [0.40, 1].
5 Conclusion
References
1. L. O. Chua and Yang, Cellular Neural Networks Theory, IEEE T-CAS vol. 35 (1988)
1257-1272.
2. L. Merlat, A. KSneke, J. Merckl~, A Tutorial Introduction to Cellular Neural Net-
works, Proc. of Workshop in Electrical Engineering and Automatic Control (1997),
ESSAIM, Mulhouse, France.
3. D. Monnin, L. Merlat, A. KSneke and J. H~rault, Design of Cellular Neural Networks
for Binary and Gray Level Image Processing, Proc. of ICANN 98 (1998) 743-748.
4. T. Roska and L. O. Chua, CNN with Non-linear and Delay-type Template Elements
and Non-uniform Grids, in Cellular Neural Networks, J. Wiley & Sons (1993) 31-43.
441
Figures
6000 , , , i ,
4000 t
OL ........
-3
lib,..,,,
-2
Fig. 2. Single threshold processing, parameters: a = 2, b = -I, I = -0.56, x(0) = 0
-1
, ,
0
,
1 2
J,ll
.....
3
XlO 4 XlO 4
3 3
o
-1.5 -1 ~).5 0 0.5 1 1.5
t o -2 -1.5 -0.5 0 0.5 1 1.5 2
t
Fig. 3. Internal states binarization, Fig. Internal states shifting,
4.
parameters: a = 2, b = 0, 1 = 0 parameters: a = 0, b = 0, I = 0, t = 0.69
600s
4000
t ' l ' ' '
, .... 1
o- , I, ..,i,,.n~ltll . . . . u
-3 -2 -1 0 1~ 3
Fig. 5. T w o thresholds processing, parameters: a = 1.605, b = 1, I = 0.205
i " 600014000
I k, '
' o. . . . ,,,.,m~llll , ..i.. I i ..... I
-3 -2 -1 0 1
Fig. 6. Internal states shifting, parameters: a = 0. b = 1, / = -1.4, t = 0.83
iL
4000
2~176176
o"
-3
.,h,,~U]
-2
dll .
-1
I I
0
hldl
1
I
2
Fig. 7. Single threshold processing & A N D , parameters: a = 1.905, b = 1, I = -0.095
6000 I i i
4o~176
I
i i , I
-3 -2 -1 0 1 2
F i g . 8. Internal states shifting, parameters: a = 0, b = 1, I = 0 . 1 9 , t = 2.44
Autopoiesis and Image Processing: Detection of Structure and
Organization in Images
Abstract
The theory of Autopoiesis describes what the living systems are and not what they do.
Instead of investigating the behavior of systems exhibiting autonomy and the concrete
implementation of this autonomy (i.e. the system structure), the study addresses the
reason why such behavior is exhibited (i.e. the abstract system organization). This article
explores the use of autopoietic concepts in the field of Image Processing. Two different
approaches are presented. The first approach assumes that the organization of an image is
represented only by its grayvalue distribution. In order to identify autopoietic
organization inside an image's pixel distribution, the steady state Xor-operation is
identified as the only valid approach for an autopoietic processing of images. The effect
of its application on images is explored and discussed. The second approach makes use of
a second space, the A-space, as the autopoietic-processing domain. This allows for the
formulation of adaptable recognition tasks. Based on this second approach, the concept of
autopoiesis as a tool for the analysis of textures is explored.
1 Introduction
The theory of Autopoiesis, developed by the Chilean biologists Humberto Maturana and
Francisco Varela, attempts to give an integrated characterization of the nature of a living
system, which is framed purely with respect to the system in and of itself. The term
autopoiesis was coined some twenty-five years ago by combining the Greek auto (self-)
and poiesis (creation; production). The concept of autopoiesis is defined as [Varela,
1979, p. 13]:
2. constitute it (the machine) as a concrete unity in the space #1 which they [the
components.] exist by specifying the topological domain of its realization as
such a network. '
In a first attempt, the question arises whether images by themselves preserve some kind
of autopoietic organization. Because of images generally are considered as static
representations of real-world objects, but autopoiesis is constituted by a network of
dynamic transformations, the image must be processed by suitable operators in order to
reveal possible organizational principles. Thereby, the original image appears to be like a
"frozen" state of its intrinsic dynamical processes. Two approaches are possible from
444
now on: the first approach assumes no relation between these dynamics and the real-
world objects pictured in the image, in contrary to the second approach, which names the
kind of features of real-world objects from which the dynamics are driven.
This section is concerned with the first approach i.e., image dynamics are restricted to the
distribution of colors or grayvalues in the image. No reference is given to the pictured real-
world objects. In order to "melt" the image due to a possible intrinsic dynamic, an
operator is searched with the following two essential properties:
1. The operator should be applied point-wise. Normally, image operators like the
Laplacian, the Sobel or the median operator are applied to all image pixels, at once. But, as
it was mentioned in the introduction, autopoietic systems constitute the domain of its
realization. For an effective search of these domains, the size of them can not be
predicted. Hence, the application domain of the operator must be balanced between pure
local analysis (a pixel and its neighborhood) and global analysis (all image pixels). Pre-
defined image operators do not offer such a choice. By repetitively applying the local
operator point-wise, the effect of a local operation is spread out over the image and more
complicated patterns of interaction are possible. As a reminder for a similar procedure in
genetic algorithms, we refer to this manner of image operator application as steady state
image processing.
Speaking more formally, let | be the image operation, which is applied point-wise. A
sequence of points p(T) is generated randomly, where p(T) is the point chosen at time
step T. Let g(p) be the grayvalue of point p in the image. In this article, all points are
equally possible in the random sequence. Non-adjacent points in the sequence could be
neighbors in the image. Assume p(Ti) and p(T2) are such a pair of points with T~<'1"2.
Then, while applying the operator | to the point p(Tt) and its neighbors, p(T2) is also
affected. But later, at step/'2, the application of the operator onto the modified p(T2) also
affectsp(Tt). Our demand is to have a non-zero probability of reproducing p(Ti)'s original
value by this procedure. This demand can be fulfilled using the Xor-operation. This can
be verified by considering the following three properties of the Xor-operation:
Commutativity: a|174
Associativity: (a@b)|174174
Auto-projection: (a|174
445
The third property explains the fundamental role of the Xor-operation in data coding, as
well as for sprite algorithms in computer animations. It can be easily shown that the Xor-
operation and its negate are the only binary operations fulfilling all of these three
properties. Hence, for detecting organized autopoietic structures in an image's grayvalue
distribution, it is necessary to apply the Xor-operation in a steady state manner. The
resulting algorithm is as simple as in the following:
Repeat:
2.1 Diseussion
For a further understanding of the effect of this operation consider figure 1. There, a face
image (a), the result of the repetitive application of the above algorithm after 1000
"generations" (b), and a dilated version of the second image (c), are shown. A white
circular contour around the phong-pattern on the forehead can be seen. This phong-
pattern is a result of the lighting conditions during image acquisition. Phong-patterns are a
major problem for facial recognition tasks. By using the Xor-operator, they can be easily
detected. But where comes this circle around the phong-pattern from?
To understand this, consider the effect of the same procedure onto a gradient
image (figure 2). Also, the original image, the image after 1000 steps and its dilated version
are shown. A white line appears in the middle of the image, around the grayvalue 128, but
not exactly in this position. The explanation of this effect reminds of another famous role
of the Xor-operation, as a benchmark for neural networks. The Xor-operation can not be
linearly separable. A neural network needs a hidden layer to learn the Xor-operation. The
gradient image helps to give an imagination of this fact. If pl has grayvalue 0, then g(Pl) |
g(P2) gives g(Pz), i.e. for low grayvalues, the Xor-operator tends to be the identity
transformation. Ifg(pl) is the maximum grayvalue (255 in our case), g(Pl) | g(P2) gives
the inverse of g(Pz), i.e. for high grayvalues it tends to be the inverting transformation.
But there is no linear descent from identity to inverse! Hence, we must have a non-linear
anomaly between these two extremes. The white line represents this anomaly.
Grayvalues around 128 tend to complete each other to the maximum grayvalue 255. The
white line appears to be a boundary between a gradual descent from the maximum to the
minimum grayvalue. This way boundary exchange processes can be identified; i.e. the
boundary must be a closed one to prevent the Xor-operations from pocketing it. Hence,
446
Figure I. A face image (a), tile result of the repetitive application of the above algorithm after 1000
"generations" (b), and a dilated version of the second image (c).
Figure 2. A gradient image (a), tile result of the repetitive application of the above algorithm after 1000
"generations" (b), and a dilated version of the second image (c).
Two summarize the foregoing discussion: For the detection of the autopoietic
organization of a grayvalue distribution, or better, the actual grayvalue distribution as a
"frozen" state of a possible autopoietic organization, the Xor-operator must be applied in
a steady state manner, i.e. on a sequence of randomly chosen image points. Only the Xor-
operator has the property of auto-projection, which ensures a much greater than zero
probability of regenerating the original image. This is true as long as binary numbers
represents grayvalues. The application of the Xor-operator onto images yields phong-like
structures, which prove to be the only organizational issues of intensity ordering in an
autopoietic manner.
These first results are encouraging enough to continue this work. It has been
shown, that the search for autopoietic organization in grayvalue distributions of images
reveals new structural properties of them, which are hardly to find by mean o f other
image processing operations. Further research on the Xor-operator should explore the role
of the probability distribution for the random sequence of pixel positions. Also, other
447
ordering relations in the image than the conventional intensity ordering should offer new
application tasks for the Xor-operator.
Texture perception plays an important role in human vision. It is used to detect and
distinguish objects, to infer surface orientation and perspective, and to determine shape in
3D scenes. Even though texture is an intuitive concept, there is no universally accepted
detinition for it. Despite this fact we can say that textures are homogeneous visual
patterns that we perceive in natural or synthetic images. They are made of local
micropatterns, repeated somehow, producing the sensation of uniformity [Ruiz-del-Solar,
1997]. It is important to point out, that textures can not be characterized only by their
structure because the same texture, viewed under different conditions, is perceived as
having different structures.
In the fi:~mework of the theory of autopoiesis, Maturana and Varela make a
complementary definition of the concepts of organization and structure of a system. The
organization of a system defines its identity as a unity, while the structure determines
only an instance of the system organization. In other words, the organization of a system
defines its invariant characteristics. The concept of autopoiesis captures the key idea that
living systems are systems that self maintain their organization (see introduction). In the
context of texture analysis, the systems to be analyzed are the textures. As it was
established, the concept of organization must be used to characterize a system and in our
case to characterize a texture. For this reason, in this section the concept of autopoiesis is
explored as a tool for texture identification, which corresponds to an important task in the
field of texture analysis. The analogy between the process of autopoietic organization (i.e.
lil~) in a chemical medium and the process of texture identification is used.
Before to apply the concept of autopoiesis as a tool for texture identification a
computational model of autopoiesis must be defined. Varela et al. developed the first
model that was capable of supporting autopoietic organization [Varela et al., 1974].
Recently, McMuilin developed the SCL model, which corresponds to an improvment of
the model, presented by Varela [McMullin, 1996a and 1997b]. The SCL model from
McMullin is used here.
SCL involves three different chemical elements (or particles): Substrate (S), Catalyst (K)
and Link (L). These particles move in random walks in a discrete, two-dimensional space.
In this space, each position is occupied by a single particle, or is empty. Empty positions
are managed by introducing a fourth class of particles: a Hole (H). SCL supports six
distinct reactions among particles [McMullin, 1997b]:
448
1. Production:
K+S+S ........ > K+L+H
2. Disintegration:
L ........ > S+S
3. Bonding:
Adjacent L panicles bond into indefinitely long chains
4. Bond decay:
Individual bonds can decay, breaking a chain
5. Absorption:
L+S ........ > L*
6. Emission:
L* ........ > L+S
The original SCL model was modified to allow the identification of textures, by
introducing the idea of a texture-dependent catalyst. That means, a catalyst that is tuned
with a defined texture and that produced an autopoietic organization only in this texture.
To implement this idea an autopoietic image A(i,j) is defined for each texture image T(i,j).
Each pixel of A(i,j) has a corresponding position in T(ij) and is represented by 2 bits
(enough for represent four particles). A T-Space is associated with the texture images
TO'j) and a A-Space is associated with the autopoietic images A(i,j) (see figure 3). The
reactions defined by the SCL model, that is the possible autopoietic organization, take
place in the A-Space, but by taking into account information from the T-Space (textures).
T-Space A-Space
K
.s2
Figure 3. The A-Space, where the autopoietic organization is created, and the T-Space, where convolution
between the texture and the Gabor-Filter is performed, are shown.
449
Production:
Cl=Nl*Gk
C2=N2*Gk
where Gk is the Gabor-Filter associated with the catalyst K; N/ and N2 are the
neighborhood, in the T-Space, of St and $2, respectively (see figure 3); Ct and C2 are the
results of the convolution (performed in the T-Space); and TH is a threshold value. If in
the A-Space of a given texture a chain of elements form a boundary, after an interaction
time, then tile catalyst K has identified the texture (in its T-Space) as corresponding to the
class of textures characterized by the Gabor-Filter Gx.
query
~ autoagent
poeitci- ~ "9 ~
Texturei Textures'
9 Database
Figure 4. ProposedTextureRetrievalSystem(A3G:AutomaticAutopoietie-AgentGenerator;TA2T:
Textural Autopoietie-AgentTester).
4 Conclusions
The use of autopoietic concepts in the field of Image Processing was explored. Two
different approaches were presented. The first approach, presented in section 2, assumes
that the organization of an image is represented only by its grayvalue distribution. In
ordcr to identify autopoietic organization inside an image's pixel distribution, the steady
state A'or-operation was identified as the only valid approach for an autopoietic
processing of images. The application of the Xor-operator onto images yields phong-like
structures, which prove to be the only organizational issues of intensity ordering in an
autopoietic manner. These first results are encouraging enough to continue this work. It
was shown that the search for autopoietic organization in grayvalue distributions of
images reveals new structural properties of them, which are hardly to find by means of
other image processing operations. Further research on the Xor-operator should explore
the role of the probability distribution for the random sequence of pixel positions. Also,
other ordering relations in the image than the conventional intensity ordering should offer
new application tasks for the Xor-operator.
The second approach, presented in section 3, makes use of a second space, the A-
space, as autopoietic processing domain. This allows the formulation of adaptable
recognition tasks. Based on this second approach, the concept of autopoiesis as a tool for
thc analysis of textures was explored. The SCL model, a computational model of
autopoiesis, was modified to allow the identification of textures, by introducing the idea
of a texture-dependent catalyst. As a demonstrating example, a Texture Retrieval System
based on the use of an autopoietic-agent, the texture-dependent catalyst, was presented.
Further research must be performed to apply this concept in the solution of real-world
problems.
References
Varela, F.J. (1979). Principles of Biological Autonomy, New York: Elsevier (North
t [olland).
Varela, F.J., Maturana, H.R., and Uribe, R. (1974). Autopoiesis: The organization of
living systems, its characterization and a model. BioSystems 5: 187-196.
Abstract. One of the maj or difficulties arising in the analysis of a radiological image is
that of non-uniform variations in luminosity in the background. This problem urgently
requires a solution given that differing areas of the image have attributed to them the
same values and this may potentially lead to grave errors in the analysis of an image.
This article describes the application of two different methods for the solution of this
problem: polynomial algorithms and artificial neural networks. The results obtained
using each method are described and compared, the advantages and drawbacks of each
method are commented on and reference is made to areas of potential interest from the
point of view of future research.
1 Introduction
Within the field of digital image processing in medicine, one of the areas to which
most effort is dedicated is that of the analysis of radiological images[l]. Any
improvement in either the quality of these images or the analysis process of the same
would guarantee an important improvement in patient care.
Moreover, this area of investigation is particularly interesting in terms of
developing new support systems for specialists in a particular image field, given that
there is generally available a good supply of images both for development and for
system tests
In digital analysis of radiological images one of the problems that occurs most
frequently is the problem of variations in luminosity [2]. This problem occurs as a
consequence of curvature in the exposed surface or an intrusion of some kind between
the image acquisition apparatus and the object. The consequence is that the non-
uniform illumination causes the elements making up the image to have different
453
luminosity values depending on the area of the image and these values, for the
different elements, are similar for different areas of the image.
This is a problem that needs to be resolved before proceeding to a detailed analysis
of the image. Not doing so could cause grave errors to occur during the segmentation
phase given the impossibility of establishing a criterion that delimits, with a sufficient
margin of error, the different elements that make up the radiograph.
The traditional approach to this problem is based on statistical methods [3]. Using
a set of images a series of probabilities are calculated, on the basis of which a function
is applied to the luminosity value of each point of the image so as to obtain the correct
values. However, this kind of method has two major drawbacks:
This article present the results obtained in the preprocessing of radiological images
corresponding to an orthopedic service. The aim of the research is to endeavour to
eliminate problems of variations in luminosity and to make an in-depth analysis of the
images, with a view to creating a valuable tool for specialists to employ in their
diagnoses. In view of this aim, the two different methods selected as most
appropriate were polynomial algorithms [4] and artificial neuron networks. In order
to evaluate the quality of the results a segmentation of each image obtained from
applying the two methods was carried out using different clustering algorithms. The
results obtained using both methods along with the advantages and drawbacks in the
use of either are described below.
A solution to the problem that is the concem of this research would mean
significant progress in the development of an automatic process for the examination
of radiological images, given that the value of the radiographs available depends
greatly on the extent to which this flaw is corrected.
As a longer-term aim, it is hoped to extend the research so as to develop a system
that assists specialists in the fitting of prostheses as well as in the assessment of screw
implants.
To start with, the characteristics that best defined the image were identified (Fig.
1), in order to select those techniques that would produce the best results. For this
characterization of the image, standard digital processing techniques were used.
454
It was observed that the borders between the different elements are both close to
each other and fuzzy or blurred.
The histogram of the radiograph was also examined (Fig. 2), with a view to
obtaining a precise idea of the distribution of the values, and this confirmed that the
borders were blurred. In addition, the radiograph presented a non-uniform variation
in luminosity, the intensity of the bone and the screw in the upper and bottom portions
of the image is very different.
3~000
N o~ pixe]:~
o
100 200 ~5
lnt~r~ibj
3 The M e t h o d s A n a l y s e d
Applied to the analysis of the problem were the polynomial algorithm and artificial
neuron network techniques, with the aim of comparing the results of both. The
former is a linear algorithmic technique whereby it is only possible to adjust a fixed
number of parameters; the ANN technique, on the other hand, is a non-linear one
whereby after training it is expected that it will be capable of generalizing, i.e. of
adapting to images with totally different characteristics to those of the images used for
training.
The least squares method consists of the construction of an image reflecting the
variations occurring in the background of the image, by means of a bidimensional
polynomial (p(x,y)) calculated by way of the least squares method, and subtracting it
from the original image with a view to eliminating the variation.
The calculation of the polynomial is based on the assumption that the values for
background luminosity in an image are spatially continuous, it being possible to make
the calculations using a polynomial of arbitrary degree based on the Weiertrass
theorem of approximation[5].
Aoox~ ~ + alox~y~ + a20x2y~ + ... + an0xny~ + a01x0y1 + ao2x~ 2 + ... + aonX~ n (1)
The values for the polynomial image are subsequently determined by calculating the
value of the polynomial for x,y with x:l...N, y.'l...M; where N and M represent the
range of each dimension of the image.
Bearing in mind that this technique has the drawback of being time-consuming in
computational terms, the degree of the polynomial and the numbe of points used to
calculate it should be limited as far as possible.
The type of neural network utilized was the feed-forward type [6], with an input
layer composed of 25 process elements, plus a hidden layer and an output layer, each
having one process element each. Connectivity is total between all the process
elements of the network. The input layer was defined at a size of 5x5 pixels, with
views of a fragmented processing of the image, simulating a convolution [7].
Training. Selected in order to train the network was a supervised learning process
using the backpropagation algorithm.
456
Input Pattern. Used was a synthetic image in a range of grey tones, with dimensions
of 368 by 360 pixels taking values in the interval [0,255]. A total of 9,975 fragments
of 5x5 pixels were extracted, the values of which were administered to the network as
input.
Output pattern. One fragment of 5x5 pixels was extracted from the output image for
each fragment of the input image. The expected output of the network would be the
mean value of the pixels in each fragment of the image taken as the output model.
1
(2)
1 + e zx•
where actv is the activation in each process element and x the input to the neuron.
Initialization function. Randomized Weights initialize weights and bias with values
distributed aleatorially; in this case, in the interval [-1,I].
f ~'k
4 Comparison of Results.
Fig. 4. Results for the Clustering Algorithm in the Segmentation of the Different Images.
There appears to be a case for claiming that the ANNs produce quite an improved
result over the polynomial algorithms. Nevertheless, there still remain various
adjustments to be made to the training network so as to obtain optimum results, given
that there are certain pattems that the network is capable of treating optimally. For
example, in the radiographs it can be appreciated that there is an elevated level of
459
noise, conducive to error in the segmentation phase, for which reason its elimination
during the pre-processing phase is desirable.
Finally, another interesting modification would be the design of a non-supervised
training network that would permit the detection of patterns of interest in the images,
thus facilitating segmentation and characterization of a radiological image.
6 Acknowledgements
Our thanks to the Computing Service of the Juan Canalejo Hospital (A Corufia,
Spain) for their collaboration in this research.
References
Topics: Image Processing, neural nets, industrial automation, texture recognition, real-time quality
control
Abstract. There are several approaches to quality control in industrial processes. This work is center in
artificial vision applications for defect detection and its classification and control. In particular, we are
center in textile fabric and the use of texture analysis for discrimination and classification. Most
previous methtxls have limitations in accurate discrimination or complexity in lime calculation; so we
apply parallel and sigllal processing techniques. Our algorithm is divided in two phases: a first phase is
the extraction of texture features and later we classify it. Texture features should have the followings
properties: be invariant under the transformations of translation, rotation, and scaling; a good
discriminating power; and take the non-stationary nature of texture account. In Our approach we use
Orthogonal Associative Neural Networks to Texture identification and extraction of features with the
previous properties. It is used in the feature exlracti~m and classification phase (where its energy
function is minimized) too, so all the method was applying to defect detection in textile fabric. Several
experiments has been done comparing the proposed method with other paradigms. In response time and
quality of response our proposal gets the best parameters.
1. Introduction
For real-time image analysis, for example in detection of defects in textile fabric, the
complexity of calculations has to be reduced, in order to limit the system costs [3].
There are several approaches to quality control in industrial processes [1][2][7]
Additionally algorithnts which are suitable for migration into hardware h,'tve to be
chosen. Both the extraction method of texture features and the classification algorithm
must satisfy these two conditions. Moreover, the extraction method of texture features
should have the followings properties: be invariant under the transformations of
translation, rotation, and scaling; have a good discriminating power; and take the non-
stationary nature of texture account. We choose the Morphologic Coefficient [8] as a
feature extractor that is adequate for its implementation by associative memories and
dedicated hardware.
In the other hand, the classification algorithm should be able to store all of patterns,
have a high correct classification rate and a real time response. There are m a n y
models of classifier based on artificial neural networks [5][13][16]. Hopfiel [11] y [12]
introduced a first model of one-layer autoasociative memory. The Bi-directional
Associative Memory (BAM) was proposed by Kosko [14] and generalizes the model
to be bidirectional and heteroassociative. The BAMs have storage capacity problems
[17].
It has been proposed several improvements (Adaptative Bidirectional Associative
Memories [15], multiple training [17] y [18], guaranteed recall, and a lot more
461
besides. One-step models without iteration has been developed too (Orthonormalized
Associative Memories [9] and the l-Iao's associative memory [10], which uses a
hidden layer). In this paper, we propose a new model of associative memory which
can be used in bidirectional or one-step mode.
The Hausdorff Dimension (HD) was first proposed in 1919 by the mathematician
Hausdorff and has been used, mainly, in fractal studies [4]. One of the most attractive
features of this measure when analyzing images is its invariant properties under
isometric transformations. We will use HD when extracting features.
Definition L The Hausdorff dimension of order h of a set S with S ~ Rn,h _>0 and S > 0
is defined as follows:
(I)
with
Definition 1l. The Hausdorff dimension of a set S is the value of h that makes Hh(s)
have an inflexion point in 0 and infinite. Formally
In this paper, we use a new model of associative memory which can be used in
bidirectional or one-step mode. This model uses a hidden layer, proper filters and
orthogonality to increase the store capacity and reduce the noise effect of lineal
462
semicover in each plane ~-i, for pixel i=1.. p, is obtained by bipolar filter :
463
The norna of 8 -semicober of the image in the phme ~.i, for i=l..P and window Vj. for
j=I..N/IVj is
Vj -sn(l/lj . V j t = l c = t , f ( x ) = l V x e V~ y V x e l ( ~ ) (6)
f,(x1) ~>'r
rl(x3) " , ~ , ~ ~
...... ,_, ~ ~-~g(~ )
;1 yj > 0
g(YJ ) = yj < 0
Finally, the CM (Morphologic Coeficient) in a plane ~,i , for i=1.. P, is calculated
from several windows Vj, for j=I..N4V j , y Vj = 1,2,., N in each plane, that is
u/~=K log[g(y,)]
(7)
CM(~., . V j) = 1=/ -loglVl
So, we need a neuron that represent the output of a window.
g(Yl )
g(Y2) ~ K-sm(I/~i)
g(Y3) ~ ~ ">CM
g(YN) log K-sm(I/~i)
K --IogK
Figure 4. Morphologic Coefficient in the plane ~'i
464
I Ljl
+1 +1 +1 +1 +1
+! +1 +l +l +1
(8)
W= V=
4. Experiments
To test the texture analysis algorithm (features extraction and classifier) we consider
the problem of defects detection in textile fabric. Real-world 512x512 images (jeans
texture) with and without defects (fig la and lb) were employed in the lea,'ning
process of the MAO classifier. We considered windows of 70x70 pixels with 256 gray
levels and the parameters of the algorithm were adjusted to obtain high precision and
low response time. These are shows in the table la and lb.
The implementation was made in a C-program. In the test process and in the learning
process were employed different images. In both cases were 1.200 images with defects
and 1.000 without defects. The results shows that in all the cases our algorithm is two
magnitude order faster than the others. In addition the hit rate it is next to 90% for with
and without defects texture recognition (notice that in the C-III, ad-hoc partitioning, it
is over 95%). The conclusion is that it is feasible to implement a real-time system with
a high precision level based in our algorithm. So an architectural proposal will be
made.
465
Table I. Simulationresults
5. Conclusion
References
[1] N.R. Pal and S.K. Pal, A review on image segmentation techniques, Pattern
Recognition, Vol. 26, No.9, pp. 1277-1294, 1993.
[2] R.M. Haralick, Statistical and structural approaches to texture, Procc. IEEE, Vol.
67, pp. 786-804, 1979
[3] C. Neubauer, Segmentation of defects in textile fabric, Procc. IEEE, pp. 688-691,
1992.
[4] Hoggar, S. G. Mathematics for Computer Science. Cambridge University
Press. 1993.
[5] J.M.Zurada, Introduction to Artifial Neural Systems, West Publishing Company,
1992.
[6] Harwaood, D. et al. Texture Classification by Center-Symmetric Auto-Correlation,
using Kullback Discrimination of Distribution. Pattern Recognition Letters. Vol 16,
pp. 1-10. 1995
[7] Laws, K. Y. Texture Image Segmentation. Ph D. Thesis. University of Southern
California. January. 1980.
466
1 Introduction
Cellular Neural Networks (CNN) [11 are a neural network model encompassed by
the dynamic network category. They are characterized by the parallel computing
of simple processing elements (so called cells) locally interconnected.
On the other hand, many image processing tasks consist on simple operations
restricted to the neighbourhood of each pixel into the image under processing.
Therefore, they are directly mapped out on a CNN architecture. This fact, along
with the possible implementation as an integrated circuit of the CNN makes
these architectures an interesting choice for those image processing applications
needing high processing speeds.
In order to approach a given task by means of a CNN architecture, the
weights of the connections among cells must be determined. This is usually
achieved after a heuristic design which requires a good definition of the problem
under consideration, as well as the use of learning algorithms [2]. Most of these
algorithms consist of adaptations of classical learning algorithms and leads to
good solutions on those applications projected on single layer CNN. However,
many of them fail when multiple CNN operations are required.
Multiple CNN operations, which are needed for complex problems resolu-
tion, can be implemented using the CNN Universal Machine (CNN-UM) [3].
The CNN-UM consists of an algorithmically programmable analog array com-
puter which allows to approach complex problems by splitting them up into
simpler operations (many of which are even implemeted on existing libraries
and subroutines [4]).
468
Another way to approach those complex tasks is given by the use of the
discrete-time extension of the CNN (DTCNN) [5]. Due to the synchronous pro-
cessing in DTCNN, a robust control over the propagation velocity is possible,
faciliting the extension to multilayer structures [6]. This allows to directly ap-
proach the global problem. However, the high complexity of the dynamical be-
haviour in this kind of structures, makes most of the learning algorithms applied
to single layer structures unsuitable. Usually the learning process in multilayer
systems is tackled by considering the optimization of each layer independently,
either heuristically or by means of single-layer learning algorithms. However, it
can be interesting a global training process where all the weights of the different
layers are optimized at the same time.
In this work we present a global learning strategy for multilayer D T C N N
architectures. We apply a stochastic optimization method, namely Genetic Al-
gorithms (GA), to simultaneously optimize all the weights of the different layers
of the system. To prove the generality of the method, we applied it to different
image processing tasks projected onto multilayer DTCNN. First of all we tack-
led the problem of training a system to perform the skeletonization of arbitrary
binary images. Next, the edge detection in general grayscale images was consid-
ered. Finally a novel active contour based technique for image segmentation is
approached using this learning strategy.
In Section 2 the notions of multilayer D T C N N architectures are briefly re-
called. Section 3 describes general GA characteristics and the specific GA used is
discussed. Application examples of the GA-based training process are in Section
4 and the final conclusions and discussions in Section 5.
Single layer D T C N N [5] have been shown to be an efficient tool in image process-
ing and pattern recognition tasks. They are completely described by a recursive
algorithm and their dynamic behaviour is based on the feedback of clocked and
binary outputs.
The equations which govern the behaviour of a multilayer D T C N N with
time-variant templares are [6]:
c I+1 ifx~(k)>0
Yl(k+l)=f(x~(k))=,_l ifx[(k)<0 (2)
where u~, x[(k) and y[(k) are the input, internal state and output of the
c-th cell in layer l respectively. The inputs v,zd have continuous values, and the
outputs y~(k) are binary valued. The summations are performed within the
neighbourhood (N~(c)) of a cell c, which is defined as the set of all cells within
the distance r including cell c. The feedback coeficients a~'a(k) E IR, the control
coefficients b~'d(k) E IR and the thresholds i~(k), are called templates. For
469
our purpose they are time variant and translation invariants. So, the set of the
weights which caracterizes the topology of the network is largely limited, making
the learning process easier.
where yC(kr the output of cell c once the convergence is reached at time
interval k~d and y~ is the desired output value. The total error value will be
computed over all the cells of the network.
4 Application Examples
AN_ W = bd
dc
The b coefficient corresponds to the pixel under processing and the positions
marked a and b to white and black neighbours respectively. The value of the
remaining three neighbours is don't care, because they do not take part on the
decision of wether a pixel belongs to the northwestern edge or not, and correspods
to positions marked c. In fact, this coefficient could be set to 0, but we will
also optimize it's value during the training process. The processing p a t t e r n in
471
As it can be noticed, the value for coefficient c, that measures the influence of
the don't care neighbours, is nearly 0. Applying these corresponding templates
to a general test image with multiple objects, the result in Fig.2 was found.
Fig. 3. Edge detection training pha~se: Input and target images respectively
Fig. 4. Edge detection test phase: Input and Output images respectively
of interest. The region edges will be eroded based on a certain local information
until the final contours coincide with those of the objects under consideration.
Unlike the classical techniques, now the optimun location of the final contour
is not achieved by means of a global minimization process of the energy of the
snake but as a result of the local processing of an external energy. Furthermore,
this strategy easily allows the splitting of an initial contour into several ones in
order to delimitate different objects into the image under processing. A block
diagram of the proposed structure is shown in Fig.5.
The first two layers act consecutively for each of the four cardinal directions
until reaching the convergence, that is, until the erosion layer o u t p u t remains
unchanged after two consecutive iterations. The mlmber of iterations needed to
reach the convergence depends on the shape and size of the initial region. For the
training phase we only considered the problem of the optimization of the first
two layers. The third layer performs an edge detection task on binary images in
475
order to have the contour of the region in E R output as the global output of
the system. Templates for this simple task can be found in the literature a b o u t
CNN.
In order to determine valid templates for this application we have carried out
a learning procedure using the training p a t t e r n in figure 6. The input represents
the energy image. In this case, this only include external energy forces in such a
way t h a t the gray level associated with each pixel in the image is a function of
the distance to the closest border. T h e desired output is represented by a binary
image where the objects are perfectly defined.
AEpN = (! --0.16
--0.47
BEPN = ~0
0.54
IEPr~ = --0.15
Figure 7 shows the evolution of the region contour from its initial state until
its final location superposed on the energy image. Note that the evolution starts
from only one contour which is broken in order to delimite two different objects.
5 Conclusions
In this p a p e r we have shown that GAs can be successfully used as a global learn-
ing strategy for multilayer D T C N N architectures that perform complex tasks
over general grayscale images. This m e t h o d allows to simultaneously optimize
all the coeficients of the differents layers of the network instead of applying a
different learning strategy for each one. Due to the global nature of the training
process the only information required are the global input and output of the
overall structure.
476
Fig. 7. Example of the evolution of the contour superposed on the energy image
References
1. Chua, L.O., Yang, L.: Cellular Neural Networks: Theory and Applications. IEEE
Transactions on Circuits and Systems. 35 (1988) 1257 1290
2. Nossek, J.A.: Design and learning with Cellular Neural Networks. International
Journal of Circuit Theory and Applications. 24(1996) 15 24
3. Roska, T., Chua, L.O.: The CNN Universal Machine: An analogic array computer.
IEEE Transactions on Circuits and Systems. 40 (1993) 163-173
4. Roska, T., K~k, L., Nemes, L., ZavKndy, ,~., Brendel, M.: CSL-CNN Software Li-
brary, Version 7.1., DNS-CADET-15 Analogical and Neural Computing Laboratory,
Computer and Automation Institute, Hungarian Academy of Sciences.
5. Harrer, H., Nossek, J.A.: Discrete-Time Cellular Neural Network. International
Journal of Circuit Theory and Applications. 20 (1992) 453-467
6. Harrer, H.: Multiple layer Discrete-Time Cellulax Neural Networks using time vari-
ant templates. IEEE Transactions on Circuits and Systems-II: Analog and Digital
Signal Processing. 40 (1993) 191-199
7. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning.
Addison-Wesley Publishing Company. (1989)
8. Matsumoto, T., Chua, L.O., Yokohama, T.: Image Thinning with Cellular Neural
Networks. IEEE Transactions on Circuits and Systems 37(1990) 638-640
9. Venetianer, P.L., Werblin, F., Roska, T., Chua, L.O.: Analogic CNN Algorithms for
Some Image Compression and Restoration Tasks. IEEE Transactions on Circuits
and Systems-h Fundamental Theory azld Applications. 42(1995) 278- 283
10. Harrer, H., Venetiazmr, P.L., Nossek, J.A., Roska, T., Chua, L.O.: Some Examples
of Preprocessing Analog Images with Discrete-Time Cellular Neural Networks. IEEE
International Workshop on Cellular Neural Networks and their Applications. Rome,
Italy. (1994) 18-21
11. Vilarifio, D.L., Brea, V.M., Cabello, D., Pardo, J.M.: Discrete-Time CNN for Image
Segmentation by Active Contours. Pattern Recognition Letters. 19(1998) 721 734
12. Vilarifio, D.L., Cabello, D., Balsi, M., Brea, V.M.: Image Segmentation Based on
Active Contours Using Discrete-Time Cellular Neural Networks. IEEE International
Workshop on Cellular Neural Networks aim their Applications. London, England.
(1998) 331-336
How to Select the Inputs for a Multilayer Feedforward
Network by Using the Training Set
1 Introduction
Neural networks (NNs) are used in quite a variety of real world applications, in them
one can usually measure a large number of variables that can be used as potential
inputs. One clear example is the extraction of features for object recognition [1],
many different types of features can be utilized, such as geometric features,
morphological, etc. However, usually not all variables that can be collected are
equally informative: they may be noisy, irrelevant or redundant.
Feature selection is the problem of choosing a small subset of features ideally
necessary and sufficient to perform the classification task, from a larger set of
candidate features. Feature selection has long been one of the most important topics in
pattern recognition and it is also an important issue in NNs. If one could select a
subset of variables one could reduce the size of the NN, the amount of data to process,
the training time, and possibly increase the generalization performance. This last
result is known in the bibliography and ratified in our results.
Feature selection is also a complex problem, we need a criterion to measure the
importance of a subset of variables and that criterion will depend on the classifier. A
subset of variables could be optimal for one system and very inefficient for another.
In the bibliography there are several potential ways to determine the best subset of
features: analyze all subsets, genetic algorithms, a heuristic stepwise analysis and
direct estimations.
In the case of NNs direct estimation methods are preferred because of the
computational complexity of training a NN. Inside this category we can perform
478
another classification: methods based on the analysis of the training set, [2], [3], [4],
[5], [6], [7], [8], methods based on the analysis of a trained multilayer feedforward
network [9] and methods based on the analysis of other specific architectures [10J.
The purpose of this paper is to make a brief review of the methods based on an
analysis of the training set, present a methodology to compare them and present the
first empirical comparison among them.
In the next section we review the methods, in section 3 we present the comparison
methodology, the experimental results and an ordination of methods according to its
performance and we finally conclude in section 4.
2 Theory
The first method reviewed was proposed by Battiti [4], (from here named BA). The
algorithm is:
1. (Initialization) Set F to the whole set of p features. S an empty set.
2. Compute the mutual information I(C,f) for each feature fe F and the full set of
classification classes C=(cb cz..... CM).
3. Find feature f that maximizes I(C,f). Then include f in S and extract f from F,
S=Svo{ f}, F=F-{ f}.
4. Repeat until the cardinal of S is k.
4.1 Compute the mutual information I(f,s) between fe F and se S.
4.2 Choose g as the feature that maximizes the following equation:
where rl(f), rz(f) . . . . . rN(f) is a partition resulting of dividing the range of f values into
equals parts. And the above sum is over all these parts of f and all the classes cl, c2 .....
CM. The appropriate number of parts is usually between 16 and 32, we have used 24 in
our experiments. In the equation p means probability.
Analogously:
p ( ( f e r k ( f ) ) ^ (s e r j (s)) (3)
I(f,s) = ~ p ( ( f e r k ( f ) ) A ( s e rj(s)).log 2
k,j=ltoN p ( s e r) ( s ) ) . p ( f e rk ( f ) )
Another method was proposed by Chi [7] (from here named CHI). He defined an
entropy measurement for a feature f as:
N M (4)
CH(f) = -~ p ( f e rk ( f ) ) . ~ , p ( ( f e c i ) A ( f e rk (f))).log 2 p ( ( f e c i ) A ( f e rk (f)))
k=l i=1
This magnitude is always positive and the feature is considered more important as its
CH measurement decreases. This method also allows an ordination of features
according to its importance.
Setiono [6] proposed another method (from here called SET) based on the use of
normalized information gains G'i of feature fi to estimate the importance of a feature.
The normalized information can be calculated with the following equations:
nj ni M (5)
I ( S ) = -~.~--.log 2 - I(S,,) = -~_a n'kJ-log2 n'kj
j-i n n j=l nik n~k
where nj is the number of samples x belonging to class cj, xe cj, n is the total number
of samples in the training set S, nik is the number of samples for which feature fi takes
a value inside rk(fi) and nikj is the number of samples x for which ficzrk(fi) and xe cj.
And finally:
N
E i = ZN nik .I(Sik ) I i = _Zniklog2 nik Gi = I(S) - E i G; = Gi
1-"-i (6)
k=l n k=l n n
The method also allows an ordination of features in a way similar to the Battiti's
method, the first selected feature is considered the most important and the last
selected the least important.
The calculation of the GD distance between a set of features F and the classes C is:
Ll(fp,fl)l(fp,f2)...l(fp,fp)j
The mutual information between features, I(fi,fj), can be calculated as described in the
method BA, eq. (3), and the expression for calculating Mataras distance is:
The first method is quite popular in pattern recognition and it is called Relief [3]
(from here RLF). We will review here a simpler version for problems of two classes,
see the reference for a more complete method which supports multiclass problems
and missing attributes. The algorithm is basically the following:
1. We will call S to the training set, cl and c2 the two classification classes and p the
number of inputs or features in every instance of the training set.
2. Initialize a weight of features relevance W=(wl ..... wp) randomly around 0.5.
3. Repeat the following an appropriate number of steps, m.
3.1 Choose at random an instance Xm(X1,X2..... Xp) from S.
3.2 Choose at random two instances, z, y, closest to x, zc cl and yE c2.
3.3 If (xe cl) then N_hit=z; N_miss=y; else N_hit=y; N_miss=z;
3.4 Update weights:
for i= 1 to p wi =wi-diff(xi,N_hiti)2+diff(xi,N_missi)2;
4. Normalize the relevance of features, Relevance=(1/m).W.
The difference diff of two features fi and si of two samples f and s are for nominal
values:
~0,if f and si are the same (11)
diff ( f ,s,) = [ 1 , i f f and s, aredifferent
481
P "
do = lq~=lWq.(f~ _ fqj' ) 2 (13)
where q=l ..... p denotes every feature of instances yi,yj, yi=(fli ' f2 i . . . . . fpi) and
YJ=(flj, f2j ..... f~), and there is a weight Wq for each feature.
Also the definitions:
where:
N N N N (17)
us:X X:<i,J No:X X:<i,+)
i=lj=i+l i=l j = i + l
Finally, we review a method proposed by Thawonmas [2] (form here FUZ). The
fuzzy rules are generated as follows: consider Xi the set of samples which belongs to
class ci, we can generate an activation hiperbox of level 1 Aii(1), i=l ..... M, by
finding the maximum and minimum values of each input variable from Xi. Then, if
Aii(1) and Aii(1 ) overlap we can define an inhibition hyperbox Iij(1) of level 1 as the
intersection of hyperboxes Aii(1 ) and Aii(1). The fuzzy rule would be:
- If the sample x is in Aii(1 ) and is not in Iij(1) for every j j~ei then x belong to class
Ci.
After that, we can define new activation hyperboxes of level two, Aij(2) and Aji(2), by
taking into account the samples included in the inhibition hyperbox Iij(1) which
belong to classes i and j respectively. If Aij(2) and Aji(2) overlap we can define a new
inhibition hyperbox of level 2. This process continues until no inhibition hyperboxes
are found.
Then, we can define the exception ratio oij(F) for a set of features F, as the sum for all
levels of the ratio between the volume of inhibition hyperbox of level n (Iij(n)) and the
volume of activation hyperbox (Aij(n)) multiplied by the probability of finding a
sample inside the inhibition hyperbox. See the reference for a more complete
explanation and the exact equations.
The total exception ratio O(F) is the sum of oij(F) for all i,j i~j.
After all theses calculations, we should apply the following algorithm:
1. (Initialization) Set F to the whole set of p features. S an empty set.
2. Compute O(F-{f}) for every feature f e F.
3. Find the feature g that minimizes the above calculated total exceptions ratios.
4. Include g in S and extract g from F, F=F-{g} and S=Su{g}.
5. Repeat steps 2) and 3) until the cardinal of S is k.
This method allows an ordination of features as the method BA.
We have pointed out before that all the methods allow an ordination of inputs
according to its importance, besides that, there is no simple way to choose the
cardinal k of the subset of inputs that should be selected for an application.
Furthermore, the ordination of inputs depends on the method, two ordinations
obtained from two different methods will be, in general, different and therefore their
performance.
Every method is based on some heuristic principle and the only way to compare them
may be empirically because of the complexity of the problem. This will be described
and accomplished in the following section.
3 Experimental results
Mushroom (LE), The M o k ' s Problems (M1, M2, M3), Pima Indians Diabetes (PI),
Voting Records (VO) and Wisconsin Breast Cancer (WD). The complete data of the
problems and a full description of them can be found in the UCI repository.
In all problems, we have included a first useless input generated at random inside the
interval [0,1]. It is interesting to see how important this input is considered by every
method.
For each problem and method we have obtained an ordination of feature importance.
For example, the ordination of all methods for problem AB is inside Table 1 (BA 0.5
means Battiti's method with [~ parameter equal to 0.5).
Table 1. Input importance ordination for all methods and problem AB.
Method Least Important Most Important
BA 0.5 6 3 4 8 7 5 1 2 9
BA 0.6 6 3 4 8 7 5 1 2 9
BA0.7 6 4 3 8 7 5 1 2 9
BA0.8 6 4 3 8 7 5 1 2 9
BA 0.9 6 4 3 8 7 5 2 1 9
BA 1 6 4 3 8 7 5 2 1 9
SET 1 7 3 6 4 8 9 2 5
CHI 1 2 7 5 3 8 6 4 9
GD DST 1 2 7 8 5 6 9 3 4
RLF Not applicable
SCH 2 4 6 9 3 8 7 5 1
FUZ 2 3 5 4 8 7 9 1 6
After that, we obtained several inputs subsets by deleting successively the least
important input until a final subset of one input. For example, using the results of
Table 1 method BA 0.5, the first subset is obtained by deleting input {6}, the
following subset is obtained by deleting inputs {6,3 } and the final subset of one input
is {9}.
For every subset we wanted to obtain the performance of a classifier to see how good
the subset is, the classifier of interest is Multilayer Feedforward. We trained several
multilayer feedforward networks to get a mean performance independent of initial
conditions (initial weights), and also an error for the mean by using standard error
theory [11]. The performance criterion was the percentage correct in the test set. The
number of trained neural networks for each subset was a minimum of ten, in many
cases we trained much more than ten networks in order to diminish the error in the
mean, the maximum permitted error in a measurement was 3%.
In table 2, we can see the results of all methods for problem AB.
Then, for each method and problem we can obtain what we call the optimal subset.
This subset is the one which provides the best performance, and in the case of two
subsets with indistinguishable performance, the one with a lower number of inputs
because it provides a simpler neural network model.
In order to see if the performances are distinguishable we have performed t-tests. The
hypothesis tested was the null hypothesis ~tg=~B, assuming that the two mean
performances are indistinguishable. In the case that this null hypothesis can be
484
rejected we conclude that the difference between the two measurements is significant.
The significance level of the tests, ct, was 0.1.
For example, from the results of Table 2, the performance for method BA 0.5 for the
subset with 2 omitted inputs is the best mean performance, but this performance is
indistinguishable (according to the described t-test) with the one of subsets of 6 and 7
omitted inputs. The performance differences among subsets 2, 6 and 7 is not
significant and we should select the subset with lower number of inputs, subset 7 of 7
omitted inputs, as the optimal one. The inputs in this subset are the most appropriate
to design an application for problem AB according to method BA 0.5.
Also, for the method SCH, there are two subsets 1,3, with the same performance,
which is better and distinguishable from the rest of subsets. Again, we should select
the subset with lower number of inputs, subset 3, as the optimal one.
After obtaining the optimal subsets for each method and problem, we can compare the
performance of different methods in the same problem by comparing the performance
of their optimal subsets and the number of inputs in the case of indistinguishable
performance. Again, we performed t-tests to see significant differences.
For example, the results for method BA 0.5 (55.4_+0.4) and SCH (58.7+0.7) in Table 2
are distinguishable and we can conclude that the performance of method SCH is
better for the problem AB.
By comparing all the methods, two by two, we can obtain another table where we can
find whether one method is better or worse than another for a concrete problem. An
extract of that table is in Table 3.
For example, we can see that method BA 0.5 is better than SCH in problems BL, BN,
CR, DI, GL, HE, M1, M3, PI, VO. The number of problems where BA 0.5 performs
better is larger and so we can expect that its performance will be better than the one of
SCH.
Following this methodology and this type of comparisons with the full results, (we do
not present them because of the lack of space) we can get the following ordination:
GD DST > BA 0.5 > RLF = BA 0.7 = BA 0.9 = BA 1,0 > BA 0.6 = BA 0.8 = SET >
> CHI > FUZ > SCH
The best method is GD distance and the following Battiti with a low value of I~
(~=0.5), among methods RLF and SET the differences are not very important, and the
worst methods are clearly SCH and FUZ.
However, we can and should further discuss the applicability of every method. For
example, for applying the GD distance, the matrix T of transinformation should not be
singular. Well, we have found a singular matrix within the working precision (double
floating point precision) in 3 of 15 problems. This method is the best but because of
its limited applicability we can think of using Battiti with a low value of 13.
Another method with limited applicability is FUZ, we can think in the case where the
activation hyperboxes of level 1, Aii(1), i=l ..... M, does not overlap. In that case there
are not inhibition hyperboxes and therefore the method is not applicable. We found
this situation in 3 of 15 problems.
Finally, the method Relief RLF was not applicable for several reasons in 4 of 15
problems.
Another important question is the computational complexity. The methods based on
an analysis of the training set are usually characterized by a low computational cost.
This is true for all the methods reviewed except for Scherf SCH, which performs a
gradient descent search with computational cost larger than the training of a neural
network.
4 Conclusions
References
1. Devena, L.: Automatic selection of the most relevant features to recognize objects.
Proc. of the Int. Conf. on Artificial NNs, vol.2, pp.1113-1116, 1994.
2. Thawonmas, R., Abe, S.: Feature reduction based on analysis of fuzzy regions.
Proc. of the 1995 Int. Conf. on Neural Networks, vol. 4, pp. 2130-2133, 1995.
3. Kira, K., Rendell, L.A.: The feature selection problem: Traditional methods and a
new algorithm. Proc. of 10 th Nat. Conf. on Artif. Intellig., pp. 129-134, 1992.
4. Battiti, R.: Using mutual information for selecting features in supervised neural net
learning. IEEE Trans. on Neural Networks, vol. 5, n. 4, pp. 537-550, 1994.
5. Scherf, .: A new approach to feature selection. Proc. of the 6 th Conf. on Artificial
Intelligence in Medicine, (AIME'97), pp. 18 I- 184, 1997.
6. Setiono, R., Liu, H.: Improving Backpropagation learning with feature selection.
Applied Intellig.: The Int. Journal of Artif. Intellig., NNs, and Complex Problem-
Solving Technologies, vol. 6, n. 2, pp. 129-139, 1996.
7. Chi, Jabri: An entropy based feature evaluation and selection technique. Proc. of
4 th Australian Conf. on NNs, (ACNN'93), pp.193-196, 1993.
8. Lorenzo, Hern~indez, M6ndez: Attribute selection through a measurement based on
information theory (in Spanish). 7 a Conferencia de la Asociaci6n Espafiola para la
Inteligencia Artificial, (CAEPIA 1997), pp. 469-478, 1997.
9. Tetko, I.V., Villa, A.E.P., Livingstone, D.J.: Neural network studies 2. Variable
selection. Journal of Chem. Inf. Comput. Sci., vol. 36, n. 4, pp. 794-803, 1996.
10.Watzel, R., Meyer-B~ise, A., Meyer-B~ise, U., Hilberg, H., Scheich, H.:
Identification of irrelevant features in phoneme recognition with radial basis
classifiers. Proc. of 1994 Int. Symp. on Artificial NNs, pp. 507-512, 1994.
ll.Bronshtein, I., Semandiavev, K.: Mathematics Handbook for engineers and
students (in Spanish). MIR, Moscow, 1977.
Neural Implementation of the JADE-Algorithm
1 Introduction
Principal Component Analysis (PCA) is a well known tool for multivariate data
analysis and signal processing. P C A finds the orthogonal set of eigenvectors of
the covariance matrix and therefore responds to second-order information of the
input data. One often used application of P C A is dimensionality reduction. But
second-order information is only sufficient to describe data that are gaussian
or close to gaussian. In all other cases higher-order statistical properties must
be considered to describe the data appropriately. A recent technique that also
includes P C A and that uses higher-order statistics of the input is Independent
Component Analysis (ICA).
The basic assumption to perform an ICA is a linear mixture model rep-
resenting an n-dimensional real vector x = [ x 0 , . . . , X n - i ] T as a superposition
of m linear independent but otherwise arbitrary n-dimensional signatures a (p),
0 < p < m, forming the columns of an n x m-dimensional mixing matrix A
= [a(~ a(m-i)]. The coefficients of the superposition interpreted as an m-
dimensional vector s = [ s o , . . . , sin-i] T, leads to the following basic equation of
linear ICA:
x = As. (1)
The influence of an additional noise term is assumed to be negligable and will not
be considered here. The components of s are often called source signals, those
of x mixtures. This reflects the basic assumption that x is given as a mixture of
the source signals s. Thereby x is the quantity that can be measured. It is often
assumed that the number of mixtures equals the number of sources (n = m).
A few requirements about the statistical properties of the sources have to be
met for ICA to be possible [4]. The source signals are assumed to be statistically
488
independent and stationary processes with at most one of the sources following a
normal distibution, i.e. has zero kurtosis. Additionally for the sake of simplicity
it can be taken for granted that all source signals are zero mean E {s} = 0,
0<i<m.
The implementation of an ICA can be seen in principle as the search for an
m x n-dimensional linear filter matrix W = [ w ( ~ w( m-l)] T whose output
n--1
y = Wx y/ = ~ wj(0 x j
j=o
reconstructs the source signals s. Ideally the problem could be solved by choosing
W according to
W A = Im,
where Im represents the m-dimensional unit matrix. But it is clear t h a t the
source signals can only be recovered arbitrarily permuted with a scaling factor
possibly leading to a change of sign, because there is a priori no predetermina-
tion which filter leads to which source signal. This means, it is impossible to
distinguish As from .4w with A = A ( P S ) and ~ = ( P S ) -1 s, where P represents
an arbitrary orthogonal permutation matrix and S a scaling matrix ~w~h nonzero
diagonal elements [2].
The determination of an arbitrary mixing matrix A can be reduced to the
problem of finding an orthogonal matrix U by using second-order information of
the input data [3][9][12]. This can be done by whitening or sphering the d a t a via
a m • n-dimensional whitening matrix Ws obtained f r o m the.correlation matrix
[
R x [ R ixj def
= E { x ~ x j ) ) of x leading to
z = W s x = W s A s = Us. (2)
2.2 R e p r e s e n t a t i o n s of t h e c u m u l a n t matrices
For the determination of the orthogonal mixing matrix U according to the
whitened model (2) it will be necessary to represent the cumulant matrices first
by the orthogonal mixing matrix and second by an eigendecomposition of the
fourth-order cumulant of z.
At this point nothing has been assumed about the statistical structure of the
sources. From the statistical independence of the components of s follows, that
cumulants of all orders of s are diagonal leading to [A(M)] ij = 5iJ aiu(i)TMu(j)"
The FOSS is thus given as
=
{ M IM = E
m, cPu(p)u(p)T
}
p----0
= { M I M = UAU T, A diagonal. }
This means that the dimensionality of the FOSS equals m, the number of sources.
m2--1
Cum (zi, zi, zk, zt) = E A(P)M(P)M(P)ij kt (7)
p--O
where Diag (.) denotes the m x m-dimensional diagonal matrix with the m ar-
guments as diagonal elements. The joint diagonalizer D can be found by a max-
imization of the joint diagonality criterion [7]
m2--1
c (V) : E Diag (VTM(p)V)2, (9)
p:0
where [Diag(.)[ is the norm of the vector of diagonal matrix elements, which is
equivalent to a minimization of
m 2 --1
Z (lo)
p:O
491
m--1
c(V)= E ICum(hi, hi, hk,ht)12, (11)
i,k,l=O
Using (7) and (12) leads to a new representation of the cumulant matrices
m2--1 m - 1 m--1
79(m,d) = (m + d - 1 ) .
In the generic case, as defined in [8], the dimension or generic width ~ (m,d)
is even smaller than 7:)(m, d). A few values for G (m, d) are given in table 1,
also compared to 79 (m, d) and m 2. Additionally follows from (6) that only m
out of the m 2 possible eigenvalues A(P) are nonzero, which means that only m
eigenmatrices should be really important.
T a b l e 1. Comparison of the possible dimension of the real vector space given by the
set of cumulants Cum (z~, zj, zk, zl) (d = 4) for various dimensionalities m of the real
random vector z.
m m2 ~ (m, 4) 7) (m, 4)
4 16 10 35
5 25 15 70
6 36 22 126
arbitrary orthogonal matrix, the question arises whether the orthogonal mixing
matrix U can be identified by the joint diagonalizer of the eigenmatrices M (p),
0 < p < m. The answer is yes and the reason is given by Theorem 2 in [12], which
states, t h a t D is equal to the transpose (=inverse) of U up to a sign p e r m u t a t i o n
and a rescaling if two conditions are fullfilled
1. C u m (hi, hi) = 5ij
2. Cum (hi, hi, hk, ht) = 0 for at least two nonidentical indices,
with h = D T z . While the first condition is fullfilled by our orthogonal model
(2), condition two is given by the way the joint diagonalizer D is determined
through (11).
Averaging over the ensemble of input data used for training the network leads
to an eigenequation of the fourth-order correlation tensor
m--I
From equation (3) can be seen t h a t the main difference between the fourth-
order cumulant and the corresponding correlation tensor is an explicit suppres-
sion of two-point correlations. Taking the latter into account we propose a new
493
Lawi (t) = r (t) y (t) {(zi (t) zj (t) - ij) - (t) (y (t) - Tr (W))}
Tr (.) denotes the trace of the matrix argument. The corresponding weight up-
date rule in case of m output neurons can be found straightforwardly to read
(18)
where w~y) denotes the weight matrix of output neuron p connecting to input
neurons i and j. The upper bound of the sum over q representing the decay term
is intentionally unspecified. The lerning rule can be implemented in two different
ways namely with 0 < q < m (Oja-type) and 0 < q _< p (Sanger-type). While
in the first case the resulting weight matrices belong to approximately equal
eigenvalues, the second case leads to weights whose corresponding eigenvalues
are obtained in decreasing-order. The later thus can give information about the
number of eigenmatrices necessary to span the FOSS.
Finding an (approximative) orthogonal joint diagonalizer D by minimizing
(9) can be interpreted as determining something like an average eigenstructure
[7]. Since the criterion can only be minimized but cannot generally driven to zero,
this notation corresponds only to an approximate simultanous diagonalization,
though the average eigenstructure nevertheless is well defined.
3 Experimental Results
Fig. 1. Image ensemble used to evaluate the algorithm developed within this paper. It
consists of 1. the three letters ICA, 2. a painting by Franz Marc titeled 'Der Tiger', 3.
an urban image, 4. normal distibuted noise, 5. the Lena image and 6. a natural image
gather from the natural image ensemble used in [2]. They are all 256 x 256 pixels in
size with pixel values in the intervall [0,..., 255]. The images have been normalized to
yield unit variance and the mean pixel value has been substracted from each picture.
Oja- respectively Sanger-type learning rule for each m under consideration the
same mixing matrix has been used.
Since statistical independence of the source signals is an important condi-
tion to separate the source signals from the mixtures, we calculated the source
correlation matrix (0 _< i, j < m)
256
S clef 1
ij=2562 E si(x,y) sj(x,y).
x,y:O
de
2 ~ I ul - I
) 1 I ul
)
1 (19)
maxk
i=o \ j = o j=o \ i = o
has been calculated to get a measure of how well the demixing or separation has
been performed. The closer s is to zero the better the separation, but a value
g ~ 1 - 3 usually indicates good demixing.
Table 2 summarizes the experimental results. It can be seen that the average
eigenstructure can be determined better with the Oja-type learning rule leading
to much better separation results (see also [15]). Figure 2 shows the eigenvalues
obtained using the Sanger-type learning rule. For the determination of the joint
diagonalizer D here only the first 10, 15, 22 (m = 4, 5, 6) eigenmatrices have
been used. After convergence the weights with numbers greater than 10, 15, 22
(m = 4, 5, 6) have died away which means t h a t their norm converged to zero (see
table 1).
495
Table 2. Summary of the simulation results with m = 4, 5, 6. The table shows the
cross-taBr E as defined in (19) obtained using the Oja-type and the Sanger-
type learing rule (18).
m E (Oja) s (Sanger)
4 2.05 7.95
5 1.92 16.69
6 3.98 25.25
3.5
m=4 - -
3 m=5 - -
m=6 - -
2.5
2
1.5
1
0.5
0
5 10 15 20
# of weight
4 Discussion
References
1. Anthony J. Bell and Terrence J. Sejnowski. An information-maximisation approach
to blind separation and blind deconvolution. Neural Computation, 7:1129-1159,
1995.
2. Anthony J. Bell and Terrence J. Sejnowski. The 'independent components' of
natural scenes are edge filters. Vision Research, 37(23):3327-3338, 1997.
3. Jean-Francois Cardoso. Source separation using higher order moments. In Pro-
ceedings of the ICASSP, pages 2109-2112, Glasgow, 1989.
4. Jean-Francois Cardoso. Fourth-order cumulant structure forcing, application to
blind array processing. In Proceedings of the 6th workshop on statistical signal and
array processing (SSAP 1992), pages 136-139, Victoria, Canada, 1992.
5. Jean-Francois Cardoso and Pierre Comon. Independent component analysis, a
survey of some algebraic methods. In Proceedings ISCAS 1996, pages 93-96, 1996.
6. Jean-Francois Cardoso and Antoine Souloumiac. Blind beamforming for non-
gaussian signals. IEE Proeedings - Part F, 140(6):362-370, 1993.
7. Jean-Francois Cardoso and Antoine Souloumiac. Jacobi angles for simultaneous
diagonalization. SIAM Journal on Matrix Analysis and Applications, 17(1):161-
164, 1996.
8. P. Comon and B. Mourrain. Decomposition of quantics in sums of powers of linear
forms. Signal Processing, 53(2):96-107, 1996.
9. Pierre Comon. Independent component analysis, a new concept? Signal Processing,
36:287-314, April 1994.
10. Gustavo Deco and Dragan Obradovic. An information-theoretic approach to neu-
ral computing. Perspectives in Neural Computing. Springer, New York, Berlin,
Heidelberg, 1996.
I I . Te-Won Lee, Mark Girolami, Anthony J. Bell, and Terrence J.Sejnowski. A uni-
fying information-theoretic framework for independent component analysis. Inter-
national Journal on Mathematical and Computer Modeling (in press), 1998.
12. Jean-Pierre Nadal and Nestor Parga. Redundancy reduction and independent
component analysis: Conditions on cumulants and adaptive approaches. Neural
Computation, 9:1421-1456, 1997.
13. J. G. Taylor and S. Coombes. Learning higher order correlations. Neural Networks,
6:423-427, 1993.
14. Christian Ziegaus and Elmar W. Lang. Statistics of natural and urban images. Lec-
ture Notes m Computer Science (Proceedings ICANN 1997, Lausanne), 1327:219-
224, 1997.
15. Christian Ziegaus and Elmar W. Lang. Independent component extraction of
natural images based on fourth-order cumulants. In Proceedings of the ICA (Inde-
pendent Component Analysis) 1999, in press.
Variable Selection by Recurrent Neural
Networks. Application in Structure Activity
Relationship Study of Cephalosporins
Abstract. Two methods for variable selection which are efficiently im-
plemented by Hopfield -like Neural networks are described. Qualitative
SAR models using the selected variables by both neural networks variable
selection methods are built. The biological activity against Staphylococ-
cus aureus of cephalosporins was used as dependent variable. The final
correlation between observed and predicted activity values are good, in-
dicating that the informative weight of the favored variables is high,
providing a sound basis to select a good variable set of in qualitative
structure-activity relationships (SAR) modeling.
1 Introduction
variable selection methods based on recurrent neural models are proposed. The
first model searches the best variable subset looking for the maximal independent
set of a graph with minimum cardinality. The second one builds clusters of
analogous variables and chooses the best one of each cluster to form the most
relevant subset. The analogy function measures the variables capacity to keep
the class distribution of the data.
We build SAR models in cephalosporins using selected variables. Cephalos-
porins are antibacterial compounds belonging to the fl-lactams family. Its basic
structure has a fl-lactam ring bounded to a dihydrotiazine ring known as cephem
nucleus.
The second section describes the recurrent neural network methods for vari-
able selection and the third presents the application of the proposed methods to
a SAR study.
The approach to the discovery of relevant variables needs an accurate search for
the best solution of a minimization problem [6]. Recurrent neural networks have
been used in optimization problems.
The proposed RNN approach to variable selection has two main compo-
nents: the analogy matrix and the recurrent neural network search. Following
the structure of the variable selection models the similarity matrix can be seen
like an evaluation function. In this article, the variable selection is done using
two recurrent models. We called them the independent set model (VSIS) and
the clustering model (VSCA). Both use the same relevance function but make
the search using different dynamics ( energy functions ).
where i E Nm, k E Nm, vector [c]m is the class distribution of the object
collection, bmaz E [0, 1] is the similarity threshold between objects and bmi,~ E
J = 1 if objects i and k have
[0, 1] is the dissimilarity threshold. As can be seen Vik
similar measurements of j - t h variable and belong to the same class or if objects
have different measurements of j - t h variable and belong to different classes. It
J -- 1 if objects i and k are "well classified" by the j - t h variable.
means t h a t Vik
The similarity matrix [sjz]~=n is calculated using the formula:
m--1 m J l
~ i = 1 ~ k = i + l VikVik
sj, = m. (m- 1)/2 (3)
and each element sit of the similarity matrix equals the number of object
pairs which are well classified by the variables j and l, divided by the total
number of object pairs.
The clustering model uses a recurrent neural network for making clusters of vari-
ables given a similarity level. Given a set of objects of some kind and a relevant
measure of similarity between these objects, the purpose of cluster analysis is
to partition the set into several clusters (subsets) in such a way that objects in
each cluster are highly similar to one another, while objects assigned to differ-
ent clusters have low degrees of similarity. The cluster analysis can be used to
perform variable selection, if a measure of similarity between variable is given.
After clustering of variables is performed, we have to select one variable from
each cluster according with certain criteria. Using similarity matrix from equa-
tion 3, for a similarity level s, min sj, < s < max sfl, we use a neural network
algorithm developed by Cruz and L6pez [9] to perform cluster analysis.
A Hopfield-like neural network with n 2 neurons was considered. T h e differ-
ential equation system expressing the state of the network at the time t is:
dxij
dt - A 2~ (sjk--s)yik (4)
k=l,k~j
Yij = f(xij)
i , j = 1,m
In this system Yij is the state of the ij-th neuron at a determined time;
Yij = 1 if j - t h variable is placed at the i-th cluster and Yij = 0 otherwise. The
function f ( x i j ) is the transfer function of the neural network. In this model the
Takefuji M a x i m u m transfer function was used:
{~ ifxti=max(xli,...,xmi)
f(xij) = otherwise (5)
As in the first method, the algorithm for variable selection using cluster
analysis is applied for each similarity level s equal to each different element of
the similarity matrix sjl from equation 3 taken in increasing order. This means to
solve n ( n - 1 ) / 2 differential systems 4 in the worst case. The solution was updated
for every different performance of the clustering pattern. The effectiveness of
VSAC algorithm for each similarity level s was also evaluated by using the 1NN
classifier.
SAR studies were carried out by means of MLPs with v - x - 1 architecture, where
v is the number of descriptors and x the number of neurons in the hidden layer,
respectively. The neuron in the output layer corresponds to the biological activity
class. In this qualitative study the target values of biological activity presented
to the networks, were 0.1 and 0.9 for compounds belonging to the inactive and
active classes, respectively. ANN training was performed using backpropagation
algorithm by the SNNS [25] package running on a Indigo2 R4400 workstation.
503
3.3 Results
The similarity matrix with 43 molecular descriptors which describe the 105 com-
pounds were calculated and used to apply variable selection methods. T h e se-
lected variable subset obtained with VSIS and VSCA models, with cardinality
N~ up to 10 are shown in table 1. In general all subsets selected by both meth-
ods allowed classification with 100% of well classified patterns. Moreover, the
variables 1 and 2 that form the subsets with cardinality 1 separate the patterns
collection in the two studied classes, according to the 1NN classifier. As can be
seen both methods perform similarly, selecting in general the same subsets of
variables, particularly in subsets with low cardinality.
The most favored variables in the selection were variables 1 , 2 ( H O M O and
LUMO) and variables 38, 40, 41, 42 and 43, corresponding to electrotopological
indexes calculated on carbon atoms of ce]em nucleus. The values of these elec-
trotopological indexes depend on substituent at C-3 of cephem nucleus and it
has been reported that in vitro activity and bioavailability of cephalosporins is
affected by hydrophobic and electronic characteristics of this group.
although with 1NN classifier all patterns were well classified, 3 or 4 variables are
not enough to build an effective SAR model.
Ns r M.A.E. Np
3 0.86 0.73 17
4 0.93 0.78 5
6 0.99 0.23 1
7 (VSIS) 0.999 0.13 0
7 (VSCA 0.998 0.17 0
4 Conclusions
Two variable selection methods based on recurrent neural models were described.
The first model selects the best variable subset looking for the maximal indepen-
dent set of a graph with minimum caxdinality. The second one builds clusters of
analogous variables and chooses the best one of each cluster to form the most
relevant subset.
Both methods were applied to a sarnple of 105 cephalosporins described by 43
molecular descriptors and distributed in two classes: active and inactive against
S. aureus. All the selected subset of variables showed the capacity to keep the
distribution of the pattern collection. Both algorithms performed similarly.
SAR NN models for S. aureus, using the selected variables were built. The
obtained SAR models provide good classifications of the compounds and shows
the strong activity dependence on electronic and hydrophobic parameters of
cephalosporins.
5 Acknowledgments
This work has been supported by University of Antioquia under the Research
Project "Development of Heuristics to the Combinatorial Optimization NP-
Problem". The authors also thank the financial support of Third World Academy
of Sciences. (TWAS R.G.A. No 97-144 R G / C H E / L A ) .
References
1. Rose V.S., Wood J. and MacFie H.J.H., Analysis of Embedded Data: k-Nearest
Neighbor and Single Class Discrimination in Advanced Computer-Assisted Tech-
niques in Drug Discovery (Methods and Principles in Medicinal Chemistry, vol
III), Mannhold R. and Krogsgaard-Larsen H., van de Waterbeemd H., ed., VCH,
1995, pp 229-242.
2. Tetko I.V., Luik A.I. and Poda G.I., J. Med. Chem., 36, 811-814 (1993).
505
3. Lin C.T., Pavlick P.A. and Martin Y.C., Tetr. Comput. Methodol., 3, 723-738
(1990).
4. Wikel J.H. and Dow E.R., BioMed. Chem. Left., 3, 645-651 (1993).
5. Hopfield J.J. and Tank D.W. Biological Cybernetics, 52, 141-152 (1985)
6. Takefuji Y. Neural Network Parallel Computing. KLUWER Acad. Pu. 1992
7. Garey M. R. and Johnson D. S. , "Computers and Intractability : A Guide to the
Theory of NP-Completeness". Freeman, San Francisco, 1979.
8. Cruz R. and Lopez N. Proceedings of the V European Congress on Intelligents
Techniques and Soft Computing. Eufit'97, V 1,465-470 (1997)
9. Cruz R.., Lopez N., Quintero M, and Rojas G. Journal of Mathematical Chemistry,
20 385-394 (1996)
10. Ishikura K., Kubota T., Minami K., Hamashima Y., Nakashimizu H., Motokawa
K. and Yoshida T. The Journal of Antibiotics, 47, 453-465 (1994).
11. )Lee Y.S., Lee J.Y., Jung S.H., Woo E., Suk D.H., Seo S.H. and Park H., The
Journal of Antibiotics, 47, 609 612 (1994).
12. Negi S., Yamanaka M., Sugiyama I., Komatsu Y., Sasho M., Tsuruoka A., Kamada
A., Tsukada I., Hiruma R., Katsu K. and Machida Y., The Journal of Antibiotics,
47, 1507 1525 (1994).
13. Negi S., Sasho M., Yamanaka M., Sugiyama I., Komatsu Y., Tsuruoka A., Kamada
A., Tsukada I., Hiruma R., Katsu K. and Machida Y. The Journal of Antibiotics,
47, 1526 1540 (1994).
14. [24] Ishikura K., Kubota T., Minami K., Hamashima Y., Nakashimizu H., Mo-
tokawa K., Kimura Y., Miwa H. and Yoshida T., The Journal of Antibiotics, 47,
466 477 (1994).
15. Park H., Lee J.Y., Lee Y.S., Park J.O., Koh S.B. and Ham, W., The Journal of
Antibiotics, 47, 606-608 (1994).
16. Yokoo C., Onodera A., Fukushima H., Numata K., Nagate T. The Journal of
Antibiotics, 45, 932 939 (1992).
17. Yokoo C., Onodera A., Fukushima H., Numata K. and Nagate T., The Journal of
Antibiotics, 45, 1533 1539 (1992).
18. Yokoo C., Got M., Onodera A., Fukushima H. and Nagate T., The Journal of
Antibiotics, 44, 1422 1431 (1991).
19. Dewar M.J.S., Zoebisch E.V., Healy E.F. and Stewart J.J.P., J. Am. Chem. Soc.,
107, 3902-3909 (1985).
20. Stewart J.J.P., MOPAC 6.0 User Manual, Frank J. Seiler Research Laboratoty, US
Air Force Academy, 1990.
21. Estrada E., J. Chem. Inf. Comput. Sci., 35, 31-33 (1995).
22. Estrada E., J. Chem. Inf. Comput. Sci., 35, 708-713 (1995).
23. Kier L.B. and Hall L.H., J. Pharm. Sci., 72, 1170-1173 (1983).
24. Kier L.B. and Hall L.H., Pharmaceutical Research,7, 801-807 (1990).
25. Stuttgart Neural Network Simulator (SNNS), Version 4.1, Institute for Parallel
and Distributed High Performance Systems. 1995, Report No. 6/95.
Optimal Use of a Trained Neural Network for Input
Selection
1 Introduction
Neural networks (NNs) are used in quite a variety of real world applications, in them
one can usually measure a large number of variables that can be used as potential
inputs. One clear example is the extraction of features for object recognition [1],
many different types of features can be utilized, such as geometric features,
morphological, etc. However, usually not all variables that can be collected are
equally informative: they may be noisy, irrelevant or redundant.
Feature selection is the problem of choosing a small subset of features ideally
necessary and sufficient to perform the classification task, from a larger set of
candidate features. Feature selection has long been one of the most important topics in
pattern recognition and it is also an important issue in NNs. If one could select a
subset of variables one could reduce the size of the NN, the amount of data to process,
the training time, and possibly increase the generalization performance. This last
result is known in the bibliography and ratified in our results.
Feature selection is also a complex problem, we need a criterion to measure the
importance of a subset of variables and that criterion will depend on the classifier. A
subset of variables could be optimal for one system and very inefficient for another.
In the bibliography there are several potential ways to determine the best subset of
features: analyze all subsets, genetic algorithms, a heuristic stepwise analysis and
direct estimations.
In the case of NNs direct estimation methods are preferred because of the
computational complexity of training a NN. Inside this category we can perform
another classification: methods based on the analysis of the training set, [2], methods
507
based on the analysis of a trained multilayer feedforward network [1], [3-16], and
methods based on the analysis of other specific architectures [ 18].
The purpose of this paper is to make a brief review of the methods based on the
analysis of a trained multilayer feedforward network and present the first empirical
comparison among them.
In the next section we will briefly review the 19 different methods, in section 3 we
present the comparison methodology, the experimental results and an ordination of
the methods according to its performance and we finally conclude in section 4.
2 Theory
Many methods based on the analysis of a trained multilayer feedforward network try
to define what it is called the relevance of an input unit Si, one input I i is considered
more important if its relevance Si is larger. They also define the relevance sij of a
weight wij connected between the input unit i and the hidden unit j. The relation
between Si and sij is:
Nh (1)
Si = Z so"
j=l
so . = (Wo.)2 (2)
And the one proposed by Tetko [9] (from here named TEKA) is:
These criteria are based on the heuristic principle that, as a result of the learning
process, the weights of an important input should have a larger magnitude than other
weights connected to a useless and may be random input.
Other criteria of weight relevance are based on an estimation of the change in the
m.s.e. (mean square error), E, when setting the weight to 0, this estimation is
calculated by using the hessian matrix H, as a result of considering the Taylor
expansion of the m.s.e., E, with respect to the weight. One example is the method
proposed by Cibas [8] (from here CIB) where we denote by Wk the weight wij:
(4)
sk = s o = 1 . h k k BW~
In the above expression h~ is a diagonal element of hessian matrix H.
And the criterion proposed by Tetko [9] (form here TEKE):
508
(s)
L
The hessian matrix can be exactly calculated with the algorithm and expressions
described in [17].
Another method of estimating weight relevance is based on an estimation of the
change in E when setting wij equal to 0, but it does not use the hessian matrix, it was
proposed by Tetko [9] (named TEKC from here), and the value of weight relevance
is:
Wil(t) (6)
sij : E ~-~-E(t).Awii(t) i f
t----0~162 we - we
where the sum over t represents a sum for all iteration steps of the training process
from the initialization t=0, until the iteration of convergence, wlij is the initial value of
weight wij and we is the value of that weight at the iteration of convergence, Awij(t) is
the change in the weight calculated by using the learning algorithm at the iteration t.
In order to apply the method, we should keep a record of the appropriate information
during the learning process for calculating the weight relevance.
Other methods define the relevance Si of input i by a calculation related to the
variance of weights w~j of input i, they are based on the heuristic that an input unit
with small variance will behave as a threshold and therefore it will have little
importance. One example is the criterion defined by Devena [1] (from here DEV):
( (Zw~.~12 (7)
Another example is the criterion proposed by Deredy [10] (from here DER3):
Nh.vari (8)
S i = ~_wij
J
Another way to define the relevance of input i Si is by using the sensitivity of outputs
oj with respect to the input Ii. It is based on the heuristic that a higher sensitivity
means a larger variation in the output values with respect to a change in the input, and
therefore we can suppose that the input is more important. For example, Belue [3]
(from here BL1) uses the following definition:
A similar method was proposed by Cloete [4] (from here named CLO):
S i = max(Aij) Vj (10)
-I Ns
where Ns is the number of training samples.
Priddy [5] also proposed a method (from here called PRI) based on sensitivities:
No Ook (12)
s, = X s s Z-yiT(x, w)
a~Sj=lx~D k:t:j t
which tries to estimate the variation of the probability of classification error with
respect to the variation of the input Ii. See the reference for more details.
The method proposed by Sano [16] (from here named SAN) also uses sensitivities:
This method gives a matrix of sensitivities D(k,i) with i=1, ..., Number of outputs, and
i=1 ..... Number of inputs, and the relevance Si of input i is consider to be larger than
the relevance Sj of input j if D(k,i)>D(k,j) for a number of values of k greater than
No/2.
And, finally Deredy [10] proposes the use of logarithmic sensitivities (from here
named DER2):
tries to estimate the overall importance Sis of unit i in layer s over the units in the next
layer s+ 1, wij is the weight between unit i in layer s and unit j in layer s+ 1, M is the
number of units in layer s+l. The equation is recursive, we set Sj equal to 1 for all
510
outputs and calculate Si for all hidden units, applying the equation again we calculate
the input relevance.
An analogue method also proposed by Tetko [7] (named TEK from here) is based on
the equation:
M 2 s (17)
= ,~ (wo)'E[ai ~ .~s+l
where E[ak~] is the mean value of the output of unit k in layer s. We also set Sj equal
to one for the outputs and recursively calculate Si for the inputs.
Another method proposed by Mao [12] (named MAO from here) calculates an
estimation of the m.s.e, increase when deleting input i, the estimation is made again
by a Taylor expansion of the m.s.e, with respect to the value of the input I i. That value
is used as the relevance Si of the input. The equations are:
Ns (18)
S i = "~ &E k (Ii)
k=l
where the sum is for all patterns in the training set Ns, and:
where AIi should be O-Ii (which is the effect of setting Ii equal to 0), and the
derivatives can be calculated recursively. For the output units:
and the relationship between the derivatives of two layers 1+1 and 1 is:
where g denotes the sigmoid function and its first and second derivatives are:
g =, y j l+1 ..
.[l--yjl + 1 . ) g,,=
y jl + 1 .[t--yj
.. 1+1 . . . .
).[l--z.yj/ + 1 . ) (23)
There are two very simple methods that calculate the effect of substituting an input by
its mean value in the training set. They are based on the heuristic that if this
substitution has little effect in the performance the input nearly behaves as a threshold
and has little importance.
511
The first one proposed by Lee [6] (called LEE in the paper), calculates the percentage
correct in the test set, with one input substituted by its mean value, the input is
considered more relevant if the value of the tested percentage is lower, because the
performance decrease is larger.
The second one proposed by Utans [13] (called UTA in the paper), focus on the
m.s.e., E, and calculates its increment when substituting an input for its mean value.
One input is considered more relevant if the increment of E is higher.
Bowles [15] proposed another method (called BOW here) that should also keep an
information record of the training process. The relevance of one input Si is defined as
the following sum over all iteration steps until the convergence point T:
T INh I (24)
Si =t~=O~l~J'WiJ "
where wij is the weight between input i and hidden unit j, and 5j is the backpropagated
error of hidden unit j.
Finally, Younes [ 14] proposed another method that we have used in the comparison
(we call it YOU), but we will not describe it because it is rather complex, its
computational cost is high and the applicability limited, we got division by zero errors
in 6 of 15 problems.
It is very important to point out that every method reviewed allows obtaining an
ordination of inputs or features according to its importance or relevance. Obviously,
the ordination of two methods will not be, in general, the same and therefore its
performance will also be different.
Furthermore, we can get an ordination of inputs and we will know which inputs
should be first discarted from the training set (the least important ones), but there is
not simple and efficient procedure to know the cardinal, k, of the final subset of
inputs. We do not know the optimal number of inputs that should be kept in the
training set.
As we saw before, every method is based on heuristic principles and the only way to
compare them may be empirically because the complexity of the problem. This will
be described and accomplish in the following section.
3 Experimental Results
For example, the results for method UTA and CIB in Table 2 are distinguishable and
we can conclude that the performance of method UTA is better for the problem PI.
By comparing all the methods, two by two, we can obtain another table where we can
find whether one method is better or worse than another for a concrete problem. An
extract of that table is in Table 3.
UTA
Best UTA: AB,BN,BU,CR,GL,VO,LE,PI,WD
CIB E q u a l : BL,DI,M 1,M2,M3
Best CIB: HE
We can see that method UTA is better than CIB in a larger number of problems and
conclude that method UTA performs better than CIB.
Following this methodology and this type of comparison with the full results, (we do
not present them because of the lack of space) we can get the following ordination:
UTA > TEKA > BL2 > DEV > TEKB =DER2 = MAO > CLO > BL1 =
= BOW = YOU > PRI = SAN > CIB =TEKE > TEK > TEKC >DER3 >LEE
The best method is UTA and the worst is LEE. We can further discuss the
applicability of every method. The unique method with limited applicability was
YOU as commented in the theory section.
Another important question is the computational complexity. The methods with a
higher computational cost were CIB and TEKE because of the calculation of the
hessian matrix and also YOU which requires an iteration over all samples of the
training set and the calculation of two integrals for each iteration.
4 Conclusions
References
1. Devena, L.: Automatic selection of the most relevant features to recognize objects.
Proc. of the Int. Conf. on Artificial NNs, vol.2, pp.1113-1116, 1994.
2. Battiti, R.: Using mutual information for selecting features in supervised neural net
learning. IEEE Trans. on Neural Networks, vol. 5, n. 4, pp. 537-550, 1994.
515
3. Belue, L.M., Bauer, K.W.: Determining input features for multilayer perceptrons.
Neurocomputing, vol. 7, n. 2, pp. 111-121, 1995.
4. Engelbrecht, AP., Cloete, I.: A sensitivity analysis algorithm for pruning
feedforward neural networks. Proc. of the Int. Conf. on Neural Networks, vol. 2,
pp. 1274-1277, 1996.
5. Priddy, K.L., Rogers, S.K., Ruck D.W., Tarr G.L., Kabrisky, M.: Bayesian
selection of important features for feedforward neural networks. Neurocomputing,
vol. 5, n. 2&3, pp. 91-103, 1993.
6. Lee, H., Mehrotra, K., Mohan, C. Ranka, S.: Selection procedures for redundant
inputs in neural networks. Proc. of the World Congress on Neural Networks, vol. 1,
pp. 300-303, 1993.
7. Tetko, I.V., Tanchuk, V.Y., Luik, A.I.: Simple heuristic methods for input
parameter estimation in neural networks. Proc. of the IEEE Int. Conf. on Neural
Networks, vol. 1, pp. 376-380, 1994.
8. Cibas, T., Souli6, F.F., Gallinari, P., Raudys, S.: Variable selection with neural
networks. Neurocmputing, vol. 12, pp. 223-248, 1996.
9. Tetko, I.V., Villa, A.E.P., Livingstone, D.J.: Neural network studies. 2. Variable
selection. Journal of Chemical Information and Computer Sciences, vol. 36, n. 4,
pp. 794-803, 1996.
10.E1-Deredy, W., Branston, N.M.: Identification of relevant features in HMR tumor
spectra using neural networks. Proc. of the 4 th Int. Conf. on Artificial Neural
Networks, pp. 454-458, 1995.
ll.Steppe, J.M., Bauer, K.W.: Improved feature screening in feedforward neural
networks. Neurocomputing, vol. 13, pp. 47-58, 1996.
12.Mao, J., Mohiuddin, K., Jain, A.K.: Parsimonious network design and feature
selection through node pruning. Proc. of the 12th IAPR Int. Conf. on Pattern
Recognition, vol. 2, pp. 622-624, 1994.
13.Utans, J., Moody, J., Rehfuss, S., Siegelmann, H.: Input variable selection for
neural networks: Application to predicting the U.S. business cycle. Proc. of
IEEE/IAFE 1995 Comput. Intellig. for Financial Eng., pp. 118-122, 1995.
14.Younes, B., Fabrice, B.: A neural network based variable selector. Proc. of the
Artificial Neural Network in Engineering, (ANNIE'95), pp. 425-430, 1995.
15.Bowles, A.: Machine learns which features to select". Proc. of the 5 th Australian
Joint Conf. on Artificial Intelligence, pp. 127-132, 1992.
16.Sano, H., Nada, A., Iwahori, Y., Ishii, N.: A method of analyzing information
represented in neural networks. Proc. of 1993 Int. Joint Conf. on Neural Networks,
pp. 2719-2722, 1993.
17.Bishop, C.: Exact calculation of the hessian matrix for the multilayer perceptron.
Neural Computation, vol. 4, pp. 494-501, 1992.
18.Watzel, R., Meyer-B~ise, A., Meyer-Base, U., Hilberg, H., Scheich, H.:
Identification of irrelevant features in phoneme recognition with radial basis
classifiers. Proc. of 1994 Int. Symp. on Artificial NNs, pp. 507-512, 1994.
19.Bronshtein, I., Semandiavev, K.: Mathematics Handbook for engineers and
students (in Spanish). MIR, Moscow, 1977.
Applying Evolution Strategies to Neural Networks Robot
Controller
Abstract - In this paper an evolution strategy (ES) is introduced, to learn weights of a neural
I. Introduction
Autonomous robots are sometimes viewed as reactive systems; that is, as systems
whose actions are completely determined by current sensorial inputs. This is the base
of the subsumption architecture [1], where finite state machines are used to
implement robot behaviors. Other systems use fuzzy logic controllers instead [2]. The
rules of these behaviors could be designed by a human expert, designed "ad-hoc" for
the problem or learned using different artificial intelligence techniques [3]. The
control architecture used to evolve the reaction (adaptation) is based on a neural
network.
The neural networks controller has several advantages [4]: (1) NN are resistant
to noise, that exists in real environment, and are able to generalize their ability in new
situations, (2) the primitives manipulated by the evolutionary strategy are at the
lowest level in order to avoid undesirables choices made by human designer, (3) a
NN could easily exploit several ways of learning during its lifetime. The used of a
feed forward network with eight input units and two output units directly connected
to motors appears in previous works [4] as an efficient way to learn a behavior:
"avoid obstacles" using Genetic Algorithms. In this work the NN ought to learn more
complex behavior: "navigation". This task requires more environmental information
and the sensors have been grouped using only five input units.
In the proposed model, the robot starts without information about the right
associations between environmental signals and actions responding to those signals,
And from this situation the robot is able to learn through experience to reach the
highest adaptability grade to the sensors information. The number of inputs (robot
sensors), the range of the sensors, the number of outputs (number of robot motors)
and its description is the only previous information.
517
2. Evolution Strategies
Evolution strategies (ES) developed by Rechenberg [5] and Schwefel [6], have been
traditionally used for optimization problems with real-valued vector representations.
As Genetic Algorithms [7] (GA) the ES are heuristic search techniques based on the
building block hypothesis. Unlike GA, however, the search is basically focused in the
gene mutation. This is an adaptive mutation based on the likely the individual
represents the problem solution. The recombination plays also an important role in
the search, mainly in the adaptive mutation.
I Selection
a"rent H Recombination
Pent H Mutation
Children
Chi'arcn+earentU EvaluateV[
Survival [ -[ (FitnessFunction)
Figure. 1 : Schema of an evolution strategy.
Where xf and cri' are the mutated values, following a normal distribution (N(/z, or)).
518
However, when a (It+l) ES is used the mutation process follows the 1/5 rule [8].
In both cases, the recombination follows the canonical GA approach [7].
3. Experimental Environment
The task faced by the autonomous robot is to reach a goal in a complex environment
avoiding obstacles found in the path. Different environments have been used to find
the connections of the NN. The system has been developed using a simulator to prove
different characteristics of the system. Finally, a real robot has been used to test the
proposed solution.
A simulator developed in a previous work [10] has been used as complete
soRware for the simulation of mobile robot. Working with a simulation offers the
possibility to evaluate several systems in different environments controlling the
execution parameters. The robot simulator characteristics is based on a mini-robot
Khepera [9] has been used, which is a commercial robot developed at LAMI (EPFL,
Laussanne Switzerland). The robot characteristics are; 5.5 cm of diameter in circular
shape, 3 cm of height and 70 gr. of weight. The robot has two wheels controlled by
two motors that let any type of movement. The ES should specify the wheel velocity
that could be read later by an odometer. Eight infrared sensors supply two kinds of
incoming information: proximity to the obstacles and ambient light. Instead of using
eight sensors individually, to reduce the amount of information six sensors are used
and grouped (as Figure 2 shown) giving a unique value, the average, from two input
values. Representing the goal by a light source, the ambient information lets the robot
know the angle (the angle position in the robot of the ambient sensor receiving more
light) and the distance (the amount of light in the sensor).
Different simulated worlds that resemble real ones have been defined before
being implemented in the real world. An example of these environments is shown in
Figure 3 (a) and Figure 3 (b). The controlled developed is the same in both cases
(simulated and real) except the differences in the treatment of the sensors.
It has been proved that by means of connections between sensors and actuators, a
controller is able to solve any autonomous navigation robotic behavior [11]. This
theoretical approach is based on the possibility of finding the right connections of a
feed-forward NN without hidden layers for each particular problem. The input
sensors considered in this approach are the ambient and proximity sensors of Figure
2. The NN outputs are the wheel velocities. The velocity of each wheel is calculated
by means of a linear combination of the sensor values using those weights (Figure 4):
5
vj = f ( Y wij x s,) (3)
i=1
Where w O" are searched weights, si are sensor input values and f is a function for
constraining the maximum velocity values of the wheels.
520
i ~gt,lot~,,c~,.,ot/
I
s:. 7
SI
~& Sensor
~ ~ V W;,Weigoft
b hth1e6r2
5. Experimental Results
Different experiments have been done all of them over the same set of environments.
The environments have been generated by changing the goal position, number and
location of obstacles looking for a generalized environment. In a set of preliminary
comparisons, it was found that results obtained with the software model did not differ
significantly from the results obtained with the physical robot.
An exploratory set of experiments was performed in simulation to adjust the
quality measures used in the fitness function as well as the parameters of Evolution
Strategy. A (!a+~.)-ES, p.--6, ~.=6, were used.
The quality measures used to calculate the fitness value of a controller were the
following:
9 Number of collisions. (Collisions)
9 Number of stops. Cycles of the simulation in which the robot stays in the
same location. (Stand)
9 Time needed to reach the goal. (Time)
9 Length of the robot trajectory from the starting point to the final location.
(Path Length)
The global evaluation depends linearly with these concepts: 10*Collisions +
10*Stand + 20*Time - 1,5*Path_Length. Each evaluated robot behavior ends over
one environment when the goal has been reached or the time exceed some time out.
521
Five evolutionary runs of 70 generations each have been performed, for eight
different environments, each one starting with a different seed for initializing the
computer random functions.
@j l i,
~ ~ 1
It -|
~r 1.00r~x: 00,i~ny: 0 Ak'avo
I
m
- I -I m
o ~,~,* I
Ii
m m
m m
Figure 5. Eight environments used to evolve the controller. Dark shapes are the
obstacles, the big point is the starting location of the robot and the small
point is the goal. The environments are closed.
The evolution of the quality measures used to calculate the fitness value shows a
similar behavior over all environments. All the quality measures evolve in the way to
get the optimal robot behavior. See Figures 6-10.
522
2 9 ~ lr~Wx..----[,
. . . . . . . . . . . . . . . . . . . . . . . . .
i 2000 /
/ 2.
++) ~ iLL. .......... ' :~ :,
Q/ ............................................................
Gemmb~ G~erdlor~
Figure 6. Evolution of the "Path Len ;th" Figure 7. Evolution o f "Time" needed to
versus generations in each reach the goal versus
environment. generations in each
environment.
2000 2000
1888, ~ ...... L
1680. ,!
| 1~ i :
~ ~" 9
40O
20o i \._.....~.. . , 200 9 ' ' " " ........... ~~ ~:::=:+ :
o ..................................................... ;+?~v, 0 I r .......... ',,, ), .+,',,. ),,~ ............... .~......... , ............
Gen~tlons ~nlrlttO(ll
60000
- - E0
50000
............. E1
40000 ........... , .Z% \ .
E2
........... E 3
30000
- - E4
E
20000 --E5
............. E 6
10000 ..... E 7
G e n e r a t i o n s
Figure 10. Evolution of the fitness value of the population's best controller versus
generations in each environment.
IO0
3Ot , 9 . . . . . 9 ;i
70 ji : i
4O
3o
lO ...........,;
4 7 10 13 lS 19 22 25 28 31 34 37 40 43 46 49 52 55 58 1 4 7 10 13 16 t9 22 25 28 3t 34 37 40 43 46 49 52 55 58
Generations Generations
Figure 11. Evolution of the quality Figure 12. Evolution of the quality
measures versus generations measures versus generations
in environment 1. in environment 3.
100
90
I 70 /i
60 /'
g 50 w 50 /'
~ 4o
L 30 30 \ /
2Q 2O
10 10
0 0
1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Environment Environment
I--s0 sl s2 s3,
I I--~ -ss I srl
Figure 13. The fitness value of the solution (S.) obtained in environment n is
measured in all environments. The point shows the fitness value calculated
in the training environment.
Neural networks trained with an ES adjust precisely their weights to the training
environment. This is an advantage when we want to obtain a good solution within a
short processing time but a lack for getting generalized solutions. This behavior is
displayed in Figure 14; the solution trained in environment 3 is validated in
environments 1,2 and 6.
524
7. References
[3] Matell~m V., Molina J.M., Sanz J., Fem~indez C. "Learning Fuzzy Reactive
Behaviors in Autonomous Robots". Proceedings of the Fourth European
Workshop on Learning Robots, Germany, (1995).
[4] Miglino O., Hautop H., Nolfi S. "Evolving Mobile Robots in Simulated and Real
Environment". Artificial Life 2:417-434 (1995)..
[5] Rechenberg, I. Evolutionsstrategie: Optimierung Technischer Systeme nach
Prinzipien der Biologischen Evolution. Frommann-Holzboog, Stuttgart (1973).
[6] Schwefel, H. P. Numerical Optimization of Computer Models. New York: John
Wiley & Sons (1981).
[7] Goldberg D., Genetic Algorithms in Search, Optimization and Machine Learning,
Addison-Wesley, New York, (1989).
[8] Rechenberg I., Evolution strategy: Nature's Way of Optimization. In H. W.
Bergmann, editor, "Optimization: Methods and Applications, Possibilities and
Limitations", Lecture Notes in Engineering, pag 106-26, Springer, Bonn (1989).
[9] Mondada F. and Franzi P.I. "Mobile Robot Miniaturization: A Tool for
Investigation in Control Algorithms". Proceedings of the Second International
Conference on Fuzzy Systems. San Francisco, USA, (1993).
[10] Sommaruga L., Merino I., Matell~n V and Molina J. "A Distributed Simulator
for Intelligent Autonomous Robots", Fourth International Symposium on
Intelligent Robotic Systems-SIRS96, Lisboa (Portugal), (1996).
[11] Braitenberg V. Vehicles: experiments on synthetic psychology. MIT Press
Cambridge, Massachusets (1984).
On Virtual Sensory Coding: An Analytical
Model of the Endogenous Representation
Dpto. Inteligencia Artificial - UNED, Senda del Rey, s/n. E-28040 Madrid, Spain,
{j ras, delapaz, jmira}~dia, u/led, es
1 Introduction
Those methods have problems caused by the lack of mechanic precision of the
robot (dead reckoning), related to local minima, and in general, problems due to
the discrepancies between the model and the real environment. Other qualitative
methods are used to solve part of those problems [9], but the solution seems to
be in the use of an hybrid strategy using qualitative and geometric methods.
Again the problem of building an internal model of the external environment,
allowing the robot to navigate or to perform other tasks involving an efficient
use of the inner representation of the external geometry, has not been solved in
a satisfactory way.
In this paper a very modest example, but analytically complete one, has
been worked out. We deal with the task of creating a computable structure
in the analytical level for a simple set-up, such as a circular system with a
limited set of distance sensors, arranged with plane radial symmetry, and that
can move as a whole (rotation relative to the system and displacements with it).
We also assume that the system has other sensors for inner perception, such as
the angle rotated by the sensors set or the displacement of the center (direction
and amount). The codification of the sensors can be absolute, relative or as a
rate of change. These inner sensors can have dead reckoning which must be taken
into account.
The system can move around measuring distances in an environment filled
with two-dimensional obstacles (from the point of view of the system). The
obstacles are fixed (or they move very slow related to the system movement).
The sensors can also have sporadic errors (wrong measurements) that must be
compensated.
The rest of the paper is structured in the following way. In section 2 we de-
scribe the solution method at the knowledge level, starting with data structures
(distance sensors, system movement and inner sensors) and giving the diagram
of transformations for the successive representations in the model. Section 3 de-
scribes the first transformation (the way we use the system movement to increase
the sensors resolution and introduce rotation invariance and adaptation to dis-
placements). Then, in section 4 we describe the second transformation (sensory
representation independent of the position). Finally in section 5 we conclude
giving the usefulness of this method of design.
several types with different properties. From a formal point of view these sensors
are characterized by the following properties (figure 1):
1. Each sensor is fixed in a point at R t (where t means the type) from the
system center. T h a t distance is the same for all the sensors of the same
type.
29 The sensory field of each sensor faces outwards from the system 9 The position
of a sensor i (of type t) is determined by an angle 0~ relative to the system)
in the same direction as the axis of its sensory field9 As a first approximation,
we suppose that the sensors of each type are distributed uniformly around
the system such that 0~ = i 9 A0 t, where A0 t ~ 2, with N t being the total
number of sensors of type t in the system.
3. The sensor has a sensitivity sector defined by the angle 6t centered on its
axis.
4. The sensor can detect objects within its sensory field, between a minimum
distance (dt,~i,) and a maximum one (aU,~) far from its position. The value
given by the sensor represents that distance relative to it from the closest
object within range. The sensor can inform about saturation (all objects are
out or range, far away or too close)9 The precision of the returned value can
be limitted, this can be represented by the value belonging to a finite set
only9 The most common case is the values distributed representing a l i n e a r
range.
5. Each type of sensors has an accuracy given by a function depending on the
distance and the angular position of the object relative to the sensor9 There
is also a minimum size of detectable object (ie. its projection) depending on
the distance and the angle again9
Sensorial
~ ~'~ ~ t%nin
/ I U: ~ . . ~'~ ~
' i"~ . - . " ~ s e n s o r
i i--"/q,t i
I I
system ,
center ,
\ /
A way to use the system movement to improve the sensors' resolution consists of
to accumulate the instantaneous values of the information at the p r i m a r y sen-
sors corresponding to m a n y different coordinates in successive sampling intervals.
This expansion is developed in two parts: 1) rotation of the sensors, without dis-
placements (static virtual sensors) and 2) displacement in one direction, without
rotation, t h a t gives us the corrected or dynamic virtual sensors.
Given t h a t the information received by the system has to be transformed
to represent the environment from the endogenous "point of view", the first
representation relative to the system position and independent of the direction
defines the properties of the virtual sensors (VS's):
530
.................................................................
~,5 [ topologicalmap of zones
I mobileenvironmentobjects I I (+ metric information)
I movements I I b~ f
I local position
[independent sensors I
~2 . . . . . . . . ~ ..............
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,
,
real sensors
[ (instant sample) I
%, .
l system rotauons 1
s stem
L
~Splacements]
I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a
.......................................................
.........................................................
............................................................ I
..............................................................
.................................................................
1. T h e VS's are "placed" in the center of the system. T h a t means the distance
value stored by t h e m is relative to t h a t center.
2. Every VS is assigned to a two-dimensional spatial sector a r o u n d the system
to represent the distance to the closest object in t h a t sector.
3. T h e VS's receive unified values from the different kinds of real sensors only
when one of t h e m is in range (not saturated) and it is faced in the same
direction assigned to the VS.
531
4. The VS change the stored information concerning distances when the sys-
t e m rotates or when it moves, thus maintaining approximately invariant the
representation of the external obstacles.
eJ~- " ~
: V
9 ~ 9 virtual sensor
The main function in this stage is the d a t a routing depending on the angu-
lar direction of the sensors. D a t a distribution is done by intermediate elements
which group virtual sensors into zones, to allow more modularity and fault tol-
erance. There are so m a n y groups as real sensor sectors for the first step in the
distribution. The sector covered by the virtual sensors of a group belongs to
the group. Every real sensor is connected to the group with the assigned sector
corresponding to the facing direction of the real sensor in the sampling moment,
when the measurement is taken.
532
Once the valid d a t a (sensor in range) is in the correct place, each VS accumulates
the new incoming value to the previously stored values with a weight. This
t e m p o r a l accumulation also includes a continuous fforgetting" (increment the
distance) if the VS is not activated frequently. The accumulation is also spatial
by lateral interaction with the neighbor virtual sensors. Small contributions from
the neighbors are added to the stored value. The contributions are related to the
dispersion in the real sensors depending on the overlap between sensory fields.
The fforgetting" consists of a periodic increment of the stored distance, in-
versely proportional to it. This function allows the correction of isolated erro-
neous data. The following expression for the distance increment due to forgetting,
includes all these functional specifications
0
It is null when d = dma~= and it is K . (dma~ -dmi,O when d = dmin. The
constant K must be the part of the complete range corresponding to the times
it is activated in a sampling period of the real sensors, so it must be K = N L
where N is the number of real sensors of one kind, /~ is the n u m b e r or virtual
sensors, f , is the sampling frequency of the slowest real sensors, and fa is the
activation frequency of the forgetting increments in the virtual sensor. Normally
f~ > fr and N > N , so K is a small value.
The virtual sensors must correct the stored value when the system changes its
position thus reflecting the changes of the objects relative to the system. T h a t
correction has two components depending on the angle of the displacement rel-
ative to the facing direction of the virtual sensor:
The two corrections are proportional to the projection of the displacement over
the facing angle of the sensor. To avoid the global calculation depending on the
angle of each sensor relative to the displacement, we distribute the computations
and the connections between the virtual sensors in a local and modular way.
We also suppose to simplify the calculations t h a t the displacement is along the
nearest direction corresponding to one virtual sensor.
533
The process starts in the sensor faced to the same direction of displacement.
This sensor has a longitudinal correction equal to the displacement and a null
transversal correction. The sensor transmits t h a t information to the two neighbor
sensors activating them. The sensors receiving information from one side com-
pute their corrections and transmit to the other side. This cascade process ends
with the last sensor (pointing in the opposite direction of displacement) which
receives two activations equilibrating the transversal corrections (the n u m b e r of
virtual sensors must be even). This way of computing allows to all the sensors
use the same formulae independent of the angle relative to the displacement.
We now compute the correction of a sensor in the place n u m b e r k counting
from the first activated sensor (in the same direction of the displacement) with
index 0. We call dk the distance stored before the displacement in the k-th sensor
and d'k the corrected distance stored after the displacement and it will be the
interpolation (or extrapolation) between dk y dk-1. The correction depends on
the angle of the sensor relative to the direction of displacement, Ok, t h a t can
be substituted by 8k ---- k 9 A0, where A0 = 2-2-"with N the number of virtual
sensors. We will call a the distance advanced ~y the system.
The diagrams of fig. 4 can help us to develop the expressions for the new
corrected value. There are two possible geometric configurations, depending on
the advance and the value of the previous sensor ( d k - 1 ) , the first one is calculated
by i n t e r p o l a t i o n and the second one by extrapolation. The results are the same in
both cases, as we will prove. We use a shortened notation calling S ~ sin (A8),
C ~ cos (A8), Sk =-- a" sin (Ok) and Ck -= a . cos (Oh).
The i n t e r p o l a t i o n correction gives the new value of the distance d'k = d k - 1 C -
Ck + p where the last t e r m p can be solved by similar triangles (see fig. 4) as
dk -- d k - l C
( d k - l S -- Sk) (2)
P ---- dk-lS
dk -- d k - l C
(-dk_lS -t- sk) (3)
q - dk-lS
dk = d k - l C - Ck + dk -- d k - l C ( d k - l S - Sk) (4)
da-lS
d
9 dk-1
O~ ...'a. sin(Ok)
d k ~ ~ a .'a 9sin(0k)
F i g u r e 4 . Diagram for the projections used in the calculations of the longitudinal and
transversal corrections of the virtual sensors in a system displacement 9 The two possible
configurations are represented (interpolation above and extrapolation below).
where the second term in the right side is the transversal correction and the last
term is the longitudinal correction.
The expression (5) in the shown form and using only the definitions for Sk
and Ckhas direct dependence on Ok which varies with the displacement direction9
We seek an expression depending only on values in the sensor and the neighbor
535
activating it. We use the trigonometric properties for the angle sum and t h a t
Ok = Ok-~ + AO, with Sk and Ck giving the result:
8k : C . 8 k _ l Jr S " Ck_l
(6)
Ck = - S . Sk-1 + C " Ck-~
where b o t h expressions only depend on the previous sensor values and on con-
stants (S and C ) . T h a t is to say, the corrections in every virtual sensor are done
using the three values sent by the adjacent sensor, S k - 1 , Ck-land d k - 1 , and using
the formulae 6 y 5. Every sensor sends to the next one the three values Sk, Ck
and dk. In the first activated sensor the initial values are So -- 0 (null transversal
correction) and Co = a (full longitudinal correction).
Sc -~ 89 sin A0
xc = (ra COS + rb COS (7)
Yc = ~ (ra s i n a a + rb sinab)
536
/ ~ , .-" . ."
,s .- .
" "'--.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8c ~ 81 -}- 82
81X 1"~-82~2
81 82
YC = 81+82
8c = 81 "~- 82
Xc "~ X l -[- X2 (9)
data routing
rea sors
q'k are the position independent virtual sensors (eq. 5 again), and ~c, xc, yc are the area
and coordinates of the center of area from eq. 9.
538
Acknowledgements
This work has been partially supported by the project TIC-97-0604 of the
ComisiSn Interministerial de Ciencia y Tecnologia (CICYT) of Spain.
References
8. Borenstein, J., Koren, Y.: The vector field histogram-fast obstacle avoidance for
mobile robots. IEEE Journal of Robotics and Automation Vol 7, No. 3, (1991) 278-
288
9. Levitt, T.S., Lawton, D.T., Chelberg, D.M., Nelson, P.C.: Qualitative Navigation.
Proc DARPA Image Understanding Workshop Los Altos. Morgan Kaufmann (1987)
447-465
10. Romo, J., de la Paz, F., Mira, J.: Incremental Building of a Model of Environment
in the Context of the McCulloch-Craik's Functional Architecture for Mobile Robots.
Tasks and Methods in Applied Artificial Intelligence. Springer (1998) 339-352
Using Temporal Information in ANNs for the Implementation of
Autonomous Robot Controllers
Abstract
1 Introduction
In the last few years, there has been an important trend towards obtaining robot controllers
with the emphasis on the behavior desired rather than on the knowledge required by the robot
in order to carry out its functions [1 ]. Most of the implementations have made use of artificial
neural networks as their basic building block due to their tolerance to noise and how adequate
they are for automatic implementation [2][3], although normally based on static architectures.
The results of these approaches have permitted the generation of controllers able to perforn~
simple tasks in uncomplicated environments, but have been difficult to scale to more complex
problems, among them those that must handle temporal information. Obviously, handling
temporal information is necessary for performing certain functions and very useful in others.
The use of this type of information becomes even more necessary when the robot suffers fi'om
undersensorization, as the data perceived by the sensors in a given instant may be the same for
substantially different situations, and the only way to avoid this ambiguity is to increase the
dimensionality of the sensing space. In an analogy to the Embedding Theorem [4], the
dimensionality may be increased considering the data sensed in previous instants. It thus
seems necessary to employ an ANN structure that permits this type of temporal processing.
Several methods have been proposed in order to incorporate to ANNs the capacity of
processing temporal information: recurrencies [5][6], unfolding of the temporal dimension
into a fixed spatial window [7] and variable delays in the synaptic connections 181.
Recurrencies permit obtaining behaviors in which it is necessary to maintain a state that
somehow summarizes the history of the network in previous instants, but does not easily
permit using particular previous values of given inputs. The temporal windows do permit
using particular values, but the network must process all the data within the window, even
when some of them are not necessary. Variable delays in the synaptic connections permit the
541
network to learn to process only the values of those instants of time it requires, thus reducing
the number of connections needed.
In this work we will make use of variable delays in the connections of the ANNs in order
to obtain controllers for a real autonomous robot that needs to perform tasks for which the use
of temporal information is necessary. The ANNs will be obtained using an evolutionary
algorithm. The evolution will be carried out in a simulated environment and the final
controller will be downloaded to the real robot. In the examples we will shown how handling
temporal information permits obtaining functions that would not be possible otherwise. We
will also show that the networks support very high levels of noise that would make it very
difficult to obtain the behaviors with structures different from ANNs.
It is eveident that handling temporal information is important for performing some tasks,
but in addition, it may be very useful in order to improve the performance of other tasks.
When the inputs of the ANN are very noisy, if the information of previous instants is
available the network may choose to average the input data and thus reduce the effect of noise
increasing the robustness of the behavior. In other cases, identical sets of input values may be
differentiated if the data from previous instants are considered, thus reducing ambiguities, in
addition, temporal data may be employed for predictions, as in [9], where a robot predicts the
trajectory a mobile target is going to follow and finds a fast interception point for chasing it in
the minimum possible time, or predicts situations where a mobile target is going to crash with
the robot in order to prevent them.
Note that if we add a new dimension to the ANN (time), in order to verify the criteria
established by [10] for the transference of hchaviors from simple simulations to complex real
worlds, we must also add noise to this new dimension. This will lead to the behavior
generated by the network being robust. Tile noise present at the inputs of an ANN is usually
of zero mean and consists in larger or snmller variations around the ideal value of the input.
Thus, for example, in the case of the robots, this noise corresponds to imperfections in the
operation of the sensors. When the ANN must learn a given temporal pattern present at its
inputs there is a high probability of the pattern not being always exactly the same. That is,
there may be slight differences between the points that make up the pattern. Differences not
only with respect to the values themselves, but also with respect to the spacing of the samples
in time. The ANN must tolerate reasonable differences in tile input values as well as errors in
the spacing between them.
For example, in the case of a robot, the tolerance to temporal noise is fundamental. The
distance between the different events of a temporal pattern, even when it does not vary front
the viewpont of an external observer (which is a very strong assumption), it may be altered
from the point of view of the robot for different reasons. The robot may appreciate events with
a different temporal distance simply because of a change in the controller (for example
increase or decrease the number of neurons of the network) or even of the compiler employed
in order to obtain the object code, as any of these reasons may imply longer or shorter times
between two consecutive processings of the input values to the ANN. If some type of
temporal noise is not employed that makes tile ANN tolerant to the temporal variations in the
duration and/or separation between the events, we will have to obtain new controllers
whenever any change of this type ocurrs.
This noise, we may call temporal noise, may be addressed in different ways. As in the case
of additional noise, care must be taken when it is introduced in the network. The zero mean
noise changed in each evaluation of the ANN in the evolutionary process does not always
542
correspond to reality. It may be necessary to employ the same amount of noise for a large
number of executions of the ANN, changing it later for another value and preserving this
value for another number of executions and so on. This is the case, for instance, if we want to
make an ANN tolerant to variations in the time the robot may need to execute it due to the
previously commented reasons, as in these cases, the execution time of the controller will not
be significantly changed from one step to the next, but will change between evaluations of the
controller.
3 Architecture
Recurrencies, consisting in the existence of at least one return path for the ouptut
information of least one neuron through connections between neurons of the same layer or
from one layer to a previous one, are useful for summarizing the history of previous activation
states, but do not permit a simple storage of fixed temporal patterns, which are necessary for a
large number of applications. On the other hand, temporal windows consisting in the presence
of several inputs (each one usually with a different weight) corresponding to consecutive
temporal values of the same sensor to a given neuron, permit storing these temporal patterns,
but present the drawback that connections for
all the temporal instants within the window
must exist, even when they are not necessary.
This leads to a large number of connections
and processing time and obscures the
processing the network must perform. In
some applications, such as mobile robotics,
the processing time is very important, as it
determines the reaction speed of the robot.
The importance of this must be stressed, as
the robot may be faced with dangerous
situations and processing speed is important,
specially when noise may cause a delay in the
perception of the dangers or in the case of
simple robots, whose processing capacity is Figure 1: ANN with synaptic delays
very small.
In order to prevent these drawbacks, the architecture for the ANNs we employ (figure I)
consists of several layers of neurons interconnnected as a multiple layer perceptron where the
synapses, in addition to the weights include a delay term that indicates the time an event takes
to traverse it. These delays, as well as the weights, are trainable, allowing the network to
obtain from its interaction with the world a model of the temporal processes required. A fact
that must be taken into account is that, in general, for the processes we are going to consider
to have delays only in the first layer is equivalent to having them in all the layers.
In order to obtain the weights and delays of the synaptic connections of the ANN, we have
made use of an evolutionary algorithm. The reason for using this type of algorithm is the
difficulty in determining in general how good a single action of the robot is towards the
consecution of its objectives. This is the credit apportioning problem, which precludes the use
of a supervised learning algorithm or even a reinforcement learning scheme. In most cases we
cannot decide a priori what the best motion or sequence of motions is. Specially because the
543
motion may imply a compromise among several cases that are perceptually identical to the
robot but in reality are different. When the behavioral complexity increases or when the noise
level is so high that the designer cannot choose the optimal strategy, it becomes very difficult
to make use of learning. For this reason, the selection of an evolutionary algorithm as a
method for obtaining the parameters of the ANNs seems more adequate.
The type of algorithm employed is basically an evolutionary strategy with some changes in
order to adapt it to the nature of the problem. The selection of an evolutionary strategy rather
than a genetic algorithm is given by the large depende,lce between the weights of an ANN,
leading to a high level of epistasis which can cause the problem to be deceptive and slow
down the process of obtaining a solution, as indicated in [I 1]. Thus, an evolutionary strategy
where more emphasis is put on mutation than crossover seems better than a genetic one.
6 Wall following
The wall following behavior is one of the most usual behaviors in the autonomous robotics
literature. The behavior consists in the robot finding and following the walls of an enclosure at
the highest speed possible minimizing the distance to the wall it is following each instant of
time and avoiding collisions. It is usually implelnemed in robots where the sensors employed
in this task provide values in a range that is large enough for the robot to be able to distinguish
between when it is approaching a wall or going away from it. The biggest problem found
when obtaining these behaviors is caused by the presence of noise in the sensors. The infrared
sensors of the Rug Warrior, which are the only ones we can use for this task, are binary, as
mentioned before. This fact makes it impossible to decide when we are approaching or going
away from an objet without taking into account the previous instants. An additional problem
found when obtaining this behavior is that the Rug Warrior has a single receiver for the two
544
emitters that is located in an intermediate point between both sensors. This particular
arrangement of the emitters and receiver leads to a compounding of the noise problem.
In order to guide the evolutionary process towards obtaining an ANN that implements the
desired behavior, we have implemented the following procedure. The fitness function of the
robot is the amount of energy it possesses at the end of an evaluation period. The robot
increases its energy level by eating the food it finds stuck on the walls of the simulated
environment. In order to eat, the robot must simply sense, with one of the infrared sensors a
point in the wall. Once the robot has sensed a brick of the wall, the food disappears from the
brick, forcing the robot to follow the walls of the environment in order to continue eating and
thus increasing its energy. The reasons for using this strategy as opposed to an engineering of
the fitness function are studied in detail in [13],
The environment employed is a world enclosed by walls where the walls present a large
number of angles and shapes, for the robot to evolve a behavior that follows walls whatever
their shape without colliding with them.
If the controller is evolved without considering
temporal information, we see (figure 3) that there
are some types of curves, whatever the number of
hidden layers employed, that the robot cannot
handle in a satisfactory manner. This is due to what
we commented before about the fact that the robot
without temporal information cannot differentiate
some situations from others, and thus adopts a
simple strategy of turning one way when a wall is Figure 3: Wall following without delays
detected and the other when it is not. As the robot
must not collide with the wall, the turning
radius must be large enough for this not to
happen. This turning radius must be even
larger if the robot starts its evaluation far
from the walls, as it must be capable of
reaching them.
Using an ANN with delays we obtain the
behavior of figure 4, in which it is easy to
see how the robot is capable of making all
the turns, even when it starts its ew~lution
far from the walls.
Note that these behaviors have been
obtained with very high levels of noise.
Smoother behaviors may be obtained in the
simulator with lower levels of noise, but
they are less robust and will not work
adequately on the real robot. Thus, the range
of the infrared sensors may be reduced up to
50% (in order to simulate different surfaces) Figure 4: Following walls with delays
and their orientation may also vary between
theoretical position-~8 and theoretical_posilion+n/8 ill order to simulate the change in
orientation due to collisions. These types of noise are applied on top of the usual 5% noise
level on the values that reach sensors and actuators. In the case of temporal noise, the time
elapsed between two data samplings is taken randomly from a range of values determined at
the beginning of each evaluation, and is maintained until the next evaluation. This noise is
545
useful, in addition to the general reasons already mentioned, for the simulation of other
circumstances, such as a decrease in the battery load (which leads to a smaller distance
advanced by the robot, which is equivalent to a smaller time interval between consecutive data
samplings), or the different friction coefficients of the surfaces and robot wheels. The main
problems observed with these very high levels of noise is the large oscillations that may arise
in the fitness of the individuals through the generations, as the controller may have obtained a
very good fitness with some given levels of noise and be bad for other different levels. In
order to minimize this problem it becomes necessary to increase the number of evaluations of
each individual above the number that would be required without these types of noise.
7 Homing
robots learn to perform the tasks adequately in these environments and many others where
they where tested. The evolutions took around 600 generations and each robot was evaluated
16 times each generation. An order of magnitude less generations are required if no temporal
noise is used, but the results are not useful in real robots.
Figure 7: Homing behavior, initial position Figure 8: Homing behavior: final position
(light at the botton is trap, light at the top is
home and circle is robot)
8 Conclusions
In this work we have studied the use of ANNs with temporal delays in the synapses for the
generation of behavioral robot controllers. The controllers were obtained using an
evolutionary process and it was seen that in a relatively small number of generations the
results obtained were very adequate in tasks that could not be performed without the use of
temporal information. We also ascertain the need to include a different type of noise in the
evolutionary process when temporal information is employed. This noise is on the temporal
positions of the events perceived by the robot and helps to make the robot robust with respect
to time dependent phenomena. It is also shown how the combination of ANNs and
evolutionary algorithms is capable of autonomously generating structures that can operate in
environments and real robots where huge amounts of noise are present.
Acknowledgments
This work was funded by the Universidade da Corufia and the CICYT under project
TAP98-0294-C02-01.
References
Abstract - Classifier System are special production systems where conditions and actions are
codified in order to learn new rules by means of Genetic Algorithms (GA). These systems
combine the execution capabilities of symbolic systems and the learning capabilities of Genetic
Algorithms. The Reactive with Tags Classifier System (RTCS) is able to learn symbolic rules
that allow to generate sequence of actions, chaining rules among diferent time instants, and
react to new environmental situations, considering the last environmental situation to take a
decision. The capacity of RTCS to learn good rules has been prove in robotics navigation
problem. Results show the suitablity of this aproximation to the navigation problem and the
coherence of extracted rules.
1. Introduction
A Classifier System, proposed by John Holland [1, 2, 3, 4, 5, 6, 7], is a kind of
production system. In general, a production system is a set of rules that trigger others
and accomplish certain actions. Rules consist of a condition and an action. An action
can activate the condition of other rule, and thus some rules interact on other.
Classifier Systems are parallel production systems while traditional expert systems,
generally, are not parallel. In a parallel production system several rules can be
activated at the same time, while in not parallel ones, only one rule can be activated in
each action. Together with the parallel activation capacity of rules, CS's have the
property of learning rule chains sintactically simple to guide their behavior in
changing environments, therefore they are considered as learning systems.
In traditional production systems, the value of a rule with respect to other is
fixed by the programmer in conjunction with an expert or group of experts in the
matter that is being emulated. In a CS does not exist this advantage. The relative
value of the different rules is one of the key pieces of the information that it must be
learnt. To facilitate this learning, the CS force rules to coexist in an information-based
service economy. It is held a competition among rules, where the right to answer to
the activation is going from the highest bidders, that will pay the value of their offers
to those rules responsible of their activation. The competitive nature of the economy
assures that good rules (usefull ones) survive and bad rules disappear.
When a CS is employed for learning reactive behaviors, an additional problem is
detected respect to the action chains: these action chains blind the system, make it
insensitive to the environment during the duration of the chain, since the system can
549
not manage any new input during the decision process. If, furthermore, the
environment where the learning is accomplished is dynamical, the system would have
to read the sensors (input, situation of the environment) in each decision step, since
this is the principal characteristic of reactive systems. To solve this problem, a
Reactive with Tags Classifier System, RTCS, is proposed [8], [9]. For example, in the
navigation of an autonomous robot through a dynamical environment problem
studied (where the obstacles can be mobiles), robot would not have to remain blind
any moment, therefore each movement must be the result of the application of a
decision process over the last reading of the sensors [10]. Control rules could be
designed by a human expert, designed "ad-hoc" for the problem, or learnt through
some artificial intelligence techniques. Some approximations have employed Genetic
Algorithms to evolve Fuzzy controllers [11], Evolutionary Strategies to evolve
connections weights in a Braitenberg approximation [12], or Neural Nets for
behaviors learning [ 13].
In the proposed learning system, the only previous system information is related
to number of inputs (in the robot will be number of sensors), the domain, the number
of outputs (in the robot, number of motors) and their description. Thus, the robot
controller (the RTCS) starting without information about correct associations between
sensors input and motors velocities. From this situation, the system (robot +
controller) must be capable from learning to reach the greater degree of fitness to the
sensors information. The robot has to discover a set of effective rules, employing past
situations experience, and must extract information of each situation, when this is
produced. In this way, the system will learn from incremental way and the past
experience remains implicitly represented through evolved rules.
2.Classifier Systems
_!
Sl Lisht~u~
~ t Sensors
[] ProximityS~sors
(b)
Fig. 1: (a) Sensors in the real robot. (b) Input information to the system.
The input to the CS consists of three proximity sensors, angle and goal distance
(given by ambient sensors) and velocity values obtained by the speedometer. The
outputs are the velocity values. The composition of the message could be seen in
figure 2.
Velocity values flow as input to the classifier system and as decision from the CS to
the robot. The values are defined by the maximum and minimum velocities (10, -10).
This range is divided in four equal sets. All these sets should be codified to build the
message from the environment. Two binary digits are needed to represent each set.
The codified inputs to the robot are displayed in the table:
Proximity Angle Distance Velocities
00 Very Near (VN) Near 0 (0) (0,25) (VN) Slow Forward (F)
01 Near (N) <n (0-1'1) (25,100) (N) Fast Forward (FF)
iI Far (D >n (PI-2PI) (I 00,200) (F) Backward (Be)
10 Very Far (VF) Near 2~t (2PI) (200, ao) (VF) Stop (ST)
Results obtained with RTCS are caused by, on one hand the introduction of Internal
Tags, IT, and, additionally, the introduction of the mechanism that allows the CS to
be reactive (RTCS). Evidently, the existence of two mechanisms permits to obtain so
good results when applying the CS to the problem of the navigation. In this section,
the influence and contribution of Internal Tags will be analyzed. When a RTCS is let
to evolve in the simulator, in a certain moment, the RTCS is able of solving the
navigation problem. Then is considered that the robot has learnt This RTCS has been
carried to the real robot and it has been proven its efficiency in navigation.
The analysis of the meaning of the symbolic rules obtained has been done by
different groups. Each group contains a different number of rules and share some
condition values that reflect similar situations. In Table 1 - Table 6 appear collected
the different groups and the symbolic values for the part of condition can be
observed, represented by the concepts sl, s2, s3, A (angle), d (distance), vl and v2
(left and right wheels velocity values) in the condition part. Below of condition
values, message values: vl and v2 values in message part are found.
T a b l e 1: G r o u p 1 r u l e s .
CONDITION MES
sl s2 s3 A d vl v2 vl v2
ForVF VNorN VF 0 N Bc ST F F
VF MN N or VF 0 VF F F F F
VF MN or L N or VF 0 VN or F F F FF FF
F or VF F or VF MN 0-PI VF Bc ST F F
F or VF N or VF VN or N 2PI or 0-PI VF FF FF or Bc Bc ST
VF VF MN or L 0-PI VF FF or Bc All Bc ST
F or VF VN or N VF 0-PI N FF FF ST Bc
VF VN or N VF 0-PI F or VF ST or F Bc ST Bc
VF VN or N VF 0-PI F F or FF FF ST Bc
VF N or VF VN or N 2Pi or 0-PI F F or FF FF Bc ST
F or VF VF MN 0 or PI-2PI VF ST or Bc ST or Bc Bc ST
F or VF MN VF 2PI or PI-2PI VF ST or F Bc F F
VF VN or N VF PI-2PI N or VF ST or Bc ST F F
MN F or VF VF 0 or PI-2PI F or VF Bc ST Bc ST
VN or N VF VF 0-PI VF All ST F F
553
VN or N VF VF 2PI N F ST or F ST Be
VN or N VF VF 0-PI VF All F F F
VF MN VF N VF ST Bc F F
VF VF MN 0-PI F All ST or Be Bc ST
F or VF VF MN 0 F FF or Bc ST or Bc F F
Group 1 c o n s i s t s o f 2 0 r u l e s o f t h e 119 t h a t R T C S contains. Analyzing the
s e n s o r s v a l u e s s l , s 2 a n d s3 in t h e r u l e s o f g r o u p I, t h i s g r o u p s e e m s t o h a v e in
common t h a t r e p r e s e n t s s i t u a t i o n s o f c o l l i s i o n d a n g e r in s o m e o f t h e s e n s o r s . It d o e s
n o t s e e m b e a g r o u p t h a t a n s w e r s o l e l y to t h i s c h a r a c t e r i s t i c , s i n c e c o l l i s i o n d a n g e r
a p p e a r s f r e q u e n t l y in o n l y o n e s e n s o r . F u r t h e r m o r e , i f v a l u e s o f v e l o c i t i e s d e c i d e d f o r
each situation are analyzed, turnings and advances of the robot are observed, though
rules number t h a t m a k e t h e r o b o t t u r n s is n o t v e r y h i g h . A n g l e values are those
which, compel the robot to advance without turning when collision risk appears by
t h e l a t e r a l ( v a l u e s s 2 a n d s3).
T a b l e 2: G r o u p 2 rules.
CONDITION MES
S1 s2 s3 A d vl v2 vl v2
VN VN VF 0-PI N or VF FF FF ST Bc
N or VF VN or N F or VF 0 or 0-PI F F or FF FF ST Bc
N or VF VN VF 0 or PI-2PI F FF FF or Bc FF FF
VN VN or F VN 0-PI VF F or FF FF or Bc Bc ST
N F VN 0-PI VF Bc ST Bc ST
VN or N N or VF VN or F 0 VF Bc all Bc ST
N or VF VN or N VF 0-PI VF F F ST Bc
VN or N VN or F VN or F 2PI VF ST all Bc ST
N F VN or F PI-2PI VF FF FF Bc ST
VN or F VF F PI-2PI F or VF Bc ST Bc ST
N N or VF VN 2PI or PI-2PI F or VF Bc ST FF FF
F VN or N VF P1-2PI F or VF F or FF F F F
VN or N VN or F VF 0-P1 F ST or F F FF FF
VN or F F N 2PI or PI-2PI VF ST or F Bc Bc ST
VN F VN 0-PI VF ST Bc Bc ST
VN N N all VF ST or F Bc ST Bc
VN VN or N VN or N 0-PI VF ST or Bc Bc ST Be
VN or N F VF 2PI or 0-PI VN or F FF FF or Bc ST Bc
F VN N or VF PI-2PI VF FF FF ST Bc
N or VF F or VF VN or F 2PI VF all ST F F
N VN or N VF 0 or 0-PI VF FF FF ST Bc
N or VF VF VN PI-2PI N or VF FF FF Bc ST
N N or VF VN 2PI or PI-2PI F or VF Bc ST Bc ST
VN VN or N VF 0or 0-PI VF F F ST Bc
N or VF VN VF PI-2PI VF F ST or F ST Bc
N VN VF 0 VF ST or Bc ST or Bc ST Bc
N or VF VN VF PI-2PI VF FF or Bc FF ST Bc
F VN or F VF 0 or 0-PI N ST or F all ST Bc
N or VF VN VF PI-2PI VF F ST or F ST Bc
Group 2 contains 29 of the 119 ones that forms the RTCS, so appear a 50%
more rules than in p r e v i o u s group. In t h i s g r o u p the most important observed
c h a r a c t e r i s t i c is t h a t a p p e a r m a n y v a l u e s n e a r o r v e r y n e a r in s e n s o r s s l , s 2 o r s 3 , t h a t
554
T a b l e 3: G r o u p 3 rules.
CONDITION MES
sl s2 s3 A d vl v2 vl v2
VF VF VF 0-PI VN or F All ST or F F F
VF F or VF F or VF 0 or PI-2PI VF ST Bc Bc ST
F or VF VF VF 0 VF ST Bc FF FF
VF VF VF 2PI or 0-PI All FF or Bc ST or Bc ST Bc
F or VF VF F or VF 0-PI VF FF FF FF FF
F or VF F or VF VF 2PI N FF FF or Bc FF FF
VF VF F or VF 2PI N F ST or F FF FF
VF F or VF VF 0 VF F F F F
VF VF VF 0 F F F F F
VF VF VF 0 F Bc ST F F
VF VF F or VF 2PI or 0-PI VN or N ST FF or Bc F F
G r o u p 3 c o n s i s t s o f 11 rules o f the 119 o n e s that c o n t a i n s the R T C S . In this
g r o u p t h e r e are less rules that in p r e v i o u s ones. R e p r e s e n t e d situations, b y the
o p p o s i t e o f t w o p r e v i o u s g r o u p s , d o not c o n t a i n n o collision situation, in fact it s e e m s
r a t h e r t h a n rules o f this g r o u p are related w i t h a n g l e values. A n a l y z i n g a n g l e values
for all rules, m o s t l y the r o b o t s e e m s quite a l i g n e d w i t h the objective. A s result o f the
i n f e r e n c e o f e a c h rule, m e s s a g e s sent to the r o b o t c o m p e l h i m to a d v a n c e in straight
line, that c o r r e s p o n d s to the situation w h e r e the r o b o t is l o c a t e d f o r m i n g a 0 or 2PI
a n g l e w i t h the o b j e c t i v e . D i s t a n c e values to the o b j e c t i v e d o n o t s e e m , to b e
d e t e r m i n a n t to take d e c i s i o n s .
T a b l e 4: G r o u p 4 rules.
CONDITION MES
sl s2 s3 a d vl v2 vl v2
N N or VF VF 0 VN or N ST or Bc Bc FF FF
VF N VF 2PI or PI-2PI VF Bc ST F F
VF N N or VF PI-2PI VF F or FF F F F
F VF N 0-PI VF Bc ST Bc ST
N VF F 0-PI VF FF or Bc ST Bc ST
N N or VF N 0 or 0-PI All Bc ST or F Bc ST
VF VF N All VF ST ST or Bc FF FF
N VF All 2PI VF F All FF FF
N or VF N VF 2PI or 0-PI VF FF FF FF FF
F or VF N N or VF 2PI or PI-2PI VF ST or Bc FF or Bc F F
F N VF 0-PI F F F F F
F N VF 2PI or 0-PI VF All F ST Bc
F N F or VF 0-PI VF FF FF or Bc ST Bc
N or VF N VF 0-PI VF FF FF FF FF
F F or VF N 0 or PI-2PI N or VF FF or Bc ST F F
All VF N 2PI or PI-2PI N or VF ST or F Bc Bc ST
F N VF 2PI VF FF F or FF ST Bc
VF N VF 0 VF F F F F
555
F VF N PI-2PI VF FF FF Bc ST
F or VF VF N 0 or PI-2PI VN or F F F F. F
VF F or VF N 0-PI F F F F F
G r o u p 4 c o n s i s t s o f 21 o f 119 rules that c o n t a i n s the R T C S . In this group, a n y
clear t e n d e n c y a p p e a r s w i t h r e s p e c t to the g e n e r a l b e h a v i o r s : " s t r a i g h t to o b j e c t i v e "
a n d " a v o i d o b s t a c l e s " . A n a l y z i n g the rules, s o m e o f the p r o x i m i t y s e n s o r s , s l , s2 a n d
s3, has values o f near, but this value a p p e a r s in rules j o i n t to Far or V e r y Far values,
so that is n o c a s e o f d a n g e r situations. A t t e n d i n g to the angle and d i s t a n c e values is
o b s e r v e d that the robot, in a l m o s t all rules, is far f r o m the o b j e c t i v e and in g e n e r a l n o
a l i g n e d w i t h the objective. T h e s e c i r c u m s t a n c e s c a u s e that rules c o m p e l to the r o b o t
to turn t o w a r d the o b j e c t i v e and to a d v a n c e so that, thereinafter, s o m e d a n g e r
situation will b e p r o d u c e d that g r o u p s 1 and 2 c o u l d resolve.
T a b l e 5: G r o u p 5 rules.
CONDITION MES
sl s2 s3 a d vl v2 vl v2
VF N or VF VF All L ST or Bc ST or Bc FF FF
VF All VF 0 All FF or Bc ST FF FF
VF All VF 0 or 0-PI N ST or F F F F
VF All VF 0 All FF or Bc FF FF FF
VF N or VF F 0-PI VF ST or Bc Bc F F
VF N or VF F 0-PI VF All F or FF ST Be
N or VF F or VF F 0-PI VF ST Bc Bc ST
VF VF N or VF 0-PI N or VF ST Bc FF FF
VF All All PI-2PI VF ST ST or F Bc ST
VF VF N or VF 2PI or 0-PI VF F F F F
VF VF N or VF PI-2PI VN or N FF FF or Bc FF FF
F or VF VF N or VF 2PI VF F F F F
N or VF F F or VF 2PI VF F F or FF F F
VF N or VF N or VF 2PI F or VF F ST or F F F
N or VF F All 2PI L ST or F ST or F F F
VF VF N or VF 2PI L FF FF FF FF
VF VF N or VF 2PI L F F FF FF
VF All VF 0-PI L F F F F
N or VF N or VF F or VF 0 or 0-PI L F F FF FF
VF F or VF N or VF 2PI VF FF FF FF FF
VF VF N or VF 2PI or PI-2PI L ST or Bc ST Bc ST
VF F or VF N or VF 2PI or PI-2PI L ST Bc FF FF
N or VF VF VF 0 N FF or Bc ST F F
VF F N or VF 0-PI N ST FF or Bc ST Bc
G r o u p 5 is c o m p o s e d o f 24 rules o f 119 o n e s that f o r m R T C S . This g r o u p is
similar to the p r e v i o u s g r o u p r e s p e c t to the values o f s l , s2 and s3, but o p p o s i t e
r e s p e c t to angle a n d d i s t a n c e values. In this case, a n g l e values d e f i n e , in a l m o s t all
rules, situations w h e r e the r o b o t is a l i g n e d with the objective. A s d i s t a n c e value, in
m o s t o f the rules, c o r r e s p o n d s w i t h Far or V e r y Far d i s t a n c e situation, the c o m b i n e d
e f f e c t o f a n g l e a n d distance values cause that rules c o m p e l the r o b o t to a d v a n c e
straight to the objective.
556
5. Conclusions
This work has been centered in the application of a new CS, named Reactive with
Tags Classifier System (RTCS), to learn symbolic rules in a navigation problem.
Navigation of a robot could be defined as a complex behavior that requires the
movement in a world with obstacles and where the robot goal is to reach a predefined
point. This problem, from the point of view of learning, is considered sufficiently
complex if the decision must be obtained in real time, since the environment
continues changing during the time of taking a decision, or, from another point of
view, the robot is moving while the decision is taken.
The RTCS contains a set of mechanisms that allow the incorporation of new
environmental information in the process of taking decisions. This process allows
rules sequence (chaining rules in different execution instants) and break the sequence
to provide a reactive output. The RTCS has proven the capacity of learning reactions
and strategies, so the dilemma between reactive and planned systems could be
surpassed.
557
6. References
1 Introduction
~ s ( P ) = argmaxo,l{P(aolX), P(f~IlX)}
1f
MEP(SB) = -~ J min (f(PIg20), f ( P l Y h ) ) dP,
f(PI Y2i) denotes the density function of the patterns in class Y2i. A value of
where
MEP(SB) close to 0 means that the classes involved are well separated, whereas
if a considerable degree of overlapping is present, its value will tend to 0.5.
It is evident t h a t no classifier will be able to discriminate between patterns
coming from the overlapping region between the classes involved. However, even
* J. Dorronsoro and C. Santa Cruz were partially supported by Spain's CICyT un-
der grant TIC 98-0247. Both are also members of the Department of Computer
Engineering, Universidad Aut6noma de Madrid.
559
1.2
0.8
0.6
0.4
0.2
0
o I x-s~tlon 2 3 4
Fig. 1. Histograms of the x-sections of a sample of 10000 normal (solid line) and
exponential (dashed line) data.
50
45
40
35
////
30
i I L i , i i i i
;~550 55 60 65 70 75 80 85 90 95 100
Nodal data fractt~ in training set
Fig. 2. Evolution of the mean probability errors of NLDA (solid line) and MLP (dashed
line) classifiers with respect of the fraction on normal data in the training set.
classes. Some alternative coding procedures can be used to try to correct this
fact. However, they do not essentially improve the MLP performance depicted
above.
It seems thus to be desirable to avoid any incorporation of class size informa-
tion into network training through target coding. Notice that a discrimination
building procedure that does not depend on the necessity to use a concrete target
scheme certainly exists. In fact, probably the best known discrimination method,
Fisher's Analysis, does not rely on any targeting scheme at all. In the follow-
ing section we will discuss what we may call Nonlinear Discriminant Analysis
(NLDA), a classifier construction method that combines the target-free nature
of Fisher's Analysis with the approximating properties of MLPs. Although the
training of NLDA networks has a larger complexity than that of ordinary MLPs,
they appear to be more robust than MLPs on pattern recognition problems such
as the above synthetic example. In fact, figure 2 also shows the evolution of
the MEP values for NLDA classifiers built upon training sets with a decreasing
fraction of exponential sample data. It can be seen that at the beginning both
are similar and close to the Bayes optimum of 0.263, but once the fraction of
normal data in the sample is about 70%, MLP performance degrades faster than
that of the NLDA net, which remains below 0.3 even when 90% of the sample
are normal data, and below 0.35 when that proportion reaches 98% (notice that
when the sample is made up of just normal data, any classification method must
give a MEP value of 0.50).
The robustness that this target coding free training gives to NLDA classifiers
may make them suitable for use in a number if practical situations. For instance,
they have been used successfully in credit card fraud prevention [2], a problem in
which classes tend to have overlapping and size characteristics similar to those
of the above synthetic problem. NLDA network training will be briefly reviewed
in the next section, and in section 3 an example of their application to a psy-
561
WI
x~ actl_[~T~outl WH
x2 3 ~ ~ ~ !yC-1
' act i f(act9~out~ / / ~ ) ( ~ )
XD w~M~ actM~ ~
Fig. 3. Architecture of a NLDA net and notational conventions used in the paper.
ISwl_ I ( W " ) * s # w H I
= ISB-- - I(W")*s wHI (1)
where SB H and S H denote now the between and within class scatter matrices of
the transforms at the hidden layer level of input patterns. These matrices are
then affected by the concrete values of the input weights W I. Therefore, the
criterion function ,7 depends now on the pair of weight vectors ( W z, w H ) , and
its minimization is to be done with respect to both9
A simple way to do it (see [9]) is to iteratively generate a sequence ( W Ik, W H
k ),
of minimizing weights in a two step fashion9 For the k + ] weights, we first
compute W ~ k + l by keeping fixed the W~ and performing multiple C class dis-
criminant analysis on the pattern vectors provided by the hidden unit outputs
these fixed weights provide. This is done by the well known Fisher's eigen-
value and eigenvector computations (see [Sj, pp. 115-121). Once this is done,
we keep now fixed the just computed Wk+ 1 and we obtain the new Wk+ 1
by optimizing against it the corresponding version of the criterion function
y I ( W I ) : y ( W I, Wf..kl ).
Several choices are now available; in [9] a quasi-Newton procedure is used,
for which the gradients V J I = (0,7I/0wi1) have to be computed. This can be
done using the vectors out h = (outhj) as intermediate variables (see figure 3 for
concrete variable labeling). More precisely, if M denotes the number of hidden
units, C the number of classes and Ni that of sample patterns in class i, we have
0j1 c N, M a j Oout~j8act~
OwIl -- E E E Oout h Oact h OwI, '
(2)
i=l j=l h=l
which, for instance, in the simplest case of a 2 class problem, reduces to
where the two scalars sB and sw denote now the between and within class
scatter of network outputs. Notice that here the output layer has a single unit
(in general, and as it happens with classical Fisher analysis, it has C - 1 units for
a C class problem) and, therefore, the determinant ratio of the general criterion
function reduces to a simple quotient of scalars.
For the practical application of the above formula, the partials O~g/Oout~j
and O~w/Oout~j have to be computed. A further analysis shows that
563
M
08B H H k
Oout~ _
- - 2 E W l l W k l ( m i -- m k)
k=l
M M
H H-1~8 ~ k
2 M M Ni H H k i )"
E E E lWhl(O t r-
h=l k=l r=l
where m ik and m k denote the components of the class means m i and of the total
mean m respectively, and N = ~ c Ns is the total number of sample patterns.
Further details on NLDA networks can be found in [9]. We just mention here
that, with respect to NLDA complexity, the more complicated criterion function
being used in NLDA obviously will imply model training costlier than that of
MLPs. The simplest way of comparing both is to estimate the cost of gradient
computations for each model. For a two class problem, it can be easily derived
from the above estimates that the cost of a single full network gradient estimation
is O(NDM2), with M and N as before, and where D denotes input pattern
dimension. In contrast, gradient computations in backpropagation will have a
cost of O ( N D M ) , that is, will be M times faster. In a general C class problem
the relative costs are the same if traces are used instead of determinants in the
definition of the NLDA criterion function; if the latter are used, the cost of a
NLDA gradient estimate shots up to O ( N D M 2 ( C - 1 ) 5) = O(NDM2C5). In any
case, when tried in problems in which both methods converge, NLDA training
tends to require less iterations than backpropagation, substantially alleviating
its greater cost.
Training Test
MLP NLDA MLP NLDA
class 0 1 0 1 0 1 0 1
0 51 li 47 5 17 13 22 8
!1 2 56 8 50 12 18 10 20
T a b l e 1. Classification results for training and test sets using for MLP and NLDA
networks.
The results reported below are derived from a 170 pattern initial sample
obtained in an early phase of the project. The sample individuals were divided
into two performance categories, with 82 and 88 individuals respectively. Ob-
serve that small sample sizes are unavoidable in most performance assessment
applications, because of factors such as the specificity of the task abilities to be
measured, the usually not too large sizes of hiring or training person groups, or
the concrete company testing requirements that make difficult the aggregation
of samples obtained through different test procedures.
Several test sets where formed with 60 patterns, 30 from each class, and
the training sets were made up with the remaining 110 patterns, 52 of them
corresponding to the first class and 58 to the other. For each set, various MLP
and NLDA networks were trained, with some of them being discarded because
of either poor training convergence or poor test results (notice that local min-
ima convergence or overfitting are bound to appear, given the above mentioned
pattern input dimension and sample sizes). Table 1 contains training and test
set confusion matrices for both NLDA and MLP networks. The MLP has two
output units, with the usual target coding values of (1,0) and (0,1), while NLDA
networks yielded one dimensional outputs (see section 2). As it can be seen from
the table, training set classification was slightly better for the MLP net than for
the NLDA one. However, the MLP net has a test classification error percentage
of 42% while that of the NLDA network was just 30%. Among other plausible
interpretations of this fact, it seems to imply a more robust behavior of NLDA
networks with respect to overfitting.
The overall application in which NLDA classifiers will be incorporated is now
in mid development. A second testing phase, very nearly completed, will increase
the initial 170 sample with about 200 more patterns, which later on will further
augment on subsequent validation and maintenance phases. These new data will
be used to validate the initial models and to derive if possible new, more efficient
classifiers.
4 Conclusions
The target free nature of NLDA network training seems to make the result-
ing classifiers rather robust in situations where classes have a high degree of
overlapping. When sample sizes are very uneven, this has been observed in syn-
thetic classification problems (such as the one reported in the first section) and
566
also in practical applications (such as credit card fraud detection). In this work
their use for professional performance assessment has been reported. This is a
difficult classification task with a priori high class overlapping but where usu-
ally all class sample sizes are rather small, a difficulty which is compounded by
large input pattern sizes, not reducible through standard dimensionality reduc-
tion techniques. In any case, NLDA results over a first field sample are very
encouraging. Subsequent work will concentrate, on the one hand, in enlarging
the samples available, and also on an empirical study of the general effects of
small overlapping samples on the discrimination capabilities of NLDA networks,
and their comparison with those of other classifier construction techniques.
References
1 Introduction
Modeling and control of industrial processes usually requires that an analytic sys-
tem model can be built. However, in large industrial systems, global models can-
not always be defined. In such cases, discovering of complex relationships between
system variables is often problematic and modeling should be based on experimen-
tal knowledge. Modern automation systems produce masses of measurement data
which, however, may be very difficult or even impossible to interpret. In many
practical situations, even minor knowledge about the characteristic behavior of the
system might be useful. For this purpose, the measurements need to be converted
into some simple and comi)rehensive display which would reduce the dimensionality
of measurements and simultaneously preserve the most important metric relation-
ships between the data.
Artificial neural networks have successfully been used to build system models
directly based on process data. They provide means to analyze the process with-
out explicit physical model. The Self-Organizing Map (SOM) [7] is one of the most
popular neural network models. It is especially suitable for system analysis due
to unsupervised learning and topology preserving properties. The SOM algorithm
implements a nonlinear nmpping from high-dimensional input data space onto a
two-dimensional grid or net of neurons. The mapping preserves the most important
topological and metric relationshil)s of tim input data. The net roughly approxi-
mates the probability density fuimtion of the data and, thus, inherently clusters
568
the data. Various visualization alternatives of the SOM are helpful, e.g., in hunting
correlations between measurements and in investigating the cluster structure of the
data.
The SOM based data exploration has been applied in various engineering appli-
cations such as pattern recognition, text and image analysis, financial data analysis,
process monitoring and modeling as well as control and fault diagnosis [8, 12]. In
addition, the SOM has been used in analysis of telecommunications environment,
e.g., in discrete-signal detection and adaptive resource allocation problems [10, 13].
The ordered signal nmpping property of the SOM algorithm has proven to be
especially powerful in analysis of complex industrial processes [1]. In this paper,
analysis of a pulping process is considered. The SOM based approach has been
utilized to determine the reasons for situations where the pulp quality variable,
kappa number, becomes too low. Similar approach has been used also in the analysis
of steel rolling process.
Data ]
Black-boxmodeling] I Dataanalysis
"2
-
~ i 2r:'::,'=
t"
< 727o::::=
Black-box models build a regression model between inputs and outputs of a sys-
tem based on measurement data. After that, they can be used to predict system
outputs if the inputs are known. For example, in [11,2,9], artificial feed-forward
neural networks are suggested for prediction of pulping processes. The disadvan-
tage of black-box models, however, is that they give little information about the
dependencies of the variables and it is difficult to get general view of the system
behavior.
In data analysis, useful information of process data is extracted without model-
ing the system. The approach needs to be distinguished from process experiments,
569
where inputs are intentionally varied to find out the effect of the changes in the
output. Data analysis could merely be used to find out reasonable setups for the
experiments.
The application areas of data analysis are such processes or phenomena which
- - due to their complexity or nature - - are impossible to model analytically. The
advantage of the analysis is that neither process modifications nor experiments are
required and general methods independent of application can be used.
Two different types of information may be acquired. Qualitative information
can be obtained using such a data representation that is easy to understand and
interpret. This is usually done using data visualization techniques. An example of
quantitative, i.e., numerical information is the correlation coefficient depicting the
strength of linear dependency between two variables.
Because the whole analysis is based on data, the measurements have to be
reliable. Signal noise is usually not a problem, because it may be reduced using
[ilt,ering techniques. Real problems are due to sensor faults. A good example is signal
uon-stationarity caused by slow drifting of dirt on the surface of the rneasurement
sensor.
Analysis of data originating from a complex system is an iterative process. In
Figure 2, different stages of data analysis are presented. After data acquisition, the
variables of interest are selected. Common preprocessing operations include removal
of erroneous measurements, noise reduction, and compensation of delay between dif-
ferent variables. The analysis may directly lead to conclusions, but typically several
variable sets need to be considered; also the preprocessing may need to be altered.
Usually the data analysis is started using as many variables as possible which are
then reduced to the most significant ones.
I D"t"a~quisiti"t--~
)n Scl
v.r.ih,l:escti~ ~--'~ I:'rcP'~cssi~"l Analysis~-~ C~mclu~ion.~
It should be emphasized, that the whole analysis process requires tile presence of
a process expert, whose assistance is valuable in selection of variables, in choosing
methods and their paranmters for preprocessing as well as in interpretation and
checking of the results. Expertise is - - of course - - needed also in making tile
correct conclusions based on the results.
the algorithm simultaneously does two things: vector quantization and projection
of the prototype vectors into two dimensions.
The topology preserving property of the mapping makes it possible to apply
several different visualization methods to study the dependencies between variables
in different parts of the input space. In the following, the visualizations used in the
case study are discussed in detail.
3.1 S O M visualization
In the case study, the behavior of a continuous pulp digester of a pulp mill was
studied. An illustration of the digester and the separate impregnation vessel is
shown in Figure 3.
The color images of this article can be viewed at URL
http://.ww, cis.hut. fi/proj ect s/ide/publicat ions/fulldet ails. html#iwann99
571
Continuous
digcMcr
Steam
Imple~nati,,*n
ves~c[
Ell
ill * F- ~ Old~cmextractlon
Blackliquor
Exlracth)n
screens
",,-,::2, I Blackliquor
Wash
screen
Washliqu(~ k
While liquIw
...................... . ............................................ " Ka~.,..,.ro,.r
Fig, 3. The continuous digester and the impregnation vessel. The cooking and wash liquor
flows are marked by thin lines and the chip flow by thick line. The four square-shaped
symbols are heat exchangers.
Presteamed wood chips together with cooking liquor are fed into the impreg-
nation vessel. After the impregnation, the chips are fed into the digester. At the
top of the digester, they are heated to cooking temperature using steam, and the
pull)lag reaction - - removal of lignin - - begins. During the cooking, the chips move
downwards the digester. The cooking ends at extraction screens by displacement of
hot cooking liquor by cooler wash liquor, which is injected to the digester through
bottom nozzles and bottom scraper. The liquor moves counter-current to the chip
flow and performs washing of the chips.
Problems in tile digester operation, where pulp consistency in the digester outlet
dropped, were the starting point for the analysis. In those situations, end product
quality variable - - the kappa number - - values were smaller than the target value.
The test material consisted of measurements made in the digester during one
week with constant production speed. At the end of the period, the digester ended
up into such a faulty situation that tile production speed of tile lille had to be
dropped in order to restore control of the digester.
572
Fig. 4. Component planes of the SOM trained using 24 measurement signals of the digester.
The signals depicting operation of the digester were collected from the automa-
tion system of tile mill. The test period was selected so that there were no significant
errors in the measurements. In preprocessing, the signals were delayed with respect
to each other using known digester delays. Because the signal values were already
averages of ten measurements made once a minute, no further noise reduction by
filtering was required.
In Figure 4, the component planes of a SOM trained using 24 measured signals
are presented. In the lower part of the map, especially in the right corner, the value
of kappa number is low. This means that the problematic states are mapped into
that part of the map.
The component planes of Figure 4 are rearranged in Figure 5 so that component
planes resembling each other, i.e., ttle correlating ones, lie near each other: this aids
in interpretation of the numerous component planes.
Using the reorganization of the component planes in Figure 5, the 11 variables
best correlating with kappa number were selected for further investigation while
the 12 others were rejected. In Figure 6, component planes of an another SOM
trained using the selected variables only are shown. Also in this case, the problematic
process states with low kappa number were mapped into the bottom right corner
of the map. To illustrate correlations between the kappa number and several other
variables, each map node is assigned a hue (see Figure 7 top left corner). Then,
573
Fig. 5. Rearranged component planes. The variables selected for further investigation are
surrounded by black line. The output variable of interest, the kappa number, is indicated
by black arrow.
xy-plots with the kappa number on the y-axis and the other ten variables on x-axis
were produced using the map weight vector values.
The scatter plots indicate that in the faulty states there seems to be very lit-
tle correlation between kappa and H-Factor, which is the variable used to control
the kappa number. On the other hand, the variables "Dig_chip_r', "Dig_liq_l",
"Black_liq", "Screens", and "Press_d" seem to correlate with the kappa number.
The explanation for this is that in a faulty situation, the downward movement
of the chip plug in tile digester slows down. This is due to the fact that the plug
is so tightly packed at the extraction screens that the wash liquor cannot pass it.
574
There are two consequences: the wash liquor slows down the downward movement
of the plug and the pulping reaction does not stop.
Because the cooking continues, the kappa number becomes too small. In addi-
tion, the It-factor based digester control fails, because in the It-factor computation,
assumed constant cooking times for chips become longer due to slowing down of the
chip plug movement.
Finally, in the Figure 8, the signals that were used to train the SOM of Figure 6
are colored using the coding presented in Figure 7. Now it can be clearly seen
that the process operates normally until sample 700. From that on, problematic
situations dyed by yellow and green hues appear every now and then. The last
variations in the kappa number are so alerting that the operators have to slow
down the production rate of the digester (not shown).
Fig. 8. Color coding of signals used to train the SOM with 12 variables.
5 Conclusions
The Self-Organizing Map can be effectively used to find and visualize correlations
between process variables in different operational states of the process. In faulty
states of the continuous digester, variables that normally do not have much effect
on the pulp quality (chip level etc.) seemed to affect the pulping. This was due to the
fact that they correlated with chip plug movement in the digester. In the problematic
situations, the H-factor based kappa number control failed due to increased residence
time of the chips in the digester, which is assumed to be approximately constant at
constant production speed. In other words, the H-factor was bigger than the control
system expected it to be.
5.1 Acknowledgments
The authors wish to thank UPM-Kymmene Wisaforest pulp mill for the pulping
data and UPM-Kymmene Pulp Center for aid in the interpretation of the results. Fi-
nancial support by Technology Development Centre of Finland and UPM-Kymmene
is gratefully acknowledged.
References
1. E. Alhoniemi, J. Hollm~n, O. Simula, and J. Vesanto. Process monitoring and Model-
ing Using the Self-Organizing Map. Integrated Computer-Aided Engineering, 6(1):3-14,
1999.
2. B.S. Dayal, J. F. MacGregor, P. A. Taylor, and S. Marcikic. Application of feedforward
neural networks and partial least squares for modelling kappa number in a continuous
kamyr digester. Pulp ~ Paper Canada, 95(1):26-32, 1994.
3. R. R. Gustafson, C. A. Sleicher, W. T. McKean, and B. A. Finlayson. Thoretical model
of thekraft pulping process. Industrial ~4 Engineering Chemistry Process, 22(1):87-96,
Jan. 1983.
4. J. Himberg. Enhancing SOM-based data visualization by linking different data pro-
jections. In L. Xu, L. W. Chan, and I. King, editors, Intelligent Data Engineering and
Learning, pages 427-434. Springer, 1998.
5. E. H~kSnen. A mathematical model for two-phase flow in a continuous digester. Tappi
Journal, 70:122-126, Dec. 1987.
6. S. Kaski, J. Venna, and T. Kohonen. Tips for Processing and Color-Coding of Self-
Organizing Maps. In G. Deboeck and T. Kohonen, editors, Visual Explorations in
Finance, Springer Finance, chapter 14, pages 195-202. Springer-Verlag, 1998.
7. T. Kohonen. Self-Organizing Maps, volume 30 of Springer Series in Information Sci-
ences. Springer, Berlin, Heidelberg, 1995.
8. T. Kohonen, E. Oja, O. Simula, A. Visa, and J. Kangas. Engineering Applications of
the Self-Organizing Map. Proceedings of the IEEE, 84(10):1358 - 1384, 1996.
9. M. T. Musavi, D. It. Coughglin, and M. Qiao. Prediction of wood pulp k # with
radial basis function neural network. In Proceedings of the 1995 IEEE International
Symposium on Circuits and Systems, volume 3, pages 1716-1719, Piscataway, 1995.
IEEE.
10. K. Ralvio, J. Hcnriksson, and O. Simula. Neural detection of QAM signal with strongly
nonlinear receiver. Neurocomputing, 21:159-171, 1998.
11. J. B. Rudd. Prediction and control of pulping processes using neural network mod-
els. In 80th Annual Meeting, Technical Section, volume B, pages 169-173, Montreal,
Quebec, Canada, Feb. 1994. Canadian Pulp & Paper Association.
577
12. O. Simula and J. Kangas. Neural Networks for Chemical Engineers, volume 6 of
Computer-Aided Chemical Engineering, chapter 14: Process monitoring and visualiza-
tion using self-organizing maps, pages 371-384. Elsevier, Amsterdam, 1995.
13. H. Tang and O. Simula. The optimal utilization of multi-service scp. In Intelligent
Networks and New Technologies, pages 175-188. Chapman & Hall, 1996.
14. J. Vesanto. SOM-Based Data Visualization Methods. Intelligent Data Analysis, 1998.
Accepted for publication.
15. J. Vesanto and J. Ahola. Hunting for Correlations in Data Using the Self-Organizing
Map. Accepted for publication in International ICSC Symposium on Advances in
Intelligent Data Analysis.
16. J. Vesanto, J. Himberg, M. Siponen, and O. Simula. Enhancing SOM Based Data
Visualization. In T. Yamakawa and G. Matsumoto, editors, Proceedings of the 5th In-
ternational Conference on Soft Computing and Information/Intelligent Systems, pages
64-67. World Scientific, 1998.
Gradient Descent Learning Algorithm for
Hierarchical Neural Networks: A Case Study
in Industrial Quality
Daniela Baratta, Francesco Diotalevi, Maurizio Valle and Daniele D. Caviglia
1 Introduction
Artificial Neural Networks (NNs) are an efficient solution for solving many real
world problems. At present there is a growing interest in applications like Optical
Characters Recognition (OCR), remote-sensing images classification, industrial
quality control analysis and many others in which Neural Networks (NNs) can be
effectively employed [1]-[3].
Among learning algorithms, the most diffused are supervised and adopt the
error function gradient descent technique [4]: Back Propagation (BP), for example,
is one of the most widely used and reliable but its implementation in analog VLSI
requires precise and complex circuit implementation [5]. The Weight Perturbation
(WP) algorithm, on the other hand, was formedy developed to simplify the circuit
implementation [6] and although it looks more attractive then BP for the analog
VLSI implementation, its efficiency in solving real world problems has not yet been
heavily investigated.
Usually, in pattern recognition problems when the input data can be separated
into categories, classification trees are a popular approach (e.g. see, among others,
[7] and [8]). On the other hand, also Neural Networks (NNs) constitute a cost-
efficient solution [9]. Comparisons between classification trees and Multi Layer
Perceptrons (MLPs) show that MLPs feature classification and generalisation
performances similar or even better, [10], [11]. To take advantage of both NNs and
classification trees approaches, some authors tried to use NNs together with
classification trees (see, among others [12], [13]).
579
C G OL,,
W -~._
AI!IIB!i M (~N R~ P~Q
Scratches Seams Transverse Stains Dirtiness Marks
Fig.1. A sketch of surface defect classes and families in flat rolled mills.
580
Concerning training, each MLP is trained independently from the others (e.g.
using either the Back Propagation algorithm [15], [20] or the Weight Perturbation
algorithm [6]) on the corresponding data sub-set.
Input Simple
dl dlO d83 dgO d30 d66 d92 d16 d33 d62 d72 d73 d77 d79 d82 d60 dTO d26 d27
Scratches Scimi Trunlv9 Stains DlrtlnesJ Mirkl
so;
Po ) - e ( w i j ) Ae
Aw~j = -1"I e ( w + c.).
c.) = -'q pertq (") = - ~//ft pertq ~") (5)
pq step , - eP AE"
We can combine the information of the term step in the rl value, i.e.:
, 1 ]n / (6)
Awo = -1] l,mE, perto~") 1] - ~ o.)/~,ep
582
pert~ = (_+'
1 with equal probability (7)
To compute the synapse's weight w,~,we only need to compute Ae and to known
pertij (n)
With the term "learning strategies" we mean the way through which the
synaptic weights are updated [17]. The two main learning strategies are:
By pattern: with the by-pattern approach the pattern examples are sequentially
and usually randomly given in input to the network; the synaptic weight values are
updated at each example presentation following the direction of the negative
gradient of the output error function e.
By epoch: with the by-epoch approach, the synaptic weight values are updated
when all the pattern examples have been given in input to the network following the
direction of the negative gradient of the output error function e.
With respect to the by-epoch approach, the by-pattern examples presentation
procedure introduces some randomness in the learning processes that often may
help in escaping from the local minima of the output error function ~. Moreover,
this technique is usually faster and more effective when the training set is composed
of thousands of pattern examples (e.g. in the case of hand-written character
recognition, speech recognition, etc.). On the other hand, the by-epoch approach
usually gives better results when high precision is required (e.g. function
approximation).
To accelerate the learning process we adopt an adaptive and local learning
rate strategy [18]: each synapse has its own local learning rate rl0 and the value of
each a ri0 is changed adaptively following the local gradient's error function
behavior (Se/'6wo).
More precisely, ri is increased when at least during two successive iterations the
signs of the term 8v_/fiwU is equal, and is decreased when the signs of the term
during two consecutive iterations are different.
The final version of the WP learning algorithm that we adopted is:
5 Simulation results
Though neural approaches are very effective in dealing with the nature of
problems whose specifications cannot be explicitly defined (e.g. using "rules" as in
Expert Systems), nevertheless they need a large database for the training task. The
database that we received by technicians working on the steel-industry plant is
relatively small due to the high cost of collecting data (1725 patterns). Moreover,
the number of samples is not equally distributed among classes and families (see
Fig. 3.).
The measurements were collected through the following system (Fig. 4.):
9 an on-line CCD camera;
9 an on-line image acquisition/pre-processing system which performs
filtering and feature extraction tasks on images of the surface of the fiat
rolled strip.
The Data Base (DB) was obtained with on-line and on-plant measurements; it
includes samples of steel-ribbon impurities and flaws. Each sample consists of 16
features, some representing the geometrical properties of the imperfections on the
strip, others providing information about illumination, width and thickness of the
strip etc., and a code number, which identifies the type of defect.
Most of the input features come from the on-line image acquisition/pre-
processing system. The complete list of the features of the input samples is detailed
in Table 1.
The use of the DB increased with noise allows a better comparison of the
classification performances of the TMLP trained either with the BP or the WP
learning algorithms. In Table 3., the most significant features of the two algorithms
are reported.
In Fig.5. and Fig.6. are reported respectively the classification results related to
the superclasses and the defect classes. Fig. 7. reports the overall performances of
the TMLP with the two learning techniques.
Table 2.Number of pattems fot the Training Set, Test Set and Validation Set
BP WP
Weight Update by epoch by pattern
Learning rate update Vogl's acceleration technique local learning rate
adaptation strategy
Stop criteria - the number of epochs is - when the minimum of
greater than un upper the Validation Set is
limit reached
- when ABS(gradient) is
less than a ~ive threshold
Activation.function Hyperbolic tangent
6 Conclusions
It is worth nothing that even if the available Data Base was of limited
dimensions and of poor quality with respect to examples distribution, the use of the
TMLP architecture allows to reach reasonable classification performances.
The comparison between the BP algorithm with Vogl's acceleration and the WP
algorithm is favorable to the first one in this case. More extensive experiments with
larger Data Bases will be necessary to give a better insight of the problem.
7 References
1. B. E. Boser, E. Sackinger, Jane Bromley, Y. Le Cun, and L. D. Jakel, "An
Analog Neural Network Processor with Programmable Topology," IEEE
Journal of Solid State Circuits, Vol. 26, No. 12, pp. 2017-2025, 1991.
2. D.Baratta, G.M. Bo, D.D. Caviglia, M. Valle, G. Canepa, R. Parenti and C.
Penna, "A Hardware Implementation of Hierarchical Neural Networks for Real
Time Quality Control Systems in Industrial Applications," In Proc. of the
International Conference on Artificial Neural Networks, ICANN'97, p.p. 1229-
1234, Lausanne, Switzerland, 1997.
3. G.M. Bo, D.D. Caviglia, and M. Valle, "An Analog VLSI Neural Architecture
for Handwritten Numeric Character Recognition," In Proc. of the International
Conference on Artificial Neural Networks, ICANN'95, Paris, France, 1995.
4. J. Hertz, A. Krogh, and R.G. Palmer, "Introduction to the theory of the Neural
Computation," Addison- Wesley Publishing Company, 1981
5. M. Valle, D.D. Caviglia and G.M. Bisio, "An Experimental Analog VLSI
Neural Network with On-Chip Back-Propagation Learning," Journal of Analog
Integrated Circuits and Signal Processing, Kluwer Academic Publisher Vol. 9,
1996, pp. 25-40, Dordrecht (N).
6. M. Jabri and B. Flower, "Weight Perturbation: An Optimal Architecture and
learning Technique for Analog VLSI Feedforward and Recurrent Multilayer
Networks," IEEE trans. Neural Networks, vol. 3 (1), pp. 154-157, 1992
587
1 Introduction
Artificial neural network algorithms are currently applied as valuable tools and
system components in a variety of application domains, e.g. control, prediction,
optimisation, OCR, image and signal processing, computer vision, and pattern
recognition. Application examples are, e.g. medical imaging, cheque and credit
card slip reading, surveillance, identification and access control, image coding,
intelligent cruise control tasks, and visual and multisensorial quality control in
industrial manufacturing [4], where this paper focuses on.
However, as advanced cognitive models, e.g. selective attention mechanisms,
complex feature maps and feature binding schemes etc.., due to their inher-
ent computational complexity for reasons of costs and hardware capability are
out of reach for the majority of applications, successful state-of-the-art systems
typically employ hybrid structures incorporating modules from the disciplines of,
e.g. image processing, artificial intelligence, artificial neural networks, statisti-
cal pattern recognition. Further, to achieve industrial acceptance, transparence,
ease-ouf-use, rapid configuration, and robust classification have to be provided
by such inspection system development tools.
In previous work (cf. e.g. [4]) such a system has been developed, denoted as
QuickCog. QuickCog is both development system and run-time platform on PC
as well as industry P C / 1 0 4 + standard (MS-Windows 95, 98, NT). Instant de-
ployment of developed inspection systems on these platforms is feasible. The key
589
From the large number of neural network algorithms, we have picked a few
according to their desirable properties with regard to ease-of-use, speed, trans-
parence and performance. Though backpropagation networks in the hands of an
experienced user can be very powerful tools, it is well known that the appropriate
definition of network topology and learning parameters is not an easy task and
for each application or modification, this burden is again imposed on the user.
590
Todays situation in industrial manufacturing does not leave any room for such
time consuming processes. Thus, we focused on neural algorithms for powerful
nonparametric classification, t h a t autonomously find their topology in the learn-
ing process tailored to the problem and have no critical learning parameters.
One convenient m e t h o d is the well-known and proven LVQ3-method of Kohonen
[1], which adapts a set of reference vectors according to the following equations
for the two nearest neighbors wi(t), wl (t) of a pattern xj:
Evaluation measure qs ~
(Separability) x [eJx • • ~ \
• • 2 1 5 2 1 5\ \ •
. oS
oo" 9 ~, x x v
' / ~ x // 9 o o
o ~176 ~176 9 /
O- O O Approximationof the classborde~ /
by Voronoi-Tesselation
from the original d a t a set to achieve perfect resubstitution. These vectors per-
fectly serve as an initialisation for LVQ3, which carries out a fine tuning step
to improve generalisation of the neural classifier 1. For instance for Iris data,
4 errors occured using 10 reference vectors of RNN. A fine tuning with LVQ3
improved the error rate to 2. Further, Radial-Basis-Function-type (RBF) neural
networks with similar salient properties are implemented in our system, e.g.
1 In fact, the RNN initialisation in many simpler cases is already sufficient for reliable
classification.
591
and the generalisation capability of the trained network, the appropriate collec-
tion of training, validation, and test data sets is both a crucial and a tedious,
error prone task. The same holds for the organisation and preclassification of
larger pattern numbers, and thus demands for efficient and ergonomic support
by the development system. Thus, QuickCog offers an interactive tool, the sam-
ple set editor, tailored to this problem. Fig. 5 shows the results of QuickCog
interactive multiple ROI selection. The selected ROI are extracted and affiliated
to training and test sets. In the regarded case we tentatively defined three classes
to the problem, i.e. correct pins (Pin_ok), barely visible bubbles (Weak_Bubbles),
and prominently visible bubbles (Strong_Bubbles). This tentative class affiliation
can be swiftly changed according to context knowledge and constraints of the
regarded production line. Based on the extracted sample sets, a classification
system for training and test was designed. As a first step of treatment, all pin
images were subject to histogram equalization. For ensuing preprocessing, seg-
mentation based on multi-level thresholding already provides useful results. This
is illustrated in Fig. 7, where only pixels with gray-values in [100;160] have a cor-
responding output pixel set in the segmentation image. Evidently, the amount
of bubbles is proportional to the accumulated segmentation area. From available
X-ray images 48 characteristic pins were extracted and preclassified. 33 were
used for training and 15 for testing. Based on the segmentation result and an
ensuing masking operation, that eliminates the ring which can be observed in
Fig. 7, the bubble regions were isolated and the mean, standard deviation, and
gray-value histograms were computed, concatenated and normalized to serve as
594
feature vectors for ensuing classification. Fig. 8 gives an insight in the resulting
feature space, using a technique basically similar to SOFM [1] but more con-
venient for industrial applications [4]. The resulting feature map is sensitive,
i.e. from each projection point in the map, the corresponding pin image can be
invoked by mouse click from the sample set database. This transparent prop-
erty considerably alleviates system design and analysis [4]. The WeightWatcher
supports interactive navigation in feature space, which is especially useful when
feature data with considerably varying density is met. Fig. 9 shows the confu-
sion matrices for test data using the RNN/LVQ3 neural classifier. The selection
principle of the RNN achieves correct resubstitution by default. Generalisation
only achieved 86% recognition rate, but this becomes less worrysome, if Fig. 9
is observed. The resulting errors, that are due to two misclassified patterns, are
a confusion between weak and strong bubbles. For the corresponding pin im-
ages, a unique affiliation in preclassification was indeed hard to make. Good and
potential defect pins, however, have been correctly identified. With this applica-
tion, one of the cases has been met, where LVQ fine-tuning did not accomplish
a better solution. With the PNN neural classifier the same recognition rate of
86% has been achieved.
595
4 S u m m a r y and Conclusions
In this paper, we applied an innovative inspection system development platform,
the also commercially available QuickCog PC-System [4] [5], for the first time
in the domain of automated quality control in electronics and microelectronics
production. Due to the nonparametric properties of the applied neural classifiers,
the proposed approach is viable, but we will investigate improved preprocess-
ing techniques and regard larger sample sets. Especially, image segmentation
based on recurrent neural networks could help to refine bubble segmentation at
the fringes. We will proceed with a similar approach to other relevant sensor
technologies, e.g. ultrasonic inspection for chip and smart card production (cf.
Fig. 1).
For future work, one focus of our research will be on the application of neural
networks with time-dependent processing for quality data analysis and predic-
tive diagnosis. An ambitious long term research objective is the realization of
a quality control loop to optimize production output. QuickCog will serve as
platform in this research and will be enhanced by the required methods and
modules.
References
1. Kohonen, T.: Self-Organization and Associative Memory. Springer Verlag Berlin
Heidelberg London Paris Tokyo Hong Kong, 1989.
2. Elbaum, C., Reilly, D.L., Cooper, L.N.: A Neural Model for Category Learning.
Biological Cybernetics 45 (1982) 35
3. Gates, G.W.: The Reduced Nearest Neighbour Rule. [EEE Transactions on Infor-
mation Theory, vol. IT-18, (1972), 431 - 433
4. K6nig, A., Eberhardt, M., Wenzel, R.: A Transparent and Flexible Development En-
vironment for Rapid Design of Cognitive Systems. Proc. Int. EUROMICRO Conf.,
Workshop CI, V/isteraas, Sweden, (1998), 655 - 662
5. K6nig, A., Eberhardt, M., Wenzel, R.: QulckCog - Cognitive Systems De-
sign Environment. QulckCog home page: http://www.iee.et.tu-dresden.de/-
koeniga/QuickCog.html, (1999)
Forecasting Financial Time Series
through Intrinsic D i m e n s i o n Estimation
and Non-linear Data Projection
1. Introduction
These methods are often the only possible ones but they have large defects. The first
over-estimates the autoregressive order necessary (because it does not take account of
non-linear dependencies between the data) and lead to overfitting. The second is very
heavy to implement and often not very reliable; indeed, the various trainings can be
sullied with errors which are caused by the method of prediction itself, such as for
example the presence of local minima in the optimization of Multilayer Perceptrons.
The method suggested in this paper try to overcome these disadvantages. It will be
exposed in the second part and then applied to a simple artificial example. In the
fourth part, we will try to predict the successive fluctuations of the SBF 250 Stock
Market Index.
2. Forecasting method
The autoregressive non-linear order can be defined as the optimal number of past
values to use in a time series for a good prediction. The autoregressive vector includes
these past values. Using a non-linear method to evaluate the autoregressive order must
make it possible to take into account the non-linear relations between past values of
the series; a traditional linear method to estimate the autoregressive order only takes
into account the correlation (linear dependence) between past values.
One can still choose the autoregressive vector in two ways. The first one consists in
estimating the optimal antoregressive order n, and to look for the best n past values in
the series to use for the prediction [5, 9]. Another possibility is to look for a n-
dimensional vector built with non-linear mixings of the past values of the series,
instead of the raw values themselves.
In the following we will use the second possibility. We will first look for a way to
estimate the non-linear autoregressive order, and secondly we will build the auto-
regressive vector with a projection method.
In order to determine the non-linear autoregressive order, we will use the notion of
"intrinsic" dimension of a set of points. Without going into mathematical details, the
intrinsic dimension of a data set can be defined as the minimum number of coordinates
that would be necessary to describe the data without loss of information, is these
coordinates were measured on curved axes. For example the intrinsic dimension of a
598
set of points forming a string in dimension 2 (or higher) is 1, and the intrinsic
dimension of a set of points forming a non-planar surface in dimension 3 (like the
well-known horseshoe distribution) is 2.
First we build an autoregressive vector of size m from the last past values of the raw
time series. This vector will have to be sufficiently large to contain all information
necessary to a good prediction. One possible solution is to take the optimal
autoregressive vector for an A R X model [5]; indeed this one is built in a way that it
contains "sufficient" information when used with a linear prediction method, and will
thus obviously contain enough information too when used with a non-linear prediction
method. Larger vectors can be taken for more security, but they would make more
difficult the continuation of work. An autoregressive vector is built at each time step;
they are laid out as rows in a matrix called autoregressive matrix.
To estimate the fractal dimension of the autoregressive matrix, we use the Grassberger
and Procaccia method [4]; many other methods can however be used to estimate a
fractal dimension [1, 6, 7]. It must be mentioned that the concept itself of non-linear
dependency is difficult to define. Therefore the fractal dimension found by these
methods can vary; in difficult situations, it may be worthwhile to use several methods
in order to asses their results. The intrinsic dimension can also be a non-integer
value; in the following, we will use the integer value nearest to the intrinsic dimension
as an approximation of the non-linear autoregressive vector size defined below.
The set of points defined by the rows of the autoregressive matrix form a d-surface in
a m-dimensional space. If we could unfold this d-surface by projecting the m-
dimensional space onto a d-dimensional one, keeping the topology of the initial set,
we would obtain a d-dimensional non-linear autoregressive matrix that could be used
for further prediction.
599
After this projection, we obtain the required non-linear autoregressive matrix. Its rows
will be used as input vectors to any non-linear forecasting method. We used in our
experiments the standard multi-layer perceptron (MLP) and radial-basis functions
(RBF) as prediction core.
Obviously, the prediction method could also use the initial m-dimensional
autoregressive vectors extracted from the raw series. Nevertheless, it must be
reminded that even if neural networks are known to be good candidates (compared to
other non-linear interpolators) when dealing with the curse of dimensionality, it
remains that, for a fixed number of training vectors, their performance decrease with
the dimension of their input vectors. The interest of our method is precisely here: we
expect that the little information lost in the non-linear projection will be largely
compensated by the gain of performance in the forecasting itself. This will be
illustrated in the examples below.
In order to test the above method, we built a chaotic artificial time series from the
following non-linear equation:
:t;§ = a x 2 + b x,2 + e (1)
Obviously, the non-linear autoregressive order of this time series is 2 (it is generated
from 2 past values). Let us note the lack of a x,, term, as well as the presence of a
noise e (about 10% of the maximum value of the series).
The first step of our method consists in the search for the optimal autoregressive
matrix for a linear A R X prediction model.
Figure 2 shows the sum (on 1000 test points) of the quadratic errors obtained if one
uses a standard A R X model of increasing size; the x-coordinate of the figure is the
autoregressive order.
Fig. 2. Sum of quadratic errors (on 1000 test points) obtained with an ARX model for different
values of the autoregressive order.
To ensure to collect the whole dynamics of the series, we will build an initial
autoregressive matrix of order 6. The estimation of the fractal dimension of this matrix
gives 2.12, which is very close to reality.
The following step of the method is the projection of the set of the points (rows of the
autoregressive matrix) from R 6 to R 2. Note that in the simulations we added the x, term
601
to the two coordinated found b y this projection, in order to improve the results. The
final autoregressive vector dimension is thus equal to 3.
W e also compared this result to the error obtained with a similar Multi-Layer
Perceptron, where the input vector is the set of p last values from the raw series.
Figure 3 shows this error for different values of p. The horizontal line corresponds to
the error obtained with our method; we conclude that we obtain (for this example) an
error similar to a result obtained by trial and error on several non-linear models, which
was the goal of our investigation. This easiness o f implementation will be valuable
when dealing with a "real-size" dataset for which the non-linear autoregressive order
is unknown.
Fig. 3. Sum of quadratic errors (on 1000 points) obtained with a MLP network for different
values of the autoregressive order. The horizontal line corresponds to the result of the proposed
method.
A n interesting example o f time series in the field o f finance is the S B F 250 ~ index.
The application of time series forecasting produces to financial market data is a real
1 The SBF 250 is one of the reference index of the French stock market. As suggested by its
name, it is based on a representative sample of 250 individual stocks.
602
challenge. The efficient market hypothesis (EMH) remains up to now the most
generally admitted one in the academic community, while essentially challenged by
the practitioners. Under EMH, one of the classical econometric tool used to model the
behavior of stock market prices is the geometric Brownian motionL If it does
represent the true generating process of stock returns, the best prediction that we can
obtain of the future value is the actual one. Results presented in this section must
therefore be analyzed with a lot of caution.
To succeed in determining the variations of the SBF250, other variables being able to
influence its fluctuations are included as inputs (extrinsic variables). We selected
three international indexes of security prices (S&P500, Topix and FTSE100,
respectively American, Japanese and English), two rates of exchange (Dollar/Mark
and Dollar/Yen), and two American interest rates (T-Bills 3 months and US Treasury
Constant Maturity 10 years). We used daily data over 5 years (from 01/06/92 to
01/12/97), to have a significant data set.
The problem considered here is the forecasting of the SBF250 index at time t+l, from
available data at time t.
To capture the relations existing between the French (non-stationary) index and the
other variables chosen, a co-integration is necessary. The result of this co-integration
is the (stationary) residues of the SBF250 index, defined by the difference between the
true value SBF+~ and the approximation S B F 1 given by the model:
7
R = SBF,+, - S B F , = SBF,+, - ( s t + E P t , i .It,i) (2)
i=l
where I,.i (1 < i < 7) are the 7 selected variables at time t..
In the following, we will focus on the forecast of these residues, or more exactly on
the forecast of daily return of these residues. Indeed, it is more useful for somebody
eager to play on the market, to forecast its fluctuations rather than its level. To predict
that the level of the SBF index tomorrow is close to the level today is trivial. On the
contrary, to determine if the market will raise or fall is much more complex and
interesting.
R t - Rt_ t
9,- (3)
Rt_t
eL5 = ~tdt + ~dz, where dz = E . ~ " and E = N(O,1). S is the stock price, ~t is the drift rate by
S
unit of time and o'is the instantaneous volatility.
603
According to Refenes et al. [12], we will use technical indicators directly resulting
from the outputs of the residues:
" P,, P,-,0, P,-2~ P,~o: returns ;
9 P,- P,-5, P,-5- P,-,o, P,-~o- P,-,5, P,-,~- P,-~: differences of returns ;
9 K(20), K(40) : oscillators ;
9 MM(10), MM(50) : moving averages ;
9 MME(10), MME(50) : exponential moving averages. ;
9 p-MME(10), p-MME(50) return and moving average differences ;
9 MME(10)-MME(50). moving average differences
The target variable, whose sign has to be predicted, is a forecast variable over 5 days:
This variable has to be predicted using the 11 indicators selected after PCA. The
interpolator we used is a Radial-Basis Function (RBF) network with the learning
algorithm presented in [13]. The network is trained with 1000 points and tested on 100
other points. Our interest goes to the sign of the prediction only, which will be
compared to the real sign of the prediction variable.
The best results we obtained are 60,2% correct approximations of the sign of the series
on the training set, and 48 % on the test set. This result is obviously bad: it is worst
than a pure random guess on the test set!
604
On the other hand, is we use the proposed method and estimate the fractal dimension
of the data set, we obtain an approximate value of 5. We then use the CCA method to
project the 11-dimensional data (after PCA) on a 5-dimensional space. Thereafter, we
use another RBF network to approximate the variable to predict. We obtain 61% of
correct sign prediction on the training set and 57 on the test set. This result seems to
be significantly better than the result that we could get by using a purely naive
approach (for example, by predicting always a + sign). A lot of simulation work
remains however to be done to validate it (by, for example, constructing a bootstrap
estimator).
Still better results were obtained using a MLP instead of a network (more than 62%
correct sign predictions on the validation set). Unfortunately, the results obtained with
a MLP are difficult to repeat for various initial conditions, convergence parameters,...
We prefer to restrict our performances to those obtained with a RBF network, because
they are much less parameter-dependent.
5. Conclusion
The proposed method for the determination of the best autoregressive vector gives
satisfactory results on a financial series. Indeed, the quality of the prediction obtained
is either comparable to the quality obtained with other methods (slightly higher on a
real-world financial time series, and equivalent on an artificial data set). The
advantage of our method mainly comes from the systematization of the procedure:
there is no need for many trials and errors for the determination of the variables to use
at the input of the predictor and of its parameters. Moreover, the determination of the
autoregressive vector is completely independent from the prediction method.
Ameliorations of the proposed method could be searched in alternative ways to
estimate the fractal dimension of the series or to project the data in a non-linear way.
The question of the predictability of a series such as the SBF250 index remains. The
results presented in this paper are promising, but could certainly be improved. We
must also remind that predicting a complex, mostly stochastic time series as the
SBF250 must be achieved with several prediction methods, in order to cross-validate
their results. It must also be noted that, the simple fact of being able to forecast, at a
certain level of confidence, a financial time series is not in itself sufficient to
invalidate the EMH. The problem is to see if it is possible to exploit the prediction
algorithm to obtain abnormal returns, that is to say returns that take into account the
level of the risk generated by the trading strategy as well as the associated transaction
costs.
605
References
1. Alligood K. T., Sauer T. D., Yorke J. A.: Chaos: An Introduction to Dynamical Systems.
Springer Verlag, New York (1997), pp. 537-556
2. Box G.E.P., Jenkins G.: Time Series analysis: Forecasting and Control. Cambridge
University Press (1976)
3. Demartines P., H6mult J.: Curvilinear Component Analysis: A self-organizing neural
network for nonlinear mapping of data sets. IEEE Trans. on Neural Networks 8(1) (1997)
148-154
4. Grassberger P., Precaccia I.: Measuring the Strangeness of Strange Attractors. Physica D56
(1983) 189-208
5. Ljang L.: System Identification - Theory for User. Prentice-Hall (1987)
6. Tackens F.: On the numerical Determination of the dimension of an attractor. In: Lecture
Notes in Mathematics Vol. 1125, Springer-Verlag (1985) 99-106
7. Theiler J.: Statistical Precision of Dimension Estimators. Phys. Rev. A41 (1990) 3038-3051
8. Weigend A. S., G-ershenfeld N.A.: Times Series Prediction: Foreasting the future and
Understanding the Past. Addison-Wesley Publishing Company (1994)
9. Xiangdong He, Ha_ruhiko Asada: A New Method for Identifying Orders of Input-Output
Models for Nonlinear Dynamic Systems. In: Prec. of the American Control Conf., San
Francisco (CA) (1993) 2520-2523
10.Burgess A.N.: Non-linear Model Identification and Statistical Significance Tests and their
Application to Financial Modelling. In: Artificial Neural Networks, Inst. Elect. Eng. Conf.,
June (1995)
ll.Fama E.: Efficient Capital Markets: A Review of Theory and Empirical Work. Journal of
Finance XXV No 2 (1970) 383~117
12.Refenes A. N., Burgess A.N. and Bentz Y.: Neural Networks in Financial Engineering: A
Study in Methodology. IEEE Transactions on Neural Networks 8(6) (1997) 1222-1267
13.Verleysen M., Hlava~kova K..: An Optimized RBF Network for Approximation of
Functions. In: Proc of European Symposium on Artificial Neural Networks, Brussels
(Belgium), April 1994, D facto publications (Brussels).
P a r a m e t r i c Characterization of H a r d n e s s Profiles
o f Steels w i t h N e u r o - W a v e l e t N e t w o r k s
A b s t r a c t . This work address the problenl of extracting the Jominy hardness profiles of steels
directly from the chemical composition. Wavelet and Neural networks provide very interesting
results, especially when compared with classical methods. A hierarchical architecture is pro-
posed, with a first network used as a parametric modeler of the Jominy profile, and a second
one estimating parameters from the steel chemical composition. Suitable data preprocessing
helps to reduce network size.
1 Introduction
Hardenal)ility is a basic feature of steels: in order to characterize it, manufacturers
usually perform the so-called Jominy end-quench test [1], which consists in measuring
the hardness along a specimen of a heat-treated steel, at prcdefincd positions; tlle
measured values form the Jominy hardness profile.
Hardenability depends on chemical composition in a partially unknown fashion,
therefore black-box models have been developed to predict the shape of Jominy profiles
directly from chemical analysis. Most of them are linear, but this affects accuracy,
especially when a wide variety of steels is considered.
Neural Networks (NNs) seem to cope well with such a modeling problem, as they
are good approximators for strongly non-linear functions. An attempt to apply NNs
to predict Jominy profiles has been made in [2] by using a standard Multi-Layer
Perceptron (MLP) with one hidden layer, but there is no rel)ortcd attempt to use
Wavelet Networks, ( W N s ) for thc same task.
Unfortunately, most methods llased on NNs alone sulfei from sew~ral caveats.
For instance, their initialization and training requires a large alnount of data, which
are sehtom easily and rapidly available. In addition, simple NNs may often predict
profiles which are not l)hysically plausible, unless very eomi)lex networks are used and
long training processes are employed. It is therefore mandatory to accurately select
the network structure, in order to obtain good performance, to reduce as much as
possible the number of free parameters, and consequently to reduce the required size
of tile training set.
Another drawback of NNs alone is that no information related to physical char-
acteristies of the steel can be extracted fl'om the trained network; this means that
NNs can only be used to predict the profiles themselves, lint not any other steel
characteristic.
This paper presents some more powerful methods based on two combined Neuro-
Wavelet Networks ( N W N s ) , where one network provides a parametric model of the
Jominy profile, while the second one predicts tile parameters as a flmction of chemical
composition. The extracted parameters do have a strong relationship with the Jomiuy
profile, of which they are a compact representation.
607
2 Neuro-Wavelet Unification
Radial Wavelet Networks are based on Wavelet decomposition and use radial Mother
Wavelets ~(IIXII) E L2(~ N) suitably dilated and translated. Such networks are based
on Radial Wavelons (WAVs) which havc a model bascd oil the Euclidean distancc be-
tween the input vector X and a translation vector E, where each distance component
is weighted by a component of a dilation vector T:
A function O(.) is admissible as a radial Wavelet only if its Fourier transform satisfies
a few constraints not discussed here [5]. A commonly used function is the Mexican
hat e ( z ) = (l - 2z2) . e - " .
Radial Wavelet Networks, as well as many other neural and fuzzy paradigms, can
be viewed into a unified perspective by means of the Weighted Radial Basis Functions
( W R B F ) [4].
Each layer (array) of W R B F neurons is associated with a set of 1)arameters: an
ordern E ~R, defining the neuron's metric (mostly n E {0, 1, 2}), a weight matrix V(, a
center matrix C, a bias vector 0 and an activation function F (z). The mathematical
model of a W R B F neuron of order n (or, WRBF-n) is:
where F (z) can be any function (although in most cases monotonic flmetions or
Wavelets or linear or polynomial functions are used) and the distance .function 79,,(.)
is defined as:
I)n(Xj Cji) A I ( X j - C j i ) forn=0
- = (3)
/ Ixj - c~t ~ fo~ n r o
All the NWN paradigms used here have been re-conduced to W R B F in order
to have common paradigms, methodologies, initialization strategy and learning rule,
which is the main advantage of unification; Radial Wavelons are WRBF-2 neurons
with Wii = (1/Tji) 2, and C = E (i.e. the matrix made of one translation vector
E per neuron), while the activation fimetion comes from tile radial Mother Wavelet
F (z) = ~ ( v ~ ) . Details on the unification of other neural paradigms can be found
in [7, 4].
As far as initialization of NWNs is concerned, in this work we have used three,
forms of initialization:
- Fixed initialization: all weights, biases and centers are initialized to a predcfincd
vahm (or set of values). This has been used for all the networks of the parametric
model described in section 4.
- Random initialization: the l)aramcters are initialized to ran(lore values (uniform
distribution). This has been used for the WRBF-0 networks of the parameter
estimator described in section 5.
608
F i g . 1. A) A few e x a m p l e s of J o m i n y profiles. B) E i g e n v a l u e s of t h e i n p u t d a t a c o v a r i a n c e m a t r i x , C) B l o c k
d i a g r a m of t h e p a r a m e t r i c n e u r o - w a v e l e t e s t i m a t o r .
M and N are, respectively, the number of samples in the training (or validation) set
and the number of network outputs; YjP is the j-th component of the p-th target vector
YP in the training (or validation) set, while ~P is the corresponding network estimate.
To reduce the size of the NWNs, we tried to reduce as much as possible the number
of input variables to the network (without loosing significant information) by al)l)lying
the Principal Component Analysis [8] to the vectors A/" of the training set,.
Figure 1.B plots the computed eigenwdues in decreasing order. The eigenvectors
associated with the largest eigenvalues span a subspace containing most of the infor-
mation of the training set. The 6 largest eigenvalues have been retained, as a good
compromise between complexity and performance.
The projection of input data in the subspacc spanned by the corresponding 6
eigenvectors (properly normalized) maintains most of the original infornmtion and
constitutes a new input vector "1) E ~6 to be fed into the network. This vector is
obtained as V = Af. M, where M is a matri• containing as columns the 6 principal
eigenvectors.
Aim of our work was to determine which NWN performs better in modeling the
Jominy hardness profiles. At first, we drew some prclirninary considerations:
1. in traditional approaches [3, 2], the mmlber of network outputs equals the number
of measured points of the Jominy profiles, namely 15. But, from Figure 1.A it
can be observed that Jominy profiles are relatively slowly varying, especially in
the initial and the final parts. There is usually a little (liff~:ren(:c between two
neighl)oring points and thus the information conveyed I)y these values is somehow
redundant. The statistical correlation of a(Ijacent elements of ,.7" al)l)roaches mfil,y
(its average value is 0.93), as well as consequently the correlation between weights
of adjacent neurons.
2. In traditional approaches, several approximation errors can produce estimates of
the Jominy profiles which are physically aot plausible (for instance, small local
increases instead of a continuous decrease of the hardness along the specimen).
610
3. The 15 positions where the hardtmss is measured are not evenly distributed and
often differ among different manufacturers, therefore Jominy profiles cannot always
be compared directly, tn addition, even the nmnber of points can often be varied
(for instance, up to 18 or 19 points can be measured).
4. Hardness measurement is often affected by large errors, therefore the Jominy vector
J is usually affected by relatively large quantities of noise.
1. the size of the parameter vector "P is smaller than that of gT (see Table 1), thus
network B is smaller than would be a network predicting J , for comparable ac-
curacy. As a consequence, a smaller training set will I)e enough and, at the same
time, a consideral)le saving in COml)utational time and memory can be achicve(l
during both training and relaxation (namely during nominal operation).
2. If network A is properly chosen, "7' is less sensitive to measurement noise than f f
(see section 4.1), therefore steel characterization will be more robust.
3. "P can be computed also when solne measurements of , J are missing.
4. 7~ is almost independent of the number and position of har(tness measurements.
5. By prol)erly selecting network A, "P can be made representative of the physical
process, therefore it can also be used to classify (more robustly) steel quality.
6. Network A can also bc used to reduce the effects or measurement noise.
Tile choice of tile best NWN for network A (parametric model) is by itself not a sim-
ple problem, due to the need of reducing as much as possible the nmnber of tunable
parameters while maintaining a good estimation and classification accuracy. We have
essayed a set of very small two-layers WRBF networks, with one or two hidden neu-
rons, difihrent activation flmctions in tile hidden layer, and a linear output layer (see
Table 1). Such networks are very easy to train using generalize(I ba(:kl)r(q)agation [4].
Jominy curves (Figure 1.A) are monotonically decreasing; if one wishes to ap-
proximate them by means of a WRBF-2 network with either exponential or Mexican
hat activation function (WAV-x, RBF-x), the center vector in the tirst layer conM 1)(;
fixed to 0 and need not be trained. Moreover, all WRBF-0 (MLP-x) have a null center
vector, as bias and center are somehow redundant in WRBF-0 networks.
611
The LIN network in Table 1 takes into account the slight linear trend supcrimi)ose(l
on the nearly-sigmoi(lal shape of the Jominy profiles; this is similar to an MLP-2
network, but the hi(hlcn layer is composed of a linear neuron and a neuron with F (z)
hyperbolic tangent. This network has 5 parameters, as the linear activation 5mcti(m
o[' the second hidden neuron allows to merge 2 weights and biases.
Table 1 (cohunn e~, "in ram") compares the different models in terms of NSRMSE.
The values given are an average over the whole training plus validation sets (800
different specimen). There is no need to distinguish between training and vali(lation
sets, as each profile is trained independently of each other.
We observed that the estimation error of Jominy profiles predicted by each network
A has a non-null average, which varies with the distance x. We therefore subtract
this average modclization error (in tabular form) fi'om the outl)ut of network A (a-
posteriori correction), as shown in Figure 1.C. This has reduced the modelization error
roughly by a factor 2, as shown in Table 1 (cohmm er "in nun").
In both cases (with or without correction) the MLP-x, the LIN and the RBF-2
networks clearly outperform the other networks thanks to the particular shape (nearly
sigmoidal) of the Jominy profile. This similarity between a sigmoidal fimction and
the Jominy curves is filrther enhanced by aI)proximating ,:Ti = f(i) instead of .](x)
(namely, the vector elements as a filnction of their "index" i E [1,15] instead of their
distance x). The results of such apl)roximation are listed in Table 1 under cohtmns
"in pts").
Now, we can ctloose one of the seven types of network A according to the following
criteria:
1. to have tile smallest approximation error. Networks MLP-1, MLP-2, LIN and RBF-
2 are the best under this respect.
2. To have as few parameters as possible. This reduces both training time for network
A and the size of network B. Networks WAV-1, RBF-1, MLP-1 are the best under
this respect.
3. To have a set of parameters which are as representative as I)ossible of the l)hysical
process of hardening. Tile degree of representativity has 1)con assessed by ana-
lyzing the correlation between t)airs of Jominy vectors ~T and the corrcsl)onding
parameter vectors T ~. Very representative models shouhl have a roughly linear re-
lationship between ,4,.7" ~ H,.7"t - ,:T21] and A ' p ~ lip ~ - T~211, where ,:T l and
,.7.2 are two Jominy vectors (taken randomly from the data sets), while .pl and
7~2 are the corresponding parameter vectors. Figure 2.A plots A T ~ versus A,:T for
612
Fig. 2. A) Parameter vectors distance as a function of Jominy vectors distance for networks A for 1.800
random pairs of specimen. B) a~, (plain line) and a,~ (dotted line) versus a j (on normalized axes). C)
Comparison of estimated Jominy profiles.
several pairs of specimen ("in pts"; similar plots have been ol)tained for estimation
"in ram") and for the 4 neural paradigms which provide the best results (WAV-2,
RBF-2, MLP-2 perform poorly, thus their gral)hs are not reported). The closer
are the points to the main diagonal, tile more representative is tile model, the
easier will be to train network B. MLP-1 and LIN networks are the best under this
respect.
4. To have a set of parameters which provide tile smallest noise sensitivity, namely
tile smallest sensitivity of the model to the noise affecting hardness measurements.
Noise sensitivity is assessed by means of simulations, as described below.
Consider a measured Jominy vector J P . This is associated with a parameter vector
T 'p by training. A Jominy vector estimate J'P is obtained from network A with
the parameters T ~v. For definition, ,.~P is the best estimate of ,.7"p compatible with
the given model. Namely:
,.~.p training ,pp model.~uation t~.P (5)
When some noise A J p E R 15 is added to the original profile, the associated pa-
rameter vector estimated via training is corrupted by an error A'pP which increases
the error on the reconstructed profile:
t.~'P _~. A J ' p traini~g ~ p -~- A~DP modeleva~uation y P : J R _~ A y p (6)
The noises standard deviations are related to each other; (~.~ = ~/E([[ AJ'~ H2) and
a~, = t E ( H A'p~ [p) increases almost linearly with a s = ~/E(H A J ~ IP), and a 2
is smaller than a s.
The noise sensitivity of the parametric model is defined as the average slope of the
curve a~. = f ( a s ) ; when it is smaller than one, the estimated profile is less affected
by noise than the original profile. Most networks are good enough (except MLP-2),
yet networks LIN, WAV-1, MLP-1 and EXP-1 are slightly better: Figure 2.B plots
the results obtained by an average over 40 specimen, with 0 ~ a 3- _< 13 HRc 1
5. To have a model which is as independent as possible of the number and position
of hardness measurements. All the networks "in ram" are appropriate under this
respect.
At the end, we have chosen two networks, namely MLP-1 and LIN, "in mm", because
the increased flexibility given by the dependency on the distance has been considered
important.
1 HRe: Rockwell hardness unit.
613
We have also outlined a method to get similar 1)erformance to the networks "in
pts", while maintaining the flexibility of the approach "in ram". This method consists
in processing the distance x through an appropriate non-linear flmction, before passing
it to network A. This method is currently under consideration, but no final result is
yet available.
5 Parameter Estimation
a r e w i t h i n s p e c i f i e d r a n g e s , w h i c h a r e in o u r c a s e t o o r e s t r i c t i v e . T h i s c o n f i r n l s t h a t ,
in p r a c t i c a l c a s e s , t h e n e u r o - w a v e l e t p a r a m e t r i c a p p r o a c h c a n b e a m o r e e i f c c t i v e a n d
reliable alternative to traditional models. Figure 2.C compares a lneasured Jolniny
profiles with the corresponding p r e d i c t i o n , for t h e p r o p o s e d a n d t h e l i n e a r m e t h o d s .
Acknowledgments
The authors wish to thank Dr. Qinghua Zhang, Dr. Benedetto Allotta, Dr. Renzo
V a l e n t i n i a n d P r o f . G i o r g i o B u t t a z z o for t h e i r f r u i t f i t l d i s c u s s i o n s .
This work has been partially supported by the National Research Council project
M A D E S S - I I " A r c h i t e c t u r e s a n d V L S I d e v i c e s for e m b e d d e d N e u r o - F u z z y c o n t r o l " .
References
1. "Standard Method for End-Quench Test fi~r llardenability of Steel," A255, Annual Book of ASTM
Standards, pp. 27-44, ASTM, 1989.
2. W.G. Vermeulen, P.J. van der Wolk, A.P. de Weijer, S. van der Zwaag: "Prediction of Jominy llardness
Profiles of Steels Using Artificial Neural Networks," Jourrt. Material Eng. and Perforraance, Vol. 5,
No. 1, February 1996.
3. D.V. Doane J.S. Kilkaldy (eds.): "Hardenability Concepts with Applications to Steel", TMS-AIME,
1978.
4. L.M. Reyneri, "Unification of Neural and Wavelet Networks and bSlzzy Systems", to be printed on IEEE
Trans. Neural Networks, 1999.
5. Q. Zhang: "Using Wavelet Network in Non-parametric Estimation," IEEE Trans. Neural Networks,
Vol. 8, No. 2, pp. 227-236, March 1997.
6. I. Daubechies: "Ten Lectures on Wavelets," Society for Industrial and Applied Mathematics, Philadel-
phia, Pennsylvania, 1992.
7. V. Coils, M. Sgarbi, L.M. Reyneri: "A Comparison Between Weighted Radial Basis Functimm and
Wavelet Networks," in Proe. of ESANN 98, Bruges, Belgium, 22-24 April 1998, l)p. 13-19.
8. W.W. Cooley, P.R. Lohne~: "Multivariate 1)ata Analysis," Jnhn Wiley & Sons Inc., USA, 1971.
9. M. Riedmiller: "Advanced Supervised Learning ia Multi-layer l~ercepttons - l'h'om Bat:kl~rol~agati~}n to
Adaptive Learning Algorithms," Int'l dourn. Computers Standards and htterfaces, No. 5, 1994.
10. S. Chen, C.F.N. Cowan, P.M. Grant: "Orthogonal Least Squares Learning Algorithm for Radial Basis
Function Networks," IEEE Trans. Neural Networks, Vol. 2, No. 2, pp. 302-309, March 1991.
11. Q. Zhang, A. Benvenistc: "Wavelet Networks," IEEE qYans. Neural Networks, VoI. 3, No. 6, pp. 889-898,
November 1992.
Study of Two A N N Digital Implementations of a Radar Detector
Candidate to an On-Board Satellite Experiment
R. Velazco 1, Ch. Godin z'4, Ph. Cheynet 1,
S. Torres-Alegre 3, D. Andina 3, M. B. Gordon 2
1 Laboratoire TIMA
46, Av. F61ix Viallet -38031 Grenoble - FRANCE
2 Commissariat ~tl'Energie Atomique (CEA)
D6partement de Recherche Fondamentale sur la Mati6re Condensde
17 Av. Des Martyrs, 38054 Grenoble Cedex 9 - FRANCE
3 Universidad Polit6cnica de Madrid
ETS Ingenieros de Telecomunicacion 28040 Madrid - SPA1N
4 Commissariat ~ l'Energie Atomique - Division d'Applications Militiares
(CEA-DAM) Bruy6res-le-Chfitel - FRANCE
Abstract
The Microelectronics and Photonics Testbed (MPTB) is' a scientific satellite carrying
twenty-four experiments' on-board in a high radiation orbit since November 1997. The
first objective o f this paper is' to summarize one year flight results', telemetred from
one o f its experiments, a digital "neural board " programmed to perform texture
analysis by means of an Artificial Neural Network (ANN). One of the attractive
features o f MPTB neural board is its possibility o f re-programmation from the earth.
The second objective of this paper is to present two new ANN architectures, devoted
to radar or sonar detection, intended to be telecommanded on the MPTB neural
board. Their characteristics (performances and potential robustness with respect to
parameter deviations due to the interaction with charged particles) are compared in
order to predict their behavior under radiation.
1. Introduction
It is expected that neural hardware will provide attractive tools for automatic pattern
recognition and data classification. In particular, the application of neural networks
has been considered [1-4] to be relevant in automatic target recognition, speech
recognition, seismic signal processing and sonar signal processing.
Like with most information processing devices, ANNs may be implemented
following three different modalities: (i) software simulation, on a general purpose
computer, (ii) hardware emulation, that mimic the ANN on some particular hardware
architecture, which may include dedicated processors to accelerate the response, and
(iii) physical implementation, where there is, at least in principle, a one-to-one
correspondence between virtual and physical neurons. Only the two last strategies
allow to cope with the timing constraints imposed by real-time applications. The main
differences between hardware emulation and physical implementation resides in the
way the network response is obtained. In the latter, the physical implementations
attempt to take advantage of the network's structure and principles, by means of
either a computer with multiprocessing or parallelism capabilities or using dedicated
hardware in which the neurons are individually implemented (analog or digital neural
processors). In the former, neuron responses are calculated, sequentially or with some
parallelism, by a processor running a suitable program. The emulation program has
some loops that calculate the responses of the network's neurons as a function of the
state of other neurons and/or the input values. Thus, neurons exist only virtually,
corresponding to a piece of program during a particular period of time. This is
616
turned out to be the most sensitive component. These twin boards are called below
board A (the one with the neural coprocessor) and board B (the one without it). After
suitable telecommand operation, these two boards may run in parallel two different
ANN versions of the same problem.
H T225
~ ~ Processor
~ ~ L-Neuro1.O
oc___~prote
]
sso____~]
~ ,..._132Ko
" q l ~ l ~ SRAMHITACH
~ 32 KO
SRAMMH.S.
]
T225Bus
Figure 1: Experiment block diagram of MPTB neural board A
input of the ANN on board, which has thus only two hidden layers, is layer 2 of Fig.2.
. ~ x " xc(l'TO
L>/
I I
isen ,%t
/V'~--L~ ass I
. ( k T 0) T ) 1
, ~ ~ .... j ~ o
Pllter [ x s(t)
Detector
Figure 3: The Neural Detector.
620
The binary detection problem is reduced to decide if an input complex value (the
complex envelope of the input, involving signal and noise) has to be classified as one
of two outputs, 0 (noise) or 1 (noisy signal). The need of processing complex signals
with an all-real coefficient NN, requires to split off the input in its real and imaginary
parts (the number of inputs doubles the number of integrated pulses); then, a
threshold T is established at the NN output.
The input r(t) is a band-pass signal, and the complex envelope x(t) = xc(t) + j x S (t) is
sampled each T O seconds. Then:
Pd lit 9 TSNR3
A
Pd ,, -" o 9
~ TSNR6 08
08 ~ TSNRI3
l
06 ~
9 ~
0.4~
~ TSNRI4
TSNRI5
Optimum
Pfa = 0.01
0.6
0.4
j
(A
Pill = oA)Ol
0.2 ~ :~.,~ " "xJ" TSNR3 ~ TSNKI4
TSNR6 --~ TSN~RI5
+ TSNR13
$~ . , ~ Optilnlllu
o ~'~::~" 0
6 9 -1 0 1 2 3 4 5 6 7 8 9 10
SNR (dB)
SNK (dB)
(a) (b)
Figure 4: Detection Probability (Pd) vs. Signal-to-Noise Ratio for a MLP of structure 16/8/1
and different Training Signal-to-Noise Ratio. (a) Pfa--0.01 and (b) Pfa= 0.001.
621
Aiming at evaluate the SEU sensitive surface of a digital implementation of the MLP
based neural detector, we have performed software simulations. The network
performance (detection capability) in presence of a unique bit-flip fault on its
parameters (synaptic weights and neurons offsets) was calculated. These bit-flips
were successively (and exhaustively) injected to get all possible neural detector
mutations due to SEUs, before rulming a C program emulating the neural detector.
The network parameters were coded as 32 bits floating point numbers (IEEE 754
1985 standard) with a sign bit, one byte signed exponent and a 23-bits mantissa.
Parameters to be corrupted occupy 4640 bits. We have studied the degradation of the
detector performance for the particular case Pfa =0.1. We have considered that a
bitflit is critical when the Pfa increase more than 5% or when the Pd decrease more
than 10% for any of the sfudied values of SNR (between -10dB and +9dB). The
results of these simulations can be summarized as follows:
- the synaptic weights of the hidden layer neurons have all the same SEU-sensitive
bit which is the sign of the exponent. Its modification is critical. The Pfa grows from
0.1 to 0.5 when this bit is modified. This effect affects 136 bits. We hfive also put in
evidence that bitflips on another bit in the exponent lead to slight modifications of
Pfa (until a maximum 0,15). This affects another 136 bits.
- corrupting the synaptic weights of the output neuron is more critical. Bit-flips of
practically all of the bits of the exponent lead to a serious loss of detection
performance. There are 98 bits in this situation.
- bit-flips of only 30 bits of the weights in the hidden layer can be considered as
beneficial: their inversion improves the Pd without modifying the Pfa-
- modifying the remaining 4240 bits has no significant effect on the neural detector
performances.
Obviously, this study takes into account only bit-flips on the memory area used to
store network parameters. The study of the effects of bit-flips on other memory
regions needs to be done on the fmal digital implementation. It is also important to
notice that, to study the effects of SEU-like faults on the neural detector, not only the
detection probability must be checked, but also the false alarm probability. Also, let
us remark that owing to the chosen format, these figures correspond to a worst case.
In the fmal digital implementation, the network's parameters will be coded as 16-bit
integers, leading to a minimization of the sensitive memory area that avoids the
proliferation of critical bits.
of unit (linear or spherical) included at each growth step. They proceed as follows: a
first unit is trained to separate the patterns of the training set belonging to one class
from the other. I f this succeeds, only one neuron suffices, and the algorithm stops.
Otherwise, this unit becomes the first neuron of a hidden layer. New hidden neurons
are successively added, and trained to separate (either linearly or spherically) the
remaining errors. After training each hidden unit, an output (linear) neuron attempts
at learning the training set. I f its training error is lower than the accepted bound, the
algorithm stops, and the last trained output neuron is kept. Otherwise the output
neuron is removed and the algorithm goes back to add and train a new hidden unit.
The algorithm used to train the linear discrimmant units is Minimerror [19, and
references therein]. Minimerror-S, a generalization of Minimerror, is used for hyper-
spherical discriminations. Both algorithms are based on the minimization of suitable
cost functions that depend on two hyper-parameters called "temperatures". The final
weights minimize the number of errors close to the discriminating surfaces. The
algorithms have three adjustable parameters " the learning rate, the ratio between the
two "temperatures ", and an annealing rate.
or=sign
F x- -
where the vector ~" is the input, and the vector w and w0 are the weights and the
0 (3)
threshold of the radial neuron, respectively. This result means that the training set is
spherically separable. The generalization error, determined with an independent test
set of 10000 patterns, also vanishes. This explains why so many neurons are needed
when the hidden units implement linear discriminations: as shown on Fig. 5, we were
unable to improve the classification performance using NetLines: neither the
recognition error of signals nor the fraction of false alarms decrease upon adding
hidden neurons beyond three. Comparison with the results obtained with BP show
clearly that in this task, the performance reached with real valued neurons is better
than with binary linear traits.
Q25. Q~o4b
Q2).
s Q'6-
\ 0~j \ I-, ,~0~1
\
,~ clio. \
\ x
Q'64 \
\ \
acfi-
Q~4
QfD. 1. . . . . . I. . . . |-- |--
aCsl
As small networks implementations use less physical memory area, they should be
less sensitive to radiation effects. Thus we investigated the performance of the single
623
~o ~,0 . , . , .
/
0.8 o/
- BP
Q,6 :
" 2k.
~6
K
0,4 0.4
i
0,2
o,r
.~o
i
-5
I
5 c~c. ~ , - ~ ~.~ ~.~J " "
1,o
u
~8 F~=0.0~
Figure 6: Pd vs. SNR for different Pfa 9 EP /
levels, for the two investigated networks &8
--~- 1~8~res
~o_..~.-_ - -s o
S t ~
e r r o r s d u e to r a d i a t i o n : the S i n g l e E v e n t U p s e t p h e n o m e n o n r e s p o n s i b l e o f bit-flips.
T h e s e n e u r a l e x p e r i m e n t s are r e - p r o g r a m m a b l e f r o m g r o u n d . R a d a r d e t e c t o r s b a s e d
u p o n A N N s h a v e b e e n s e l e c t e d as the n e x t a p p l i c a t i o n s c a n d i d a t e to r u n o n M P T B
n e u r a l b o a r d s . T w o d i f f e r e n t s o l u t i o n w e r e p r e s e n t e d , the first o n e b a s e d o n a n
Multiple Layer Perceptron trained with Backpropagation, which showed quasi-optimal
p e r f o r m a n c e s , the s e c o n d o n e a n A N N c o m p o s e d o f b i n a r y n e u r o n s t r a i n e d w i t h
i n c r e m e n t a l t r a i n i n g a l g o r i t h m s . A l t h o u g h the latter n e t w o r k s h o w e d w o r s e d e t e c t i o n
p e r f o r m a n c e s , w e e x p e c t t h a t the s m a l l m e m o r y s u r f a c e n e e d e d to i m p l e m e n t it s h o u l d
m a k e it less s e n s i t i v e to t r a n s i e n t errors d u e the i n t e r a c t i o n w i t h r a d i a t i o n
e n v i r o n m e n t . T h u s , its m i n i m a l s t r u c t u r e s h o u l d m a k e it m o r e s u i t a b l e for s p a c e
a p p l i c a t i o n s . F u t u r e w o r k i n c l u d e s the d i g i t a l i m p l e m e n t a t i o n o f t h e s e n e u r a l r a d a r
d e t e c t o r s to e v a l u a t e b o t h the d e t e c t i o n p e r f o r m a n c e a n d the r o b u s t n e s s a g a i n s t
t r a n s i e n t p e r t u r b a t i o n s . T h e t e l e c o m m a n d o f t h e s e n e u r a l d e t e c t o r s to M P T B n e u r a l
b o a r d s is s c h e d u l e d for J u n e 1999.
6. References
[1] S.E. Decatur, "Application of neural networks to
ten'ain classification," Proc. Int. C o t f Neural [10] A. Assoum, et at. "Robustness against single
Networks, pp. 283-288, 1989. event upsets of digital implementations of neural
networks. ". Proceedings International Conference
[2] N. Miller, M.W. McKenna, T.C. Lau, "Office of on Artificial Neural Networks, Session 9, Paris,
Naval Research Contributions to Neural Networks Octobre 9-13, 1995.
and Signal Processing in Oceanic Engineering",
IEEE Journal of Oceanic Engineering, Vol. 17, N ~ [11] D. Andina, J.L. Sanz-Gonzfilez. "Quasi-
4, Oct. 1992. Optimum Detection Results Using a Neural
Network", Proc. o f IEEE Int. Conf. on Neural
[3] M.W. Roth, "Survey of neural network Networks, ICNN'96, Washington DC, USA, pp.
technology for automatic target recognition", IEEE 1929-1932, June 1996.
Trans. Neural Networks, Vot. 1, N~ 1, pp. 28-43,
March 1990. [12] W.L. Root, "An Introduction to the Theory of
the Detection of Signals in Noise". Proc. o f the
[4] D. Andina and J. L. Sanz-Gonzfilez. IEEE, vol 58, pp. 610-622, May 1970.
"Optimization of a Neural Network Applied to
Pulsed Radar Detection", Proc. of VIII European [13] D. Andina, J.L. Sanz-Gonzfilez and J.A.
Signal Processing Conference, EUSIPCO-96, Jim6nez-Pajares, "A Comparison of Criterion
Trieste, Italy, pp. 851-854, September 1996. Functions for a Neural Network Applied to Binary
Detection", Proc. of hit. Conf Neural Networks,
[5] J. D. Muller, P. Cheynet, R. Velazco, "Analysis ICNN, Perth, Australia, 1995.
m~d improvement of network robustness for on-
board satellite image processing", Int. Conference [14] D. Andina, J.L. Sanz-Gonzfilez. "Design and
on Artificial Neural Networks ICANN'97, Lausanne, Performance Analysis of Neural Binary Detectors",
Suisse, 8-10 oct. 1997. Novel Intelligent Automation and Control Systems,
Vol I, pp.59-78, Germany, 1998.
[6] A. Assoum, N.E. Radi, R. Velazco, F. Elie & R.
Ecoffet, "Robustness Against SEU of an Artificial [15] J.L. Marcum, "A Statistical Theory of Target
Neural Network Space Applications", Special Issue Detection by Pulsed Radar". 1RE Trans. on
IEEE Trans. on Nuclear Science, Vol. 43. Number Information Theoly, Vol. IT-6, N ~ 2, pp. 59-144.
3, Part I. pp. 973-978, June 1996. Apr. 1960.
[7] R. Velazco, Ph. Cheynet, J-D. Muller, R. [16] J.M.Ton-es-Moreno, "Apprentissage et
Ecoffet, "Artificial Neural Network Robustness for g6n6ralisation par des re~seaux de neurones: 6tude de
on-board satellite image processing: Results of SEU nouveaux algorithmes constructifs", Ph.D. Thesis,
simulations and ground tests", IEEE Transactions Institut Nat. Polytechnique de Grenoble, 1997.
on Nuclear Science, Part I, Vol. 44, pp. 2337- [17] Bruno Raffin, Mirta B. Gordon, "Learning and
2344, 1997. generalization with Mmimerror, a temperature
[8] J. C. Ritter , "Microelectronics and Photonics dependent learning algorithm", Neural
Test Bed", 20 th Annual AAS Guidance and Control Computation 7, pp. 1206-1224, 1995.
Conference, Breckenridge, Colorado, Feb. 5-9 1997. [18] Torres-Moreno, J.-M. and Gordon, M,
[9] F. Bezerra, D. Benezech, R. Velazco, "Study of "Characterization of the Sonar Signals Benchmark",
the sensitivity of Transputers with respect to SEU Neural Processing Letters 7 1-4, 1998.
and latchup phenomenons", Proe. of Radiation [19] J.M. Torres-Moreno, M.B. Gordon, "Efficient
Effects on Components and Systems (RADECS'95), adaptive learning for classification tasks with binary
Arcachon, 18-23 Sept. 1995. units", Neural Computation 10, pp. 1017-1040,
1998.
Curvilinear Component Analysis for
High-Dimensional Data Representation:
I. Theoretical A s p e c t s and P r a c t i c a l
Use in the P r e s e n c e of N o i s e
Starting from a recall of the theoretical framework, this paper presents the
conditions and the strategy of implementation of CCA, a recent algorithm
for non-linear mapping. Initially developed in a basic form, for non-linear
and high-dimensional data sets, lhe algorithm is here adapted to the general,
and more realistic, case of noisy data. This algorithm, which finds the
manifold (in particular, the intrinsic dimension) of the data, has proved to
be very efficient in the representation of highly folded data structures. We
describe here how it can be tuned to find the average manifold and how
robust the convergence is. A companion paper (this issue) presents various
applications using this property.
E:E(x,j- (1)
l.I
626
for example by means of some gradient descent algorithm. If the dimensions of the
input and output spaces are the same, the cost function E can be made null. It can be
normalised according to the input distribution by:
Ee :i,j( Aij )2
But in the case of non-linear and folded data structures, this cost function is not
really suitable because it works according to the relative error. The idea here is to
favour the mapping for small distances in the input space with respect to the mapping
of large distances, leading thus (intuitively) to some local topology preservation,
which is the aim. However, in the case of strongly folded data, the unfolding is
difficult or impossible to obtain, simply because when the input is folded, the
extreme points of the input distribution have a small distance Xij, thus the algorithm
which favours it prevents from the desired unfolding.
Some compromise has been given by the so-called Sammon Mapping [Sammon
(1969)], by giving less importance to the relative error:
1 (Xij-Yij) 2
Es ZXij Zi,j Xi;
i,j
The behaviour of unfolding, though slightly better, has been shown to fail with
strongly folded data manifolds [Demartines and H6rault (1997)].
In order to circumvent these drawbacks, we have derived a new algorithm for data
representation called "Curvilinear Component Analysis". In this algorithm, the
strategy of processing is as important as the equations themselves. The goal is to find
the dimension of the average manifold of the data and to map it onto a space of lower
dimension (representation space). In summary, we proceed to a Global Unfolding
followed by a Local Projection onto the average manifold (See fig. 1).
Hypothesis: the input consists of N s samples belonging to some theoretically p-
dimensional manifold, embedded in an n-dimensional input space X={xik}, i=l..Ns,
k=l..n. But, because of noise, the manifold has some "thickness", being thus also of
dimension n.
627
The aim is tofind the average manifold and to map it on a p-dimensional output
space Y. To do this, we use N n neurons with n-dimensional input weights and p-
dimensional output weights. If the number of samples is low, we use one neuron per
sample: Nn = Ns. If N s is too large, we first proceed to a vector quantization [Gersho
and Gray R. M. (1992)] of the input space, and the number of neurons is equal to the
number of prototypes N n < Ns [Demartines (1994)].
Figure I. Principle of the CCA algorithm. The input weights first proceed to a vector
quantization (VQ) of the input data space (X) in n dimensions. Then, the output weights
map the local topology of the input average manifold by projecting it (P) into an output
representation space (Y) of dimension p < n. This way, tasks like classification and
recognition are highly facilitated in an unfolded and lower-dimensional output space.
Then, each neuron i is associated to one input sample (or prototype) and its input
weights are made equal to the components of the sample (or prototype): w k = Xk,
k=l ..n. On the contrary of Kohonen's Self-Organising Maps [Kohonen (1989)], the
neurons have no a priori pre-defined neighbourhood, but they have p-dimensional
output weights yq, q=l..p, pointing in the output space Y. They will find themselves
their neighbourhood by adapting their output weights to the local topology of the
input samples.
Let us come back to the basic cost function, without normalisation for sake of clarity:
The input interpoint distances Xij = IIxi - xjll are given, and for every point Yi in
the output space we move the points yj so that the terms Eij are minimised, for
example by means of a gradient descent algorithm. In order to map the average
manifold of the data, two cases are to be considered (see figure 2): first, we need a
global unfolding of the average manifold of the data, and second, we need a local
projection of these data onto their average manifold.
Let us consider the first case alone (Unfolding: figure 2- top). In order to unfold the
data, only some of the E/j terms in formula 2 need to be minimised: those for which
the distance Yij is smaller than some pre-defined distance Z. Thus, allowing the
628
matching for only short distances is a way to respect the local topology. It has been
proved that this condition (applied on the output distances) ensures a global unfolding,
much better than other mapping techniques, which apply it to the input distances
[Demartines (1994)]. In this case, the general term to be minimised becomes:
with F~ (.) = 1 for ~j < ~, and F;t (.) = 0 for ~j > ,~,.
The choice of 2, strongly depends on the data structure (e. g. curvature of the
average manifold, spreading of the data around this manifold). As the data structure is
in most cases unknown, some strategy should be defined in order to define the best
value of 2,, see section 4.
We should remark that, apart from the desired global unfolding, there is also some
tendency to make a local projection. Look at the input distribution in figure 2:
because we ask the mapping of X14 simultaneously with the mapping of X12, X23
and X34, the resulting compromise will lead to Y12 < X12 , Y23 < X23 and Y34 <
X34, which is an approximate projection. This property will be used hereafter.
Figure 2. Illustration of the problem of data representation, in two cases: either only an
unfolding is desired, or only a local projection is desired (see text).
Let us now consider the second case (Projection, figure 2- bottom). This situation
is the opposite to the preceding one: let us suppose that we have already projected the
data onto their average manifold, the interpoint distances X'ij of the projected data
will locally minimise the following quadratic error: (Xij 2_ X,ij 2). Then, the
output vectors should map this local projection, that is, translated into a cost function
problem, they should minimise:
p = 2 _ (4)
should apply only when Yij < Xij, a situation which is initiated by the above-
mentioned tendency to make local projection. Conversely, when Yij > Xij, we are
in the condition of unfolding. Hence, the two situations (unfolding or projection) do
not overlap, and the global cost function can merge formulae (3) and (4); provided that
the continuity between them be assured at Yij = Xij.
Let us remark that, with such a cost function, there are some degrees of freedom: it
is invariant under transformations like translation, rotations, or inversion of axes.
This property can be exploited by adding constraints suitable for various conveniences
of data representation, refer to the companion paper [Gu6rin et al. (1999), this issue].
In particular, various constraints may be added, for example:
- smoothness constraints in the case of sparse distances matrices,
- constraints to let the axis of maximum variance be horizontal,
- addition of a term containing the information relative to one given factor in
factorial discriminant analysis,
- chosing one axis to minimise intra-class variance while maximising inter-class
variance in case of supervised learning
EIj=(Xij-Y/j)2=IXij-~(Yi-Yj)T(yi-Yj)) 2 (5)
and, with respect to the variation dyj, we have:
The gradient is a vector in the direction of (Yi-Yj), its norm is proportional to the
distance error. The second order differential is:
The gradient is a vector in the direction of (Yi-Yj), its norm is proportional to the
squared projection error and to Ilyi-yjll.The second order differential is:
As previously, the Hessian matrix is positive definite at the same point Yij = Xij,
which is also a minimum for Eip. For the same reason as previously, the basin of
attraction is quadratic and wide.
In order to have the same cost functions in both cases around Yij = Xij, we need to
normalise them so that their second order derivatives at this point are equal. The
global function to be minimised is then:
EEd2E/j l~j]=
a) " ca"
"-X :.~::~;~--'-~.~ ix ]F~(Yij) .z/~ /
2)~:'"" "" "::*~:~,; / j .,d[J[l~ll~g~.Unfolding
4:- al
case of unfolding, the points lie on the dy>dx side of the first diagonal and, in the case
of projection, they lie on the dy<dx side. A "good" mapping is obtained when there is
an unfolding for large dy values and a projection for small values. Then, the aspect of
the joint distribution dx/dy should be the one of figure 3.
The visual analysis of this dx/dy graph is very useful [Demartines (1992)]:
1. When searching for the (unknown) intrinsic dimension of the input data, we!
choose the output dimension by dichotomy: if the distribution lies on the first
diagonal, we can lower the output dimension, and if the distribution becomes
thicker, the output dimension is too small.
2. Once in the good dimension, playing on the minimum value of X to reduce the
scattering around dx=dy for medium values of dy, will improve the quality of
the mapping.
3. More, looking at the maximum of dx near dy=0 gives an idea of the spreading
of the data near the average manifold.
4. In the case of multimodal input data distributions, it can be interesting tol
provide one dx/dy representation for each modality [Teissier et al. (1998)].
As an example of a difficult problem of non-linear mapping, see figure 4. Here we
try to map a 3-D data set of two interlaced rings onto a 2-D representation space. The
problem has no solution, but the CCA algorithm finds the best compromise
satisfying the 2-D constraint: it breaks the two rings so that the local topology be at
best preserved.
a) b) c)
Figure 4. Mapping of a 3-D data set of two interlaced rings onto a 2-D representation
space, a) input space, b) output space where the two rings are broken in order to satisfy at
best the 2-D representation, c) the dx/dy distribution showing the local projection and the
complexity of the unfolding.
More recently, CCA has been successfully applied to difficult problems of audio-
visual fusion for vowel recognition in a noisy environment [Teissier P. et al. (1998)],
and of nuclear physics for the calibration of detectors [Vigneron V. et al. (1997)].
Some new extensions are given in the companion paper [Gu6rin-Dugu6 et al. (1999),
this issue].
Another difficult problem has been approached, the one of scene categorisation
from spatial statistics of the energy distribution of an image in various frequency
bands and orientations [Hdrault J. et al (1997)]. An image is analysed by a bank of
spatial filters, according to 4 orientations and 5 frequency bands, ranging from very
low spatial frequencies to medium ones. The global energies of the 20 filters' outputs
constitute the 20-dimensional measure space, and each image is a 20-vector in this
space. By CCA, we have found that a 2-dimensional representation was possible and
634
that, in this space, the organisation of the data was surprisingly in accordance with
some semantic meaning (see figure 5).
6. R e f e r e n c e s
Borg I. and Groenen P. (1997). Modern Multidimensional Scaling: Theory and
Applications. Springer Series in Statistics.
Cirrincione G., Cirrincione M., Vitale G. (1994). " Diagnosis of Three-Phase converters
Using the VQP Neural Network" 2nd IFAC Workshop on Computer Software Structures
integrating AI/KBS System in Process Control, Lund, Sweden, 11/13 August 1994, 5
pages.
D'Aubigny G., L'analyse Multidimensionnelle des Donn6es de DissimilaritEs, Th~se d'6tat,
Universit6 Grenoble I, 1989.
Demartines P. (1992). Mesures d'organisation du r6seau de Kohonen. In M.Cottrell, editor,
Congr~s Satellite du Congr~s Europ6en de Math6matiques: Aspects ThEoriques des
R6seaux de Neurones.
Demartines P. (1994). Analyse de donn6es par r6seaux de neurones auto-organis6s. PhD
thesis, Institut National Polytechnique de Grenoble.
Demartines P. and Herault J. (1997). Curvilinear Component Analysis: a Self-Organising
Neural Network for Non-Linear Mapping of Data Sets, IEEE Trans. on Neural Networks,
8, 1, 148-154..
Gersho A. and Gray R. M. (1992). Vector quantization and signal compression. Kluwer
Academic Publishers, London.
Gu6rin-Dugu6 A., Teissier P., Delso-Gafaro G. and H6rault J. (1999). Curvilinear
Component Analysis for High-dimensional Data Representation: II. Examples of
introducing additional mapping constraints for specific applications. Proceedings of
IWANN'99, Alicante, Spain.
H6rault J., Oliva A., Gu6rin-Dugu6 A. (1997). Scene Categorisation by Curvilinear
Component Analysis of Low Frequency Spectra. European Symposium on Artificial
Neural Networks, Bruges, BE.
Kohonen T. (1989). Self-Organisation and Associative Memory. Springer-Verlag, Berlin,
3rd edition.
Kruskal J.B. (1964). Non-metric multidimensional scaling: a numerical method.
Psychometrika, 29:115--129.
Mardia K.V., Kent J.T., and Bibby J.M. (1979). Multivariate Analysis. Academic Press,
London.
Sammon J.W. (1969). A non-linear mapping algorithm for data structure analysis. IEEE
Trans. Computers, C-18(5):401--409.
Shepard R. N. (1962). The analysis of proximities: multidimensional scaling with an
unknown distance function. Psychometrica, vol. 27, pp.125-139.
Siedlecki W., Siedlecka K., and Sklansky J. (1988). An overview of mapping techniques
for exploratory pattern analysis. Pattern Recognition, 21(5):411--429.
Teissier P., Gu6rin-Dugu6 A., Schwartz J.L. (1998). Models for Audiovisual Fusion in a
Noisy-Vowel Recognition Task, Journal of VLSI Signal Processing, vol 20, pp.25-44.
Vigneron V., Maiorov V., Berndt R. Sanz-Ortega J. J. and Schillebeeckx P. (1997). Neural
network application to enrichment measurements with nai detectors. VCCSR
Proceedings, Vienna, November 1997.
Curvilinear Component Analysis for
High-Dimensional Data Representation:
II. Examples of Additional Mapping
Constraints in Specific Applications
1 Introduction
1999] in this issue) describes a new version of this algorithm adapted to the general
and more realistic case of noisy data. The basic principles are recalled in section 2.
The theoretical constraints will be discussed in section 3 and illustrated with three
specific applications in sections 4, 5, 6, respectively,
Usually two kinds of constraints on the output data structure are considered
[d'Aubigny 1988], (i) constraints on the configuration of the output representation
(section 3.1) and (ii) constraints on the relationships between data (section 3.2). In
the following, we present how these two constraints can be taken into account inside
the CCA framework, both on theoretical and experimental aspects.
2. Add a penalty term to impose a constant distance (radius) between all the input
samples and an additional sample at the center of the input structure [Borg &
Groenen 1997]. These two terms can be weighted in the cost function.
3. Change the coordinate system from the Cartesian system to the spherical system
and impose a constant radius for all the output samples [Cox & Cox, 1991]. The
output distances are evaluated at the surface of the output sphere. The two free
parameters are the angles for the position in the spherical coordinates system.
From strategies 1 to 3, the spherical constraint for the output presentation is more
strongly imposed. Foe example, some perceptual data coming from psychological
experiments fits well with circular or spherical structure [Dr6sler 1981, Eckman
1954, Rogowitz et al. 1998].
(2)
= Zwi:(x -r s) 2 O _wo
O
variety is not revealed. If we add a more global information by the way of some
long range distances, this 3D structure can be unfolded (see figure lc-d-e-f). For
these experiments, the number of samples for which all the distances are known,
increases from 4 to 16. These samples fix the global structure and are called "anchor
points". In this example, a new "anchor point" (selected by vector quantization in
the input database) provides only 0.01% supplementary distances.
Z~;4",
" ~--o ~ s,;+?)-'.,
x:=
y:-..= :~§
(d) (e) (0
Fig. 1. (a) Original 3D data set, (b) Unfolding from a sparse matrix, only with short range
distances, 8% of distances are known, (c) Unfolding with short range distance and 4 "anchor
points" (d) 8 "anchor points", (e) 12 "anchor points", (f) 16 "anchor" points.
In figure lc-d-e, twists appear in the output data representation : the proportion of
long range distances remains too small. In figure If, the structure is completely
unfolded considering only 16 "anchor points" uniformly distributed into the input
database. At sections 5 and 6, we present two applications using this specific
weighting of the distance matrix.
1. Flatten the global structure (fig. 2b) by choosing an output dimension as the
intrinsic dimension of this global manifold (here 1). The intrinsic structure of
each cluster is then lost.
2. From this organization, process a second CCA with a supplementary dimension
up to the intrinsic dimension of each cluster (here 2 and after 3). For each new
stage, the initialization step keeps the configuration of the final previous step for
the first dimensions and initializes at random the new dimension. Figure 2c
illustrates this process in two dimensions : the clusters are circular and the global
manifold is flattened. For the third CCA in three dimensions, the clusters are
spherical on the same flatten global manifold.
Through the joint distribution dx/dy after the three CCA, we globally observe the
same behavior : distribution (fig. 2d) on eight packets (first packet : within-class
distances, other seven packets : between-class distances) and unfolding process for
long range input distances. But the distribution of the within-class distances shows
differences. By CCA with 1 dimension, projection mainly occurs on the within-class
distances (fig. 2e). By CCA with 1 then 2 dimensions, both behaviors (projection
and unfolding) coexist (fig. 2f). Finally, by CCA with 1 then 2 then 3 dimensions,
the matching is almost done (fig. 2g). An illustration of this procedure is given at
section 5 on a real application in speech recognition.
x~
~• x• ~•
x5
...:::..i::i!/
o12
. ;
..... ... :../:...i-:? ...
~, ":. ,,ii?~:,,.'::. 9
0. o.
5
2o ~0 5 s o.~ i ~:s
In order to illustrate the spherical representation, let us consider the distances (as
the crow flies) between towns all around the world. A flat representation on a plan is
not convenient for this database. Figure 3a-b illustrates the result of the CCA
640
~ >(Managua
~QUi,o~C. . . .
~r~arsovie
~I~g~e rome~ffB.~ar~r~lh
-20 )XLima ~ a n a u s
,~g:~%.: .. -40
i: f
~Brasilia
9 SanUago . . ~Tombouctou
~Bangui
)~u endlff~s~e Janelro ~Kinshasa
-6O )KNairobi
i~ i......i i~ i i o ~.;~.~:..: .... lLusaka
~d.s Cap
_ r , i , i i i = i
-100 -80 -60 -40 -20 0 20 4O 6O 80 100
(a) (b)
Fig. 3. CCA in 2 dimensions from the distances between towns all around the world (a)
dx/dy distribution (b) Towns positions on the output plan.
Here, the true position of the towns is known, then the quality of the
representation is estimated by the residue of a Procruste rotation [in Borg &
Groenen 1997] in order to fit this representation with the true one. This mean
squared error is of 12 and falls down to 3.10 "4 with the constraint of a spherical
output space implemented with the strategy 3 (see section 3.1).
The data set is composed of 100 repetitions of each of the 10 French oral vowels
[a, i, e, e, u, o, 0, y, ~e, o] pronounced in isolation by a single speaker. Noisy
acoustic signals were obtained by adding various amount of white Gaussian noise on
the temporal stimuli (24 dB, 12 dB, 6 dB, 0 dB, -6 dB, -12 dB and -24 dB). Acoustic
data are the normalized spectral components of the speech signal into 20 frequency
641
bands. The acoustic observations lie onto a 20-dimensional space. But, it is well-
known in phonetics, that vowels can be well represented in a 2D triangular shape
called "vowel triangle", organizing through the first two formants (fig. 4a).
Moreover with noise, the vocalic triangular shape is distorted according to a
progressive shrinking of the convex shape of the vowels clusters.
FI(Hz)
200
300
480 o
5
500
~
e
600
5
700
800
o 1o
3~0 -lO 5 o
F 2 (Hz) -L~5 -lo -s ~ Io
In this application, the unfolding process concerns two data structures which are
linked together. The first one is the organization of the vowel structure at each noise
level, and the second one is the evolution of this organization through the noise.
This evolution is seen as a trajectory for each cluster which depends on the
interaction between each vowel and the noise. In [Teissier et al. 1998], we have
shown that CCA firstly reveals the intrinsic audio data structure (fig. 4b), and
secondly can be constrained to unfold the trajectory of each cluster disturbed by
noise (fig. 4c). This is done by combining four constraints :
1/ Supervised data representation : Data are sequentially presented to the network
from the level "without noise" to the most noisy level (24dB).
2/Output space configuration : Two dimensions are enough to unfold the level
"without noise" (fig. 4ab). For the following levels, a third dimension is added in
order to capture this new degree of freedom (see section 3.2.2).
3/Initialization : The initialization of the output samples for a given level is set
from the coordinates of the output samples of the immediately inferior level and by
adding a positive random offset on the third coordinate. Then, for a new level i, the
initialization state is the final state of the level i (see section 3.2.2).
4/Sparse distance matrix : With this sequential process (level 0 "without noise",
level 1.... level i, ...), the number of output samples increases for each level, and
consequently also the dimension of the input and the output distance matrix. These
two matrices Xij and Y/j are not full (see section 3.2.1) : only distances between
642
samples inside the same level of noise are known and also distances with samples
inside the immediately inferior level. These matrices are then structured by blocks.
In this section, we illustrate this preprocessing for a recognition task of the ten
vowels with several level of noise. Two preprocessing stages and two classifiers are
tested (Principal Components Analysis in 3D -fig. 5a-, and constrained CCA in 3D -
fig. 4c-, simple Gaussian classifier -legend 'SG' fig. 5b-, and mixture of gaussian
classifier -legend 'MG' -fig. 5b-). The regularization of the clusters trajectories in the
noise allows to use a simpler classifier, as it is illustrated at figure 5b.
10(
96
/t . . . . ,. . . . . . . . .
-24 -12 -6 0 6 12 24 wn
DIM 2 4 15 OIM 1
Noise R S B d b
(a) (b)
Fig. 5. (a) Cross Comparison between two classifiers ( Simple Gaussian Classifier -SG- and
mixture of Gaussian classifier -MG-) and two preprocessing (Constrained CCA 3D and PCA
3D), (b) Output representation after PCA (3D).
The link with multidimensional scaling is now evident : for flattening, we use a
"flat" Euclidean geometry and for unfolding, a "curved" geometry (see section 3.1).
In both cases, the distance matrices are sparse (local mapping). Furthermore, the
introduction of the necessary discontinuities for the flattening process is simply
realized by not considering the input and output associated distances.
In the CCA framework, we show that the cortical flattening representation can be
obtained by a very sparse input distance matrix :
1. Local distances are considered to preserve the local topology
2. Some long range distances evaluated at the cortical surface are
considered to take into account the global structure ("anchor points")
and to avoid the "twists' phenomenon as it is explained in section 3.2.1.
For this example, the number of nodes is 4203 (approximately 40 cm 2) . This flatten
representation is obtained with a sparse distance matrices (mean number of local
distances per sample : 32, number of "anchor" points : 16) in 200 iterations.
644
Conclusion
References
Vicenq Parisi Baradad I , Hussein Yahia 2, Jordi Font a, Isabelle Herlin 2, and
Emili Garcia-Ladona 3
1 Introduction
T h e study of oceanographic phenomena like vortex, dipole rings and fronts in-
volves processing of sequences of remotely sensed images, like the Sea Surface
T e m p e r a t u r e images obtained with the A V H R R sensor [7], to analyze the dy-
namics using m a t h e m a t i c a l spatial modelling and estimation of the motion field
to model the t e m p o r a l evolution [5]
The first a t t e m p t s in computing image motion in oceanograghic images [4],
to get sea surface currents, consists in identifying the m a x i m u m cross correla-
tion (MCC) between extracted windows in consecutive images. As pointed out
in [3] it gives poor results in zones of high rotational. T h e y propose to solve this
insensitivenes to rotation by formulating the problem as a correspondence be-
tween selected image tokens located in consecutive images. This correspondence
is found using a Hopfield neural network t h a t minimizes a cost function which
quantifies the differences between the tokens.
Each one of the tokens in an image has to be described in a manner t h a t makes
able to compare it with all the tokens in successive images. These description
is made through features t h a t quantify its characteristics (area, excentricity,
curvature,...). T h e y characterize the token locally when they result from an
analysis of a small region around it or globally, as the region of analysis grows
up. In fact, token's selection is made after an analysis of the image looking for
those regions which present conservative features. Thus when object oclusion is
646
Fig. 1. Geometric construction used for the estimation of the features of a corner
probable local features are preferred as the number of tokens to track will be
higuer.
C6te [3] looks for tokens in contours of strong spatial gradients of temper-
ature. T h e y select as tokens those points of the gradient with highly curved
shapes. Selection of these tokens avoids the aperture problem, which relates the
ambiguity in interpreting the translation of a point without salient characteris-
tics, located in a moving edge seen in an aperture.
Though these tokens makes the method robust against the aperture problem
and it performs well when dealing with cloudy images t h a t can oclude parts of
the contours, it meets the problem cited above: the velocity field c o m p u t e d is
very sparse, so, in order to obtain a more dense field it's necessary to look for
global descriptors which include information to distinguish better those close
points t h a t are very similar locally.
It is proposed to characterize each point in the contours using the segments
at each side of the point, analyzing the geometric characteristics of this corner
when the contour is aproximated at different scales. T h e lower scales will take
into account local information and as the scale raises more global information is
used.
This representation obtains for each point of the contour the features of the
corner formed by the point and the two segments of the contour leaving it in
opposite directions when approximated at different scales.
At the lowest scale the closest points correspond to the neighbouring pixels
at each point and at higuer scales they correspond to an approximation of the
evolution of the contour.
T h e features used will be the local position (x, y)of each point in the contour,
the angle r between segments and the orientation 0 of B K , appearing in Fig.I,
at each scale of representation:
T h e positions of the most similar pairs of tokens in succesive images indicate
the apparent motion. Let pi be the i-th token in an image and pj the j - t h token
647
in the succesive image, their associated features are fi,k and fj,k respectively,
where the index k varies from 1 to N f ; the difference between tokens is,
Nf
di f f (pi,p.f) .= ~ (fi,k - fj,k) (1)
k=l
T h e Hopfield neural network is useful to solve combinatorial optimization
problems when the initial state of the network is adequately chosen in order to
arrive to a global minimum [6], but this initial settings are difficult to achieve
when the number of variables in the function to minimize grow.
To solve this drawback it is proposed to use the simulated annealing scheme,
which allows to converge to a global minimum, without regarding the initial
state of the neurons, due to the possibility to increase the energy of the network
ability when it arrives to a local minimum.
This p a p e r is organized as follows: section 2gives the basics of the scale space
approximation using the multiressolution analysis; in section 3 the principles of
multiressolution analysis are applied to find features of the points in discrete
curves. T h e cost function t h a t represents the correspondence problem and it's
minimization using simulated annealing appears in section 4. Section 5 shows an
exemplification of the proposed method using curves t h a t have been displaced,
rotated and deformed. Finally the conclusions and future work a p p e a r in section
6.
2 Multiressolution analysis
Vj_ 1 = Vj (~ W j (4)
and there exists a function ~ (x) E L 2 (R) called wavelet that can generate
a base orthonormal of Wj. The basis functions are c J (x) and are constructed
using the relation
cj, (x) = a 2r ( a - i x ) , n e Z (5)
The pass from the aproximation o f function f at the scale j to the scale j - 1
can be done with the relation
, Yij (s) }. As the pixels are on a grid, xij (s), Yij (s) and =~Oxij(s) ,
~.o
3 (s) can be described using piecewise constant slope curves
r-l(.s-lApl2~
l&p ] ms with l ~ {1, x/'2} andm E { O,:1:--~-~, :t::1} (11)
lif-S/2<_s<S/2
beingFl(~)= Oifs<-S/2ors>S/2
and the evolution of x and y over all the pixels, from {xij,yij} to both
extremes of the contour is expressed as,
k
k=jq-1tlk~P--[-](~)
E [~
fl rn~s
for the segment between {xij, yij} and {XN~jj, YNjj } and,
for the segment between {Xij,yij} and {XNkjj,YN~j}. With l~,ln E {1, v ~ }
and mk ~ ~0,~=-~-2,fl=l~.
t ~ J
A ~~ (16)
m
So the approximations are made up of the scaled basis functions, and we'll
get the same kind of geometric configuration if ~Om,n has the same form than
x~/(s) ,~ij~ (s) and E~j~ ( s ) , ~ j ~ (s). As these are constant slope piecewise func-
tions the Haar's scaling functions is chosen.
As the only accepted transition for the neurones are those which decrease the
energy the initial state of the network has to be set carefully, in order to arrive
to a good solution and not be t r a p p e d in a local minima.
T h e simulated annealing [1] permits to scape from these local minima and
arrive to a high quality solution without depending on the choice of the initial
state.
651
5 Results
To exemplify these method this section shows the analysis of two images which
consist of three curves. T h e features at each pixel of the curve are c o m p u t e d us-
ing different scales. In Fig.2 and Fig.3 appear the features found for two adjacent
pixels; it can be appreciated the capability proportioned by these characteriza-
tion to distinguish neighbouring pixels in cases where the curves have no specially
salient local features.
T h e correspondence between the pixels, found by the simulated annealing is
shown in Fig.4. Note t h a t the method fails in the pixels located at the extremes
of the curves, this is due to the fact t h a t each pixel is analyzed at increasing
scales, until they involve the closest extreme of the curve to the pixel, so in the
case of pixels located in the extremes, the only information used is its position.
6 Conclusions
F i g . 2. F e a t u r e s of a p o i n t in a c u r v e w h e n t h i s is a p p r o x i m a t e d a t different scales
F i g . 3. F e a t u r e s of a p o i n t in a c u r v e w h e n t h i s is a p p r o x i m a t e d a t different scales
653
7 Acknowledgments
This work has been undertaken in the framework of the Mediterranean Tar-
geted project ( M T P phase II-MATER). We acknowledge the support from the
Europeans Commissions Marine Science and Technology P r o g r a m m e (MAST
III) under contract MAS-CT96-0051
References
1. Aarts, E., Korst, J.: Simulated annealing and Boltzmann machines: a stochastic
approach to combinatorial optimization and neural computing. John Wiley ~: Sons
(1989)
2. Aarts, E., Van Laarhoven, P.: A new polynomial time cooling schedule. Proc. IEEE
Int. Conf. on Computer Aided Design. Santa Clara (1985) 206-208
3. C6te, S., Tatnall, A.R.L.: Estimation of ocean surface currents from satellite im-
agery using a Hopfield neural network. Third Thematic Conference on Remote
Sensing for Marine and Coastal Environments I Seattle (1995) 538-548
4. Emery, W.J.: An objective method for computing advective surface velocities from
sequential infrared satellite images. Journal of Geophysical Research, 91, (1986)
12865-12878
5. Herlin, I.L., Cohen,I., Bouzidi S.: Image processing for sequences of oceanographic
images. J. Visualization and Computer Animation 7 (1996) 169-176
654
f g
s(t) ~ _ _ ~ e(t) e(t) ~ y(t)
(Figure 1 Left). This class of nonlinear systems, also known as Wiener systems,
is not only another nice and mathematically attracting model, but also a model
found in various areas, such as biology: study of the visual system [4], relation
between the muscle length and tension [6], industry: description of a distillation
plant, sociology and psychology, see also [7] and the references therein. Despite
its interest, at our knowledge, no blind procedure exists for the identification of
such systems.
We suppose that the input of the system $ = {s(t)} is an unknown non-
Gaussian independent and identically distributed (iid) process, and that both
subsystems h, f are unknown and invertible. We are concerned by the restitution
of s(t) by only observing the output of the system. This implies that we will
blindly design an inverse structure g,w (Figure 1 Right). The nonlinear part g
is concerned by the compensation of the distortion f without access to its input,
while the linear part w is a linear deconvolution filter.
The following notation will be adopted through the paper. For each process
Z = {z(t)}, z denotes a vector of infinite dimension, whose t-th entry is z(t).
Following this notation, the input-output transfert can be written as:
e = f(Hs) (1)
where:
denotes a square Toeplitz matrix of infinite dimension and represents the action
of the filter h on s(t). This matrix is nonsingular provided that the filter h is
invertible.
One can recognise in equation (1) the postnonlineax (pnl) model [12]. How-
ever, this model has been studied only in the finite dimentional case, in which it
has been shown that, under mild conditions, the system was separable provided
that the input s has independent components, and that matrix H has at least
two nonzero entries per row or per column.
We conjecture that this will remain true in the infinite dimensional case. Here
the first separability condition is fullfiled since s has independent components
due to the iid assumption. Moreover, due to the particular structure of matrix H ,
the second condition of separability will always hold except if h is proportional
to a pure delay.
The output of the inversion structure can be written in the same way than
(1):
y = w= (3)
657
with x(t) = g(e(t)). Following [12], to invert such a system, the inverse system
g, w is estimated by minimizing the output mutual information 9
Mutual information of a r a n d o m vector of dimension n is defined by:
1
H(Z) = lim g(z(-T) .. z(T)) (5)
T - ~ oo ~ ~" '
when the limit exists. Theorem 4.2.1 of [3] states t h a t this limit exists for a
stationary stochastic process. We shall then define mutual information rate of a
stationary stochastic process by:
9 = 9 9 9 (8)
\y(-T)] \w('T) wiO) ] \ x ( - T ) + v ( - T ) ]
wT(/~+.~)
where VT is a r a n d o m vector which contains the remaining terms corresponding
to the convolution truncation 9
since as T -+ 0% x ( t ) + v(t) m.s.) x(t). The first term of this last equation is:
1
H ( X ) = lim - - H(e(-T),... ,e(T)) + E E[logg'(e(t))]
T--~oo 2T + 1
t=--T
= H ( E ) + E [logg'(e(T))] (12)
1 ~027r +oo
I ( y ) = H(y(~-)) - ~ log[ E w(t)e -jr~ IdO - E [log g'(e(~-))] - H(E)
~z--O0
(13)
- 0v( ) = - (14)
ow(t) u'wk~)
o (y)
- E[x(~- - t)r - ~(-t) (16)
Ow(t)
659
w--+w+e.w (17)
(19)
(20)
g --+ g + e o g (21)
i Small enough to insure the validity of the first order variation approximation.
660
then:
It suffice to take R{Q(u)} > 0 to insure this condition. Based on the gradient
descent, the algorithm writes then as:
g +-- g + P { O * J} o g (31)
4 P r a c t i c a l issues
It is clear that (20) and (31) are unusable in practice. This section is concerned
by adapting these algorithms to an actual situation. We consider then a finite
discrete sample g = {e(1), e ( 2 ) , . . . , e(T)}. The first question of interest is the
estimation of the quantities involved in equations (20) and (31). We assume
that we already have computed the output of the inversion system, i.e. 2( =
{ x ( 1 ) , x ( 2 ) , . . . , x ( T ) } and y = { y ( 1 ) , y ( 2 ) , . . . ,y(T)}.
= - s1 ~K(U-h(t) ) (32)
(t) : - (33)
is estimated by:
1 T
~y,r (t) = ~ ~ y(T - t)r (y(T)) (34)
7-=1
assuming ergodicity. Since 7u,r (u) (0) = - 1, ~u,r (y) (0) may be set to - 1 without
computing it.
N o n l i n e a r s u b s y s t e m p a r a m e t r i s a t i o n a n d e s t i m a t i o n : No parametrisa-
tion of g is used. One would ask the intriguing question "How would I compute
the output o] the nonlinear subsystem without g ?". In fact applying the equation
(31) to the t-th element of the sample E, and using x(t) = g(e(t)), one gets:
This equation will then compute the output of g without having a particular
form of this function. A possible choice of Q is:
-u if u_> 0 (37)
Q(u)= 0 otherwise
5 Experimental results
To test the previous algorithm, we simulate a hard situation. The iid input
sequence s(t), shown in figure 3, is generated by applying a cubic distortion to
an iid Gaussian sequence. The filter h is FIR, with the coefficients:
h = [0.826, - 0 . 1 6 5 , 0.851, 0.163, 0.810]
Its frequency response is shown in figure 2. The nonlinear distortion is a hard
10
i~ -5
~-10
-100
~.~ - 2 0 0
o- - 3 0 0
-400
0, 02 o13 04 & o16 o17 o18 o19
Normalized frequency (Nyquist == 1)
. . . . . . . . . . . . . . . . . . . . . i
Fig. 3. From left to right: Original input sequence s(t), Observed sequence e(t), Re-
stored sequence y(t).
algorithm was provided with a sample of size T = 1000. The size of the impulse
663
response of w was set to 51. Estimation results, shown in figures 3,4,5, prove
the good behavior of the proposed algorithm. The phase of filter w, Figure 4, is
composed of a linear part which corresponds to an arbitrary uncontrolled but
constant delay, and of a nonlinear part which compensates the h phase.
15
40
0
"~ -5-
-1o i i i i , t , i ,
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Normalized frequency(Nyquist == 1)
-100C
~ -2000
;
~ oo0
4000
-500G
01 0.2 03 0.4 05 0.6 0.7 0.8 09
Normalized frequency(Nyquist == 1)
f
Y
i
In this paper a blind procedure for the inversion of a nonlinear Wiener system
was proposed. This procedure is based on a relative gradient descent of the
mutual information rate of the inversion system output.
664
One may notice that some quantities involved in the algorithm can be effi-
ciently estimated by resort to the F F T which reduces dramatically the com-
putational cost. The estimation of g is done implicitely, only the values of
x(t) = g(e(t)), t = 1 , . . . , T are estimated. One can further use any regression
algorithm based on this data to estimate g, e.g. neural networks, splines, ect.
The relation between the choice of Q and the performances of the algorithm are
not well understood and is currently under investigation.
The proposed procedure shows good performance on simulated data, and is
now applied to real data. Extension to multichannel Wiener systems is currently
under investigation.
References
Abstract. This paper shows an approach to recover original speech signals from
their nonlinear mixtures. Using a geometric method that makes a piecewise linear
approximation of the nonlinear mixing space, and the fact that the speech
distributions are Laplacian or Gamma type, a set of slopes is obtained as a set of
linear mixtures.
1 Introduction
The problem of blind separation of sources [1] involves obtaining the signals generated
by p sources, sj, j=l ..... p, from the mixtures detected by p sensors, ej, i=l ..... p. The mixture
of the signals takes place in the medium in which they are propagated, and:
where Fi: flt~ is a function of p variables from the s-space to the e-space, represented
by one matrix, Apxp. The goal of source separation is to obtain p functions, Lj, such that:
where Lj: ~P-3t is a function from the e-space to the s-space. The source separation is
considered solved when signals yj(t) are obtained from matrix Wpxp(similar to A) [2], and:
Nevertheless, there exists a great variety of sensors [6] whose transfer characteristics are
modelled by diverse functions. Thus, we can also consider a more general nonlinear model
whenever the F~ transformation is a continuous nonlinear function, since in this way, it is
666
2 Basis of procedure
In previous papers [3,4] we have shown that, for linear mixtures, the set of all the images,
E(t), forms a hyperparallelepiped in the E-space; by taking p vectors, (w~,...w~) each one
located at one of the edges of the cone that contains the mixing space, as column vectors
of a matrix Wp,p, this matrix is similar to Ap,p. This can be performed as follows:
Recently, for linear mixtures of two speech signals [1], we used the property that speech
signal distributions are Laplacian or Gamma type and symmetrical; then, normalizing the
mixing space it is possible to determine the distribution of the points in the unit circle,
obtaining two maxima that correspond to the slopes wle and w21 or, in the same way, the
independent components because, due to the linearity of the F~ transformations, the
mixtures are distributed with maxima of probability in directions parallel to the edges of
the parallelepiped (distribution axes). Given the values of wlj and ci=det (W), the sources,
X, may be obtained. Thus, for p=2 we have:
3 Piecewise Linearizatlon
When a normalization of the mixing space is performed, as in the case of the previous
method, a loss of information for the case of nonlinear mixtures of two speech signals
occurs, since the irregularities in the point density of the two-dimensional mixing space
are projected onto the unit circle. The method proposed in this paper considers the
distribution of the points observed in the E(t) space by sectorizing the latter by means of
radial and angular parameters. In this way, each sector is addressed by two numbers: the
radius and the angle, as shown in Figure 1. Then, for each circle (or radius) there are two
sectors (or angles) with the maximum distribution of points corresponding to tile
independent axes (or slopes) as if a linear mixture of signals were made in each span
between two circles. Thus, for each circle we obtain a Wp matrix as in the linear case. If
the F~ non-linear function is continuous, a piecewise linearization can be done in order to
approximate F~. In some cases, when the non-linear mixing function is not continuous,
good approximations can be obtained if the gap between two successive slopes is low, i.e.,
if the distance between, two sectors is not excessive. Clearly, a high number of sectors
provides greater accuracy in the piecewise linearization. This procedure can be applied,
not only to Gamma distribution type signals, but also to all kinds of sources presenting a
probability distribution with a maximum at the centre and that are symmetrical around this
667
centre, such as Gaussian, Laplacian and Poisson functions. Furthermore, the method is
valid, in general, even in the presence of additive noise produced in the medium itself (or
in the mixing sensors), as the usual noise models do not alter the relative centres or the
distribution symmetries.
~ector ( p , 8 )
E2
4 Adaptive processing
The piecewise linearization procedure for the separation of two signals can be
implemented in a recursive artificial neural network. The number of processing elements
is proportional to the number of radii (po,,,) used in the observed signals map and will
depend on the number of sources to be separated. For the case of two signals, this number
is 2p,~, irrespective of the number of angular sectors used. The structure of the recursive
network (Hopfield or Herault-Jutten) allows us to separate the sources, s (t), as follows:
where e(t,p) represents the value of an observation vector belonging to the sector
identified by the p radius, and W 0 =(w ~j) is the weight matrix associated with this radius.
Note that, without loss of generality and for two signals, the elements (w,,w22) of Wp are
equal to 1, and that the two slopes (w~2,w2j) have the value wij=tan(0k), with Ok
representing the angles of the two winning sectors in each circle of p radius; in other
words, 01 and 0 2 are the angles formed by wl and w2 weight vectors with the (e~,e2) axes
respectively. The adaptive rule for the weights is the recursive expression used in the
668
context of competitive learning [8] since, geometrically, the two weight vectors (wt,wz)
that are representative of each circle of p radius are shifted towards a new vector e(t), i.e.:
where c~ is the classical learning rate, which must be a suitable monotonic decreasing of
the scalar- valued coefficient, 0<a<l. Initially, the weights are located on the (eL, e2) axes
with zero value, i.e., w(t=0,p)=0. After convergence, the two weight vectors of equation
(10) will be located on the two maximum distributions of points for each circle,
respectively.
5 S i m u l a t i o n results
We simulated this adaptive procedure for linear and nonlinear mixtures. Four simulations
were made with synthetic and speech signals. In the Figures, we show the input space, the
observed space, the sectorization of the latter, with the lines corresponding to the (w~,wz)
vectors, and the separated signals space. For the sake of clarity, the radial sectors have not
been plotted. In simulations 3 and 4, we show the window of the separated speech signals
in the convergence.
a/01'0:/
There were 4 circles (radii), and 16 sectors (angles). The crosstalk values for each
separated signal, with 2000 samples, were c(1)=-35 dB and c(2)=-34 dB.
Simulation 2. In this simulation, a nonlinear mixture was generated from linear mixtures
in each circular sector, i.e., four matrices depending on the radius were used, as follows:
The crosstalk values for each separated signal, with 2000 samples, were c(1)=-26 dB and
c(2)=-25 dB.
Simulation 3. The third simulation used speech signals as source signals, namely the
Spanish words "mano (hand)" and "mufieca (doll)". The crosstalk values for each
669
separated signal, with 5000 samples, were c(1)=-22 dB and c(2)=-27 dB. There were 4
circles, and the nonlinear mixing applied was as follows:
Simulation 4. Tile fourth simulation also used speech signals as source signals, namely the
Spanish words "mano (hand)" and "mufieca (doll)". The crosstalk values for each
separated signal, with 5000 samples, were c(1)=-25 dB and c(2)=-28 dB. There were 5
circles and the nonlinear mixing applied was as follows:
a,p
5,=toi5o5)
6 Conclusions
This paper presents an adaptive procedure for the demixing of linear and nonlinear
mixtures of signals with probability distributions that are symmetrical with respect to their
centres and non uniform. The main idea is that it is possible to perform a piecewise
linearization in the case of nonlinear mixtures (and in the linear case) in order to obtain
the distribution axes of probability that are parallel to the slopes of the hyperparallelepiped
(parallelepiped for two sources), or independent components, for each circle of radius p.
Although this paper describes the application of the algorithm to two signals, the
operations are performed in vectorial form and future work will concern the application
of this method to the separation of more than two sources and the study of the influence
of noise.
Acknowledgments
This work has been supported in part by the Spanish CICYT proyect TIC98-0982.
References
//
Fig. 2. Simulation I, Orthogonal input space, Linear mixing, Sectorized mixing and Output
space
671
rlI .,,
/i
ii I .... I
Fig. 3. Simulation 2, Orthogonai input space, Non-linear mixing, Sectorized mixing and
Output space
SI
$2 ~ _ .~.... __ . L I L L L L 9 L. . . . _ : _
el
C2
XI
X2
;:.:r
~:~:...: ! : ;
.'?.,.
,.. '~
Fig. 5. Simulation 3. Input space, Nonlinear mixing, Sectorized mixing and Output space.
SI
el
e2
X|
X2
r
1.2.',! , ;
.~~,,.~
Fig. 7. Simulation 4. Input space, Nonlinear mixing, Sectorized mixing and Output space
Nonlinear Blind Source Separation by Pattern
Repulsion*
1 Introduction
Blind source separation has been a topic of growing interest for researchers in the
last few years [6, 3, 2]. Most of the work t h a t has been done until now has focused
on the separation of linear mixtures. A few authors have started to address the
more general problem of separation of nonlinear mixtures [4, 7, 5, 9, 8, 10] (see
further references in [6]). In this paper we address that problem. We present a
method for performing nonlinear separation, together with some examples.
We shall denote vectors by bold lowercase letters and vector functions by
bold uppercase letters. Superscripts shall denote vector components, as in s i. To
denote an exponent, we shall enclose the base in parentheses, as in (s ~) 2 unless
the meaning becomes clear from the context. Subscripts shall be used to index
vectors (patterns) within a set.
We shall consider the following setting. A set of sources s i, statistically in-
dependent from one another, forming the source vector s, are passed through
a nonlinear m i x i n g s y s t e m M , whose output forms the v e c t o r of observations
o = M (s). Our aim is to recover (i.e. separate) the original sources from the
nonlinear mixtures o i. The separation is blind, meaning t h a t little is known
about the sources s i or the mixing system M. Regarding the sources, we only
assume t h a t they are independent from one another. Regarding the mixing sys-
tem, we have to assume, first of all, t h a t it is invertible. If nothing else is assumed,
and its magnitude will depend only on the distance between the patterns.
As noted above, the force, and thus also the potential, will have to decay
faster than the electrostatic ones. We shall assume that E ( x ) is finite everywhere,
and that its integral is also finite, f ~ . E ( x ) d x = K . We shall also assume that E
is concentrated around the origin, so that, for any ~, E (I]x - ~11), considered as
a function of x, has significant values only in a small region around ~, in which
the density p(x) can be considered constant. Therefore
and we can use E(x)/K as a Parzen kernel for estimating the probability density
of patterns p from the samples xi,
1
~(x) - g g ~ E (llxi - xll ) ~-. p(x) , (4)
i
where N is the number of patterns. We can then express the total energy as
OW OW Oxl
i
Oxi
= - Z r x,). 0w (9)
j,i
3 Application examples
We present two examples of nonlinear separation, the first one with images (see
Fig. l-a) and the second one with time-domain signals (see Fig. 3-a). The non-
linear mixtures corresponded to nonlinear analytical expressions given ahead.
These expressions were chosen somewhat arbitrarily, but taking into consider-
ation that they shouldn't be too unsmooth (cf. Sect. 1). The separation was
implemented, in both cases, by MLPs with 10 tanh hidden units and with two
linear output units. The MLPs had full connectivity between consecutive layers,
and also had direct connections from the inputs to the output units.
The images had a total of 200 • 200 = 40,000 pixels each, and the signals were
both 1000 samples in length. In both cases the sum in (9) would have involved
too large a number of terms to be usable in practice. We dealt with this, in both
cases, by strongly subsampling the set of pairs of patterns that we used in the
computation. In the case of the images we started by randomly sampling, without
replacement, a set of 1000 patterns from the 40,000 observations. The remaining
procedure was the same for both the images and the signals. Designating by
o~,i -- 1, ..., 1000, the training patterns, and setting by convention 01001 = ol,
678
we then further subsampled the set of pairs of output patterns to be used in (9)
by using only the pairs of the form ( x i , x i + l ) with i -- 1, ..., 1000.
For the computation of W and of its gradient we used, for the potential,
E (llxl - x211) -- 0.05 e -2~215215 The objective function J that we minimized
had two extra terms, besides the energy, J -- W + B + R. The term B bounded
the distribution of output patterns within the square [-1, 1]2, and R was a
regularization term. For B we used
1000
B= Z (m x [(Ix t- 1),~ + [(Ix l- 1),~ 2} (lO)
i=1
where )4/was the set of all weights except those of the direct connections from
inputs to outputs and the unit biases, which did not affect the smoothness of the
mapping. Training was performed in batch mode. We used adaptive step sizes
and error control (cfl [1], Sects. C.1.2.4.2 and C.1.2.4.3) which were essential in
getting the training to converge quickly. The training converged in roughly 1000
epochs, in both cases.
For the images we used the mixture equations
01 = Sl (s2 + 1 ) / 2 (12)
o2 = s2/(sl + 1) (13)
where s~ and oi designate pixel intensities, with 0 corresponding to black and
1 to white. We set A = 10 -3 in the regularization term. Figures. 1-b and 1-c
show the nonlinear mixtures and the nonlinear separation results, respectively.
We see that the method was able to recover the original sources without visually
noticeable error. Figure 1-d shows a result of linear separation, obtained by using
a linear network instead of the MLP, and following the same training procedure.
Note that, since the mixture is not linearly separable, different linear separation
methods would probably have yielded different results, but none would have
been able to recover the original sources.
To give a better view of how the nonlinear separation performed, we processed
a square grid of 30 x 30 points, from the source space, through the nonlinear
mixture and separation. Figure 2 shows the grid after the nonlinear mixture and
after the separation. We see that the separation was able to recover the original
square grid with relatively small error.
In the second example, the sources were a triangular waveform with 4 com-
plete periods within the 1000 samples, and a sine wave with 23 complete periods
within that interval. Both sources were first rescaled in amplitude to the interval
[0, 1]. We used the mixture equations
01 = (sl) 2 / l o g (s 2 + 2) (14)
0 2 = (s2) 2 / X / s 1 + 1 . (15)
679
Fig. 1. Nonlinear separation of images. (a) originals, (b) nonlinear mixture, (c) non-
linear separation, (d) linear separation.
This mixture is more strongly nonlinear than the one of the first example,
especially due to the quadratic terms. Accordingly, we had to use a smaller
coefficient in the regularization term, )~ = 5 x 10 -5. The order of the 1000
mixture observations was randomized, so t h a t in the pairs of the form (xi, xi+l)
there would be no correlation between the two patterns. A p a r t from these details,
the training was performed in the same way as in the first example. Figure 3
shows the results, and we can see t h a t the nonlinear system was able to separate
the sources relatively well, while a linear system was not. Figure 4 shows scatter
plots of the various signals, to give a better view of the transformations involved
in this experiment.
4 Conclusions
~
~176176176176
::::::::::::::::::::::::::::::
.............. i ...............
... 9.-..--.
\\\\\\~i??iiiiii!?iiiii;;~:.".:
(a) (b)
Fig. 2. Mappings. (a) nonlinear mixture, (b) nonlinear separation.
Appendix
Consider a finite region ~ C ]Rn, and designate b y u (x) -- u the uniform
density within that region. Let p (x) be any density within that region, and let
c (x) -- p (x) - u. Since both u and p are probability densities, ,we ~have
n o (x) dx = 0 (16)
Therefore,
= / n u2 ( x ) d x + 2 u f n e ( x ) d x + / n e2 (x)dx (18)
with equality only if p -- u. The uniform density is therefore the absolute mini-
mizer of fn p2 (x) dx.
References
[1] Almeida, L. B.: Multilayer perceptrons. Handbook of Neural Computation.
Fiesler, E. and Beale, R., eds. Institute of Physics and Oxford University Press
(1997) available at http://www.oup-usa.org/acadref/nccl_2.pdf.
[2] Bell, A. and Sejnowski, T.: An information-maximization approach to blind sep-
aration and blind deconvolution. Neural Computation 7 (1995) 1129-1159.
681
(a) (b)
(c) (d)
Fig. 3. Nonlinear separation of time-domain signals. (a) originals, (b) nonlinear mix-
ture, (c) nonlinear separation, (d) linear separation.
[3] Comon, P.: Independent component analysis - A new concept?. Signal Processing
36 (1994) 287-314.
[4] Deco, G. and Brauer, W.: Nonlinear higher-order statistical decorrelation by
volume-conservingneural architectures. Neural Networks 8 (1995) 525-535.
[5] Hochreiter, S. and Schmldhuber, J.: LOCOCODE performs nonlinear ICA with-
out knowing the number of sources. Proc. First Int. Worksh. Independent Com-
ponent Analysis and Signal Separation. Aussois, France. Cardoso, J. F., Jutten,
C., and Loubaton, P., eds. (1999) 277-282.
[6] Lee, T.-W., Girolami, M., Bell, A., and Sejnowski, T.: An unifying information-
theoretic f~amework for independent component analysis. International Journal
on Mathematical and Computer Modeling (1998).
[7] Marques, G. C. and Almeida, L. B.: An objective function for independence. Proc.
International Conference on Neural Networks. Washington DC (1996) 453-457.
[8] Marques, G. C. and Almeida, L. B.: Separation of nonlinear mixtures using pattern
repulsion. Proc. First Int. Worksh. Independent Component Analysis and Signal
682
?::-5, ,
~,. , , . - - , . : . . . - , : ,...
~': ".- : _ ~ : :" ":
o . ,,
(a) (b)
.r
i-oO
. -~ - . . . . r',r
-- t *, '*o*
....... 9. . - .. . - ...,
' ::.i :! -:..
..- - :" ~- ::
......... ........... ~ " ~ - .............. ~ ................ ~ .......................................... ~ : o . . . . . . . . . . ~ ~ i ' " "A ....... ~ .........
9. . o, o," .. ~ % . *~ ~ .-.~ . . . . .,
t ~176
~- ~ .... o'~ " "~ ** o : - *-'~ . . : t -~ .. ",',
(c) (d)
F i g . 4. Scatter plots. (a) originals, (b) nonlinear mixture, (c) nonlinear separation, (d)
linear separation.
Separation. Aussois, France. Cardoso, J. F., Jutten, C., and Loubaton, P., eds.
(1999) 277-282.
[9] P a j u n e n , P.: Nonlinear independent component analysis by self-organizing maps.
Proc. Int. Conf. on Artificial Neural Networks. Bochum, G e r m a n y (1996) 815-819.
[10] Palmieri, F., Mattera, D., and Budillon, A.: Multi-layer i n d e p e n d e n t c o m p o n e n t
analysis (MLICA). Proc. First Int. Worksh. Independent C o m p o n e n t Analysis and
Signal Separation. Aussois, France. Cardoso, J. F., Jutten, C., and Loubaton, P.,
eds. (1999) 93-97.
[11] Xu, D., Principe, J., Fisher, J., and Wu, H.-C.: A novel measure for i n d e p e n d e n t
c o m p o n e n t analysis. Proc. I E E E Int. Conf. Acoust., Speech and Sig. Processing.
Seattle W A 2 (1998) 1161-1164.
Text-to-Text Machine Translation
Using the R E C O N T R A C o n n e c t i o n i s t Model*
M.A. Castafio 1 F. C a s a c u b e r t a 2
1 Introduction
The task chosen in this paper was called the Traveller task, which was defined within the
first-phase of the EuTrans project [1]. This task can be considered a more realistic test for
our connectionist RECONTRA translator than that we have previously employed [7]. The
fi'amework adopted for the task is that of a traveller (tourist) at the reception of a hotel of a
country whose language he/she does not speak. The vocabularies of the languages
considered in the project (Spanish, English, German and Italian) ranged from 500 to 700
words. Taking into account the great difference between the 30 words of the vocabularies
of the task previously approached using the RECONTRA model and the 700 words in the
Traveller task, we chose a subtask of the Traveller task to test ot, r translator. This subtask
includes sentences in which the tourist notifies the reception of his departure, asks for the
bill, asks and complains about the bill and asks for his luggage to be moved.
lq order to decrease the sizes of the vocabularies and the complexity of the chosen
Traveller (sub)task, the grouping of some words and word sequences into categories was
introduced. Specifically, two categories labelled by $DATE and $HOUR were used,
which respectively represented generic dates and hours just as their names suggest. In a
first experiment we considered pairs of categorized Spanish-into-English sentences and
later, non-categorized Spanish-into-English sentences. In what follows we will refer to
them as the categorized and non-categorized Traveller tasks respectively.
The Spanish vocabulary of the non-categorized task had 178 different words, which was
decreased to 132 words after categorizing the sentences. The English vocabulary had 140
and 82 words in the corresponding non-categorized and categorized tasks. Figure 1 shows
some examples of both non-categorized and categorized Spanish-into-English translations.
Figure 1. Some examples of pairs of sentences of the non-categorized and categorized Traveller
tasks.
The hasic architecture adopted for the connectionist RECONTRA translator is a simple
recu,'rent network presented in [9]. In addition, it includes "delayed" inputs, which
reinforce the preceding and the following contexts of the input signal. The resulting neural
topology is shown in Figure 2.
Let see now how the RECONTRA model runs: The words of the sentence to be translated
are presented sequentially at the input layer of the net, while the model has to provide the
successive words of the corresponding translated sentence. That is, the translator sees the
input sentence through a window of n words which is shifted word by word and generates
the successive words of the output sentence one after the other.
It should be noted that the size of the window should be wide enough so that the
RECONTRA will have seen enough information at the input layer before providing the
appropriate translated output word; that is, the net cannot translate something it still has
not seen. However, it is not strictly necessary that the input word(s) related to the
686
translated output word was inside the current input window, since the net has memory and
is able to remember (some) past events.
English word
I output unns I
In order to mark the end of the sentence translated by the net an additional output word is
included in the target vocabulary. Consequently, the presentation of the input sentence
finishes alter the translator provides this special word or, failing this, after introducing the
whole input sentence and a certain number of following empty input words to the network.
In order to approach MT between languages which involve hu'ge (or even meditml)
vocabularies using our R E C O N T R A translator, a local representation of these
vocabularies cannot be employed. It would lead to networks with .m excessive (and so
unapproachable) number of connections to be trained. Consequently, a distributed
representation of both source and target vocabularies is required. Previous studies on the
more appropriate type of distributed codification to represent the vocabularies in our
RECONTRA translator were carried out in 161. They suggested employing similar
(boolean) codifications with those words in the vocabuhtry which appeared in similar
syntactic contexts. These experiments also showed that the learning convergence was
increased when the same codification for the sot, rce word to be translated and the
corresponding translated target word was adopted (as far as possible). Finally, these
studies revealed that significantly better translation performances were obtained by using
coarse representations in contrast to severe subsymbolic distribttted rel~resentations.
Consequently, in the experiments presented in this paper we adopted boolean coarse
codifications for both the source and target vocabularies of the Traveller tasks; in addition,
similar codifications were assigned to words in the same vocabulary that appeared in the
same syntactic context and to words in different vocabularies that were a translation of
the other.
The RECONTRA translator described above is trained by ttsing an on-line version of the
Backward-Error Propagation algorithm [ 16]. This means that the recurrent connections of
the net are ignored in the process of adjusting the weights and that they are considered
additional external inputs to the architecture (although the net is not spread in time as in
the Back-Propagation-through-Time learning method). Consequently, the gradient of the
error is truncated in the estimation of the weights; that is, it is not exactly computed.
However, this learning method runs well in practice as it is shown in [5] where it is
compared to (more computationally costly) methods which exactly follows the gradient.
Tile resttlting training algorithm is as tbllows: After inputs and target traits arc updated,
the forward step is computed, the error is back-propagated Ihrough Ihe net and the weights
are modified. Later, the hidden unit activations are copied onto the corresponding context
units. This time cycle is continuously repeated until the target wdues mark the end of the
translated sentence. A sigmoid function (0,1) is assumed as the non-linear actiwttion
function and context activations are initialized to 0.5 at the beginning of every input-
output pair. The updating of the weights requires estimating appropriate values for the
learning rate and momentum. The choice of these parameters is carried out inside the
unitary bidimensional space which they define, by analyzing the residual mean sqnared
error of a network trained for 10 random presentations of the learning corpus ( 10 epochs).
Training continues for the learning rate and momentum which led to the lowest mea,1
squared error. And the training process stops when a certain established criterion is
verified.
With regard to the translated message provided by the RECONTRA model, the network
continuously generates output activations. In order to interpret the activations provided at
a given time cycle, the word associated to the pre-established codification of the target
vocabulary which is nearest to such activations is searched lbr.
4 Experimental Results
First, the Spanish-into-English categorized Traveller task was approached using the
RECONTRA translator described in the above section. Later, the non-categorized MT task
was learned in a second experiment using both the RECONTRA model and other recent
inductive MT approaches 1.
All the connectionistexperimentspresentedin the paperwere trained and testedusing tile SNNS neuralsirnulatnr
1191.
688
The corpora adopted in the two tasks approached in the paper were sets of text-to-text
pairs each of which consisted of a sentence in the Spanish input language and the
corresponding translation in the English output language. Among the Spanish-into-English
non-categorized pairs of sentences considered in the EuTrans project (related to the
subtask considered in this paper), we randomly chose 5,000 samples to train the
connectionist translator and 1,000 different pairs to test the resulting learned models.
These training and test corpora were later categorized and employed to learn and
recognize the categorized task.
3,425 of the 5,000 non-categorized training pairs were different; after categorizing these
5,000 santples, the number of different pairs was decreased to 2,687. In the test corpus 991
out of the 1,000 non-categorized pairs were different, and after being categorized this
number went to 771.
There was no overlapping between the non-categorized training and test corpora;
however, 54% of the pairs in the categorized test set were included in the categorized
learning set.
The length of the non-categorized Spanish sentences ranged from 3 to 20 and the length of
the non-categorized English sentences, from 3 to 17. The number of words of the
categorized sentences ranged from 3 to 13 for the Spanish ones and from 3 to 12 for the
English ones.
Tile RECONTRA translators employed to approach the categorized Traveller task had 50
input units and 37 outputs, which respectively coded tile 132 words of tile Spanish source
vocabulary and the 83 words of the English target vocabulary (including tile special word
which marks the end of the translated sentence). The codifications adopted were pseudo-
random boolean coarse codings with the features specified in Section 4.2.
689
Six Spanish words were presented simultaneously to the net, so that 3 and 2 words were
considered the corresponding (balanced) right and left contexts of the input word. This
input context was adopted after studying some examples of the task and verifying that the
source word(s) corresponding to the target word to be translated every time cycle, have
been previously presented to the input of the net. In order to avoid translators with an
excessive number of trainable connections, larger input contexts were not considered.
Tile next step in our approach to tile categorized task wits to estimate an adequate value Ior
the number of hidden units. With this objective in mind, translators with the above
features and with a single hidden layer ranging from 130 to 160 units were designed.
Appropriate values for the learning rate and momentum were found for every model. Each
of these models was trained up to 500 epochs using the 5,000 categorized pairs
corresponding to the learning corpus of tile task. The resulting Irained translators were
then tested on the 1,000 recognition samples. Tile best test performances were obtained fo,"
the netwo,-k with 140 hidden units. Table 1 shows the (test) sentence accuracy translation
rates aud tile word accttracies achieved for that topology afler both 100 and 500 h'aining
epochs. These results reveal that, in spite of the low number of training epochs, the
translation performances obtained were quite good.
Table 1. Sentence accuracy translation rates and word accuracy rates for the categorized and non-
categorized Traveller tasks.
Let see now the behaviour of the RECONTRA translator on the same pairs of sentences
without being categorized.
Taking into account that a RECONTRA translattn' with 140 hidden tinits p,'ovided gotld
accuracy rates for the categorized Traveller task, we employed a model with 160 neurons
at this time since the size of the vocabularies of the non-categorized task were higher. The
178 words of the Spanish vocabulary were coded into 61 boolean units using pseudo-
random coarse representations. Tile 140 words of the English vocabulary togclhcr with tile
word which marks the end of the translation were coded into 52 units in a similar way.
After studying some pairs of examples of the task, we noticed that at least 8 delayed inputs
(with 4 words for the left context and 3 words for the right context) were ,'equired.
690
In order to compare the results achieved using our connectionist RECONTRA translator,
the experiments on the non-categorized Traveller task presented in the previous Section
were approached using a translation model based on subsequential transducers similar to
that presented in [2]. The scheme combined subsequential transducers with hmguage
models of both the input and output languages (built using 3-grams I12]) and with an
error-correcting model which was focused on the Levenshtein distance [ 10]. The resulting
translator was trained and tested employing the same respective learning and test samples
than those considered with the RECONTRA translator. Table 2 shows the test sentence
performances reached.
Table 2. Sentence translation rates achieved using different inductive techniques to approach the
non-categorized Traveller task.
SI~NTENCI{
TI/,ANSI,ATION MOI)FI, ACC. RATI,~
Probabilistic alignments 77.6%
Grammar association with perceptrons 58.3%
Grammar association with LOCO model 79.5%
Subsequential transducers 27.2%
RECONTRA 9 I. 1%
Tile non-categorized Traveller task was also approached by Garcfa and Prat using their
respeclivc (developing) techniques hased on probahilislic aligumcnts and grammar
association. The first technique [ 11] was inspired in the translation Model-II previously
developed by the people from IBM [3]; and the second technique estimated the association
probabilities through a multilayer perceptron and through a model called LOCO [15].
Table 2 summarizes the sentence translation rates achieved using the same corpora
employed in our non-categorized experiment to train and test these last translation models.
More details about these experimentations can be fotmd in [151.
691
Looking at the comparative results shown in Table 2 we can observe that the best
translation performance was provided by the RECONTRA connectionist model. However,
our translator required the largest spatial storage and learning time. On the other hand, the
accuracies obtained using subsequential transducers could be improved by providing
larger training samples.
Considering these encouraging results, it seems feasible that future work could deal with
more complex limited-domain translations and with larger vocabularies. However, in
order to avoid translators with unapproachable sizes, more experimentation to work with
effective compact (coarse or distributed) representations of the vocabularies is required.
Destructive training methods can be also employed to reduce the size of the networks (and
so the learning time), New connectionist architectures which continue to lower this
learning time should be also considered. Automatic categorization of source non-
categorized sentences and t,'anslation of instances of translated categories will be added to
the process of translating categorized sentences. Finally, the integration of our translator
with a module which recognizes voice input is still pending.
References
3. P.F. Brown, S.A. Della Pietra, V.J. Della Prieta, R.L. Mercer. The Mathematics of Statistical
Machine Translation: Parameter Estimation. Computational Linguistics, vol. 19, no. 2, pp.
263--31 I. 1993.
4. M.A. Castafio, F. Casacuberta. A Connectionist Approach to Machine "l'rt,lshttion. Procs. of
the 5th European Conference on Speech Communication and Technology (EUROSPEECH-
97), vol. I, pp. 91--94, Rhodes, Greece. 1997.
5. M.A. Castafio, F. Casacuberta. Training Simple Recurrent Networks through (;radie,t Descent
Algorithms. In "Biological and Artificial Computation: From Neuroscience to Technology", In
"Lecture Notes in Computer Science", vol. 1240, pp 493--5(10. Eds. J. Mira, R. Moreno-l)/:tz,
J. Cabestany. Springer-Verlag. 1997.
6. M.A. Castafio. Reties Neuronales Recurrentes para Inferencia Grt,natieal y Tradttceit~n
A,tomdtica. Ph.D. dissertation, Dpto. Sistemas Inform,Sticos y Computaci6n, Universidad
Polit~.cnica de Valencia. 1998.
7. A. Castellanos, I. Galiano, E. Vidal. Application of OSTIA to Machine Transh:tion T~tsks. In
"Lecture Notes in Computer Science", vol 862, pp.93--105, R.C.Carrasco and J. Oncina
(Eds.), Springer-Verlag. 1994.
8. G. Dorffner. A step towards sub-symbolic language models without li,guistic representatimts.
Connectionist Approaches to Language Processing, vol. I. Eds. R. Reilly, N. Sharkey.
Erlbaum. 1990.
9. J.L. Ehnan. Finding Str, cture in Time. Cognitive Science, vol. 2, no. 4, pp. 279-31 I. 1990.
I0. K.S. Fu. Svnuwtic Pattern Recognition and Applications. Prentice-Hall. 1982.
I1. 1. Garc(a. Trtuhwcign Autom6tica basada en M~todos Estadisticos. Final year project. Dpto.
Sistemas Inform,-iticos y Computaci6n. Universidad Polit~cnica de Valencia. 1996.
12. F. Jelinek. L~tngttage Modelling for Speech Recognition. Procs. of the 12tb European
Conference on Artificial Intelligence (ECAI-96), pp. 26--32, Hungary. 1996.
13. N. Koncar, G. Gulhrie. A Natttrttl l.z, lgttage Transhttion Neur~tl Network. Procs. of the Int.
Conf. on New Methods in Language Processing, pp. 71--77, Manchester, UK. 1994.
14. A. Marzal, E. Vidal. Comp,tation of Normalized Edit Distance and Applications. IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 15, hum .9. 1993.
15. F. Prat. Traducci6n Automdtica en Dominios Restringidos: Algunos Modelox Estoctisticos
Susceptibles tie ser Aprendidos tl partir de Ejemplos. Ph.D. dissertation, Dplo. Sislemas
lnformfiticos y Computaci6n, Universidad Polit6cnica de Valencia. 1998.
16. D.E. Rumelhart, G. Hinton, R. Williams. l.earning seq,e,tial strttctttre in simple recurre,t
networks. In "Parallel distributed processing: Experiments in the microstructure of cognition",
vol. I. Rumelhart D.E., McClelland J.L. and the PDP Research Group (Eds.), MIT Press.
Cambridge. 1986.
17. N.E. Sharkey. Connectionist Representations fi, r Natural Lttnguage: Ohl and New. Procs. of
the VI SEPLN, Donostia. 1990.
18. A. Waibel, A.N. Jain, A.E. McNair, H. Saito, A.G. Hauptmann, J. Tebelskis. JANUS: A
Speech-to-Speech Translation System ,sing Connectionist tt,d Symbolic Processi,g
Strategies. Procs. ICASSP-9 I, pp. 793--796. 199 I.
19. A. Zell et al. SNNS: Stuttgart Neural Network Simulator. User manual, Version 4. I. Technical
Report no. 6195, Institute for Parallel and Distributed High Performance Systems, University
of Stuttgart. 1995.
An Intelligent Agent for Brokering
Problem-Solving Knowledge
t Dept. of Social Science Informatics (SWI), University of Amsterdam, Roetcrsstraat 15, 1018 WB
Amsterdam, The Netherlands, richard@swi.psy.uva.nl, http://www.swi.psy.uva.nl/
2 University of Karlsruhe, Institute AIFB, 76128 Karlsruhe, Germany, dfe@aifb.uni-karlsruhe.de,
http://www.aifb.uni-karlsruhe.de/WBS/dfe/
Abstract
Wc describe an intelligent agent (a broker) for configuration and execution of knowl-
edge systems for customer requests. The knowledge systems me configured from reusable
problem-solving methods that reside in digital libraries on the Internet. The approach
followed amounts to solving two subproblems: (i) the configuration problem which im-
plies that we have to mason about problem-solving components, and (ii) execution of
heterogeneous components. We use CORBA as the communication infrastructure.
System 1 ~ . ~ PSM
~ ~ nbrary2
broker
I ~ PSM
Output library1
Configurationprocessof I)roker ]
TheproductIsarunningKBS I
System2 ~...~ ~
CusI.... 'a L ~ . ~ Glue ~
I,'IGURE 1: Distinction between two systems: (1) the broker configures a i)roblem solver by reasoning with
UPML, and (2) the output of tim broker is a knowledge system, which consists of executable code fragments
corresponding to the selected PSMs, along with "glue" for their integration to nmke them interoperate. The
arrows in systeml denote UPML expressions, whereas tbe arrows in system2 stand for CORBA structures.
that the site where a PSM is executed should support the language in which the PSM is
implemented (System2 in Figure 1).
In Section 2, we briefly review the ingredients needed to explain our approach. Section 3
describes the configuration task of the broker. In Section 4, we outline how the configured
problem solver is executed and in Section 5 we sketch the C O R B A architecture to implement
our approach. Finally, Section 6 concludes the paper.
2 Ingredients
Before we explain our approach in detail, we first briefly explain its ingredients: PSMs,
ontologies, UPML and CORBA.
The components we broker are problem-solving methods, which are domain-indel)endcnt de-
scriptions of reasoning procedures. PSMs are usually described as having an i n p u t / o u t p u t
description, a competence description (what they can deliver), ~ s m n p t i o n s on domain
knowledge (what they require before they can deliver their competence). We distinguish be-
tween two kinds of PSMs; primitive and composite ones. Composite PSMs comprise several
subtasks that together achieve the competence, along with an operational description speci-
fying the control over the subtasks. Primitive PSMs are directly associated with executable
code.
2.2 Ontologies
An ontology is a shared and common understanding of some domain that carl be commu-
nicated across people and computers [17, 32, 17, 30]. Most existing ontologies are domain
ontologies, reflecting the fact that they capture (domain) knowledge about the worht inde-
pendently of its use [18]. However, one can also view tile world from a "reasoning" (i.e.
use) perspective [19, 14, 10]. For instance, if we are concerned with diagnosis, we will talk
about "hypotheses", "symptoms" and "observations". We say that those terms belong to
the task ontology of diagnosis. Similarly, we can view the world from a problein-solving
point of view. For example, Propose gc Revise sees the worht in terms of "states", "state
695
transitions", "preferences" and "fixes" [14, 20]. These terms are part of the method or PSM
ontology [16] of Propose & Revise.
Ontologies can be used to model the different agents involved in our scenario (illus-
trated in Figure 3). So we have task ontologies, PSM ontologies and domain ontologies to
characterize respectively the type of task the customer wants to solve, the PSM, aml the
application domain for which a customer wants a KBS to be built. These different ontologies
are related to each other through what we (:all bridges.
FIGURE 2: The class hierarchy of the UPML language (left), and ttle attributes of a UPML specification
(left).
language is specified in the ProtegeWin3 tool [21] which is a tool that allows one to write
down a meta-description of a language. Figure 2 gives the class hierarchy of UPML (left part
of figure). A UPML specification consists of, among others, tasks, PSMs, domain models,
ontologies and bridges (see right part of Figure 2). Bridges have to fill the gap between
different ontologies by renaming and mapping.
Having specified the structure and syntax of UPML, ProtegeWin automatically can
generate a knowledge acquisition tool for it, that can be used to write instances in UPML
(i.e. actual model components). For describing the coml)ctence of PSMs, FOL fornmlas can
be used. Typically, library providers use a subset (the part related to PSM) of UPML to
characterize their PSMs, using the generated KA tool.
I PSM-doml
bridge
ln KB I
FIGURE 3: The steps the broker needs to make for selecting a PSM.
9 Matching the goal with PSM competences and finding a suitable renaming of terms
(the ontology of the task and the ontology of the PSMs may have different signatures).
9 Checking the assumptions of the PSM in the customer's knowledge base, and gener-
ating the needed PSM-domain bridge (for mapping different signatures).
These tasks are closely related to matching software components in Software Engineering
[34, 26], where thcorem proving techniques have shown to be interesting candidates. For the
current version of our broker, we use tlm leanT4p [3] theorem prover, which is an iterative
deepening theorem prover for Prolog that uses tableau-based deduction.
For matching the customer's goal with the competence of a PSM, we try to prove tim
task goal given the PSM comt)etence. More precisely, we want to know whether the goal
logically follows from the conjunction of the assumptions of ttle task, the postcondition of
the PSM and the assumptions of the PSM. Figure 4 illustrates the result of a successful
proof for a classification task (set-pruning) and one specific PSM (prune). In Figure 4,
Formula (1) represents to task goal to be proven (explained in the paragraph on "broker-
customer interaction"), Formula (2) denotes the assumption of the task, Formula (3) the
postcondition of the PSM and Formula (4) represents the assumption of the PSM. The
generated substitution represents the PSM-task bridge needed to mall tile output roles of
the task and PSM onto each other. The output of the whole matching process, if successful,
is a set of PSMs whose competences match the goal, along with a renaming of the input and
output terms involved. If more than one match is found, the best 6 needs to be selected. If no
match is found, then a relaxation of the goal might be considered or additional assumptions
could be made [5].
5Note that "class" is used in the context of classification, and not in the sense of the OO-paradigm.
Sin the current version, we match only on competence (thus on fimctionality). However, UPML has a slot
for capturing non-hmctional, pragmatic factors, such as how often has the component be retrieved, was that
succegsful or not, for what application, etc. Such non-fimctional aspects play an important role in practical
component selection.
698
9 ?- match_psm('set-pruning,prune,Substitution,lO).
The goal to be proven is :
formula(forall([var(x, class)]), (I)
implies(in(x, 'output-set'),
and(in(x, 'input-set'), test(x, properties)))).
The t h e o r y i s :
and(formula(forall([var(x, class)]), (2)
equivalent(test(x, properties),
formula(forall([var(p, property)]),
implies(in(p, properties),
implies(true(x), true(p)))))),
and(formula(forall([var(x, class)]), (3)
implies(in(x, output),
and(in(x, input),
formula(forall([var(p, property)]),
and(in(p, properties),
has_property(x, p)))))),
formula(forall([var(x, element), var(p, property)]), (4)
implies(has_property(x, p), implies(true(x), true(p)))))).
Yes
......................................................................................
FIGURE 4: Matching the task goal of the "set-pruning task" with the competence description of tile "prune"
PSM using a theorem prover. The task provides the goal to be proven (1). The theory from which to
prove the goal is constituted by the assumptions of the task (2), the postcondition of the PSM (3) and the
assumptions of the PSM (4). The "10" in the call of the match denotes that we allow the theorem prover to
search 10 levels deep. The resulting substitution constitutes the PSM-task bridge.
10 ?- b r i d g e _ p d ( p r u n e , a p p l e - c l a s s i f i c a t i o n , B).
The g o a l t o be proven i s :
formula(forall([var(x, element), vat(p, property)]), (1)
implies(has_property(x, p), implies(true(x), true(p)))).
The theory is:
and(formula(forall([var(c, class), var(f, feature)]), (2)
implies(has_feature(c, f), implies(true(c), true(f)))),
forall([var(x, class), var(y, feature)], (3)
equivalent(has_feature(x, y), has_property(x, y)))).
Limit = 1
Limit = 2
Limit - 3
Limit - 4
Yes
......................................................................................
FIQURE fi; Derivi.g the PSM-domai. bridles: V x:class, y:feat.ro (has-foat.re(x,y) ~ h~s-property(x,y)) at
the fourth level
699
Once a PSM has been selected, its assumptions need to be checked in the customer's
knowledge base. Because the signatures of the KB ontology and PSM ontology are usually
different, we may need to find a bridge for making the required proof possible. Figure 5
illustrates the result of a successful proof for deriving a PSM-domain bridge. In the figure,
we ask to derive a PSM-domain bridge to link together the prune PSM and a KB for
apple classification. Formula (1) represents the PSM assumptions (same as Formula (4)
in Figure 4). We want to prove Formula (1) from the assumption of tile KB (Formula
(2)) and some PSM-domain bridge (if needed). In our prototype, a PSM-domain bridge is
automatically constructed, based on an analysis of the respective signatures. This involves
pairwise comparison of the predicates used by the PSM and the KB that have tile same
arity. A match is fotmd if the respective predicate domains can be maI)ped onto each other.
The constructed bridge is added to the theory (Formula (3)) and then the theorem prover
tries to prove the PSM assumption from this theory (i.e., from the conjunctiou of Formula
(2) and (3)), which is successful at the fourth level of iteration (Figure 5). Note that, in
general, it is not possible to check every assumption automatically in the KB; some of them
just have to be believed true [6].
Figure 3 shows the case for primitive PSMs. In case of composite PSMs the following
happens. When a composite PSM has been found to match the task goal, its comprising
subtasks are considered as new goals for which PSMs need to be found. Thus, the broker
consults again the libraries for finding PSMs. This continues recursively until only primitive
PSMs are found.
Configuration
Configur Taskstructures
task library
assumptions Broker
broker
customer's do-task(In,Out):-
solve(PSM2(In,
0)),
TKB ~ Out))j
utio~
CORBABUS - TCP/IP
FIGURE 6: Tile whole picture.
reasoner, they need to be put together. In the current version, we simply chain PSMs based
on common inputs and outputs, taking into account the types of the initial data and the
final solution. This means that we only deal with sequential control and not with iteration
and branching. We plan to extend this by considering the control knowledge specified in the
operational descriptions of composite PSMs (controlling the execution of their subtasks).
This knowledge needs to be kept track of during PSM selection, and can then be used to ghm
the primitive PSMs together. The same type of control knowledge can be found explicitly in
existing task structures for modeling particular task-specific reasoning strategies [9, 2, 11].
Task structures include task/sub-task relations along with control knowledge, and represent
knowledge-level descriptions of domain-independent problem solvers. If the collection of
PSMs selected by the broker matches an existing task structure (this can be a more or less
strict match), then we can retrieve the corresponding control structure and apply it.
module ibrow
t y p e d e f s t r i n g atom;
enum s i m p l e _ v a l u e _ t y p e enum v a l u e _ t y p e
int_type, { simple_type,
float_type, compound_type,
atom_type list_type
}; };
union s i m p l e _ v a l u e union v a l u e
switch (simple_value_type) switch (value_type)
{ c a s e int_type: { case simple_type:
long int_valua; simple_value simple_value_value;
case float_type: c a s e compound_type:
float float_value; sequence<value> name_and_urEuments ;
case atom_type: case list_type:
atom atom_value; sequence<value> list_value;
}; };
interface psm
{ value solve(in value arE) ;
};
};
FIGURE 7: The IDL description for list-like data structures. For space reasons we printed it in two cohmms,
but it should be in one column.
the customer's knowledge base. Figure 6 situates the execution process in the context of
the overall process.
Since we use CORBA, we need to write an IDL in which we specify the data structures
through which the PSMs, the KB and the broker communicate [15]. Figure 7 shows the
IDL. In prin('iple, this IDL can then be used to make interoperable PSMs written in any
language (as long as a mapping can be made from the language's internal data structures
to the IDL-defined data structures). In our current prototype, we experiment witb Prolog
and Lisp, and our IDL provides definitions for list-like data structures with simple and
compound terms. This IDL is good for languages based on lists, but might not be the best
choice for including object-oriented languages such as Java. An IDL based on attribute-
value pairs might be an alternative. Figure 8 illustrates the role of IDL in the context of
heterogeneous l)rogralns aml CORBA. Given the IDL, coml)ilcrs generate langm~ge spccitic
wrappers that translate statements that comply with the IDL into structures that go into
tim CORBA bus. The availability of such compilersz depends on the particular CORBA
version/implementation used (we used ILU and Orbix).
Prolog
CORBA
bus
Lisp
The last conversion to connect a particular language to the CORBA l)us is performed
1)y a wrapper (see left wral)pers in Figure 8) constrm:ted by a participating partner (e.g.
a library provider of Prolog PSMs). This wrapper translates the internal data stru('tures
used by the programmer of the PSM or KB (e.g. pure Prolog) into statements accel)ted by
the automatically generated wrapper (e.g. "IDL-ed" Prolog). Figure 9 shows an example
of a simple PSM written in Prolog and wrapped to IDL. Wrapping is done by the module
convert, which is imported into the PSM and activated by the predicates in_value and
out_value).
5 Architecture
In ttm context of CORBA, our PSMs are sel~Jers and tim statements in tim problem solver
(the program configured by the broker, see Figure 6) are the c l i e n t s . This means that each
PSM is a separate server, the advantage being modularity. If we add a new PSM to the
library, and assuming that the PSMs run distributively at the library's site, titan we call
easily add new PSMs, without side effects. Tim customer's KB is also a server which the
broker and the PSMs can send requests to.
During the execution of the problem solver, the broker remains in charge of the overall
control. Execution means that when a statement in the problcin solver prograln is called (a
client is activated), a request is sent out to the CORBA bus and picked up by the appropriate
ZThe compiler for Prolog has been developedinhouse.
702
:-module(prune,[
psm_solve/3
3).
:- use_module(server(cllent)).
:- use_module(convert).
prune([], _, [ ] ) : - ! .
prune(Classes, Features, Candidates) :-
setof(Class,
( member(Class, Classes)
forall(member(Feature,Features),
has_property(Class, Feature))),
Candldates).
...........................................................
PSM - a server (through a unique naming service). Execution of the PSM may mean that
the PSM itself becomes a client, which sends requests to the customer's knowledge base (a
server). Once the PSM finished runldng and has generated an output, this is sent back to
the broker l)rogram, which then continues with the next statement.
Another issue is that typically a library offers several PSMs. Our approach is that each
library needs a mete-server that knows which PSMs are available and that starts up their
corresponding servers when needed. The same meta-server is also used for making U P M L
descriptions available to the broker. In our architecture, PSMs are considered objects with
two properties: (i) its UPML description and (ii) its executable code. The mete-servers
of the various libraries thus have a dual function: (i) provide UPML descriptions of its
containing PSMs, and (ii) provide a handle to the apl)rol)riate PSM implementatiolL
Interaction with the broker takes place through a Web browser. We use a common
gateway interface to Prolog (called PLCGI 8) to connect the broker with the Web.
6 Conclusions
We presented an approach for brokering problem-solving knowledge on the Internet. We
argued that this imt)lied solving two problems: (i) configuration of a problem-solver from
individual problem-solving methods, and (ii) execution of the configured, possibly hetero-
geneous, problem solver. For the configuration problem, we developed a language to char-
acterize problem-solving methods (UPML), which can be considered as a proposal for a
standard product-description language for I)rol)lem-solving components in the context of
electronic commerce. We assume that library providers of PSMs characterize their products
in UPML. Moreover, we also assunm that the custonmr's knowledge base is either charac-
terized in UPML, or that a knowledgeable person is capable of answering all questions the
broker might ask concerning the fulfilhnent of PSM assumptions. For matching customers'
goal with competences of PSMs, we used a theorem prover, which worked out satisfactory
for the experiments we did.
aPLCGI is only for internal use.
703
Acknowledgment
This work is carried out in the context of the IBROW project 9 with support from the
European Union under contract number EP: 27169.
References
[1] J. Angele, D. Fensel, D. Landes, S. Neubert, and R. Studer. Model-based and incremental
knowledge engineering: the MIKE approach. In J. Cuena, editor, Knowledge 01~ented So]tware
Design, IFIP ~ansactions A-27, Amsterdam, 1993. Elsevier.
[2] Barros, L. Nunes de, J. Hendler, and V. R. Benjamins. Par-KAP: a knowledge acquisition
tool for building practical planning systems. In M. E. Pollack, editor, Proc. o] the 15th IJCAI,
pages 1246-1251, Japan, 1997. International Joint Conference on Artificial Intelligence, Morgan
Kanfmann Publishers, Inc. Also published in Proceedings of the Ninth Dutch Conference on
Artificial Intelligence, NAIC'97, K. van Marcke, W. Daelemans (eds), University of Antwerp,
Belgium, pages 137-148.
[3] B. Beckert and J. Posegga. leanT4p: Lean tableau-I)ased deduction. Journal o[ Automated
Reasoning, 15(3):339-358, 1995.
[4] V. It. Benjamins. Problem-solving methods for diagnosis and their role in knowledge acquisition.
International Journal of Expert Systems: Research and Applications, 8(2):93-120, 1995.
[5] V. R. Benjamins, D. Fensel, and R. Straatman. Assumptions of problem-solving methods and
their role in knowledge engineering. In W. Wahlster, editor, Proc. ECAI-96, pages 408-412. J.
Wiley & Sons, Ltd., 1996.
[6] V. R. Benjamins and C. Pierret-Golbreieh. Assumptions of problem-solving methods. In
N. Shadbolt, K. O'tIara, and G. Schreiber, editors, Lecture Notes in Artificial Intelligence,
9http://www. swi.psy, uva.al/projects/IBROW3/home.html
704
1076, 9th European Knowledge Acquisition Workshop, EKAW-96, pages 1-16, Berlin, 1996.
Springer-Verlag.
[7] J. Breuker and W. van de Velde, editors. CommonKADS Library for Expertise Modeling. IOS
Press, Amsterdam, The Netherlands, 1994.
[8] B. Chandrasekaran. Design problem solving: A task analysis. AI Magazine, 11:59-71, 1990.
[9] B. Chandrasekaran, T. R. Johnson, and J. W. Smith. Task-structure analysis for knowledge
modeling. Communications of the ACM, 35(9):124-137, 1992.
[10] B. Chandrasekaran, J. R. Josephson, and V. R. Benjamins. Tile ontology of tasks and methods.
In B. R. Gaines and M. A. Musen, editors, Proceedings of the 11th Workshop on Knowledge
Acquisition, Modeling and Management, Banff, Alberta, Canada, 1998. SRDG Publications,
University of Calgary.
[11] D. Fensel and V. R. Benjamins. Key issues for automated problem-solving methods reuse. In
tI. Prade, editor, Proc. of the 13th European Conference on Artificial lnteUigence (ECAI-98),
pages 63-67. J. Wiley & Sons, Ltd., 1998.
[12] D. Fensel, V. R. Benjamins, S. Decker, M. Gaspari, R. Groenboom, W. Grosso, M. Musen,
E. Motta, E. Plaza, A. Th. Schreiber, R. Studer, and B. J. Wielinga. The component model
of UPML in a nutshell. In Proceedings of the First Working IFIP Cortference on Software
Arehitecture (WICSA1), San Antonio, Texas, 1999.
[13] D. Fensel, S. Decker, M. Erdmann, and R. Studer. Ontobroker: The very high idea. In Proceed-
ings of the 11th International Flairs Conference (FLAIRS-g8), Sanibal Island, Florida, 1998.
[14] D. Fensel, E. Motta, S. Decker, and Z. Zdrabal. Using ontologies for defining tasks, problem-
solving methods and their mappings. In E. Plaza and V. R. Benjamins, editors, Knowledge
Acquisition, Modeling and Management, pages 113-128. Springer-Verlag, 1997.
[15] J. H. Gennari, H. Cheng, R. Altman, and M. A. Musen. Reuse, corba, and knowledge-based
systems. International Journal of Human-Computer Studies, 49(4):523-546, 1998. Special issue
on Problem-Solving Methods.
[16] J. H. Gennari, S. W. Tu, T. E. Rotenfiuh, and M. A. Musen. Mapping domains to methods in
support of reuse. International Journal of fluman-Computer Studies, 41:399-424, 1994.
[17] T. R. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisi-
tion, 5:199-220, 1993.
[18] N. Guarino. Formal ontology, conceptual analysis and knowledge representation. International
Journal of Human-Computer Studies, 43(5/6):625-640, 1995. Special issue on Tim Role of
Formal Ontology in the Information Technology.
[19] M. Ikeda, K. Seta, and R. Mizoguchi. Task ontology makes it easier to use authoring tools.
In Proe. of the 15th IJCAI, pages 342-347, Japan, 1997. International Joint Conference on
Artificial Intelligence, Morgan Kaufmann Publishers, Inc.
[20] E. Motta and Z. Zdrahal. A library of problem-solving components based on tim integration
of the search paradigm with task and method ontologies. International Jourual of Human-
Computer Studies, 49(4):437-470, 1998. Special issue on Problem-Solving Methods.
[21] M. A. Musen, J. II. Gemmri, II. Eriksson, S. W. Tu, and A. R. Puerta. PROTEGE II: Comlmter
support for development of intelligent systems from libraries of components. In Proceedings of
the Eighth World Congress on Medical lnformatics (MEDINFO-95), pages 766-770, Vancouver,
B. C., 1995.
[22] R. Orfali, D. Harkey, and J. Edwards, editors. The Essential Distributed Objects Survival Guide.
John Wiley & Sons, New York, 1996.
[23] A, Puerta, S. W, q~l, and M. A, Musen. Modeling tasks with mechanisms. In Workshop on
Problem-Solving Methods, Stanford, July 1992. GMD, Germany.
705
[24] F. Puppe. Knowledge reuse among diagnostic problem-solving methods in the shell-kit D3.
International Journal of Human-Computer Studies, 49(4):627-649, 1998. Special issue on
Problem-Solving Methods.
[25] A. Th. Schreiber, B. J. Wielinga, and J. A. Breuker, editors. KADS: A Principled Approach
to Knowledge-Based System Development, volume 11 of Knowledge-Based Systems Book Series.
Academic Press, London, 1993.
[26] J. Schumann and B. Fischer. NORA/HAMMR making deduction-based software component
retrieval practical. In 12th IEEE International Conference on Automated Software Engineering,
pages 246-254. IEEE Computer Society, 1997.
[27] N. Shadbolt, E. Motta, and A. Rouge. Constructing knowledge-based systems. IEEE Software,
10(6):34-39, November 1993.
[28] L. Steels. Components of expertise. AI Magazine, 11(2):28-49, Smmner 1990.
[29] A. ten Teije, F. van Harmelen, A. Th. Schreiber, and B. Wielinga. Construction of l)roblem-
solving methods as parametric design. International Journal of Human-Computer Studies,
49(4):363-389, 1998. Special issue on Problem-Solving Methods.
[30] M. Usehold and M. Gruninger. Ontologies: principles, methods, and applications. Knowledge
Engineering Review, 11(2):93-155, 1996.
[31] A. Valente and C. Liickenhoff. Organization as guidance: A library of assessment models. In
Proceedings of the Seventh European Knowledge Acquisition Workshop (EKAW'93), Lecture
Notes in Artificial Intelligence, LNCS 723, pages 243-262, 1993.
[32] G. van Heijst, A. T. Schreiber, and B. J. Wielinga. Using explicit ontologies in KBS develop-
ment. International Journal of Human-Computer Studies, 46(2/3):183-292, 1997.
[33] J. Wielemaker. SWI-Prolog ~.9: Reference Manual. SWI, University of Amsterdam,
Roetersstraat 15, 1018 WB Amsterdam, The Netherlands, 1997. E-mail: jan@swi.psy.uva.nl.
[34] A. M. Zaremski and J. M. Wing. Specification matching of software components. ACM Trans-
actions on Software Engineering and Methodology, 6(4):333-369, 1997.
A System for Facilitating and Enhancing Web Search
Abstract
We present a system that uses semantic methods and natural language processing capa-
billies in order to provide cmnprehensive and easy-to-use access to tourist information in the
WWW. Thereby, the system is designed such that as background knowledge and linguistic
coverage increase~ the benefits of the system improve, while it guarantees state-of-the-art
information and database retrieval capabilities as its bottom line.
1 Introduction
Due to the vast amounts of information in the WWW, its users have more and more difficul-
ties finding the information they ale looking for among the many heterogeneous information
resources. Therefore, methods for comfortable and intelligent access are in the primary focus
of a number of research communities these days. Currently, syntactic methods of information
retrieval prcwdl in realistic scenarios (cf., e.g., Ballerini et al. (1996)), such as in general search
engines like AltaVista, but the limits inhcrent in these approaches often make linding the proper
information a nuisance. On the other end of methodologies, semantic methods could provide
just the right level for finding information, but they rely on explicitly annotated sources (cf., e.g.,
(Fensel et al., 1998)) or on complete and correct natural language understanding systems, both
of which cannot be expected in the near future.
Therefore our system, GETESS, uses the semantics of documents in the WWW - - as far as
it is provided explicitly or as it can be inferred by an incomplete natural language understanding
system, but relies on syntactic retrieval methods, once the methods at the semantic level fail to
fulfill their task. In particular, we consider an information finding system that, (i), has seman-
tic knowledge for supporting the retrieval task, (it), partially, but robustly, understands natural
hmguage, (iii), allows for several ways of interaction that appear natural to the human user, and,
(iv), combines knowledge from unstructured and semi-structured documents with knowledge
from relational database systems.
In our project, we decided to aim at an information system that provides information finding
and filtering methods for a restricted domain, viz. for prospective tourists that may travel in a
certain region and are looking for all kinds of information, such as housing, leisure activities,
scesights, etc. The information about all this cannot be found within a narrowly restricted format
-- neither in a single database nor in a single web site. Rather, the information agent must gather
inlormation that is stored on many different web servers, often in unstructured text, and even in
some databases, such as a booking database of a hotel chain. In order to improve on common
information retrieval systems, at least part of what is stated in the (HTML) texts must be made
available semantically. However, since automatic text understanding is still far from perfect, we
pursue afitil-sr approach that is based on extracting knowledge from text with a robust parsei,
but also integrates and falls back onto common information retrieval mechanisms when the more
elaborate understanding component fails.
707
In the following, we draft the architecture of the GETESS-system with its overall sharing of
the work load. From this outline we will then motivate and describe some key issues of the major
subsystenas of GETESS.
2 Architecture
The front end of the GETESS system (cf. a depiction of its architecture in Figure 1) provides a
user interface that is embedded in a dialogue system controlling the history of interactions (cf.
Section 3). Single interactions are banded to the query processor that selects the corresponding
analysis methods, viz. the natural language processing module (NLP system; also cf. Section 5)
or the information retrieval and database query mechanisms (cf. Section 4). While the latter ones
can be directly used as input to the search system, the natural language processing module first
translates the natural language query into a corresponding database query, before it sends this
formal query to the search system.
In order to process queries and search fox"results, three kinds of resources are provided by
the back end of the GETESS system. First, archived information is available in several content
databases (the abstract DB, the index DB and the DB repository), the function of which is ex-
plained below. Second, the lexicon and the ontology provide metaknowledge about the queries,
viz. about the grammatical status of words and their conceptual denotations. Third, a database
incorporating dialogue sequences and user profiles, gives control over dialogue interactions.
While dialogue sequences and user profiles are acquired during the course of interactions and
the metaknowledge is provided by the human modeller with the help of knowledge acquisition
tools (KA 7bols), the content databases must be filled automatically, since the contents of typical
web sites change almost on a daily basis. For this task the gatherer searches regularly through
relevant XML/HTML-pages and specified databases in order to generate corresponding entries
in the abstract database, the index database and the database repository.
The content in the abstract database is derived from a robust, though incomplete natural lan-
guage understanding module that parses documents and extracts semantic information building
a so called "abstract" for each document. These abstracts are sets of facts, i.e. tupels, like
h a s C h u r c h ( A l i c a n t e , C h u r c h - 1 ) , that could be extracted from natural language text,
like "Aficante's major church was build during mediaval times" The index generator builds ac-
cess information lor full text search with information retrieval methods, while the DB repository
offers relevant views onto extern databases.
Subsequently, we will first introduce the front end, the dialogue system. The key issues here
are concerned with facilitating user interaction at different levels of expertise (Section 3). At the
back end of the system, the tools for gathering, database management and information retrieval
provide rite technical platform for efficiently updating and accessing the system's information
repositories (Scction 4). The natural language processing component in GETESS is employed
by the dialogue system as well as by the back end in order to understand natural language queries
and extract information from natural language texts, respectively, and, thus, enhance the quality
of the web search (Section 5). Finally, we outline the function of the ontology that constitutes
the "glue" of the system at the semantic level (Section 6).
3 DialogueSystem
The dialogue system constitutes the interface between the human user and the data in the reposi-
tories of the GETESS system. In order to facilitate the user's task of finding the information he is
looking for, users should be able to express queries conveniently at their level of expertise. This
means the system should allow for intuitive interaction by natural language queries as well as for
formal queries that may be the preferred mode of interaction by a human expert user. Indepent
of the concrete mode of interaction, the system should react quickly and accurately while using
the capabilities of the different modal actions.
In the GETESS system we allow for four types of interaction, viz. natural language, graphical
interface, keyword search and formal database query. Since the methods for natural language
processing as well as for keyword search and formal database queries form major components of
the system, their description has been delegated to subsequent sections (5 as well as 4, resp.).
Thus, this chapter serves the following three goals: First, it is described how reasoning about
user interactions may support the user's goal in quickly finding the appropriate information.
Second, it is sketched how single interactions are treated as elements of a complex dialogue.
Hence, the user does not have to start from scratch every time he inititiates a new query, but can
instead ,efer to his previous queries, e.g. by requests like "Show me information related to this
matter." where "this matter" relates to the last query. Finally, we give a glimpse on the use of the
graphical interface.
The Knowledge Base of the Dialogue System. Knowledge is crucial for all modes of in-
teraction, because we want the system to give appropriate responses to the user when problems
arise. For instance, when a query results in an abundance of hits, the system must reason about
why this problem might have occurcd and how it might be solved. Knowledge that allows for
this type of reasoning is encoded in the knowledge base of the dialogue system (KBD).
The KBD includes all the definitions available in the ontology (cf. Section 6). These defini-
tions help in explaining to the user why a query was too unspecific or giving him hints how he
might try to rephrase the query such that he gets the information he is looking for. For example,
if the user seeks for information on the local offers of "entertainment", the hit rate for a database
query can be reduced by the choice of one of the refined search terms "music events", "theater
events" and "sport events". Vice versa, a too specific choice like "folk music event" might result
in no hits, but the hint towards more general search terms like "folk culture presentation" might
bring up an event that also includes live demonstrations of "folk music". Further help is also
provided through important terminological links such as synonyms, homonyms, antonyms and
terms that may be parts of other terms, e.g. the show of a magician may be part of a circus show
and, hence, the circus show might be a viable entertainment alternative to a magician's show.
In addilion to tile definitions of the ontology, the KBD features definitions and rules about
dialogtte concepts. At the moment, this part is tuned to map different interactions onto common
requests to the database. I For example, the user input "1 am looking for the station", which may
also be supplemented by restrictions from the graphical interface, has the same meaning, i.e.
tin the linguistic literature this mapping is defined by the way natural language propositions, requests or ques-
tions can be considered as so calkedspeechacts (Austin, 1962).
709
it constitutes the same speech act, as "where is the station?". Therefore, both inputs must be
mapped onto the same query to the database.
Both types of knowledge provide user support that reduces the number of inquiries the user
has to pose to the system and, hence, accelerates the dialogue compared to common keyword
retrieval interactions.
Complex l)ialogues. As indicated above, information finding rarely produces an instanta-
neous hit - - after the user has formulated just a single query. This is true for syntactic methods
and it will improve only to a limited amount with semantic methods either. However, we believe
that when a user's sequence of interactions is perceived as being executed in order to achieve
it goal, then this task of finding the proper information can be substantially facilitated. For this
putpose, we provide a query processor that analyses not only the single interactions, but also
views them as being embedded into a more global structure.
The methodology we use is based on work done by Ahrenberg et al. (1996), who structure
the dialog hierarchically into segments that are opened by a request and closed by the appropriate
answer. The assumption in our scenario is that users typically have a request for a certain piece
of information and give related ;,rtformation in order to succeed. For example, they give a topic ~,
which here boils down to a type restriction, like "sightseeing tour", and temporal information
when they want to take part in a sightseeing tour during a particular time frame. The task of the
dialogue system lies in zooming in or out on relevant information according to the interaction
initiated by the user. For example, two user interactions 3 like, (i), "Show me all theater events.",
and, (ii),"No,just the ones in August." return a large set of documents first (with feedback such as
described in the previous subsection), but a much smaller set of data after the second interaction
has narrowed down the focus.
Hence, we here identify dialogue segments, interactions and topics as the major parameters
(though not the only ones) that the dialogue system keeps track of. By this way the user's single
interactions may all convey towaFds the common information finding goal and, thus, facilitate
the human-computer inter'tction.
The Graphical Interface. Besides of the natural language query capabilities and the possi-
bility of directly composing a formal query, the GETESS system features a graphical interface.
This interface conslitules :m intermedi:tte level of access to the systent between the most profes-
sional (and fastest) one, viz. the formal query, and the most intuitive one (that requires a some-
what more elaborate intcraction), viz. the natural language access. The graphical interface does
not require from the user to learn the syntax of a particular query language or the concepts that
are available in the ontology, but expects some basic understanding of formal systems from the
user. This interface visualizes the ontology in a manner suited for selecting appropriate classes
and attributes and, thus, allows the assembly of a formal query through simple mouse clicks. For
this purpose, the ontology is visualized by a technology based on hyperbolic geometry (Lamping
& Rao, 1996): classes in the center of the visualization are represented with a big circle, sur-
rounding classes are represented with a smaller circle. This technique allows fast navigation to
distant classes and a clear illustration of each class and its neighboring concepts.
:"File "topic" in a dialogue corresponds tt) what the dialogue is about. Usually, it is given only implicitly in
nalural language statements. In our setting it may also be given explicitly,e.g. through the graphical interface.
3Each interaction corresponds I(~sely to a speech act as introduced above, but might also be an act in the
graphical user interface. Examples are up&tte(users provide informationto the system), question, answer, assertion
or directive.
710
The back cad of GETESS employs a typical gatherer-broker structure, viz. a Harvest search
service (Bowman et al., I995) with a database interface. Though we use the tools provided by
,'mother project, SWING (Heyer et al., 1997), the setting of GETESS puts additional demands on
the gatherer-broker system: (i), the GETESS search engine has to work with facts contained in
the abstracts~ (it), ontology knowledge must be integrated into the process of analysing internet
information as well as answering user queries, (iii), internet information can be of different types
(e.g., HTML, XML texts), .'rod, (iv), data collections such as information stored in databases
must also be accessible via the GETESS search engine. These different requirements must he
met both during the main process of gathering data and during the querying process (broker).
The Gatherer pn)c~s. Periodically, inlernet information is analysed via interact agents in
order to build a search index for the GETESS search engine. Information (e.g. HTML-texts,
Postscript-files ..... ) is checked to find keywords. Additionally, the GETESS-Gatherer has to
build abstracls from these information. The two kinds of index data ('simple' keywords and
abstracts) are stored in databases.
The Broker process. As indicated above, the dialogue system maps the user's queries (with
the help of the natural language processing module and the definition in the ontology) onto formal
or keyword queries in IRQL, the Information Retrieval Query Language. The IRQL language
combines different kinds of queries - - both database and information retrieval queries - - thus
providing access to the index data. The query result set is ranked with an user-centered ranking
function. The ranked result set is then presented to the user via the dialogue system.
The integration of different types of information (full text, abstracts, relational d,'~tabase facts)
during gathering and querying has put forth and still requil~es demand for research, however at
the same time it raises new possibilities of posing queries, because:
2. Searching for particular integer values, for instance prices or distances is nearly impossible
with a conventional information retrieval approach. In GETESS, it will be possible to
compare integer and real values, e.g. to search for all prices that fall below a threshold. In
addition, one may also determine minimum, maximum and average values as well as sort
and group results by particular values.
3. Database functionality brings up answers from the abstract databases that are composed
of different abstracts. That means, for answering an query of a user we may refer and
exploit facts derived from different websites. Thereby, it is not even necessary that these
websites are connected by links, but all the algebraic operations given through database
functionality can be employed in order to deduce information. For instance, today's cinema
events may be announced in one web site, the corresponding reviews are found in another
one. Database technology allows for retrieving all movies that are shown today and that
received a good note in the corresponding review.
6 Ontology
As already mentioned, the gathering, use and querying of information with syntactical methods
is very limited and in many cases not successful. A semantic reference model, an ontology,
which structures the content and describes relationships between parts of the content helps to
overcome these limitations. With the ontology in GETESS, we aim at two major puq~oses:
First, it offers inference facilities that are exploited by the other modules, as, e.g., described
in Section 3, the dialogue module may ask for the types a particular instance belongs to in
order to present alternative query options to the user. Second, the ontology acts as a mediator
between the different modules. This latter role is explained here in more detail, since it illustrates
how ontological design influences the working of the GETESS system and, in particular, the
extraction of facts fio,n natural language texts.
The text processing (cf. Section 5) of natural language documents and queries delivers syn-
tactic relations between words and phrases. Whether and how this syntactic relation can be
translated into a meaningful semantic relation, depends on how the tourism domain is concep-
tualized in the ontology. For example (cf. Fig. 2), the natural language processing system finds
syntactic relations between the words "church" and "Alicante" in the phrase "Alicante's main
chu,'ch". The word "Alicante" refers to A l i c a n t e which is known as an instance of the class
cs t y in the database. The database refers to the ontology for the description of the classes,city
and c h u r c h . Querying the ontology for semantic relations between c h u r c h and c i t y results
in hasBui iding and hasChurch. Both relations are inherited flom the class location to
the class c i t y . Since, h a s C h u r c h is the more specific one, a cmTesponding entry between
A l i c a n t e and c h u r c h is added to the abstract, i.e. the set of extracted facts, of the cmxently
l),'ocessed document in the abstract database.
This example shows that the design of
the ontology determines the facts which • hasBuildingr
may be extracted from texts, the database
schema that must be used to store these
facts and, thus of course, what informa-
tion is made available at the semantic level.
ltence, the ontology might constitute an en-
gineering bottleneck. However, we try to
overcome this problem by using the linguis-
hasChurch=I.Church-1I Database
tic and statistical analyses of the text pro- Figure 2: Interaction of Ontology with NLP
cessing component for indicating frequent, system and Database Organization
though unmodelled, concepts and relations
to the knowledge engineer.
7 Related Work
The GETESS project builds on and extends a lot of earlier work in various domains. In the natural
language community, research like (Grosz et al., 1987; Wahlster et al., 1978) fostered the use of
natural language application to databases, though these applications never reached the high pre-
ciseness and generality required in order to access typical databases, e.g. for accounting. Here,
our approach seems better suited, since the imponderabilities of general natural language under-
standing are counterbalanced by information retrieval facilities and an accompanying graphical
interface.
Only few researchers, e.g. Hahn et al. (1999), have elaborated on the interaction between
natural language unde,'standing and the con'esponding use of ontologies. We think this to be
an important point since underlying ontologies cannot only be used as submodules of text un-
derstanding systems, but can also be employed for a more direct access to the knowledge base
and for providing an intermediate layer between text representation and extem databases, an
interesting topic._that has not been raised so far to [he best of our knowledge.
713
As far as such queries of conceptual structures is concerned, we agree with McGuinness &
Patel-Schneider (1998) that usability issues play a vital role in determining whether a semantic
layer can be made available to the user and, hence, we elaborated on this topic early on (Fensel
et al., 1998). We, thereby, keep in mind that regular users may find lengthy natural language
questions too troublesome to deal with and, therefore, prefer an interface Ihat allows fast access,
but which is still more comfortable than any lormal query language.
Projects that compare directly to GETESS are, e.g., Paradime (Neumann et al., 1997) 4, MU-
L1NEX (Capstick et al., 1998) and MIE'VFA (Buitehtar et al., 1998). However, none of these
projects combines information extraction with similarly rich interactions at the semantic layer.
I Icnce, to thc best of our knowledge we are the only one integrating unstructured, semi-structured
and highly-structured data with a variety of easy-to-use facilities for human-computer interac-
tion.
8 Conclusion
In the project GETESS (GErman Text Exploitation and Search System) we decided to build
an intelligent information finder that relies on current techniques for information retrieval and
database querying as its bottom line. The support for finding informations is enhanced through
an additional semantic layer that is based on ontological engineering and on a partial text under-
standing tool.
In order tofitcilitate web search, the dialogue is considered a complex entity. The analysis of
sequences of interactions allows for refining, rephrasing or refocusing succeeding queries, and
thus eliminates the burden of starting from scratch with every single interaction. Thereby, several
modes of interaction are possible, besides keyword search and SQL-queries one can mix natural
language queries with clicking in the graphical query interface.
Having built the single modules for our system, the next task in the GETESS project is bring-
ing these components together. Given the design methodology of achieving entry level features
first and then working towards "the high ceiling" (via. complete text understanding and represen-
tation), we expect benefits on the parts of economic and researct~ interests early in the project.
The system is general enough in order to be applied to many realistic scenarios, e.g. as an intelli-
gcnt interface to a company's intranet, even though it is still far from offering a general solution
for the most general information finding problems in the WWW. Further research will have to
show an evaluation of how a user's performance in finding particular pieces of information looses
or (hopefully) gains from using this information ;,gent.
References
Ahrenberg, L., Dahlb."ick, N., J(innson, A., & Thure, A. (1996). Customizing interaction for natural
language interfaces. Computer and lnformation Science, 1(1).
Austin, J. (1962). Ih~w to Do Things with Words. Oxford University Press.
P,allcrini, J., Biichel, M., Kuaus, I)., Mateev, B., Mittendorf, M., Sch~iuble, P., Sheridan, P., & Wechsler,
M. (1996). SPIDER retrieval system at TREC 5. In Proc. of TREC-5. Gaithersburg, Maryland,
November 20-22, 1996. http://www-nlpir.nist.gov/pubs.
Bowman, C., Danzig, P., Hardy, R., Manber, U., & Schwartz, M. (1995). The har-
vest information discovery and access systems. Networks and ISDN Systems.
hnp://ftp.cs.colorado.edu/publcsltechreportslschwartzlHarvest.Conf.ps.Z.
Buitclaar, P., Netter, K., & Xu, F. (1998). Integrating different strategies for cross-language information
retrieval in the MIETTA project. In Hiemstra, D., de Jong, F., & Netter, K. (Eds.), Language Tech-
nology in M,Ithnedia h~rmation Retrieval Proceedings of the 14th 7~,ente Workshop on Ixmguage
Technology, TWI51" 14, pages 9-17. Universiteit Twente, Enschede.
Capstick, J., Diagne, A. K., Erbach. G., Uszkoreit, H., Cagno, F., Gadaleta, G., Hernandez, J. A., Korte,
R., Leisenberg, A., Leisenberg, M., & Christ, O. (1998). MUL1NEX: Multilingual web search and
navigation. In I'roceedings of Natural Language Processing and h~dustrial Applicatons.
Fensel, D., Decker, S., Erdmann, M., & Studer, R. (1998). Ontobroker: The very high idea. In FLAIRS.98:
Proceedings of the I I th International Flairs Conference, Sanibal Islaml, Florida, May 1998.
Grosz, B., Appelt, D., Martin, P., & Pereira, E (1987). Team: An experiment in the design of transportable
natural-language interfaces. Artificial hltelligence, 32(2): 173-243.
ltahn, U., Romacker, M., & Schulz, S. (1999). tlow knowledge drives understanding: Matching medical
ontologics with the necds of medical language processing. AI in Medicine, 15(I):25-51,
Iteyer, A., Meyer, H., Diisterh6fl, A., & Langer, U. (1997). SWING: Der Anfrage- und Suchdienst des
Rcgioualen lnformationssystems MV-Info. In Tagungsband luK-Tage Mecklenburg-Voq~ommern.
Schwerit,, 27.,,'28. Jtmi 1997.
Lamping, J. & Rao, R. (1996). The hyperbolic browser: A focus + context technique for visualizing large
hierarchies. Journal of Vi.~ualLanguages & Computing, 7.
McGuinness, D. & PateI-Schneider, P. (1998). Usability issues in knowledge representation systems. In
Proc. of AAAI-98, pages 608--614.
Neumann. G., Backofen, R., Baur, J., Becket, M., & Braun, C. (1997). An information extraction core
system for real world german text processing. In 5th hzternational Conference of Applied Natural
Langttage Processing, pages 208-215, Washington, USA.
Neumann, G. & Mazzini, G. (1998). Domain-adaptive iuformation extraction. Technical report, DFKI,
Saarbrticken.
Wahlster, W., Jameson, A., & Hocppner, W. (1978). Glancing, referring and explaining in the dialogue
system HAM-RPM. American Journal of Computational Linguistics, pages 53-67.
Applying Ontology to the Web: A Case Study
Abstract
This paper describes the use of Simple HTML Ontology Extensions (SHOE) in a real world internet
application. SHOE allows authors to add semantic content to web pages and to relate this content to
common ontologies that provide contextual infiwmation about the domain. Using tills inibrmation, query
systems can provide more accurate responses than are possible with the search engines available on the
Web. We have applied these techniques to the domain of Transmissible Spongiform Encephalopathies
(TSEs), a class of diseases that include "Mad Cow Disease". We discuss our experiences and provides
lessons learned from the process.
1. Introduction
The "Mad Cow Disease" epidemic in Great Britain and the apparent link to Creutzfeldt-
Jakob disease (CJD) in humans generated an international interest in these diseases.
Bovine Spongiform Encephalopathy (BSE), the technical name for "Mad Cow Disease",
and CJD are both Transmissible Spongiform Encephaiopathies (TSEs), brain diseases
that cause sponge-like abnormalities in brain cells. Concern about the risks of BSE to
humans continues to spawn a number of websites on the topic; some of these sites
provide valuable information, while others are simply sources of rumors. The reliable
sites range in content from epidemiology of the diseases, to scientific studies on
inactivation, to regulations by various agencies. It is difficult for users to locate relevant
information with the standard web search engines because these tools match on individual
words instead of their meanings. As such, they cannot take the relationship between
words into account, map between the terminology of different communities, or use any
contextual information to differentiate between terms with many meanings.
The Joint Institute for Food Safety and Nutrition (JIFSAN), a partnership between
the Food and Drug Administration (FDA) and the University of Maryland, is attempting
to rectify this situation. They wish to provide a clearinghouse for information on TSEs.
This site must be able to serve a diverse group of users, including the general public,
researchers, risk assessors, and policy makers. However, the diversity of data, the
constant appearance of new information, and the distribution o f ownership make it
difficult to manually maintain an accurate index. Additionally, the nature of the target
user community means the retrieval tools must be able to respond to general queries and
very specialized queries with the appropriate level of detail to inform the user.
We have built a suite of tools to address these problems, with the basis for these
tools being an internet compatible knowledge representation language called Simple
HTML Ontology Extensions (SHOE). The underlying philosophy of SHOE is that
intelligent agents will be able to better perform tasks on the Internet if the most useful
information on web pages is provided in a structured manner. To tiffs end, SHOE extends
HTML with a set of knowledge oriented tags that, unlike HTML tags, provide structure
for knowledge acquisition as opposed to information presentation. In addition to
providing explicit knowledge, SHOE sanctions the discovery of implicit knowledge
716
through the use of taxonomies and inference rules available in reusable ontologies that are
referenced by SHOE web pages. This allows information providers to encode only the
necessary information on their web pages, and to use the level of detail that is appropriate
to the context. SHOE-enabled web tools can then process this information in novel ways
to provide more intelligent access to the information on the lnternet.
This paper describes the first application of SHOE to a large-scale, real world
domain. In Section 2, we lay out the architecture of the system and detail the efforts to put
each piece in place. Section 3 discusses what we have learned from the process. Sections
4 and 5 discuss related and future work, respectively. Finally, Section 6 presents our
conclusions.
The following subsections describe how we created our ontology, how SHOE tags were
added to web pages, how new SHOE information is discovered, and how users access
information that is relevant to them.
~BODY>
<ONTOLOGY ID="TSE Ontology" VERSION="1.0">
<USE-ONTOLOGY ID="Base Ontology" VERSION="1.0" PREFIX="base">
~REI.~TION NAME="hasInput">
<ARG POS=I TYPE="Process">
<ARG POS=2 TYPE="Material">
</RELATION>
<RELATION NAME="hasOutput">
<ARG POS=I TYPE="Process">
<ARG POS=2 TYPE="Material">
</RELATION>
~/ONTOLOGY>
</BODY>
Note that the motivation for web ontologies is slightly different from that of traditional
ontologies. People rarely query the web searching for abstract concepts or similarities
between very disparate concepts, and as such, complex upper ontologies are not
necessary. Since most pages with SHOE annotations will tend to have tags that categorize
the concepts, there is no need for complex inference rules to perform automatic
classification. In many cases, rules that identify the symmetric, inverse, and transitive
relationships will provide sufficient inference.
The initial TSE ontology was fleshed out in a series of meetings that included
members of the FDA and the Maryland Veterinarian School. Since one of the key goals
was to help risk assessors gather information, the ontology focused on the three main
concerns for TSE Risks: source material, processing, and end-product use. Source
materials are described using the concepts of Animal, Tissue, and DiseaseAgent.
Processing focused on the types of Processes, and relations to describe inputs, outputs,
duration, etc. Finally, end-product use categorized the types of Products and dealt with
the RouteOfExposure. We also defined number of general concepts such as People,
Organizations, Events, and Locations.
Currently, the ontology has 73 categories and 88 relations. It is stored as a file on
a web server with an HTML section that presents a human-readable description and a
machine-readable section with SHOE syntax. In this way, the file can serve the purpose
of educating users in addition to being understandable to machines.
2.2 Annotation
Annotation is the process of adding SHOE semantic markup to a web page. A SHOE web
page describes one or more instances, each representing an entity or concept. An instance
is uniquely identified by a key, which is usually formed from the URL of the web page.
The description of an instance consists of ontologies that it references, categories that
classify it, and relations that describe it. A sample instance is shown in Figure 2.
Determining what concepts in a page to annotate can be complicated. First, if the
document represents or describes a real world object, then an instance whose key is the
718
<HTML>
<BODY>
~ I N S T A N C E KEY="http://www.cs.umd.edu/projects/plus/SHOE/tse/rendering.html">
<USE-ONTOLOGY ID='TSE-Ontology" VERSION='1.0" PREFIX="tse"
URL="http;//www.cs.umd.edu/projects/plus]SHOE/tse/tseont.html">
< C A T E G O R Y NAME="tse. P r o c e s s ' >
<RELATION NAME="tse.name">
<ARG POS='TO" V A L U E = " n e n d e r l n g ' >
</RELATION>
<RELATION NAME="tse.hasInput'>
<ARG POS="TO" VALUE="http://www.cs.umd.edu/projects/plus/SHOE/tse/offal.html">
</RELATION>
<RELATION NAME="tse.hasInput">
<ARG POS="TO" VALUE="http://www.cs.umd.edu/projects/plus/SHOE/tse/bones.html">
</RELATION>
<RELATION NAME='tse.hasOutput'>
<ARG POS="TO" VALUE="http://www.cs.umd.edu/proJects/plus/SHOE/tse/mbm.html">
</RELATION>
<RELATION NAME='tse.hasOutput">
<ARG POS='TO" VALUE='http://www.cs.umd.edu/projects/plus/SHOE/tse/tallow.html">
</RELATION>
<RELATION NAME="tse.hasOutput'>
<ARG POS='TO" VALUE="http://www.cs.uznd.edu/proJects/plus/SHOE/tse/gellatln.html">
</RELATION>
< /INSTANCE>
< /BODY>
</IITML>
document's URL should be created. Second, hyperlinks are often signs that there is some
relation between the object in the document and another object represented by the
hyperlinked URL. If a hyperlinked document does not have SHOE annotations, it may
also be useful to make claims about its object. Third, one can create an instance for every
proper noun, although in large documents this may be excessive. If these concepts have a
web presence, then that URL should be used as the key, otherwise, unique keys can be
created by appending a "#" and a unique string to the end of the document's URL.
Since manually annotating a page can be time consuming and prone to error, we
have developed the Knowledge Annotator, a tool that makes it easy to add SHOE
knowledge to web pages by making selections and filling in forms. As can be seen in
Figure 3, the tool has an interface that displays instances, ontologies, and claims. Users
can add, edit or remove any of these objects. When creating a new object, users are
prompted for the necessary information. In the case of claims, a user can choose the
source ontology from a list, and then choose categories or relations from a corresponding
list. The available relations will automatically filter based upon whether the instances
entered can fill the argument positions. A variety of methods can be used to view the
knowledge in the document. These include a view of the source }tTML, a logical notation
view, and a view that organizes claims by subject and describes them using simple
English. In addition to prompting the user for inputs, the tool performs error checking to
ensure correctness ~ and converts the inputs into legal SHOE syntax. For these reasons,
only a rudimentary understanding of SHOE is necessary to markup web pages.
We selected pages to annotate with two goals in mind: provide information on the
processing of animal-based products and provide access to existing documents related to
TSEs. We were unable to locate web pages relevant to the first goal, and therefore had to
create a set of pages describing many important source materials, processes and products.
To achieve the second goal we selected relevant pages from sites provided by the FDA,
United States Department of Agriculture (USDA), the World Health Organization and
a Here correctness is in respect to SHOE's syntax and semantics. The Knowledge Annotator cannot verify if
the user's inputs properly describe the page.
719
them 2. We chose Parka (Evett, Andersen, and Hendler 1993; Stoffel, Taylor, and Hendler
1997) as our knowledge base because evaluations have shown it to be very scalable, there
is an n-ary version, and parallel processing can be used to improve query execution time.
Since we were not interested in performing complex inferences on the data at the time,
the fact that Parka's only inference mechanism is inheritance was of no consequence.
An important aspect of the lnternet is that its distributed nature means that all
information discovered must be treated as claims rather than facts. Parka, as well as most
other knowledge base systems, does not provide a mechanism for attaching sources to
assertions or facilities for treating these assertions as claims. To represent such
information, one must create an extra layer of structure using the existing representation.
Parka uses categories, instances and n-ary predicates to represent the world. A natural
representation of SHOE information would be to treat each declaration of a SHOE
relation as an assertion where the relation name is the predicate, and each category
declaration as an assertion where instanceof is the predicate. To represent the source of
the information, we could add an extra term to each predicate. Thus, an n-ary predicate
would become an (n+l)-ary predicate. However, the structural links (i.e., isa and
instanceof) are default binary predicates in Parka. Thus, this approach could not be used
without changing the internal workings of the knowledge base. We opted for a simpler
approach, and instead made two assertions for each claim. The first assertion ignores the
claimant, and can be used normally in Parka. The second assertion uses a claims predicate
to link the source to the first assertion. When the source of information is important, it
can be retrieved through the claims predicate. Although this results in twice as many
assertions being made to the knowledge base, it preserves classifcation while keeping
queries straightforward.
As designed, the agent will only visit websites that have registered with JIFSAN.
This allows JIFSAN to review the sites so that Expose will only be directed to search
sites that meet a certain level of quality. Note that this does not restrict the ability of
approved sites to get current information indexed. Once a site is registered, it is
considered trusted and Expose will revisit it periodically.
2 A binary knowledge base can represent the same data as an n-ary knowledge base, but requires an
intermediate processingstep to convert an n-ary relation into a set of binary relalions.This is inefficientin
terms of storage and executiontime.
721
as a table of the possible variable bindings. If the user double-clicks on a binding that is a
URL, then the corresponding web page will be opened in a new window of the user's web
browser.
It is widely believed that the outbreak of BSE in Great Britain was the result of
changes in rendering practices. Since processing can lead to the inactivation or spread of
a disease, JIFSAN expressed a desire to be able to visualize and understand the
processing of animal materials from source to end-product. To accommodate this, we
built the TSE Path Analyzer, a graphical tool which allows the user to pick a source,
process and/or end product and view all possible pathways that match their queries. The
input choices are derived from the taxonomies of the ontology, allowing the user to
specify the query at the level of generality that they wish. This display, which can be seen
in Figure 5, is created dynamically based on the semantic information in the SHOE web
pages. As such, it is automatically updated as new information becomes available,
including information that has been made available elsewhere on the web.
Since both these interfaces are applets, they are executed on the machine of each
user who opens it. This client application conmmnicates with the central Parka
knowledge base through a Parka server that is located on the JIFSAN website. When a
user starts one of these applets on their machine, the applet sends a message to the Parka
server. The server responds by creating a new process and establishing a socket for
communication with the applet.
722
3. Lessons Learned
This research has given us many insights into the use of ontologies in providing access to
internet information. The first insight is that it is worthwhile to spend time getting the
ontology "right". By "right", we mean that it must cover the concepts in the types of
pages that are to be used and the ways in which these pages will be accessed. We often
had to extend our ontology to accommodate concepts in pages that we were annotating,
and this slowed the annotation process.
Second, real world web pages often refer to shared entities such as BSE or the
North American continent. Such concepts may be described in many web pages, none of
which should have the authority to assign a key to them. In such cases, we revise the
appropriate ontologies to include a constant for the shared object. However, this may
result in frequent updates if the ontology is used extensively.
Third, ordinary web-users do not have the time or desire to learn to use complex
tools. Although the PIQ is easy to use once one has gained a little experience with it, it
can be intimidating to the occasional user. On the other hand, users liked the Path
Analyzer, even though it can only be used to answer a restricted set of queries, because it
presents the results in a way that makes it easy to explore the problem. It seems web users
are often willing to sacrifice power for simplicity.
723
Finally, the knowledge base must be able to perform certain complex operations
as a single unit. For example, the Path Analyzer needs to display certain descendant
hierarchies. Although such lists can be built by recursively asking for the immediate
children of the categories retrieved in the last step, this requires many separate queries. In
a client-server situation this is expensive, since each query requires its own
communication overhead and internet transmission delays can be significant. To improve
performance, we implemented a special server request that returns the complete set of
parent-child pairs that form a hierarchy. Although this requires the same amount of
processing by the knowledge base, it results in a significant speedup of the client
application.
4. Related W o r k
The World-Wide Web Consortium (W3C) has proposed the Extensible Markup Language
(XML) (Bray, Paoli, and Sperberg-McQueen 1998) as a standard that is a simplified
version of SGML (ISO 1986) intended for the Internet. XML allows web authors to create
customized sets of tags for their documents. Style sheets can then be used to display this
information in whatever format is appropriate. SHOE is a natural fit with XML: XML
allows SHOE to be added to web pages without creating an HTML variant, while SHOE
adds to XML a standard way of expressing semantics within a specified context. The
Resource Description Framework (RDF) (Lassila and Swick 1998) is another work in
progress by the W3C. RDF uses XML to specify semantic networks of information on
web pages, but has no inferential capabilities and is limited to binary relations.
There are many other projects that are using ontologies with the Web. The World
Wide Knowledge Base (WebKB) project (Craven et al. 1998) is using ontologies and
machine learning to attempt automatic classification of web pages. The Ontobroker
(Fensel et al. 1998) project has resulted in a language which, like SHOE, is embedded in
HTML. Although the syntax of this language is more compact, it is not as easy to
understand as SHOE. Also, Ontobroker does not have a mechanism for pages to use
multiple ontologies and those who are not members of the community have no way of
discovering the ontology information.
5. F u t u r e W o r k
The JIFSAN TSE Website is a work in progress, and we will continue to annotate pages,
refine the ontology, and improve the tool set. When we have accumulated a significantly
large and diverse set of annotated pages, we will systematically evaluate the performance
of SHOE relative to other methods. We also plan to develop a set of reusable ontologies
for concepts that appear commonly on the Web, so that future ontologies may be
constructed more quickly and will have a commonality that allows for queries across
subject areas when appropriate.
To gain acceptance by the web community, a new language must have intuitive
tools. We plan to create an ontology design tool that simplifies the ontology development
process. We also plan to improve the Knowledge Annotator so that more pages can be
annotated more quickly. We are particularly interested in including lightweight natural
language processing techniques that suggest annotations to tile users. Finally, we are
investigating other query tools with the goal of reducing the learning curve while still
providing the full capabilities of the underlying knowledge base.
724
6. Conclusion
The TSE Risk Website is the first step in developing a clearinghouse on food safety risks
that serves both the general public and individuals who assess risk. SttOE allows this
information to be accessed and processed in powerful ways without constraining the
distributed nature of the sources. Since SHOE does not depend on keyword matching, it
prevents the false hits that occur with ordinary search engines and finds other matches
that they cannot. Additionally, the structure of SHOE allows intelligent agents to process
the information from many sources and combine or present it in novel ways.
We have demonstrated that SHOE can be used in large domains without clear
boundaries. The methodology and tools we have described in this paper can be applied to
other subject areas with little or no modifications. We have determined that the hardest
part of using SHOE in new domains is creating the ontology, but we are convinced that as
high quality ontology components are made available, this process will be simplified. We
are encouraged by the interest that our initial efforts have generated in the TSE
community, and believe that improvements in our tools and the availability of basic
ontologies will lead to an internet where the right data is always available at the right
time.
Acknowledgments
This work is supported in part b y grants from ONR (N00014-J-91-1451), ARPA
(N00014-94-1090, DABT-95-C0037, F30602-93-C-0039) and the ARL
(DAAH049610297).
References
Bray, T., J. Paoli and C.M. Spcrberg-McQueen. 1998. Extensible Markup Language (XML). W3C (World-
Wide Web Consortium).(At http://www.w3.org/'rR/1998/REC-xml-19980210.html)
Craven, M., D. DiPasquo, D. Freitag, A. McCallum,T. Mitchell, K. Nigramand S. SlatteD,. 1998. Learning
to Extract SymbolicKnowledgefrom the World Wide Web. In Proceedings of the AAAI.98 Conference on
Artificial Intelligence. AAAI/MITPress.
Lassila, O. and R.R. Swick. 1998. Resource Description Framework (RDF) Model and Syntax. W3C
(World-WideWeb Consortium).At http://www.w3.org/TIU'WD-rdf-syntax-19980216.hind.
Stoffel, K., M. Taylor and J. Hendler. 1997. EfficientManagementof Very Large Ontologies. In
Proceedings of American Association for Artificial Intelligence Conference (AAAI-97). AAAI/MITPress.
H o w to Find Suitable Ontologies
Using an Ontology-Based W W W Broker
J u l i o C ~ s a r A r p l r e z V e g a I, A s u n c i 6 n G 6 m e z - P 6 r e z ' ,
A d o l f o L o z a n o T e l l o 2 a n d H e l e n a S o f i a A n d r a d e N. P. Pinto3"l ".
~arpirez, asun, alozano]@delicias.dia fi. upm.es, sofia@gia, ist. utl.pt
1 I N T R O D U C T I O N AND M O T I V A T I O N
Nowadays, it is easy to get information from organizations that have ontologies using the
WWW. There are even specific points that gather information about ontologies and have
links to other web pages containing more explicit information about such ontologies (see
The Ontology Page 4, also known as TOP) and there are also ontology servers like The
Ontology Server s [8, 9], Cycorp's Upper CYC Ontology Server 6 [29] or Ontosaurus 7 [36]
that collect a huge number of very well-known ontologies.
When developers search for candidate ontologies for their application, they face a
complex multi-criteria choice problem. Apart from the dispersion of ontologies over
several servers; (a) ontology content formalization differs depending on the server at
which it is stored; (b) ontologies on the same server are usually described with different
detail levels; and (c) there is no common format for presenting relevant information
about the ontologies so that users can decide which ontology best stilts their purpose.
Choosing an ontology that does not match tile system needs properly or whose usage is
expensive (people, hardware and software resources, time) may force future users to stop
reusing the ontology already built and oblige them to formalize the same knowledge
again. It would be very useful for the knowledge reuse market to prepare a kind of yellow
pages of ontologies that provides classified and up-dated information about ontologies.
These living yellow pages would help future users to locate candidate ontologies for a
given application. A broker specialized in the ontology field can help in this search,
I Grupo de reutilizaci6n. Laboratorio de lnt9 Artificial. Facultad de lnformfitica. Universidad Polit6cnica de Madrid. Espafia
2 Area de Lenguajes y Sistemas Infornutticos. Departamento de lnfomvitica. Universidad do Extremadura. Espafia
3 Grupo de Intelig~,neia Artificial.. Oepartamento de Engenharia Infonmttiea. lnstitulo Superior Ttcnico. Lisboa. Portugal
t This work was partially supported by JNICT grant PRAXIS XXI/BD/11202/97 (Sub-Programa Ci6neia e Tecalologia do Segundo
Quadro Comunit~hio de Apoio).
4 http://www.medg.lcs.nait.edtffdoyle/top
http:Hwww-ksl.standlbrd.cdu:5915
96 http://www.cyc.com
7 http://indra.isi.edu:8000/l,oom
726
Functional _ _ description of use tools, documentation quality, training courses, on-line help, operatin!
Instructions, availability of modular use, possibility of additlng new knowledge, possibilit'
of delaying with contexts, availability of PSMs.
speeding up the search and selection process, by supplying the engineer with a set of
ontologies that totally/partially meet the identified requirements. As a first step to solving
the problem of searching for candidate ontologies, we present (ONTO)2Agent, an
ontology-based WWW broker on the field of ontologies that spreads information about
existing ontologies, helps to search appropriate ontologies, and reduces the search time
fbr the desired ontology. (ONTO)2Agent uses as a source of its knowledge an ontology
about ontologies (called Reference Ontology) that plays the role of a yellow pages of
ontologies.
In this paper, we will firstly present an initial set of features that allow us to
characterize, evaluate and assess ontologies from the user point of view. Secondly, we
will show how we have built the Reference Ontology at the knowledge level [32] using
the METHONTOLOGY framework [5, 11, 16] and the Ontology Design Environment
(ODE) [5], and how we have incorparated the Reference Ontology into the (KA) 2
initiative [4]. Finally, we will present the technology we have used to build ontology-
based WWW brokers and how it has been instantiated in (ONTO)2Agent. (ONTO)2Agent
is capable of answering questions like: give me all the ontologies in the domain D that
are implemented in languages L1 and L2.
727
3.1 METHONTOLOGY
The METHONTOLOGY framework enables the construction of ontologies at the
knowledge level. It includes: the identification of the ontology development process, a
proposed life cycle and the methodology itself. The ontology development process
identifies which tasks should be performed when building ontologies (planning, control,
quality assurance, specification, knowledge acquisition, conceptualization, integration,
formalization, implementation, evaluation, maintenance, documentation and
configuration management). The life cycle (based on evolving prototypes) identifies the
stages through which the ontology passes during its lifetime. Finally, the methodology
itself, specifies the steps to be taken to perform each activity, tile techniques used, the
products to be outputted and how they are to be evaluated. The main phase in the
ontology development process using the METHONTOLOGY approach is the
conceptualization phase. Its aims are: to organize and structure the acquired knowledge
in a complete and consistent knowledge model, using external representations (glossary
of terms, concept classification trees, "ad hoe" binary relation diagrams, concept
dictionary, table of"ad-hoc" binary relations, instance attribute table, class attribute table,
logical axiom table, constant table, formula table, attribute classification trees and an
instance table) that are independent of implementation languages and environments. As a
730
result of this activity, the domain vocabulary is identified and defined, For detailed
information on building ontologies using this approach, see [16].
Forward I
was formalized in Flogic [28]. A
WWW broker called Ontobroker [10]
uses this Flogic ontology to infer new
information that is not explicitly stored
on the ontology.
To make this ontology accessible to
the entire community, it was decided Figure 2. Ontological Reengineezing Process of the (KA)a Ontology.
to translate this Flogic ontology to
Ontolingua [20] and to make it accessible through the Ontology Server. Since all the
knowledge had been represented in a single ontology, the option of directly translating
from Flogic to Ontolingua was ruled out (since it transgressed the modularity criterion),
and it was decided to carry out an ontological reengmeermg process of the (KA) 2
ontology as shown in Figure 2. First, we obtained a (KA) 2 conceptual model, attached to
the Flogic ontology manually by a reverse engineering process. Second, we restructured
it using ODE conceptualization modules. After this, we got a new (KA) 2 conceptual
model, composed of eight smaller ontologies: People, Publications, Events,
Organizations, Research-Topics, Projects, Research-Products and Research-Groups.
Finally, we converted the restructured (KA)2 conceptual model into Ontolingua using
forward ODE translators.
[People
Employee
~ublication
Article Pr~Dcv9
oSS:
Dcpartraent
Acadmnio-Staff Article.ln-Book
ZZL, Confe~snc~-Paper
Journal-Article
ReJcarch-Projr
[ A&uinuttrative-Staff TedmiceI-Report ReacarchOiganization
Workshop-Paper Un vett ty
Bock
Joumal
lEl~-Bxpe~
ImCS
Special.lnue
and a Researcher. For a detailed explanation of the new (KA) 2 ontology conceptual
model built alter restructuring the Flogic (KA) ~ ontology, see [5]
Employs
I i
Project ~ Ors~liz~ti~l
Rt~earch-bate~t
i i i
8UpOI'YjIS~ ~]lt~,l~ll[
I I I I I
Figure 4. Dia~am of Binary "Ad-hoe" Relations in (KA) ~.
'so~ct
, . ..... ,,o,.,,
~ o . .~onwlre-Pfojecl I
| Re~/lrch-Projeet
1
R elea rch.T oplc-E ve n i l
~Iptr
Conference
3.f$ "lT.0alolwllr I
r..... 1
w orklhop
E C A I "fd-OA#ale|teal-enpt*ceHn p ahlicltion
ECAI "~le-AppL.oJ'-Onta~& -Probl-.~olv-Mrlholltl Article
l
Activity ANicll-ln-Book
Con ferantl-Arllr
JOtlrlllI-ArileI|
Tlchnlc II*R aport
R elesrch T o p l c - J o . r n a i l
BoWr
t'k'hop*Arlir
Joula|t
IEER-Expert
IJIIC3
~plcIl[-Iiiui
Oa-llna-P ubllcstton
element of the relation), the name of the relation and the name of the target concept;
for instance, the relation Ontology-Formalized-in-Language between the class of
ontologies and one Language.
Based on the previous criteria, our analysis of the conceptual model of the (KA) 2
ontology showed that:
9 about the classes: from the viewpoint of the Reference Ontology, some important
classes were missing; for instance, the classes Servers and Languages, subclasses of
Computer-Support at the Product ontology. The subclass of the class Servers is the
class Ontology-Servers, whose intances are the Ontology-Server, the Ontosaurus and
the CycServer. The subclass of the class Languages is the class Ontology-Languages,
whose instances are Ontolingua, CycL [29] and LOOM [30].
9 about the relations: from the viewpoint of the Reference Ontology, some important
relations were missing; for instance, the relation Research-Topic-Products between a
research topic and a product, or the relation Distributed-by between a product and an
organization or the relation Ontology-Located-at-Server that relates an ontology to a
server.
9 about the properties: from the viewpoint of the Reference Ontology, some important
properties were missing; for instance, Research-Topic-Webpages, Developers-Web-
Pages, Type-of-Ontology or Product-Name.
So, we introduced the classes, relations and properties needed. The most representative
appear highlighted in hold lettering in Figure 5.
All the changes, the entry of new relations and properties and the entry of new concepts
were guided by the features that were presented in section 2. Essentially, the (KA) 2
ontology was extended using new concepts and some knowledge previously represented
in the (KA) 2 ontology was specialized in order to represent the intbrmation that we found
was of use and of interest for comparing different ontologies with a view to reuse or use
as a basis for further applications.
4 ONTOAGENT A R C H I T E C T U R E
Having identified the relevant features ofontologies and built the conceptual structure of
the Reference Ontology using the Ontology Design Environment, the problem of
entering, accessing and updating the information about each individual ontology arises.
Ontology developers will enter such knowledge using a WWW form based on the
features identified in section 2. A broker specialized in the ontology field, called
(ONTO)2Agent, can help in this search. In this section, we describe domain-independent
technology for building and maintaining ontology-based WWW brokers. The broker
uses ontologies as a source of its knowledge and interactive WWW user interfaces to
collect information that is distributed among ontology developers.
The approach taken to build ontology-based WWW brokers is based on the architecture
presented in Figure 6. It consists of different modules, each of which carries out a major
function within the system. These modules are:
4.1 (ONTO)2Agent
In tile ontological engineering context, using tile Reference Ontology as a source of its
knowledge, the broker locates and retrieves descriptions of ontologies that satisfy a given
set of constraints. For example, when a knowledge engineer is looking for ontologies
written in a given language applicable to a particular domain, (ONTO)2Agent can help in
the search, supplying the engineer with a set of ontologies that totally/partially comply
with the requirements identified.
735
Both query builders allow users to formulate simple and complex queries. Simple
queries can be made using the predefined queries present in the agent. They are
based on ODE intermediate representations and include: definition of a concept,
instances of a concept, comparison between two concepts, etc. They are used to get
answers, loaded with information, easily and quickly. The query procedure is similar
to the one used by Yahoo I~ or Alta Vista 12, so anyone used to working with these
lnternet search tools is unlikely to have any problems using the interface. Complex
queries can be formulated by using a query builder wizard that works with AND/OR
trees and the vocabulary obtained from the ontologies we are querying. It will allow
us to build a more restrictive and detailed query than is shown in Figure 8, where we
are looking for all the ontologies in
the engineering domain, with
standard units as a defined term and
whose language is either
Ontolingua, LOOM or SFK; before
the query is translated to the proper
query language, it is checked
semantically for inconsistencies -
syntactic correctness is implicit-, n e ~ S. (ONTO)'Agent is asked to provide all the ontologies in
the engineering domain, written in Ontolingua, LOOM or SFK, with
thanks to the building Standard Units as a defined term, using a query expressed by means
query
method. If it is all right, it is ofanAND/ORlree.
refined, eliminating any
redundancies.
B.2. The resulting query is then translated into the SQL language in order to match
the ontology specification at the knowledge level, using the implementation of the
ontology stored in a database. For the ontolingua implementation of a similar agent,
an OKBC-capable [39] builder would be required.
B.3. The SQL query is sent to the server by means o f a OntoAgent-specific protocol
built on top of the TCP/IP stack. Therefore, the applications will be able to contact
the server by means o f this protocol. The inference engine used is the search engine
equipped with MS-Access and some add-ins.
B.3. Once the query is sent to the server, the results will be returned and will be
graphically visualized by the system. This representation will be different depending
on whether or not natural language generation was requested. These results can be
saved in HTML format for later consult using a common web browser.
Appart from this querying capability, we can also download or upload ontologies from
the server or to the server. So, we can work on the ontology of our own workstation so as
to work with it employing ODE, and modify and/or enlarge it as desired.
Figure 9. Search results in natural language and in tabuler form. Sodium definition: Sodium is an alement that belongs to the
alkalymetal group and has an atomic number of 11,an atomic weight of 22.98977 and a valency of 1. The table also shows the
Chemicals instance attributes table.
generate Spanish text descriptions in response to the queries in the domain of chemistry.
This is shown in Figure 9, where we queried the definition of sodium and the instance
attributes table of the Chemicals ontology using a predefined query.
Chemical OntoAgent does not have the modules described for the world-wide web
domain model builder broker, since the Chemicals ontology was built entirely using
ODE, and needed no further dynamic updating after its completion.
5 CONCLUSIONS
In this paper we presented (ONTO)2Agent, an ontology-based WWW broker to select
ontologies for a given application. This application seeks to solve some important
problems:
1. To solve the problem of the absence of standardized features for describing
ontologies, we have presented a living and domain-independent taxonomy of 70 t~atures
to compare ontologies using the same logical organization. This framework differs from
Hovy's approach, which was built exclusively for comparing natural language processing
ontologies. This framework also extends the limited number of features proposed by
Fridman and Hafner for comparing well-known and representative ontologies, like: CYC
[29], Wordnet [31], GUM [3], Sowa's Ontology [35], Dahlgren's Ontology [7], UMLS
[24], TOVE [21], GENSIM [27], Plinius [38] and KIF [15].
2. To solve the problem of the dispersion of ontologies over several servers, and the
absence of common formats for representing relevant information about ontologies using
the same logical organization, we built a living Reference Ontology (a domain ontology
738
about ontologies) that gathers, describes using the same logical organization and has
[inks to existing ontologies. We built this ontology at the knowledge level using the
METHONTOLOGY framework and the Ontology Design Environment. We also
presented the design choices we made to incorporate the Reference Ontology into the
(KA) 2 initiative ontology after carrying out an Ontological Reengineering Process.
3. To solve the problem of searching for and locating candidate ontologies over several
servers, we built (ONTO)2Agent, an ontology-based WWW broker that retrieves lhe
ontologies that satisfy a given set of constraints using the knowledge tbrmalized in the
Reference Ontology. (ONTO)2Agent is an instantiation of the OntoAgent Architecture.
OntoAgent and Ontobroker have several key points in common. Both are distributive,
joint-efforts by the community, they use an ontology as the source of their knowledge,
they use the web to collect information, and they have a query language for formulating
queries. However, the main differences between them are:
eOntoAgent architecture uses: (l) a SQL database to formalize the ontology, (2) a
WWW form and an ontology generator to store the captured knowledge, and (3) simple
and complex queries based on ODE intermediate representations and AND/OR trees to
retrieve information from the ontology.
9 Ontobroker uses: (1) a Flogic ontology, (2) Ontocrowler for searching WWW annotated
documents with ontological information, and (3) a Flogic-based syntax to formulate
queries.
We hope that (ONTO)2Agent and the Reference Ontology will ease the ,.search of
ontologies to be used in other applications.
6 ACKNOWLEDGEMENTS
We would like to thank Mariano Fernandez and Juanma Garcia for their help in using
ODE.
7 REFERENCES
1. Aguado G., Balenlan J., Baflbn A., Bernardos S., Fernfindez M., G6mez-Pdrez A., Nieto. E, Olalla A., Plaza R., Sanchcz A.
ONTDGENEt~TION: Reusing domain and linguistic ontologies for Spanish. Workshop oil Applications of Ontologies and
PgMs. Brighton. England. August 1998. 29
2. Bateman, J.A.; B. Maguini, G. Fabris. The Generalized Upper Model Knowledge Bare: Organizalinn and Use. In Towar&s Very
Large Knowledge Bases. Pages 60-72. lOS Press. 1995.
3. Batcman J. A., Magnini B., Rinaldi E, The Generalized Italian, German, English Upper Model, Proceedings of ECAI94's
Workshop on Comperison of Implemented Ontologics, Amsterdanh 1994.
4. Bcnjamins R., Fensel D., Community is Knowledge! m (i(,4)~, Knowledge Acquisition Workshop, KAW98. Presented in FOIS
1998.
5. Bl~quez M, Fermindez M, Oarcia-Pinar J. M., G6mez-P6rez A., Building Ontologies at the Knowledge Level using the
Ontology Design Envirnoment, Knowledge Acquisition Workshop, KAW98, Banff, 1998.
6. Borst P., Benjamins J., Wiclinga B., Akkarmans tt., An Application of Ontology Construction, Workshop on (hltologieal
Engineering, ECAI~6, Budapest, PP. 5-16, 1996.
7. Dahlgren K., Naive Semantics for Natural Language Understanding. Boston:MA, Kluwer Academic, 1988.
8. Farquhar A., Fikes R., Rice J., The Ontolingua Server: A Toolfor Collaboralive Ontology Consfruclion, Proceedings of the 101h
Knowledge Acquisition for Knowledge-Based Systems Workshop, Banff, Alberta, Canada, PP. 44.1-44.19, 1996.
9. Farquhar A., Fikcs R., Pratt W., Rice J., Collaboralive Ontology Construction for Information Integration, Technical Report
KSL-95-10, Knowledge Systems Laboratory, Stanford University, CA, 1995.
10. Fensel, D, Decker, S. Erdman M. Studer, R. Ontobroker: The Very High Idea. In Proceedings of the 1 lth International Flairs
Conference (FLAIRS-98), Sanibal Island, Florida, May 1998.
11. Fermindez, M., O6mez-P6rez, A. Juristo, N. METttONTOLOGY: From OniologlcalArt Toward Ontological Engineering. Spring
Symposium Series on Ontological Engineering. AAAI97. Stanford. USA. Marsh 1997.
12, Fermtndez M. CtlEMICALS: ontologla de elementos quhnicos. Proyeclo fin do can'era. Facullad de lnformfitica. IJniversidad
Polit6cnica do Madrid. December 1996.
13. Fisher, D.; K. Rust9 SFK: A Smalltalk Frame Kit. Technical Report, OMD/1PSI, Darmstadt, Germany, 1993
14. Fridman N., Hafncr C., The State of the Art in Ontology Design, AI MAGAZINE, Fall 1997, PP. 53-74, 1997.
] 5. Genesereth M., Fikes R., Knowledge Interchange Format, Technical Rcptwt, Computer Science Department, Stanford University,
Logic-92-1, 1992
16. G6mez-PO'ez A., Knowledge Sharing and Reuse, The Handbook of Applied Expert Systems, Edited by J. Liebowitz, CRC Press,
1998.
17. O6mez.P~rez A., Towards a Framework to Verify Knowledge Sharing Technology, Expert Systems with Applications, Vo] 1 I, N~
4, PP. 519-529, 1996,
739
IS. Gn~ber, T. and Olsen, R. An Ontology for Engineering Mathematics, Technical Report KSL-94-lg, Knowledge Systems
Laboratory, Stanford University, CA, 1994.
19. Cn'ubor T., Toward Principles for the Design of Ontologtes Used for Knowledge Shoring, Technical Report KSL-93-04,
Knowl~lgn Systems Laboratory, Stanford University, CA, 1993.
20. Gruber T., OHTOLINGUA: A Mechanism to Support Portable Ontologies, KSL-91-66, Knowledge Systems Laboratory, Stanford
University, 1992.
21. Gruningcr M., Fox M, Methodology for the Design and Evaluation os Proceedings of IJCAI9Ys Workshop oil Basic
Ontological Issues in Knowledge Sharing, 1995.
22. van lieist O., Sehreiber A. Th., Wielinga B. J., Using explicit ontologJes in KBS development, International Journal of llmnan-
Computer Studies, 45, PP. 183-292, 1997.
23. llovy E., ,What lYould J!Mean to Measure on Ontology?, unpublished, 1997.
24. Ilumplweys B. L., Lindberg D. A. B., UMLS project: making the conceptual connection between users and the inlbrm.~tio, they
need, Bulletin of the Medical Library Association, 81(2), 1993.
25. JavaSoft. Java Security FAQ. http://iava.sun.com/sl'aq. October 1997.
26. Kan S. K., Metrics and Models in Software Quality Engineering. Ed. Addison-Wesley Publishing Company, MA, USA 1995.
27. Karp P.D., A Qualitative Biochemistry and its Application to the Regulation of the Trypophan Operon, Artilicial Intelligence abd
Molecular Biology, L. Hunter (ed.), 289-325, AAAI Press/MIT Press, 1993.
28. Kifer M., Lansan G., Wu J., Logical Foundations ofObjdct-Oriented and Frame-Based Languages, Journal of the ACM, 1995.
29. Lanai D.B., CYC: Toward Programs with Common Sense, Comnmnicalions of the ACM, 33(g), PP. 30-49, 1990.
30. Loom Users Guide Version 1.4. ISI Corporation. 199L
31. Miller G. A., WordNet: An On-line lgxical Database, Intemelianal Journal of Lexicography 3, 4: 235- 312, 1990.
32. Newell A., '/'he Knowledge Level, Artificial lntdligance (Ig), PP. 87-127, 1982.
33. Pressman R., Software Engineering: A practitioner's approach, McGraw-Hill, 1997.
34. Slagie J., Wick M, A Methodfor Evah~aUng Candidate, AI MAGAZINE, WINTER 88, PP. 44-53, 1988.
35. Sowa J. F., Knowledge Representation: Logical, Philosophical, and Computational Foundations, Boston:MA, PWS Publishing
Company, Forthcoming, 1997.
36. Swartout B., Patti R., ~fight K., Russ T,, Towards Distributed Use of Large.Scale Ontologtes, AAAI97 Spring Symposium Series
on Ontological Engineering, 1997.
37. Uschold M., G~ninger M., ONTOLGGIES: Principles, Methods and 16Applications, Knowledge Engineering Review, Vol. I 1,
N. 2, June 1996.
38. Van der Vet P.E., Spcel P.-H., Mars N. J. I., The Plinius ontology of ceramic materials. Proceedings of ECAI94's Workshop oll
Comparison of Implemented Ontologies, Amsterdam, 1994.
39. Chaudhri Vinay K., Farquhar Adam, Fikes Richard, Karp Peter D., Rice James P. The Generic Frame Protocol 2.0, July 21,
1997.
Towards P e r s o n a l i z e d D i s t a n c e Learning
on t h e W e b
One of the features that characterizes distance learning (DL) is the "systematic
use of communication media and technical support" [7] as alternatives to mediate
in learning ex~periences. Any theory about learning insists t h a t the quality of the
communication between teacher and student is a decisive factor in the process.
Therefore, taking advantage of resources such as Internet, that can significantly
improve the information sources and the quality of the communication with the
students, should be seen as an obligation (the natural evolution of this kind of
education will eventually lead to its imposition).
In the near future it is likely t h a t a distance learning student will contact
his/her classmates, teachers, advisor, and the University administration, as well
as make use of common university facilities, through the Internet. Telematic
services can be used by any student, for example, in clearing up doubts together
with fellow students or the teacher, regardless of his/her degree of isolation, or
in lightening the administration involved in compiling his/her academic record.
Considering the student diversity which characterizes this kind of education
(workers with family responsibilities, disabled people, teachers with a p e r m a n e n t
need to bring up to date their background knowledge, teenagers coming from
technical schools and secondary education ... ) as well as the dispersion of the
information source~ (news, mailing fists, web pages of different kinds: those of
the institution or other institutions, pages for the different courses, FAQ's, the
lecturers' pages, practical exercises, continuous remote assessments ... ) the
741
1 www.geocities.com/Athens/Forum/5889/index.html
2 usuarios.iponet.es/jastorga/matematicas
742
Educational software can be found oil m a n y different web sites 3. All these
applications are closed systems, implemented especially for specific contents and
for a specific level. Then there are the so-called authoring tools, working envi-
ronments used in the creation of Internet courses. These tools mainly provide
3 www.edsoft.com,
www.gcse.com/maths,
node.on.ca/tile,
curriculum.qued.qld.gov.au/lisc/edsw/dossoft.htm,
www.telelearn.ca/conference/demos.html
743
We have finally opted to make use of a Web server, implemented in Lisp (CL-
H T T P (Common Lisp Hypermedia Server)), that was used in the development
of the Interbook system (sect. 3.)
4 www.learning-web.com
www.lotus.com/learningspace
6 www.educom.edu/program/nlii/articles/moshwils.html
7 www.realeducation.com/products/index.html
s www.contrib.andrew.cmu.edu/-plb/InterBook.html
9 www.contrib.andrew.cmu.edu/~ plb/AIED97_workshop/Vassileva/Vassileva.html
10 www.icbl.hw.ac.uk/projects/isle/Doc.html
11 curriculum.qed.qld.gov.au/lisc/edsw/d-ctools.htm
12 www.icbl.hw.ac.uk
13 www.ai.mit.edu/projects/iiip/doc/cl-http/home-page.html
744
(that is, the d a t a t h a t refers to the page, its content, referring pages, identifica-
tion of the a u t h o r . . . ); This specification is called R D F (Resource Description
Framework (~r~. w3. o r g / T R / W D - r d f - s y n t a x ) and it will eventually be included
in the formal specification of H T M L .
4.3 Experimentation
At the moment the access to the exercises of the machine-learning courses a t ~the
CSS and postgraduate courses of the AID have been personalized.
The system maintains a model for each user that interacts with him/her; when
the student first starts a session with the server, he must register; in this way,
his model is automatically initialized in the system (Figure 1).
The student can follow whatever p a t h he/she wants, while doing the exercises,
irrespective of the pages that the system recommends.
The personalization of the exercises module is basically focused on:
1. Recommendations about the pages, the aim of which is to help the student
to understand the purpose of the exercise. For instance, in the exercise of
Figure 2 the system advises the student to study the objectives of the exer-
cise among the course contents.
745
Fig. 1. The login page where the user is asked to introduce a login identifier and a
security password
After the student has visited the objectives page, the system will then re-
construct the original page, recommending a different hyperlink (Figure 3).
2. The system allows the student to add new hyperlinks in documentation
pages. For example, if the system presents a page with interesting hyper-
links, the user could add a new hyperlink using a form, as in Figure 4.
Fig. 2. Exercise with several possible links to follow together with the system's recom-
mendations
The success of the learning task of the Web personal assistant depends cru-
cially on the quality of its knowledge. The first design choice is to select a stable
set of attributes for describing training examples. The selected attributes must
satisfy some of the following requirements:
- T h e y are correlated.
- There are causal dependencies between them.
- There are hierarchical dependencies between attributes and classes.
- T h e y cover a significant portion of the training examples.
- T h e y are based on measurement or objective judgments.
- Their values can discriminate between the training examples.
Fig. 3. The system recommends the next hyperlink for the student
some structure of feature values. The objective of this process is to decrease the
dispersion of training values, improving the predictive quality of the learning
task. However, there is a tradeoff between the usefulness of these clusters of
feature values and the quality of the program results.
Fig. 4. Form where the student is asked to introduce the data for the new hyperlink
extend the system to the whole CSS. However, this system will only become
really useful when the W W W resources became the main support for distance
learning education [1].
With respect to the performance of the personalization task, in an extended
design of the system we decided to apply an ensemble o.f classifiers for improving
its learning accuracy. In addition, content-based information filtering techniques
are applied in the representation of the Web pages [4]. Two information sources
are combined: academic reports and available data from user activity on the
web, including information directly introduced by the student and items which
he/she has selected (web pages, added hyperlinks, news groups, e-mail lists ... ).
Finally, the classification model is constructed from the overlapping training sets
of the cross-validation sampling method [6]. The final system will go beyond the
identification of relevant items for the student to find out the preferred channel
of communication with other teachers and students. For example, it is quite pos-
sible that some students will prefer to contact their companions through news
groups, instead of looking at the Web pages of registered students. Additionally,
the unstructured nature of the information sources (web pages, information asso-
ciated with h y p e r l i n k s . . . ) requires the application of representation techniques
that summarize the relevant features of domain objects (there is an interesting
proposal in [3]).
6 Acknowledgements
The authors would like to acknowledge the helpful comments of Simon Pickin,
arising in the course of his language revision of this article. We also thank the en-
tire Artificial Intelligence Department of the Spanish National Distance-learning
University (UNED) for providing support for this project.
749
References
1 Introduction
Knowledge based system (K13S) designers and hypertext system developers contend
that information structures may reflect the semantic structures of human memory.
Further, they believe that mapping the semantic structure of an expert onto a
knowledge hypertext information structure and explicitly illustrating that structure in
the hypertext will result in improved comprehension, because the knowledge
structures of the users will reflect the knowledge structures of the expert to a greater
degree [13]. This paper reviews techniques for ascertaining an expert's knowledge
structure and mapping it onto visual representations. The studies show generating a
semantic network through structured knowledge acquisition improves the
development phase significantly.
The short prehistory of knowledge engineering (KE) techniques and tools
(including knowledge acquisition, conceptual structuring and representation models),
the overall overview of which is presented in [5, 27], is a way to develop the
751
methodology that can bridge a gap between the remarkable capacity of human brain
as a knowledge store and the efforts of knowledge engineers to materialise this
compiled experience of specialists in their domain of skill.
Beginning from the first steps and research that show the "bottleneck" [1] in expert
system development up to nowadays the AI (a~ificial intelligence) investigators and
designers has been slightly guided by cognitive science. So major part of KE
methodology suffer of fragmentation, incoherence and shallowness.
The highlights in this area relate to early works in 80-ies on the reconstruction of
semantic space of human expertise [3] and serious success of repertory grid-centred
tools as the Expertise Transfer System (ETS) [4], AQUINAS [3], KSSO and others.
All these programs can be related to the first generation of KE tools.
The next impact to knowledge acquisition refinement is concerned with the visual
knowledge engineering [5] that develop novel technique aimed at knowledge
engineers. These so-called second generation KE tools [7] provides ideas of CASE
technology to AI [2]. They help to traverse and organise visually an emerging
knowledge store and to the semantic space of the domain in the most natural form, for
example as an "image panel" or a sketchpad for the concept maps, diagrams and
pictures.
Although the popular methods described above are rather powerful and versatile,
the knowledge engineer in fact is weakly supported at the most important and critical
stage in the knowledge engineering life cycle - transition from elicitation to
conceptualisation by understanding and realisation of the domain strncture and
expert's reasoning way. He needs a mindtool which will help and assist.
The last 5-7 years the main interest of the researchers in this field is concerned
with the special tools that help knowledge capture and strncturisation. Many KA tools
appeared that help to cut down the revise and review cycle time and to refine,
structure and test human knowledge and expertise [ 1, 24].
In this paper the new technology called CAKE (Computer Aided Knowledge
Engineering) is described. CAKE also may be effectively used for concept mapping
and ontology development.
Like KBS development, ontology development faces the knowledge acquisition
bottleneck problem. However, unlike KBS, the ontology developer comes up against
the additional problem of not lraving any sufficiently tested and generalised
methodologies recommending what activities to perform and at what stage of the
ontology development process these activities should be performed. That is, each
development team usually follows their own set of principles, design criteria and steps
in the ontology development process. The absence of structured guidelines and
methods hinders the development of shared and consensual ontologies within and
between teams, the extension of a given ontology by others and its reuse in other
ontologies and final applications [6].
Until now, few domain-independent methodological approaches have been
reported for building ontologies. Uschold's methodology [25], Gruninger and Fox's
methodology [12] and METHONTOLOGY [6] are the most representative. These
methodologies have in common that they start from the identification of the purpose
of the ontology and the need for domain knowledge acquisition. However, having
752
Cognitive tools have been around for thousands of years. Cognitive tools refer to
technologies, tangible or intangible, that enhance the cognitive powers of human
beings during thinking, problem solving, and learning. Cognitive tools represent
formalisms for thinking about ideas. They constrain the ways the people organise and
represent ideas, so they necessarily engage different kinds of thinking. [161
Today, computer software programs are examples of exceptionally powerful
cognitive tools. As computers have become more and more common in education,
training, and performance contexts, the effectiveness and impact of software as
cognitive tools have begun growing.
Although many types of software can be used as cognitive tools for learning (e.g.,
databases, spreadsheets, expert system shells, abductive reasoning tools, multimedia
authoring systems, micro-worlds, and dynamic modelling tools), this article focuses
on the effectiveness of such visual techniques as concept mapping, ontologies and
knowledge base design software employed as intellectual partners in learning.
Concept maps, which are very alike to semantic networks, are spatial
representations of concepts and their interrelationships that are intended to represent
the knowledge structures that human store in their minds [14]. Concept maps are
graphs consisting of nodes representing concepts and labelled lines representing
relationships between the concepts. Concept mapping is the process of constructing
753
The same approach may be implemented to the hypertext design. Now many
modem Intemet hypertext tools, such as Explorer and Netscape, are intended to serve
as graphical browsers for a global hyperlinked mediaspace. Really, however, every
user of more or less complex hypertext structure is usually frustrated by a chaotic
labyrinth of crosslinks. This is especially valid for the World Wide Web as a
distributed hypermedia system, where the sort of the associated information is usually
unavailable for the local node.
755
The imposing of the knowledge structure on such amorphic hyperlink spaces can
dramatically shorten the conceptual apprehension of the corresponding flow of
information. In this way, the CAKE technology, even in the described
implementation, appears to be useful in this scope of problems, because it offers key
functionality for elucidating of the basic logical skeleton of the domain. Even the
plain visualising of the logical schemata of the domain have the powerful cognitive
impact both to the user and to the designer.
For example fig. 2 shows a draft of one hypertext tutorial chapter. This tutorial is
based on the course in intelligent system development and is intended for distance
learning [8].
The least but not the last contribution of the CAKE technology into this scope of
problems concludes in the possibility for the end user to consciously navigate through
the hypermedia space, while gradually increasing the knowledge structure of the path
left behind. Such structure may generalise the primitive apparatus of bookmarks and
index files.
The active browsing support currently implemented in CAKE allows the user of
the system to automate both the analysis and synthesis procedures of these activities.
The proof of a framework's value is how much time and cost one saves when
developing and modifying the knowledge base and hypertext environment. The
framework of CAKE is a modern design environment with the openness, and tool and
dam integration capabilities one needs to:
756
6 Discussion
This paper presents a rationale for the application of visual knowledge engineering
software as cognitive tools in education and industrial development of intelligent
systems.
Higher order thinking, especially problem solving relies on well-organised,
domain-specific knowledge. The approach described in this paper facilitates the
development and representation of domain knowledge. Therefore, visual tools are
predictive of different forms of higher order thinking.
They help in organising knowledge and data by integrating information into a
progressively more complex conceptual framework. When learners construct concept
maps or ontologies for representing their understanding in a domain, they may
reconceptualise the content domain by constantly using new propositions to elaborate
and refine concepts that are already known based on decontextualised knowledge [16,
18, 20]. The cross links which connect different sub-domains of conceptual structure,
enhance the anchorage of the concepts in the cognitive structure.
However, the research described above is limited and there is a great need for
sustained research regarding the implementation and effects of visual tools as
cognitive tools.
7 Acknowledgements
The presented research was partially supported by Russian Foundation for Basic
Research (grant 98-01-00081).
References
20. Musen, M.: Conceptual Models of Interactive Knowledge Acquisition Tools, Knowledge
Acquisition, Vol. 1, No. 1 (1994) 73-88
21. Nosek, J.T. & Roth, I. A comparison of Formal Knowledge Representation Schemes as
Communication Tools: Predicate Logics vs. Semantic Network, International Journal of
Man-Machine Studies, 33 227-239
22. Shavelson, R.J., Lang, H., & Lewin, B. On concept maps as potential "authentic"
assessments in science (CSE Technical report No. 388). Los Angeles, CA: National Centre
for Research on Evaluation, Standards, and Student Testing (CRESST), UCLA (1994)
23.Sowa, J. F. Conceptual Structures: Information Processing in Mind and Machine, Addison-
Wesley, Reading, Mass.
24.Tuthill, S. Knowledge Engineering. TAB Professional and Reference Books.
25.Uschold, M, Grtminger, M. ONTOLOGIES: Principles, Methods and Applications.
Knowledge Egineering Review. Vol. 11; N. 2 (1996)
26.Welsh M. http://www.linux.org/LDP (1995)
27.Wielinga B., Schreiber G., Breuker J. (1992) A Modeling Approach to Knowledge
Engineering. Knowledge A cquisin'on, 4 (1). Special Issue. (1992 )
Optimizing Web Newspaper Layout Using
Simulated Annealing
1 Introduction
2 The problem
After a user sends a query to a news server site, a set of articles related to his
query is obtained. These articles are page segments extracted from web newspa-
pers that may contain headers, text and even images. Tile fact is that the client
does not know exactly what kind or amount of information will be received.
As the user's query is sent via a web browser, the results should be presented
as a web page containing all the articles extracted by the server in a correct way,
that is, without overlapping between articles, occupying the smallest possible
area and with no empty gaps between articles.
761
It would be convenient for the optimization process to take place inside the
client machine to avoid server overload due to several queries being made at the
same time and because the results depend on the client's computer configuration,
such as the face and size of fonts being used and the size and resolution of the
screen.
As described in [4], the best way to manage the above constraints is to
program the optimization process as a JavaScript [2] script to be sent within the
web page containing the articles to be laid out and which will be interpreted by
the web browser when the page is loaded. Such a script is able to change the
appearance of a web page dynamically and to lay out all the articles, taking into
account the face and size of fonts and the size of the browser window, avoiding
scroll bars if possible. Thus, the server only has to find the articles t h a t satisfy
the query and send t h e m to the user, while the rest of the work is done at the
client's end.
check whether the layout is legal; thus, the optimization process becomes too
slow to be very useful.
4 Proposed approach
The surface of the window is divided into columns with a fixed width and an
infinite height (if the number of articles is such that they do not fit inside the
window, a vertical scroll of the window is allowed). The number of columns in
the layout depends on the size of the browser window. In this paper, each article
has the same width as a column and a height that depends on the amount of
information, but the system will shortly be able to deal with articles with a
width of several columns, although if an article originally takes up more than
one column, it can be fitted to one column without loss of information. The
problem is then how to fill all the columns with articles to get the heights of the
columns as close as possible.
This problem is very similar to a bin packing problem [10], in which the goal
is to minimize the amount of fixed size bins used to pack a number of objects;
however, in this problem, the number of bins (columns) is fixed and what has
to be minimized is the used capacity difference between bins (columns).
The new generated permutation is a copy of the old one, but with the numbers
at the marked positions swapped:
new=(128456739)
The algorithm used in this approach needs only two parameters: the number
of iterations required by the search process n u m I t and the number of changes
necessary to reach the thermal equilibrium k. Its implementation is detailed in
the tbllowing pseudocode:
n=0
T = To
select a configuration iol d at random
evaluate iol d
repeat
for j = 1 t o k
select a new configuration i,,~,,,, in the neighbourhood
of iold by mutating iold
A f = f(inew) -- .f(iold) zxi
if ( A f < 0) OR, (random(O, 1) < e - - f - )
t h e n iold = inew
e n d for
T = fT(To,n)
n=n+l
u n t i l (T < T.~in)
Use the last configuration to obtain the layout
To --
zxf* (1)
ln(po)
where A f* is the average objective increase observed in a random change,
and Pa is the initial acceptance probability (0.8 is usually used).
For the freezer function (fT) this approach uses:
764
To (2)
fT(TO, n) - 1 + n
this function lowers the temperature and thus the acceptance probability
quickly at first, and later starts a more controlled descent until the minimum
temperature is reached.
The minimum temperature is calculated on the basis of the desired number
of iterations n u m I t as follows:
Tmin = f T ( T o , n u m l t ) (3)
Two different objective functions are tested in this approach. The first one is the
sum of all differences between the capacity taken up of each column and that of
the most filled column:
n--1
f, = Z c - c, (4)
i=0
where ci is the capacity taken up of the i-th column and C is that of the
most filled column (C = m a x ( c i ) ) . This function measures the unused area in
the layout, but implies a lot of calculation, making the algorithm slower, so a
different objective function was designed and tested.
The final objective function measures the difference in capacity taken up
between the most filled and the least filled column:
f2 = C - c (5)
where c is the capacity used of the least filled column (C = m i n ( c i ) ) . The
optimal layout (if it exists with the given articles) is reached when this difference
f2 is zero. This means that all columns are equally filled.
This objective function is easier and faster to calculate than the first one and
guides the search better because the first one cannot distinguish between two
different layouts having the same total unused surface area but in which one
has all the columns with unused capacity equally filled while the other has some
columns with a little unused capacity and other columns that are almost empty.
5 Results
To determine how the number of articles influence the time spent by the SA
in the optimization process, the algorithm was tested with 25, 50, 100 and
765
Table 1. Minimum, maximum, average and standard deviation of time and cost opti-
mizing 25, 50, 100 and 200 articles
Fig. 1. Minimum, maximum and average time in optimizing 25, 50,100 and 200 articles
766
Taking into account that the program is written in JavaScriptand that every
execution of the algorithm must be interpreted by a JavaScriptengine inside the
browser, the times obtained are acceptable. If the algorithm were written in C
and compiled, every execution would be much faster, but it would not be able
to optimize web pages dynamically and in the client's computer. Moreover, in
a usual size browser window there is only room for 10 articles without scrolling
bars in the window and the usual number of articles returned from the server is
no greater than 25, so usual times are between 2 seconds (with 10 articles) and
5 seconds (with 25 articles) in most cases, which is a really short time compared
with the time spent loading the web page.
An example of a final result is shown in figure 2, where 25 articles using a
very small (unreadable) font are displayed. An 8-point font size was used in the
execution in the figure to allocate as many articles as possible in the window
without scrolling. With a normal font size, i.e. 10-12 points, no more than 10
articles can be fitted into a window.
6 Conclusions
This paper presents a different approach from the one presented in a previous
paper [4] based on SA to solve the pagination problem where the code to solve
the problem is sent by the server within the same web page to be optimized. With
this approach, the server only has to look up the information the user orders
767
and as the optimization process runs at the client's end, it knows the exact
configuration of the client's computer and adapts to it easily, always obtaining
a personalized result for each user.
The time required for optimization is acceptable; for example, in a 233MHz
Intel Pentium MMX it is usually between 2 seconds (optimizing 10 articles) and 5
seconds (optimizing 25 articles). With current processors, the optimization time
should be better, so this is a very good time if we consider that the code that
performs the optimization is interpreted within a web browser slower than an
normal optimization application compiled for a particular computer architecture.
The proposed approach is available in h t t p : / / k a l - e l . u g r . e s / - j e s u s / l a y o u t .
In the near future this application will be able to handle articles having different
widths, so a long article, which at present is restricted to fit only in one column,
would occupy more than one column, and thus have a squarer shape, although
this is not really a restriction, since the shape of the article can be altered to
occupy as many columns as necessary.
Another interesting improvement would be to allocate related articles as close
together as possible. This would make the layout easier to read and understand
for the user, but would involve tagging or an understanding of the articles by
the machine, which is a much more complex and completely different problem.
7 Acknowledgements
This work has been supported in part by the proyects C I C Y T Proyecto BIO96-
0895 (Spain), D G I C Y T PB-95-0502 and F E D E R 1FD97-0439-TEL1.
References
1. E.H.L. Aarts and J. Korst. Simulated Annealing and Boltzmann Machines. John
Willey, 1989.
2. Netscape Communications Corporation. Javascript developer central. Web ad-
dress: http ://developer. netscape, com/tech/j avascript.
3. K. de Jong. An analysis o/the behavior o/ a class o/genetic adaptive systems. PhD
thesis, Dept. of Computer and Communications Sciences, University of Michigan,
Ann Arbor, 1975.
4. J. Gonzs and J.J. Merelo. Optimizing web page layout using an annealed genetic
algorithm as client-side script. In A. E. Eiben, T. Bick, M. Schoenauer, and H. P.
Schwefel, editors, Proceedings of the 5th Con.ference on Parallel Problem SoSJing
.from Nature, volume 1498 of Lecture Notes in Computer Science, pages 1018 1027,
Amsterdam, The Netherlands, September 1998. Springer-Verlag.
5. W. H. Graf. Graf's home page. Web address: h t t p : / / w w w . d f k i . d e / - g r a f / .
6. O. Kamba, K. Bharat, and M. C. Albers. The Krakatoa chronicle - an interactive,
personalized newspaper on the web. Technical Report Number 95-25, Technical
Report, Graphics, Visualisation and Usability Center, Georgia Institute of Tech-
nology, USA, 1995.
7. S. Kirkpatrick. Optimization by simulated annealing - quantitative studies. Y.
Stat. Phys. 34, 975-986, 1984.
768
I. Introduction
During last years the number of biomedical applications that are based on artificial
neural networks has increased considerably. ANN-based systems are used both in
hardware applications, as to put up in doses of medicine to a patient that is in an Intensive
Care Unit, and software applications, as diagnostic and monitoring assistance systems of a
patient. In this paper we focus the discussion on diagnostic assistance systems. Specifically
wc propose a development methodology of this kind of systems, to begin by data
collection until the fired analysis of the results is obtained.
The design process of this kind of systems is divided in three phases. First, a basic
prc-process of data analysis collected by doctors. With this analysis we want to detect
crro,s made during the data collection, and to realise a first study of data nature. This phase
allows filtering non-significant variables. The last step of this phase consists of generating
training and test data sets lbr the training process of ANN.
In a sccond phase the trainiag process of the ANN is realised, evaluating its
perfornmnce. The third phase consists of studying the criterions used by the ANN to give
the final diagnose. If we know the criterions of the ANN, we can increase the medical
reliance on the diagnostic assistance system. Trepan is the algorithm used for this purpose.
Next, we describe in detail the three phases commented before, and we present three
medical applications realised by the Soft Computing Applications Group of the Universitat
Aulbnoma tie Barcelona, in collaboration with different hospitals.
2. Data Pre-proeessing
Data pre-processing main goal is to detect and, if it's possible, to correct data
abnormalities in the original data set, to present the neural network, learning set that
contains all the information but appearing simplified in order to improve the tilne needed
in the learning process as well as the internal neural network architecture. This process is
divided in three phases: descriptive statistic analysis, data transfommtion and data
wdidation using Trepan.
770
All Ihis process is made aulomatically through an AWK validation program. This
I)r~gr;un, using tile origbml tlala set ;.Uld a wdidalion rules set, detects data SKI
:d~n~wmalilies, iuld, if tile original dlll;~| set is correct, generates lwt) dala subsets: one t)l
these is used to train the network and the other is used to validate the learning process. The
case number and case bahmcing can be select by the user like a part of lhe validation rules.
In order to detect abnormalities due to the data input process or data set wrong
selection, a simple stalislic study is made. This study shows the parameters dcpiclcd in
table I.
Present values Standard deviation
Missin~ values Sl~tndald error
Mean Lowest wdue
Central reserwttion Upper value
First quartile Third quartile
T a b l e I. Descriptive statistic analysis parameters.
The number of nfissing values can be used to pin down data wdidity. Patterns wilh
missing dala Call be off cast or can be completed with mean values, typical values or, if it's
I)ossiblc, with correlated wdues. Nevertheless, if data set has enough example numbers,
patterns with missing data are rejected.
Standard deviation is used to detect data noise. A standard deviation bigger than
expcctcd, ix suggesting errors in the data input. In the other side, if the standard deviation
was between the expected wdues it doesn't means input data process was correct, it would
be that all data set is shifted. This error can be colTected in next steps through data
transformation. In order to ensure that every wlriable is into her domain, in the simple
statistic study appears the lowest and upper values that every wlriable achieve.
2.2. Data t m n . ~ ) r m a t i o n
Over thc original data set various transformations are made in order to improve Ihe
learning process and diminish the training time. Data translbrmation relies to the variable
type. It's possible to identify three variable types:
9 Nomit~al variables. Those that present one or more exclusives states. It isn't
possible to determine a degree order keeping to the adopted value.
9 Ordinal variables. They have different states through that it's possible to
determine a degree order but the grade distance is undetermined.
9 Ahsohtte variables. They have different states through that it's possible Io
dcterminc a degree order and also is possible to determine the degree distance.
Nominal variables can be coded in two ways:
1. By following a configuration one-neurone-one-state in so much there are so many
neurones like states and only one different neurone is active in every state.
2. If the wlriable has a binary value it can be coded with only one neurone hastening
or disabling it in order to code the characteristic presence or absence.
Ordinal w~riables are coded like cumulative indicators using so many neuroncs minus one
than state number are present. Every state is coded enabling so many neurones like the
771
degree thai it takes up in thc wduc ordeucd scale. And absolute variables arc nonmaliscd
between Ihe interval [0. l, 0.91.
TREPAN makes a statistical data set separation attending to variable or variables lhat
give mr information in order to classify the resulting class. This study can show dala
dcviatkm that would be very difficult to see using traditional statistical techniques. This
alg~rilhm can be used to lest the protocol validity giving informalion in two senses:
9 It gives inft)rmation about more imporlant variables and less ilnporlant wlriablcs.
9 If the generated tree is very unbalanced it depicts the htck of wlriables in the
protocol definition.
9 The Trepan trees can be conlrasled wilh the specialist criterion in order to debug
the protocol definition.
In the t~thcr sense, the node complexity dccision rules gives information about lhe global
c~mq~lcxity of thc problem.
Artificial Neural Networks (ANN) is the better option to build a diagnostic assistance
system. ANN is altractive to medical applications due to different characteristics:
9 If the input data is human opinions, ill-defined categories, or it is subject to
possibly large error, the robust behaviour of neural networks is important.
9 A neural network presents the ability to discover patterns in data that are so
obscure as to be imperceptible to human researchers and standard statistical
methods.
9 The medical data exhibits significant unpredictable nonlinearity. Traditional time-
series models for predicting future values are based on strictly defined models.
9 A ncural network acquires inforlnation, 'knowledge', concerning a problem by
means of a learning/training process, extracting the knowledge directly from the
data. This information is stored in a compact way, attd the access is simple and
fast.
9 A neural network presents a high degree of precision (generalisation) at time to
give a new solution to a new input data, in the same problem domain.
9 Determining the Iraining and test pat{crns set. If we wa~lt ;m ANN-b;lsed system
clfcclivc, the Iraining set must be complete enough to salisfy several goals: (I)
Every class must be represented. (2) Within each class, statistical variation must bc
adcqualcly represented. (3) The training set must have approximately Ihc double t~l"
ANN frcc parameters (internal connection) to avoid overfitting problem. (4)
Training and test set must be balanced.
9 Training the neural network.
9 Ewduating perlbrnmnce of neural networks by means I{cccivcr Operating
Characteristic (ROC) curves. With this slcp, we ewduate if the training process has
bccn made corrcclly, if the ANN give a good solution in relation with the data
input.
9 Validating the diagnoslic assistance system. A group of specialist ewdtmtes Ihe
Ilcrforul:mce of ANN-based system. We conlparc the ANN diagnose ;rod the
specialist diagnose.
()ncc tile ANN has been trained, the next step is to study the followed criterions by the
ANN tt~ reach the final result. If we know the criterions of the ANN, wc can increase the
medical reliance on the diagnostic assistance system.
Ti~e internal representation of tile knowledge acquired by an ANN after it~ learning
process isn't easily understandable. Several parameters of tile ANN lake part of this
internal representation, for example weight values, bias values, and actiwltion and output
function. This aspect in a great inconvenient of the use of the ANN in medical nature
applicali~nas.
Bu(, why do we want to gain access to the internal rcprcsentalion in an
Ht~tlcrslandablc and easily manner?. And, maybe we can reach an ANN Ih:|l can bc
~l~Craletl. Following, several answers to the question are shown:
9 The deduced criterions can explain how the net reaches the final diagnoses. This
is the major obstacle of several ANN-based systems, overcoat in medical
applications.
9 If the knowledge of the ANN-based system can be expressed by a rules set, it can
be made up in other intelligent systems, for example an expert system. It's
possible because we can handle and express the ANN-knowledge in an easily
manner.
9 Thanks to tile ANN-knowledge we can explore the collected data and we c;m
evaluate ANN-conclusions. With this process, it's possible Io give to specialist
more inlbrmation about the problem.
Learning techniques that use rules as knowledge representation, resolve the
commented problent in a direct manner, that's to say, the acquired knowledge is easy to get
on with. But there are applications where the ANN-systems present better solutions than
olhcr learning algorithms.
Several ANN-knowledge extraction algorithms have been proposed. Every one of
Ihcn| presents different characlerislics. The selected algorithm is Trepan, which gencralcs a
dccisi~m tree from a trained ANN and the pattern set used to train the net. ht l~lcl, "l'repau
tlocsn't need an ANN, it only needs an oracle or teacher that responses Ihe queslions made
by the algorithm, and an instances distribution model.
773
Wc can easily uaderstand Trepan from a classic algorithm as 1193. 1193 is a symbolic
learning algorithm thai learns concef)ls. 1193 generates a decision tree (DT) from a
classified examples set by a Icacher. This I)T is composed by rule-uodcs, where a rule
separates the examples set in Iwo classes, one class complies with the rule, and Ihe other
docsu't. In a rceursivc way, rulc-nodcs are selected to classify the examples set. Algorithm
finishes when tire decision tree classifies completely tire initial examples set. The
performance of Trepan is similar to ID3 algorithm, with the difference that Trepan
generates news examples from a data model (the model is deduced fiom tile examples set).
Trepan uses the trained-ANN as oracle, and we car) conclude the decision tree shows the
ANN-knowledge.
Thanks to Trepan we can:
9 Study the acquired knowledge from tile net. From this information we can observe
ttsclhl charactcristics of Ihc problem, characteristics detcclcd by the net.
Aftcrwards, the specialist can evaluate Ibis knowledge.
9 Generate a rule-bascd expert system from the ANN-knowledge. It's possible Io
complctc expert systems with the deduced rule set.
9 Study the wcight of the different wu'iables o attributes in the ANN-solution. If a
variable doesn't appear on the DT, it's probably that this attribute is not important
for the net. In the same way, we can detect a variable very important lbr the final
diagnose.
9 Sft)dy possible ANN-perforn)ance problems, If there is some unclassified
examples, we cab) suppose that any att,'ibute is not present in the protocol, and
maybe these attributes are important for the net.
Trepan increases to maximum the reading and understanding of the ANN-decision
tree, generating trees more compacts and usefnl than trees generated by means of ID3
;algorithms. The generated rules by Trepan presents a great senmntic expressiveness, naajor
than (hc rules reached by ID3.
The explained methodology has been utilised in several medical projects realiscd
by mcmbcrs of our group in collaboration with different hospitals. Next, we present the
three more important. ..
This project presents the study of the Open Angle Chronic Glauconla (OACG). The
OACG is a high frequency and serious eye disease (0.4-2.1% between population older
than 40 years) because it cab] produce a great damage in the visual function, being one of
the main reasons of bliqd,]ess in developed countries. At present there are two tests that are
considered as the pillar of the glaucoma diagnosis: the study of the atrophy of the fibbers
laycr and the optical nerve bead, and the exploration of the visual field. But, the study of
the visual field is still the inain data in the glaucoma diagnosis and the absence of
campimctric defects excludes its diagnosis.
The diagnosis system is based on artificial neural networks, specifically the
fccdforwards networks, trained by means of the backpropagation learning algorithm. The
nctwork has seven units, which are the zones defined in the campimctry, and the response
is if the patient visual field presents glaucomatous defects.
774
The specificity (>82%) and set~sihility (>9(.)) v;dues ~Jrc higher than Ihe index
td,laincd hy ~)ther inclhotls of visual field inlerpletatiou. '1'o sum up, flotn the results
ohhfincd is deduced that :lrlificial neural networks arc a good solution to develop a
di;ignostic helping system.
Another positive aspect of this al)proximation is the possihility of knowing the
criterions followcd by the net to reach the final diagnoses, and lbr this .job we have uscd
Trepan algorithm. The glaUcOlna application has the particular characteristic thai their
variables arc continuous, anld this aspect represents a great effort to Trepan, because the
process of determining rule conditions with continuous variables is very hatd.
(-)l)hlhahnologists of the Glaucoma Unit of IOBA (lnstituto de Oflalmobiolog[a Aplicada)
havc cvaluatcd and accepted the final rule set. With this we can increase the credibility of
ANN-sohlti(m.
The aim of this project is to determine if the utilisation of an ANN for the
indicalion tfl radin-,.zuided biopsies can rcduce the percentage of negative biopsies to
diagnose breast cancca. Bctwccn 15 aml 30% of mannnography dctccted abnormalities are
brcast carcinonms. I lcnce, radio-guidcd biopsy is indicated lbr outlining the suspccted
breasl zone and con fin|ring/refuting the presence of breast carcinoma. Nevertheless, the
pcrccnlage of negative biopsies is, fortunately for patients, extremely high (up to 85%);
this represents extraordinary expenses of time and money to the hospital. Objective
methods designed to reduce the percentage of negative biopsies would no o n l y alleviate
hospital hudgcts but also lessen the understandable patient fears and nt, isances when facing
the dot|bt of cancer. An additional goal of the project is to study tile weight of cvcry
attribute that characterises a mammography, with the purpose of determining the quality of
the prot(~col.
Mammography's are the snloothes! breast cancer detection tcclmique, being the
first breast exploration. After that, if it's necessary, the patient is subjectcd to other
explorations more aggressive. A mammography can present two types of characteristics:
n|icrocalci ficatious and nodules. These characteristics can be shown both at the same time,
but this is a very strange case; they usually appear alone.
As the previous application an ANN-based system is proposed. In the initial
analysis, belorc training pt'ocess, we can detect the aspect commeuted bclbre, the
separation between microcalcification and nodules. For this reason we decided to divide
the problem into two parts: (I) detection of dangerous microcalcifications, and (2)
dclcction of dangcrotls nodules (high risk of breast cancer). With this solution we achieve a
COmlflcxity rcductiou of the two new problems.
An ANN-based systcm has been designed for resolving Ihe mic,'ocalcificalions
problem. After that, we get a rule-set from the ANN by means "Frepau algorithm, being
cvalualcd aml validated by specialists.
For the nodulcs problem we have decided using another approximation, different to
ANN-bascd systcm. We have decided to design a rule-based system, basically because is
very simple to deduce it. Trepan algorithm has been utilised to deduce the rule set, and
after, if it's nccessluy, breast specialist complete the rule-based system.
In c(mcfusion, the final system for resolving the manmlography radio-guided
hiopsics has a hyhrid nature. It presents two blocks, a rule-based block and an ANN-based
block, improving the performance of the complete system in relation to a unique ANN-
based sohttion.
775
F (95, 39.6 F) 2of3 (sex=male, age<0.29, F (94.1, 20.4 F) Iof6 (Respiratory frequency<0.66 t,
albumin >0.3) sex=males, axillary temp>0.512,
Bronchoplegy=yes, albumin<O.28,
urea<O.115)
Figure I shows the first node of a Trepan tree. It shows two kinds of rt, les:
categorical rules and simple rules. The first node is a categorical rule with a w~riablc that
shows four possible states. And for every state of these variables Trepan generates a new
rule sct t~r it gives a classification in the set with the result (T) or the set with false result
(F). It can be seen than the 39.6% are examples with cognitive function set to no damage
an of this 95% has a false final evolution, then this kind of examples are classified as false
cxamplcs. Also there are 20.4% of examples with moderated damage and 94.1% of this has
a false evolution and then all kind of examples will be classified as false examples. In the
other side, the examples with light damage and high damage needs a more complex rule
that will be generated in the next tree nodes.
It can be seen that the w~riable cognitive function has the bigger information in the
set because it's in the top of the tree, but in the other side this variable hasn't sufficient
clinical sense to predict the patient evolution for itself. Then in this case Tl'epan tree aids to
discover that the data set is deflected by this variable.
6. Conclusions
We have shown Ihal a good analysis of Ihe collected data r help to kin)wing
I)t'llcr Ihe ii;|ltlrc of Ihe problem. Si)ecific~,lly, in mammography radio-guided bi~l)sies,
Ih:mks to this mmlysis we can detect:several aspects of Ihe l)rohleln thai wc didn'l know. h'J
the last application (prediction ;index of advanced age palienls), we use the Trepan
algorithm Io study the solution given by the ANN-based system.
Thc mclhodology shown is divided in three phases: (1) to realise a basic pre-
process of the collected data and to generate training and test pattern set. (2) To train the
ANN, and to evaluate its performance. (3) And, to obtain and to study criterions followed
by the ANN In reach the final solution. Thanks to the third step we can increase the
medical reliance o,t the diagnostic given by the ANN-based assistance system.
7. Ackimwledgements
8. l,l.el'erences
Introduction
9 Spectra, from samples with different medical diagnosis, can overlap, producing an
impression of homogeneity or common causality.
9 The identification of metabolites with low concentrations becomes very difficult or
almost impossible, with the common techniques.
9 The presence of noise or a traditional statistical analysis can provoke a loss of
information related to small biochemical changes, which may represent highly
important facts from a clinical point of view.
$u Glu
__ ~ l a ~ ~ j ~
Gin. 1 NA&
.,4-.. ~ ,
2.5 2.3 1.5 1=] ppm
. . . . . . . . . . . , . . . . , .
Fig. 1. An example1 of the spectra's variability. The image shows a perchloric acid proton
spectrum, extracted from a brain tumor biopsy, with the diagnosis of having a high grade
Glioma. The differences between both can be noticed on the marked metabolites (Ala, NAA,
etc.).
The first part of this work, intends to analyze the principal problems responsible
for that kind of behavior. The second part is dedicated to the study of artificial Neural
Networks as a computational tool for solving those difficulties, and spectra
classification. In the third part we end with the proposal of a distributed object
oriented system for automated diagnosing.
NMR spectroscopy
9 Its height, which represents its intensity, and indicates this metabolite's
concentration in the studied sample. It is equivalent to the ordinate y in the XY
plane.
Starting from a spectrum of a normal brain, shown in figure 2, one can determine
typical metabolite's resonances, which can serve as a criterion for later classifying
purposes.
The determined metabolites are:
1. Lactate's (Lac) H 3 protons.
2. Alanine's (Ala) H 3 protons.
3. N-Acetylaspartate's (NAA) H 6 protons.
4. The H 4 protons of the Glutamate and Glutamine (Glu and Gin).
5. The Creatine's and phosphocreatine's C H 3 and CI-I2 groups respectively.
6. The trimetilamonioum groups of Coline and derivatives.
7. The H 2 and H 3 Taurine's protons.
8. The H 2 proton of the Myoinusitol.
The spectrum's formation process is very simple, and can be easily explained
through any two metabolites, for
13er~r~ ~ g
cr example the Lactate and N-
Acetylaspartate, which produce the
spectra shown in figure 3.
The resulting spectrum, built on a
~r ~ o Ms basis of a 50:50 mixture from both
metabolites, could be obtained by
overlapping former spectra, as figure 4
shows.
TmI ~l~ Following this approach it is easy to
Ih'no GI~ ~. conclude that the more metabolites
present in a spectrum, the harder it will
be to read and classify, in order to
produce a safety diagnosis.
pprn All this leads us to the conclusion
that, the application of Pattern
Fig. 2. Proton NMR spectrum from a normal
Recognition techniques to the
brain, with the main metabolites.
classification of NMR spectra, could
produce quite interesting results for this and other formerly mentioned problems. This
approach is not completely new, since signal processing has always been associated to
NMR spectra analysis; but, in our work, we would like to show, the significance an
integrated completely automated Neural Network based classification system would
have.
The pattern recognition based spectra's analysis and classification started in the
80's with the works of Jeremy Nicholson and John Lindon. However, in spite of the
success that NMR's techniques have obtained in the biochemical area, its use have
not been extended in many research groups on that field, mainly due to the lack an
781
experience and basically, real time, effective software systems. That is the main
reason why we think that an Integrated Classification Environment would be of great
importance for this field.
J\ k.._._
4.12 4.0 3.9 3.6 3.3 . .. 1.9 1.5 1.3:5 plom
Pattern Recognition
Pattern recognition can be defined as the capacity to identify, analyze and interpret
a set of regularities, previously defined and characterized, within a collection of
objects, for example our metabolites, which are described through a set of
measurements. Those measurements are commonly affected by noises and other, less
important elements of the complex environment.
Our main task would be the combination of these techniques with NMR
spectroscopy knowledge in order to achieve the following goals:
9 To detect subtle differences on the metabolites present in spectra with the same
diagnosis. This ability would allow us to refine our final classification.
9 The application of modern pattern recognition techniques to noisy spectra's
analysis, where it is very difficult to identify important metabolites from the noise
would also be highly important.
782
22 52 22 S
I i
FID states for Free Induction Decay, which represents the measurements acquisition process.
Fig. 5. Flow of control in any Pattern Recognition System.
Figure 5 depicts the flow of control in any pattern recognition system, which one
we have adapted to the problem of spectral classification.
In the following sections, we will study the different steps in that process, with
exception of the first and second ones, since they do not provide any important
information for our computational process.
The representation of the pattern, which we are trying to classify, constitutes one of
the fundamental steps in the classification process, since its goal is the reduction of
the data vector's dimension; in other words we are about to prepare the spectral data
in a comprehensive and reduced form for the neural network.
The main characteristic of the resulting vector, which we called Feature Vector
(FV), lies in that it only contains the spectrum's relevant components, from a
biochemical and a classificatory point of view. That is, its components are freed from
any noise and non relevant data.
This FV can be obtained through any of the following approaches:
1. Features Selection: This method is very simple, and it consist of direct selecting,
from the original vector, a subset of components, which represent the spectrum's
main features. It commonly relies on the specialist, for example a doctor's or
biochemists, experience.
783
Wavelet Transform
This method starts from a function named "mother wavelet ''3, which acts as a
prototype. This "wave" would be translated and scaled in such a way that, the N
spectrum's components, or points, would be transformed in N coefficients, with the
help of N Wavelets. Those Wavelets form an orthogonal basis of the N-dimensional
spectral space. The resulting signal would have the minimum of noise, when the
transformation and mother wavelet are correct.
Each one of the mentioned coefficients is calculated through the dot product of the
data vector with one of the basis functions.
The set of basis functions, as mentioned earlier, can be obtained from the mother
wavelet gb~,c(t), through transformations as equation 1 (Tate, Anne Rosemary. 1996)
shows:
1 (t -b) (1)
ga,b ( t ) : gbasic a
Once the principal features are extracted, one can start with the classification task,
through a study of the more frequently applied neural network topologies for the last
step of the pattern recognition process.
In the literature we found that Backpropagation Neural Networks (NN) constitute
the most frequently applied model; that is an input layer, one or more hidden layers
and an output one. Among the more frequent applications are:
9 Interpretations of:
- Radiographs.
- Electrocardiograms.
- Dementia states.
- Blood analysis.
9 Diagnosis of:
- Lunge tumors.
- Breast tumors.
The main cause for this model's frecuent application in medical problems, lies in
the typical overlapping that exists among sets of values reporting malignant and
benign diagnosis.
785
We have chosen for this project's first phase the Backpropagation model of NNs,
on the basis of following reasons:
9 Backpropagation networks, as its name implies, learn by example, a highly
frequent applied model in medicine.
9 New knowledge can be easily added to the network, by including new examples in
the training set, and retraining the network. This is a very important fact since the
final system should be used by operators and other non specialized staff.
9 The input data do not need to meet any specific probabilistic distribution.
9 Once the network is trained, it can be applied in real time problems.
Model features
Input layer
It contains the input neurons, whose role is to connect the hidden layer with the
input data, determined in the former process. In this layer we found as many neurons
as input variables, in our case we have ten.
Hidden layer
Represents the main processing component of the NN. The elements to take in
account at this point are: the activation function, denoted as AF(x) where x represents
the weighted sum of each neuron's input, since it decides the possible activation of
each neuron, and the total amount of neurons to place in this layer.
In the case of the activation function, the end user would be able to choose among
the following:
9 The Sigmoid function, the most widely used transfer function,
9 The Piece Wise Step function,
9 The Unit Step or Hard limiter function, and finally
9 The Gaussian.
Respecting the total amount of neurons, it is generally determined empirically
through a test-error approach, in our case we left this decision to the end user,
allowing him, through an interactive process, the determination of the neurons in this
layer.
Output layer
Generally this layer contains a neuron per classification class, for our problem we
have six possible classes: Astrocitomas, Meningiomas, Glioblastomas,
Oligodendrogliomas, Medulloblastomas, and non malignant tumors, which represent
the different tumors classes to diagnose.
786
The previous sections have developed a background for this last part, where we
briefly describe how to use the methods and techniques mentioned earlier, to design
and implement a distributed diagnostic system.
Why distributed?
Since the practical application of our system would be, in its last phase, in real
cases, we planned, looking for simplicity in its design, on the basis of a distributed
C O R B A 4 compliant architecture.
The main reason why we chose CORBA as the architecture for distributed objects
is the independence respect to:
9 Implementation language, and
9 Working platform.
On the basis of this architecture, we have designed a thin-client component, whose
main role would be the interaction with the end-user, leaving the hard processing
work to the server component. Figure 6 depicts this basic idea.
4 CORBA stays for Component Object Request Broker Architecture, a methodology for design
and implementation of distributed applications for the net.
787
Under global services we grouped, basically, the first and third former mentioned
steps. Local services deal with the user-system interaction, storing, locally, the ready
to classify NNs, that is N N ' s already modeled and trained, developed in the second
step.
In the near future we are planning to allow the communication among local clients,
in order to permit resource interchange, eliminating, that way, redundant operations.
References
J.L. Ferns
1 Introduction
Evolutionary methods have been the focus of much attention in computer sci-
ence, principally because of their potential for performing partially directed
search in very large combinatorial spaces. Evolutionary algorithms (EAs) have
the potential to balance exploration of the search space with exploitation of
useful features of that search space. However the correct balance is difficult to
achieve and places limits on what can be predicted about the algorithm's be-
haviour. In addition, EAs are often implemented in system-specific ways, making
it very difficult to compare results on different implementations.
A similar problem exists in evolutionary biology, and substantial progress
has been made in this area by choosing the proper levels of abstraction at which
to study natural systems (see, for instance [1] [2] and [3]). This suggests that
abstracting away from the comprehensive detail of EAs may generate rewards
in terms of our understanding of the evolutionary processes.
Several attempts have been made at establishing a methodology that deals
with measures of evolutionary processes in EAs (see, for instance, [4] [5]). The
justification of these approaches is, amongst others, to measure the present and
past performance of EAs, compare their current performance and predict its fu-
ture behaviour; it can also help in specifying the characteristics of the proposed
EAs, understand the reasons for observed EA performance and provide the k n o w
how to tackle fundamental problems in EAs (i.e. scaling, transferability, flexibil-
ity, evolvability).
789
The latter reason is based on the assumption that all EAs face fundamental
problems to do with their use for large scale applications. In general, computa-
tional EAs do not scale well from small to large problems, do not transfer well
from one problem domain to another and are not very flexible in response to
changing test problems. Biological EAs are arguably better, but we have not
been able to work out how to implement them in a feasible manner outside
their natural context. By developing and using measures on evolutionary sys-
tems we are likely to be able to quantify and learn more about how to solve
these problems.
It may be that these fundamental problems are all aspects of evolvability - the
capacity of systems to evolve - and it has been argued elsewhere (i.e. [6] [7]) that
we may be able to measure aspects of evolvability. Understanding evolvability
would yield substantial benefits in the application of EAs to real problems.
Measures of evolutionary processes in EAs have been derived from a number
of different sources: theory of animal breeding and theoretical genetics ( [6] [8] [9]),
study of natural selection ([1]), adaptive landscape theory ([10]) and ALife
modelling ([11]).
In summary, the advantages to be gained from developing measures of evolu-
tionary processes in EAs strongly suggest that research incorporating this are is
essential in the development of EAs for real-world applications. As Mitchell and
Forrest [12], discussing the relation of genetic algorithms to Artificial Life write:
" . . .the formulation of macroscopic measures of evolution and adaptation, as
well as descriptions of the microscopic mechanisms by which the macroscopic
quantities emerge, is essential if artificial life is to be made into an explanatory
s c i e n c e . . . " and "... we consider it an open problem to develop adequate criteria
and methods for evaluating artificial life systems.". Their comments still apply
strongly to the whole fields of evolutionary computation and Artificial Life and
should be acted upon.
In this paper we present one specific evolutionary measure applied to a par-
ticular EA based on Genetic Programming, B T G P , and for a specific problem
(filtering of Boolean query trees for a classification problem). The outline of the
paper is as follows: after this introduction, Section 2 will give a brief description
of B T G P while Section 3 will describe the real world application. In Section 4 ,
we will implement the evolutionary measure and a new evolvability diagram will
be introduced and applied to the information retrieval task with B T G P . Finally
section 5 will contain the conclusions of our work and some future directions of
research.
application of the genetic operators to produce the children of the next gener-
ation. The B T G P has m a n y configuration options (see [13]) but for the experi-
ments described in this paper the following options were used:
The phenotypic representation is a Boolean decision tree. Each node of this tree
is either a function node taking one of the values AND, OR, NOR, NAND or a
leaf node variable which references a particular keyword. For a given training or
test case each keyword variable will be instantiated to the value 1 or 0 denoting
the presence or absence (respectively) of the corresponding keyword for t h a t
case. A tree which evaluates T R U E for a positive case or FALSE for a negative
case has thus correctly classified t h a t case.
The fitness function is evaluated over a set of training or test cases. It is
parameterised by the following values: the number of correctly identified positives
npos, the number of negatives falsely identified as positive n,~eg, the total number
791
of positives Npos, and the total number of negatives Nneg. T h e fitness function
is designed to minimise both the number of missed positives and the n u m b e r of
false positives:
N p o- snpos nneg
f = c~ Npo~ + fl Nn,g
Note t h a t a and fl and the function lie in the range [0, 1] with 0 being the
best possible fitness, 1 the worst. The aim is therefore to minimise its value.
The d a t a set was generated from a known decision tree illustrated in Fig. 1.
It has 16 keywords, a training set of 200 cases and a test set of 50 cases. The
training and test cases were chosen randomly from the 216 possible keyword
configurations such t h a t each set contained an equal n u m b e r of positive and
negative cases.
/j OR
vintage
antique
collector
car
vehi cl e
transport
5 design
programming
I construction
database
tutorial
beans
4 Evolutionary Measure
A new diagram for measuring and comparing the performance of one or dif-
ferent algorithms on the same task is proposed here. Taking inspiration from
Astrophysics, the Hertzprung-Russetl diagram [14] [15], otherwise known as the
HR diagram, allows us to track the evolution of every star in the Universe in
a simple two dimensional diagram where temperature is represented versus the
star's magnitude or logarithm of its luminosity referred to a standard star, gen-
erally the Sun. In this diagram stars with different chemical composition and
mass evolve through well studied paths that are the result of their internal phys-
ical phenomena guided by the laws of Physics. Different temperatures alter the
nuclear reactions in their interiors while their luminosity is a balance between
the radiation pressure trying to escape the outer layers of the star and the grav-
itational collapse of the latter that increases the optical depth thus trapping the
photons.
The simile with EAs appears if we consider the mutation rate to play the
role of temperature (or level of agitation), while the ratio between exploitation
and exploration resembles the luminosity (actually the inverse of the luminosity).
Exploitation can be measured as the inverse of the mean fitness of the population
while exploration can be realised as the fitness variance. Therefore we can study
the dependence with time of an algorithm changing the degree of mutation and
extracting conclusions on the capabilities of the algorithm to explore and exploit.
The proposed evolutionary parameters have been measured for our algorithm
( B T G P ) and task (information retrieval). Some of the measures specifications
will depend upon the nature of the algorithm itself and its representation of
solutions (Boolean tress) where phenotype and genotype are the same, Further-
more, some will be influenced by the definition of fitness and sampling of our
fitness landscape derived from the task.
In the remainder of this section we will present some preliminary descriptive
results and will discuss them. For testing purposes we have fixed a few B T G P
parameters; 100 individuals, 100 generations, maximum depth of trees 4 lev-
els, roulette wheel selection (unless otherwise indicated), ramped half and half
tree generation and branching factor between 2 and 4. The data set and fitness
function used have been discussed in Section 2).
With these settings, B T G P was executed for 13 different values of the mu-
tation rate: 0, 0.01, 0.05 and then 10 consecutive runs evenly spaced between 0.1
and 1.
Mean fitness, ] , slowly decreases with generations for rate 0.2, but as soon as
mutation is increased to 0.5 or 0.8, ] oscillates around a constant value and does
not decrease further. For smaller values of the mutation rate (e.g. Mutrate =
0.05), the mean fitness decreases even further.
Fitness variance, av, 2 does not change significantly in the course of the run
for mutation rates Mutrate > 0.1). Nevertheless the rate is decreased to small
values (i.e. M u t ~ t ~ = 0.05), variance seems to increase steadily as generations
go by.
Statistics on the degrees of exploitation and exploration and different muta-
tion rates were gathered for BTGP. The result is the diagram shown in Fig. 2.
793
Ld
0
EL.
,.-*2
0
.J
01-
0.0 0,2 0.~- 0,~ O,B 1.0
Mutation
Fig. 2. Evolvability diagram for BTGP and information retrieval.
Directions for future work could include the extension of these measures
and evolvability diagram to other algorithms applied to the same and different
problems. In particular, the exploration of algorithms with a non trivial m a p p i n g
between genotype and phenotype could lead us to establish some conclusions
on the suitability of algorithms to tasks and to compare different algorithm
performances, linking these to evolvability.
References
1. Endler, L., "Natural Selection in the Wild", Princeton, N.J., Princeton University
Press, 1986.
2. Hofbauer, J. and Sigmund, K., "The Theory of Evolution and Dynamical Systems",
Cambridge, Cambridge University Press, 1988.
3. Roff, D., "Evolutionary Quantitative Genetics", London, Chapman and Hall, 1998.
4. Fernhndez-Villacafias, J.L., Marrow, P., Shackleton, M., submitted to GECCO'99,
1999.
5. Bedau, M.A., Snyder, E., Brown, C.T. and Packard, N.H., A comparison of evo-
lutionary activity in artificial evolving systems and in the biosphere, in "Fourth
European Conference on Artificial Life", P. Husbands and I. Harvey (Eds.), pp.
125-134, Cambridge, MA, MIT Press, 1997.
6. Altenberg, L., The evolution of Evolvability in genetic programming, in "Advances
in Generic Programming", K.E. Kinnear Jr. (Ed.), pp. 47-74, Cambridge, MA, MIT
Press, 1994.
7. Wagner, G.P., Altenberg, L., Complex adaptations and the evolution of evolvabil-
ity, "Evolution" 50, 967-976, 1996.
8. Falconer, D.S., "Introduction to Quantitative Genetics", 3rd. ed. Harlow, Long-
man, 1994.
9. Miihlenbeiu, H., The equation for the response to selection and its use for predic-
tion, "Evolutionary Computation" 5, 303-346, 1998.
10. Hordijk, W., A measure of landscapes, "Evolutionary Computation" 4, 335-360,
1997.
11. Bedau, M.A. and Packard, N.H., Measurement of evolutionary activity, teleology
and life, "Artificial Life I I ' , C.G. Langton, C. Taylor, J.D. Farmer and S. Ras-
mussen (Eds.), pp. 431-461, Redwood City, CA, Addison-Wesley, 1991.
12. Mitchell, M. and Forrest, S., Genetic algorithms and Artificial Life, "Artificial Life"
1, 267-289, 1995.
13. Fern~ndez-Villacafias, J.L. and Exell, J., BTGP and information retrieval, in "Pro-
ceedings of the Second International Conference ACEDC'96", PEDC, University
of Plymouth, 1996.
14. Hertzsrpung, E., Ueber die Verwendung photographischer effektiver Wellenlaengen
zur Bestimmung von Farbenaequivalenten, "Publikationen des Astrophysikalischen
Observatoriums zu Potsdam", 22. Bd., 1. Nr. 63, 1911.
15. Russell, H.N., Nature, no. 93, 252, 1914.
Artificial N e u r a l N e t w o r k s as U s e f u l Tools for
the O p t i m i z a t i o n of the R e l a t i v e Offset b e t w e e n
T w o C o n s e c u t i v e Sets of Traffic Lights,r
1 Introduction
One of the most difficult problems in urban traffic control is to decide the optimal
offset between two consecutive sets of traffic lights. To illustrate the basic concepts of
the problem, suppose that A and B are two consecutive traffic lights and that the
vehicles drive from B to A.
ALast
Detector B
Fig. 1. This figure shows a sketch of the essential characteristics of the problem. Our aim is to
adjust the offset between signals A and B in such a way that the vehicle BFirst reaches the
vehicle Mast when this one reaches the stop line at A.
This paper is based upon data provided by the Traffic Control Department of the city of
Gij6n (Spain). The author would like to thank the Gij6n City Council for its helpful
collaboration
796
Let QA and QB be the queues of vehicles stopped in front of the signals A and B.
Let ALast and BFirst be the last vehicle in queue QA and the first one in queue QB,
respectively. The difference between the instants where the green phases of the
signals A and B start is called relative offset between A and B. The problem is how to
coordinate A and B in such a way that the vehicle BFirst reaches the vehicle Alast
when this one is crossing the stop line of the signal A. See Figure 1.
The early researchers in urban traffic control tried to solve the problem searching
for an offset value near the optimum, which was computed for an average queue
length and a rate of queue output statistically estimated. These solutions were
obtained independently of the real length of the queue QA. This is not surprising since
to compute this length is a hard task. Neural networks have been revealed as a
powerful tool to compute it.
To improve the offset we need to have extra information about the traffic behavior
in the network. An induction loop detector buried in the road will provide us this
information. The detector is able to supply the number of vehicles crossing over it
every 5 seconds and the length of time it was driven over. These data are cyclically
recorded in two patterns called flow and occupancy profiles (see fig. 2). It is supposed
that the detector is placed upstream close to the stop line
I I I I I
Green Red Green Red
Fig. 2. A pair of flow and occupancy profiles represented as integer arrays. Each component
corresponds to data of flow and occupancy measured for 5 seconds, respectively. Some of the
first rectangles included in the Green window could correspond to vehicles starting from the
queue.
this information directly from real traffic nets is very complicated, since we should
force some situations close to congestion constituting a critical risk and there are other
special situations quite difficult to obtain because the traffic demand would need to be
modified.
To eliminate these difficulties we have built up a realistic traffic simulator, able to
provide accurate profiles of flow and occupancy and capable of managing the queue
output efficiently. So we can generate a lot of profiles in many different traffic
situations.
Since our simulator manages the position of the vehicles and its state (stationary or
driving) every hundredth of a second, the part of the profile corresponding to vehicles
in the queue is known, as well as value tO. With these data we have trained an ANN.
In this way an algorithm has been generated to decide if there exists tO and to
determine its value in the positive case. These results were successfully tested with a
collection of real data.
Once the problems of generating reliable profiles and of obtaining the value tO
have been solved, it was necessary to compute the optimal offset in terms of the value
tO and to build an algorithm to implant it in the network. However, a question arises
immediately. How can we reach the optimal offset from the current offset?. The
problem of searching for a set of decisions to optimize the offset (let's say, what kind
of actions could/should be taken) is not easy to solve. In most of the cases it won't be
possible to correct the offset deviation in only one step. Moreover, each action taken
to move the offset can modify the previously calculated value tO. Thus, before
concluding the process of optimizing the offset, value tO could have been modified.
An efficient method to reach the optimal offset was proposed, it was based on the
supposition that always there is green to excess, that is, the green supplied is long
enough to satisfy the demand of vehicles
Finally, we present the results of a collection of trials carried out on the simulator
in many different situations, as well as some important conclusions.
2 The Simulator
When designing the simulator we took into account two main characteristics. On the
one hand, it was necessary to move the vehicles in a realistic way in order to have
reliable profiles; so we needed to model the behavior of two different processes: the
queue output (ahead) and the queue input (behind). On the other hand, it was also
necessary to have the capability of implanting the required actions to optimize the
offset.
To solve the first problem (queue output) we have designed on the street an
experiment consisting of the observation of the queues (more specifically, of the
vehicles leaving the queue) in different traffic conditions and in different lanes always
containing ten vehicles at least. In this way we have achieved a sample containing 150
observations representing the behavior of the vehicles when they start leaving a
queue. A recording for each vehicle has been made. It contains their position in the
queue as well as two noticeable instants: starting and reaching the stop line,
considering as reference the instant in which the green phase begins.
With these data and assuming that a vehicle is separated from its neighbors by an
average distance of 1,5 meters and that they have an uniformly accelerated
798
movement, the statistical distribution of two random variables has been obtained for
each vehicle in the queue allowing us to simulate the queue output:
9 S t a r t i n g i n s t a n t computed since the beginning of the green phase. A simple
examination of the real data will illustrate that the starting instant of the first
vehicle in the queue approaches a uniform distribution with a mean of 1.97 and a
standard deviation of 1.056 and that the difference between the starting instant of
two consecutive vehicles also approaches a uniform distribution with a mean of
1.44 and a standard deviation of 0.56.
9 V e h i c l e a c c e l e r a t i o n . Considering that the vehicles start from a stationary state, the
equation for the uniformly accelerated movement is reduced to e=a*t2/2. Since we
know the instant in which every vehicle crossed over the stop line as well as its
position in the queue and considering this position as a measure of the distance to
the stop line, we can deduce the uniform acceleration associated to each vehicle. In
this way we are able to estimate the acceleration of each vehicle depending only on
its position within the queue.
Place in queue 1 2 3 4 5 6 7 8 9
Acceleration 5.95 2.97 2.20 1.73 1.54 1.31 1.12 1.02 0.95
These data have been submitted to the method Curve Estimation of SPSS [10] to fit
an inverse model.
Table 3. A set of results of the Friedman Two-Way Anova test, where Pi is the position in the
queue of the vehicle i.
Position in queue P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
Significance 0.1 0.8 0.8 0.6 0.4 1 1 0.6 0.8 0.4
In this way the behavior of the simulator with regard to the process of queue output
has been validated.
To simulate the model of the queue input we have supposed that every vehicle is
driven with the average speed while the distance to the previous one is lesser than 30
meters. At that moment it begins to brake until its speed is equal to the speed of the
previous vehicle. To compute the deceleration and the time occupied in braking, we
have used again the equations of the uniformly accelerated movement.
Vf= V 0 + a*t and e=v,*t + a't2/2 (2)
The variables Vf (speed of the previous vehicle), Vo (speed of the objective vehicle)
and e (30 meters) are known and then we can compute a and t.
To complete the simulator it should be equipped with the possibility that the signal
timing can be modified. So we have implemented and included in the simulator a
virtual traffic regulator capable of receiving control messages and of putting them into
action on the street.
Our guess to find out the value tO is based on the main idea that the speed of the
vehicles in the queue is slightly different from those circulating without any
restriction. So we are going to go through the profiles supplied by the simulator
searching for that instant, if it exists. Unfortunately, this is a very difficult decision if
only one measuring instant is considered. Therefore to detect this difference it will be
necessary to consider a window including three consecutive measuring instants.
Since the detector is placed very close to the stop line, the queue usually goes
beyond the detector. Really, there are three different possibilities depending on the
queue length. If there is no queue or it needs a longer output time than the green phase
then the instant we are searching for doesn't exist, or else it should be within the
green phase. In this way the problem space can be restricted to the instants associated
with the green phase and then a collection of all the sets including three consecutive
pairs of flow and occupancy profiles is generated. Attached to every set, we recorded
whether the intermediate instant corresponds to the passing through the detector of the
last vehicle in the queue or not. This process is carried out on each profile.
Data provided by the simulator are used to train an artificial neural network. As in
[4] our aim is to compute a function like this:
last_vehicle_in_queue? (measure_instant):truth_value.
800
Even when a profile was misclassified or when the estimation error was greater
than 1, the traffic conditions were close to saturation. That is, the saturation level
(percentage of the green phase required to clear the queue) is close to 100%, and
therefore error becomes less important in terms of traffic control because it is very
difficult to improve the offset in this situation, as was said above. This fact is also
801
confirmed by other two parameters: spare time of the green phase and time with a
maximum occupancy. See Table 5.
Table 5. Relation between the Error of the estimated value tO and the Saturation level of the
green phase
Error of the estimated Average time of the Average time of the cycle with Saturation
value tO cycle with flow and flow equals to 0 and occupancy level of the
In measuring instants occupancy equal to () equals to 5 green phase
afthe profile) (Spare time) /Detector conlinuousl), stepped)
0orl 14% 41% 80%
>2 15% 35% 86%
Misclassification 3% 46% 93%
At this moment we are able to compute the value tO through an ANN. To complete
our task we need to compute the optimal relative offset and to design an algorithmic
solution to reach this optimal offset. To do this it will be necessary to introduce some
previous concepts. Figure 3 shows a schematic representation of the problem. It
describes the traffic evolution throughout a cycle.
[] Queue
9 Platoon
t ~ Green phase
GP / I Red phase
o, b ' 6
Fig. 3. Represents the Evolution of traffic (which has a cyclical and therefore repetitive
character) throughout a cycle. The cycle is split into the usual green and red phases. In order to
simplify the notation we have made two particular instants coincide: the beginning of the cycle
and the beginning of the green phase. This restriction doesn't suppose any loss of generality in
the system representation. The location of the vehicles at the beginning of the cycle is
represented on the horizontal axis. Oblique lines allow us to represent the instant where the
vehicles reach the slop line and its slope corresponds to the multiplicative inverse value of the
traffic speed. Obviously the queue of vehicles formed in front of the signal has an output speed
lower than the free circulation speed, and therefore its related line has a slope bigger than the
remaining ones. So, the origin represents both the beginning of the cycle (vertical axis) and the
location of the controlled signal (horizontal axis).
First of all, we need to determine exactly the term optimal relative offset. Let D be
the location of the first vehicle coming from the signal B at the beginning of the cycle.
Then tp represents the instant in which such a vehicle would reach the signal on the
hypothesis that it would be circulating along the street without any restriction, tp
802
would be the optimal relative offset between the signals A and B if there was no
queue. The instant where the last vehicle in the queue of the signal A arrives at the
stop line of A will be called output time of the queue and it was already denoted by tO
in the Introduction. According to the situation represented in Fig. 1 we can see that
the platoon located at D reaches the signal A in an instant tp later than tO.
Consequently there exists an offset deviation A=tp-t0. This deviation can also be
interpreted in the following sense: the head of the platoon D should have arrived at
the position P at the beginning of the cycle. We say that relative offset is optimum
when A=0.
In the case A~0 our objective consists of designing and implementing some actions
to reduce the magnitude of A until making it 0. From this viewpoint, on the situation
of Fig. 1 it seems reasonable to produce in signal A a delay of A seconds in the
opening of the green phase or an advancement of A seconds in signal B. Both
solutions are symmetrical and from now on we suppose that actions to modify the
offset will be taken only at signal A. Unfortunately, the decision of delaying A
seconds at the beginning of the green phase implies a temporal increase of the cycle
length. This fact constitutes a critical point for any offset strategy. As a first
consequence, the junction could turn out to be jammed. Moreover, delay A can
become non-optimal. Let us see.
9 Possibility of congesting the junction. To point out this circumstance it is necessary
to introduce some other terms. Let us call Supply to the current length of the green
phase and Demand to the length of the green phase required for clearing a queue
containing all the vehicles entering the lane each cycle. In our approach the
restriction Supply>Demand is assumed for all the cycles (non-saturation
hypothesis); otherwise there would not be any strategy able to improve the offset,
since the queue would increase continuously until reach the saturation point of the
green phase, without any possibility of reducing it. A brief analysis of the problem
of implanting a new offset allows us to see that a modification of A seconds in the
cycle might be excessive to guarantee the restriction Supply>Demand. Really, the
violation of the inequality could have a temporary character. In this case the non-
saturation hypothesis would be again fulfilled some cycles later. But this
possibility is not fully satisfactory. Although the profiles provide us information
enough to find out the saturation, it is difficult to check if this saturation is going to
be temporary or if it hides a new state of permanent saturation, unless we decide to
wait (maybe indefinitely) until a situation of non-saturation is detected again.
Obviously this risk is critical.
9 Possibility of generating an infinite loop. In any case, the action of modifying the
cycle length could modify the relationship between the supply and the demand
temporarily, which would have consequences on the queue length and therefore in
value tO. Then a paradoxical effect could take place. This is, when the solution was
reached so it already stopped being optimal. So what happens is that we try to
pursue a mobile objective. So we should restart the process with a new goal. This
process could continue indefinitely. Our only chance consists of achieving a
convergent displacement of the goal.
In conclusion, any offset strategy should maintain the intersection in a non-
saturation state and should be convergent to the optimum. To do this we have
implemented an algorithm to compute the maximum offset variation (upper bounded
803
by A) compatible with the non-saturation state. So, any feasible decision has to satisfy
the inequality t0<GE (See Fig. 3), where GL is the length of the green phase. In this
way, the offset also converges with the optimum.
The simulator is a useful tool to evaluate this strategy. Thus, to test the efficiency
of the algorithm we have fed the simulator with a wide collection of traffic situations
representing many different relative offset problems. In 92% of times the offset error
was reduced below 5 seconds in at most 4 cycles; Moreover, when offset error could
not be corrected, the queue was too long (occupying almost the whole green phase)
and then the offset lost importance in terms of traffic control.
Figure 3 shows a situation where the offset is not the optimum. As it was stated
above, the offset can be optimized increasing the cycle length. Figure 4 describes
another non-optimal situation requiring a decrease of the cycle to correct the offset.
The action is now completely the opposite but our algorithm maintains its
effectiveness because its performance is independent of the action if the non-
saturation restriction is continuously satisfied.
Although the scope of this paper is more restricted, it is important to point out
other two considerations. On the one hand, anyone of the situations described in
figures 3 and 4 could be optimized advancing or delaying the beginning of the green
phase. This fact is due to the recurrent behavior of the traffic when it is regulated by
traffic lights. On the other hand, to increase or to decrease the cycle would be
necessary to modify the length of the green and/or red phases. Nevertheless, the
consequences of the actions are not the same depending on the order in which the
phases are considered. All the possibilities were included in the algorithm. The results
were always analogous.
[ ] Queue
9 Platoon
9 ~ Green phase
GE- ~ ~ p h a s e
t0
tp
___S'
O p D
Fig. 4. Another typical situation with a non-optimal offset. The offset deviation is now
negative.
5 Conclusions
In [4] we had proved that the value of tO could be accurately estimated through an
artificial neural network in some particular situations depending on the location of the
detector in the lane. Now with the aid of the simulator we can make reliable
estimations of the value tO in any case, with errors lesser than 11.6% of the queue.
804
To compute queues some traffic controllers ([7], [8], [9]) use accumulative
algorithms based on the idea that the queue length in a cycle t+l can be obtained
adding to the queue in cycle t the estimated number of vehicles entering the lane in
cycle t+l minus the number of vehicles leaving the lane along the green phase of
cycle t+l. This method may fail sometimes since these parameters are difficult to
compute accurately and then the error in the estimation of the queue length could
increase indefinitely. Our approach avoids this risk due to the fact that in every cycle
value tO is estimated independently from the former cycle.
Additionally, we have built up an algorithm to correct the relative offset error that
only needs the value tO as input. In this way, the required actions to correct the error
become independent both on the actions taken in the former cycle and on their
consequences. The results of this strategy were completely successful.
6 References
[1] Bahamonde, A.; L6pez-Garcfa, S.; Hemfindez-Arauzo, P.; Bilbao, A.; Vela, C.R.: ITACA:
An Intelligent Urban Traffic Controller. Proceedings of IFAC Symposium on Intelligent
Components and Instruments for Control Applications, SICICA'92. Mfilaga, (1992) 787-
792
[2] Bell, M.; Scemama, G.; Ibbetson, L.: CLAIRE an expert system for congestion
management. Proceedings of the Drive Conference. Brussels (1991)
[3] Forast~, B; Scemama, G.: Surveillance and congested traffic control in Paris by expert
system. Proceedings of 2nd. International Conference on Road Traffic Control. London
(1986). 333.337
[4] Hern~ndez-Arauzo, P.; L6pez-Garcfa, S.; Bahamonde, A.: Artificial Neural Networks for
the computation of traffic queues. Biological and Artificial Computation: From
Neuroscience to technology. LNCS, Vol. 1240. Springer-Verlag, Berlin (1997) 1288-1297.
[5] Hermlndez-Arauzo, P.; Bahamonde, A.; L6pez-Garcfa, S.: Sobre la Calculabilidad del
tiempo de desalojo de una cola de vehiculos. Proceedings of VI Conferencia de la
Asociaci6n Espafiola para la Inteligencia Artificial, CAEPIA-95. Alicante. Spain. (1995)
449-458
[6] Hernfindez-Arauzo, P.: Traffic queues computation. A virtual problems model, Ph.D.
dissertation. Universidad de Oviedo at Gij6n. (1996). p. 104 + ii
[7] Hunt, P.B.; Robertson, D.I.; Bretherton, R.D.; Winton, R.I.: SCOOT a traffic responsive
methgod of coordinating signals. TRRL Report LR1014, Transport and Road Research
Laboratory. Crowthorne (1981)
[8] Institute of Transportation Engineers Australian section: Management and Operation of
Traffic signals in Melbourne. Technical report. Melbourne. (1985)
[9] Lowrie, P.R.: The Sydney co-ordinated adaptive traffic system. Principles, Methodology
and algorithms. Proceedings of the International Conference on Road Traffic Signalling.
London. (1982). 67-70
[10] SPSS Inc. SPSS-X User's guide. Mc Graw-Hill. NewYork. (1983)
[11] Zella, A. et al.: SNNS: Stuttgart Neural Network Simulator. User Manual, Version 4.1.
Institute for Parallel and Distributed High Performance Systems. Technical Report No.
6/95. (1995).
ASGCS: A New Self-Organizing N e t w o r k for
A u t o m a t i c Selection of F e a t u r e Variables
J. Ruiz-del-Solar, D. Kottow
Department of Electrical Engineering
Universidad de Chile
Casilla 412-3, Santiago, CHILE
Ph.: +56-2-6784207 / Fax: +56-2-6953881
E-mail: j.ruizdelsolar@computer.org
Abstract
1. Introduction
[Van Sluyters et al., 1990]. The receptive fields of these cells can be seen as feature
detectors, which are then modeled as Gabor Functions [Daugman, 1980]. These functions
have often been used as filters in technical systems.
In this context, it seems natural to follow this example-based learning strategy to
automatically select the feature variables or, in other words, to automatically generate the
invariant feature detectors (Gabor-like filters). Different approaches have been used to
generate this kind of detectors by using neural models [Kohonen, 1995a; Sanger, 1989;
Sirosh, 1995]. Among them, the adaptive-subspace SOM (ASSOM), proposed by
Kohonen, stands out because of its simplicity and biological plausibility. TEXSOM, the
first image processing architecture based on the ASSOM model, was recently proposed
[Ruiz-del-Solar and KOppen, 1996 and 1997; Ruiz-del-Solar, 1998]. This architecture is
suitable to perform texture segmentation and defect identification on textured images.
The main drawback of the application of the ASSOM model in image processing
systems is that a priori information is necessary to choose a suitable network size (the
number of feature variables/the number of Gabor-like filters) and topology in advance.
Moreover, in some cases the lack of flexibility in the selection of the network topology
(rectangular or hexagonal grids) makes it very difficult to cover some areas of the (two-
dimensional) frequency domain with the filters.
As a first improvement of the ASSOM, the Supervised ASSOM (SASSOM),
proposed in [Ruiz-del-Solar and KOppen, 1996], selects automatically the number of
neurons (or filters) in the network equal to the number of classes into consideration in the
classification process. However, the lack of flexibility in the selection of the network
topology still remains an important drawback.
On the other hand, the Growing Cell Structures (GCS) network, proposed in
[Fritzke, 1994], improves existing SOM models by selecting automatically the network
size and topology from the input data. GCS corresponds to a self-organizing network that
grows until a performance criterion is met.
The main purpose of this article is to present the Adaptive-Subspace GCS
(ASGCS) network, which corresponds to a further improvement of the ASSOM. The
ASGCS network introduces some GCS concepts into the ASSOM model. These new
concepts allow to automatically select the number of feature variables (the number of
filters or neurons in the network) and the topology (not only rectangular or hexagonal
grids) of the network. The article is organized as follows. The ASSOM model is explained
in section 2. The proposed ASGCS network is presented in section 3. The generation of
Gabor-like feature filters using the ASGCS and the ASSOM models are shown and
compared in section 4. Finally, in section 5, a summary of this work and some
conclusions and projections are given.
parametric reference vector, but by basis vectors that span a linear subspace. The
comparison of the orthogonal projections of every input vector into the different
subspaces is used as matching criterion by the network. If one wants these subspaces to
correspond to invariant-feature detectors, one must define an episode (a group of vectors)
in the training data, and then locate a representative winner for this episode. The training
data is made of randomly displaced input patterns. The generation of the input vectors
belonging to an episode is different, depending on translation, rotation, or scale invariant
feature detectors needed to be obtained [Kohonen, 1995a, 1995b, 1996]. In either case,
the learning rule of the ASSOM architecture will be given by [Kohonen, 1995b]:
(2)
2. Updating the basis vectors of the representative winner and its neighbors as
follows:
2. Updating the basis vectors of the representative winner and its direct
topological neighbors N, as follows:
with
or' (tp) = es for the winner
ot'(tp)=e c "~ceN s (6)
6.3. Look for the direct neighbor of q with the largest distance in input space. This
is a cell f, satisfying the condition that the sum of the projections of the base
vectors o f q into the subspaee o f f is minimum.
f =argmin(E(b~i',b~q')l
i~u, k h - -) (11)
6.4 Insert a new cell r in between q and f in such a way that one has again a
structure consisting only of simplex structures of dimension k.
6.6 Perform a redistribution of the counter variables "r,.of the topological neighbors
oft"
!
A't',. = ca-"t"rx'uk'*,)r~ Vc ~ N,
(13)
and initialize the counter variable o f r as:
r, = ~ A ' r
c~N,
(14)
810
In this section, the automatic generation of Gabor-type spatial feature filters by using the
ASGCS and the ASSOM is shown. As in [Kohonen, 1995a], artificial, two-dimensional,
randomly oriented random-frequency sinusoidal waves are used to train the networks.
Sampling lattices with 169 (13xl 3) points were used.
(a) (b)
(c) (d)
Fig. I: The growing process of the ASGCS. Only the even component (bl) of the generated filters is
shown.
(a)
(b)
(c)
(e)
Fig. 2: The Gabor-like feature filters generated by the ASGCS after: (a) 1000; (b) 2500; (c) 5000; (d)
10000; and (e) 20000 iterations.
811
(a) (b)
(c) (d)
(e) (f)
Fig. 3: The Gabor-like feature filters generated by the ASSOM after: (a) 1000; (b) 5000; (c) 10000; (d)
20000; (e) 30000; (0 40000 iterations. Only the even component (bl) of the filters is shown.
First, the results obtained with the ASGCS network are presented. The growing
process o f the ASGCS is shown in figure 1. Filters which were close neighbors in the
underlying growing graph structure are shown near each other in these pictures. This is
done in the manner described in [Fritzke, 1994]. As it can be seen, the growing of the
network is performed in parallel with the frequency and orientation tuning o f the filters. It
should be noted that already in early phases of the simulation the ASGCS network has
812
basically its final shape with fewer neurons (filters). This behavior is described as fractal
growth. In figure 2 the generated feature filters are shown as function of the number of
iterations, ranging from 1000 to 20000. It can be observed that the generated filters exhibit
very fast a Gabor-type structure (after 5000 iterations).
Finally, in figure 3 the feature filters generated by the ASSOM are shown, as
function of the number of iterations (ranging from 1000 to 40000). It can be observed that
it takes a long number of iterations till the filters shown a Gabor-like structure. It should
be also mentioned that the ASGCS algorithm is about five times faster than the ASSOM
algorithm. In addition, the setting of the parameters of the ASGCS is easier and more
robust than the setting of the ASSOM parameters.
The ASGCS network was introduced in this article. This network corresponds to a
further improvement of the ASSOM that introduces some GCS concepts into the
ASSOM model. These new concepts allow to select automatically the number of feature
variables (the number of filters or neurons in the network) and the topology (not only
rectangular or hexagonal grids) of the network. Examples of the automatic generation of
Gabor-type spatial feature filters by using the ASGCS and the ASSOM were shown.
It can be concluded that proposed network is adequate to automatically generate
Gabor-like feature detectors and that its properties must still be analyzed. Moreover, the
ASGCS algorithm is about five times faster than the ASSOM algorithm, and the setting of
the its parameters is easier and more robust.
Further extensions of this work include:
9 Study of the dynamic and the properties of the ASGCS
9 Introduction of the ASGCS network into an image processing architecture, e.g.
the TEXSOM-Architecture.
9 Experimentation with the ASGCS in one-dimensional signal processing.
References
Kohonen, T. (1995b). The Adaptive-Subspace SOM (ASSOM) and its use for the
implementation of invariant feature detection. Proc. of the Int. Conf. on Artificial Neural
Networks. ICANN 95, October 9-13, Paris, France.
813
Ruiz-del-Solar, J., and KOppen, M. (1996). Automatic generation of Oriented Filters for
Texture Segmentation. Proc. of the lnt. Workshop on Neural Networks for Identification,
Control, Robotics & Signal~Image Processing- NICROSP 96, August 21-23, Venice,
Italy.
Sirosh, J. (1995). A Self-Organizing neural network model of the primary visual cortex,
Ph.D. Thesis, The University of Texas at Austin, USA.
Van Sluyters, R.C., Atkinson, J., Banks, M.S., Held, R.M., Hoffmann, K.-P., and Shatz,
C.J. (1990). The Development of Vision and Visual Perception. In L. Spillman and J.
Werner (Eds.), Visual Perception: The Neurophysiological Foundations, Academic Press.
Wilson, H.R., Levi, D., Maffei, L., Rovamo, J., and DeValois, R. (1990). The Perception
of form: Retina to Striate Cortex. In L. Spillman and J. Werner (Eds.), Visual Perception:
The Neurophysiological Foundations, Academic Press.
Adaptive Hybrid Speech Coding
with a M L P / L P C Structure
Marcos Fafndez-Zanuy
ABSTRACT
In the last years there has been a growing interest for nonlinear speech
models. Several works have been published revealing the better performance of
nonlinear techniques, but little attention has been dedicated to the implementation
of the nonlinear model into real applications. This work is focused on the study of
the behaviour of a combined linear/nonlinear predictive model based on linear
predictive coding (LPC-10) and neural nets, in a speech waveform coder. Our
novel scheme obtains an improvement in SEGSNR between 1 and 2.5 dB for an
adaptive quantization ranging from 2 to 5 bits.
1. Introduction
Speech applications usually require the computation of a linear prediction model for the
vocal tract. This model has been successfully applied during the last thirty years, but it
has some drawbacks. Mainly, it is unable to model the nonlinearities involved in the
speech production mechanism, and only one parameter can be fixed: the analysis order.
With nonlinear models, the speech signal is better fit, and there is more flexibility to
adapt the model to the application.
In the last years there has been a growing interest for nonlinear models applied to speech.
This interest is based on the evidence of nonlinearities in the speech production
mechanism. Several arguments justify this fact:
a) Residual signal of predictive analysis [ 1].
b) Correlation dimension of speech signal [2].
c) Fisiology of the speech production mechanism [3].
d) Probability density functions [4].
e) High order statistics [5].
Although these evidences, few applications have been developed so far. Mainly due to
the high computational complexity and difficulty of analyzing the nonlinear systems.
The applications of the nonlinear predictive analysis have been focussed on speech
coding, because it achieves greater prediction gains than LPC. The most relevant systems
are [6] and [7], that have proposed a CELP with different nonlinear predictors that
improve the SEGSNR of the decoded signal.
Three main approaches have been proposed for the nonlinear predictive analisys of
speech. They are:
815
a) Nonparametric prediction: it does not asume any model for the nonlinearity. It is a
quite simple method, but the improvement over linear predictive methods is lower than
with nonlinear parametric models.
b) Parametric prediction: it asumes a model of prediction. The main approaches are
Volterra series and neural nets.
Recently several contributions have appeared on the context of neural nets. In this paper
we propose a novel ADPCM speech waveform coder for the following bit rates: 16Kbps,
24Kbps, 32Kbps and 40Kbps with an hybrid (linear/nonlinear) predictor. With this
structure a significative improvement in SEGSNR between 1 and 2.5 dB is achieved over
the equivalent coders based on MLP and LPC alone.
0.2
0.3
0.1
0
0.2
-0.1
-0,2
0.1
-0.3
-0.4
-0,5 0 0.5 -0.5 0 0.5
x _ x2 x_x3
0.4
0,2
0 0.2
-0.2
-0.4
-0.6
-0.8 -0.2
-1
-0.4
-0.5 0 0.5 -0.5 0 0.5
phenomena in the vocal chords. Figure 1 shows that is possible to model a saturation
function with a cubic function, but it is not possible with a quadratic function.
A more detailed explanation about the nonlinear predictive model based on neural nets
can be found in [8] and [9]. This paper is focused on the speech coding application.
In a preliminar work we studied the behaviour of the linear (LPC) and nonlinear
MultiLayer Perceptron (MLP) predictors alone. This study reveals that the optimal
solution is an adaptive selection LPC/MLP prediction. We propose a linear/non linear
switched predictor in order to choose always the best predictor and to increase the
SEGSNR of the decoded signal. Figure 2 represents the implemented scheme.
For each frame the outputs of the linear and nonlinear predictor are computed
simultaneously with the coefficients obtained from the previous encoded frame. Then a
logical decision is made that chooses the output with smaller prediction error. This
implies an overhead of 1 bit for each frame that represents only 1/100 bits more per
sample (in our simulations frame size is t00 samples). It is referred in the table as hybrid
predictor, because it combines linear and nonlinear technologies. The percentage of use
of each predictor is showed in table 1.
4 ft-
I
1 bit/frame ,I
Fig. 2 Adaptive ADPCM-B hybrid coder. LP: linear predictor, NLP: nonlinear predictor,
SW: switch
(ADPCMB) configuration is adopted. That is, the coefficients of the predictor are
computed over the decoded previous frame, because it is already available at the receiver
and it can compute the same coefficients values without any additional information. The
obtained results with a forward unquantized predictor coefficients (ADPCMF) are also
provided for comparison purposes.
9 The nonlinear analysis consists on a multilayer perceptron with 10 input neurons, 2
hidden neurons and 1 output neuron. The network is trained with the Levenberg-
Marquardt algorithm.
9 The linear prediction analysis of each frame consists on 10 coefficients obtained with
the autocorrelation method (LPC-10).
Residual prediction error quantization
eThe prediction error has been quantized with (Nq=) 2 to 5 bits. (bit rate 16Kbps to
40Kbps).
9 The quantizer step is adapted with multiplier factors, obtained from [ 10]. A,,~ and Am~
are set empirically [ 111.
Database
9 The results have been obtained with the following database: 8 speakers (4 males & 4
females) sampled at 8Khz and quantized at 12 bits/sample.
Additional details about the predictor and the database were reported in [8] and [9].
10'
LLI
t.o
10~ ~. v~.,~ropagatio n
Elman
Levenberg-Marquardt
10-1 , , i i i I 1 1 1
0 200 400 600 800 1000 1200 1400 1600 1800 2000
epochs
Figure 3 MSE vs epochs for Multilayer perceptron (trained with BP and L-M) and Elman
net
818
a)Linear predictor
For the linear predictor the parameters are:
9 Prediction order: it is studied LPC-10 (same number of input samples than the MLP
10x2xl) and LPC-25 (same number of prediction coefficients than the MLP 10x2xl
OFrame length: sizes from 10 to 300 samples with a step of 10 samples are evaluated.
Notice that the bigger frame size the smaller the number of frames for a given speech
signal, but if the frame length is large then the assumption of stationary signal into the
analysis window is no valid and the behaviour degrades. If the frame length is short, the
parameter estimation is not robust enough and the behaviour degrades.
b)Nonlinear predictor
For the nonlinear predictor based on neural nets, the number of parameters that must be
optimized is greater. The selected network architecture is the Multi-Layer Perceptron
with 10 input neurons, 2 hidden neurons with a sigmoid transfer function and one output
neuron with a linear transfer function trained with the Levenberg-Marquardt (L-M)
algorithm, based on our previous results [8]. We have also evaluated a recurrent Elman
net, but we found that its behaviour was worse than MLP trained with L-M. Fig. 3 shows
Histograma
150 I I
10x2x
10x4x]
100
50
0'
12 14 16 18 20 22 24 26 28 30
Gp
Figura 4 Histograms of the prediction gain for 500 random initializations of neural net
weights.
819
the Mean Square Error as function of the number of epochs for a typical voiced frame of
the database. It can be seen that the L-M algorithm presents a fast convergence and a
small MSE. The MLP 10x4xl was also tested, but it has more coefficients and the
computational complexity is greater. Also a great number of random initializations must
be done in the 10x4xl structure, because the probability of achieving the greatest
prediction gain for a random initialization is lower than for the 10x2xl structure (fig. 4).
The adjusted parameters of the predictor into the closed loop ADPCM scheme are:
eNumber of trained epochs: This is a critical parameter. To encode a given frame the
neural net is trained over the previous frame in the backward scheme and over the actual
frame in the forward configuration. In both cases special attention must be taken in order
to avoid the problem of overtraining (the network must have a good generalization
capability to manage inputs not used for training). Although consecutive frames are
normally very similar, there are significative changes in the waveform that must be seen
as perturbances of the input, and even if the neural net is applied over the same frame
used for training, the conditions are different because the predictor is trained in an open-
loop scheme and tested in closed loop, so really the input signal is corrupted by the
quantization noise. This is as much important the lesser is the number of quantizer bits.
The way to make the neural net as robust as possible to this small changes implies the
optimization of training conditions such us:
a)Number of epochs used for training
b)Number of random initializations of the weights ( a multistart algorithm is used).
A D P C M Forward
35 , , , , ,
9 . . . . . . . . . . . . . . . . .
. . . . . . 9
i i I I
50 0 100 150 200 250 300
frame length
Fig. 5 SEGSNR vs frame length for ADPCM forward.
820
A D P C M Backward
35
~ Nq=5
30
/ . ' ~ : - ~. ~ . . . . . . ~'..". . . . . . . . . ~ .-__-:__-..~.=.~ ~.~-__. _.~ ~ . . . . . . -
i ./--- 9
25
' ,4, ._-'_?_'" \'L . . . . . . . _
/./-7 /
/ / ~ ~ ~ ~ Nq=3
9 / ~ / / r ~ -- ~ ~ ~ ~
,'r" 20
Z . ~ l l . ~ . " ' i #- - - ~ . _-r- ~ _ . ~-- k-- - 7 ~-- 9 ~ .x . . . . . . ~ -
co
./) -I //
ILl
O') /I I
15
,, ...-'>'- ...........
/,'.' / - ,...
10
:4 I /
I /
I / HYBRID
J /
/ MLP
/
/ / ...... LPC-IO
/
/
/ / I I I I
Figures 5 and 6 show the SEGSNR (computed with a 200 samples analysis window) for
frame lengths ranging from 10 to 300 samples for MLP10x2xl, LPC-10, LPC-25 and
hybrid predictor with Nq=2 to 5 bits, averaged for the frames of one sentence. For the
hybrid predictor an overhead of 1 bit/frame must be sent, so if the frame length is reduced
the compression ratio is also reduced. For these reasons in this study the block size has
been selected to 100 samples/frame, because it offers a good compromise.
821
3. Results
The results have been evaluated using subjective criteria (listening to the original and
decoded files), and SEGSNR.
Table 2 shows the SEGSNR obtained with the ADPCM configuration for the whole
database with the following predictors: LPC-10, LPC-25 and MLP 10x2xl.
The results of the ADPCM forward (with unquantized predictor coefficients) are also
provided such us reference of the backward configuration.
This results reveal the superiority of the nonlinear predictor in the forward configuration
(3.5 dB aprox, over LPC-25 except for the 2 bit quantizer). This superiority is greater
if the quantizer has a high number of levels.
In the backward configuration there is a small SEGSNR decrease with the linear predictor
versus the forward configuration. For the nonlinear predictor it is more significative
(nearly 3dB), but the SEGSNR is better than LPC-10 except for Nq=2 bits. Also, the
variance of the SEGSNR is greater than for the linear predictor, because in the stationary
portions of speech the neural net works satisfactorily well, and for the unvoiced parts the
nnet generalizes poorly. Therefore, we propose a hybrid predictor.
ADPCMB-LPC- 101 14.92 5.1 20.59 5.9 25.38 6.6 30.02 7.1
Table 2. SEGSNR for ADPCM forward, backward, linear, nonlinear and hybrid.
We have also evaluated the computational complexity of the studied systems. Table 3
summarizes the number of flops required for encoding the whole database with diferent
schemes. For comparison purposes, the computational complexity has been refered to the
ADPCM LPC-10 systems. Thus, the numbers in table 3 show how many times is greater
the computational burden. Evaluated systems are:
eB: ADPCM with backward adaptation of prediction coefficients.
822
100 and 200 indicate the frame length in the block adaptive prediction system.
frame length
The unique work that we have found that deals with ADPCM with nonlinear prediction
is the one proposed by Mumolo et alt. [12]. It was based on Volterra series and has
problems of unstability, which were overcome with a switched linear/nonlinear predictor.
Our novel nonlinear scheme has been always stable in our experiments, although we also
propose a switched predictor in order to increase the SEGSNR of the decoded signals.
The results of our novel scheme show an increase between 1 and 2.5 dB over classical
LPC-10 for quantizer ranges from 2 to 5 bits, while the work of Mumolo [12] is 1 dB
over classical LPC for quantizer ranges from 3 to 4 bits and also with and hybrid
predictor. On the other hand, the computational complexity has increased thirty times
aproximately in the hybrid structure.
A statistical test was done in order to check if the results are statistically significatives.
The selected test is ANOVA (Analysis of Variance), and it proves that the proposed
adaptive hybrid speech coder is significatively better than the ADPCMB LPC-10 and
LPC-25 schemes for all studied bit rates.
In this paper we have obatined the same conclusion than in our speaker recognition
application of nonlinear predictive models based on MLP [13]: the best results are
achieved with a combination of linear and nonlinear predictive models. In [ 14] we have
obtained the same conclusion (also in speaker recognition) for a combination of a MLP
trained as a classifier for each speaker, and a codebook of cepstral parameters derived
from a linear parametrization.
Acknowledgements
References
ABSTRACT
In this paper, we present a new speech coding named NPC (Neural Predictive Coding). It is
obtained thanks to a MLP (Multi Layer Perceptron) used in prediction. The system is
designed to predict the samples of a signal window from previous ones. The goal of this
coding is to extract the signal window characteristics relative to the database which it is
extracted. After a precise description of our coding, we compare results obtained by our
coding with the ones obtained by classic coding (MFCC, FFT, LAR, LPC and LPCC) on
phoneme recognition. The NPC coding allows an improvement of the recognition rate in
respect of the other coding.
INTRODUCTION
The first coding goal is to reduce the data number to process. In the same time, it has to
preserve the maximum of discriminating information. They are several coding types.
Frequential coding (FFT, MFCC) or predictive coding (LPC, LAR, LPCC). These coding are
effective, but not adaptive. The objective of the NPC coding is to adapt itself to the base to
code. The closest classic coding of NPC is the LPC.
In the first part we will describe the model that we propose, first in a qualitative way then in a
formal one. In a second part we will present the chosen parameters for the coding system and
give explanations for these choices. We will also present the results obtained by our coding
and classic coding already quoted on phoneme recognition. Finally, we will discuss the new
possible research axes.
825
I NPC CODING
The coding system is a two layers perceptron. It possesses nl inputs, n2 neurons on the
hidden layer, and one output. It is trained to predict a signal sample from the nl previous
ones. The (nl+l) * n2 weights of the first layer are common to all windows, and constitute
the fixed part of the system. On the other hand the n2+l weights of the second layer are
proper to each window, and constitute the coding coefficients. The process is decomposed in
two phases, first the phase of the first layer adjustment, and then the coding one.
We choose b signal windows of N samples each. A second layer is associated with each
window. We present the nl first samples of a window to the MLP composed of the common
first layer, and the second layer associated with this window. The MLP is adapted to predict
the window sample nl+l. Then we bring input and output forward by one sample, i.e. that the
system has to predict the sample nl+2 from samples 2 Io nl+l placed in input. So far as to
predict the last one from samples N-nl to N-1. Therefore each window provides N-nl-I
examples. Modifications of the second layer are then executed. On the other hand
modifications of the common first layer are executed only after the passage of all the
windows. Then, weights of the first layer are modified by the N-nl-I examples of the b
windows of the database, while weights of the second layer are modified by the N-nl-I
examples of the associated window. Once this weight optimization is done, we obtain a first
layer that constitutes the fixed part of the coding system. Then, the system is ready to code.
From a connectionist viewpoint, the first layer capture the common information. From a
signal viewpoint, the first layer do some optimal transformations for the prediction of the
windows.
1.1.2 Coding
Each window to be encoded is presented to the MLP constituted of the first layer
previously calculated and a second layer that we initialize at random. We optimize
weights of the second layer to minimize the prediction error in output. Weights of lhis
second layer then constitute the coefficients coding of the window. We reiterate this
procedure for all the windows. The objective is'to~.p~at off the common information to all
windows (not discriminative) on the first layer and the proper information to each
window on the second one.
826
We also propose a variant to this coding, by associating a second layer not with each window
but with each class. I.e. during the first layer adjustment, when we present a window to the
system we do not place its associated second layer, but the second layer associated with its
own class. Modifications of this second layer are only executed after the passage of all the
windows of this class. The objective is to put on the second layer the common information to
the associated class (discriminative).
X~jc = {ex'c(i),eX'C(i + 1),.... eX'C(j - 1), eX'C(j) the vector composed of samples i to j of the x
window of the c class.
The neuron outputs of the bidden layer is:
v'.~(k)-- o(w,. x~:'o~ + B,) ~l)
where cr is the sigmoid function, W 1 the matrix nl*n2 of the first layer weights, and B1
the vector 1"112 of tile first layer biases.
Then, the prediction of the k th sample of the x window of the c class is:
~,~(k) = ,~(w; ,~ 9 vx,~(k) + b~, ~ (2)
where W~ 'c is the vector n2*l of second layer weights associmcd to the x window of tile c
class, and b~'c the bias.
827
J~ = .,, eX'Cv.
V (k), (6)
x=l k=nl+l
I.e. the sum of (3) on all examples of all windows of the c class. We modify weights to
minimize these criteria by using the backpropagation algorithm. Once this minimization is
done, we obtain the fixed part of the coding system, W~ and B1.
1.2.2 Coding
For each window we initialize at random W~ 'c . Then we minimize with the backpropagation
N
algorithm the criterion j~,c = Z cx'c (k) by modifying only W~ 'c . W~ 'c constitute then the
k=nl+l
coding coefficients of the x window of the c class.
As we can see, the NPC is like the LPC a predictive coding where the coding coefficients are
optimized for the prediction. But there are two fundamental differences with the LPC. First,
the NPC has a fixed part which depend of the base to code. This fixed part has for goal to
capture the useless information that is necessary for the prediction, but not for a classification.
This dependence of some parameters of the coding system is used on several studies [l]. It's
often parameters of a filters bank. And second the prediction is a non linear prediction. The
non linearity of the speech signal is shown and used in several studies [2,3,4]
An application of our system is the phoneme recognition. We have done several experiences
so as to test our system performance on this application. We will describe the generals
conditions of these experiences, then we will present the results obtained by our coding and
the classic ones. Finally we will discuss about these results.
828
We have tested 5 coding to evaluate NPC. We choose these coding because they are often
used for phoneme recognition. The results obtained by the different coding are presented in
table 1. The indicated scores are the recognition rate obtained on the generalization database.
Maximum
Coding generalization
recognition rate
FFT 59.14
MFCC 58.58
LAR 57.95
LPCC 57.88
LPC 56.69
Table.t: Recognition rate obtained
by different coding.
829
One observes that the recognition rates are relatively close. The best rate is obtained by the
FFT coding. The LPC coding obtain the worst rate. The frequentials coding look like to be
better than predictive ones.
The transfer function of all hidden neurons is the sigmoid function. We have managed several
experiences on the predictive MLP input number (nl). Figure 1 shows the sum of absolute
connection weights connecting the input i to neurons of the hidden layer, nl is fixed to 20 in
this experience, i varies from 1 to 20, i=l representing the most postponed input, and i=20 the
most recent. One observes that tiffs sum decreases when one goes back in the past. We can
therefore imagine that the prediction is made thanks to most recent samples (what seems
natural). If we add too many input, they bring much noise than information. And if we
suppress too many inputs we suppress information. The optimal number for nl that we find is
20. Two additional experiences with nl equal to 18 and 22 give worst recognition rate.
0.4
0.3
0.2 O
O
O
0.1 0
O
0 o
O
0
000 0000 0 0
0
0 20
We have done 6 experiences. For each one we have done classifications for several iteration
numbers. Iteration for the first layer adjustment part, and for the coding part. During the first
layer adjustment phase, the differences between the 6 experiences was :
9 The function of the output ncuron was the sigmoid function or tile linear one.
9 There was a bias for the second layer or there was not
9 The second layer was associated with each window or with each class
830
Table 3 presents an experience led with a second layer for each window, without bias and
with a line,u" function for the output neuron during the first layer adjustment phase. The score
,are the recognition rate on the test database. The character size used is proportional to the
score in order to have a visual impression. Each line correspond to a iteration number of the
first layer adjustment phase (indicated on left). And each column correspond to an iteration
number of the coding phase (indicated on top).
The figure 3 and 4 present results of the experience that give the best recognition rate. This
rate is 62.25 for 7000 iterations for the first layer adjustment phase and 40 iterations for the
coding phase. In the figure 3 we can see the mean of recognition rates of all iterations number
of coding phase (1 5 10 20 40 100) in respect of the iteration number of the first layer
adjustment phase. Inversely in the figure 4 we can see the mean of recognition rates of all
iterations number of first layer adjustment phase (200 800 3200 7000) in respect of the
iteration ntunber of the coding phase.
62 61.5
0 0
61.5 0 0
0 61
61 0
60.5
60.5
60
60
59.5 59.5
0 2000 4000 6000 8000 0 20 40 60 80 100
9 According to table 2 and figure 4 one observes that the iteration number of the coding
phase is very determinant. Too many iterations deteriorates the recognition rate ( even if
the prediction error is lower). A over fitting appear after 100 iterations. The optimal
number seems to be between 5 and 40. This point was noted in all the experiences led.
9 According to table 2 and figure 3 one observes that more important is the iteration
number during the first layer adjustment phase the better is the recognition rate. This point
was also noted in the other experiences. But after 3000 iterations the recognition rate do
not grow up significantly. We never went so far in the experiences to observe an over
fitting.
9 To end this analysis we note that our best score is higher of 3 points than the best score
obtained by the classic coding. It is also higher of 5.5 points than the LPC coding.
3 FURTHER WORK
,, The MLP used for the classification is a basic classifier. It would be interesting to test this
coding on others more evolved classifiers. Moreover the used database, even if it is
balanced, is small, and would merit to be expanded to a greater phoneme class number.
Finally, the number of coefficients used to code the speech signal is very often superior to
12. So it would be interesting to increase it.
9 An other application of our coding would be tile speech compression. Exactly as it is done
with the LPC coding we could compress and decompress a speech signal. It will be
interesting to test this point.
9 Actually the first layer is optimized for the prediction. But we want to have a better
recognition rate than a good prediction. So we think about ways to couple the prediction
and the classification. A study is in progress at the laboratory.
CONCLUSIONS
In this paper we have presented the NPC coding that we have developed. NPC is original
because it has a part wich depends of the base to code (first layer of a perceptron). It is also
original because the prediction is a non linear prediction. Experiences that we have presented
show that our coding obtains a best score in generalization that the FFT coding (that has been
the most effective classic coding among the 5 that we have tested). However a study of
greater scope remains to be led, and is in progress at the laboratory.
832
REFERENCES
1 Introduction
Automated classification addresses the general problem of finding
an approximation F of an unknown function F defined from an
input space [2 onto an unordered set of classes {wl,... ,wK}, given
a training set: T = {(~eP,yP = F(xP)}P1 C ~2 x {•l,...,09K}.
Among the wide variety of methods available in the literature to
learn classification problems, some are able to handle many classes
(e.g. decision trees [2,12], feedforward neural networks), while others
are specific to 2-class problems, also called dichotomies. This is the
case of perceptrons or of support vector machines (SVMs) [1,4,14].
When the former are used to solve K-class classification problems,
K classifiers are typically placed in parallel and each one of them is
trained to separate one class from the K - 1 others. The same idea
can be applied with SVMs [13]. This way of decomposing a general
classification problem into dichotomies is known as a one-per-class
decomposition, and is independent of the learning method used to
train the classifiers.
834
2 Illustrative example
!
~ "'" *'" 9 #~k' o'r1761769 ~ o ~ 9 ~ I'i- , ~ %Oo.~. o'.';~., 9 Oo %, . 9
0.5
9 " 9 .'. ,~" .." 9 il_i! "." "" ' " "'"
].3k
Class 1 ~! i \ Class 2
0 / \
.:-., .: : :-.... 9
1.5 ~'
I ~'I I I I I I I 11 I
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
Since the three covariance matrices are identical and the a pri-
ori probabilities are equal, the boundaries of the decision regions
based on an exact Bayesian classifier are three lines intersecting in
one point [7], which are represented by continuous lines on Figure 1.
The 50 data of each class is linearly separable from the data of the
other two classes. However, the maximal margin of a linear separ-
ator isolating Class 3 from Class 1 and 2 is much larger than the
margin of the other two linear separators. Thus, when using 3 linear
SVMs to solve the three dichotomies, the norm of the optimal hy-
perplane found by SVM algorithm is much smaller in one case than
in the other two. Whenever the output class is selected as the one
corresponding to the SVM with largest output, the decision region
obtained is shown in Figure 1 by dashed lines, which is quite different
from the optimal Bayes decision.
For comparison, the dash-dotted lines (with cross-point marked
by a square) correspond to the boundaries of the decision regions
obtained by three linear Perceptrons trained by the Pseudo-inverse
method, i.e. the linear separators minimize mean square errors [7].
This matches closely the optimal one.
Two different ways of normalizing the outputs of the SVMs are
also illustrated in Figure 1 and the boundaries of the correspond-
ing decision regions are shown with dotted lines. In one case, the
836
3 S V M output normalization
The first normalization technique considered has a geometrical in-
terpretation. When a linear classifier fk : ~d __+ {--1, +1} of the
form
]k(X) = sgn(gk(w)) ----sgn(xrw k q- bk) (1)
is normalized such that the Euclidean norm Ilwkll2 is 1, gk(x) gives
the Euclidean distance from ~c to the boundary of fk.
Non-linear SVMs are defined as linear separators in a high di-
mensionM space 7-/in which the input space I~d is mapped through a
non-linear mapping 9 (for more details on SVMs, see for example the
very good tutorial [3] from which our notations are borrowed). Thus,
the same geometrical interpretation holds in 7-/. The parameter w k
of the linear separator fk in 7-/of the form (1) is never computed
explicitly (its dimension may be huge or infinite). But is known as a
linear combination of images through 9 of the support vectors (input
data with indices in N~)
wk = E P P P ). (2)
p~N ~,
k used in this work will thus be defined
The normalization factor rrw
by
1
-- - ~ aka~P
" P~' y r r (3)
(-~):
p ,p' 6 N ks
: E ~P'~P'~'P~'P'I((TP
~k~k~ u., ,~C P' ), (4)
P,Pi~-Nsk
(6)
P -- arg max(g)
by
/~ = arg mkax( M g ) ,
838
5 Numerical experiments
All the experiments reported in this section are baed on datasets of
the Machine Learning repository at Irvine [10]. The values listed are
pecentages of classification errors, averaged over 10 experiments. For
glass and dermatology, one time 10-fold cross validation was done,
while for vowel and soybean, the ten runs correspond to 5 times
2-folding. We used SVMs with polynomial kernel of degrees 2 and 3.
k k
database deg no normal. 71"w ~, M
glass 2 35.7 =h 13.5 31.6 =h 10.3 31.9 4- 12.3 ~39.0 + 12.5
glass 3 37.6 =h 12.8 33.3 =t= 11.4 35.7 + 10.6 45.2 :t: 10.8
dermatology 2 3.9 -t- 1.9 4.1 -t- 2.0 3.9 + 1.9 4.2 =h 2.0
dermatology 3 3.9 :t: 2.7 4.4 + 2.7 3.9 4- 2.7 4.4 =h 2.7
vowel 2 70.3 =t= 39.7 69.8 -t- 40.7 69.9 ~: 40.5 24.2 + 1.6
vowel 3 62.1 =h 44.5 61.4 :t: 45.4 61.8 ::h 44.9 10.5 4- 3.2
soybean 2 71.6 4- 34.7 71.6 4- 34.8 71.6 :k 34.9 29.2 :t: 11.2
soybean 3 71.6 :k 34.8 71.4 =h 35.1 71.6 :t: 34.8 28.8 -4- 11.1
6 Robust decomposition/reconstruction
schemes
Lately, some work has been devoted to the issue of decomposing a
K-class classification problem into a set of dichotomies. Note that
all the research we are referring to was carried out independently of
the method used to learn the dichotomies, and consequently all the
techniques can be applied right away with SVMs.
The one-per-class decomposition scheme can be advantageously
replaced by other schemes. If there are not too many classes, the so
called pairwise-coupling decomposition scheme is a classical alternat-
ive in which one classifier is trained to discriminate between each pair
of classes, ignoring the other classes. This method is certainly more
efficient than one-per-class, but it has two major drawbacks. First,
the number of dichotomies is quadratic in the number of classes.
Second, each classifier is trained with data coming from two classes
only, but in the using phase, the outputs for data from any classes
are involved in the final decision [11].
A more sophisticated decomposition scheme, proposed in [6,5],
is based on error-correcting code theory and will be referred to as
ECOC. The underlying idea of the ECOC method is to design a set
of dichotomies so that any two classes are discriminated by as many
dichotomies as possible. This provides robustness to the global clas-
sifier, as long as the errors of the simple classifiers are not correlated.
For this purpose, every two dichotomies must also be as distinct as
possible.
In this pioneering work, the set of dichotomies was designed a pri-
ori, i.e. without looking at the data. The drawback of this approach
840
is that each dichotomy may gathers classes very far apart and thus is
likely hard to learn. Our contribution to this field [8] was to elaborate
algorithms constructing the decomposition matrix a posteriori, i.e.
by taking into account the organization of the classes in the input
space as well as the classification method used to learn the dicho-
tomies. Thus, once again, the approach is immediately applicable
with SVMs.
The algorithm constructs the decomposition matrix iteratively,
adding one column (dichotomy) at a time. At each iteration, it
chooses a pair of classes (wk,c0k,) at random among the pairs of
classes that are so far the less discriminated by the system. A clas-
sifier (e.g. a SVM) is trained to separate wk from wk,. Then, the
performance of this classifier is tested on the other classes and a
class wl is added to the dichotomy under construction as a positive
(resp. negative) class, if a large part of it is classified as positive
(resp. negative). The classifier is finally retrained of the augmented
dichotomy. The iterative construction is complete, either if all the
pairs of classes are sufficiently discriminated or when a given number
of dichotomies is reached.
Although each of these general an robust decomposition tech-
niques are applicable to SVMs and must be in any case preferred to
the one-per-class decomposition, they do not solve the normalization
problem. When choosing a general decomposition scheme composed
of L dichotomies providing a mapping from the input space J2 into
{ - 1 , +1} L or ]~L, one also has to select a mapping rn : IRL --+ 11~K,
called the reconstruction strategy, on which the arg maxk operator
will finally be applied.
Among the large set of possible reconstruction strategies that
have been explored in [9], one distinguishes the a p r i o r i reconstruc-
tions from the a posteriori reconstructions. In the latter, the mapping
rn can be basically any classification technique (neural networks, de-
cision trees, nearest neighbor, etc.). It is learned from new data and
thus, it solves the normalization problem.
Reconstruction mappings rn composed of L SVMs have also been
investigated in [9] and provided excellent results, especially for degree
2 and 3 polynomial kernels. Note that in this case, the normalization
problem occurs again at the output of the mapping rn and in our
841
7 Conclusions
References
5. Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via
error-correcting output codes. Journal of Artificial Intelligence Research, 2:263-
286, 1995.
6. T. G. Dietterich and G. Bakiri. Error-correcting output codes : A general method
for improving multiclass inductive learning programs. In Proceedings of AAAI-91,
pages 572-577. AAAI Press / MIT Press, 1991.
7. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley
& Sons, New York, 1973.
8. Eddy Mayoraz and Miguel Moreira. On the decomposition of polychotomies into
dichotomies. In Douglas H. Fisher, editor, The Fourteenth International Confer-
ence on Machine Learning, pages 219-226, 1997.
9. Ana Merchan and Eddy Mayoraz. Combination of binary classifi-
ers for multi-class classification. IDIAP-Com 02, IDIAP, 1998. pa-
per 22 in the Proceedings of Learning'98, Madrid, September 98,
http://learn98, tsc. uc3m. es/~learn98/papers/abst ract s.
10. C. J. Merz and P. M. Murphy. UCI repository of ma-
chine learning databases. Machine-readable data repository
h t t p ://www. ics .uci. edu/~mlearn/mlrepository.html, Irvine, CA: University
of California, Department of Information and Computer Science, 1998.
11. Miguel Moreira and Eddy Mayoraz. Improved pairwise coupling classification with
correcting classifiers. IDIAP-RR 9, IDIAP, 1997. To appear in the Proceedings of
the European Conference on Machine Learning, ECML'98.
12. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
13. B. Schrlkopf, C. Burges, and V. Vapnik. Extracting support data for a given task.
In U. M. Fayyad and R. Uthurusamy, editors, Proceedings of the First International
Conference on Knowledge Discovery and Data Mining, pages 252-257. AAAI Press,
1995.
14. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York,
1995.
Self-Organizing Yprel Network Population
for Distributed Classification Problem Solving
Abstract: This paper deals with a new scheme of distributed classifier based on a particular
formal neuron named "yprel". The main characteristics of the proposed approach are: (i) a
classifier is a set of interconnected and cooperating networks; (ii) the distributed resolution
strategy emerges from the individual network classification behaviors during the incremental
building phase of the classifier; (iii) each neuron is able to come to classification decisions about
some elements and to communicate them; (iv) the network architectures and the interconnexion
links between the networks are not a priori chosen, but get themselves organized thanks to an
incremental and competitive learning between the decision-making neurons.
I. INTRODUCTION
One of the main problems raised by any pattern recognition work is the classification
task solving. Nowadays, neural methodologies have proved their ability to treat complex
problems about learning and classification, but some critical points still remain unsolved.
Two of them are dealt with this paper: (i) the way to build an adapted network
architecture to a given task; (ii) the way to share out a complex problem resolution among
a set of cooperating networks. These points have been acknowledged as key problems by
several authors during the last years [1 ], [2], [3], [4], [5].
In this paper, we describe the main principles of the yprel methodology which put
forward the self-organization o f a distributed solution for supervised classification
problems. The two following sections detail the formal neuron used and the way to
determine a network architecture adapted to a given goal. The third section describes the
automatic task decomposition and the distribution process which emerge from the
learning phase. Preliminary results obtained for the handwritten digit recognition on the
NIST database are given in the last section.
This section presents a particular neuron named "yprel" for the abbreviated form of"Y-
PRocessing-ELement". The "Y" character symbolizes the neuron structure which
possesses at most two inputs and one output. The tbllowing figure 1 gives an example of
yprel network. Two kinds of neuron can be distinguished: the one-input neurons linked to
the real features extracted from the shape to identify, and the standard neurons with their
two inputs. In an yprel network, each neuron output can be connected to one or several
other neurons without any layer restraint for the global structure.
844
Input Network
features~ v output
The role of file one-input yprels is not only to normalize the data coming from the
feature vector, they are also able to take some classification decisions. Each extracted
feature is linked to a particular neuron. This one uses the linear separation to determine
two homogeneous decision domains on both sides of a remaining non-decided area. This
decision scheme is illustrated by the figure 3.
rules allows to deduce them from the respective limits of the studied class elements and
other classes elements. Only the six cases listed on the figure 4 are possible.
do = -1 dl = 2 9 '
yp = kl.x + k2
with x the extracted feature component. The two parameters kt, k2 are determined during
the learning phase. They normalize the enlarged interval [Fmin, Fmax] on the segment
[0,1]. Thus, if (Yv < 0) or (yp > 1), the one-input yprel can directly conclude about
elements. Then, these decision values are updated according to the respective domain
labels do, dl in order to respect the output encoding used. The final Yt, output becomes
nonlinear and defined by the following rules:
The standard yprel role is also to perform in its own input space a kind of linear
separation between the studied class elements and every others. At first, it allows to
ensure the propagation mechanism of all the decisions coming from the two inputs and to
verify their coherence. Let yp be a standard yprel, with two inputs e0 and et which refer to
other neuron outputs. The yp input values depend on the encoding scheme used, while the
data distribution in this space is due to the classification behaviors of eo and ej. The input
space of the standard yprel is illustrated by the figure 5. The encoding used enables to
differentiate in this space several regions which correspond to the different decision
making situations.
846
The decisions provided by the two inputs e0, et are compatible if the two sub-nets are
come to the same conclusion. In this case, the corresponding decision is just transmited to
the ypreis using the Yt, neuron output.
Fig. 5. Standard yprei input space, Fig. 6. Standard yprel decision function.
The two input values are also compatible if only the neuron ei provides a decision and
the other one e] gives no conclusion, ( i * j , 0"e{0,1}). In this case, the yp neuron output
takes the value of the classification decision. This situation allows to ensure a real
cooperation process between the two sub-nets by taking into account the complementarity
of the decisions coming from the two inputs.
When no decision is taken by the two inputs, ( 0 < eo < 1 and 0 < et < 1 ), the standard
yprel calculates a certain linear combination of its inputs. This combination determines
the direction of a projection straight line which enables also to find two homogeneous
decision domains on both sides of a remaining non-decided area. The figure 6 illustrates
this standard neuron decision mechanism.
Once the projection direction is found, the standard neuron is exactly determined like
the one-input yprel. The supervised learning allows to deduce the two frontiers
Fmin, F m a x of the mixed zone and the corresponding decision domain labels d0, dl from
the respective element positions on the projection straight line, as it is described on the
figure 4. The obtained non-decision interval is also enlarged and normalized on the
segment [0,1], and the same rules allow to perform the final output encoding. The
standard neuron simulation is done by computing:
with e o , et the two input components; ao , at the synaptie weights and kt , k2 the
normalisation parameters. The expression (ao .eo + at.el) computes the element (eo, et)
orthogonal projection on a straight line whom the direction vector components are (a0=cos
I~ , at = sin/z):/t being the angle between the projection straight line and the horizontal
axis.
The use of only two inputs allows to search the projection direction in an exhaustive
way, by making vary the angle/~ by l-degree step. For each angle tt value, we determine
the two decision domains according to the element projections on the corresponding
straight line. The number of decisions inside one domain acts as a selection criterion to
847
find the final direction. The selected straight line corresponds to the domain which
provides the greatest possible number of decisions. To ensure a better generalization
behavior, we only keep a domain for a given straight line if its number of decisions is
superior to a certain treshold determined during the learning phase. Otherwise, the
corresponding frontier is moved and the normalisation parameters are updated to include
this domain in the non-decision area. This frontier shifting mechanism is illustrated by the
figure 7.
If no domain is finally kept, we select the projection direction which minimizes the
length of the non-decision interval computed before the frontier movements. This case has
to be considered since a standard neuron can be very efficient only by combining the
decisions coming from two complementary sub-nets. The same shifting mechanism is
used to validate the decision domains due to the one-input yprels. Its role is very
important in the generalization behavior. It allows to eliminate from the network all the
non-representative decision domains which contains only few elements of the learning set.
We have described the formal neuron used in the yprel networks and the way it was
able to take some decisions according to a given class. Nevertheless, no hypothesis has
been made about the network structure. A network has to combine the decision abilities of
several yprels to solve its particular classification problem. The aim of the tbllowing
section is to present the method used to obtain a sell-organizing architecture adapted to
each network goal.
Ill. A SELF-ORGANIZINGNETWORKSTRUCTURE
In the yprel methodology, the network structure is not a priori chosen. It is determined
step by step during the learning process. The learning strategy used allows to mix the
main advantages of the incremental building methods with the competition process due to
the genetic algorithms, but without requiring to the gene encoding phase. This learning
algorithm is based on a cooperative and competitive process between tile deciding
neurons.
The first step of the network building phase consists in creating the set of all the one-
input yprels linked to the extracted features. Then, we generate incrementally a set of
standard yprels. Each generated neuron becomes the terminal element of a particular sub-
net which takes some classification decisions. All the created sub-nets attempt to reach the
848
same goal, but each determines its own solution space owing to the decision distribution
and the propagation mechanism. Thus, each sub-net becomes a potential candidate for the
particular problem to solve. The total number of decisions taken inside each sub-net acts
as a selection criterion to find the winner sub-net. Only the sub-net with the best
performances will be kept as the final network. This competition principle is illustrated by
the figure 8.
Input ~
features Network
-- ~""," 9 output
With this competitive learning algorithm, the parameters of each yprel are calculated
only once, when this neuron is used as the terminal element of a new sub-net. They are
never modified later. Then, the necessary calculations to generate a new sub-net are
limited to evaluate the parameters of only one yprel. The computation cost to generate one
standard neuron decreases as soon as we progress in the network structure building, since
we only use the remaining non-decided elements to determine its parameters. Therefore,
the learning procedure being very fast, it allows to generate a great number of candidate
neurons which test simultaneously different possible architectures. This competition
mechanism is essential to ensure the search of a "good" solution in the combination space.
It allows to limit the overleaming phenomena, since it does not lead to complete each time
the same network structure. Thus, a biased or overfitted architecture can be completely
given up, ifothcr tested combinations lead to better solutions.
The main problem raised by this learning strategy is the way to choose the two inputs
of a new yprel. The selection rule must avoid to explore systematically all the possible
combinations, since the total number of different network structure possibilities follows
an exponential law with respect to the number of created neurons. The selection rule must
highly constrain the competition process to ensure the system convergence and to limit the
combinatorial explosion. But, in the same time, it has to remain sufficiently flexible to get
out of a biased solution by testing other network structure possibilities. The proposed
selection rule has this double-acting behavior.
At first, we use the individual standard neuron efficiency which calculates the number
of decisions taken by the neuron itself in comparison with the best of its two parents.
During the generation phase, only the yprels with individual efficiencies superior to a
certain treshold are kept to become potential candidates. The other ones are eliminated
and the corresponding sub-net combinations are marked to be never tested again. The
same treshold value is used to validate a decision domain and to keep a candidate. This
treshold mechanism allows to limit highly the combinatorial explosion by keeping only
the most reliable candidates which really make the problem solving progress on a certain
number of decisions compared to their two parents.
During the generation phase, the system updates a two-dimensional array which allows
to sort the selected population according to the performance (number of decisions) and the
849
size (number of yprels) of the obtained sub-nets. Each input of a new yprel is chosen by
performing two biased drawing lots inside this table. The first one determines the size
category of the candidates. The second allows to select one element among these
candidates. Each pseudo-random selection is made by computing the following function
which returns the position of the selected element in the considered list:
n- 1 - r trap+!
with n the number of elements in the list, r a random selection in [0,1] and trap a
temperature parameter. This function associates one selection probability with each
element according to its position in the ordered list. These probabilities follow an
exponential law whom the curvature is set by the temperature parameter. The first pseudo-
random selection enables to favour the candidates with the smallest sizes and the one-
input yprel selection, while the second increases the chances of the most efficient sub-nets
for a given size. The random component of this selection function allows to test new
neuron combinations and to avoid local minima, while the bias component constrains the
process to ensure the incremental structure building and the system convergence. In the
proposed approach, the two temperature parameters have been frozen in a satisfactory
experimental way, close to a breadth-first-search which favours the neuron competition
more than the incremental building scheme. This way is possible since the convergence of
our system is mainly due to the treshold mechanism which selects the candidates and to
the strategy used to distribute the problem solving.
During the standard neuron generation phase, this biased-random selection is applied
on the ordered list of the one-input yprels. Only the input features used by the sub-net
winner of the competition process will be present in the final network architecture. By this
way, the yprel methodology enables an automatic structure determination mixed to an
active input feature selection adapted to each particular network goal.
Unfortunatly, the first trials on real complex problems with just one network by class,
have shown all the difficulties to find satisfactory network structures and the importance
of the treshold value working on the number of decisions taken by each yprel. A high
value limits the number of candidates by selecting only the most reliable neurons, but
does not allow to classify all the learning base elements. On the contrary, a small value
increases the number of neurons with non-representative decision domains leading to
overfitting phenomena. Moreover, they have shown that the treshold values had to be
different according to the treatened classes and the problems to solve, that is to say the
absolute necessity to introduce an adaptative parameter.
The maximun possible value for this parameter corresponds to a decision domain
which contains all the elements not belonging to the network class (i.e. the concerned
class is linearly separable from the others inside only one yprel). The neuron generation
starts with this maximum value. After a certain number of trials, if none standard yprel is
kept, the treshold value is decreased. The process is repeated until the system can select
some candidates among the number of available trials. The generated network takes
classification decisions about a part of the learning data. All the non-decided elements
850
make up the learning base of a new network linked to the same class. This one reduces
again the non-decision number and so on... The following figure 9 illustrates this strategy
based on the successive learning base filterings. This task decomposition depends on the
created networks and their individual classification behaviors. It is entirely linked to the
self-organization of the network structures and emerges thus from the learning phases.
~
initial Base filtered base
wo,k y~:~ork
This strategy based on the emergence of the task decomposition is applied to all
classes. Each class will be treatened by a different number of networks. The networks
linked to a particular class have to come to rejection decision about other classes
elements. In the same time, the networks associated to other classes try to conclude to
recognition decisions about the same elements. Then, the different network goals can be
treatened in interacting way to simplify the learning phase and the structure of a particular
network under construction. To get this inter-class cooperation scheme, the building of the
whole classifier is made layer by layer. One layer represents one step of each intra-class
decomposition. Inside a layer, the networks are built class after class. A given layer will
contain at most one network by class and less if certain class treatments have been
achieved in the previous layers. Then, a network under construction can consider the
outputs of all the previously built networks as a supplementary set of possible input
features. The proposed methodology leads to a real self-organization of the internal
cooperation links between the networks since they emerge from the learning phase thanks
to the selection rule working on this new set of one-input yprels. The following figure 10
gives an example of a possible distributed architecture for a 3-class problem cO,cl, c2
where the networks are built in the class order (0,I,2) and the classifier layer after layer.
io,t,a, I ,ayor,
..............................................
t , or2 j,ayor t0, si or
base q outputs
Vo PRELIMINARY RESULTS
Some preliminary results have been obtained with the previous methodology on the
classical problem of the handwritten digit recognition. The learning base contains 5.000
elements constitued by the 500 first examples of each class extracted from the NIST
database3 without any sort of population, while the test base contains 53.383 digits. A
feature vector of 124 components has been extracted from each character. This vector
contains classical features in the field ofOCR (invariant moments, number and position of
intersections with vertical and horizontal straight lines, projection profiles...). The tobal
recognition rate is 95,5% for a 4,5% error rate, which is normal according to the size and
to the unsorted population of the learning base.
The whole classifier structure is shared out among a set of 117 networks for a total
number of 445 neurons. The average of network sizes is 3,8 yprels. The smallest ones are
only made with 3 neurons: 1 standard yprel conbining 2 one-input yprels; the biggest has
9 neurons. The different cooperation links are summarized in the following table. The
number of networks by class illustrates the intra-class task decomposition. The whole
system contains 112 inter-class cooperation links corresponding to the generation of
internal features (i.e. the concerned one-input yprels are linked to other network outputs).
class 0 1 2 3 4 5 6 7 8 9
nb network 7 11 15 14 10 15 6 10 18 11
nb neuron 26 43 57 60 46 49 26 38 64 36
internal feature 4 10 20 16 3 12 6 8 24 9
We can notice that the strategy used generates a vast number of very small networks
linked by the two cooperation schemes. Then, we can compare the global classifier to an
unique network which contains very localized areas dedicated to explicit treatments of
each class.
Except for the first network of each class which works on the whole learning base, all
the created networks only try to reduce previously filtered bases. Then, we have a real
reduction of the task to solve for each class as soon as we progress in the classifier layers.
The figure 11 indicates the decision rates obtained for each class inside the first layers.
The simulated elements will not use the same network sub-sets according to the places
where the recognition and rejection decisions are taken for each class. The figure 12 gives
the percentages of network used.
Fig. 11. Decision rates by class. Fig. 12. The network uses.
We have seen that the elements do not use the same network sets according to the
decisions making. As each network selects its own input features, different elements will
852
use different input feature sub-sets. Thus, in the yprel methodology the input feature
extraction could be managed by the element itself according to the networks to simulate.
The two following figures illustrate the uses of the extracted input features. The first one
gives the number of input features used by the elements, while the second shows the use
frequencies of the extracted features.
VI. CONCLUSION
This paper shows that good performances can be obtained for supervised classification
problems by using a complete self-organizing yprel network population. The proposed
methodology can be very usefull about large scale problems (high number of classes,
elements and input features) thanks to the distributed resolution strategy which emerges
from the learning phase and allows a partial simulation of the generated system and an
active input feature extraction according to the networks to simulate. A way to restart the
learning phases with new elements by freezing some reliable networks is currently under
study to obtain a real self-organized incremental data learning scheme.
References
[1] R.K.Powalka, N.Sherkat, R.LWhitrow, (1996), "Multiple recognizer combination topologies",
in "Handwriting and drawing research: basic and applied issues", M.L.Simner, C.G.Leedham,
A.l.W.M.Thomassen (Eds.), IOS Press, pp 329-342.
[2] C.Jutten, (1995), "Learning in evolutive architecture: an ill-posed problem ?", Proc.
IWANN'95, "From natural to artificial neural computation", J.Mira, F.Sandoval (Eds.), Lecture
Notes in Computer Science N~ Springer-Verlag, pp 361-373.
[3] D.Ackley, M.Littman, (1991), "Interactions between learning and evolution", Artificial life II,
SFI studies in the sciences of complexity, vol. X, Addison-Wesley,pp 48%509.
[4] J.Sietsma, R.J.F.Dow, (1991), "Creating artificial networks that generalize", Neural
Networks, Vol.4, N ~ 1, pp 67-79.
[5] S.A.Harp, T.Samad, A.Guha, (1989), "Towards the genetic synthesis of neural networks",
Proc. 3rd. Int. Conf. on Genetic Algorithms, pp 360-369.
An Accurate Measure for Multilayer Perceptron
Tolerance to Additive Weight Deviations
1 Introduction
Based on the analogy between Artificial Neural Networks (ANNs) and natural ones,
it is frequently assumed that ANNs are inherently tolerant to faults. This assumption
has been shown to be false [1,2]; consequently, some configurations of weights for a
fixed structure of ANN are more tolerant than others, although they provide similar
performance with respect to learning ability. Thus it would be uselill to have an accurate
means of uneasuring this tolerance.
Different measurements have been proposed. In J3J the probability of errors in
Multilaye," Perceptrons (MLPs) that use the simple-step function as activation is used
to study the tolerance of such structures, attd in [4] the study is extended to neurons
with a multiple-step activation function. The authors tound that the probability of error
is not affected by an increase in the number of neurons per layer.
In [1 ] a simulation-derived quantitative measure making use of the worst case
hypothesis to consider the number of neurons that present a stuck-at fault is used, while
in 12] a procedure based on the replication of bidden units is I~roposed to achieve fault-
lolcrant ANNs, providing rnelrics for tolerance as a Iimcliou ol' redundancy.
The authors ofl51 show that the inscrlion of synaplic noise during the learning
process increases the fault tolerance of ANNs. They also sbowed that enlarging the
networks does not improve fault tolerance at all, bt, t on the contrary makes them more
susceptible to faults.
854
Hence, it is clear that the supposed fault tolerance of MLPs is relative and that
the degree of fault tolerance can and should be measured as is the learning
performance. This becomes significant when the learning process is performed in media
different from where the MLP is implemented, as there may be differences in the
accuracy used to store the values of the weights or other magnitudes. These differences
may seriously degrade the performance of the MLP. A discussion of hardware error
sources is presented in [5].
Choi et al [7] proposed statistical sensitivity as a measure of tolerance in MLPs
and thus as a criterion for selecting a contiguration of weights from different
alternatives presenting similar perrformance with respect to learning. Statistical
sensitivity measures the output changes of the MLP when the values of the weights
change, a low value implying less degradation of the learning performance. This fact
is implicity proved with the results obtained.
In this work, an explicit relation between the statistical sensitivity to weight
deviations and the mean square error (MSE) between the desired and obtained output
in the MLP is shown, such that it is possible to accurately predict the degradation of the
MSE for a particular value of this deviation. A new ligure that we call Mean Square
Sensitivity (MSS) is proposed as a measurement of the fault tolerance to weight
deviations. The MSS can be easily computed after training and is shown to constitute
a good measurement of the tolerance of the nelwork.
The statistical sensitivity can be computed locally lor each neuron and so,
different alternatives arise for future work and research possibilities; lot example, the
study of tolerance to weight perturbations of particular elements in the network or the
modification of the training algorithm in order to reduce the MSS. Other measures
proposed affect the whole network [I,2,3,4] and so their use to study tolerance is more
limitecl. Furthermore, as the MSS is directly related to the MSE degradation, an
accurate and quantitative measure of MLP tolerance is ohlained, without making use
of any IWpotheses.
A multilayer perceptron (MLP) is composed of M layers, where each layer rn (m=l ...M)
has Nm neurons. The neuron i of layer m is connected to the N,,_j neurons of the
previous layer by a set of weights w~i"'(j= I ..... N,,.0.
The output y~" of a neuron i belonging to a layer m, is a lunction li" (activation
function) of the weighted sum of the outputs coming Iiom the neurons in the previous
layer (zi'") :
N I
Iii iii t/I ,111 II1 III -- [ %
Yi = fi (Zi ) =./i (~ wii Yi ) (1)
j-I
855
where y)"~ constitutes the outputs of the N,,.I neurons of the previous layer. Specifically,
yi~ ..... N()) are the inputs to the network.
During learning, the weights are adapted in order to minimize the MSE by
using a gradient descent algorithm known as backpropagation. Depending on the initial
values of the weights and other parameters of the algorithm, different values are reached
after training. These posible solutions may present a similar MSE, but they differ with
respect to the fault tolerance obtained. Fault tolerance is related to a unilbrm
distribution of learning, such that the saliency of the weights is regular [6].
The values of the weights in an electronic implementation can be changed if
a circuit presents a defect. Moreover, the backpropagation algorithm is often executed
in a general purpose computer and the weights obtained are physically implemented;
it is thus possible Ibr differences in the loaded values to occur. The statistical
sensitivity Sim allows us to measure the degradation of the expected output of a neuron
i in layer m in a quantitative way when the values of the weights change. The statistical
sensitivity is defined in [7] by the following expression:
where o represents the standard deviation of the changes in the weights, and var(Ay~")
is the variance of the deviation in the output (with respect to the output of the MLP in
the absence of perturbations) due to these changes and can be computed as:
where E[.] is the expected value of [.].
where Of/m---df/"
dzi m
N{b N.
E [Ay/] : E [Of/' (yjoAwij)]=
, Of,.' ~ yjo E [Awi]] = 0 (5)
j-i j-;
N,. _ I
E lay/'] = E [Of~m ~ " LYi.,-I a. .w. .q. . . . +
. . .w. 0 ray i I..
)l
j-I
Nm - I (6)
: O f / , , ~ . , , - LYj
~ . . wii E [A3~"-J])
E [Aw 0. . . .]+
.i.I
= 0
[2
iN,,_ N,,,.,
.... '~l~ "-' ~ ...... ""'-'~ (7)
~IjR k*l
N m_ N ~_ I
Ir"~rmx2 ~ .. m-I. 2 m ~ m~,m-l\
I~%) Z .,uyr ) " ~,. Z .,%%., ) (lT:k
ul = ,~ r-I s-I
CIk I N,,_! N,,,_,
(s)
1,~Z' Itl " ~ r Ill ~ Ill ~-~ lit f , tit - i
I% % Lwi r L wk.,%,* otherwise
[ r=l " s-I
857
ErZXy/I'Ay2"I :
N.,_, ~,,_,
/ m-lA /11+ //IA //I -I x I11 / nl-l~t Ill mA nl-lx.t
= E [ Y ' ~ d f j m tYr ~Wjr WjrtaYr )Y~,c3f~ tY., zaw~,.+w~ay., )j
rffil sffil
E,,-, E,,-,
.-....,..,., ..... ,,._, . . . . . . . . . . .... . . . . . . _ . m-.
: E ,~oJj OJk tYr Ys P--,IIAWjrIAWk.,.J-C14PjrWks t~l~Yr raYs 1 (9)
r.I s-I
nl-I nl l ~ r A Ilia II1" 11 Ill-I Ill lr~rt Ilia I l l - ] ~x
+ Y,. wh. t~ltaWi,, raYs 1 + Y.,. W/l" r. [zxwk, zxy,. l)
u,,,_, ~,,.,
...,..,..-..-..,.-,ll,-, . . . . . . i., ,,, ........ ,-'.Ill-'.
= O[j (YJ'k ,Z.I Z_.I tYr Y.. t~ l taWi~ ZawA.. +Wit
. .Wk~.. l ~.l ~ y r l.Xy~. ]
r.I s=l
-* 0 -~ 0)
In the particular case of MLPs with only one hidden layer, expression (7) can
be compuled in a more compacl form:
taking into account that the statistical sensitivity of tile inputs is zero, i.e., S~~ Vi.
In this way, as the above expression (10) shows, the computation of the
statistical sensitivity can be performed in a relatively easy way for each neuron of the
MLP and for each input pattern in the case of MLPs with one hidden layer, because
there are no cross terms as ill expression (7).
The goal of the backpropagation algorithm is to reduce tile mean square error (MSE)
which can be expressed as:
858
i E",,,
e - 2N~, i,-1 ~'~.t(,1,(1,) .~,u(p))2 (11)
g 1 '
1 E (di(.P)-YiM(J~)) AYi O~) * (A.yiM(p))2 + 0 (12)
Now, if we compute the expected value of e' and take into account that
E[AyIM]=0 and that E[(AyiM )2] can be obtained from expressions (2) and (3) as
E [(Ay~M)2] = o2(S~U)2 , the following expression is obtained:
By analogy with the delinition of MSE, we define the following ligure as Mean
Square Sensitivity (MSS) :
l }E (siM)2
MSS - 2Np I,-I i-I
(14)
The MSS can be computed from tile statistical sensitivity of the neurons
belonging to the output layer, as expression (14) shows. In lhis way, combining
expressions (13) and (14), the expected degradation of the MSE, Ele'] can be computed
as:
Thus, (15) shows the direct relation helween lhe MSE degradation and the
MSS. As the MSS can be directly cornputed alier training it is possible to predict the
degradation of the MSE when the weights are deviated from their nominal values into
a range with standard deviation equal to o. Moreover, as can be observed in the
expression obtained, a lower value of the MSS implies a lower wdue of the degradation
of the MSE, so we propose using the MSS as a suitable measure of tile tolerance of
MLPs to weight devialions. Note that as lhe slalislical sensitivity o f a parlicular neuron
elm he computed independently, several lines of research are opei~ Io sludy Ihe tolerance
of particular elements or to develop new training algorithms that take into acccmnt the
MSS as another term to rninimize during learning. In 17] it is proposed to use the
average statistical sensitivity as a criterion Io select a weight configuration l?om
859
different possibilities which present similar MSE al/er training. However, Ihe MSS is
directly obtained li'om the MSE degradation, .'is expression (15) shows, and thus
constitutes a better measurement of MLP tolerance against weight perturbations.
4 Results
In order to validate expression (15) we compared the results obtained for the MSE when
the MLPs are subject to additive deviations with the predicted value obtained by using
this expression. Two MLPs were considered: an approximator of the sine function [8]
and a predictor of the Mackey-Glass temporal series [9]. The approximator had 1 input
neuron, 11 neurons in the hidden layer and 1 output neuron, and the predictor consisted
of 3 input neurons, 11 neurons in the hidden layer and I oulput neuron. All lhe neurons
considered contained a bias input. Table I shows the values of MSE and MSS obtained
after training with the test patterns (different from those used lbr training).
Approximator Predictor
All the weights obtained after learning have been deviated from their nominal
values by using the additive model such that each weight w!i" has a value equal to (wii"
+ (5~), where 6j is a random variable with standard deviation equal to o and average
equal to zero. Table 2 shows the values ()f the MSE predicted and obtained
experimentally for different values of o. For each value of o considered, tile
experimental values of MSE are averaged over 100 lests where each test consists of a
random deviation of all the weights of the MLP. The conlidence interval at 95% is also
presented in Table 2.
Expression (15) is shown to be valid; it accurately predicts the degradation of
the MSE when the weights present perturbations. It is also proven that the lower the
value of the MSS, the lower the degradation of the MSE. Thus, even if a particular
configuration presents a lower MSE after training, if the MSS is high, this nominal
MSE is strongly degraded when deviations are present, and so the MSS Inust be
considered when a weight conliguralion is 1o be chosen.
860
Approximator Predictor
Figures 1 and 2 show the degradation of the MSE for dill'ercnl values ofo. Thc
values predicted and obtained experimentally arc represented for the approximator and
the predictor, respectively. Each experimental value is plotted with its respective
confidence level at 95% obtained with 100 samples. In a similar way to Table 2, the
predicted values of the MSE accurately fit those obtained experimentally. The matching
between the predicted and the experimental values of MSE is better whcn weight
deviations are smaller; however, for grealcr dcvialions it constitutes an upper bound Ibr
the MSE degradation.
5 Conclusions
111 this letter we have presented the relation between the mean square error (MSE) and
the statistical sensitivity. As the statistical sensitivity measures the deviation in the
output of a MLP when its weights are perturl~ed, this relation allows us to obtain a
useful criterion to evaluate the fault tolerance of the network. To cornpare different
weight configurations, we propose the use of mean square sensitivity (MSS), which is
computed from the statistical sensitivity. Lower wdues of MSS imply lower
degradations of MSE. Results show the correctness of Ihe expressions obtained. To
distinguish MSS from other measures proposed to assess tile Iolcrancc of MLPs, it is
directly related to MSE degradation and also, as statistical sensitivity can be computed
for each neuron of the MLP, new research possibilities are opened lot the study of
related aspects. As future work, a new backpropagation algorithm that includes Ihe
861
objective of minimizing MSS, jointly with MSE, will be developed in order to obtain
weight configurations that maximize lault tolerance while maintaining learning
performance. As MSS is an accurate measure for MSE degradation, the perlbrmance
of such an algorithm will probably be better than that described in I10] for a similar
training algorithm based on average statistical sensitivity minimization..
Approximator
0.12
Experimental
Predicted
0.1 ,/
0.08
~. o.06
0.04
0.02
0
0.1 0.2 0.3 (1.4 0.5 0.6 0.7
Shlrldard dovlallon of weight periurbationl~
Pqod~ctor
0.1
Exporlmental
Pmdlctod .....
0.09
0.08
References
A b s t r a c t . Fuzzy heterogeneous networks are recently introduced neural network models com-
posed of neurons of a general class whose inputs and weights are mixtures of continuous vari-
ables (crisp and/or fuzzy) with discrete quantities, also admitting missing data. These networks
have net input functions based on similarity relations between the inputs and the weights of a
neuron. They thus accept heterogeneous -possibly missing- inputs, and can be coupled with
classical neurons in hybrid network architectm'es, trained by means of genetic algorithms or
other evolutionary methods. This paper compares the effectiveness of the fuzzy heterogeneous
model based on similarity with the classical feed-forward one, in the context of an investiga-
tion in the field of environmental sciences, namely, the geochemical study of natural waters in
the Arctic (Spitzbergen). Classification performance, the effect of working with crisp or fuzzy
inputs, the use of traditional scalar product v s . similarlty-based functions, and the presence
of missing data, are studied. The results obtained show that, from these standpoints, fuzzy
heterogeneous networks based on similarity perform better than classical feed-forward models.
This behaviour is consistent with previous results in other application domains.
1 Introduction
training procedure, and indeed, these networks were able to learn from non-trivial
data sets with an effectiveness comparable, and sometimes better, than that of classi-
cal methods. They also exhibited a remarkable robustness when information degrades
due to the increasing presence of missing data. One step further in the development
of the heterogeneous neuron model was the inclusion of fuzzy quantities within the
input set, extending the former use of real-valued quantities of crisp character. In this
way, uncertainty and imprecision (in inputs and weights) can be explicitly considered
within the model, making it more flexible. In the context of a real-world application
example in geology [12], it was found that hybrid networks using fuzzy heterogeneous
neurons perform better by treating the same data with its natural imprecision than
considering them as crisp quantities, as is usually done. Moreover, in the same study
it was found that hybrid networks with heterogeneous neurons in general (i.e. with or
without fuzzy inputs) outperform feed-forward networks with classical neurons, even
when trained with sophisticated procedures like a coml)ination of gradient techniques
with simulated annealing.
In this paper, the possibilities of this kind of neurons are illustrated by compa-
rison to fully classical architectures in a real-world problem. The paper is organized
as follows. Section 2 reviews the concept of heterogeneous neurons and their use in
configuring hybrid neural networks for classification tasks. Section 3 describes the
example application at hand, fruit of an environmental research in the Arctic, while
Section 4 covers the different experiments performed: description, settings and discus-
sion. Finally, Section 5 presents the conclusions.
A fuzzy heterogeneous neuron was defined in [12] as a mapping h : 7/" --+ TLo~t C R .
Ilere R denotes tile reals and 7/" is a cartesian product of an arbitrary number n
of source sets. These source sets n~ay be extended reals 7~i = Ri U {,V}, extended
families of (normalized) fuzzy sets ~'i = ~ U {X}, and extended finite sets of tile form
Oi = Oi U {X}, .Mi = .Mi U {X}, where each of tile Oi has a fltll order relation, while
tlle .Mi have not. In all cases, the extension is given by tlle special symbol X, which
denotes the unknown element (missing information) and it behaves as an incomparable
element w.r.t, any ordering relation. Consider now the collection of n / e x t e n d e d fimzy
sets of the form ~'i = .TiU {X} and their cartesian product ~',,t = ~'l x .T'2 • 2 1 5 .~,~,.
The resulting input set will then be ~ " = 7~"~ • fi"t • (.')"o • .M'--, where tile
cartesian products for the other kinds of source sets (7~"*, O"~ are constructed
in a similar way from their respective cardinalities n~, no, n,,, with 7"r~ = . ~ = ~o =
J ~ ~ = ~ ~ = r n = n~ + n l + no + nm, and n > 0. According to this definition,
neuron inputs are vectors composed of n elements among which there might be reals,
h,zzy sets, ordinals, nominals and missing data.
An interesting particular class of heterogeneous submodels is constructed by con-
sidering Is as the composition of two mappings h = f o s , such that s : 7/" -+ 7r C R
and f : 7Z~ --+ 7Zout C_ R . The mapping It can be seen as a n-ary flmction l)arameteriz~,d
by a n-ary vector ~b e 7/" representing the neuron's weights, i.e. h(&, ~ ) = f ( s ( & , @)).
Within this framework, several of the most common artificial neuron models can be
derived. For example, the classical scalar-product driven model is obtained by making
865
~ k = t g i j k ijk
slj = n
Ek=l i j k
where 91jk is a similarity score for objects i, j according to their value for variable k.
These scores are in tile interval [0, 1] and are computed according to different schemes
for numeric and qualitative variables. In particular, for a continuous variable k and
any two objects i, j the following similarity score is used:
[vik - vjkl
ffijk = 1
range O'.k )
Ilere, vi~ denotes tile value of object i for variable k and range (v.k) = maxl,j (Iv,k--vjkl)
(see [7] for details on other kinds of variables). Tim Jijk is a binary function expressing
whether both objects are comparable or not according to their values w.r.t, variable
k. It is 1 if and only if both objects have values different from ,u for variable k,
and 0 otherwise. This way, in the model considered here, Gower's original definitions
for real-valued and discrete variables are kept. For variables representing fuzzy sets,
similarity relations from the point of view of fuzzy theory have been defined elsewhere
[5], [15], and different choices are possible. In our case, if .T~ is an arbitrary family of
fuzzy sets from tim source set, and A, f3 are two fltzzy sets such that .4, h E f'i, the
following similarity relation is used:
input function, and the presence of missing information prevent the use of gradient-
based techniques. The resulting heterogeneous neuron can be used for configuring feed-
forward network architectures in several ways. In this paper it is slmwn how layered
feed-forward structures with a hidden layer composed of heterogeneous neurons and
an output layer of classical units are natural choices better suited for the data than
the fully classical counterparts.
4 Experiments
4.1 General Information
way, the actual distribution was 371, 292,103,114, 27s. Default accuracy (relative fre-
quency of the most common class) is then 37/144 or 32.5~ Entropy, calculated as
- ~']~=,(nk/N) log2(nk/N), is equal to 2.15 bits. There were no missing data and all
-
r-5% r r+5%
Fig. 1. A triangular fllzzy number constntcted out of the reported crisp value r.
4.2 E x p e r i m e n t Settings
In the present study, all models (including the classical feed-forward one) were trained
using exactly the same procedure and parameters in order to exclude this source
of variation from the analysis. Of course, fully classical architectures need not be
trained using the SGA. They could instead be trained using any standard (or more
sophisticated) algorithm using gradient information, llowever, this would have made
direct comparison much more difficult, since one could not attribute differences in
performance exclusively to the different neuron models, but also to their training
algorithms. The experiment settings were the following:
i j
where y} is the j-th component of the output vector yi computed by the network
at a given time, when the input vector :vi is presented, and ~ = r is the
target for x}, where r tel)resents the characteristic fimction for class j. The error
displayed will be the m e a n s q u a r e d error, defined as MSE = A-LSE, where m
is the number of outputs and p the number of patterns.
Let the classification accuracy for training (TR) and test (TE) sets, calculat,~d with a
winner-take-all strategy, be denoted CATn(r) and CATE(r), respectively, for a givr
run 7". Tile errors MSETa(r) and MSETE(r) are similarly defined. For each neural
architecture, the following data is displayed:
R
Accuracy: Mean classification accuracy on training MCATR = -~ ~ ..... 1 CATn(run.),
same on test MCATE = ~ ~ , , = 1 CATs(run), and best classification accuracy
(BCA) defined as the pair < CATa(r), CATE(r) > with higher CATE(r).
869
I' I"'%1~176
5n
51
66.3%
99.4%
0.1084
0.0338
8.0e-06
3.0e-06
67.1%
69.3%
~176176176
0.1202
0.0917
1.6e-05
1.1e-05
75.0% 76.8%
100% 75,6%
Tile neural nets obtained ill the previous experiment can now be used to assess the
effect of factor (c), tile influence of missing values in tile data. The purpose of this
experiment is twofohh lirst, it is usefitl studying to what extent missing information
degrades performance. This is all indication of robustness and is important from the
point of view of the methods. Second, in this particular problem, studying the effect
of missing data is very interesting, because it can give an answer to the following
questions:
1. What predictive performance could we expect if we do not supply all the informa-
tion? (and just a fraction of it).
2. What would have happened had we presented to the net incomplete trail~it,9 in-
formation from the outset?
This scenario makes sense in our case study, for which a rich set of complete data
may be impossible to obtain, because of lack or damage of resources, physical or
practical unrealizability, lack of time, climatic conditions, etc. Note that it is not
that a particular variable cannot be measured (we could readily remove it) b u t that
some realizations of (potentially) all variables may be missing. These experiments
were performed with the same nets found in the previous section. This time, however,
they were each run on different test sets, obtained by artificially and randomly (with
871
io0 . . . . . . . . .
"I
Io Z
to
~o ...............................
~o
m
to
io[- , 9 i i n I i i *
0 i i i ~ I | t * I
to m ~ 4~ io en 70 m m
IOQ . . . . . ,, . , . .
O0
eO'
70
mO
84)
4(i
30
IIO
Io
(i
m an an 4o ~0 eo 70 an In too
I00
O0
TO ,.IJ
QO
60
40
30
IQ
O .
to
. .
ao
.
m
. .
~
.
•
.
m
.
m to
,
m i~u
'~
u ,
m
9
an
b
~o
i
~
i
~
,
~o
r
7o
i
m i~
Fig. 2. Increasing presence of missing data in test. Mean test classification accuracy for tile heterogeneous
( p ~ 5 n ) a n d f u z z y h e t e r o g e n e o u s ( p l 5 n ) f a m i l i e s . ( a ) 5~ r a i d 51 ( b ) 2 a 5 . m i d 2 / 5 n ( c ) 4 h 5 n a n d 4 / 5 . (el)
6h5n and 6t5n (e) 8hSn and 815n (f) Mean test classification accuracy for 6h5,, raid 615. when trained with
a 30% of missing information. See text for an explanation of axis.
872
Both neuron models h, f are very robust, a fact that shows in the curves, which follow
a quasilinear decay. The accuracies are consistently higher for the fuzzy model than for
the crisp counterpart for all the network architectures, again showing that allowing
imprecision increases effectiveness and robustness. Performance, in general, is well
above the default accuracy until a 50% - 60% of missing information is introduced. In
many cases, mean classification accuracy is still above for as far as 70% - 90%, which
is very remarkable. This graceful degradation of fuzzy heterogeneous models should
not be overlooked, since it is a very desirable feature in any model willing to be useful
in real-world problems.
The last figure -fig. 2 (f)- shows the effect of a different training outset. Choosing
what seems to be the best group of architectures for the given problem, the 6h5n and
615n, these networks were trained again, this time with a modified training set: adding
to it a 30% of missing information, in the same way it was done for the test set, and
using them again to predict the increasingly diluted test sets. As usual, the horizontal
line represents the size of the major class and k-nearest neighbours performance is
also shown. Training and test accuracies were this time lower (as one should expect)
and equal to MCATR = 88.8% for 6h5~ and to MCATrt = 96.3% for 615,,. However,
the differences with previous performance are relatively low. Some simple calculations
show that, although the amount of data is 70% that of the previous situation, new
accuracies are 97.3% and 96.3% of those obtained with full information for 6h5,, and
6f5~, respectively. Performance in test sets is also noteworthy: although the new curves
begin at a lower point than before, the degradation is still quasilinear. What is more,
the slope of this linear trend is lower (in absolute value), resulting in a slight raising
up of the curves (in both of them).
5 Conclusions
Experiments carried out with data coming from a real-world problem in the domain
of environmental studies have shown that allowing imprecise inputs, and using fimzy
heterogeneous neurons based on similarity, yields much better l)rediction indicators
-mean accuracies, mean errors and their variances and absolute best models found-
than those from classical crisp real-valued models. These results for heterogeneous
t These experiments could not be performed for the p~5~ architectures, for they do not accept missing
information. Although there are estimation techniques, they are not an integrated part of the models, and
would have introduced a bias.
873
networks confirm the features observed in other studies [1] [21 [31 [111 [12] concerning
their mapping effectiveness and their robustness with respect to the presence of un-
certainty a n d m i s s i n g d a t a . T h e i r a b i l i t y to c o n s i d e r d i r e c t l y i m p r e c i s e d a t a a n d t h e i r
p e r f o r m a n c e u n d e r t h o s e c i r c u m s t a n c e s d e s e r v e closer a t t e n t i o n , d u e to t h e i r i m p l i c a -
t i o n s for real-world p r o b l e m s f r o m t h e p o i n t of view of n e u r o f u z z y s y s t e m s . However,
t h e s t u d y of t h e s e n e t w o r k s is still in its i n i t i a l stage. Several o t h e r a r c h i t e c t u r e s are
possible, a l o n g w i t h different ( p a r t i a l ) s i m i l a r i t y m e a s u r e s , a n d f u r t h e r i n v e s t i g a t i o n s
are b e i n g m a d e in o r d e r to e x p l o r e in m o r e e x t e n t t h e i r p r o p e r t i e s , a n d to m a k e t h e
scope of t h e i r a p p l i c a t i o n m o r e precise.
References
1. LI. Belanche and J.J. Valdds. "Using Fuzzy Heterogeneous NeurM Networks to Learn a Model of
the Central Nervous System Control~. In Procs. of EUFIT'98, 6th European Congress on Intelligent
Techniques and Soft Computing, pp, 1858-62, Elite Foundation. Aachen, Germany, 1998.
2. LI. Belanche, J,J. Valdds and R. Alqu~zar. "Fuzzy Heterogeneous Neural Networks for Signal Forecast-
ing". In Procs. of ICANN'98, Intl. Conf. on Natural and Artificial Neural Networks (Perspectives in
Neural Computing), pp. 1089-94. Sk~ivde, Sweden. Springer-Verlag, 1998.
3. LI. Belanche, J.J. Valdrs, J. Comas, I.-R. Roda and M. Poch. "Modeling the Input-Output Behaviour
of Wastewater Treatment Plants using Soft Computing Techniques ". In Procs. of BESAI'98, Binding
Environmental Sciences and AI, held as part of ECAI'98, European Conference on Artificial Intelligence,
pp. 81-94, Brighton, UK, 1998.
4. Chandon, J.L, Piuson, S: Analyse Typologlque. Thdorie et Applications. Masson, 1981.
5. Dubois D., Esteva F., Garcfa P., Godo L., Prade t1.: A logical approach to interpolation based on simi-
laxity relations. Instituto de Investigaci6n en lnteligencia Artificial. Consejo Superior de lnvestigaciones
Cientfficas, Barcelona, Espafia. Research Report IliA 96/07, 1996.
6. Dubois D., Prade IL, Esters F., Garcla P., Godo L., Ldpez de MAntaras R: Fuzzy set modelling in
case-based reasoning. Int. Journal of Intelligent Systems (to appear) (1997).
7. Gower, J.C. A General Coefficient of Similarity and some of its Properties. Biometrics ~7, 857-871,
1971
8. Fagundo, J.R, Valdds J.J, Rodrfguez, J. E.: Karst Hydrochemistry (in Spanish). Research Group of Wa-
ter Resources and Environmental Geology, University of Granada, Ediciones Osuna, pp 212, Granada,
Spain, 1996.
9. Fag-undo, J.R, Vald~s J..l, Pulina, M.: Hydrochemical investigations in extreme climatic areas, Cuba
and Spitzbergen. In: Water Resources Management and Protection in Tropical Climates, pp 45-54,
Havana, Stockholm, 1990.
10. G.J. Klir, T.A. Folger.: l~zzy Sets, Uncertainty and Information. Prentice llail Int. Editions, 1988.
11. Valdds J.J, Garcfa R.: A model for heterogeneous neurons and its use in eonfigucing neural networks
for classification problems. In Procs. of IWANN'97, International World Conference on Artificial aml
Natural Neural Networks. Lecture Notes in Computer Science 1240, pp. 237-246. Springer Verlag, 1997.
12. Vald~s J.J., Belanche LI., Alqu~zar R. Fuzzy heterogeneous neurons based on similarity. International
Journal of Intelligent Systems (accepted for publication, 1999). Also in Proes. of CCIA'98: Congr~s
Cataig per a la lntel.lig~neia Artificial (Catalan Congress for Artificial Intelligence), Tarragona, Spain,
1998. Also in LSI Research Report LSI-98-33-R. Universitat Polit~cnica de Cataiunya, Barcelona (1998).
13. Goldberg, D.E.: Genetic Algorithms for Search, Optimization & Machine Learning. Addison-Wesley
(1989).
14. Davis, L.D.: Handbook of Genetic Algorithms. Van Nostrand Reinhold (1991).
15. Zimmermann lt.J.: Fuzzy set theory and its applications. Kluver Acadenfic Publishers (1992).
A Neural N e t w o r k A p p r o a c h for G e n e r a t i n g Solar
Irradiation Artificial Series
ABSTRACT
In this paper a relevant problem in the photovoltaic solar energy field is considere(l:
tile generation of artificial series of hourly solar irradiation. Tile proposed methodo-
logy artificially generates series following the average tendency of tile hourly radiation
series kt in a given place. This is obtained by making use of a set of historical values
of this series in such place (for training purposes) as well as the dailyclarity index/iT
of the year to be generated. This information is employed for the supervised train-
ing of a proposed neural network model. Ttle neural model employs a well known
l)aradigm, calle(t Multilayer Perceptron (MLP), in a feedback architecture. The gen-
eration method is base(I on the MLP ability to extract, from a sutiiciently general
training set, tim existing relationshil)s between wlriables whose inter(lel)endence is
mlknown a priori. This way, the presented design methodology can iml)licitly include
all the available information. Simulation results show the good perfornmnce of the
irradiation series generator, and the general applicability of this methodology in the
estimation of highly coml)lex temporal series.
875
1 Introduction
The design and analysis of photovoltaic converters is usually performed via numerical
simulations which require as input data large time sequences of hourly or daily irradi-
ation* values [Grah 90, Lore 91]. Nevertheless, these historic radiation measurements
do not exist in most of the world countries, and, if any, their quality is questionable
or they have t)lenty of missing values.
In 1988 Graham proposed the substitution of this historical measurements by syn-
thetic sequences of irradiation values generated using mathematical models of the irra-
diation process. These generated sequences should preserve the statistical properties
of the historical measurements. The proposed methodology was based on autoregress-
ive time series theory for generating sets of daily values of solar irradiation.
Tile work described in [Grah 90] extends such methodology to the generation of
hourly solar irradiation series making use of daily values. These daily values can be
obtained from historical measurements (which are more common than hourly meas-
urements) or via some daily values generation methods (which are more validated
than hourly methods). This is an stochastic disgregation method (very typical in
Hydrology: to separate the annual flow estimation into monthly estimations). The
hourly radiation series are very useful when studying photovoltaic systems with one
or two-hour response time such as peak plants or photovoltaic plants which return
energy to the network at maximum charge instants.
The main criticisms to Graham's method are the high computing requirements for
obtaining each series value, and the geographical location dependency of the method
with the place where data has been retrieved for constructing the model.
In this work, we propose a neural network approach, making use of the Multilayer
Perceptron (MLP) [Lipp 87, Rume 86, Werb 74] in a feedforward-feedl)ack architec-
ture [Nare 90] for generating hourly solar radiation series. The main attractive prop-
erty of our method is the MLP capability for approximating any continuous fimction
defined on a compact set within a prescribed error margin. Existence results prove
that it sullices to employ a MLP with a hidden layer, a required number of neurons
and an appropriate training procedure [Horn 89]. In practice, selection of al)l)ropriate
topology as well as training algorithms may I)ecome a big challenge.
One important aspect addressed in this paper is the possibility of employing the
presented architecture with a reduced knowledge of the problem to be considered
In that sense the paper defines a simple design methodology with quite general
applicability.
The paper is organized as follows. Section 2 presents some basic aspects concerning
the use of MLP based architecture for time series processing; in addition, specific
aspects related with the generation of irradiation series are also considered. Thc
specific proposed model for the generation of tirne series is l)rcscntcd in section 3.
Conchlding remarks are outlined in section 4.
*The term solar radiation refers to the physical phenomenon in a generic sense, whereas the term
irradiation refers to the incident energy on a horizontal surface over a given period of time (hourly,
kw.h
daily irradiation, etc). Therefore, the irradiation units are -Wr.
876
$a-
L,
E.
By training a MLP with p inputs and 1 output, with a training set representative
enough, the MLP will be able to tlnd the desired relationship (in case that it exists)
877
KT
i (hour)
n (day)
k(i-3) MLP . k(i)
t t
k(i-2)
t
kt(i-1)
order to generate it, 365 KT (daily clarity index) of such year were needed as inputs,
as well as the 3 initial values of the hourly clarity index kt of each day. The MRV
obtained was 0.0943, proving that the method emulates quite well the deterministic
component of the series.
The obtained series generator can be successfully compared is some aspects with
the computation of the average tendency ktm performed by Graham's method. Our
proposed method can be employed for generating series corresponding to any locality,
if the corresponding training data set is available, i.e. a set of hourly and daily
clarity indexes measured over several years. Also, Graham's method requires the same
training set for computing the nonlinear regressions corresponding to each locality
which link each hour ktm value and the KT value of the corresponding day. On the
other hand, tile use of a MLP does not assume any a priori model, being advantageous
versus a nonlinear regression approach.
From an academic point of view it is very interesting to note tha MLP capability
tbr finding relationships among variables of different nature. In our example, making
use of an appropriate training set, the MLP was able to relate information from hour
of the day, daily clarity index value, and 3 previous values of the hourly clarity index
in order to generate a new kt index value.
Nevertheless, the shape of the resulting series does not have the characteristic
rippling of the real series. This is due to the fact that the employed training set
(8 years of kt and t(T values) was large in relation to tim MLP 5-x-1 topology. It is
possible that such training set may have input/output pairs such that different desired
output values may bc linked with the same input wtlue. Therefi)re, after training the
MLP, an averaging effect might have occurred among such different output values.
Hence, this could justify that our proposed method does not generate radiation hourly
series with the characteristic stochastic rippling of the real series (the generated series
are smooth as can be seen in Figure 3).
For the sake of emulation completeness, the stochastic rippling was enmlated, as
a first approach, via a generated set of random gaussian variables corresponding to
the 16 hours of the (lay excluding the initial p and last 2 (that is 16 - p - 2 random
880
0.8
MLP
0.7 ....... real
0.6
0.5
F
0.4
0.3
o2f11
60 8O 100 120 140 160 180 200
i
220
i
Figure 3: Real series versus generated one without noise. Days 5-13.
0.8 [ i i i
MLP
0.7 ....... real
0.6
0.5
~ 0.4
0.3
0.2 l ~
0.1
0 '
60 80 100 120 140 160 180 200 220
i
Figure 4: Real series versus generated one with additive noise. Days 5-13.
881
0.9 i i i i r
MLP t
0.8 ....... real
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
5660 5680 5700 5720 5740 5760 5780
i
Figure 5: Real series versus generated one with additive noise. Days 355-361.
variables). The means and variances of these random variables were estimated fi'om an
error signal between the 9th year real series and the series generated by our proposed
method. Hence, we added to each of the generated hourly series values {kt} one
realization value of the random variable corresponding to such hour. The initial 3
hours of each (lay and the last 2 did not suffer such perturbation. In Figures 4 and 5
we show the series corresponding to the 9th year, generated by the described method
after adding the noise, corresl)onding to the hourly values from (lay 5 to 13, and from
355 to 361 respectively.
4 Concluding Remarks
A methodology based on neural networks has been presented for generating time series
following the average tendency of the hourly radiation series kt in a given place. Such
methodology is based on the possibility of implicitly employing information associ-
ated with the problem, without knowing the existing relationships between different
variables and sources of information.
The proposed methodology makes use of both a set of historical values of the
series (for training purposes) as well as the daily clarity index I(,r of the year to I)c
generated in a straightforward manner: tile whole information has been eml)loyed
for the supervised training of a MLP based feedforward-feedl)ack architecture. A
proper selection of the model has proved to 1)e more critical than the training method
selected.
Although the quality of tim developed method needs filrther testing, one (:an con-
clude that the generation can be performed with little knowledge of the problem. This
882
is due to the MLP capability for finding relationships among variables with unknown
a priori relationship. Nevertheless, a proper MLP topology and training set must be
selected for such purpose.
The proposed method does not assume any a priori model as opposed to the stand-
ard approximation techniques where polynomial regression techniques are employed.
Acknowledgments
This work has been financially supported by Proyecto Multidisciplinar de In-
vestigaci6n y Desarrollo 14908 of the Universidad Polit6cnica de Madrid and Pro-
ject PB97-0566-C02-01 of the Programa Sectorial de PGC of Direcci6n General de
Ensefianza Superior e Investigacidn Cientffica in the MEC.
The authors want to thank Professor Eduardo Lorenzo and Dr. Mario Macagnan,
from the Instituto de Energfa Solar in the UPM, for their helpful comments and
suggestions, as well as for providing the data on radiation series employed in this
work.
References
[Agar 97] , M. Agarwal, A Systematic Classification of Neural-Network-Based Con-
trol. IEEE Control Systems Magazine, vol. 17, n.2, pp. 75-93, April 1997.
[Gold 96] R. Golden, Mathematical Methods.for Neural Network Analysis and Design,
MIT Press, Cambridge, 1996.
[Horn 89] K. Hornik, M. Stinchcombe and It. White, Multilayer feedforward networks
are universal approximators. Neural Networks, 2 (5), 359-366.
[Lowe 91] D. Lowe and A. R. Webb, Time series prediction by adaptive networks: a
dynamical systems perspective. IEEE Proceedings-F, February 1991.
[Nare 90] K. S. Narendra and K. Parthasarathy, Identification and Control of Dy-
namical Systems Using Neural Networks, IEEE Transactions on Neura]
Networks, vol. 1, n. 1, pp. 4-27, March 1990.
[Nare 91] K. S. Narendra and K. Parthasarathy, Gradient methods for the Optimiz-
ation of Dynamical Systems Containing Neural Networks, IEEE Transac-
tions on Neural Networks, vol. 2, n. 2, pp. 252-262, March 1991.
[Prie 88] M. B. Priestley, Non-linear and non-stationary time series analysis. Aca-
demic Press, 1988.
Rautenberg, Sandro x
Todesco, Jos6 Leomar 2
Engenharia de Produg8o e Sistemas
Universidade Federal de Santa Catarina - U F S C - Brasil
I dinho@eps.ufsc.br
2 tite@eps.ufsc.br
Abstract
Color recipe specification m textile print shop requires a great deal o f human
experience. There is an intrinsic knowledge that makes the computing modeling a difficult
task. One o f the main issues is the human color perception. A small variation on the
intenseness o f colorants can lead to very different results. In this" paper, we propose to use
a Radial BasLe' Function Networks (RBFN) to color recipe specification in the textile print
shop. The method has been applied on a real environment with the/bllowing results': it
allowed the modeling o f the intuitive nature o f color perception; it made possible to
simulate the color mixing process on a computer; and it became a suitable means fi)r
training on color recipe specification.
geywords: Textile print shop, Color Recipe, RBFN, Artificial Neural Network.
I. Introduction
One of the more important processes in the textile industry is the development of an
appropriate color to print a certain kind of fabric. Issues such as product, esthetic beauty,
art creativity, among others things, are directly dependent on this development. In general,
these issues are the main points observed by customers, causing direct impact on sales [9].
Even being so important, on the majority of the industries, the color recipe process is
yet very primitive. It is basically resumed in the attempt of reaching a desired color. In this
process, the person in charge uses experience, mixing the colorants to obtain the target
color [I0]. In many cases, this process leads to dissatisfied results, with a high number of
failures and a considerable amount of wasted material. The nature of the method may lead
to failures even when the colorist is an color recipe expert [12].
To profile this situation, many companies invest in the acquisition of a
spectrophotometer and in a recipe computerized formulation system. Nevertheless, such
885
investments occur without a previous study regarding the adaptation of the company
environment. In this case, the company may face some problems, specially after the system
has been implemented [3].
On the current state-of-the-art, the color development is intrinsically dependent on the
individual perception, making color recipe a highly specialized task. The colorist uses
his/her experience comparing and prescribing colors with different amount of colorants.
In this article we propose a system to simulate the human color perception on textile
print shop. The system is able to catch and lay up the knowledge requested and to deal with
the knowledge inherent in the process. The main result is the system capability of helping
individuals in the development of a certain color. It is also a source for training to others
involved in the process of stamping the fabric.
In the current process, the colorist decisions are based mainly on his/her professional
experience with color development. The amount of each colorant in the mixture is
determined intuitively and based on practical experiences on different mixes [7]. The lack
of an explicit methodology may cause a lot amount of re-work or waste of raw material.
Besides, the subjectivism of color evaluation makes difficult to reach a complete
agreement of how close a color is from the initial target. Variables such as age, tiredness,
visual defect, opinions, taste, etc. can make color perception different among observers [9].
Another factor is how sensible the mixture is to an increase in the amount of a
colorant. There are colorants that, when mixed in a small quantity, interfere a lot on the
final result. In Table 1 we present the quantity (in grams) of each colorant to the
production of one kilogram of paint. The pantones identify the colors according to industry
standards.
biggest mistakes. However, when the mix involved more than one colorant, the system
answered very well. In 78% of the cases the mistake was smaller than 0,8.
Total of tests A3
textile 22 1.0
surface coating 11 1.1
paper: transparent 12 1.2
opaque 5 1.1
Table 2: Test Results in Color Recipe Prediction.
1 , xj-q
f=(x-c) (2,r) hi2 ~ . ~ . . . ~ i (4.2)
The values of o'eo'v..o"., j = [1,n], are used in the same manner as with "normal"
probability densities to provide "dispersion" scales in each component direction.
Another common variation on the basis functions is to increase their functionality
using the Mahalanobis distance in the Gaussian function. The above equation becomes:
where Kq is the inverse of the covariance matrix of X associated with hidden node C.
888
Given p-exemplar n-vectors, representing p-classes, the network can be initiated with
knowledge of the centers (locations of the exemplars). If cj represents the jth exemplar
vector, then we can define the weight matrix C as follows:
C = [el c2 ... c,] (4.4)
such that the weights in the hidden nodej are the components of the "center" c). Thus, a
hidden-layer node calculates the expression of Eq. (4.2).
The output layer is a weighted sum of the hidden-layer outputs. When presenting an
input vector x to the network, the network implements
y= w..f(llx-dl) (4.5)
where f represents the vector of functional outputs from the hidden layer, and c the
corresponding center vector. Given some training data with desired responses, the output
weights W can be found using the LMS interactively or non-interactively, like descent
gradient and pseudo inverse techniques, respectively.
A simple way of choosing the scaling factors for the Gaussian functions is to set them
equal to the average distance between all training data
o-~2 = ~-7S-
i
~--~ ( x - c ) r ( x - c) (4.6)
x e(-) )
where O j is the set of training patterns grouped with cluster center Cj, and ~ is the
number of patterns in Oj.
Another manner of choosing the o-2 parameters is to calculate the distances between
the centers in each dimension and use some percentage of this distance tbr the scaling
factor. In this way, the p-nearest neighbor algorithm has been used. Sometimes, to improve
the radius of the Gaussian function, it is interesting to multiply this variance by a constant.
889
The objective is to increase the radius and consequently the amplitude or range of the
neuron [19].
5. The Aplication
The first step in direction to implement the solutions was the normalization of the
environment, in order to get good conditions. Some steps were:
9 To form a team and a strategy to evaluate the colors;
9 Definition of the colorants used;
9 The type of the clothes to be considered; and
9 Configuration of the textile print shop machine.
After the normalization of the environment was initiate the data acquisition. To do that,
was simulated the pantone system in the clothes. The pantone system is formed by
approximately a thousand of diversified color samples, that is considered a good vehicle
of color communication by colorist professional.
When a certain color was reached then three spectral measures were did by the
spectrophotometer X-rite 978. The development of a color recipe prediction system
requires the utilization of technologic resources, particularly the spectrophotometer. This
equipment quantifies the human perception regarding a certain color [3, 13]. The
spectrophotometer output is a color evaluation in a so-called Lab scale. The first step is to
convert this scale to another parameter, more intuitive to the colorist. An example is the
CMYK (Cyan, Magenta, Yellow, Black) scale. By working with this scale, the knowledge
acquisition process becomes easier. The spectral data were converted to other scales of
representation, XYZ e zyx [11].
The best results were got using two RBFN, according Figure 2, where each net have its
own functionality. The two stages were:
9 9 inputs (LabLabLab);
9 7 outputs (7 colorants used by
the industry);
9 quantity: RBFN to predict the
amount (in grams) of each
I
colorant identified on first ~rou~ r
To test the system were selected 21 colors extracted from the pantone system that
weren't made before, mainly because was difficult to get. For composition stage, the
system present the following results:
* 17 excellent compositions, resulting in 81% of success;
9 02 compositions partially corrects, where was possible to reach the desired
color with small adjust (9.5% of the compositions);
9 02 compositions completely wrong (error of 9.5%).
For the second stage, quantity, the system present the following results:
9 11 excellent recipes;
9 08 recipes close to the target, needing small corrections; and
9 02 recipes completely wrong.
Although these results were already perceived in the real field, there is still room for
other approaches. Actually, the most difficult step, the knowledge acquisition, can be
notably improved by the adoption of automatic knowledge extraction techniques (e.g.,
rule-extraction [15], fuzzy neural networks [20], or hybrid learning techniques [21]). Such
methods can elucidate rules directly from a set of samples composed by pairs (color target;
colorant mix), avoiding most of the steps on the laborious knowledge acquisition task to
design a fuzzy system.
8. References
[1] Ara~jo. M., and Castro, E. M. M.; Manual de Engenharia T~xtil (l'extile Engineering
Manual), Fundaq~fo Calouste Gulbenkian, Lisboa, September 1984.
[2] J. M. Bishop, M. J. Bushnell, and S. Westland, "Application of Neural Networks to
Computer Recipe Prediction". Color research and application, John Willey & Sons,
New York, February 1991, pp. 3-9.
[3] R. Hirschler, L.C.R. Almeida, and K.S. Arafjo. "Formulaq~o computadorizada de
receitas de cores de tingimento e estamparia t6xtil: como obter sucesso na ind6stria"
("Computerized color recipes in textile print shop: how to obtain success in industry").
Quimica T~xtil, Assoeiaq~o Brasileira de Quimicos e Coloristas T~xteis, Barueri - Silo
Paulo, September 1995, pp. 61-67.
[4] Moody, J. & Darken, C. J., Fast learning in Networks oflocally-tunedprocessing units,
Neural Computation, voi. 1,281-294, 1989.
[5] R. Luo, P. Rhodes, J. Xin, and S. Scrivener. "Effective colour communication for
industry". JSDC, Society of Dyers and Colorist, Bradford, December 1992, pp. 516-
520.
[6] Haykin, Simon, Neural Networks: A Comprehensive Foundation, Macmillan College
Publishing Company, New York, 1994
[7] Ribeiro, E.G., Como iniciar uma estamparia em silk-screem (how to open a textile print
shop), CNI, Rio de Janeiro, 1987.
[8] Welstead, S.T., Neural network and fuzzy logic applications in C/C ~+, John Willey &
Sons, New York, 1994.
[9] Farina, M., Psicodintimica das cores em comunicagao, (psycho-dynamics of colors in
communication) Editora Edgard Bl0cher Ltda, S~o Paulo, 1990.
[10] Vigo, T., Textile Processing and Properties - preparation, dyeing, finishing and
performance, Elsevier, Amsterdam, 1994.
[11] Billmeyer, F.W.Jr., and M. Saltzman, Principles of color technology, John Willey &
Sons, New York, 1981.
[ 12] Ingamells, W., Color for textiles, Society of Dyers and Colorist, Bradford, 1993.
[13] M. R. Costa, "Principios b~tsieos da colorimetria", (Basic Principles of Coloring)
Quimica Tdxtil, Associaq~o Brasileira de Quimicos e Coloristas Texteis, Barueri - S~io
Paulo, June 1996, pp. 36-71.
[14] Lammens, J.M.G., A computational model of color perception and color naming.
Faculty of the Graduate School of State University of New York at Buffalo, New York,
June 1994.
892
[15] Abe, S.; and Lan, M. -S., "A method for fuzzy rule extraction directly from numerical
data and its application on pattern classification," IEEE Trans. on Fuz~ Systems, vol.
3, no. 1, pp. 18-28, 1995.
[16] Hush, Don R. & Home, B. G., Progress in Supervised Neural Networks: What's New
Since Lippmann ?, IEEE Signal Processing Magazine, 8-39, January, 1993.
[17] Lee, S. & Kil, R. M., A Gaussian Potential Function Network with hierachically Self-
Organizing Learning, Neural Networks, vol. 4, 207-224, 1991.
[18] Wettschereck, D. & Dietterich, T., Improving the Performance of Radial Basis
Function Networta' by Learning Center Locations, Advances in Neural Information
Processing System 4, J. E. Moody, S. J. Hansen and R. L. Lippmann editors, 1133-
1140, 1992.
[19] Saha, Avijit & Keeler, J. D., Algorithms Jbr Better Representation and Faster
Learning m Radial Basis Function Networks, Advances in Neural Information
Processing System 2, D.S. Touretzki editor, 482-489, 1990.
[20] Ishibuchi, H. ; Kwon, K.; and Tanaka, H., A learning algorithm offi~zzy neural
networks with triangular fuzzy weights, Fuzzy Sets and Systems, vol. 71, pp. 277-293,
1995.
[21 ] Bonarini, A., Evolutionary Learning of Fuzzy Rules: Competition and Cooperation, in
Fuzzy Modeling: Paradigms and Practice, Ed. By W. Pedrycz Kluwer Academic Press,
1996.
[22] TODESCO, Jos6 L., Reconhecimento de padr~es usando rede neuronal artificial
corn urea funr de base radial: uma aplicar na classificar de cromossomos
humanos. Florian6polis, 1995. Tese (Doutorado em Engenharia de Pruduq[lo) -
Engenharia de Produq[lo e Sistemas, UFSC.
[23] TONTINI, Gerson. Automatlzacao da ldentificar de padr6es em grtificos de
controle estat[stico de proces$os (CEP) atrav~s de redes neurais corn 16gica ~'fusa.
Florian6polis, 1995. Tese (Doutorado em Engenharia Mecfinica) - Engenharia
Mee~nica, UFSC.
Predicting the Speed of Beer Fermentation
in Laboratory and Industrial Scale
1 Introduction
From the production management point of view, the ability of predict the
duration of the fermentations would be a useful one [3]. In practise, the fer-
mentation times in seemingly equivalent settings can vary considerably, which
hinders efficient scheduling of the plants. Moreover, the breweries are forced to
make daily measurements to observe the course of the fermentations, in order
to make the decision when to stop the process. With a good predictor for the
fermentation speed, one could manage with fewer measurements.
In this paper we study how two predictor families, neural nets and decision
trees suit this problem. The task of the neural net is to predict the fermentation
time and the task of the decision tree is to classify the batches as slow or fast.
The neural net prediction gives continuous classification while the decision tree
is understandable even to the brewers. We perform two sets of tests. The first set
is performed with data from laboratory tests. The second data set is collected
from a real brewery.
The rest of this paper is organized as follows. First, Section 2 briefly explains
the beer fermentation process. Section 3 reviews the results that were obtained on
the laboratory-scale data. Section 4 goes through the results that were obtained
on the brewery data. Finally, Section 5 presents the conclusions of the current
work.
The main ingredients of beer are malt, water and hops. The main phases of the
brewing process are wort production and fermentation.
The wort production starts with crushing the malt into coarse flour, which is
then mixed with water. The resulting porridge-like mash is heated according to
a carefully selected temperature program which encourages the malt enzymes to
partially solubilize the ground malt. The resulting sugar-rich aqueous extract,
wort, is then separated from the solids and boiled with hops. The wort is then
clarified and cooled.
The fermentation process starts with aerating the cooled wort and adding
yeast to it. The yeast starts to consume the nutrients contained in wort, in
order to stay alive and grow. At the same time, the yeast produces alcohols
and esters. Fermentation is controlled by regulating the temperature, oxygen
content, and the pitch rate; i.e., the amount of yeast put into the fermentation
tank. Temperature has a great effect on both the speed of fermentation and the
flavour of beer. The growth of yeast can be controlled by the oxygen content. The
pitch rate affects the fermentation speed but not as much as the temperature.
However, the effects of pitch rate on flavor are small which permits larger changes
without altering the flavor profile.
In addition, the course of fermentation is affected by other factors, such as
the wort composition and the yeast condition. Ideally, these factors should be
constant, so that the predictability of fermentation is maintained. In practice,
neither the wort composition nor the yeast condition is static. The natural vari-
895
ation of malt induces some variation to the wort composition, although such
variations can be diminished by re-planning the mashing recipes [1, 2].
The condition of the yeast is a more complicated issue. Traditionally, the
breweries have observed the viability, i.e. the percentage of live cells in the batch
by laboratory analyses. However, these methods do not tell anything about the
vitality of the yeast, i.e. the fermentation rate of the cells. The yeast used in
brewing is grown by the brewery and recycled many times before disposal. The
ability of the yeast to ferment is greatly dependent on the history of the yeast.
For example, new yeast typically behaves differently from yeast that has been
recycled many times. Also, yeast that has been stored long periods between
fermentation is often less vital.
Ideally, the brewery should be able to modify the fermentation recipes so that
the variability of the yeast and wort would be canceled out. So, if the vitality
of the yeast is low, the brewery could increase the pitch rate or elevate the
temperature or oxygen content slightly. A fermentation recipe planner, such as
the Sophist system [8] is well suited to this task. A reliable estimate of the yeast
vitality is needed for such an approach, though. However, as one can expect from
the above introduction, no single analysis exists that would permit predicting
the time of fermentations to any reasonable degree.
A set of 100 fermentations [4] was used for both the artificial neural net (ANN)
and the decision tree experiments. This data set contains fermentations with
recycled yeast (up to 4 cycles) and fermentations with freshly propagated yeast.
The worts used in these experiments were all made according one recipe us-
ing a single lot of malt extract. Hence the worts were all very similar indeed.
Yeast viability was assessed by methylene blue (MB) and methylene violet (MV)
staining, both at the end and at the start of a fermentation. In addition the
trehalose content of the yeast, which is a stress indicator, was measured before
pitching. The pitching rate was constant. As a fourth yeast condition measure-
ment the acidifying power (AP) was recorded. Cropped yeast was aerated for
0, 3 or 5 hours before pitching. The percentage of apparent fermentation--the
percentage of sugars consumed--was calculated from daily measurements of the
specific gravity (SG) of the wort. A review of these measurements is given, e.g.,
by Londesborough [5].
The first approach was to train ANN on this data. In the work presented here
an ANN was trained to predict the relative degree of fermentation at 72 and 130
hours. Several sets of inputs were used, in order to see what analysis contribute
to the quality of prediction.
A number of neural nets to estimate the apparent degree of fermentation at
72 and 130 hours were trained. For each net approximately 75~ of the available
896
Table 1. The error of prediction of degree of fermentation of neural nets using different
measurements. The errors are given in absolute percentages, e.g. the difference between
the predicted value and the actual measured value was never more than the given error.
"Prev. adf" means the measured degree of fermentation of the batch that the yeast
was cropped from. This value is not available when freshly propagated yeast is used.
data was used for training and 25% for validation. The nets differed in the input
measurements used. Table 1 lists the inputs to the nets that were constructed
and the prediction errors of these nets.
It can be seen that information about the behavior of the yeast in previous
batches is rather useful, inclusion of this data reduces the error of prediction
significantly. For freshly propagated yeast such data is not available and it is
therefore more difficult to predict the behavior of such yeast. Adding informa-
tion about the physiological condition of the yeast in the form of the trehalose
measurement helped prediction in this case.
Of this set only two measurements appeared in the tree (Table 2) induced
from the whole data, namely methylene blue measurement, and somewhat sur-
prisingly, the specific gravity at the start of the fermentation. T h e training accu-
racy of the depicted tree, as well as t h a t of the trees on Tables 3 and 4, is 98%.
The cross-validation accuracy (i.e. the estimated performance on unseen cases)
of this scheme is 97.8% • 0.4%, meaning that circa 2% of new batches would be
misclassified using this rule.
T a b l e 3. Rule induced from the data where methylene blue measurement was excluded.
Our third experiment was to exclude the original specific gravity from the
set of available measurements. The effects were parallel to the second experi-
ment: the specific gravity was replaced with pH of the original wort. The cross-
validation accuracy was 94.8%~0.7%, again pointing out t h a t the original gravity
is a more informative measurement in this setting.
The immediate conclusion to be drawn of these decision tree experiments is
t h a t predicting whether a batch will be slow or not can be done with surprisingly
little information about the wort quality and the yeast condition. Only two
measurements of both appear in the three decision trees that, moreover, all have
very high accuracy.
898
T a b l e 4. Rule induced from the data where original specific gravity measurement was
excluded.
One important question arose from these experiments: why seems the original
specific gravity of the wort to be necessary in predicting the speed of fermenta-
tion. This finding seems peculiar since the wort was of very even quality in the
different batches. Another question was why replacing SG measurement with pH
obtains almost as good results. A technical answer to this question is t h a t the
two measurements are quite strongly inversely correlated (r--0.7818) in this data.
Still, answer to the fundamental question why either of these measurements is
relevant, remains unclear.
We set out to validate the laboratory results in industrial scale. For t h a t end,
we collected the d a t a of 118 fermentations from a brewery.
The set of attributes in the d a t a was different from the one used in the
laboratory tests: The brewery uses a online capacitance measuring device for
assessing the viability of the yeast mass rather than the staining methods. The
benefit of the former approach is t h a t the whole yeast mass becomes measured
instead of a small sample. In addition, the volume of the pitched yeast was used
as an additional measurement.
Since the history of the yeast was found important in the l a b o r a t o r y d a t a set,
the fermentation time of the previous round of the yeast was included. Also, the
length of the history of the yeast as the number of fermentations was included.
The propagations where the yeasts originated from were more numerous which
contributed to variation that could not be coded into the data.
The fermentation t a n k was filled with two brews t h a t entered the t a n k in
intervals of varying length (several hours in each case). We found it necessary to
include the first SG measurement performed from the full t a n k into the d a t a set
in order to manage this complication, in addition to the average of the original
specific gravities of the two brews. The interim time between the two brews was
also included into the data set.
Fig. 1. Neural network prediction results. Each dot represents one prediction: the
squares correspond to data items that were included into the training set and the
circles represent predictions in fresh cases. The solid line corresponds to the correct
prediction and the dashed lines axe the one-day error margins.
the predictive accuracy on the test set. A network of 2x4 hidden units was found
to give a good result when all yeast strains were present in the data. In contrast,
as small as lx3 unit network was found to generalize well when the d a t a was
restricted to include just one strain.
The predictions given by the network of lx3 hidden units are depicted in the
Figure 1, which plots the fermentation speed predictions given by the network on
the training and test d a t a against the measured duration. A correct prediction
falls on the solid line. The dashed lines represent error margins of one day. It can
be seen t h a t most predictions are within the one-day error limit. The average
deviation of the predictions is 0.6 days (14.4 hours), which is clearly worse t h a n
the best results (1 hour and 6.5 hours) obtained on the laboratory-scale data.
However, taking the more complicated real-world setting into account the result
is satisfactory.
5 Conclusions
References
1. Aarts, R., SjSholm, K., Home S., Pietil~i K.: Computer-Planned Mashing. In Pro-
ceeding of the Twenty-Fourth Congress, European Brewery Convention. IRL Press,
Oxford (1993) 655-662
2. Aarts, R., Rousu, J.: Towards CBR for Bioprocess Planning. In Smith, I., Faltings,
B. (eds.): Advances in Case-Based Reasoning, Proceedings of the Third European
Workshop, EWCBR-96. Lecture Notes in Artificial Intelligence, Vol. 1168. Springer-
Verlag, Berlin Heidelberg New York (1996) 16-27
3. Cummins, S., Plant, N., Kelleher, P., O'Connor, J.B.: Optimisation of Brewery Op-
erations Using Fuzzy Logic and Simulation Tools. Proceedings of the International
Symposium on Automatic Control of Food and Biological Processes. SIK. GSteborg,
Sweden (1998) 459-467
4. Kataja, K.: Yeast Recycling in Main Fermentation of Beer (in Finnish). Master's
thesis, Department of Chemical Technology, Helsinki University of Technology, Fin-
land (1997)
5. Londesborough, J.: The Measurement of Yeast Viability in Breweries (in Finnish).
Mallas ja Olut 5 (Oct. 1998) 139-148
901
6. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Ma-
teo, Calif. (1993)
7. Quinlan, J.R.: Improved Use of Continuous Attributes in C4.5. J. Artif. Intell. Res.
4 (1996) 77-90
8. Rousu, J., Aarts, R.: Case-Based Planning Methods in Biotechnical and Food Pro-
cesses. Proceedings of the International Symposium on Automatic Control of Food
and Biological Processes. SIK. GSteborg, Sweden (1998) 215-224